This post provides an overview of the PayScale dataset and the proprietary model we use to calculate expected pay. It also defines terms we commonly use in our research.
The PayScale Dataset
The data we use for our research come from the PayScale online salary survey unless otherwise specified. This survey, which is ongoing, incentivizes respondents to provide their information by offering an individualized report of how people like them with the same or a very similar job title are compensated. The main body of the survey collects demographic information (e.g. age, ethnicity, degrees obtained, major, educational institution) as well as other relevant worker characteristics (e.g. job title, years’ experience, job title, skills critical for the job, certifications, management responsibilities) and labor market traits (e.g. location, industry, company size). The survey logic is responsive to each respondent’s job title and previous answers. For example, we collect information on the type of aircraft that airline pilots fly as that is one of the chief determinants of pay. Similarly, we ask a set of follow-up questions to those who directly supervise people, such as whether they are responsible for hiring and firing.
We also ask questions about the amount and types of compensation respondents receive. Workers can report wages for contract work, hourly pay, base salary, bonuses, commissions, tips, and more. These rich compensation data provide the core data offering for PayScale’s B2B data products. Our products are designed to provide timely, accurate, and useful information to HR managers and business leaders, allowing them to effectively set pay to attract and retain employees in their unique talent market.
We regularly add research questions at the end of the survey on a broad set of topics. These questions vary widely, from salary negotiation tactics to research on job satisfaction and engagement to social issues like workplace gender policies. We combine these data with the core survey to a nuanced and unique view of the nature of work and compensation.
Due to the nature of the PayScale survey offering, there are several areas where our data are particularly strong and others where they are relatively thin. White-collar, health care and tech jobs tend to be very well represented in the PayScale dataset. In other positions, such as minimum wage and union jobs, legal compliance or collective bargaining are the most relevant factors in wage negotiations and there is a limited role for a third-party salary report at the individual level. This being the case, workers in these fields are less likely to participate in the salary survey. Similarly, compensation for executive roles at mid-sized organizations or larger tends to be widely variable both in amount and in structure. Again, pay negotiation and structure in this world are so different from the rank-and-file that executives enter the salary survey and lower rates.
We are conscientious of how our raw data are not representative of the entire US labor market. The millions of salary profiles we collect represent a sizable portion of the market, and we employ statistical techniques to control for the occupational and industrial skewness in our data. PayScale’s research team is committed to using our data responsibly and wisely to provide meaningful and accurate insights.
The Salary Model
PayScale employs a proprietary parametric Bayesian model for constructing pay ranges and estimates. Although the model has the flexibility to produce estimated conditional distributions for a range of variables, we rely on it primarily to produce pay ranges for individual respondents conditional on the data they provide. We model pieces of compensation both individually and at the aggregate level, so we have separate models at the job title/country level for base, bonus, and total cash compensation.
The model prioritizes both the most current and the most salient data, meaning recent profiles that most closely match the respondent’s compensable characteristics are factored more heavily in the creating the conditional salary range. We assume a distribution from the double-Pareto lognormal family of distributions for compensation. This allows the data to follow an asymmetric bell curve that can have a variety of different shapes contingent on job title and location.
Pay: Unless otherwise specified, reported pay is effective annual compensation (EAC). EAC includes base pay, bonus, commission, and tips. This definition also includes hourly compensation, annualized by assuming full-time, year-round employment. This allows for an apples-to-apples comparison between hourly rates and base salaries. This measure does not include the cash-value of noncash benefits, such as equity or stock compensation, retirement benefits, health insurance, etc.
Percentiles and Medians: We virtually always report median pay. Half of the workers within a reporting group earn more than the median amount, while half earn less. For additional texture, we sometimes report other percentiles (most commonly the 25th and 75th percentiles). The interpretation for any percentile is that x percent of the reporting group earns less than this amount, so 25 percent of workers earns less than the 25th percentile and 75 percent earns less than the 75th percentile.
Pay vs. Market, Overpaid, and Underpaid: We also regularly leverage the PayScale model to compare an individual’s reported pay to the median of the conditional predicted pay range. We typically use the percentage difference between the expected median and reported pay. Sometimes we classify those as earning less than 75 percent of the predicted median as underpaid and those earning over 125 percent as overpaid.