In a previous post, we saw how time consuming it is to calculate with pencil and paper the median salary, or any median of a data set of more than a few data points.
Even after computers became common in the 1950’s, and could start doing the work, the mean still was used, because of one further nasty property of medians. Medians require retaining information about every value until the end of the period for which a median is calculated. It you want to know the median salary, you need to save every employee’s salary.
Means do not require nearly as much information. In the early days of computers, storing information was expensive, so the mean was still favored for "typical".
Median Salary: Keeping Running Totals
Let’s look at the example from two earlier posts. The median number of checks per day for a year requires retaining the number of checks written every day until the end of the year (365 values). Only at the end of the year can one sort the list and figure out what the median number of checks per day was.
For the mean number of checks per day, all you need to know are two numbers: the total number of checks written, and the number of days in a year. Storing 2 numbers was a lot less expensive than storing 365.
New Technology Microsoft
Let look at a similar pay example: to find the mean salary at Microsoft, all you need to know is the size of Microsoft’s payroll, and how many people work there. This is just two numbers. To find the median salary, you need to know and store what each of the 50,000 employees make.
Making storage even easier, the mean just needs running totals: every time you write a check, just increase the number of checks written for the year by one. You can then throw the check away; you don’t need any more information from the check to calculate the mean number of checks per day.
Median Mean and Mode Relation
The same is true for the mean size of checks: it only requires the running total number of checks written and total dollar value of those checks. The median requires storing the dollar value of each check until the end of the year (750 numbers).
Generally, means require only running totals, while medians need every value in the data set to be stored. This makes storage requirements much smaller for means than medians. Back when computer information was stored on computer tape (think spinning tapes in early James Bond movies), this difference in storage space was enough to keep the mean as the common definition of “typical.”
The “Typical Salary Range”
The evil standard deviation also depends only on running totals, which explains why it has lasted so long as the common definition of “typical range.” Beyond the two values for mean, standard deviation needs only one additional running total: the sum of the squares of dollar value of each check.
Better ways of finding the “typical range” have the same problem as the median. They require keeping information about every check written, thus require much more storage, as well as being more time consuming to calculate.
Standard deviation may be a horrible measure of “typical range”, because it is highly dependent on outliers, but it is easy to store and calculate 🙂
First Personal Computers Introduced
Fortunately for us, the microprocessor-based computers were invented, those first personal computers introduced a quick and efficient way for the average citizen to calculate. The disk storage for 10,000’s of checks costs much less than 1 cent. Microsoft Excel, on any custom built desktop computer, can calculate medians in less than a millisecond, even for 10,000’s of checks.
At PayScale, we do median salary calculations for 10,000’s of jobs over millions of data points every day, and these do not even make our computers sweat 🙂
It is time to stop using the mean to represent “typical” or “average,” and start using the median. Take the first step and find out the median salary for your job with our salary calculator.
Dr. Al Lee