Transforming a variable to a normal distribution or other specified shape

Copyright © Richard B. Darlington. All rights reserved.

It is often useful to be able to transform a set of scores into a particular distribution--most often into a standard normal distribution. That allows you to then apply statistical methods designed for that distribution. It's difficult for a critic to object to use of a technique that assumes a normal distribution if you have transformed the data into a normal shape.

This section discusses two major ways to transform data. Both are practical with a wide variety of statistical packages. One very inexpensive package that includes these capabilities is Mystat for Business; the "regular" Mystat includes just the first of the two methods. We'll call the two methods the equal-area method and the median-score method. The equal-area method is easier to understand and to execute, while the median-scores method seems somewhat more satisfying intuitively. In most practical applications it probably makes little difference which method you use. We'll assume here that you want to transform scores into a normal distribution, though both methods can be applied to other distributions as well.

The central problem is that the normal distribution, or other distributions you might use, are distributions of infinitely large populations. Thus it is impossible for any finite number of scores to form exactly a normal distribution. Rather you must transform the scores so you can say that the transformed scores come "as close to a normal distribution as a set of that many scores can come". The problem is that the quoted phrase has no single meaning, so we must choose some particular meaning. The equal-area and median-score methods assign two different meanings to the quoted phrase.

The equal-area method

Suppose we have 9 scores that we want to transform to a normal distribution, or rather "as close to a normal distribution" as any 9 scores can be. If you divide a standard normal distribution into 10 equal areas, so that 10% of the total area falls in each section, the cutting points between the areas will fall at the 9 values

-1.282 -.842 -.524 -.253 0 .253 .524 .842 1.282

Therefore these 9 scores have a certain claim to being called the 9 scores that come as close to forming a standard normal distribution as any 9 scores can come.

More generally, if you have a sample of N cases, you divide a standard normal distribution into N+1 equal areas, and take the N cutting points between adjacent areas. Those N values are the N transformed scores. Assign the highest transformed score to the highest raw score, the second-highest transformed score to the second-highest raw score, and so on.

If you rank the raw scores from low to high, then the area to the left of each score is Area = rank/(N+1). For instance, in a set of 9 scores the 8th lowest score (which is also the second highest score) gets a rank of 8, so Area = rank/(N+1) = 8/10 = .8. Therefore the appropriate transformed score is the score chosen so that 80% of a standard normal distribution is to the left of that score. A standard normal table shows that value is .842.

If some of the raw scores are tied, use mean ranks. For instance, if the 7th and 8th ranked scores were tied when N = 9, then you would use Area = rank/(N+1) = 7.5/10 = .75.

The median-scores method

If you drew infinitely many samples of size 9 from a standard normal distribution, and took the second-highest score in each sample, it can be shown that the median of all those second-highest scores would be .917. The median of all the third-highest scores would be .564. We might call .917 and .564 the expected values of the second- and third-highest scores, except that statisticians use that phrase to refer to the means rather than the medians of sampling distributions. Therefore we'll call these scores median scores instead of "expected values". Ranking from the bottom, for samples of size 9 the median scores for all 9 ranks are:
Rank         1      2      3      4    5   6     7     8      9
Median    -1.446  -.917  -.564  -.271  0  .271  .564  .917  1.446
Like the 9 scores derived by the equal-area method, these 9 scores have a reasonable claim to being called the 9 scores that come as close to forming a standard normal distribution as any 9 scores can come. The difference is a matter of intuition, but most statisticians think that median scores are intuitively more appealing than equal-area scores.

One might well ask why not literally use expected values? That is, why not use the means instead of the medians of the sampling distributions of ranked scores? There are two practical answers. First, expected values are far more difficult to calculate than median scores. Second, it is much easier to generalize the median-score method to distributions other than normal distributions.

The median-score method is like the equal-area method in that you first find an Area to the left of each score, then use a standard normal table (or other table if you're transforming to some other distribution) to find the scores corresponding to those areas. In the equal-area method with N = 9, the 9 areas were found by the formula Area = rank/(N+1) and were simply .1 .2 .3 .4 .5 .6 .7 .8 .9. In the median-score method the areas are

.074 .180 .286 .393 .500 .607 .714 .820 .926

To see why these particular values are used, consider the second-highest value of .820. It can be shown that in any continuous distribution (not just a normal distribution), if you draw infinitely many random samples of size 9 from the distribution, and take the second-highest score in each sample, the median of all those second-highest scores will be the point chosen so that .820 of the population distribution's total area falls to its left. Similar statements can be made about .607, .714, and the other scores in this list of 9 scores. Therefore the second step in the median-score method is just like the second step in the equal-area method; find the scores corresponding to particular areas. The two methods differ only in the first step. Where the equal-area method uses the formula Area = rank/(N+1), the median-score method uses the beta distribution. It can be shown that the desired Area equals the Inverse Beta Function (what Systat and Mystat for Business call BIF) for the 3 values .5, rank, and N+1 - rank. Thus for instance if N = 9 and you want the Area for the second-highest score (whose rank is 8), you want BIF(.5, 8, 2). In those programs, if N = 9 then all the Areas can be found at once by starting with a column of ranks (titled Rank), and then writing the command

LET AREA = BIF(.5, RANK, 10 - RANK)

Whatever N is, replace 10 in this command by N+1. Ties are handled the same way as in the equal-area method; assign the mean rank to tied values.

A table of median-score normal scores, for N up to 100 but with no allowance for ties, is available.

Go to Darlington home page