Cornell University

Copyright © Richard B. Darlington. All rights reserved.

A second measurement issue stems from the fact that most investigators present each infant with several views of each scene. How should the several looking times be combined? Should they simply be averaged, or is a weighted average better? If so, should the first view receive the most or the least weight? And how unequal should the weights be?

Third, assuming that both compression and averaging will be used, should the compression be done before or after the averaging? In other words, is it best to average and to then compress the average scores, or is it better to compress each individual looking time and then average the compressed scores? Or is some mixture best? That is, might it be best to apply some compression, then average the compressed scores, then further compress the average scores?

I argue in this work that it is in fact possible to choose rationally among this broad array of choices. My answers to the three questions above are:

- The optimum degree of compression is achieved by taking the fourth root--the square root of the square root.
- The best average is a simple unweighted average.
- The compression should be done after the averaging.

The central idea behind the present analysis is that the more
successfully an analytic method *C* removes random error from a set of data,
the higher *r(C _{A}C_{B})* will be. That's because
the

Of course some of the 44 groups of infants observed more interesting
stimuli than others, and the groups differed in age. To control these
differences, I combined the 44 groups by first computing variables
*C _{A}* and

It actually doesn't matter whether one uses averages or sums in the
"averaging" part of computing *C _{A}* and

Various degrees of compression can be achieved by "raising" scores to
various powers *W*, where *W* is forced to fall between 0 and 1. For instance,
taking the square root of a score is equivalent to "raising" that score to the .5
power, so that is also equivalent to setting *W* = .5. Similarly, setting *W* =
.25 is equivalent to taking the fourth root of a score. If *W* is set equal to 1
there is no compression and the original scores are used. The lower *W* is set,
the more compression is achieved.

*W* cannot be lowered all the way to 0, since any number to the 0 power
is 1 so all scores would become 1. However, lowering *W* to near zero turns
out to achieve essentially the same degree of compression as replacing the
original scores by their logarithms. For instance, on a log scale the difference
between 50 and 51 is 10.9% of the difference between 5 and 6, and on a scale
formed by setting *W* = .01 the comparable percentage is 11.1%--nearly the
same. By setting *W* even closer to 0, one can simulate with any desired
degree of precision the compression characteristics of a log scale. Thus by
choosing *W* somewhere between 0 and 1 one can vary the degree of
compression from none at all to the same compression as is achieved in a log
scale.

To allow the possibility of compressing a certain amount before
averaging, and compressing more after averaging, one can replace the single
power *W* by two powers *G* and *H*. One can first "raise" each individual
looking time to some power *G*, then compute a weighted sum across the three
views of the same scene, then "raise" that sum to some other power *H*. For
instance, consider the transformation in which *G* = .5, *H* = .4, and the three
weights for the weighted sum are respectively 1, .8, and .6. Suppose infant
Susan has looking times of 25, 30, and 10 seconds for stimulus A. Then for
Susan,

*C _{A}* = ( 1*25

Thus the problem is to find the values of *G, H, b2* and *b3* that
maximize *r(C _{A}C_{B} .G)*. Once the problem is
stated this way, one can solve it with a mathematical method called

The fact that three of the four parameters (*G, b2*, and *b3*) turned out to
be 1, allows us to state the recommended approach far more simply. As
already mentioned, the recommendation is to use the fourth root of the sum of
the individual looking times.

I was also surprised that the first looking time was given the smallest weight of the three, though the difference was insignificant. Local lore had been that it might be best to weight the first looking time more heavily, on the ground that infants began to lose interest by the second and third views of a scene. But it may be that that's precisely why one should give the later views equal weight; we want to see whether infants are sufficiently interested to keep looking during later views.

A third feature of the results fit my prior intuitions more closely. Intuition suggests that some compression of the highest scores is needed. But it also suggests that a log transformation goes too far, for instance treating the difference between 5 and 6 seconds as if it were as large and important as the difference between 50 and 60 seconds. The use of fourth roots matches these intuitions, since it falls between the original scale and the log scale in the degree to which it compresses high scores relative to low scores. For instance, as already mentioned it makes the difference between 50 and 51 seconds be 19% as large as the difference between 5 and 6 seconds, while the comparable percentage for a log scale is 11%.

These latter approaches are designed to minimize *sampling
error* in the final significance tests or confidence bands computed by a
typical investigator, while the optimization approach I used is designed to
minimize *measurement error*. One goal of transforming scores is
to increase the validity of statistical methods that assume normal distributions,
thereby limiting sampling error. But the optimization method I used was not
at all designed to *assure* that distributions would be approximately
normal. Rather that method emphasizes the reduction of measurement error as
measured by the ability of transformed scores to predict the transformed
looking times for other stimuli. By minimizing measurement error one hopes
to maximize statistical *power*, while an approach emphasizing
normality or symmetry would emphasize maximizing the *validity*
of statistical methods that assume normal distributions.

To try to transform looking times to normal distributions ignores the question of whether the transformation that best eliminates measurement error does in fact yield a normal distribution. As engineers well know, there are statistical methods that assume highly skewed exponential distributions, or even more highly skewed Weibull distributions. One should not ignore the possibility that the best transformation (which measures the underlying trait with least error) does not yield a distribution that is normal or even symmetric.

However, we would have the best of both worlds if we found that by
minimizing measurement error we also produced a scale that is roughly
normal, so that we could count on the well-known robustness of parametric
statistical methods to protect us from the moderate remaining levels of
nonnormality. That does seem to be the case with the composite *C*
recommended here. *C*'s skew value was 1.19, compared to 3.92 for the
original 3024 looking times. To present a measure of skew that is more
intuitively understandable, I defined the ratio

(90th percentile point - median)/(median - 10th percentile point)

This measure would of course be 1.00 for a symmetric distribution. This ratio
is 1.37 for *C* and 6.10 for raw looking times. Since (6.10 - 1)/(1.37 - 1) =
13.8, by this measure of skew, the skew of *C* is only about one-fourteenth that
of raw looking times. The skew in *C* is thus noticeable but apparently not
high enough to seriously interfere with the validity of standard statistical
methods.