The Use of Infant Looking Times in Statistical Analysis

Richard B. Darlington

Cornell University

Copyright © Richard B. Darlington. All rights reserved.

Three Problems

The time that an infant looks at a scene is a widely used measure of the infant's interest or surprise at the scene. But what is the best way to use looking times in statistical analyses? It seems clear that the difference between looking times of 5 and 6 seconds should be regarded as in some sense larger or more important than the difference between 50 and 51 seconds. Thus it seems useful to somehow "compress" long looking times relative to short times. But how much compression is best? For instance, should the difference between 50 and 51 seconds be compressed to a half, or a quarter, or a tenth of the difference between 5 and 6 seconds?

A second measurement issue stems from the fact that most investigators present each infant with several views of each scene. How should the several looking times be combined? Should they simply be averaged, or is a weighted average better? If so, should the first view receive the most or the least weight? And how unequal should the weights be?

Third, assuming that both compression and averaging will be used, should the compression be done before or after the averaging? In other words, is it best to average and to then compress the average scores, or is it better to compress each individual looking time and then average the compressed scores? Or is some mixture best? That is, might it be best to apply some compression, then average the compressed scores, then further compress the average scores?

I argue in this work that it is in fact possible to choose rationally among this broad array of choices. My answers to the three questions above are:

Thus I recommend taking for each infant a simple unweighted sum or mean of the time they looked at the various displays of a given scene, and taking the fourth root of that sum or mean. This makes the difference between 50 and 51 seconds only 19% as large as the difference between 5 and 6 seconds, because the difference between the fourth roots of 50 and 51 is only 19% as large as the difference between the fourth roots of 5 and 6. The rest of this work explains how I reached these conclusions.


These conclusions emerged from an analysis of data provided by Elizabeth Spelke. A series of 18 experiments in the Spelke lab at Cornell included 504 infants, each of whom looked three times at each of two scenes A and B, in the order A B A B A B. Each of the 18 experiments included 2 or 3 or 4 experimental groups, making a total of 44 such groups across the 18 experiments. These 18 experiments all tested different hypotheses about infant cognition, but for our purposes we need not focus on the precise hypotheses tested. Rather the central point is that we have 504 infants in 44 groups, with each infant having looked three times at each of two different scenes. Within each group the infants were about the same age and viewed the same scenes.

Analytic Method

Let C denote some scheme for averaging and compressing looking-time data, let CA denote the set of scores produced by applying scheme C to three views of scene A, let CB denote the set of scores produced by applying scheme C to scene B, and let r(CACB) denote the correlation between CA and CB computed across a set of infants.

The central idea behind the present analysis is that the more successfully an analytic method C removes random error from a set of data, the higher r(CACB) will be. That's because the C-scores are influenced both by random error and by the individual infant's nonrandom propensity to look long at stimuli. The more successfully random error is removed from the data, the more important the influence of nonrandom infant propensities will be, and the higher r(CACB) will be.

Of course some of the 44 groups of infants observed more interesting stimuli than others, and the groups differed in age. To control these differences, I combined the 44 groups by first computing variables CA and CB, then adjusting these variables to means of zero within each of the 44 groups, then computing r(CACB) on the adjusted C-scores in the entire set of 504 infants. This essentially controls for group (here denoted G), so the correlation computed this way will be denoted r(CACB .G).

It actually doesn't matter whether one uses averages or sums in the "averaging" part of computing CA and CB. For instance, if one summed three scores using weights of .5, .3, and .2, that sum could be called an average because the three weights sum to 1. But those weights would yield the same value of r(CACB .G) as weights of 1, .6, and .4, because the latter weights are in the same ratio as the former. Thus for simplicity I used a weighted sum in which the first weight b1 was fixed at 1 and the latter two weights b2 and b3 were allowed to vary.

Various degrees of compression can be achieved by "raising" scores to various powers W, where W is forced to fall between 0 and 1. For instance, taking the square root of a score is equivalent to "raising" that score to the .5 power, so that is also equivalent to setting W = .5. Similarly, setting W = .25 is equivalent to taking the fourth root of a score. If W is set equal to 1 there is no compression and the original scores are used. The lower W is set, the more compression is achieved.

W cannot be lowered all the way to 0, since any number to the 0 power is 1 so all scores would become 1. However, lowering W to near zero turns out to achieve essentially the same degree of compression as replacing the original scores by their logarithms. For instance, on a log scale the difference between 50 and 51 is 10.9% of the difference between 5 and 6, and on a scale formed by setting W = .01 the comparable percentage is 11.1%--nearly the same. By setting W even closer to 0, one can simulate with any desired degree of precision the compression characteristics of a log scale. Thus by choosing W somewhere between 0 and 1 one can vary the degree of compression from none at all to the same compression as is achieved in a log scale.

To allow the possibility of compressing a certain amount before averaging, and compressing more after averaging, one can replace the single power W by two powers G and H. One can first "raise" each individual looking time to some power G, then compute a weighted sum across the three views of the same scene, then "raise" that sum to some other power H. For instance, consider the transformation in which G = .5, H = .4, and the three weights for the weighted sum are respectively 1, .8, and .6. Suppose infant Susan has looking times of 25, 30, and 10 seconds for stimulus A. Then for Susan,

CA = ( 1*25.5 + .8*30.5 + .6*10.5).4 = 2.636.

Thus the problem is to find the values of G, H, b2 and b3 that maximize r(CACB .G). Once the problem is stated this way, one can solve it with a mathematical method called steepest ascent. This is one of an entire family of methods that mathematicians and engineers call optimization methods. These are extremely general methods for finding the values of parameters that maximize or minimize any reasonable mathematical expression. In principle, multiple regression and principal component analysis could both be done with optimization programs, since the goal is to find weights that will maximize the multiple correlation or factor variance.


When a steepest ascent program was applied to the current problem, it yielded a value near 1 for G, a value near .25 for H, and values near 1.15 for b2 and b3. The corresponding value of r(CACB .G) was .5403. Lowering b2 and b3 to 1 lowered r(CACB .G) only insignificantly, from .5403 to .5394; I'll call it .54. Thus for simplicity I suggest using values of 1 for b2 and b3.

The fact that three of the four parameters (G, b2, and b3) turned out to be 1, allows us to state the recommended approach far more simply. As already mentioned, the recommendation is to use the fourth root of the sum of the individual looking times.


I personally was surprised that G turned out to be so high. I had guessed that it would be useful to compress individual looking times, but setting G at 1 and H at .25 puts all the compression at the level of the sum rather than the individual looking time.

I was also surprised that the first looking time was given the smallest weight of the three, though the difference was insignificant. Local lore had been that it might be best to weight the first looking time more heavily, on the ground that infants began to lose interest by the second and third views of a scene. But it may be that that's precisely why one should give the later views equal weight; we want to see whether infants are sufficiently interested to keep looking during later views.

A third feature of the results fit my prior intuitions more closely. Intuition suggests that some compression of the highest scores is needed. But it also suggests that a log transformation goes too far, for instance treating the difference between 5 and 6 seconds as if it were as large and important as the difference between 50 and 60 seconds. The use of fourth roots matches these intuitions, since it falls between the original scale and the log scale in the degree to which it compresses high scores relative to low scores. For instance, as already mentioned it makes the difference between 50 and 51 seconds be 19% as large as the difference between 5 and 6 seconds, while the comparable percentage for a log scale is 11%.

Alternatives: Minimizing Sampling Error or Measurement Error

One might ask why I didn't take a quite different approach that emphasizes the normality or at least the symmetry of the transformed scores. That is, why not judge each transformation by its ability to produce a normal distribution in the 3024 looking times studied? Or more modestly, why not compute the skew of the resulting scores for each transformation studied, and choose the transformation with the lowest skew?

These latter approaches are designed to minimize sampling error in the final significance tests or confidence bands computed by a typical investigator, while the optimization approach I used is designed to minimize measurement error. One goal of transforming scores is to increase the validity of statistical methods that assume normal distributions, thereby limiting sampling error. But the optimization method I used was not at all designed to assure that distributions would be approximately normal. Rather that method emphasizes the reduction of measurement error as measured by the ability of transformed scores to predict the transformed looking times for other stimuli. By minimizing measurement error one hopes to maximize statistical power, while an approach emphasizing normality or symmetry would emphasize maximizing the validity of statistical methods that assume normal distributions.

To try to transform looking times to normal distributions ignores the question of whether the transformation that best eliminates measurement error does in fact yield a normal distribution. As engineers well know, there are statistical methods that assume highly skewed exponential distributions, or even more highly skewed Weibull distributions. One should not ignore the possibility that the best transformation (which measures the underlying trait with least error) does not yield a distribution that is normal or even symmetric.

However, we would have the best of both worlds if we found that by minimizing measurement error we also produced a scale that is roughly normal, so that we could count on the well-known robustness of parametric statistical methods to protect us from the moderate remaining levels of nonnormality. That does seem to be the case with the composite C recommended here. C's skew value was 1.19, compared to 3.92 for the original 3024 looking times. To present a measure of skew that is more intuitively understandable, I defined the ratio

(90th percentile point - median)/(median - 10th percentile point)

This measure would of course be 1.00 for a symmetric distribution. This ratio is 1.37 for C and 6.10 for raw looking times. Since (6.10 - 1)/(1.37 - 1) = 13.8, by this measure of skew, the skew of C is only about one-fourteenth that of raw looking times. The skew in C is thus noticeable but apparently not high enough to seriously interfere with the validity of standard statistical methods.

Go to Darlington home page