Posts Tagged ‘pollster’

Straight up Statistics: How Random is Your Sample?

Monday, February 16th, 2009

By AUSTIN NELSON

How many times have you read a statistic like this one: “57% of Americans believe that the economy will do X in the next Y years”? Ever wondered how in the heck they can say something like that? Do they poll every American and ask them what they think?

The answer, of course, is no.

Through a series of statistical tricks, it is usually perfectly acceptable to make statements like the one above using only a small sample size. For national poll results, usually between 1000-2000 people suffice. Scientifically speaking, much can be inferred from such a sample, and its accuracy can be evaluated mathematically (hence the ubiquitous +/- 3% info that follows all poll data).

The goal of polling any sample is to get enough people to gather a representative group, without going overboard and designing polls that would take months to perform. The problem — and this is where one can get into trouble with polls and other studies using small samples to evaluate a large population — is in the process of sampling itself. For instance, in “nationwide” polls like one that could get the datum described above, polls are often conducted by phone, with pollsters “randomly” selecting people to poll and collecting the results.

But what does “random” mean? Presumably, these pollsters have a method akin to pulling a name out of a hat filled with every name in the phone book. There are a few problems with this assumption of “randomness.”  The first is that not everyone is in the phone book. The second, and more significant problem with this methodology, is that it takes a very specific kind of person to a) actually answer the phone when someone calls from a number they don’t know and b) actually stay on the line when the person on the other end announces they just need “a few minutes.” By choosing to interview people by phone, the pollsters have actually thrown random out the window and left a large portion of America out of their study entirely.

Now, Im not saying that every national poll is worthless. Far from it. In fact, polls can be very informative as to trends in public opinion because you can accurately compare the results of the same poll over time. But it should never be assumed, not even for one moment, that if the poll says 57% of Americans do whatever, that 57% of Americans in real life actually do that (even including the stated error range).

Sampling is a big issue in any area of scientific inquiry. The assumptions that underlie any statistical analysis are very specific as to the requirements for sampling. Outside of the physics laboratory, these assumptions are almost never met. However, through careful design and data acquisition, one can make a reasonable stab at satisfying their requirements.

One good example of this is the Case-Shiller home price index, or CSHPI. As described previously by Andrew Jeffery, the index uses paired-sale comparisons to evaluate current trends in housing markets. Their methodology is opaquely complex but freely available for the world to see.

Some argue against the method, saying that by only sampling homes that have repeat sales within a given time period, you leave out a huge chunk of homes whose sales could give you insight into home values in its area. This is true, and the CSHPI is far from perfect as a result, but there is simply no way one can achieve perfection in an undertaking like modeling home prices.

The important thing to keep in mind is that with the CSHPI, you know what exactly what you are getting — and what you’re not.

Is the index a perfect indicator for what is going on in Brentwood, CA or Mesa, AZ? Absolutely not, and anyone who tells you otherwise is selling you something you don’t want to buy. But it is a painstakingly accurate and admirably well-designed method for tracking trends on a large scale level. That it leaves a large chunk of the market out of its samples is an inevitable aspect of proper experimental design.

Only by controlling as many variables as possible (in this case, by only comparing one house to itself rather than every other house that has sold within a given time frame) can one hope to do any meaningful analysis of a market as complex as residential real estate.

Because of the way it is constructed, the index itself is really only valuable as a tracker of large scale trends. If the index goes down by 10%, you can’t reasonably say that any given property has declined by 10% or even use it to reliably estimate the price change of a specific property. But if the index has shown a 25% drop from its peak (as it has), you can reliably infer that things are not going well in US housing. By tracking the rate of that decline or the difference in trends between the individual indices of one metro area versus another (there are indices available for 20 metropolitan areas), one can gain valuable and reliable insight about the performance of those markets and make inferences about future trends.

In conclusion, sampling is one of the most important but least appreciated aspects of modern data analysis. In order to correctly interpret any given data, it is absolutely essential to know how that data was sampled and how that sample fits into the area of study. Be wary of data where the data collection and analysis methodology are not freely available. And understand that where samples are involved, usually the most valuable way to use that data is to monitor changes over time rather than making inferences about how any given time period’s data relates to whatever phenomenon you are interested in.

This is especially true when it comes to home values, where there is absolutely no single data model that can tell you how much your house is worth or how much to pay for that new house you’ve got your eye on. However, there is enough data currently available that with careful scrutiny (and the help of trained professionals like the friendly folks at Cirios Real Estate) you can confidently make those assessments.