Posts Tagged ‘Straight up Statistics’

Straight up Statistics: How Random is Your Sample?

Monday, February 16th, 2009

By AUSTIN NELSON

How many times have you read a statistic like this one: “57% of Americans believe that the economy will do X in the next Y years”? Ever wondered how in the heck they can say something like that? Do they poll every American and ask them what they think?

The answer, of course, is no.

Through a series of statistical tricks, it is usually perfectly acceptable to make statements like the one above using only a small sample size. For national poll results, usually between 1000-2000 people suffice. Scientifically speaking, much can be inferred from such a sample, and its accuracy can be evaluated mathematically (hence the ubiquitous +/- 3% info that follows all poll data).

The goal of polling any sample is to get enough people to gather a representative group, without going overboard and designing polls that would take months to perform. The problem — and this is where one can get into trouble with polls and other studies using small samples to evaluate a large population — is in the process of sampling itself. For instance, in “nationwide” polls like one that could get the datum described above, polls are often conducted by phone, with pollsters “randomly” selecting people to poll and collecting the results.

But what does “random” mean? Presumably, these pollsters have a method akin to pulling a name out of a hat filled with every name in the phone book. There are a few problems with this assumption of “randomness.”  The first is that not everyone is in the phone book. The second, and more significant problem with this methodology, is that it takes a very specific kind of person to a) actually answer the phone when someone calls from a number they don’t know and b) actually stay on the line when the person on the other end announces they just need “a few minutes.” By choosing to interview people by phone, the pollsters have actually thrown random out the window and left a large portion of America out of their study entirely.

Now, Im not saying that every national poll is worthless. Far from it. In fact, polls can be very informative as to trends in public opinion because you can accurately compare the results of the same poll over time. But it should never be assumed, not even for one moment, that if the poll says 57% of Americans do whatever, that 57% of Americans in real life actually do that (even including the stated error range).

Sampling is a big issue in any area of scientific inquiry. The assumptions that underlie any statistical analysis are very specific as to the requirements for sampling. Outside of the physics laboratory, these assumptions are almost never met. However, through careful design and data acquisition, one can make a reasonable stab at satisfying their requirements.

One good example of this is the Case-Shiller home price index, or CSHPI. As described previously by Andrew Jeffery, the index uses paired-sale comparisons to evaluate current trends in housing markets. Their methodology is opaquely complex but freely available for the world to see.

Some argue against the method, saying that by only sampling homes that have repeat sales within a given time period, you leave out a huge chunk of homes whose sales could give you insight into home values in its area. This is true, and the CSHPI is far from perfect as a result, but there is simply no way one can achieve perfection in an undertaking like modeling home prices.

The important thing to keep in mind is that with the CSHPI, you know what exactly what you are getting — and what you’re not.

Is the index a perfect indicator for what is going on in Brentwood, CA or Mesa, AZ? Absolutely not, and anyone who tells you otherwise is selling you something you don’t want to buy. But it is a painstakingly accurate and admirably well-designed method for tracking trends on a large scale level. That it leaves a large chunk of the market out of its samples is an inevitable aspect of proper experimental design.

Only by controlling as many variables as possible (in this case, by only comparing one house to itself rather than every other house that has sold within a given time frame) can one hope to do any meaningful analysis of a market as complex as residential real estate.

Because of the way it is constructed, the index itself is really only valuable as a tracker of large scale trends. If the index goes down by 10%, you can’t reasonably say that any given property has declined by 10% or even use it to reliably estimate the price change of a specific property. But if the index has shown a 25% drop from its peak (as it has), you can reliably infer that things are not going well in US housing. By tracking the rate of that decline or the difference in trends between the individual indices of one metro area versus another (there are indices available for 20 metropolitan areas), one can gain valuable and reliable insight about the performance of those markets and make inferences about future trends.

In conclusion, sampling is one of the most important but least appreciated aspects of modern data analysis. In order to correctly interpret any given data, it is absolutely essential to know how that data was sampled and how that sample fits into the area of study. Be wary of data where the data collection and analysis methodology are not freely available. And understand that where samples are involved, usually the most valuable way to use that data is to monitor changes over time rather than making inferences about how any given time period’s data relates to whatever phenomenon you are interested in.

This is especially true when it comes to home values, where there is absolutely no single data model that can tell you how much your house is worth or how much to pay for that new house you’ve got your eye on. However, there is enough data currently available that with careful scrutiny (and the help of trained professionals like the friendly folks at Cirios Real Estate) you can confidently make those assessments.

Straight Up Statistics: Deconstructing the Average

Thursday, January 15th, 2009

By AUSTIN NELSON

In today’s fast paced, data-driven world, it’s easy to get lost in the morass of statistics flashing across our TVs and computer screens at a sometimes maddening pace.

Government officials, bankers, retailers and snake oil salesmen alike throw out statistical arguments at the drop of a hat, telling you why their pitch is the only one worth listening to because they have the data to back it up. But before accepting what you hear or read at face value just because some nameless research institute did a study, stop for a minute to ponder the complexities of even the most seemingly innocuous of statistics: The average.

Let’s first assume some particular data being quoted were reliably gathered and analyzed (This is almost never a safe assumption, but that’s a topic for another day), then examine how the average and another so-called “descriptive statistic” –- the median — are used in the data reports we see every day.

While on the surface it may seem that these two statistical measures could be interchangeable (indeed they are often used interchangeably with no explanation), they tell us very different things about the data they describe.

The median of a given group of data is its middle value. For instance, if your dataset has five data points and you lined them all up from smallest to largest, the third value would be your median. On the other hand, the average, or mean, of a dataset is determined by summing all values and dividing by the number of data points.

For example, suppose you are looking at real estate sales in a certain area within a certain time frame and you had the following 5 values: $300,000, $320,000, $320,000, $450,000, and $1,200,000. The median of this set is $320,000 (the middle value). The average is $518,000 (2,590,000 / 5). As you can see, even in this simple example, the two descriptive statistics are significantly different.

Real estate sales are often represented by the median value. The reasons for this are varied, but center around the fact that a few sales at extremely high levels (like that $2 million house on the top of the hill) can easily skew the average of a dataset towards those properties, even though most homes in the area are selling at lower prices.

For example, in Temecula, CA where most homes sell at modest levels (by California standards) but some homes sell for significantly more, the average sale price in 2008 was about $435,000. The median price, on the other hand, was around $359,000. That’s is a difference of over 20%.

Contrast that with areas where home prices are more homogenous, like Daly City, CA, where the average and median values are more closely in line. In 2008, the average sale price for Daly City was around $562,000 while the median was about $558,000 – a much smaller spread (<1%).

So which is better? Average or median? As can be seen from the examples above, neither.

Both display different aspects of the same set of data points. In Temecula, where median and average wildly diverge, using the average skews the data towards a much higher level. An individual from out of state looking to buy there might incorrectly assume they couldn’t afford to do so. On the other hand, solely looking at the median leaves out the fact that there are million dollar plus estates in Temecula available to buyers looking for that sort of thing.

When the National Association of Realtors releases their monthly sales statistics — which is the real estate pricing data carried by most major news outlets — they present sales price data as both median and average values. These values are used to track sales prices over time to identify trends in sales activity nationwide and regionally. While both median and average values are freely available to anyone with internet access, the median values are often the ones quoted in the popular press.

By focusing exclusively on median values, however, one can miss interesting trends.

For example, on a nationwide level and in three of the four regions identified, median and average home sale prices have been tracking at around the same relative spread since 2005. In the West region, however, the median sales price has been falling faster than the average price.

This widening variance helps tell the story of what’s been happening in Western real estate markets in the past few years. In most markets, high-priced homes have retained their value better than homes that are closer to, or below the median. Since so many lower end homes are being sold, many after foreclosure, the sheer volume of these transactions is dragging down the median figures. The average, on the other hand, is propped up by the few expensive homes still being sold.

This analysis then begs the question, why does the trend only exist in the West? As other regions decline, can we expect the same pattern to play out? Why are higher priced homes holding up better? If expensive homes begin to lose their value, what would that do to the median and average sales prices? What does the data look like on a city or zip code level?

It’s easy to see that just by comparing the median and average sales price trends, much insight — or at the very least another list of questions — can be gained.

I could go on all day about the wealth of information that such a seemingly simple statistic as the average can provide those with the patience and curiosity to “drill down” past the headlines. But my point is simply this: Pay attention! Don’t let the evening news or your favorite web news source gloss over the statistics to prove whatever skewed point they want to make that day. Spend the time to think critically about the information or you run the risk being fleeced regularly for the rest of your life.

At the very least, pay close attention to the source of any information you are receiving, particularly when that information comes in the form of a statistic. If you are being presented with a descriptive statistic like an average or a median, notice which one you are being given and pause for a second to think about why they used one and not the other.

Furthermore, if you notice that a single set of data is being described interchangeably by median and average, this should throw up a huge red flag as to the reliability of the information and its source.