The ol’ blog doesn’t allow comments and discourages search engine indexing now so basically putting things up here is somewhat pointless but I’m going to keep doing it.
I was talking about R and statistics and stuff and I ran into an interesting problem. Let’s say I have a csv file with data for a dependent, response variable W and three independent variables X, Y, and Z. Additionally, I have a second csv file with some additional values for the input variables. My goal is to produce a model for W in terms of X, Y and Z from the first data file and then see what the model predicts for the values of X, Y and Z in the second data file. If you had measured values for W in the second file then you could compare your predictions with the actual and get an idea of how good your model is. This is called out-of-sample model validation. Anyway, I load the data files like so:
> data <- read.csv("datafile.csv")
> data2 <- read.csv("datafile2.csv")
However, if you don't have a handy data set you can just get R to fake one for you. Note that I've chosen different sizes here.
> data <- data.frame(W=rnorm(10), X=rnorm(10), Y=rnorm(10), Z=rnorm(10))
> data2 <- data.frame(W=rnorm(5), X=rnorm(5), Y=rnorm(5), Z=rnorm(5))
It turns out that there are two ways to make your linear model. The right way and the wrong way. The wrong way goes like this:
> lmfit <- lm(data$W ~ data$X + data$Y + data$Z)
> print(lmfit$fitted.values)
1 2 3 4 5 6 7 8 9 10
-0.8562631 1.1639021 1.1048939 0.4355315 0.7493099 0.1484507 -0.8458088 -0.3278646 0.3195509 1.3771322
When you try to predict the values of W using data2 it will complain at you in a kind of unhelpful way.
> predW <- predict(lmfit, data2)
Warning message:
'newdata' had 5 rows but variables found have 10 rows
> print(predW)
1 2 3 4 5 6 7 8 9 10
-0.8562631 1.1639021 1.1048939 0.4355315 0.7493099 0.1484507 -0.8458088 -0.3278646 0.3195509 1.3771322
R is being kind of unhelpful here. I found a fairly long discussion of this that ultimately helped me solve my problem but it goes around the long way to get there. http://faustusnotes.wordpress.com/2012/02/16/problems-with-out-of-sample-prediction-using-r/
The short version is that when we built our model we said we were specifically looking at "data$W", not just "W". So when we ask it to predict based on inputs from data2 it says, "I can't find data$W in here." and then it just stuffs the original predictions for 'data' into predW. That's not very helpful and possibly misleading. However, this can all be solved by giving our model variable names less scope and providing data via the data argument.
> lmfit_new <- lm(W ~ X + Y + Z, data=data)
> print(lmfit_new$fitted.values)
1 2 3 4 5 6 7 8 9 10
-0.8562631 1.1639021 1.1048939 0.4355315 0.7493099 0.1484507 -0.8458088 -0.3278646 0.3195509 1.3771322
> predW_new <- predict(lmfit_new, data2)
> print(predW_new)
1 2 3 4 5
-1.3766995 -1.8670918 1.1370810 2.8103150 0.6789566
All better. And then we can get the Root Mean Square of the difference between our prediction and our (completely made up) data2 W:
> sqrt(sum((data2$W - predW_new)^2))
[1] 3.655083