Thwarted Again

Casa CorMAnca by Paul Cremoux Studio

futility

The ol’ blog doesn’t allow comments and discourages search engine indexing now so basically putting things up here is somewhat pointless but I’m going to keep doing it.

I was talking about R and statistics and stuff and I ran into an interesting problem. Let’s say I have a csv file with data for a dependent, response variable W and three independent variables X, Y, and Z. Additionally, I have a second csv file with some additional values for the input variables. My goal is to produce a model for W in terms of X, Y and Z from the first data file and then see what the model predicts for the values of X, Y and Z in the second data file. If you had measured values for W in the second file then you could compare your predictions with the actual and get an idea of how good your model is. This is called out-of-sample model validation. Anyway, I load the data files like so:
> data <- read.csv("datafile.csv") > data2 <- read.csv("datafile2.csv")

However, if you don't have a handy data set you can just get R to fake one for you. Note that I've chosen different sizes here.
> data <- data.frame(W=rnorm(10), X=rnorm(10), Y=rnorm(10), Z=rnorm(10)) > data2 <- data.frame(W=rnorm(5), X=rnorm(5), Y=rnorm(5), Z=rnorm(5))

It turns out that there are two ways to make your linear model. The right way and the wrong way. The wrong way goes like this:
> lmfit <- lm(data$W ~ data$X + data$Y + data$Z) > print(lmfit$fitted.values) 1 2 3 4 5 6 7 8 9 10 -0.8562631 1.1639021 1.1048939 0.4355315 0.7493099 0.1484507 -0.8458088 -0.3278646 0.3195509 1.3771322

When you try to predict the values of W using data2 it will complain at you in a kind of unhelpful way.
> predW <- predict(lmfit, data2) Warning message: 'newdata' had 5 rows but variables found have 10 rows > print(predW) 1 2 3 4 5 6 7 8 9 10 -0.8562631 1.1639021 1.1048939 0.4355315 0.7493099 0.1484507 -0.8458088 -0.3278646 0.3195509 1.3771322

R is being kind of unhelpful here. I found a fairly long discussion of this that ultimately helped me solve my problem but it goes around the long way to get there. http://faustusnotes.wordpress.com/2012/02/16/problems-with-out-of-sample-prediction-using-r/

The short version is that when we built our model we said we were specifically looking at "data$W", not just "W". So when we ask it to predict based on inputs from data2 it says, "I can't find data$W in here." and then it just stuffs the original predictions for 'data' into predW. That's not very helpful and possibly misleading. However, this can all be solved by giving our model variable names less scope and providing data via the data argument.
> lmfit_new <- lm(W ~ X + Y + Z, data=data) > print(lmfit_new$fitted.values) 1 2 3 4 5 6 7 8 9 10 -0.8562631 1.1639021 1.1048939 0.4355315 0.7493099 0.1484507 -0.8458088 -0.3278646 0.3195509 1.3771322 > predW_new <- predict(lmfit_new, data2) > print(predW_new) 1 2 3 4 5 -1.3766995 -1.8670918 1.1370810 2.8103150 0.6789566

All better. And then we can get the Root Mean Square of the difference between our prediction and our (completely made up) data2 W:
> sqrt(sum((data2$W - predW_new)^2)) [1] 3.655083

statistics

For a while I’ve used JMP for data analysis professionally. Recently we’ve been talking about experiment design and statistical power at work and during some of these discussions I decided it might be good if I knew how to do some R. The learning curve is OK but there’s a lot of kind of magical behavior that I don’t really get yet.

So far I’m working on getting a handle on linear models. JMP has a dialog that is Analyze->Fit Model. It then gives you a nice little browsing interface and lets you build your model with the crosses and nesting and all that. Then you tell it what kind of model you want and what kind of report you’re interested in seeing and let it go. I have a sample dataset that has nothing to do with work that has a response variable “delta”, a couple of continuous variables that effect delta and then a nominal variable “coding” which is essentially, control / experimental. I’ve fit several different models to this data but knowing what I know about how I built the dataset, it makes sense to nest the continuous variables within “coding”. The report it produces has a lot of nicely formatted data and you can run additional tests and reports from within that.

R is a much more programmy kind of tool. It’s possible to get the same model in one line of code.
> lmfit <- lm(data$delta ~ data$coding + data$dt*data$coding + data$dt2*data$coding, read.csv("derp.csv"))
But then you have a result object that you interrogate for information. I'm still struggling with the data structures R presents. The way you index into them to get at specific data really hasn't clicked yet and I find myself fumbling around in the terminal a lot. This is what it looks like when you ask it for the linear model and a summary thereof.
> lmfit Call: lm(formula = data$delta ~ data$coding + data$dt * data$coding + data$dt2 * data$coding, data = read.csv("derp.csv"))


Coefficients:

          (Intercept)           data$codingb                data$dt

              2.15264                0.01773                0.75516

             data$dt2   data$codingb:data$dt  data$codingb:data$dt2

              0.51304                1.67854                0.96067  
> summary(lmfit)

Call:

lm(formula = data$delta ~ data$coding + data$dt * data$coding +

    data$dt2 * data$coding, data = read.csv("derp.csv"))
Residuals:

     Min       1Q   Median       3Q      Max

-2.25889 -0.22361  0.03567  0.39381  2.18111 
Coefficients:

                      Estimate Std. Error t value Pr(>|t|)

(Intercept)            2.15264    0.37886   5.682 2.23e-06 ***

data$codingb           0.01773    0.55084   0.032  0.97452

data$dt                0.75516    0.21063   3.585  0.00104 **

data$dt2               0.51304    0.20672   2.482  0.01818 *

data$codingb:data$dt   1.67854    0.29369   5.715 2.02e-06 ***

data$codingb:data$dt2  0.96067    0.29090   3.302  0.00226 **

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.018 on 34 degrees of freedom Multiple R-squared: 0.9205, Adjusted R-squared: 0.9088 F-statistic: 78.75 on 5 and 34 DF, p-value: < 2.2e-16

It's not as pretty but all of the information is there. Also you can save everything as a script and rerun it if you get more data or a different dataset with the same columns. JMP also supports scripting but I honestly have never used it.

I do find R's form of the fit equation to be easier to understand but either way you have to make sense of the idea that "if experimental use these coefficients, if control use these others":
R:
(Intercept) 2.15264 data$codingb 0.01773 data$dt 0.75516 data$dt2 0.51304 data$codingb:data$dt 1.67854 data$codingb:data$dt2 0.96067
vs
JMP:
5.476 + Match( :coding, "a", (:dt - 1.3) * 0.755, "b", (:dt - 1.3) * 2.434) + Match( :coding, "a", (:dt2 - 1.25) * 0.513, "b", (:dt2 - 1.25) * 1.474) + Match( :coding, "a", -1.700, "b", 1.700, . )

dinner

I went out to the garden this afternoon. We had a comparatively large amount of rain last week and the garden is now completely overrun with weeds. However, it is also overrun with vegetables. I picked a bunch of green beans and I harvested the largest of the beets. That and a glass of wine is what I had for dinner tonight. I did cheat a bit though, I ate a pear while I was cooking. It looks like later this week I’ll have my first yellow squash.

Sadly the Brussels sprouts have succumbed to the little grey bug infestation. The sprouts themselves are compromised and I’m going to have to pull everything up and throw it away. Sad times. The lesson to learn here is plant winter crops for the winter.

bits

http://arstechnica.com/tech-policy/2013/05/feds-seize-money-from-top-bitcoin-exchange-mt-gox/

After the big bitcoin drop back in April I’ve been occasionally checking in on it for no good reason. Tonight the internet tells me that a Dwolla account owned by Mt.Gox was seized by DHS. Dwolla being a primary way of getting your government backed currency of choice into the deflationary speculation instrument of your dreams. The parts of the internet that care about BTC are in an uproar, naturally. Now everyone is waiting for the other shoe to drop which is to say, at some point it will be announced why the account was seized. The discussion camps seem split along the expected lines. The Romantics claim BTC will come roaring back as populist demand rises for the best money that can’t buy food amid the artificial scarcity brought on by the corrupt DHS, dancing on their puppet strings held by Wall St. / Bipedal Lizards in Disguise, clamping down on liquidity. The Schadenfreudians hoot their told-you-sos from the sidelines and speculate on the exact size and shape of the unsavory things to be revealed for them to roll around in like a dog on a three-day-dead squirrel.

I’ll close with a quote

I recently heard Bitcoins referred to as “Dunning-Krugerrands”, which I think might be the most perfect neologism I’ve ever seen.

changes for the worse

There are a few things I’ve done in the interest of making it take a few seconds longer to trash the site. First, I turned off all of the fancy-ass, trackback, social media bullshit. That’s fine. No one was using that and it shouldn’t be on by default anyway (seriously, wtf). Sadly, the second thing I’ve done is disable comments. I’m not sure, but I have reason to believe that the comment system was the source of the compromise last time. I certainly found a bunch of unpleasant looking stuff in the error logs that was directed at the comment portal. Of course, it’s equally likely the admin account was brute-forced so I guess there’s that too. I’ve resolved to keep better backups (once again) and I’ve read a few hopeful things about how to make it harder to knock this over but I’ve also kind of resigned myself to the idea that I’m going to have to do this again and I’ve taken some steps to make that easier as well.

I ought to keep a counter of how long this stays unmolested. Anyway, if you want to get my attention about something and don’t already know how, you can send mail to the admin address or whatever. Sorry.

Heal Arr

Do we need to have a refresher on how to do the Onyxia encounter? I’ve pretty much stopped going to Ony because it’s a crap shoot if I’m going to take durability on what should be a 20 min loot farm and it always seems to be amature night but against my better judgement I went tonight after the AQ40 drubbing. I died during the trash because I got healing agro when we pulled two Warders at the same time. I should have hearthed right there as should the rest of the raid but ever optimistic we forged boldly ahead.

First round. Fight! We get whelps right off the bat. That’s never a good sign. Fortunately they managed to die quickly so we could get to the business of the dragon. I have been to few raids where I have seen her drop so slowly. I guess I don’t really care because at this point even in full FR I regen enough mana that I can cast heal rank 2 practically forever. I would like to point out that FR is advised for this fight. I cannot believe the amount of damage people were taking from the Warders’ blast waves so I had at least hoped that everyone was wearing dps gear. I guess we were all wearing our pirate costumes or something. At long last we get to phase two and about 1/3rd of the raid gets incinerated by the deep breath. Turns out those pirate costumes are flamable. This may have been a blessing in disguise however, because it meant that she only had to chew through another 1/3rd of the raid to get to the MT on the hate list for phase three. So there we are 2/3rds of the raid dead, me and two paladins healing and a few people doing damage. Over the nine minutes that phase three lasted I used inner focus three times, three major mana potions and a freaking soul stone. The only person I was healing during this portion of this fight was Arr and I was using every freaking trick in the book. I do not need to hear five people in TS, none of which is the raid leader, telling me to heal Arr every time he takes damage. I know what I am doing. Seriously, there is nothing more aggrivating as a healer than to have people yelling for heals. It’s even more egregious that another healer was the worst offender. Arr finally died when I was out of mana, feared and I then could not get my last inner focus Gheal off in time.

The second verse was much the same as the first complete with me using another damned soul stone, and more mana potions and inner foci. This time we got a Warder to go with the whelps during the pull too.

Finally, when I zoned back in to collect my corpse I was summoned in combat to Ony. The one thing that went right tonight was that my hearthstone finished casting before her flame breath did. All told I probably blew 30g on what should have been a profitable venture.

I have issues with other stupid crap I saw going on or was victim to during this travesty but this is already over long and my bridges burn brightly as is.

I have a few of questions.

1) How do you people afford this?

2) Do you believe that Arr is healing himself?

3) What did you think I was doing instead of healing Arr?

4) Why has this guild not been able to execute the phase three transition in this fight correctly for the last three months? (Last week after we killed Nefarian we went to Ony and got about half the raid dead during the transition. In contrast we lost three people during Nef.)

5) What is the most powerful paladin blessing? Here’s a hint: It’s not kings.

I’m sick of this. We just waltz in there and get our asses handed to us and everyone seems fine with it (with a few notable exceptions, thanks for calling the raid Entity /bow). Oh we shouldn’t let it get us down that we suck. We’ll come back another time and get her. Now let’s spend a half an hour talking about dkp in an instance that is so much harder than Onyxia that we are not fit to disgrace its floor with our corpses. How is this fun?

It was pointed out to me that this could be the solution to our dkp problem. If we just never kill anything ever again we don’t have to worry about who gets what first or second or neverth or how unfair it is that some people get loot before others. No one who was there tonight deserves anything other than scorn myself included for wasting everyones time by attempting to save the run but ultimately only prolonging it.

So now it’s an hour after bedtime and all I’ve done is flame the board which is futile at best. I expect to see some hate up ins tomorrow when, groggy and sleepy-eyed I open this thread instead of working. I’m sure I’ve mispelled something or put an accent in the wrong place and clearly that is more important than our collective ineptitude. Feel free to troll the discussion to your heart’s content.

What say you?

Posted on: 2006/10/31 1:06