Jorge Werthein: Analyzing Released NYC Value-Added Data Part 2

28 de fevereiro de 2012

Analyzing Released NYC Value-Added Data Part 2

In part 1 I demonstrated there was little correlation between how a teacher was rated in 2009 to how that same teacher was rated in 2010. So what can be more crazy than a teacher being rated highly effective one year and then highly ineffective the next? How about a teacher being rated highly effective and highly ineffective IN THE SAME YEAR.

I will show in this post how exactly that happened for hundreds of teachers in 2010. By looking at the data I noticed that of the 18,000 entries in 2010, about 6,000 were repeated names. This is because there are two ways that one teacher can get multiple value-added ratings for the same year.

The most common way this happens is when the teacher is teaching self-contained elementary in 3rd, 4th, or 5th grade. The students take the state test in math and in language arts and that teacher gets two different effectiveness ratings. So a teacher might, according to the formula, ‘add’ a lot of ‘value’ when it comes to math, but ‘add’ little ‘value’ (or even ‘subtract’ value) when it comes to language arts.

To those who don’t know a lot about education (yes, I’m talking to you ‘reformers’), it might seem reasonable that a teacher can do an excellent job in math and a poor job in language arts and should not be surprising if the two scores for that teacher do not correlate. But those who do know about teaching would expect the amount the students to learn to correlate since someone who is doing an excellent job teaching math is likely to be doing an excellent job teaching language arts since both jobs are set up by some common groundwork that benefits all learning in the class. The teacher has good classroom management. The teacher has helped her students to be self-motivated. The teacher has a relationship with the families. All these things increase the amount of learning of every subject taught. So even if an elementary teacher is a little stronger in one subject than another, it is more about the learning environment that the teacher created than anything else.

Looking through the data I noticed teachers, like a 5th grade teacher at P.S. 196 who scored 97 out of 100 in language arts and 2 out of 100 in math. This is with the same students in the same year! How can a teacher be so good and so bad at the same time? Any evaluation system in which this can happen is extremely flawed, of course, but I wanted to explore if this was a major outlier or if it was something quite common. I ran the numbers and the results shocked me (which is pretty hard to do). Here’s what I learned:

Out of 5,675 elementary school teachers, the average difference between the two scores was a whopping 22 points. One out of six teachers, or approximately 17%, had a difference of 40 or more points. One out of 25 teachers, which was 250 teachers altogether, had a difference of 60 or more points, and, believe it or not, 110 teachers, or about 2% (that’s one out of fifty!) had differences of 70 or more points. At the risk of seeming repetitive, let me repeat that this was the same teacher, the same year, with the same kids. Value-added was more inaccurate than I ever imagined.

I made a scatter plot of the 5,675 teachers. On the x-axis is that teacher’s language arts score for 2010. On the y-axis is that same teacher’s math score for 2010. There is almost no correlation.

For people who know education, this is shocking, but there are people who probably are not convinced by my explanation that these should be more correlated if the formulas truly measured learning. Some might think that this really just means that just like there are people who are better at math than language arts and vice versa, there are teachers who are better at teaching math than language arts and vice versa.

So I ran a different experiment for those who still aren’t convinced. There is another scenario where a teacher got multiple ratings in the same year. This is when a middle school math or language arts teacher teaches multiple grades in the same year. So, for example, there is a teacher at M.S. 35 who taught 6th grade and 7th grade math. As these scores are supposed to measure how well you advanced the kids that were in your class, regardless of their starting point, one would certainly expect a teacher to get approximately the same score on how well they taught 6th grade math and 7th grade math. Maybe you could argue that some teachers are much better at teaching language arts than math, but it would take a lot to try to convince someone that some teachers are much better at teaching 6th grade math than 7th grade math. But when I went to the data report for M.S. 35 I found that while this teacher scored 97 out of 100 for 6th grade math, she only scored a 6 out of 100 for 7th grade math.

Again, I investigated to see if this was just a bizarre outlier. It wasn’t. In fact, the spreads were even worse for teachers teaching one subject to multiple grades than they were for teaching different subjects to the same grade.

Out of 665 teachers who taught two different grade levels of the same subject in 2010, the average difference between the two scores was nearly 30 points. One out of four teachers, or approximately 28%, had a difference of 40 or more points. Ten percent of the teachers had differences of 60 points or more, and a full five percent had differences of 70 points or more. When I made my scatter plot with one grade on the x-axis and the other grade on they y-axis I found that the correlation coefficient was a miniscule .24

Rather than report about these obvious ways to check how invalid these metrics are and how shameful it is that these scores have already been used in tenure decisions, or about how a similarly flawed formula will be used in the future to determine who to fire or who to give a bonus to, newspapers are treating these scores like they are meaningful. The New York Post searched for the teacher with the lowest score and wrote an article about ‘the worst teacher in the city’ with her picture attached. The New York Times must have felt they were taking the high-road when they did a similar thing but, instead, found the ‘best’ teachers based on these ratings.

I hope that these two experiments I ran, particularly the second one where many teachers got drastically different results teaching different grades of the same subject, will bring to life the realities of these horrible formulas. Though error rates have been reported, the absurdity of these results should help everyone understand that we need to spread the word since calculations like these will soon be used in nearly every state.

I’ve never asked the people who read my blog to do this before since I prefer that it happen spontaneously, but I’d ask for you to spread the word about this post. Tweet it, email it, post it on Facebook. Whatever needs to happen for this to go ‘viral,’ I’d appreciate it. I don’t do this for money or for personal glory. I do it because I can’t stand when people lie and teachers, and yes those teachers’ students, get hurt because of it. I write these posts because I can’t stand by and watch it happen anymore. All you have to do is share it with your friends.

28 de fevereiro de 2012

Analyzing Released NYC Value-Added Data Part 2

Nenhum comentário:

Postar um comentário