On New York School Tests, Warning Signs Ignored

Ruby Washington/The New York Times

A student is tested in Brooklyn. Experts had warned that the tests had become too easy.

By JENNIFER MEDINA

October 10, 2010

When New York State made its standardized English and math tests tougher to pass this year, causing proficiency rates to plummet, it said it was relying on a new analysis showing that the tests had become too easy and that score inflation was rampant.

Multimedia

Graphic

David Goldman for The New York Times

Randi Weingarten, second from left, president of the teachers’ union; Chancellor Joel I. Klein, center; and Mayor Michael R. Bloomberg at a news conference in 2009 discussing improved math results.

Michele McDonald for The New York Times

Daniel Koretz, a Harvard professor, oversaw the study of New York’s tests that led to the state’s conclusion that the exams had become too easy to pass.

Kirsten Luce for The New York Times

Betty Rosa, a member of the Board of Regents, said the unprecedented high scores had seemed unbelievable.

But evidence had been mounting for some time that the state’s tests, which have formed the basis of almost every school reform effort of the past decade, had serious flaws.

The fast rise and even faster fall of New York’s passing rates resulted from the effect of policies, decisions and missed red flags that stretched back more than 10 years and were laid out in correspondence and in interviews with city and state education officials, administrators and testing experts.

The process involved direct warnings from experts that went unheeded by the state, and a city administration that trumpeted gains in student performance despite its own reservations about how reliably the test gauged future student success.

It involved the state’s decision to create short, predictable exams and to release them publicly soon after they were given, making coaching easy and depriving test creators of a key tool: the ability to insert in each test questions for future exams. Next year, for the first time, the tests will not be released publicly.

It involved a national push for numbers-based accountability, begun under President George W. Bush and reinforced by President Obama. And it involved a mayor’s full embrace of testing as he sought to make his mark on the city, and then to get re-elected.

“They just kept upping the stakes with the scores, putting more pressure on the schools but not really looking at what it all means,” said Pedro Noguera, an education professor at New York University who has worked with the city’s Department of Education to help improve struggling schools.

New York has been a national model for how to carry out education reform, so its sudden decline in passing rates may be seen as a cautionary tale. The turnaround has also been a blow to Mayor Michael R. Bloomberg and his chancellor, Joel I. Klein, who despite warnings that a laserlike focus on raising scores could make them less and less reliable, lashed almost every aspect of its school system to them. Schools were graded on how much their scores rose and threatened with being closed if they did not. The scores dictated which students were promoted or left back, and which teachers and principals would receive bonuses.

Even now, the city believes that the way it uses the tests is valid. The mayor and the chancellor have forcefully defended their students’ performance, noting that even after the changes this year, student scores are still better than they were in 2002. They have argued that their students’ progress is more important than the change in the passing rate, and that years of gains cannot be washed away because of a decision in Albany to require more correct answers from every student this year.

The test scores were even used for a new purpose this year: to help determine which teachers should receive tenure.

“This mayor uses data and metrics to determine whether policies are failing or succeeding,” said Howard Wolfson, the deputy mayor for government affairs and communications. He also helped run Mr. Bloomberg’s re-election campaign in 2009, using the city’s historic rise in test scores to make the case for a third term. “We believe that testing is a key factor for determining the success of schools and teachers.”

“Under any standard you look at,” he added, “we have improved the schools.”

But given all the flaws of the test, said Prof. Howard T. Everson of the City University of New York’s Center for Advanced Study in Education, it is hard to tell what those rising scores really meant.

“Teachers began to know what was going to be on the tests,” said Professor Everson, who was a member of a state testing advisory panel and who warned the state in 2008 that it might have a problem with score inflation. “Then you have to wonder, and folks like me wonder, is that real learning or not?”

New Generation of Tests

The problems that plagued New York’s standardized tests can be traced to the origin of the exams.

In 1996, New York set about creating tests for fourth and eighth graders as a way to measure whether schools were doing their jobs. A precursor to the widespread testing brought about by Mr. Bush’s No Child Left Behind law, the tests replaced more basic exams that had been given in the same grades, which simply determined whether students needed remedial instruction. (The city had also given its own tests for many years.)

Teachers pushed back, saying they could gauge their students’ performance better than any mass-produced tests could. “There was a lot of resistance from throughout the education community to having the tests,” said Alan Ray, who was the chief spokesman for the State Education Department in the 1990s and in 2000, and retired this year after overseeing data for the office.

But education officials in New York, and many other states, were coming to the conclusion that some measurement system, no matter how limited, was necessary.

The officials sought advice from dozens of educators across New York to figure out what the tests should encompass, Mr. Ray said. Teachers and principals asked that the standards be specific, to make it clear what they were expected to teach at each grade level, and superintendents pleaded to keep the tests relatively short so that students would not spend days filling in bubbles. The state obliged both requests.

The decision to keep the tests narrow and short — the fifth-grade math test, for example, had 34 questions this year — would have a lasting impact, said Daniel Koretz, a professor at Harvard’s Graduate School of Education who specializes in assessment systems. The same types of questions would be trotted out every year, he said.

“In many cases you could not write an unpredictable question no matter how hard you tried,” Professor Koretz said. He oversaw the study of New York’s tests that led to the state’s conclusion that they had become too easy to pass.

The state also continued making tests public after they were administered. Coupled with the questions’ predictability, the public release of the tests, which started long before the nationwide accountability movement, provided teachers with ready-made practice exams.

“If people had known what an effective lever the tests would be of driving behavior, I think they would have designed the tests differently,” said John King, who became deputy state education commissioner last year.

According to testing experts, publicly releasing the exams had another detrimental effect.

Test designers like to insert questions that do not count in the score but might be used in a future test. The designers gauge how students — who are not told which questions will not count — perform on these questions. This allows them to create tests with a mix of easy, moderate and difficult material that is constant — or standardized — from year to year, so that administrators can compare one year’s performance with another’s.

But in New York, “field test” questions cannot be included in tests because those exams are made public.

Even the solution proved problematic. Test writers had to administer a separate field test made up wholly of questions they wanted to use in the future. This test was given to random samples of students, and was not publicly released.

The trouble with this approach, experts said, is that the students taking these field tests, and their teachers, know that they are not the official exams. They may not try as hard to answer the questions correctly, and thus give an inaccurate measure of how hard the questions are.

“There’s a lot of debate about what’s valid in that kind of prediction,” said Andrew Ho, a testing expert who also teaches at Harvard. “It may be making assumptions that are not really correct to start with.”

The new state tests were rolled out in 1999. The first batch of results, free of test preparation and repeated questions, provided both a starting point and a troubling revelation: New York’s public school students were being poorly educated. Just 38 percent of the state’s eighth graders passed the math test and 48 percent passed in reading. In New York City, those numbers were 23 percent and 35 percent, respectively.

The reaction, Mr. Ray said, “was electrifying.” Some even questioned whether the standards had been set too high.

A Mayor Chases Results

The state tests’ flaws would not become evident for years. But by 2001, the tests had a champion.

During his first campaign, Mr. Bloomberg said that education was his top priority. He pledged to take control of the city’s public schools, then under the supervision of the Board of Education, which had been ridiculed for budget troubles and stagnant academic performance.

Projecting the image of a bottom-line-oriented, pragmatic businessman, Mr. Bloomberg latched on to test scores as a clear way of seeing just how well students were doing.

“If four years from now reading scores and math scores aren’t significantly better,” Mr. Bloomberg said in a radio interview in 2001, “then I will look in the mirror and say that I have been a failure. I’ve never failed at anything yet, and I don’t plan to fail at that.”

After Mr. Bloomberg persuaded the Legislature to give him control of the schools, he appointed Mr. Klein, a former Justice Department lawyer and media executive, as his chancellor. Mr. Klein was seen as a technocrat who was eager and able to produce tangible results, the kind that could be measured.

Scores in the city and state were on their way up. In 2004, for example, the proportion of fourth graders in the city meeting math standards increased to 68 percent, up 16 percentage points since 2001. Only 42 percent of eighth graders met that mark, but that was still a significant improvement from just a few years earlier. By 2009, that rate would jump nearly 30 points.

“What is encouraging is that for two or three years in a row now, the tests have gone in the same direction — up,” the mayor said on a radio show in October 2004. “So there’s reason to believe we’re headed to the correct place.”

In 2003, Mr. Bloomberg ended the practice of “social promotion” in certain grades, requiring students performing at the lowest levels on the tests be held back unless they attended summer school and showed progress on a retest. That year, Mr. Klein released a list of 200 successful schools, the only places where teachers would not have to follow the citywide math and English curriculums. The list was primarily based on test scores.

More and more of the mayor’s educational initiatives were linked to the scores. They were used to help decide which schools should be closed and replaced with new, smaller schools. The new A-through-F grading system for schools was based primarily on how their students improved on the tests. Teachers and principals earned bonuses of up to $25,000 if their schools’ scores rose. Teachers’ annual evaluations and tenure decisions are partially dependent on test results.

Each new policy was met with denunciations from the teachers’ union or from education experts like Diane Ravitch. Ms. Ravitch, a supporter of standardized testing when she was an adviser to the Clinton and Bush administrations, became one of the biggest critics, arguing that schools were devoting too much time to the pursuit of high scores.

“If they are not learning social studies but their reading scores are going up, they are not getting an education,” Ms. Ravitch said in 2005, as the mayor coasted to re-election.

The mayor and chancellor dismissed these criticisms as the hidebound defenses of an old, failed system devoid of meaningful standards. But some questions were also being raised by people close to the administration.

In the Education Department headquarters on Chambers Street, some officials argued that the A-through-F system of grading schools should incorporate not only the English and math tests, but also the science and social studies exams given by the state. “We wanted to draw this as broadly as possible,” said a former school official who spoke on the condition of anonymity to avoid publicly disagreeing with Mr. Klein.

But after months of running models and tweaking formulas, Mr. Klein decided to stick with the two core subjects. After all, he often argued, if students could not master essential math and English skills, it would be impossible for them to grasp other concepts.

Dr. Noguera, the N.Y.U. education professor and adviser to the city, applauded Mr. Klein for creating a grading system that rewarded improvement from year to year so that schools in poor neighborhoods had the same chance of achieving a good grade as those in wealthier areas.

But it also was risky, Dr. Noguera said. “That got schools fixated on how to raise scores, not looking for more authentic learning,” he said.

Dr. Noguera expressed his views publicly and to some of Mr. Klein’s deputies, but never directly told the chancellor, he said.

Mr. Klein said in recent interviews that while the tests were imperfect, they were still the best measurements available for a school system that previously had no yardsticks. They also were not the only signs proving the city had been making progress, he said: On more difficult federal tests given to a sample of fourth and eighth graders, the city had steadily improved.

And the city’s main goal, he said, was not simply giving out laurels for students’ scoring 3s (“proficient”) and 4s (“advanced”) on the state tests.

Instead, its system of school grades and teacher incentives gave considerable weight to scores that showed improvement from year to year at all levels.

“Nobody else was doing this,” Mr. Klein said. “We never said it was good enough to get to passing and just stay there.”

In 2006, the state added tests for the third, fifth, sixth and seventh grades, in order to align with the requirements of No Child Left Behind. Scores jumped in 2007.

There were improvements at every grade level across the state and in New York City, where 65 percent of all students met state standards in math, an improvement of eight percentage points in one year.

“I’m happy, thrilled — ecstatic, I think, is a better word,” Mr. Bloomberg said at the time. “The hard work going on in our schools is really paying off.”

After Mr. Bloomberg’s first full term as mayor, the new scores seemed to ratify his claims of success. They also raised more alarms.

As a superintendent in the Brownsville section of Brooklyn, Kathleen Cashin had seen several schools improve throughout the early part of the decade. But when she saw the sudden jump, she said, she was shocked.

“I said to my intimate circle of staff, this cannot be possible,” Ms. Cashin recalled. “I knew how much effort and how much planning any little improvement would take, and not all of these schools had done any of it.”

But Ms. Cashin, who retired in February, held her tongue at the time. Asked why she did not take up her concerns with Mr. Klein or his deputies, she said, “I didn’t have their ear.”

A Proposal for a Fix

The following winter, Professor Koretz, of Harvard, and Professor Everson, of CUNY, who was a member of a state testing advisory group, sent a memo to state education officials.

“Research has shown that when educators are pressured to raise scores on conventional achievement tests, some improve instruction, while others turn to inappropriate methods of test preparation that inflate scores,” they wrote in the Feb. 5, 2008, memo. “In some cases, the inflation of scores has been extreme.”

The researchers proposed to devise a kind of audit. While tests tended to be similar from year to year, they would add to each exam some questions that did not resemble those from previous years. If a class performed well on the main section of the test but poorly on the added questions, that would be evidence that scores were inflated by test preparation. If a class performed well on both, the researchers wrote, that teacher might have methods worth emulating.

In addition, they wrote, such a system would give teachers “less incentive to engage in inappropriate test preparation and more incentive to undertake the much harder task of improving instruction.”

State education officials, the professors said, did not give them a hearing.

The 2008 results showed even more large gains — 74 percent of city students were deemed proficient in math, an increase of nine points in one year; and the city’s passing rate in reading was now 58 percent, up from 51 percent two years earlier. Statewide, the passing rates jumped to 81 percent in math and 69 percent in reading.

Professor Koretz and Professor Everson wrote another memo in September 2008, again proposing to create a way to make test results more reliable. But the idea went nowhere.

Richard Mills, who resigned as education commissioner in 2009 after 14 years, said in an interview that his administration was confident the tests had been working properly because they had gone through several vetting processes with independent testing experts.

“The whole testing matter at the state and national level is about judgments, and therefore it also always includes criticism,” Mr. Mills said. “Is it too hard or is it too easy? There is almost never a point — or at least it doesn’t endure for very long — where people say it is just right.”

A Decisive Year

The city’s Department of Education constantly mines test score data for patterns to show where improvement is happening and where it is needed. In 2008, it noticed an incongruity: Eighth graders who scored at least a 3 on the state math exam had only a 50 percent chance of graduating from high school four years later with a Regents diploma, which requires a student to pass a certain number of tests in various subjects and is considered the minimum qualification for college readiness.

The city realized that the test results were not as reliable as the state was leading people to believe.

Mr. Klein and several of his deputies spoke by phone with Merryl H. Tisch, the vice chancellor of the Board of Regents, and Mr. Mills, trying to persuade them to create a statewide accountability system similar to the city’s, one that gave improvement at least as much weight as the score itself.

The state said it would consider moving to such a system, but would need more time.

Neither the city nor state publicly disclosed the concerns about the scores. By then, students across the state were preparing for the 2009 tests, filling in bubbles on mock answer sheets, using at least three years of previous state tests as guides.

The scores arrived in May, and with them, the bluntest warning yet.

Just before the results were released, a member of the Regents named Betty Rosa called Ms. Tisch, who had recently become chancellor.

Ms. Rosa, who had been a teacher, principal and superintendent in the Bronx for nearly three decades, said the unprecedented high scores simply seemed too good to be true. She suggested the unthinkable: the scores were so unbelievable, she said, that the state should not publicly release them.

“The question was really are we telling the public the truth,” Ms. Rosa said in a recent interview. Ms. Tisch, she said, relayed that she, too, found the scores suspicious, but that it would be impossible to withhold them. “It was like a train that was already in motion and no way to stop it,” Ms. Rosa said.

The English test scores showed 69 percent of city students passing. Mr. Bloomberg called the results “nothing short of amazing and exactly what this country needs.”

“We have improved the test scores in English,” he continued, “and we expect the same results in math in a couple of weeks, every single year for seven years.” Four weeks later, it was announced that 82 percent of city students had passed the math tests.

Because of the widespread improvement in the scores, 84 percent of all public schools received an A in the city’s grading system, something Mr. Klein said he later regretted. This year, the city limited the number of A’s to 25 percent of schools.

The 2009 numbers came out as the mayor was trying to accomplish two goals: to persuade the Legislature to give the mayor control of the schools for another seven years; and to convince city voters that he deserved a third term.

Mr. Bloomberg’s opponent, Comptroller William C. Thompson, had once been president of the Education Board.

“Mike Bloomberg changed that system,” said one of the mayor’s campaign advertisements. “Now, record graduation rates. Test scores up, violence down. So when you compare apples to apples, Thompson offers politics as usual. Mike Bloomberg offers progress.”

In his debates, Mr. Bloomberg hammered home the theme. “If anybody thinks that the schools were better when Bill ran them, they should vote for him,” he said in one face-off. “And if anybody thinks they’re better now, I’d be honored to have their vote.”

Indeed, according to exit polls, 57 percent of those who said education was their primary concern voted for Mr. Bloomberg, who won the election by a five-point margin.

Mr. Wolfson, the deputy mayor and 2009 campaign strategist, said the mayor had no regrets about focusing on the exams as a matter of policy, and during the election.

“What’s the converse?” he said. “The converse is that we don’t test and we have no way of judging success or failure. Either you believe in standards or tests, or you don’t — and life is not like that. There are tests all the time.”

Ms. Tisch, in releasing the 2009 test results, had not heeded Ms. Rosa’s radical request. But the very day she put out the English test results, she began openly acknowledging doubts about the scores, irking the mayor and chancellor, who privately seethed that she was seeking to undermine their success. “As a board, we will ask whether the test is getting harder or easier,” she said.

Although the Regents did not immediately opt to create an entirely new test, Ms. Tisch and David Steiner, the new education commissioner, asked Professor Koretz, who had been rebuffed in previous requests, to analyze the ones that were in use. His conclusion — and that of another researcher, Jennifer L. Jennings — was that the tests had become too easy, and hence the scores were inflated. That led the State Education Department to raise the number of correct answers required to pass each test.

The state intends to rewrite future tests to encompass a broader range of material, and will stop publicly releasing them.

“We came in here saying we have to stop lying to our kids,” Ms. Tisch said in a recent interview. “We have to be able to know what they do and do not know.”

Robert Gebeloff and Elissa Gootman contributed reporting, and Jack Begg contributed research.

11 de outubro de 2010