A few weeks ago, the Wisconsin Department of Public Instruction (DPI) released an example of what the upcoming report cards for state schools will look like. The report cards are described as one of the package of reforms that that DPI promised to implement in order to win a waiver from the federal Department of Education from the more onerous burdens of the federal No Child Left Behind (NCLB) Act.
One of the qualifications for an NCLB waiver is that a state must put into place an accountability system for schools. The system must take into account results for all students and subgroups of students identified in NCLB on: measures of student achievement in at least reading/language arts and mathematics; graduation rates; and school performance and progress over time. Once a state has adopted a “high-quality assessment,” the system must also take into account student growth.
In announcing the NCLB waiver, DPI claimed that it had established accountability measures that “1) are fair; 2) raise expectations; and 3) provide meaningful measures to inform differentiated recognitions, intervention, and support.”
Designing a fair and meaningful system for assessing the performance of the state’s schools is a worthy endeavor. The emphasis for me is on the “fair” requirement. I consider an assessment system to be fair if it measures how successfully a school promotes the learning of whichever students show up at its door.
By this standard, our new report card system falls markedly short. It rewards schools that have the foresight to enroll well-situated students who can avoid the grinding burdens of poverty. At the same time, the report card system undervalues and thereby disserves the good work going on in lots of Wisconsin schools serving diverse populations of students.
In the following section of this post, I’ll describe the complex calculations that eventually generate the report card score that will be assigned to each Wisconsin school. Next, I’ll offer my take on the report cards. I’ll identify two fundamental problems with DPI’s report card approach; three technical oddities with the scoring that seem to lead to biased results; and the one big piece that is missing from the report card measures.
I. How the Report Card Scores Are Calculated
DPI’s report card will feature an overall score on a 100-point scale for each school. A grading scale is also provided. A score between 83 and 100 will qualify a school for the huzzahs that will accompany the “significantly exceeds expectations” level. On the other hand, a score below 53 condemns a school to the awkward “fails to meet expectations” category.
The grades are calculated on the basis of four different measures – “Student Achievement,” “Student Growth,” “Closing Gaps,” and “On-Track and Postsecondary Readiness.” There is also the possibility of deductions from the score if the school fails to meet identified standards with respect to test participation, absenteeism rate, and dropout rate.
A. The “Student Achievement” Measure
The first of the four measures that contribute to the overall score – Student Achievement – is relatively straightforward. This measure is based on WKCE results, with student scores sorted into the four traditional categories of minimal, basic, proficient and advanced, though the standards for each category are to be recalibrated with more demanding cut scores.
A school is assigned points based on the number of students in each of the WKCE categories. Each student who scores in the advanced range qualifies the school for 1.5 points, a score in the proficient range earns one point, one-half point is assigned for a student in the basic category, and students in the minimal category earn no points at all.
A three-year average of points per student is calculated that is weighted in favor of the more recent years. The calculation for each year is based on the three-year average enrollment for the school, rather than on the specific enrollment for the year that is being calculated. Scores are computed for both reading and math and the two are combined into a single overall score.
B. The “Student Growth” Measure
The second measure – “Student Growth” – is based on a complicated calculation. The first step is to analyze the WKCE performance of all students for whom scores are available for the most recent two years. (Since high school students only take the WKCE once, this measure is inapplicable to high schools.) All students are categorized into the appropriate WKCE level – basic, minimal, proficient or advanced – based on their score the prior year, which we’ll call Year 1. Next, the students’ scores in Year 1 are compared to their scores for Year 2, the current year.
For students who did better, the sizes of their jumps are compared to how much improvement would be required to move those students up to higher WKCE levels over three years. If a student is on a three-year trajectory to move up to the next WKCE level, that qualifies the school for one “growth” point. If the student is on a trajectory to move up two levels – from minimal to proficient, for example – the school gets two growth points. In the unusual event that a student is on track to move from minimal to advanced, the school would qualify for three growth points. A similar calculation is carried out for students starting out as proficient or advanced, but who are on track to move down to the minimal or basic category. Such students can earn their schools one, and only one, “decline” points.
Next, the growth points and the decline points are plugged into a complicated, multi-step formula that eventually generates a Student Growth score.
C. The “Closing Gaps” Measure
The third measure is the “Closing Gaps” score. It takes into account year-to-year changes in WKCE scores and graduation rates for the distinct student groups identified in NCLB. These groups are described as “American Indian,” “Asian,” “Black not Hispanic,” “Hispanic,” “students with disabilities,” “economically disadvantaged,” and “limited English proficient.”
Again, two years of data are utilized. For each group, a student achievement score is calculated in the same manner as for Student Achievement, the first measure. For each of the groups, the difference between the scores for the two years is determined. Improvement yields a positive score and decline a negative one.
Each of the student groups has a comparison group, which is white students for the four ethnic or racial groups and all other students for the remaining three groups. If the comparison group improved from the first to the second year, it has no impact on the score. If the performance of the comparison group declined, a penalty of half the measure of decline is applied to the scores of the matching “achievement gap” groups.
An average of the scores for all seven groups (or for as many groups as have at least 20 members) is calculated, that number is mathematically massaged a bit, and a Closing Achievement Gaps score is the result.
If the report card is for a high school, a similar calculation is carried out on the most recent two years’ graduation rates for the seven student groups and the result is a Closing Graduation Gaps score.
D. The “On Track and Postesecondary Readiness” Measure
The fourth and final measure is called “On Track and Postsecondary Readiness,” which for the sake of brevity we’ll refer to as the On Track measure. For elementary schools this measure is based on attendance (80%) and 3rd grade scores on the WKCE reading test (20%). For middle schools, the inputs are attendance (80%) and scores on the 8th grade math WKCE (20%). Attendance is calculated by determining the total number of days students were absent over the course of the school year, dividing that number by the product of the number of school days and the number of students, and subtracting that percentage from 100.
For high schools, the On Track measure is a combination of graduation rates (80%) and scores on the ACT test (20%). The graduation rate is an average of the percentage of students who graduate after four years in high school and the percentage who have graduated after six years. I assume that this means that for any given year, a high school’s graduation rate will be based on the four-year-cohort rate for the most recent class of graduates, plus the six-year cohort rate for the graduation class of two years ago.
A high school’s ACT score is based on five measures — the percentage of students who take the test, plus the percentage of tested students who achieve a score that meets or exceeds what the ACT folks have determined is the college-ready benchmark in the four subject areas of reading, English, math, and science. According to ACT, a benchmark score is the minimum score needed on an ACT subject-area test to indicate a 50% chance of obtaining a B or higher or about a 75% chance of obtaining a C or higher in the corresponding credit-bearing college score. These benchmark scores are 21 in reading, 18 in English, 22 in math and 24 in science, all out of a maximum score of 36.
E. Penalty Points
The final step in determining a report card score is ensuring that the school is satisfying what have been defined as the three Student Engagement Indicators. To pass muster on the Test Participation component, a school must ensure that at least 95% of every NCLB-defined group of students in the school has taken the WKCE. The Absentee Rate indicator defines a category of students who are absent at least 16 percent of the time and requires that no more than 13 percent of a school’s students fall into this high-absentee category. Finally, the Dropout Rate indicator is satisfied if a school has a dropout rate of less than six percent of its students.
Five points are deducted from a school’s final score for each of the Student Engagement Indicators that it fails to satisfy.
II. Problems with the Report Cards
Before spelling out my concerns with DPI’s school report cards, I should clarify that I’m not against objective measures of school performance. I’m certainly not against standardized tests and I think we should make sensible use of whatever common measures we have to try to come up with objective assessments of schools’ performance.
But the measures and comparisons should be as fair as we can make them, in the sense that scores should be determined on the basis of the learning that’s taking place in a school rather than the particular characteristics of the students who are doing the learning. That’s where the problems I describe come in.
A. Two Fundamental Problems
1. The WKCE Test
The WKCE has few champions these days. The test is administered in October and the results are not available until the following spring, which renders the results essentially useless as a guide for teachers. For most schools, the test is not closely aligned with the curriculum and so the scores inherently have a bit of a random element. The test is only administered once to high school students, at the beginning of their sophomore year, and so neglects nearly three-quarters of high school students’ careers.
Finally, WKCE results only provide a snapshot of a student’s mastery of the materials tested on the exam. Without more, there is no reliable way of disaggregating how much of a student’s achievement on the test is attributable to the pedagogical efforts of the student’s current teacher and how much is the consequence of the aptitude, skills and background knowledge that the student brought to the classroom on the first day of the school year.
The state is in the process of developing a standardized test to replace the WKCE that will be aligned to the Common Core standards. However, the new test will not be administered statewide before 2014-15. So we are stuck with the WKCE for at last two more years.
Despite their manifold shortcomings, WKCE results provide the foundation for DPI’s school report card scores. For elementary and middle schools, measures based on WKCE results represent fully 80% of a school’s report card score. For high schools, the figure is slightly more than 50%, despite the fact that for purposes of the WKCE, a student’s high school career ends in October of his or her sophomore year.
Currently the WKCE is the only game in town when it comes to standardized tests that are administered to elementary and middle school students statewide. (For high school students, the ACT may also fill the bill.) It may be inevitable that WKCE results figure heavily in school accountability systems, at least until a better standardized test can be developed and put in place. Nevertheless, a school report card score is no more reliable than the measures upon which is it based. Aside from everything else, the extent to which DPI’s approach is dependent on less-than-reliable WKCE results diminishes the confidence we can place in the report card scores that DPI computes.
2. The Challenge of Demographics
A challenge when devising a grading system for all schools in the state is figuring out how to take into account differences in student demographics. We know that there are significant achievement gaps between different categories of students. On average, white students perform better on most measures of student achievement than African-American and Latino students. Economically-disadvantaged students tend to do measurably worse than non-economically-disadvantaged students.
What is the appropriate way to take this into account? How can you ensure, for example, that an inner city Milwaukee school with a very diverse student body whose students are exceeding expectations would earn a higher score than a suburban school whose students are overwhelmingly white and economically secure but just sort of treading water?
Unfortunately, the DPI report cards have not solved this puzzle. They do not take student differences into account in a meaningful way. As a result, the score a school earns on the report card is likely to serve more as a measure of the school’s demographics than of the school’s success in boosting its students’ learning. Essentially, the report cards will assign a diversity penalty to schools whose students are not predominantly white and economically comfortable.
As we have seen, the report card includes an overall score that is calculated on the basis of four measures. Both the Student Achievement and the On Track measures will be particularly affected by demographic differences in standards of school success.
It’s hard to predict the impact of these differences on measures based on WKCE results because past results will be recalibrated and the thresholds for the various WKCE levels raised in a way that has not yet been made publicly available. However, the On Track measure for high schools is not based on WKCE results. It is therefore a good place to look to see what impact demographic differences can have on the calculation of the report card score.
For high schools, the On Track component of the school report card school is based on graduation rates (80%) and scores on the ACT test (20%). I calculated the scores that would be earned by a hypothetical high school comprised entirely of African-American students who were graduating at rates and earning scores on the ACT at levels that exactly mirrored the rates and scores for African-American students statewide. I also calculated the score for a second hypothetical high school entirely comprised of white students who again were achieving at a level that exactly matched the state average for white students. The score on this measure for the all-African-American school would be 58.8. The score for the all-white school would be 87.4.
This difference has significant consequences. Consider, for example, two high schools. High School A has a student body that is 90% white and 10% African-American. (Obviously an actual school would have a richer mix of students than just white and African-American, but simplifying in this way helps make my point.). Both groups of students at the school achieve at levels that match the state average for their category.
High school B has a student body that is more diverse – 50% white and 50% African-American. Students at High School B are high achieving –both white and African-American students graduate at rates that are 10% higher than the state average for their category and both groups also score 10% higher than the state average on the ACT measures.
High School B should earn a higher score on this category than High School A, right? Maybe it should, but it doesn’t. The score for this measure for High School A – where all the students match the state average – would be 84.6. The score for the more diverse High School B – where all the students exceed the state average for their category by 10% – would be 80.4. In this example, the difference in demographics overwhelms the difference in levels of student achievement.
This is not an aberration. The design of the report cards seems to be modeled in large part on the school grading system used in Florida. (About a year ago, members of DPI’s School Accountability Design Team were the audience for a presentation immodestly titled “Florida Formula for Student Achievement: Lessons for the Nation.”) In the invaluable Shanker Blog , Matthew DiCarlo writes that he studied the results of Florida’s grading system and found an inverse relationship between a school’s state-assigned grade and its poverty level. As he explains, “[A]ccording to Florida’s system, almost every single low-performing school in the state is located in a higher-poverty area, whereas almost every single school serving low-poverty students is a high performer.” He adds with understatement, “This is not plausible.” We can expect Wisconsin’s report card system to lead to similarly skewed results.
B. Three Oddities of the Measurements that Contribute to Biased Results
1. The Impact of Enrollment Trends on the Student Achievement Score
For the Student Achievement measure, a three-year average of points per student is calculated that is weighted in favor of the more recent years. Oddly, the calculation for each year is based on the three-year average enrollment for the school, rather than on the specific enrollment for the year that is being calculated. This wrinkle provides a boost to schools with growing enrollment and a handicap to schools whose enrollment is decreasing.
To illustrate, I compared the scores that would be earned by two hypothetical schools that had identical test performance but different enrollment trends. I assumed that both schools had .75 student points per enrollment for each of the three years. But the first school’s enrollment was 160 students in year 1, 180 students in year 2 and 200 students in year 3. The second school’s enrollment history was the mirror of the first – 200 students in year 1, 180 in year 2 and 160 in year 3.
If each year’s total points were divided by the enrollment for that year, both schools would end up with a score of 37.5 on this measure, which is to be expected since both schools had identical levels of performance. But since each year’s points are divided by the three-year average enrollment, and the current year’s results are weighted more heavily than the results from the two prior years, the first school would actually be assigned a score of 38.1 and the second school a score of 36.9. The difference reflects nothing but the opposite enrollment trends in the two schools. This makes no sense to me.
2. The Impact of Number of Advanced Students on the Student Growth Score
There are two kickers to the Student Growth formula. The first is the treatment of students who tested at the advanced level in Year 1. There is no higher WKCE level for these students to reach. The formula deals with this by essentially assigning a growth point to every student who starts out in the advanced category, no matter what his or her performance in Year 2 (though such students can also generate a “decline” point if their Year 2 performance puts them on a downward trajectory toward the basic or minimal categories).
The upshot of this is that a school’s score on this measure depends both on (1) how much improvement is shown by students who started out in any of the bottom three WKCE categories, plus (2) how many students at the school started out at the advanced level. Students at one school can show more year-to-year growth on their WKCE scores than students at another school, yet the first school may end up with a lower score on this measure if the second school started out with significantly more students at the advanced level.
It is hard to tell what practical impact this scoring wrinkle will have, since we don’t yet know the percentages of students who will score at the advanced level under the state’s new and more rigorous scoring standards. Currently, it’s not unusual for somewhere between 60% and 70% of students at some Madison schools to score in the advanced range of the WKCE. At the same time, the percentages of students in the Milwaukee public schools who scored at the advanced level last year were 18.9% in reading and 13.8% in math.
If differences at anywhere close to this level of magnitude are still evident with the new scoring standards, then there will be no meaningful way to compare two schools with sharply different demographics on the Student Growth measure. The greater number of automatic credits awarded to students starting out at the advanced level at the more affluent school is likely to swamp whatever differences there may be in the improvements of the students at the two schools who start out at any of three lower WKCE levels
3. The Impact of School Size on the Student Growth Score
The second kicker in the Student Growth measure is its incorporation of a confidence interval. Confidence intervals are statistical calculations designed to indicate how much assurance we have of the accuracy of a reported statistical measure. The law of big numbers applies here. The more observations we can take into account in deriving a statistical measure, the more faith we can have in the accuracy of the result and the smaller the confidence interval associated with that result. Contrarily, the fewer observances a statistical measure is based on, the more likely an outlier may be having an inordinate impact on the result, the less assurance we can have in the accuracy of that result, and the larger the confidence interval associated with it.
One of the steps in determining the Student Growth score is calculating a 75% confidence interval associated with the score. Once the confidence interval is computed, it is added to the student growth score previously calculated, a bit more mathematical mumbo-jumbo is applied, and then a final student growth score emerges.
Adding the confidence interval to the Student Growth score is obviously favorable to smaller schools. To get a sense of what difference this makes, I started out with the Student Growth “Total Factor” of 0.808 that is used in the example provided in DPI’s Accountability Index Technical Guide. I then calculated the confidence interval that would be generated if this Total Factor had been earned by Madison’s smallest elementary school, Lake View, which had 99 students take the WKCE last year, and also the confidence interval that would result if this Total Factor had been earned by the 245 students taking the WKCE at Leopold, Madison’s largest elementary school.
The upshot is that Lake View would be assigned a Student Growth score of 35.1 and, with identical performance, Leopold’s score would be 34.3. The difference would be attributable to nothing but the difference in the size of each school’s enrollment.
Adding the confidence interval to the initial student growth score also makes no sense to me. It is true that the smaller the school, the less confidence we can have in the accuracy of the Student Growth calculation. But any error is as likely to result in an inflated score as one that is too low. Adding the confidence level to the measure will lead to less rather than more accurate results because it generates a score that by definition is at the outer edge of likelihood. It also awards an unwarranted benefit to smaller schools and imposes an equally unwarranted handicap on larger schools.
* * * *
These three measurement oddities tend to favor schools that: boast a lot of students who start the school year under examination with advanced scores on the WKCE; are smaller rather than larger, and have growing rather than decreasing enrollment.
So tough luck, Milwaukee. As if that city’s schools don’t have enough problems, DPI’s report card score methodology rubs salt in the wounds by incorporating these gratuitous methodological biases.
The better and fairer approach would be to ignore confidence intervals in the calculation of Student Growth, disregard students who start the year at the advanced level in both the numerator and denominator of the Student Growth calculation, and base a school’s yearly Student Achievement scores on the number of students enrolled in the school that year rather than on a three-year average enrollment figure.
C. The One Big Missing Piece
The biggest shortcoming of the school report cards is that they do not include any value-added measures of student learning. As I explained in an earlier blog post, “Value added” refers to the use of statistical techniques to measure teachers’ impacts on their students’ standardized test scores, controlling for such student characteristics as prior years’ scores, gender, ethnicity, disability, and low-income status.
For the last several years, the Madison school district has received value-added reports prepared by the Value-Added Research Center, part of the UW Center for Education Research. Initially, the reports only compared Madison schools against each other. More recently the district has received a report that looks at our value-added figures as compared with state averages.
Value-added measures are absent from the school report card calculations but they feature prominently in another part of the state’s NCLB waiver request. The request also included a pledge to develop and adopt a new teacher and principal evaluation system designed to support student achievement. In the new system, measures of student achievement are to account for 50% of the overall teacher and principal rating.
DPI’s plan for evaluating teachers and principals explicitly incorporates the use of value-added measures, as this excerpt from DPI’s waiver request explains:
Individual value-added data will be used as one of several measures of student outcomes for teachers of covered grades and subjects. Value-added data will take into account the instructional time spent with students, also known as “dosage” in the value-added model to be developed by the Value-Added Research Center (VARC) at the Wisconsin Center for Education Research (WCER). The VARC model will also examine differential effects, or the varying effects a school/teacher has on student subgroups such as economically disadvantaged, English language learners, and students with disabilities.
It makes sense that the state will incorporate value-added measures in the portion of teacher evaluations that are based on student learning. While it has its shortcomings, value-added is the best way we have to measure student learning on a demographically-neutral basis.
This makes the absence of value-added from the school report card measures all the more puzzling. One would think that a system of rating schools statewide would strive to control for exogenous factors such as differing student demographics that can skew the measures of yearly student achievement. If the goal is to adopt an accountability system that is driven by what’s actually going on in a school’s classrooms rather than on the characteristics of the students the school serves, then value-added is clearly the way to go. It is inexplicable to me that DPI’s school report cards ignore value-added measure.
Any move away from assessing school performance on the basis of overall WKCE proficiency levels is likely to represent an improvement over current practice. And DPI’s school report cards could be useful as a way for individual schools to track how they are doing year-to-year in addressing student achievement issues. The report cards appropriately focus attention on the performance of students at the wrong end of the achievement gap and attempt to take into account year-to-year student growth, in however rudimentary a fashion.
But the report cards are explicitly not intended as a guide for school improvement efforts. In a July 16 letter to school administrators, Deputy State Superintendent Michael Thompson wrote that “The Technical Report Card is more of an instrument for transparency than an instrument for school improvement. Calculating accountability scores does not better inform school efforts. Engaging in a process of inquiry that incorporates multiple measures beyond the report cards will most certainly result in planning and decisions that are better informed and more likely to result in school improvement.”
So, what’s the point of the report cards again? If they are designed to promote transparency, their intended audience is not so much those responsible for the operation of our schools but the wider community instead. In the wider community, they will be used as a way of ranking schools one against another. School Boards will be called upon to explain why the scores for the district’s schools lag behind those of a neighboring district. Newcomers to Wisconsin will take the scores into account in deciding where to buy a house and raise their family.
It is in this comparative sense of ranking schools one against another – in the interests of transparency! – that the methodological shortcomings of the report cards will have their most pronounced detrimental impact. The manner in which report card scores are calculated, with its odd biases and the diversity penalty its unadjusted scores impose, will help ensure that the pride of position traditionally claimed by schools attended by our most fortunately-situated students will not be threatened by particularly inspiring and effective teaching taking place in less heralded locales. Whether intentional or not, it is inarguably the case that the design of the report cards will bring comfort to the comfortable and afflict the schools that already face the most daunting challenges.