August 19 th 2018 Comments Off

Posted in:

The Limitations of Student Evaluations

by Omar Ha-Redeye

Teaching isn’t easy. It can be rewarding, fulfilling, and at times challenging, but is also considerably variable between instructor to instructor.

In the interest of providing high quality education, most post-secondary institutions use a variety of metrics to ensure that the best instructors are attracted and retained to their schools, to help best optimize the educational experience. One of these tools are student evaluations in teaching (SET), where the individuals who are regularly exposed to the instructor are provided the most direct form of input about the pedagogical tools employed.

As any instructor will tell you though, not all educational approaches work for all students. You will never have every single student satisfied in any class,. Often those most willing to complete an evaluation are those who who have struggled the most, and usually due to their own actions (or inactions, as it may be).

Response rates rarely approach anything remotely resembling statistical significance. The response rates for online student evaluations are usually even worse than the paper versions.

These evaluations therefore often depict a bimodal distribution of students who either really love the course, or those who really don’t. The experience in the course is frequently then imposed on the instructor, whether or not that instructor had much (or any) latitude about the course content, required readings, course objectives, or methods of evaluation. Conflating the two is the nemesis of all post secondary instructors.

Even worse, student evaluations can project and reflect the biases of the students themselves, which can inadvertently hinder universities from including the types of diverse perspectives necessary to properly challenge assumptions and develop critical thinking.

Numerous studies have demonstrated a clear gender bias in student evaluations. French data revealed a perception of greater knowledge and classroom leadership skills that favoured men, despite an absence of any distinction in learning outcomes. Students tend to rate instructors better if they share a gender.

A similar American study found that the language itself used to describe female instructors differs from males, even when the course is offered online. This effect is even more pronounced for junior females.

Other studies suggest that ethnic and racial biases may also play a role. Gregory S. Parks explores both in “Race, Cognitive Biases, and the Power of Law Student Teaching Evaluations,”

Navigating legal academia can be a difficult road for professors of
color. One of the chief obstacles is securing solid teaching evaluations.

While not the only metric used to determine — among other things — promotion and tenure, student evaluations alone may give the impression that a professor of color has not mastered the course material or that he or she does not care about teaching. Student evaluations may give the impression that professors of color are incompetent, unintelligent, or lazy. Navigating poor, and potentially biased, student evaluations may also impact professors of color in other ways. It may cause them to scale back research and/or service to invest more time in teaching, which may undermine their broader promotion and tenure goals. Ultimately, the feedback, insights, and mentoring of senior and white faculty may be of little utility if their pedagogical approach is received well by students not simply because of methodology but because of race.

Parks suggests the following techniques from psychological literature to help law professors address these implicit biases:

Prime students with watermarks of white faces on slides
Prime students with the first names of positively regarded blacks and negative whites in hypotheticals
Dress the part
Conform to the teaching styles of the majority of senior, white faculty
Use a white teaching assistant
Do not give metrics throughout the semester of their performance
Teach more interesting, less difficult, and more familiar subjects
Be friendly and ask students questions outside of class

Quite a few items on this list could demoralize instructors of colour, and detract from the instructional autonomy and the fulfillment of those instructors. Sometimes people are forced to choose between professional success and personal integrity.

And sometimes not.

In 2014, the University of Waterloo launched a Course Evaluation Project Team to review practices on course evaluations, releasing a final report in 2017. The report acknowledges potential bias, and provided some recommendations about how to approach what they call student course perceptions (SCPs).

Different views from the Faculty Association of the University of Waterloo, the Status of Women and Equity Committee, and members of the psychology department indicated there had not been enough investigation into the presence of historic bias at the University of Waterloo, and called for additional training to address these biases if they exist.

Although the final report proposes to continue to use these student evaluations, they are just one potential data source of three used for annual performance appraisals, and for tenure and promotion purposes. The evaluations should focus on perceptions of course design, delivery, and learning experience, and designed to provide instructors with helpful and timely student feedback. The report also emphasizes the university should explore additional and complementary teaching evaluation methods.

In 2016, The Ryerson Faculty Association and The Ontario Confederation of University Faculty Associations obtained an Expert Report on Student Evaluations of Teaching by Dr. Philip B. Stark, which reviewed the SET literature and concluded that they provide weak or negative statistical association with instructor effectiveness,

The best evidence suggests that SET are neither reliable nor valid, even when the survey response rate is nearly perfect.

Dr. Stark also expanded on the many sources of biases found in SETs in the literature:

22. There is substantial evidence that SET have large biases. Sources of bias include students’ grade expectations; the nature of the course material (for instance, instructors who teach courses with mathematical content tend to get lower ratings, the level of the course and whether the course is required,
the course format, the size of the course, instructor gender, instructor age, instructor attractiveness, instructor expressiveness, instructor race, whether the instructor speaks with an accent or is a native speaker, the physical condition of the classroom , and so on. Many of these factors are protected characteristics under employment law: relying on student evaluations may have disparate impact on protected groups. Other factors may not be in the control of the instructor.
[emphasis added; citations omitted]

An Expert Report on Student Evaluations of Teaching (SET) by Dr. Richard L. Freishtat, also obtained by the RFA and OCUFA in 2016, provided a similar conclusion that student evaluations measure student satisfaction with a course, and not on the faculty’s teaching effectiveness. He provides some words of caution about the accuracy of the information in these evaluations,

Students should not be used to rate the adequacy, relevance, and timeliness of the course content nor the breadth of the instructor’s knowledge and scholarship. Most students lack the expertise needed to comment on whether the teaching methods used were appropriate for the course, if the content covered was appropriate for the course, if the content covered was up-to-date, if methods of student engagement used were appropriate to the level and content of the course, if the assignments were appropriate for promoting and assessing their own student learning, if what they learned has real world application, if what they learned will help them in future classes, if the type of assistance, help or support given to students was appropriate to the learning goals of the class, if the difficulty level of the course material was at an appropriate level, and if the course or the instructor was excellent, average or poor overall.
[citations omitted]

Although student evaluations had been a contentious issue at Ryerson since at least 2003, it became a formal issue in a series of grievances in 2009 and 2015, which culminated into negotiations during the 2015-2016 collective bargaining negotiations.

All matters between the university and the RFA were resolved, except for the teaching assessments under Article 5. This section included the use of student evaluations for instructor evaluation, and promotion to full time faculty. The matter was referred to arbitration, with a decision in Ryerson University v Ryerson Faculty Association by Arbitrator William Kaplan released last month.

The faculty relied on the expert reports above to indicate that the student evaluations should not be used for evaluating teaching effectiveness, and that they may contravene the Ontario Human Rights Code. Student evaluations here were reduced to averages and compared to other individuals, departments, faculty and across the university, which the faculty complained had little intrinsic value.

The university agreed that student evaluations could not be determinative of tenure or promotion, but should be used to identify trends and concerns, and provided relevant information to be used along with other means of assessment. However, they did not challenge the expert evidence by the faculty in any legally or factually significant manner.

The university also invoked the interest arbitration issue of gradualism,

Any change to a long-standing universally accepted evaluation tool must be careful, deliberate and subject to extensive study and review.

Arbitrator Kaplan agreed that student evaluations are the main source of information from students about their educational experience, and student satisfaction is an important mission of the university. However, given the use of student evaluations and their impact on tenure and promotion, they should be employed in a fair manner and evaluated with a high standard of justice,

The evaluation of teaching effectiveness for purposes of tenure and promotion is so important – to both the faculty member and the University – that it has to be done right. Tenure and promotion decisions need to be made on the best possible evidence.

Based on the demonstrable limitations of student evaluations in the expert evidence, Arbitrator Kaplan adopted the recommendation of Dr. Freishtat, that the best methods of assessing teaching effectiveness is through a comprehensive teaching dossier and in-class peer evaluations.

The question in the student evaluation assessing the overall effectiveness of the faculty member was struck completely. However, his award allows for the continued use of student evaluations as part of the dossier, even in the context of tenure and promotion decisions, but not for measuring teaching effectiveness during those decisions,

It is probably impossible to precisely measure teaching effectiveness. But the difficulties in doing so cannot serve as a justification for over-relying on a tool – the SET – that the evidence indicates generates ratings but has little usefulness in measuring teaching effectiveness. At the same time, [student evaluation] results can continue to be used in tenure and promotion, when the results are presented as frequency distributions, and when the end users are appropriately educated and cautioned about the inherent limitations both about the tool and the information it generates. As noted at the outset, [student evaluations] results provide information about the student experience, and, contextualized, are appropriately considered for tenure and promotion although, to repeat, not for reaching conclusions about teaching effectiveness.
[emphasis added]

Although this award did not provide the RFA with the full outcome they desired, it significantly limits the use of student evaluations compared to their historic use. With the move of student evaluations to a alphabetical scale instead of a numerical one, these evaluations will have far less impact on faculty, especially where appropriate training about their proper use is provided.

Arbitrator Kaplan rejected the gradualism argument as well,

While gradualism in interest arbitration is important, this case is one of demonstrated need for change and that conclusion informs the directions that follow. Put another way, this issue has been contested for many years. Despite their best efforts, the parties have not, collegially, or with third party assistance, been able to resolve it. There has been no lack of trying. When those attempts failed the parties, as they are entitled, sought third party mediation and adjudication and this is the result – a result reached after a hearing that benefited from expert evidence.

As an arbitral award, this decision does not hold precedential value for future disputes with other faculty at other universities. However, the expert reports prepared here will provide a solid basis for challenging the use of student evaluations where they are a primary or significant method of assessing teaching effectiveness.

Combined with a desire at Waterloo and other universities to find more effective methods to evaluate instruction, this controversial practice may be supplanted by other techniques that have been demonstrated to be more effective.

Maybe we can leave the slides with the watermark of white faces at home after all.

Comments are closed.

Most Recent Comments

Steph Swierenga on Resisting the Echo Chamber: AI-Assisted Judgment Writing and the Risk of Homogenization:

It would be interesting to measure this convergence. Citation diversity could be tracked. If models keep reaching for the same… more »
Kari D Boyle on Meaningful Participation of Children and Youth in Justice: Voice Is Not Enough:

Sorry for my delay in getting back to you Noel. Great question! We definitely need more research in this area.… more »
Alastair Clarke on Issues of Self-Representation in a Landmark Decision: Reflecting on Ahluwalia v. Ahluwalia:

Indeed, this situation is very serious within the immigration context. IRCC encourages applicants to follow their guides and they actively… more »
David Collier-Brown on Resisting the Echo Chamber: AI-Assisted Judgment Writing and the Risk of Homogenization:

I find LLMs are better at critiquing text than writing it. I also tell the editor-bots "If you suggest alternate… more »

+ -

Again, Why? the Big Picture of the Alberta Regulated Professions Neutrality Act

Tips Tuesday: Researching Indigenous Legal Orders

Patent Publication Dates

The Lingua Franca of the Legal Profession

RECLAIM: I Is for Inclusion

What the Minutes Show: Boards and the Governance of AI

The Limitations of Student Evaluations