Is knowledge worth testing any more? Is testing knowledge ‘Authentic’?

Bypassing the Remembering and Understanding steps when answering exam questions using Google.

Bypassing the Remembering and Understanding steps when answering exam questions using Google.

I was challenged recently at a Learning and Teaching conference about how I assess students in Biosciences. Apparently my assessment methods are ‘not authentic’ and ‘overly test knowledge’ on the basis that almost every module on my course is in part, assessed by a formal exam. There was some mention of a dodgy Einstein quote derived from “[I do not] carry such information in my mind since it is readily available in books. The value of a college education is not the learning of many facts but the training of the mind to think.” I’m sure context is everything in this quote, and very much doubt he was talking about fundamental knowledge underpinning established theories in physics. That got me thinking: Is knowledge worth testing anymore given that everything is accessible from Google? Furthermore: Seth Godin in his TEDx talk on education said, “There is zero value in memorizing anything, ever again. Anything worth memorizing can be looked up.” Sugato Mitra in “Living in the age where knowing is obsolete” suggested that answers to Physics tests could be googled and discussed, rather than answered alone in an exam. Eric Mazur suggested that any answer to a test that can be googled is not an authentic test. There are many other examples, as outlined very nicely here (sorry, this blog by @webofsubstance has since disappeared). There are two inter-related themes going on here: Knowledge is ‘out there’ and accessible so there is little benefit to memorising it. Even if knowledge is good, testing of knowledge is bad. Related to this final point, all assessments should be ‘authentic’.

As a relative novice to ‘authentic assessment’, I was keen to find out more. In the style of the above experts (excluding Einstein), I Googled it, and came with a link via Wikipedia on the definition of authentic assessment. A brief search suggested that the article is written by an expert on the topic, so I’m confident in the author being an authority on the subject. I have now read this and will discuss this document in relation to my assessment methods, which are really quite traditional (also known as ‘bad’) In ‘authentic assessment’, the assessment tool should mimic real-life situations. So having knowledge and applying it to a situation seems OK, but not testing the knowledge directly, which I think is what Eric Mazur is getting at. However if someone cannot complete an ‘authentic assessment’, what exactly is limiting the student’s progress? Is it an inability to recall the required knowledge to apply it in context, or inability to apply their acquired knowledge in context? If Google is allowed in the examination, as proposed by some experts, then we are focussing on the application of knowledge. Sorry, did I say knowledge? It’s just that it’s not necessarily knowledge is it? It’s more of a ‘state of transient observation’ whilst the information is on screen. Either way, the student got to the answer in a real-life connected-to-the-internet sort of way, so full marks all round. Well done. I’m not convinced.

My view of assessment, and in particular ‘authentic assessment’ in science, is that it should rely upon some known facts, and students should be able to apply those facts to conclude or deduce something, like an authentic scientist. As a research scientist, I have had some ideas over the years, and none of them have come about by not knowing anything about the topic. In fact I’d go as far to say that ALL of what I’d call half-decent project ideas have come from deep understanding two or more disparate topics in great detail, and making some sort of interacting link between the two. So the question is: Am I assessing in a manner that is authentic to science? The classic inauthentic/traditional test is a MCQ test. It is not very often in real life that you have to take an educated punt at 4 possible answers, three of which are deliberately contrived to trip you up. However it can test knowledge. I find a close correlation between MCQ scores and more analytic problems, but rarely can MCQs be designed in such a way. Where does the traditional University essay-style exam fit in? Authentic or not? It is dependent on the wording and what is being tested, and yes, many exams at lower levels do not really test anything other than pure recall of the lecturers’ notes. However any scientific or research report writing, where the introduction is a descriptive knowledge base, ideally with hypothesis based on knowledge, and a discussion containing critical analysis of evidence is essentially what many of us aim to replicate in traditional essay-style exams. So an essay question along the lines of:

Discuss the role of X and Y the process of Z. In answering this question, explain the relative contributions of X (and/or Y) on Z in some relevant setting.

This is a fairly typical degree-level essay exam. Some knowledge recall, in that for good marks, you need to define X and Y before putting them in context of process Z. In the second part, the students have to make some judgements based on the relative importance based on knowledge, which is not possible from memorising a 2D flow diagram. This exam structure is used below in one of my cancer-related exams that I might ask at Final year degree or Master’s-level Molecular Pathology students. Forgive the technical detail on this, but it is really very very interesting…

Discuss the role of p53 and pRb on regulating the cell cycle. In answering this question, explain the relative contributions of inactivation of genes coding for p53 and pRb in Cervical cancer vs Colon cancer (Feel free to skip this bit, but  for those who are bothered or interested):

p53 and pRb are proteins that stop cells dividing at different stages of the cell cycle (that is, time from division to next division) by inter-related and distinct mechanisms. Loss of these (by gene inactivation, or protein degradation) promotes tumour growth. In cervical cancer, genes coding pRb and p53 are very rarely inactivated because Human Papilloma virus (HPV) inactivates/degrades p53/pRb protein, so no there is no selective pressure on inactivating the genes coding for p53/pRb protein. In contrast, p53 gene is usually mutated (inactivated) in Colon cancer, and pRb is either inactive, or something controlling pRb ‘messes up’ the cell cycle control process…

Is this an authentic assessment? Well, according to many experts in education, and the expert definition of authentic assessment, no it isn’t, given that I can get the answer from Google and that it is not a ‘real-life’ situation. I can get nicely written reviews on the topic, and even Wikipedia distils this down to the basics quite nicely, so no, this is not authentic at all. Here are the 5 ‘not-at-all-trying-to-induce-a-false-dichotomy’ features of Traditional vs. Authentic assessment by Jon Mueller, as obtained via Wikipedia.

Traditional Assessment vs. Authentic Assessment:

1) Selecting a Response         vs.                   Performing a Task

2) Contrived                            vs.                     Real-life

3) Recall/Recognition           vs.                    Construction/Application

4) Teacher-structured          vs.                     Student-structured

5) Indirect Evidence            vs.                      Direct Evidence

In my essay, students are 1) performing a task which is authentic to any scientist. Yes, they are selecting a response from memory and relaying that information, but they must apply their knowledge of why mutations arise in a certain manner. It is a 2) real-life task, in that it could represent the key elements of a justification/intro for a research proposal or research paper studying these genes in either cancer type. Yes, this is all on Google, but how would you know what to search for without the appropriate knowledge-base and understanding of how to put that information in to context? Although the first part of the exam question above relies on the traditional  recall. 3) Recall of knowledge “Discuss the role of p53 and pRb on regulating the cell cycle”, the question then asks students to apply this knowledge, and hopefully construct their own hypothesis of why certain events occur. They can only properly apply the knowledge if they understand two important concepts that have been taught previously, but are not specifically asked for. They must also recall and relate the causative events in Colon cancer. They must then ‘explain the relative contributions of inactivation of genes coding for p53 and pRb in Cervical cancer vs Colon cancer‘. So although both pathways are ‘messed up’ differently in each tumour type, the net effect is the same: Cells grow fast and refuse to die. That is an important point to appreciate, and this essay allows me to assess whether the students have grasped this concept. So how about 4) Teacher-structured vs. Student-structured (whatever that means): As helpfully defined by Jon Mueller, “A student’s attention will understandably be focused on and limited to what is on the test. In contrast, authentic assessments allow more student choice and construction in determining what is presented as evidence of proficiency.” Well, it’s an exam, designed and written by myself. I really don’t understand how a student-structured task makes it more authentic, other than to tick the ideologically ‘non-traditional’ box. I suppose I could leave it up to the students to write about a tumour type of their choice, or cell signalling pathway of their choice, but that might miss some conceptual nuances that I want to test. As the teacher/’expert’, I know that these pathways are relevant to ALL cancers, unlike a student’s possible choice of some random cell signalling pathway. I know that studying these two tumours highlights crucial differences in the mechanisms of how these important regulatory pathways are bypassed in cancer. Studying say, Colon vs. Pancreatic wouldn’t cover these nuances, but the students don’t know that. Finally, what about 5) Indirect evidence or direct evidence: Again, as helpfully defined by Jon Mueller “What thinking led the student to pick that answer? We really do not know. At best, we can make some inferences about what that student might know and might be able to do with that knowledge. The evidence is very indirect, particularly for claims of meaningful application in complex, real-world situations.” I know why a student has come to the final conclusion at the end of my essay question as I have tested the underlying fundamental knowledge first, and seen how they have applied their interpretation of the fundamental knowledge to a situation that requires synthesis of ideas from distinct parts of the curriculum. So yes, direct evidence is used to assess the student on application of knowledge. So the startling conclusion is that essay-style questions under test/exam conditions can be ‘quite authentic’, even if the answer can be obtained directly from Google. It also leads us to conclude that being able to Google the answer to tests is not necessarily a good way of assessing students. Finally, if a quote appears to be obviously ridiculous, it probably is ridiculous. There are so many situations where we cannot reasonably be attached to the internet. Fundamental knowledge recall will always be needed and hence, it’s probably worth ‘testing’ whether students have this prior to progression academic progression. As a scientist, here are just a few:

1) Understanding subsequent teaching sessions that rely on prior knowledge.

2) Listening to a conference paper presentation and wondering how X relates to Y, and whether it might also relate to Z, which the authors had not considered (but will form the basis of my next research project).

3) The PhD voice viva examination. A must for any independent research scientist.

4) Any voice viva examination at any level, so a must for any of my BSc or MSc graduates.

5) Being interviewed on how your recent research findings are relevant to X by the media, and how they might relate to something else (usually quite obscure…).

6) A lab meeting with the professor: “How does X relate to Y. What is Y? Why are you studying Y if you don’t know what Y is…? Get out of my office…”

7) Being interviewed for a job, where getting an interview may in part be based on your knowledge base and ‘decent grades’.

“Oh hang on, I’ll just ask Google”. Yes, that will go down well…  

PS: I rarely use Wikipedia as an initial research tool, but given the Godin, Mitra and Mazur quotes, I thought I would give it a go. Sorry. I won’t do it again.

PPS: In future posts I’ll discuss how I get students to answer questions based on interpretation of data from scientific research papers under exam conditions. I’d go as far as to say that this is also ‘somewhat authentic‘.

Posted in Assessment, Student engagement | Tagged , , , , , , , , | 5 Comments

Should we adopt more active learning at the expense of cutting the STEM curriculum?

A few things have been troubling me recently. Do I teach too much stuff? Is teaching on my course too didactic? Am I over-reliant on knowledge transfer and passive learning? Do my students forget everything they have learnt on the course once they have sat their written formal exams? Are my students’ BSc/MSc marks not reflective of their ability/knowledge/skills at graduation? I’m going to go with the answer of ‘probably not’, just for now, despite negative comments about what is my preferred teaching style.

I have discussed a couple of major studies on active learning vs. passive learning in STEM subjects in my blog here and here. Despite these rather critical post-publication peer reviews, I am certainly not against what are described as ‘active learning’ approaches. The MSc course that I run has no more than 30% of the marks awarded from formal written exam based around ‘scientific lecture content’. This is very low for the sector, so on the face of it not overly traditional. There is extensive problem-based learning, laboratory work, lab-reports, group assignments, presentations and written essays, professional skills and project work, much of which consolidates learning of lecture content. However for the ‘formal taught scientific content’ parts, I do still tend to give 2 hour lecture sessions, where I am talking for anything up to 80% of the time. When I am not talking, I am encouraging some active learning by asking questions, a bit of peer-instruction here and there, debating, doing the odd quiz, getting students to answer questions embedded in the lecture notes and sometimes testing prior knowledge before I have even started talking. Typically I upload all lecture materials and support reading prior to sessions onto the VLE so students don’t arrive ‘cold’, and can attempt some activities prior to the session. These ‘active learning’ activities within the formal lectures do seem (anecdotally I admit) to encourage learning within the session, and also highlight to me misconceptions due absence of required prior knowledge, or just things that I just haven’t explained very well.

So if these ‘active learning’ sessions are so useful in my lectures, why not do the full 2 hour session using this manner? Flipped classroom teaching has been proposed by some, whereby lectures are pre-recorded and watch prior to the session, leaving the full lecture time free for discussion. Others propose discovery learning and problem-based learning, which promotes deeper understanding than passive learning. For me, the problem with 100% ‘active learning’ approaches comes down to content. Yes, there is lots of it in Bioscience, and the lecture is efficient in this regard. However within STEM subjects, I am slightly troubled by the end-point of many discussions whereby the proposal is to decrease content in favour of depth of learning. It’s not that my students’ answers in exams are superficial. The best of them already give very deep, reasoned, well-though-out answers that are at times really testing the tutors’ own knowledge. I really don’t see broad superficial learning from those at the top. However follow any educational conference or L&T Twitter discussion and you will perpetually hear the following negative remarks:

‘the answer (for active learning approaches) in many cases is to cut out huge swathes of contentNarrative: Too much content

‘try cutting out stuff and everyone suddenly has a vested interest’ Narrative: Too much content, much irrelevant

‘reduce content as this will promote deeper approaches’ Narrative: Too much content, specialise in fewer areas

‘Handing out lecture notes = admitting there is a problem with the lecture’ Narrative: too much content to remember in lecture session

‘if you believe that lecturing is simply the transfer of information then you will soon be out of a job’ Narrative: Lectures (and lecturers) are rubbish

All of these comments are from discussions of active learning and technology-enhanced learning approaches over more traditional approaches. Probably the most concerning is the notion that these active learning approaches are ‘better’ when we need to cut out vast amounts of content for them to be properly implemented. If 100% active learning approaches cannot deliver the required content to the same learning outcomes, then is it necessarily ‘better’ than a predominantly traditional-based or mixed active-traditional approach, or just different? Do we need to amend the learning outcomes to make the method of delivery ‘fit’?

Let’s put it another way. If my manager asked me to cut out half of the content from my course to make it ‘easier’ for some weak international students to get a ‘deeper understanding’ of specific topic areas, I’d consider this to be dumbing down, and rightly so. Recent events at Anglia Ruskin highlight the potential for falling foul of the QAA on precisely this matter. This case highlights the importance of tightly adhering to validated learning outcomes of nothing else. If I did the same, and reduced half of my content but used 100% active learning approaches throughout all of my sessions as the reason, I’d probably get promoted into the faculty L&T team for it. And there lies a problem.

So, the question is: Do we deliver too much ‘taught scientific content’ in STEM? If we don’t, then deliver all sessions by whatever methods work well for you, be it 100% active learning, traditional lecture, or in my preference of a lecture with some active learning ‘nuggets’, and continue to assess to the same standards as existing courses at similar institutions. If we do genuinely deliver too much content, then cut it down, and fully justify that decision based on comparisons with similar established courses at similar institutions. However cuts to an existing course curriculum should not be for the sole reason that it doesn’t fit with our new preferred pedagogy.

One last point: Here is a top tip for Edubloggers


Posted in Rants and moans, Student engagement | Tagged , , , , , , , | 5 Comments

Comments on Deslauriers “Improved learning in a large-enrollment physics class”.

The role of traditional teaching is taking a beating from some/many quarters in HE. Often the evidence against traditional teaching is from educational research which by the standards of a typical scientist, is rather questionable. This is a pity as the very positive messages about non-traditional methods can become lost. This gives entirely traditional teachers the opportunity to disregard the findings as nonsense, when clearly there is benefit to using methods other than a straight 2 hour Powerpoint monologue. I have discussed the Freeman paper on active/passive learning in STEM, which is very influential, but has its issues as discussed previously here. Another highly influential paper on active/passive learning was published in Science (Deslauriers et al 2011, Improved Learning in a Large-Enrollment Physics Class, Science, 332, 862-864). This is one of the most influential journals in Science, so only studies that are highly relevant and conducted using the finest experimental design are even considered for peer-review, yet alone published. So therefore, this study probably requires more attention that your run-of-the-mill education journal.

The study’s main outcome is that deliberate practice concept resulted in better scores than traditional teaching in a single group of students after 3 hours of teaching, where deliberate practice delivered by a young inexperienced tutor, whereas traditional teaching delivered by an experienced academic. That’s it. One group of students. 3 hours tuition. Two separate tutors doing two separate things. One MCQ test. No controls. Blimey! I have seem more comprehensive studies proposed for a MEd final project!

So where do we start.

The whole study is based on 3 hours of tuition by a highly rated and experience academic instructor with above average student feedback scores (so that means good at teaching, right?) giving a traditional lecture vs. 3 hours of an inexperienced post-doc. Now the results show that the post-doc wins hands-down, but the supplementary data explains that one of them is highly animated and the other not so. Why have two variables (instructor and method of tuition) especially since the study is not repeated? This is essentially an n=1 experiment with no controls.

Why not have the experience academic deliver both methods, in part to control against the two-variable problem, and in part to show that the deliberate practice concept is transferable even to experienced traditional academics, and not just young enthusiastic post-docs? Even better, why not get the new post-doc to also deliver both methods, either in the same academic year or the following intake of students? You then have something that resembles a controlled experiment.

The students were informed that something new would be done and explained why. Since the traditional group would probably get much of the same, the Hawthorn effect needs to be more adequately controlled for. The experimental group had higher attendance than the control traditional group, which is commended. However is this the sole reason that scores were higher? However they test group did engade better, in favour of the test group (Hawthorn effect again?) There is no mention of correlation of attendance with outcomes in either group, so was it the increase in attendance from 45% to 85% that increased test scores? If that factor (attendance) could be assessed in isolation, we may start to get causative effects.

The cohort of 850 were split into two groups of approx. 270. Hmm, what about the others? Might have acted as a nice control if ‘no intervention’ used, if only for the Hawthorn effect. Seems a bit odd to me.

I really disagree with the use of “amounts of learning” being used in an absolutely quantitative manner. For example “… and more than twice the learning in the section taught using research-based instruction”. Twice the learning? The assessment is by MCQ, so twice the score is twice the learning, yes? Well no, given that some questions are 50:50, some are 1 in 5, some are rather simple and even I could answer, some look really quite tricky. The results may be far more, or less impressive than the 2-fold that the authors state.

The study propagates the false dichotomy myth that teaching has to be either traditional or progressive. The Freeman paper (as discussed previously in my blog) included any studies where at least 10% of the session was ‘active learning’, so by definition 90% was traditional passive lecture. I doubt whether 100% active or 100% passive are best to allow in-depth assimilation of breadth of content to the required depth. All to often I hear of ‘ditching content’ to facilitate in-depth active learning, using the Freeman paper as evidence. Some of this content ditching may be valid in favour of some active approaches, but it also gives the impression of an anti-knowledge agenda (I’ll save this for another blog).

What really disappoints me is that this type of study could have been conducted over a couple of academic years to obtain controlled data to remove all the competing variables. I still don’t think it would be anywhere near worthy of publication in Science, even if all of the findings were replicated, but at least it would be done properly. We could then make a sensible descision on what teaching methods work best for our students. As it is, this paper allows the more traditional teaching arm to disreagrd the evidence as rubbish or unreliable on the basis of n=1 and no controls (and hence, no causitive factor identified).

I am a little concerned at the school-boy error of describing what is clearly non-parametric data using means and (presumably) standard deviation. No attempt at any statistics was presented.

Lets save the best ’til last. The authors are the tutors. OK, not ideal as preconceived bias could creep in, but this is Science journal. That would not happen and would be adequately controlled for in some way. Or maybe not. The authors state “…but we believe that educational benefit does not come primarily from any particular practice but rather than from the integration into the overall deliberate practice framework”. This might be OK in a final conclusion, after seeing that data, but NOT in the setion describing the design goal. They are effectively saying “We believe our hypothesis to be true, and we will be the tutors to gather the appropriate evidence that will prove that we are correct”. This is not how to do science, and I am amazed that the journal Science let this part through. It is the single most disturbing part of the who paper for me (lack of controls almost as bad). I have previously blogged about the white swan hypothesis. I have also written about ideology and belief, and how it should have no place in science. This paper ticks both boxes.

OK, that’s enough criticism for now, as there is a decent probability that the test group genuinely learned more, and we could distil some really good practice from this paper. I do like the subtle hint of Sceptic-Proponent collaboration in here, but that would only be valid if the traditional academic’s view was that traditional was best, and that academic had been converted by the findings.

See my initial thoughts below from when I first read the paper. Let me know if I’ve got the wrong end of the stick on this one.


Posted in Student engagement, Stuff about research | Tagged , , , , , , , | 7 Comments

Active vs. passive learning in STEM: Is a little better than a lot?

Lecturers who lecture have been getting a lot of stick recently for their ‘sage on the stage’, didactic, boring lectures. I have even heard it said that the brain is more active when asleep compared to in lectures (maybe that’s just my lectures), however I have yet to find convincing evidence in the published literature. Certainly the students who sleep through my lectures don’t learn much. So when Freeman et al (2014) published their study “Active learning increases student performance in science, engineering, and mathematics”, I read it with great interest. It is published in Proceedings of the National Academy of Sciences USA, a peer-reviewed journal that all scientists respect as one of the leading journals in science, not some free, online, non-peer-reviewed education journal that would publish a shopping list if the authors would pay the publication charges. Given the source journal, this is something that all lecturers in STEM, even those that wouldn’t touch an Edu Journal, should carefully read.

So, what does the paper claim? The paper is a meta-analysis, so looked at 255 previous studies on active vs. passive learning, and combined the results. The study shows that across the STEM disciplines, results improved by 0.47 standard deviations, and the chances of failing were roughly halved when active learning approaches were used. Active learning worked across all class sizes, but worked better in smaller classes of 90% passive learning (didactic lectures) representing the passive group. I really don’t want to criticize so early but isn’t this a bit of a false dichotomy?
So, >10% active learning is better than <10% active learning. I accept that the data supports this conclusion, and it fits well with observations of my own teaching where student understanding 'seems better' where I employ some degree of active learning vs. a 2 hour didactic lecture. I am not surprised by this finding and would encourage all lecturers to not just rely on a monologue. However does this mean that we should ditch the lecture? Well, no, as that is not what the data says.
If active learning should entirely replace passive learning, then we should observe a quantitative increase in attainment with increasing proportion of active learning up to 100% active. The authors state that "we were not able to evaluate the relationship between intensity (or type) of active learning and student performance due to lack of data", which I find surprising since they had the % of active learning, and the effect size for each study. Maybe they just couldn't show a linear response or the clear significance that they expected… In analyzing outliers with high effect sizes, the percent of active learning was 25%, 33% and 100%. I find it difficult to comprehend how the data could not be plotted to show the effect size vs. % active learning. I suspect that there was no clear link between 10% and 100% active learning and outcomes. Or to put it another way, 100% passive is probably worse than anywhere between 0% and 90% passive, but not necessarily correct that 100% active learning results in better outcomes than, say, 50:50 active:passive. This is just me reading between the lines, but there is a gaping hole in the data analysis required to be sure of the conclusion that active learning should represent the 'empirically validated teaching method'. This is not sufficient evidence to ditch the lecture (yet) but us certainly strong evidence for further studies teasing out the quantitative effect of increasing student engagement within lecture-type sessions to give optimal outcomes.

There are other commentaries that discuss other aspects, or weaknesses of this study. Major concerns include whether active learning results in better attainment, or the possibility that teachers who employ active learning become more enthusiastic. Is it the method, or the delivery? Furthermore, the file drawer effect could still at play here. Although effect sizes were symmetrical on funnel plots, there could easily be an absence of either neutral or pro-passive studies. Even if I had such data, I'd probably find it harder to publish a pro-passive learning study than a pro-active learning study.

So in summary, the data says that 100% passive learning is worse than <90% passive. Or to put it another way, some active learning is better than none. Whether this means that the highly efficient lecture should be abandoned in STEM far from certain, and there are mechanisms to introduce active elements into lectures. What is fairly certain is any mechanism that can engage students during a lecture is probably worth doing.

Posted in Student engagement, Uncategorized | Tagged , , , , , | 3 Comments

The White Swan hypothesis: How not to fool yourself in research

There is a simple test that I use to determine whether a potential graduate student is in a suitable frame of mind to undertake post-graduate research in science, and this test seems applicable to most disciplines (maybe except Maths). Ask the student how they would ‘test’ the hypothesis that ALL swans are white. Here in the UK, we have three native species of swan. They are all normally white in adult-phase, and any UK/EU student will be familiar with notion that swans are white. They think they already know the most likely answer before gathering any data, a common problem in research. So how do students go about testing such a hypothesis? There are two (or three) distinct approaches:

The first student would go to a local country park, where they would ‘look for’ white swans. They would find hundreds of white swans. They would go around all of the local parks and find even more white swans. Their preconceived notion of swans being white is supported. They come back to say “I have looked in all of the likely places, looking for white swans, and I counted several hundred. All swans are white, hypothesis proven”.

The second student would also go around the local parks and country parks, taking careful note of the number of white swans, but looking specifically for black swans, or swans of any colour other than white. They return to say “I have looked in all likely places looking for swans, and all of the swans I found were white. If there was a black swan, I think I would have spotted it, but I didn’t see any. Therefore, the hypothesis is supported (for now) that all swans are white”

The third student would go further afield, researching whether there have been any sightings of swans that are not white, and following up on such sightings, and gathering evidence that would disprove the central hypothesis. They return to say “I have seen evidence of thousands of white swans, of three different species (as determined by size, and by bill colour), which initially seemed to support the hypothesis, but I’ve got evidence of a black swan, and it is of a different species. I also have evidence of reports of a few pairs breeding elsewhere in the UK. Apparently they are escapees from exotic bird collections, and that swans are black in Australia, and other species and black and white. The central hypothesis has been disproved. I will now refine my hypothesis and undertake further research on…”

The first student knew what he was looking for and looked for data that ‘fitted’ his hypothesis, or belief, that all swans are white, and he gathered as much evidence as possible to convince himself (and his supervisor) that this is the case. I see this confirmation bias first hand all too often in science, and other sectors also (see Thomas Gilovich‘s work for some excellent studies). The second student went out looking for black or coloured swans, so was seeking to disprove the hypothesis. OK, so he didn’t quite try hard enough, but came to a logical conclusion. The third student also went out looking for coloured swans, tried that bit harder to seek out data that would disprove his hypothesis. If even he couldn’t disprove it, despite his best efforts, then he could be content with his data being of the highest quality. Reported sightings wouldn’t be enough, he needed to see one for himself. Having disproved the hypothesis, he could now focus on other specific hypotheses armed with this new knowledge e.g. whether swans are mostly white, or whether the ratio of white:black is changing over time are very different questions/hypotheses that can now tested by enquiry to further knowledge in this area. Formulating the initial hypothesis is important in designing a study, and its limitations, and data from both students 1 and 2 would correctly support the hypothesis that MOST swans in the UK are white, but not that ALL swans are white.

Now, who would you take on for a research project to test whether a proposed intervention works, given that if the intervention works, the research team will probably get a big grant to expand the study, lots of kudos and you will probably get a promotion? That depends on you aims. If it is to get promotion and you really don’t care whether the intervention works, or alternatively you are of the mind-set that you hypothesis can’t possibly be wrong for whatever reason, take on student 1 every time. You will be happy with the results! If it is to further knowledge, take on student 3, and if he is not available, settle for student 2.

None of this is new. It is a standard scenario (or variation thereof) used by many scientists to illustrate reliable methods of gathering evidence, and how easily pre-conceived ideas can bias the data generated. The biggest fear is how this mind-set of the scientist can affect the data. Might student 1 see a black swan by chance, but either a) convince himself that it was a large, dark goose, or even worse b) suppress the data for fear of not supporting the hypothesis that the professor/group leader supports? Yes, it happens a lot. Read up on the File Drawer problem, where negative studies i.e. those with effect of treatment/intervention vs control, tend not to get published, whereas studies with a clear effect do get published. If the data ‘fits’ current convention, then it’s even more likely to be published in a higher impact journal. This has a serious impact on subsequent meta-analyses. Read Ben Goldacre’s Bad Pharma. Read up on how studies are being retracted in science for both honest mistakes and dishonest practise. One potential solution is to design studies with individuals who have generated data that opposes your own hypothesis. Such Sceptic-Proponent collaborations should reduce confirmation bias, and maybe bring opposing factions together. At this point I could go on to discuss in detail Karl Popper‘s critical rationalism, and discuss falsificationism, but I won’t.

In future blogs, I will outline some of the black swans that I have found over the years. Some of them have made me unpopular, albeit temporarily in most cases, but all of them have advanced knowledge in some way, and allowed me to focus on something potentially more worthwhile. None of the black swans that I identified have accelerated my career as much as if I had generated data supporting the initial hypotheses, but I have at least published the findings in peer-reviewed journals to outline the problems to other scientists. This is a major part of scientific enquiry. It is not such a bad thing to have formulated a hypothesis that is subsequently disproved, as long as at the time, it was done with the best possible enquiry methods. It is a major part of science that we have to accept.


PS The smaller black birds in the image are coots.


Posted in Stuff about research | Tagged , , , | 2 Comments

Strategies for improving exam feedback

A frequent complaint from students regarding exams is the lack of opportunities to learn from their exams. As such, this has prompted education commentators to question their use, over other assessments where students can learn from the assessment and gain useful feedback from the assessed work. Well, I didn’t learn to drive on my driving test, but I admit, that is a lazy and poor answer. Aside from the weaknesses of exams (which Phil Race has covered very nicely recently) they have a prominent position in STEM subjects at least, a position that will not change as long as accreditation bodies promote their use. Furthermore, exams appear on our Key Information Sets (KIS) and their presence is deemed good. So aside from their perceived weaknesses, let’s make the most of them.

Feedback from every exam that I have ever taken has been in the form of a number or a letter. Actually not every exam, as I received some excellent feedback from my driving test examiner, but a part from that, nothing. I’ve never seen a script, and don’t know where strengths and weaknesses are. Why is this? We give copious amounts of feedback on other assignments, so why not here where students probably make the same repeated mistakes time and time again? It’s largely down to the practicalities of giving it back, and dealing with the fallout of contested marking. GCSE and A-levels are tricky as performed by outside exam boards, but what about in Universities? I set the papers, I mark the papers, and the marked scripts sit under my desk from one year to the next, when they are disposed of. I have no excuse for not using them for educational gain.

I recently did a short trial of giving of exam feedback to students. My initial approach was a single sheet outlining the key points on my mark scheme that were met to basic pass, merit or distinction, and outlined areas that were missed completely, with room for added extras that were relevant but not on a prescribe scheme. A simple marking matrix really, which students preferred to just receiving a grade, and for the first time they received a question-by-question marks breakdown, but moaned that they would like to see the scripts. I’ll probably do this again, but it does take significantly more time than I am allocated for exam marking. Another colleague showed students their scripts on a different module, and they were ‘happier’. However most students had effectively left the building for that academic year, or even for good, so this is still not logistically ideal.

So, why not make use of the box of scripts under the desk from last year? I frequently get asked “what does it take to pass, a 2:1 or a first in this module?” At the last taught session of the module, I get out last year’s attempts, and after anonymising them, I hand them out. They are sometimes astounded that we really do give 100% for essay answers, but are also shocked by the sheer quality and thought put into the work. They often comment on the poor quality of the 40% scripts, but I make it clear that many in the room will probably produce work of equal or lower quality. I ask them what they could produce now on the same questions. It is hoped that the real hard work of exam revision starts at this point.

So how do students gain from this? Firstly, students realise that they are not going to get a good mark in my modules by regurgitation of my lecture notes, especially when answers are out of context to the question. Secondly, at 2nd year and beyond, marks are not given over first class level without clear evidence of further reading, or independent thought that can be clearly identified on the script. Students see where last year’s cohort did this and were rewarded. Thirdly, students see how the marking scheme is non-linear, and that it is much easier to get the first 40 marks, as defined by the minimum pass level descriptor, than the next 20 marks, which require much more context and argument. It is not just about writing 100 facts and counting up 100 ticks for full marks. Finally, students see how answers can become very good answers if clearly put into context. A well annotated diagram linked to explanatory text, for example, can quickly demonstrate understanding of a complex concept. Seeing evidence of how assessors (well, how I assess them on their module) award marks, and what for, may account for the unusual marks profiles in most of my modules. Many modules in my department have coursework marks that are 10-20% higher than exam marks, and good coursework marks can easily compensate for a failed exam under our current academic awards framework. Since employing the strategy of allowing students to see last years’ scripts, exam marks are catching up with coursework marks, and for the first time I had exam marks exceeding coursework marks in 2 of my modules.

One final point: If only using last year’s scripts, you may ask “you are not letting students learn from their mistakes. Isn’t that central to giving good feedback?”. I disagree, to a point. Students show clear evidence of being able to learn from other people’s mistakes, not just their own. Maybe they will learn better when it is someone else’s mistake.

Mr Schadenfreude cartoon via

Mr Schadenfreude

Posted in Feedback | Tagged , , , , | 1 Comment

iCard: The low-tech, low cost tool for assessing student engagement in large groups

DSC_0001In this post, I discuss my initial use of simple coloured card voting system for assessing student understanding in a large group University setting. Not a novel concept, but simple and effective.

There have been some amusing Twitter debates recently on the topic of ‘no hands up’ policies, and the contentious use of randomly asking questions to pupils.  The issue of tutors assessing understanding of entire groups is not really addressed by either picking random students, or allowing students to elect to answer a question. We get a potentially non-representative snapshot, and by carefully selecting who answers, or what question gets asked, we can convince ourselves that we are doing a great job. Working in a University setting with large groups, I am focussed on assessing whether everyone in the group understands, who doesn’t, and reasons for lack of understanding.

So how can all students be encouraged to participate in Q&A sessions, whilst allowing the tutor to assess student understanding of key concepts by the whole cohort? One area that is being used particularly in Universities is Mobile technology. Apps such as Socrative allow the group to submit answers which can be displayed on screen, but require users’ own devices, which may be OK for Universities, but not really applicable to schools. Similarly Twitter can be used, where students give an answer in class and answers appear on screen. On the downside, the majority of replies are off-task, although once the novelty wears off, maybe this approach will have some merits. Similarly, ‘clicker’-type voting devices are an option, as they do not require the use of students’ own mobiles. Programming individual devices to individual students, if such information is needed can be a barrier, especially for large cohorts.

I frequently put a question on screen, and then ask groups of up to 200 undergraduate students the following questions (in this order, with typical responses noted):

Hands up everyone who thinks the answer is true (25%)

Now hands up if you think it is false (25%)

Hands up who doesn’t know (25%). And as I lose the will to live…

Hands up who doesn’t care (25%).

If that doesn’t work, simultaneously I ask for left hand for true, right hand for false. And so on. Getting large groups to answer questions so that I can gauge levels of understanding of key concepts is not easy, especially in my admittedly didactic teaching sessions.

The iCard voting system: For when Technology-Enhanced Learning seems like an un-necessary evil

Expanding on the concept of left-hand or right-hand up, is the simple use of coloured voting cards. Hold up a red or green sheet of A4 to test understanding of a key point. It allows the tutor to see who understands (or if we are being pedantic, those who think they know the answer), who doesn’t, and who is disengaged completely. However a 50:50 question is not very informative. This is where the iCard comes in. Four (or more) pieces of coloured card (A6 should suffice) liked by a treasury tag can be given out at the start of the session. Questions can be asked to the entire group, which can be a simple recall of a straightforward fact that is central to the understanding a concept, or a more testing question that requires a few minutes of working out.

Is it useful?

Teaching large groups of anything up to 200 renders ‘hand up…’ pretty useless. Even in a smaller group, getting everyone to consider the question, and seeing evidence of some effort by all students is near impossible. In contrast, the iCard-type approach does work. My initial use of this was in an end of semester informal test with 60 students. Questions were projected via Powerpoint with the question, plus 4 colour coded answers.


Initially, as anticipated, a minority did not engage well, but 90% were happy to answer all questions. To get the remaining 10% to engage, there has to be an element of bullying.  These students soon found out that if they didn’t answer with the rest of the group, I would push them individually for an answer, and then inform the rest of the group whether the individual was right or wrong. These students soon started to answer with the rest of the group.

What did I learn from using the iCard?

1) Student misconceptions on ‘simple’ fundamental points. Some of the questions were deliberately simple, and I anticipated a >95% of respondents giving the correct answer. By quickly scanning the room for incorrect answers, I could identify who got the answer wrong, and crucially, what the misconception was. I could explain why the given answer(s) might be incorrect, without necessarily highlighting students with wrong answers, by encouraging students to keep their card ‘close to their chest’

2) Identifying weaker students. As the majority of students were answering correct for each individual question, I could focus my attention to the incorrect answers. As expected, some weaker students consistently answered incorrectly. However some students whom I had down as particularly strong students were exposed as having gaping holes in their knowledge, sometimes on fundamental points.

3) Identifying topics that were poorly understood. Two topics out of 11 were particularly poorly answered. I can now look at, and take action on a) how those topics were delivered, and/or b) whether there is some underpinning knowledge that is missing from earlier in the course.

4) Poorly worded questions can trip up students. All questions should have only one correct answer. However questions can be ‘read’ differently, and in two questions, an ‘alternate reading’ of the question would lead the student to answer incorrectly. By discussing why answers are wrong, the students could argue their case. As a result, I will re-writing a few questions before using them again.

5) Students’ responses can spark debate over contentious points. As noted above, students are happy to argue with me if they think that I am wrong. Although the voting cards do not allow students to express views, or give complex and well-articulated answers, the voting cards are an ideal way to initiate subsequent debate.

I must stress that this is not a new concept, and is certainly not my idea. They are so cheap and simple, yet seem to be effective, especially with carefully worded questions or tasks. This type of in class formative assessment seems to be used only sparsely used in Universities where large group teaching is common, and where if anything, mobile technology and clickers are being introduced more widely. With such diverse opinions on how tutors should assess student understanding during sessions before ‘moving on’, maybe it’s time to re-visit some old technology before blindly moving on to a technology-based approach.

Benefits include:

1) They are very cheap and easy to make

2) They are applicable to situations where tutors are assessing a right/wrong answer/MCQ answer, or where specific/defined opinions of a group are being sought, and especially for large groups

3) Bureaucracy of obtaining anything that is remotely costly or technological can be a barrier to implementation.

4) Some tutors will always remain as technophobic as humanely possible, and a minority are particularly ‘risk averse’. Even technophiles have concerns that the time teaching students how to use any technology, and technology failure may detract from the learning.

On the negative side:

1) There is no permanent record of who voted for which answer, only the tutors judgement on who or what to follow up.

2) You will get it in the neck from advocates of Technology Enhanced Learning

In summary, all students get the same questions, and get the same treatment and there is less of a requirement to ‘pick on’ individual students. Please feel free to comment on potential uses, and importantly, limitations of use.

And here is me rambling on about it at a recent TeachMeet

Image | Posted on by | Tagged , , , , | 6 Comments