Assessing Programming Ability in Introductory Computer Science: Why Can't Johnny Code? Robyn McNamara

Transcription

Assessing Programming Ability in Introductory Computer Science: Why Can't Johnny Code? Robyn McNamara
Assessing Programming Ability in
Introductory Computer Science: Why
Can't Johnny Code?
Robyn McNamara
Computer Science & Software Engineering
Monash University
Can Johnny Program?
●
●
Most second­years can (kind of), but evidence suggests that many can't.
Not just at Monash – this is a global problem.
–
McCracken report (2002) – four universities on three continents found that their second­years lacked the ability to create a program
–
papers from every continent (except Antarctica) indicate problems with the programming ability of CS2 students
●
Antarctica doesn't seem to have any tertiary programs in IT.
Kinds of assessment
●
●
Formative: feedback to students
–
“how am I doing? how can I improve?”
–
“what are the lecturers looking for?”
Summative: counts toward final mark
–
pracs, exams, assignments, tests, prac exams, hurdles, etc.
Purpose of CS assessment
●
●
Ensure that students enter the next level of study or work practice with enough knowledge and skill to be able to succeed. May include:
–
programming ability
–
grasp of computing theory
–
problem­solving/analytical ability
–
more!
These skills are mutually reinforcing rather than orthogonal.
What's at stake
Inadequate assessment in early years can lead to:
●
●
inadequate preparation for later­year courses
–
watering down content
–
grade inflation
“hidden curriculum” effects (Snyder)
–
●
to students, assessment defines curriculum
poor student morale, which brings
–
attrition
–
plagiarism (Ashworth et. al., 1997)
Characteristics of good assessment
●
●
Reliability: is it a good measure?
–
if you test the same concept twice, students should get similar marks (cf. precision)
–
can be evaluated quantitatively using established statistical techniques (AERA et. al., 1985)
Validity: is it measuring the right thing?
–
not directly quantifiable – measured indirectly using (e.g.) correlation studies
–
this is what I'm interested in!
Types of validity
●
Content validity: assessment needs to
–
be relevant to the course
–
cover all of the course (not just the parts that are easy to assess)
Who discovered the Quicksort algorithm?
a) Donald Knuth
b) C.A.R. Hoare
c) Edsger Dijkstra
d) Alan Turing
Types of validity
●
Construct validity: assessment measures the psychological construct (skill, knowledge, attitude) it's supposed to measure.
–
Can't be evaluated directly, so we have to use other forms of validity as a proxy (Cronbach & Quirk, 1976)
You can store several items of the same type
in an:
a) pointer
b) array
c) struct
d) variable
Example: time and construct validity
●
Allocating too little time for a task threatens validity
–
●
you end up assessing time management or organizational skills Allocating too much time can also threaten validity!
–
students can spend a long time working on programming tasks
–
they can go through many redesign cycles instead of just a few intelligent ones
–
even with an unintelligent heuristic, a student can eventually converge on a “good enough” answer given enough iterations
–
not a true test of problem­solving/design ability  construct validity threat
Types of validity
●
Criterion validity: the assessment results correspond well with other criteria that are expected to measure the same construct
–
“predictive validity”: results are a good predictor of performance in later courses
–
“concurrent validity”: results correlate strongly with results in concurrent assessment (e.g. two parts of the same exam, exam and prac in same year, corequisite courses etc.)
–
We can measure this!
Method
●
●
●
●
Took CSE1301 prac and exam results from 2001, only those who had sat both the exam and at least one prac Grouped exam questions into
–
multiple choice
–
short answer
–
programming
Calculated percentage mark for each student in each exam category, plus overall exam and overall prac
Generated scatterplots and best­fit lines from percentage marks
Predictions
●
Programming questions on the exam should be the best predictor of prac mark...
●
...followed by short answer...
●
...with multiple­choice being the worst predictor
–
programming skills are clearly supposed to be assessed by on­
paper coding questions and pracs
–
many short­answer questions cover aspects of programming, e.g. syntax
Sounds reasonable, right?
MCQ vs Short Answer
●
●
Strong correlation: 0.8
Same students, same exam (so same day, same conditions, same level of preparation)
MCQ vs Code
●
●
Correlation 0.82
Note the X­intercept of 30% for best­fit line
MCQ vs Prac
●
●
●
Correlation only 0.55
We predicted a relatively poor correlation here, so that's OK
Note the Y­intercept
Short Answer vs Code
●
●
●
Correlation 0.86
SA is a better predictor than MCQ; so far so good
Note the X­intercept at 20 – a guesswork effect?
Short Answer vs Prac
●
●
Correlation 0.59
Stronger than MCQ, as expected, but only slightly.
Code vs Prac
●
●
Correlation still only 0.59 – no better than short answer!
Note that the best­fit line has a Y­intercept of more than 50%!
Exam vs Prac
●
Note that someone who got zero for the exam could still expect 45% in the pracs
–
45% was the hurdle requirement for the pracs
Summary
●
●
●
Exam programming and lab programming are strongly correlated, so they're measuring something. But...
Exam programming results are not a better predictor of ability in pracs than short­answer questions, and only slightly better than multiple­choice.
Something is definitely not right here!
What next?
●
I still haven't asked the really important questions:
–
what do we think we're assessing? –
what do the students think they're preparing for?
–
are pracs or exams better predictors of success in later courses, especially at second­year level?
–
what are the factors that affect success in programming­based assessment tasks, other than programming ability?
–
computer programming and computer science: how are they different? What are the ramifications for our teaching and assessment? (This is a big and probably postdoctoral question.)
What's next?
●
Current plan for my PhD research: three stages
●
What do we think we're doing?
–
●
●
interview lecturers to determine what skills they are trying to assess
What are we doing?
–
obtain finely­grained assessment results for first­year and second­year core subjects for one cohort and analyse these results to see which tasks have highest predictive validity
–
interview students to determine how they approach assessment
What should we be doing?
–
suggest feasible ways we can improve assessment validity
Bibliography
Reliability and validity
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological testing.
Cronbach, Lee. (1971). “Test validation”. In R. L. Thorndike (Ed.). Educational Measurement
Cronbach, L. J. & Quirk, T. J. (1976). “Test validity”. In International Encyclopedia of Education.
Oosterhof, A. (1994). Classroom applications of educational measurement. McMillan.
Bibliography
General and CS
Ashworth, P., Bannister, P. & Thorne, P. (1997) “Guilty in whose eyes? University student's perceptions of cheating and plagiarism in academic work and assessment”, Studies in Higher Education 22(2), pp. 187—203.
Barros, J. A. et. al., “Using lab exams to ensure programming practice in an introductory programming course”, ITiCSE 2003 pp. 16—20.
Chamillard, A. & Joiner, J.K., “Evaluating programming ability in an introductory computer science course”, SIGCSE 2000 pp. 212—216.
Daly, C. & Waldron, J. (2001) “Introductory programming, problem solving, and computer assisted assessment”, Proc. 6th Annual International CAA Conference, pp. 95—107.
Daly, C. & Waldron, J. (2004) “Assessing the assessment of programming ability”, SIGCSE 2004 pp. 210—213.
Bibliography
de Raadt, M., Toleman, M. & Watson, R. (2004) “Training strategic problem solvers”, SIGCSE 2004 pp. 48—51.
Knox, D. & Woltz, U. (1996) “Use of laboratories in computer science education: Guidelines for good practice”, ITiCSE 1996 pp. 167—181.
Kuechler, W.L. & Simkin, M.G. (2003) “How well do multiple choice tests evaluate student understanding in computer programming classes?” Jnl of Information Systems Education, 14(4) pp. 389—399.
Lister, R. (2001) “Objectives and objective assessment in CS1”, SIGCSE 2001 pp. 292—297.
McCracken, M. et. al., “A multinational, multi­institutional study of assessment of programming skills of first­year CS students”, SIGCSE 2002 pp. 125—140.
Ruehr, F. & Orr, G. (2002) “Interactive program demonstration as a form of student program assessment”, Jnl of Computing Sciences in Colleges 18(2), pp. 65—78.
Bibliography
Sambell, R. & McDowell, L. (1998). “The construction of the hidden curriculum: Messages and meanings in the assessment of student learning”, Jnl of Assessment and Evaluation in Higher Education 23(4), pp. 391—402.
Snyder, B.R. (1973). The hidden curriculum, MIT Press.
Thomson, K. & Falchikov, N. (1998). “ 'Full on until the sun comes out': The effects of assessment on student approaches to study”, Jnl of Assessment and Evaluation in Higher Education 23(4), pp. 379—390.