Full Paper

Back to List of papers The Do's and Don'ts of Computerising IT Skills Assessment

R.D. Dowsing and S. Long

rdd@sys.ua.ac.uk, sl@sys.ua.ac.uk

School of Information Systems,

University of East Anglia, NORWICH NR4 7TJ, UK

Abstract
There is growing interest in the computerisation of IT skills assessment, such as word processing, because of the time and cost of traditional human assessment with large numbers of candidates. Skills cannot be assessed satisfactorily by knowledge-based examinations; they need to be assessed by monitoring performance at a real practical test. By computer monitoring, data can be collected and analysed to generate an assessment.

Whilst developing assessors for different IT skills, we have learnt a number of lessons about how an IT skills assessor should be developed and what important considerations should be borne in mind during its development.

The most important 'don't' we have learnt is not to attempt to automatically assess all attempts at IT skills exercises. Difficult attempts should be filtered out and passed to a human assessor as this reduces the cost and complexity of the assessment software, using the best features of computerised and human assessors.

The most important 'do' is to generate flexible, customisable, software so that users can meet their exact requirements.

Keywords
Assessment, IT skills
Introduction

An increasing percentage of the population are acquiring IT skills, either as part of their education and training or in leisure or recreational activities associated with computers. With this increasing acquisition of skills comes the necessity of assessing competence and, with the numbers involved, IT based assessment methods are the only cost effective solution. The skills encompassed include the use of word processors, spreadsheets, email, WWW browsers and associated operating environments.

Computer based assessment is approaching the point at which the routine assessment of many IT skills is becoming practical as seen from the increasing number of products becoming available on the market (NCC,1995; Biddle & Associates, 1995; RSA, 1995). The CATS project at the University of East Anglia is investigating methods of computerising the assessment of IT skills with particular reference to students in Higher Education (Dowsing et al., 1996). The project is funded by the UK Higher Education Funding Councils under a Teaching and Learning Technology Project (TLTP) grant. A related project for professional skills assessment is being undertaken for the Royal Society of Arts Examinations Board.

There are a number of benefits that arise from computerising the assessment of IT skills. Firstly, the amount of staff time required for assessment is considerably reduced which either reduces costs or frees staff time for other activities. Secondly, computerised assessors are considerably faster than human assessors at marking and hence results can be generated earlier. Thirdly, compared to human assessors, computerised assessors produce more consistent assessment. A disadvantage of present computerised assessors is that they are not as flexible as human assessors and cannot exercise the same judgement on unexpected answers.

Although most of the computerised assessors currently available use a variety of different implementation tools and techniques, many of the problems that occur and the solutions adopted are common to all the assessors. In this paper we report on the general problems and solutions we have found in our research and development of computer based assessment tools.

A Model of Computerised Assessment

Although it is possible to partially assess a candidate's IT skills by the use of multiple choice questions, the only satisfactory way to assess IT competence is to assess the ability to use appropriate tools in a typical working environment. The simplest form of test, called a function test, assesses the ability to use a single function of an IT tool to answer each of a set of problems, for example, to select the correct menu item. A more complete test of a candidate's skill involves using a candidate-defined sequence of functions to perform a complete exercise, for example, to generate a spreadsheet to solve a given problem. The results of such exercises are assessed by comparing the syntax of the answer with the syntax of the correct answer(s) given by the examiner. Present technology does not allow the semantics - the meaning - of the answer to be assessed.

There are two main methods of assessing IT skills; assessment of the result and assessment of the method. Assessment of the result is obtained by comparing the candidate's attempt against the correct result, for example, comparing the candidate's spreadsheet with the examiner's answer. Assessment of method is obtained by comparing the sequence of actions used to generate the result, generally known as the event stream, with the correct set of actions specified by the examiner. Method is not normally assessed using human examiners because of the high cost; typically one examiner per candidate is required.

The assessment is based on criteria set by the examiner or examination board. In the simplest form of assessment an answer is either correct - and awarded full marks - or incorrect - and awarded zero. In general, an answer may be partially correct and the marks awarded depend on the correctness of the answer. In typical IT skills assessment the passing criteria is specified in terms of the number and type of errors allowed (Heywood, 1989).

Major Issues in IT Skills Assessment

Flexibility

Different users often require different behaviour from assessment software because universities and departments are autonomous and there is little standardisation in the courses they provide. Assessment software has to be implemented flexibly if it is to be successful; that is, the software must be able to be tailored to different environments and uses. For example, we have built on-line and off-line assessors for word-processing that allow the user to use either a generic or industry-standard word processor. Additionally, the results of the assessment can be immediately displayed or suppressed and encrypted as a mark sheet so that the software can be used for both formative and summative assessment. Lack of flexibility appears to be a major reason for the lack of enthusiasm for much current educational software.

Management of the Assessment Process

One of the major benefits of computerised assessors is the ability to include management software with the assessor to enable the examiner to reduce the amount of administration associated with examinations. Most of this process can be automated especially where the scripts are computer marked. In several computer-based teaching projects the automation of this assessment process has been found to be the most beneficial aspect of the computerisation.

Coping with Difficult Assessment

There will always be candidate attempts at exercises that are difficult to assess. The reason for this is that the answer space is so large that examiners cannot foresee all possible candidate answers and hence address them in their assessment criteria. Also, some of the candidate actions taken to answer the test may interact and cause unpredictable and unexpected errors. Neither of these potential errors is unique to computerised assessment; they cause major problems for human examiners. Thus to implement a computerised assessor which can assess all candidate answers is very difficult unless the assessment criteria are very simple unless our software achieves the flexibility of human assessors. Our approach is to filter out those attempts that are difficult to assess - for whatever reason - and pass them to a human examiner, rather than attempting to assess all attempts automatically.

This approach leads to the development method we have adopted. Initially, a computerised assessor is implemented which only assesses simple errors. Candidate attempts that meet the assessment criteria, that is, have less errors than the limits imposed by the criteria, are assessed as passes. Candidate's attempts that have too many errors are passed to human examiners for checking. Human examiners keep a note of the additional errors that they find and this information is used to influence the development of the computerised assessor. Development of the computerised assessor ceases when the percentage of the attempts passed to human assessors reaches a sufficiently small percentage or when development costs become too high. This method relies on the fact that complex errors are almost always combinations of simpler errors and hence a computerised assessor can be guaranteed never to generate fewer errors than actually exist. As an example, in word processing, assessment criteria are usually given in terms of errors in words rather than errors in characters. The error count for words must be less than or equal to the error count for characters and hence, if raw character difference is assessed, any attempt that has less than the requisite number of word errors at the character level must have passed.

Should Method or Result be Assessed?

Assessment can be generated from a comparison of the correct answer with the candidate's answer or by a comparison of the candidate's event stream and the correct event stream. Both techniques can detect the correctness of the candidate's answer but the event stream can additionally detect inefficiencies in the skills displayed that cannot be detected by analysis of the final result. For example, if a candidate undertaking a word processing exercise deletes some text by mistake and then re-inserts it, analysis of the final document will not detect this whereas analysis of the event stream will. Similarly, analysis of the final document will not detect if a paragraph is deleted character-by-character using backspace or selected and deleted using the delete key.

The number of items to be compared, which determines the efficiency of the assessment, is constant for the comparison of the final documents whereas it depends on the number of user actions taken in the case of the event stream comparison. The event stream comparison is faster when the number of user interactions is low but is slower when the number of interactions is large. Thus the fastest technique depends on the number of user events generated. Although the event stream analysis allows the method to be assessed, few present assessment criteria for IT skills relate to method. A further problem is that assessment of the event stream depends on matching items in the user and correct event stream. The matching process is NP-complete and hence can consume considerable time for pathological cases. A further complication is that it is often difficult to collect the user event stream, especially in the form required.

Taking all these difficulties into consideration, it is presently better, in most cases, to generate an assessment from a comparison of the final documents rather than from the event streams. In the future it is likely that event stream comparisons will become more important due to the increasing use of group working where the method of IT tool use becomes very important.

Dealing with Multiple Correct Answers

IT tools provide the user with a set of in-built functions which operate on subsets of the information to be manipulated. The higher level assessment which we are considering here is used to discover if the candidate can apply the IT tool to real scenarios. An important part of these higher level tests is the flexibility given to the candidate both in the functions used and their sequence. This flexibility is required to test the candidate's ability but it complicates the assessment since there are normally many different correct sequences of functions which can be used to perform the task. It is therefore important that the assessor can deal with multiple correct answers.

Examples where there are multiple correct answers include a spreadsheet cell that has to contain a particular value. The value can be generated by a constant in the cell or by a formula which generates the value. In some cases only one answer is correct but in others several or all the answers will suffice. As an example from word processing, consider the centring of a heading. This can be accomplished in many ways including using the centring command, one or more tabs or multiple spaces.

It is possible to build an assessor that can accept multiple correct answers to an exercise and compare a candidate's answer with all the possible correct answers to obtain the assessment. However, this is not a general approach because dealing with the combination of all the correct ways of specifying different parts of the assessment leads to a combinatorial explosion. Typical examples of multiple correct answers for part of an exercise include alternative spellings in word processing and equivalent formulae in spreadsheets.

The most effective method of assessing equivalent answers is to pre-process all answers into a single canonical form which can then be assessed by comparison with the model answer. Whilst this is simple in some cases, for example, replacing multiple spaces in a document by a single space, in others it is considerably more complex, for example, dealing with equivalent formatting methods can often only be tackled by using image processing techniques (Schurmann et al., 1992).

Synchronisation

Differences between a candidate's answer and the correct answer(s) are regarded as errors and the number and type of the errors is used to generate the assessment. The key to the identification of the errors is a synchronisation algorithm that produces the 'best' match of the candidate answer to the model answer. There are many different synchronisation algorithms discussed in the literature, a selection of which are described in Aoe (1994), but these algorithms are, in general, context independent. The best match is defined as the minimum number of changes required to convert the candidate's answer into the correct answer or vice versa. Since computerised assessment often has to mimic the assessment of a human examiner, the synchronisation algorithms need to mimic human synchronisation, that is, they need to be context dependent and exhibit 'intelligence' when identifying errors. For example, some of the standard synchronisation algorithms do not always group adjacent errors together when possible, whereas a human examiner would almost certainly do so. Thus standard synchronisation algorithms are not sufficient on their own; additional 'intelligence' needs to be built into them.

The units or tokens used in the synchronisation algorithm are also important since errors will be reported in multiples of the token size. For example, in word processing assessment, if the candidate's attempt and the correct answer are compared using words as tokens then errors will be reported as strings of words. There is a trade-off to be made between the use of small tokens which allow the position of errors to be detected with high precision but with a high probability of false synchronisation and the use of larger tokens which do not allow so precise detection of the position of errors but which give better synchronisation. Synchronisation may need to be performed on different token sizes since the assessment criteria often cover different units, for example, for a spreadsheet, assessment criteria may be in terms of cells, rows and columns.

Error Analysis and Unforeseen Errors

The major task of assessment is the classification and assessment of errors. In IT skills assessment the assessment criteria are often couched in terms of a pass if the number of errors in each defined category is less than a specified value. The assessment software therefore has to classify the errors discovered by the synchronisation algorithm in the same way as a human examiner. The problem is that examiners cannot foresee all errors that a candidate may make and thus the assessment criteria may not cover all eventualities. A related problem is that the criteria often overlap and give rise to ambiguities for some error classifications. Examination Boards have elaborate systems to cope with such circumstances using human examiners and our approach is to leave these systems in place so that scripts containing errors that are ambiguous or difficult to classify are filtered out and passed to a human examiner.

We use a rule-based system to translate between errors detected in a candidate answer by the synchronisation algorithm and the error type required by the assessment criteria. The rules are prioritised for conflict resolution and testing allows the rule set to be modified and improved.

Conclusion

In this paper we have described the major issues of computerising IT skills assessment from an implementers viewpoint. The points made can be summarised in the following list of Do's:

Do make assessors flexible and allow users to customise assessors.
Do include management software as part of any assessor.
Do use filtering to only mark automatically those attempts which are simple to assess. Pass other attempts to human assessors.
Do develop incrementally until development costs are excessive or all assessment criteria have been considered.
Do assess final documents rather than event streams unless method assessment is required.
Do use a standard synchronisation algorithm from the literature but after modification to mimic human assessment.
Do use a rule base to categorise errors

Acknowledgements

The authors would like to thank Prof. M.R. Sleep and J. Allen for useful comments on the draft of this paper. He would also like to thank the Higher Education Funding Councils of Great Britain for supporting this work through the award of a Phase 2 TLTP award and the Royal Society of Arts Examinations Board for supporting the work on professional examinations.

References

Aoe, J. (1994). Computer Algorithms: String Pattern Matching Strategies. IEEE Computer Society Press, Los Alamitos, CA.

Biddle and Associates (1994). The OPAC System Automated Office Skills Testing Software, Sacramento, CA.

Dowsing, R.D., Long, S., and Sleep, M.R. (1996). The CATS wordprocessing skills assessor, Active Learning, No 4, pp 46 - 52.

Heywood, J. (1989). Assessment in Higher Education, John Wiley, Chichester.

NCC (1995). PC Driving Test, Manchester, England.

RSA (1995). CLAIT DOS file management - Information Brief, RSA Examination Board, Coventry, UK.

Schurmann, J., Bartneck, N., Bayer, T., Franke, J., Mandler, E. and Obrlander, M. (1992). Document Analysis - from pixels to contents, Proc. IEEE, Vol 80, No. 7, pp 1101-1119.

The author(s) assign to ASCILITE and educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The author(s) also grant a non-exclusive licence to ASCILITE to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the ASCILITE 97 conference papers, and for the documents to be published on mirrors on the World Wide Web. Any other usage is prohibited without the express permission of the authors.

Back to List of papers

This page maintained by Rod Kevill. (Last updated: Friday, 21 November 1997)
NOTE: The page was created by an automated process from the emailed paper and may vary slightly in formatting and layout from the author's original.