THE EVALUATION OF SPEAKING ABILITY IN COMMUNICATIVE SITUATIONS:
global rating and detailed analysis of oral performance of students of 11 to 12 years of age

author: Amos van Gelderen
promotor: L.C.W. Pols
co-promotor: G.C.W. Rijlaarsdam
date of defence: 16 November 1992

 

Summary

Introduction

What are the main dimensions according to which can and should evaluate the speaking ability at the end of primary education? This is the question that guided the studies reported here. The question arises in the context of a National Assessment of Educational Performance. This survey aims at several goals. First, it is intended to inform the public about the effectivity of language education. Second, it wants to provide an empirical basis for the discussion about educational level and whether it needs to be improved. Third, it is directed to provide educators and educational researchers with means for educational improvement. In order to fulfil these goals satisfactorily, the testing devices that are used must provide a rich source of information. It will, for example, not be sufficient to inform the public that the speaking ability of students in Holland is 'poor'. In other words, we need more precise information about which aspects of the oral performance are disappointing, under which conditions the results are obtained and how they can be related to educational improvement.
On the other hand large scale surveys improve restrictions on the administration of tests, especially tests for oral performance that are individually administered. Moreover, rating procedures require the use of trained assessors, which is rather costly and time consuming. Therefore I undertook to develop and test a general rating scheme for the evaluation of speaking ability that results in reliable and valid ratings of different aspects of the ability and at the same time satisfies the requirements of efficiency in large scale assessments. A central assumption in the assessment of speaking ability in the context of primary education is that the most appropriate condition for testing is the simulation of realistic communication. That is, the testing situation, the so-called integrated task, should consist of a communicative purpose against the background of a real-life situation that students recognize as such. Accordingly, criteria for assessment should derive the communicative effectiveness of speech. These assumptions are based on the fact that language education primarily aims at providing the necessary skills to participate in all kinds of communicative situations. So the task of a national assessment is to evaluate to what extent the educational system succeeds. This poses specific problems for a valid evaluation. Which types of communicative tasks are relevant for the assessment of speaking ability of students of a certain age? How many different tasks should be administered and how varied will they have to be to provide a satisfactory coverage of the domain? Although the main purpose of the empirical studies reported was to develop and validate a rating scheme, the so-called problem of task validity could not be ignored. It soon appeared that evaluation criteria are to some extent dependent upon the characteristics of (integrated) tasks. Moreover the applica- bility of the rating scheme had to be limited from the beginning: only in tasks where individual speakers - instead of pairs or trios - can be rated for their contribution to the communication, use of the scheme will be warranted.

 

Data collection

Students of the last year of primary education performed on four oral tasks: two tasks were narrative, one task consisted of alarming the police by telephone and one task of an exposition of the way a spider builds his web. Except for the alarming task, in all tasks classmates functioned as listeners.
Sound recordings were made of all performances. Data collection took place in two different samples. One sample consisted of two hundred students and can be regarded as a nationally representative sample; the second sample consisted of one hundred students from the region of Amsterdam and surroundings. The registration of the oral performances in general was of an acceptable quality for assessment purposes. In view of the intended validity study and the phonetic analyses that had to be carried out, special care was taken in the recording sessions for the second sample.

 

Theoretical foundation

A rating scheme is proposed consisting of four functional dimensions. These are based on an overview of so-called analytic schemes that have been developed in studies of the rating of speaking skills in diverging contexts (Wesdorp, 1981). These dimensions are defined by functions that can be derived from the general criterion of communicative efficacy. Two dimensions - Reference and Delivery - are directly related to communicative content. Reference is defined by the representational function of language; Delivery is defined by the functions of expression and appeal (BŸhler, 1982). The dimensions interchangeably - dependent upon the communicative situation - denote the dominant communicative functions that are to be realised. The other two dimensions - Fluency and Intelligibility - are indirectly related to communicative content and apply to the conditions that have to be met in order to produce interpretable utterances. Fluency is defined by the realisation of continuity of speech and Intelligibility by the quality of the realisation of utterances ('decodability') (Crystal & Davy, 1979).
In order to use the four dimensions as a rating scheme, each dimension is regarded as a heuristic device from which specific criteria for assessment in a given speaking situation can be deduced. Furthermore a linkage is assumed between the criteria deduced from the dimensions on the one hand and the aspects of behaviour that are the objects of assessment on the other. Specifically, for Reference only linguistic aspects are seen to be relevant, for Delivery linguistic, phonetic and non-verbal aspects are relevant, for Fluency and Intelligibility linguistic and phonetic aspects. On a more concrete level, however, it is supposed that the same aspects of behaviour do not always serve the same functions. That is why the differentiation of the dimensions is solely based upon the communicative functions to be evaluated and not upon the precise behavioural aspects that can be distinghuised.

 

Empirical test of the rating scheme

The rating scheme has been put to empirical test in two steps. First, several rating categories have been derived from each dimension and have been applied in small scale experiments by jury's of four or five raters. In these experiments (N=40) performances of students on the four tasks, selected from the larger data set, are rated after an instruction- and training-session. The purpose of these experiments is to acquire knowledge as to the applicability of the rating categories for performances on different oral tasks, the degree of consensus among raters, the instrumental differentiation that exists between jury-ratings of different categories and optimal rating conditions (rating several categories simultaneously vs each category separately). Second, on the basis of these experiments, a more definite test of the scheme has been carried out. A jury of three raters (all women with experience as teachers in primary education) applied selected categories - one for each dimension of the scheme - to rate the performances of all students in our two samples on the four oral tasks. In both steps - the small scale experiments and the large scale studies - categories have been derived from the dimensions in a taskspecific way. That is, categories for the same dimension but applied in a different task often consist of different criteria and require different behavioural aspects to be observed. This is a consequence of a functional - instead of behavioural - rating scheme.
The results of the empirical investigations can be summarized in the following four points.
1. Reliability of the rating categories is at an acceptable level (about .80) for the purposes of a national assessment, when jury's of three trained raters are used.
2. A four-factor model for the correlations among jury ratings, each factor representing one of the dimensions of the rating scheme, fits reasonably well. Furthermore there are strong indications that ratings of categories derived from the same dimension hardly convey distinct information about speaking ability in a given task, whereas ratings of categories derived from different dimensions, although sometimes strongly correlated, do convey distinct information.
3. The rating scheme proves to be applicable for performances on all four tasks tested, but there are indications that in two of the tasks (the alarming task and the exposition) rating of categories for Delivery and Fluency is more difficult, due to short duration of the performances and/or to the lack of cohesiveness of the texts produced.
4. An efficient rating procedure is feasible; hereto a jury of three trained raters rates each performance on four categories simultaneously, provided that the performances are of reasonable length; without significant loss of reliability or validity.

 

Empirical test of rating validity

A question that was not addressed in the foregoing is whether the rating of the dimensions in oral performances does convey the information about the behavioural aspects stated in the dimensions definitions. As mentioned previously, the aspects of speech to be rated are not invariant across tasks. Although from an instrumental point of view it has been demonstrated that ratings of the dimensions convey distinct (but correlated) information, the diagnostic value of these ratings is not yet clear. In short, we cannot exclude the possibility that ratings are based on other aspects of the speaking performances than we believe they are, or that the ratings of different dimensions have largely overlapping meanings so that their differentiation is invalid. Moreover, some notorious rating problems, such as the 'signific effect' and the 'halo- effect', could have invalidated the resulting scores. To investigate the validity of the jury ratings on the four dimensions, several analyses have been carried out to determine the correlations between these ratings and linguistic and phonetic aspects of the rated performances. In a regression design I tested hypotheses about these correlations. First, these hypotheses state a significant relation between jury ratings and the frequency of the linguistic and phonetic variables that had been mentioned in the definition of the rating dimension in question (the convergent prediction). Second, the hypotheses state that a weaker relation exists between the jury ratings and variables that are mentioned in the definition of other rating dimensions (the divergent prediction). The prediction of Delivery and Fluency has received most attention in this examination, because the differential meaning of those two dimensions has proved to be more problematic than that of Reference or Intelligibility. Therefore a rather large amount of linguistic and phonetic predictors for Delivery and Fluency has been analyzed in comparison with the other dimensions. (Non-verbal variables could not be included because rating of performance occurred from sound tapes). On the other hand, because of the time consuming procedures involved in the analysis of phonetic and linguistic variables, only relatively small selections of performances (sixty per dimension) on one (narrative) task could be analyzed.
For the prediction of ratings of Reference the total amount of relevant 'content elements' has been determined in each performance on three tasks (narrative, alarming and expository) (N=200). Prediction of ratings of Intelligibility has been carried out by calculating the correlation between the ratings and the amount of 'hardly intelligible' utterances in performances on a narrative task (N=60). For the prediction of ratings of Delivery the following variables have been selected: (1) variation of intonation, based on auditory analysis according to a description of fundamental pitch movements in Dutch ('t Hart, Collier & Cohen, 1990), (2) acoustic measurements of variation of fundamental frequency, (3) acoustic measurements of intensity and intensity variation, (4) relative amount of pitch accents (corrected for text length), (5) relative amounts of lexical elements with a positive or negative effect on narrative register. For the prediction of ratings of Fluency the selected predictors are: (1) relative amount of self- corrections and non-functional pauses, (2) duration of self-corrections and non- functional pauses, (3) mean speech rate (pauses included) (4) mean articulation rate (pauses not included). For all variables that can not be measured instrumentally, a detailed coding instruction has been designed and applied by two trained raters. By comparing the codes assigned independently by each rater for the same performances the degree of consensus has been determined. The coding of pitch movements by the two raters has been further examined by comparison with instrumental analyses of a sample of the coded utterances. In all cases coding consensus and accuracy has been found to be satisfactory.
Results show that for the ratings of three dimensions - Reference, Intelligibility and Delivery - the hypotheses can be accepted. The ratings are more strongly related to the linguistic and phonetic variables that are mentioned in their definition than with those mentioned in the definition of other dimensions. The proportion of explained variance of ratings of Reference ranges from 53 to 79 percent (dependent upon the task).
Explained variance of ratings of Intelligibility was 37 percent and for Delivery 83 percent. Intonation variation and relative amounts of lexical elements with reinforcing or decreasing effect on register have the greatest part in predicting Delivery. Ratings of Fluency are also substantially predicted (55 percent of the variance of the ratings), however only the duration of self-corrections and non-functional pauses plays a significant role. Moreover it appears that predictors for Delivery also explain a large proportion of the variance of the Fluency ratings (55 percent). Further analysis of the specific meaning of these ratings shows that only rather gross disruptions of continuity of speech are significantly related to Fluency (false starts and pauses of long duration), whereas more subtle hesitations, repeats, filled and unfilled pauses appear to be largely ignored by the jury. Furthermore, no evidence has been found of the occurrence of so- called signific or halo-effects in the ratings of the speech performances. The correlation between ratings of Delivery and Fluency can be largely explained by the correlation that exists between the behavioural aspects rated. Also, no indication has been found for diverging interpretations among raters regarding the relevance of certain behavioural aspects for deciding upon their scores.

 

Discussion

The results of the empirical studies reported are rather promising. The rating scheme tested proves to satisfy several needs in large scale assessments of speaking ability such as the need to supply differential information about the skills students possess in a reliable and efficient way. Moreover, its utility for the rating of performances on several communicative tasks has been demonstrated. Also, the validity and diagnostic meaning of the rating dimensions was, for the greater part, substantiated. Nonetheless, I must point at some limitations of the studies on which these results are based. First, the sample of students for the validation study for Delivery, Intelligibility and Fluency was rather small, and not nationally representative for the population, so the possibility of statistic generalization is limited. Second, the results are mainly based on the scores given by three trained raters; we can not be certain that other raters' scores are equivalent. Third, several relevant predictor variables for the rating dimensions have not been included in the validity study for various reasons. Fourth, the validity of the rating dimensions has been solely determined on the basis of ratings of performances on one (narrative) task. In view of the dependence of rating criteria and the behavioural aspects to be observed on task characteristics, results can not be generalized to ratings on other types of tasks. Fifth, not all kinds of rating criteria that could be relevant in the assessment of speaking ability in communicative situations have been investigated. Specifically, criteria dealing with standard usage and grammatical correctness or com- plexity have not been included, although these criteria might be rather important in the case of formal communication. Also, communicative situations in which cooperation among interactants plays an important role, require specific rating criteria that have not been included in our scheme. Criteria for turn-taking and -giving and for evaluation of the process of negotiation and cooperation as such, are important additions if per- formances on such tasks are to be assessed.
The above limitations all deserve attention in empirical studies. Some of the research themes are specifically important in my opinion. Those themes are elaborated upon. It concerns the following:
1. A redefinition of Fluency on the basis of our validity study. The results of the study have made it clear that the significance of Fluency ratings has been severely narrowed in comparison with the original definition of the dimension; several explanations and implications of this finding are being discussed.
2. The relation between acoustic and perceptive variables in the rating of speech in several empirical studies is discussed. Several occasions in these studies and in the present one are found to speculate about the basis of speech perception and rating: detail or Gestalt.
3. The problem of task validity for the evaluation of speaking ability in communi- cative situations is explored. What are the main parameters of integrated tasks that have to be varied to reach an acceptable coverage of the domain? A suggestion for an experimental analysis of task parameters is given.
In conclusion, the utility of the rating scheme in two different contexts is discussed: large scale performance surveys and (diagnostic) evaluation in primary and secondary education. A comparison is made with a rating scheme now in use for national per- formance surveys at the end of primary school and several advantages of the present scheme are pointed out. With respect to in educational contexts it is indicated what advantages there seem to be in using the functional rating scheme in comparison with schoolpractice nowadays. Furthermore, some ideas for implementation of the scheme and some practical consequences for the teachers, the pupils and the curriculum are portrayed.