|
| |
THE EVALUATION OF SPEAKING ABILITY IN COMMUNICATIVE SITUATIONS:
global rating and detailed analysis of oral performance of students of 11 to 12
years of age
author: Amos van Gelderen
promotor: L.C.W. Pols
co-promotor: G.C.W. Rijlaarsdam
date of defence: 16 November 1992
Summary
Introduction
What are the main dimensions according to which can and should evaluate the
speaking ability at the end of primary education? This is the question that
guided the studies reported here. The question arises in the context of a
National Assessment of Educational Performance. This survey aims at several
goals. First, it is intended to inform the public about the effectivity of
language education. Second, it wants to provide an empirical basis for the
discussion about educational level and whether it needs to be improved. Third,
it is directed to provide educators and educational researchers with means for
educational improvement. In order to fulfil these goals satisfactorily, the
testing devices that are used must provide a rich source of information. It
will, for example, not be sufficient to inform the public that the speaking
ability of students in Holland is 'poor'. In other words, we need more precise
information about which aspects of the oral performance are disappointing, under
which conditions the results are obtained and how they can be related to
educational improvement.
On the other hand large scale surveys improve restrictions on the administration
of tests, especially tests for oral performance that are individually
administered. Moreover, rating procedures require the use of trained assessors,
which is rather costly and time consuming. Therefore I undertook to develop and
test a general rating scheme for the evaluation of speaking ability that results
in reliable and valid ratings of different aspects of the ability and at the
same time satisfies the requirements of efficiency in large scale assessments. A
central assumption in the assessment of speaking ability in the context of
primary education is that the most appropriate condition for testing is the
simulation of realistic communication. That is, the testing situation, the
so-called integrated task, should consist of a communicative purpose against the
background of a real-life situation that students recognize as such.
Accordingly, criteria for assessment should derive the communicative
effectiveness of speech. These assumptions are based on the fact that language
education primarily aims at providing the necessary skills to participate in all
kinds of communicative situations. So the task of a national assessment is to
evaluate to what extent the educational system succeeds. This poses specific
problems for a valid evaluation. Which types of communicative tasks are relevant
for the assessment of speaking ability of students of a certain age? How many
different tasks should be administered and how varied will they have to be to
provide a satisfactory coverage of the domain? Although the main purpose of the
empirical studies reported was to develop and validate a rating scheme, the
so-called problem of task validity could not be ignored. It soon appeared that
evaluation criteria are to some extent dependent upon the characteristics of
(integrated) tasks. Moreover the applica- bility of the rating scheme had to be
limited from the beginning: only in tasks where individual speakers - instead of
pairs or trios - can be rated for their contribution to the communication, use
of the scheme will be warranted.
Data collection
Students of the last year of primary education performed on four oral tasks: two
tasks were narrative, one task consisted of alarming the police by telephone and
one task of an exposition of the way a spider builds his web. Except for the
alarming task, in all tasks classmates functioned as listeners.
Sound recordings were made of all performances. Data collection took place in
two different samples. One sample consisted of two hundred students and can be
regarded as a nationally representative sample; the second sample consisted of
one hundred students from the region of Amsterdam and surroundings. The
registration of the oral performances in general was of an acceptable quality
for assessment purposes. In view of the intended validity study and the phonetic
analyses that had to be carried out, special care was taken in the recording
sessions for the second sample.
Theoretical foundation
A rating scheme is proposed consisting of four functional dimensions. These are
based on an overview of so-called analytic schemes that have been developed in
studies of the rating of speaking skills in diverging contexts (Wesdorp, 1981).
These dimensions are defined by functions that can be derived from the general
criterion of communicative efficacy. Two dimensions - Reference and Delivery -
are directly related to communicative content. Reference is defined by the
representational function of language; Delivery is defined by the functions of
expression and appeal (BŸhler, 1982). The dimensions interchangeably -
dependent upon the communicative situation - denote the dominant communicative
functions that are to be realised. The other two dimensions - Fluency and
Intelligibility - are indirectly related to communicative content and apply to
the conditions that have to be met in order to produce interpretable utterances.
Fluency is defined by the realisation of continuity of speech and
Intelligibility by the quality of the realisation of utterances ('decodability')
(Crystal & Davy, 1979).
In order to use the four dimensions as a rating scheme, each dimension is
regarded as a heuristic device from which specific criteria for assessment in a
given speaking situation can be deduced. Furthermore a linkage is assumed
between the criteria deduced from the dimensions on the one hand and the aspects
of behaviour that are the objects of assessment on the other. Specifically, for
Reference only linguistic aspects are seen to be relevant, for Delivery
linguistic, phonetic and non-verbal aspects are relevant, for Fluency and
Intelligibility linguistic and phonetic aspects. On a more concrete level,
however, it is supposed that the same aspects of behaviour do not always serve
the same functions. That is why the differentiation of the dimensions is solely
based upon the communicative functions to be evaluated and not upon the precise
behavioural aspects that can be distinghuised.
Empirical test of the rating scheme
The rating scheme has been put to empirical test in two steps. First, several
rating categories have been derived from each dimension and have been applied in
small scale experiments by jury's of four or five raters. In these experiments
(N=40) performances of students on the four tasks, selected from the larger data
set, are rated after an instruction- and training-session. The purpose of these
experiments is to acquire knowledge as to the applicability of the rating
categories for performances on different oral tasks, the degree of consensus
among raters, the instrumental differentiation that exists between jury-ratings
of different categories and optimal rating conditions (rating several categories
simultaneously vs each category separately). Second, on the basis of these
experiments, a more definite test of the scheme has been carried out. A jury of
three raters (all women with experience as teachers in primary education)
applied selected categories - one for each dimension of the scheme - to rate the
performances of all students in our two samples on the four oral tasks. In both
steps - the small scale experiments and the large scale studies - categories
have been derived from the dimensions in a taskspecific way. That is, categories
for the same dimension but applied in a different task often consist of
different criteria and require different behavioural aspects to be observed.
This is a consequence of a functional - instead of behavioural - rating scheme.
The results of the empirical investigations can be summarized in the following
four points.
1. Reliability of the rating categories is at an acceptable level (about .80)
for the purposes of a national assessment, when jury's of three trained raters
are used.
2. A four-factor model for the correlations among jury ratings, each factor
representing one of the dimensions of the rating scheme, fits reasonably well.
Furthermore there are strong indications that ratings of categories derived from
the same dimension hardly convey distinct information about speaking ability in
a given task, whereas ratings of categories derived from different dimensions,
although sometimes strongly correlated, do convey distinct information.
3. The rating scheme proves to be applicable for performances on all four tasks
tested, but there are indications that in two of the tasks (the alarming task
and the exposition) rating of categories for Delivery and Fluency is more
difficult, due to short duration of the performances and/or to the lack of
cohesiveness of the texts produced.
4. An efficient rating procedure is feasible; hereto a jury of three trained
raters rates each performance on four categories simultaneously, provided that
the performances are of reasonable length; without significant loss of
reliability or validity.
Empirical test of rating validity
A question that was not addressed in the foregoing is whether the rating of the
dimensions in oral performances does convey the information about the
behavioural aspects stated in the dimensions definitions. As mentioned
previously, the aspects of speech to be rated are not invariant across tasks.
Although from an instrumental point of view it has been demonstrated that
ratings of the dimensions convey distinct (but correlated) information, the
diagnostic value of these ratings is not yet clear. In short, we cannot exclude
the possibility that ratings are based on other aspects of the speaking
performances than we believe they are, or that the ratings of different
dimensions have largely overlapping meanings so that their differentiation is
invalid. Moreover, some notorious rating problems, such as the 'signific effect'
and the 'halo- effect', could have invalidated the resulting scores. To
investigate the validity of the jury ratings on the four dimensions, several
analyses have been carried out to determine the correlations between these
ratings and linguistic and phonetic aspects of the rated performances. In a
regression design I tested hypotheses about these correlations. First, these
hypotheses state a significant relation between jury ratings and the frequency
of the linguistic and phonetic variables that had been mentioned in the
definition of the rating dimension in question (the convergent prediction).
Second, the hypotheses state that a weaker relation exists between the jury
ratings and variables that are mentioned in the definition of other rating
dimensions (the divergent prediction). The prediction of Delivery and Fluency
has received most attention in this examination, because the differential
meaning of those two dimensions has proved to be more problematic than that of
Reference or Intelligibility. Therefore a rather large amount of linguistic and
phonetic predictors for Delivery and Fluency has been analyzed in comparison
with the other dimensions. (Non-verbal variables could not be included because
rating of performance occurred from sound tapes). On the other hand, because of
the time consuming procedures involved in the analysis of phonetic and
linguistic variables, only relatively small selections of performances (sixty
per dimension) on one (narrative) task could be analyzed.
For the prediction of ratings of Reference the total amount of relevant 'content
elements' has been determined in each performance on three tasks (narrative,
alarming and expository) (N=200). Prediction of ratings of Intelligibility has
been carried out by calculating the correlation between the ratings and the
amount of 'hardly intelligible' utterances in performances on a narrative task
(N=60). For the prediction of ratings of Delivery the following variables have
been selected: (1) variation of intonation, based on auditory analysis according
to a description of fundamental pitch movements in Dutch ('t Hart, Collier &
Cohen, 1990), (2) acoustic measurements of variation of fundamental frequency,
(3) acoustic measurements of intensity and intensity variation, (4) relative
amount of pitch accents (corrected for text length), (5) relative amounts of
lexical elements with a positive or negative effect on narrative register. For
the prediction of ratings of Fluency the selected predictors are: (1) relative
amount of self- corrections and non-functional pauses, (2) duration of
self-corrections and non- functional pauses, (3) mean speech rate (pauses
included) (4) mean articulation rate (pauses not included). For all variables
that can not be measured instrumentally, a detailed coding instruction has been
designed and applied by two trained raters. By comparing the codes assigned
independently by each rater for the same performances the degree of consensus
has been determined. The coding of pitch movements by the two raters has been
further examined by comparison with instrumental analyses of a sample of the
coded utterances. In all cases coding consensus and accuracy has been found to
be satisfactory.
Results show that for the ratings of three dimensions - Reference,
Intelligibility and Delivery - the hypotheses can be accepted. The ratings are
more strongly related to the linguistic and phonetic variables that are
mentioned in their definition than with those mentioned in the definition of
other dimensions. The proportion of explained variance of ratings of Reference
ranges from 53 to 79 percent (dependent upon the task).
Explained variance of ratings of Intelligibility was 37 percent and for Delivery
83 percent. Intonation variation and relative amounts of lexical elements with
reinforcing or decreasing effect on register have the greatest part in
predicting Delivery. Ratings of Fluency are also substantially predicted (55
percent of the variance of the ratings), however only the duration of
self-corrections and non-functional pauses plays a significant role. Moreover it
appears that predictors for Delivery also explain a large proportion of the
variance of the Fluency ratings (55 percent). Further analysis of the specific
meaning of these ratings shows that only rather gross disruptions of continuity
of speech are significantly related to Fluency (false starts and pauses of long
duration), whereas more subtle hesitations, repeats, filled and unfilled pauses
appear to be largely ignored by the jury. Furthermore, no evidence has been
found of the occurrence of so- called signific or halo-effects in the ratings of
the speech performances. The correlation between ratings of Delivery and Fluency
can be largely explained by the correlation that exists between the behavioural
aspects rated. Also, no indication has been found for diverging interpretations
among raters regarding the relevance of certain behavioural aspects for deciding
upon their scores.
Discussion
The results of the empirical studies reported are rather promising. The rating
scheme tested proves to satisfy several needs in large scale assessments of
speaking ability such as the need to supply differential information about the
skills students possess in a reliable and efficient way. Moreover, its utility
for the rating of performances on several communicative tasks has been
demonstrated. Also, the validity and diagnostic meaning of the rating dimensions
was, for the greater part, substantiated. Nonetheless, I must point at some
limitations of the studies on which these results are based. First, the sample
of students for the validation study for Delivery, Intelligibility and Fluency
was rather small, and not nationally representative for the population, so the
possibility of statistic generalization is limited. Second, the results are
mainly based on the scores given by three trained raters; we can not be certain
that other raters' scores are equivalent. Third, several relevant predictor
variables for the rating dimensions have not been included in the validity study
for various reasons. Fourth, the validity of the rating dimensions has been
solely determined on the basis of ratings of performances on one (narrative)
task. In view of the dependence of rating criteria and the behavioural aspects
to be observed on task characteristics, results can not be generalized to
ratings on other types of tasks. Fifth, not all kinds of rating criteria that
could be relevant in the assessment of speaking ability in communicative
situations have been investigated. Specifically, criteria dealing with standard
usage and grammatical correctness or com- plexity have not been included,
although these criteria might be rather important in the case of formal
communication. Also, communicative situations in which cooperation among
interactants plays an important role, require specific rating criteria that have
not been included in our scheme. Criteria for turn-taking and -giving and for
evaluation of the process of negotiation and cooperation as such, are important
additions if per- formances on such tasks are to be assessed.
The above limitations all deserve attention in empirical studies. Some of the
research themes are specifically important in my opinion. Those themes are
elaborated upon. It concerns the following:
1. A redefinition of Fluency on the basis of our validity study. The results of
the study have made it clear that the significance of Fluency ratings has been
severely narrowed in comparison with the original definition of the dimension;
several explanations and implications of this finding are being discussed.
2. The relation between acoustic and perceptive variables in the rating of
speech in several empirical studies is discussed. Several occasions in these
studies and in the present one are found to speculate about the basis of speech
perception and rating: detail or Gestalt.
3. The problem of task validity for the evaluation of speaking ability in
communi- cative situations is explored. What are the main parameters of
integrated tasks that have to be varied to reach an acceptable coverage of the
domain? A suggestion for an experimental analysis of task parameters is given.
In conclusion, the utility of the rating scheme in two different contexts is
discussed: large scale performance surveys and (diagnostic) evaluation in
primary and secondary education. A comparison is made with a rating scheme now
in use for national per- formance surveys at the end of primary school and
several advantages of the present scheme are pointed out. With respect to in
educational contexts it is indicated what advantages there seem to be in using
the functional rating scheme in comparison with schoolpractice nowadays.
Furthermore, some ideas for implementation of the scheme and some practical
consequences for the teachers, the pupils and the curriculum are portrayed.
|