CLC FCE Dataset, version 1.1

Description

The CLC FCE Dataset is a set of 1,244 exam scripts written by candidates sitting the Cambridge ESOL First Certificate in English (FCE) examination in 2000 and 2001. The scripts are extracted from the Cambridge Learner Corpus (CLC), developed as a collaborative effort between Cambridge University Press and Cambridge Assessment. For each exam script, the CLC FCE Dataset includes the original text written by the candidate (transcribed and anonymised, but otherwise unmodified) as well as marks, error annotation and essential demographic details including the candidate’s first language and age bracket.

The data within the corpus comprises material from what is now known as the B2 First exam, which underwent substantial revision in 2015. For the data provided here, the material is taken from the years before this revision took place.

Summary of Changes in CLC FCE v1.1

Version 1.1 of the CLC FCE Dataset is a revised and enhanced release of the original 2010 dataset (v1.0). The update focuses on improving score accuracy, contextual documentation, data consistency, and usability for research. No changes have been made to the transcribed candidate responses.

Key changes include:

  • Improved exam documentation
    • Added a PDF facsimile of the FCE Writing Handbook (2001), describing the writing tasks and mark schemes in force at the time of the exams.
    • Updated mark scheme references to explicitly align with the correct handbook version.
  • Score corrections and enrichment
    • Filled in previously missing scores and corrected mis‑transcribed or inaccurate scores.
    • Added original multi‑mark, answer‑level scores for the 97 evaluation scripts.
    • Mapped original band scores to a transparent 0–20 scale per task, with 0–40 script‑level totals.
  • Outlier and validity data improvements
    • Remapped previously released “outlier” scripts to a 0–40 scale for consistency.
    • Detokenised and split outlier scripts into two answers to better reflect authentic candidate responses.
    • Added the original source scripts used to generate outliers, including full scores and error annotations.
  • Prompt and rubric restructuring
    • Clearly separated rubric (instructions) from prompt text, improving clarity and structure.
    • Enhanced plain‑text prompt formatting, including bracketed comments as shown to candidates in the original exam papers.
    • Harmonised phrasing and structure across questions (notably Questions 1 and 5) for greater internal consistency.
  • New data formats
    • Introduced a line‑delimited JSON format alongside a slightly updated version of the original XML, enabling easier use in modern NLP pipelines while preserving XML for nested error annotations.

Data security

Please be aware of the problems of leaking benchmark datasets to LLMs (e.g. Balloccu et al, EACL 2024). Please only use this dataset with LLMs hosted locally (e.g. after download from Hugging Face Transformers) or with no retention of data for training if using LLMs via commercial APIs (check that the model cards do not oblige the user to share data or improvements).

Do not provide the corpus (full or partial) to others in any way, even if they have also signed the licence agreement, e.g. through the use of repositories on sites such as Hugging Face and GitHub.
Do not release items (e.g. models, data statistics) derived from the corpus without prior approval of CUP&A.

Publication date: 04/05/2026

Keywords: Automated Essay Scoring (AES), ESOL (English for Speakers of Other Languages), Learner Corpus, Error Annotation, Cambridge Learner Corpus (CLC), First Certificate in English (FCE).

Authors and Contributors

From Cambridge University: Oeistein Andersen, Chris Bryant, Marek Rei.
From Cambridge Assessment English: David Dursun, Mark Brenchley, Massimo Innamorati.

Citing this paper

Please continue to reference the following paper if you are using this dataset:

@inproceedings{aa2011,
author = {Yannakoudakis, Helen and Briscoe, Ted and Medlock, Ben},
booktitle = {The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
title = {A New Dataset and Method for Automatically Grading {ESOL} Texts},
year = {2011}

The paper is available at: https://aclanthology.org/P11-1019/.

To distinguish it from the original version, please also note your specific use of CLC FCE v1.1, as opposed to CLC FCE v1.0.

You may publish the results of research using this dataset.  In any such publication you must acknowledge use of the dataset in your research by citing Cambridge University Press & Assessment and the Authors and Contributors as shown. 

We ask you to inform us of any such publications by emailing: researchdatasets@cambridge.org

Please report any issues or problems in downloading the dataset by emailing: researchdatasets@cambridge.org

Licence Agreement 

  1. By downloading this dataset and licence, this licence agreement (the “Agreement”) is entered into, effective this date, between you (the “Licensee"), and the Chancellor, Masters and Scholars of the University of Cambridge acting through its department Cambridge University Press & Assessment (the “Licensor”).
  2. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee, nor shall the Licensee have any rights in the dataset other than the right to use the dataset in accordance with this Agreement
  3. The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes only. The Licensee shall not sub-licence or assign the benefit or burden of this Agreement in whole or in part.
  4. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented.
  5. The Licensee shall expressly acknowledge and reference the Licensor when making use of the licensed dataset in all publications of research based on it, in whole or in part, through citation of the paper at the top of the dataset details page.
  6. The Licensee may publish excerpts of less than 100 words from the licensed dataset pursuant to clause 3.
  7. The Licensor grants the Licensee this right to use the licensed dataset "as is". Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever. The Licensor has no liability for any loss or damage whatsoever sustained by Licensee as a result of the availability or use of or reliance on the dataset.
  8. The Licensor shall not be liable for any indirect or consequential loss or damage or for any loss of or corruption of data, loss of programs, profit or goodwill (whether direct or indirect) arising out of or in connection with the access, availability, use of or reliance on the dataset.
  9. The Licensee shall indemnify and hold the Licensor harmless against any loss or damage which it may suffer or incur as a result of the Licensee’s breach of any terms of this Agreement.
  10. This Agreement constitutes the entire agreement between the parties and supersedes any previous agreement between the parties relating to its subject-matter. Each party acknowledges and agrees that, in entering into this Agreement, it does not rely on, and shall have no remedy in respect of, any statement, representation, warranty or understanding (whether negligently or innocently made) other than as expressly set out in this Agreement.
  11. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction. 

You may download this dataset if you agree to the licence terms above and complete the following registration form.  Publications using this dataset must acknowledge and reference Cambridge University Press & Assessment as the source of the data.

Registration form

Name
Title
CAPTCHA
This question is for testing that you are a human visitor and to prevent automated spam submissions.