Eye tracking is the process of measuring the eye movements when reading text in NLP. There are some important eye-tracking corpuses.
GECO is an English-Dutch bilingual corpus with eye-tracking data from 17 participants collected from reading the complete novel “The Mysterious Affair at Styles.The corpus has 4,934 sentences, 774,015 tokens, and 9,876 words.
Uschi Cop, Nicolas Dirix, Denis Drieghe, and Wouter Duyck. 2016. Presenting geco: An eyetracking cor-pus of monolingual and bilingual sentence reading. Behavior research methods, pages 1–14.
The Mishra dataset contains 994 text snippets with 383 positive and 611 negative examples from newspaper clippings, sampled from seven native speakers.
Abhijit Mishra, Diptesh Kanojia, and Pushpak Bhat-tacharyya. 2016a. Predicting readers’ sarcasm un-derstandability by modeling gaze behavior. In AAAI, pages 3747–3753.
Dundee Corpus (Kennedy et al., 2003) is an open eye-tracking corpus with tokenization and measures similar to the Dundee Treebank.The corpus contains eye-tracking recordings of ten native English-speaking subjects reading 20 newspaper articles from The Independent.
The English corpus contains 51,502 tokens and 9,776 types in 2,368 sentences.
A Kennedy. 2003. The dundee corpus [cd-rom].Psy-chology Department, University of Dundee.
Provo Corpus is a corpus of eye-tracking data with accompanying predictability norms. The predictability norms for the Provo Corpus differ from those of other corpora. In addition to traditional cloze scores that estimate the redictability of the full orthographic form of each word, the Provo Corpus also includes measures of the predictability of morpho-syntactic and semantic information for each word. This makes the Provo Corpus ideal for studying redictive processes in reading.
Luke, S.G. & Christianson, K. (2018). The Provo Corpus: A Large Eye-Tracking Corpus with Predictability Ratings. Behavior Research Methods, 50, 826-833. https://doi.org/10.3758/s13428-017-0908-4
Other eye-tracking and eye movement database.
1.Potsdam sentence corpus
The Potsdam sentence corpus ( Kliegl et al., 2004; Kliegl et al., 2006) is a collection of 144 German sentences, with predictability estimates (cloze scores) available for all but the first word in each sentence.
Kliegl, R., Grabner, E., Rolfs, M., & Engbert, R. (2004). Length, frequency, and predictability effects of words on
eye movements in reading. European Journal of Cognitive Psychology, 16 (1 -2), 262 – 284.
Kliegl, R., Nuthmann, A., & Engbert, R. (2006). Tracking the mind during reading: The influence of past, present,
and future words on fixation durations. Journal of Experimental Psychology: General, 135 (1), 12 – 35.
2. Frank, Fernandez Monsalve , Thompson , and Vigliocco ( 2013) gathered eye movements from 43 English monolingual subjects reading 205 sentences.
Frank, S. L., Fernandez Monsalve, I., Thompson, R. L., & Vigliocco, G. (2013). Reading time data for evaluating broad-coverage models of Englis h sentence process ing.Beha vior Research Methods, 45 ,1182–1190. doi:10.3758/s13428-012-0313-y
3.Dutch DEMONIC database
The Dutch DEMONIC database (Kuperman, Dambacher, Nuthmann, & Kliegl, 2010 ). 55 subjects read 224 cons tructed Dutch sentences.
Kuperman, V., Dambacher, M., Nuthmann, A., & Kliegl, R. (2010). The effect of word position on eye-movements in sentence and para-graph reading. Quarterly Journal of Experimental Psychology, 63, 1838–1857. doi:10.1080/17470211003602412