FrSemCor, a corpus of semantically annotated French nouns
The project aims at compensating for the lack of semantically annotated data that can be used in both NLP and linguistics research. We provide a gold standard resource for French nouns, based on a careful semantic annotation design.
We have designed an annotation guide (in French) during a preliminary annotation phase. Then 73% of the noun tokens from the Sequoia corpus corpus have been doubly-annotated by three French native students. They were adjudicated, and all noun tokens were validated by an expert. The remaining 27% of the tokens were annotated by an expert of the team.
The resulting corpus contains 12,917 noun tokens, annotated into 24 supersenses (+ complex supersenses). The FrSemCor annotations are released as 12th column of the Sequoia corpus, whose morpho-syntactic annotations are available in two syntactic annotation schemes (FTBdep and Universal Dependencies), cf. Sequoia corpus and (information on FrSemCor annotation format here).
Our annotation is compliant with the annotation of named entities and (nominal) multiword expressions performed in the Parseme-fr project (available as 11th column).
The corpus is freely available, under the LGPL-LR licence.
If you use the corpus or need more information, please refer to:
Lucie Barque, Pauline Haas, Richard Huyghe, Delphine Tribout, Marie Candito, Benoît Crabbé and Vincent Segonne (2020) Annotating a French Corpus with Supersenses, Proceedings of LREC-2020, Marseille. 13-15 mai 2020. preprint version