esTenTen – Spanish corpus from the web
Etiquetas:
The Spanish Web corpus (esTenTen) is a text corpus created from the collected internet texts. The corpus belongs to the TenTen corpus family which is a set of the same processed web corpora with the target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 40 languages.
The corpus contains subcorpora based on the language varieties – European Spanish and American Spanish. Particular Spanish varieties were downloaded from web domains in the respective continents.
Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.
Part-of-speech tagset
Part-of-speech tagging and lemmatisation were performed using FreeLing analyser with Spanish configuration, see Spanish FreeLing tagset.
Overview of Spanish TenTen corpora
A list of Spanish TenTen corpora available in the Sketch Engine database:
- Spanish Web corpus 2023 (esTenTen23) – 28.6 billion words (European Spanish Web, American Spanish Web, whole Spanish Wikipedia) with topic classification for the biggest web domains based on a semi-manual check of sample texts
- Spanish Web corpus 2018 (esTenTen18) – 16.9 billion words (European Spanish Web, American Spanish Web, whole Spanish Wikipedia) with topic classification for the biggest web domains based on a semi-manual check of sample texts
- Spanish Web corpus 2011 (esTenTen11) – 9.5 billion words (European Spanish Web, American Spanish Web, small part of Spanish Wikipedia)
Deja un comentario