Katya or The Liberated Corpus 🙈

The need for linguistic analysis has been increasing steadily over the past couple of decades. As the interest and the number of research questions increased, so did the demand for amount of language data and its observations. Unfortunately, one cannot spend his time meticulously searching for words and constructs in question, as such endeavor might take a lifetime or two to complete. Everyone should agree that human time should be spent more preciously, so we employ computers and automated systems to collect and analyze linguistic data en masse.

Let us define a corpus to be a vast (or smaller) collection of authentic text, such as literature, transcripts of real-life conversations, newspapers, etc. Corpus linguistics will be defined as a discipline of linguistics, which relies on inquiries and results based on such a corpus. This allows researchers to process large amounts of data that could not have been parsed before, simply due to limitations in both the human time and resources. Katya is an example of such a corpus that focuses on features that will allow researchers to do what no other existing and commercial corpora provide.

Katya is the liberated corpus that is focused on addressing the issues above. The big difference of Katya is that it allows users to provide their own web link, which will serve as a source of data that can be queried later. Katya accomplishes this by web scraping user link. More on that in the next section. Katya is capable of exporting complete search results. Another advantage is that Katya’s code and infrastructure is public, such that anyone could learn how it works for educational purposes and other developers could contribute to the platform.

-> Go to Katya

-> Go to Katya’s backend repository

-> Go to Katya’s website repository