Skip to Main Content

Text Mining

Key Terms & Definitions

API (Application Programming Interface): An interface that allows applications to talk to one another and can be used to facilitate downloading large amounts of data from a website. Some or the resources on this guide provide an API to access data. 

API Wrapper: An API wrapper can facilitate interacting with APIs by providing a way to access an API through a particular programming language or interface to streamline the process of making API calls.

Association: Associations measure how often a word co-occurs with other words. The more often words occur close to each other when compared to their general frequency, the higher their association will be (see Collocation.)

Classification: Objects are assigned to pre-defined classes based on similarity. Similar objects are assigned to the same class. The function defining similarity is given by examples for the assignment. These are objects which have been assigned to a class before. The algorithm needs to learn a function which reflects the class definition as determined by the learning examples.

Clustering: Objects are being grouped based on similarity. Each cluster contains objects which are more similar among each other than to objects in other clusters.

Collocation:Series of words or terms that co-occur more often than would be expected by chance.

Corpus: A collection of written texts, particularly the entire body of work on a subject or by a specific creator; a collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc.

Concepts: Meaning is defined beyond a word. A concept is a semantic entity which can be expressed by several words or by a group of words.

Information Retrieval: Information retrieval is concerned with the representation and knowledge and subsequent search for relevant information within these knowledge sources. Information retrieval provides the technology behind search engines.

n-gram: In linguistics, a sequence of n items from a given sequence of text or speech. N-grams can be any combination of letters, phonemes, syllables, words, or letters. 

Metadata: Data describing other data. Metadata provide information about one or more aspects of data, such as type, date, creator, location, and so on. Most often encountered in library and archival contexts, metadata facilitate the organization, discovery, and use of a wide range of resources.

Named-entity recognition (NER) (entity extraction): seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

OCR (optical character recognition): Use of computer technologies to convert scanned images of typewritten, printed, or handwritten text into machine-readable text. This conversion allows for the computerization of material texts into formats for digital storage, search, and display. Adobe Acrobat Professional supports OCR processes, as does Microsoft Office for Windows. OCR accuracy depends on the font and style of the original document.

Representational State Transfer (REST): REST is a software architectural style that defines a set of constraints to be used for creating Web services. Web services that conform to the REST architectural style, termed RESTful Web services (RWS), provide interoperability between computer systems on the internet.

Sentiment Analysis / Opinion Mining: Opinion mining or Sentiment Analysis means finding and classifying opinionated parts of texts. These subjective parts need to be identified by Text Mining methods and separated from objective text parts. A technique typically applied is the search for words which express opinion.

Stemming: Stemming refers to the mapping to word forms to stems or basic word forms. Word forms may differ from stems due to morphological changes necessary for grammatical reasons. Plural for English nouns, for example, is mostly constructed by adding an s to the basic noun.

TF-IDF (Term Frequency- Inverse Document Frequency): Numerical statistic that is intended to reflect how important a word is to a document corpus. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.

Mandl, T. (2014). Text mining. In M. Khosrow-Pour, Encyclopedia of information science and technology (3rd ed.). Hershey, PA: IGI Global. Retrieved from 

Folger Shakespeare Library (2017). Glossary of Digital Humanities Terms. Retrieved from