A New Challenge in the Data Processing of Non-Standard Texts Containing Accents / Diacritics: A Case Study
No Thumbnail Available
The INTELLIT project develops a virtual online museum of the Romanian literature. The sources of data made available and provided by the Romanian Academy, such as: General Dictionary of Romanian Literature, Timeline of the Romanian Literary Life and the canonical works of Romanian writers are digitized and indexed using smart text analytics. One of the challenges with this process is dealing with diacritics and textual accents. Here, we present an in-depth analysis of possible solutions and describe our implementation for detecting various Unicode text processing. We present the solution identified as an accessible way to remove specific Unicode text code points in order to greatly improve our search and filtering capabilities while still preserving the original source (at the database level).
data processing, unicode, text encoding, canonical normalization