Text Preparation Through Extended Tokenization
Author(s)
M. Hassler & G. Fliedl
Abstract
Tokenization is commonly understood as the first step of any kind of natural
language text preparation. The major goal of this early (pre-linguistic) task is to
convert a stream of characters into a stream of processing units called tokens.
Beyond the text mining community this job is taken for granted. Commonly
it is seen as an already solved problem comprising the identification of word
borders and punctuation marks separated by spaces and line breaks. But in
our sense it should manage language related word dependencies, incorporate
domain specific knowledge, and handle morphosyntactically relevant linguistic
specificities. Therefore, we propose rule-based extended tokenization including
all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core
features of our implementation are identification and disambiguation of all kinds of
linguistic markers, detection and expansion of abbreviations, treatment of special
formats, and typing of tokens including single- and multi-tokens. To improve the
quality of text mining we suggest linguistically-based tokenization as a necessary
step preceeding further text processing tasks.
In this paper, we focus on the task of improving the quality of standard tagging.
Keywords: text preparation, natural language processing, tokenization, tagging
improvement, tokenization prototype.
1 Introduction
Nearly all researchers concerned with text mining presuppose tokenizing as first
step during text preparation [1–5]. Good surveys about tokenization techniques are
provided by Frakes and Baeza-Yates [6] and Baeza-Yates and Ribeiro-Neto [7],
and Manning and Sch¨ utze in [8, pp.124–136]. But – as we know – only very
few reflect tokenization as a task of multi-language text processing with far-reaching
impact [9]. This involves language-related knowledge about linguistically
Keywords
text preparation, natural language processing, tokenization, tagging
improvement, tokenization prototype.
Related Book
Other papers in this volume
Warning (2)
: foreach() argument must be of type array|object, null given [in
/var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/templates/Papers/view.php, line
364]
Code
$counter = '0';
foreach ($paper['book']['Paper'] as $otherPaper) {
if ((!empty($otherPaper['name'])) && ($counter < '7') && ($otherPaper['available'] == 1)) {
Cake\Error\ErrorTrap->handleError() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/templates/Papers/view.php, line 364
/var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/View/View.php /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/View/View.php, line 1188
Cake\View\View->_evaluate() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/View/View.php, line 1145
Cake\View\View->_render() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/View/View.php, line 785
Cake\View\View->render() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Controller/Controller.php, line 712
Cake\Controller\Controller->render() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Controller/Controller.php, line 516
Cake\Controller\Controller->invokeAction() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Controller/ControllerFactory.php, line 166
Cake\Controller\ControllerFactory->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Controller/ControllerFactory.php, line 141
Cake\Controller\ControllerFactory->invoke() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/BaseApplication.php, line 362
Cake\Http\BaseApplication->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 86
Cake\Http\Runner->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Middleware/CsrfProtectionMiddleware.php, line 169
Cake\Http\Middleware\CsrfProtectionMiddleware->process() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 82
Cake\Http\Runner->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Middleware/BodyParserMiddleware.php, line 157
Cake\Http\Middleware\BodyParserMiddleware->process() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 82
Cake\Http\Runner->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Routing/Middleware/RoutingMiddleware.php, line 118
Cake\Routing\Middleware\RoutingMiddleware->process() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 82
Cake\Http\Runner->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Routing/Middleware/AssetMiddleware.php, line 69
Cake\Routing\Middleware\AssetMiddleware->process() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 82
Cake\Http\Runner->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Error/Middleware/ErrorHandlerMiddleware.php, line 115
Cake\Error\Middleware\ErrorHandlerMiddleware->process() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 82
Cake\Http\Runner->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/debug_kit/src/Middleware/DebugKitMiddleware.php, line 60
DebugKit\Middleware\DebugKitMiddleware->process() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 82
Cake\Http\Runner->handle() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Runner.php, line 60
Cake\Http\Runner->run() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/vendor/cakephp/cakephp/src/Http/Server.php, line 104
Cake\Http\Server->run() /var/www/dce7ae55-385b-4ffa-8595-3ec5e61ff110/public_html/app/webroot/index.php, line 37
[main]