... and OntoNotes (Hovy et al., 2006). Clearly, as one moves towards a more application- and domain-driven idea of ‘correct’ tokenization, a more trans-parent, flexible, and adaptable approach to ... keep track of charac-ter start and end positions as offsets between a stringbefore and after each rule application (i.e. all pairsI, O), and these offsets are eventually traced back to the ... decades. Ap-proximately, this means splitting off punctuationinto separate tokens, disambiguating straight quotes, and separating contractions such as can’t into ca and n’t. There are, however, many...