Package conllutils
Sub-modules
conllutils.io
conllutils.pipeline
Functions
def create_index(sentences, fields=None, min_frequency=1, missing_index=None)
-
Return an index mapping the string values of the
sentences
to integer indexes.An index is a nested dictionary where the indexes for the field values are stored as
index[field][value]
. SeeSentence.to_instance()
method for usage of the index dictionary for sentence indexing.For each field, the indexes are assigned to the string values starting from 1 according to their descending frequency of occurrences in the sentences, i.e. the most frequent value has index 1, second one index 2, etc. Index 0 represents an unknown value, and the dictionary returns 0 for all unmapped values.
For mapping of instances to the sentences, use
create_inverse_index()
function to create an inverse mapping from the indexes to the string values.Args
sentences
:iterable
- The indexed sentences.
fields
:set
- The set of indexed fields included in the index. By default all string-valued fields are indexed except ID and HEAD.
min_frequency
:int
ordictionary
- If specified, the field values with a frequency lower than
min_frequency
are discarded from the index. By default, all values are preserved. Themin_frequency
can be specified as an integer for all fields, or as a dictionary setting the frequency for the specific field. missing_index
:int
ordictionary
- The integer index representing the missing values (i.e. when a token does not
have value for the indexed field). By default, missing index is not mapped in the index dictionary, and all
missing values are indexed as -1. If specified, the mapping index[field][None] =
missing_index
is added into the index dictionary. Themissing_index
can be specified as an integer for all fields, or as a dictionary setting the missing index for the specific field.
Raises
ValueError
- If the non-string value is indexed for some of the
fields
.
def create_inverse_index(index)
-
Return an inverse index mapping the integer indexes to string values.
For the
index
with mappingindex[field][v] = i
, the inverse index has mappinginverse_index[field][i] = v
. SeeInstance.to_sentence()
method for usage of the inverse index for transformation of instances to sentences. def empty_id(word_id, index=1)
-
Return new ID value for empty token indexed by
word_id
starting from 0 andindex
starting from 1.The empty token ID is encoded as a tuple with id[0] =
word_id
and id[1] =index
. For more information about the ordering of the empty tokens in the sentence, seeSentence
class.Raises
ValueError
- If
word_id
< 0 orindex
< 1.
def multiword_id(start, end)
-
Return new ID value for multiword token spanning in the sentence across the words with ID from
start
toend
(inclusive).The multiword token ID is encoded as a tuple with id[0] =
start
and id[1] =end
. For more information about the ordering of the multiword tokens in the sentence, seeSentence
class.Raises
ValueError
- If
start
< 1 orend
<=start
.
def pipe(source=None, *args)
-
Build a data processing pipeline.
A pipeline specifies the chaining of operations performed over the processed data. The operations can be divided into three types:
- data sources,
- filters or transformations,
- and actions.
The data sources generate the processed data, e.g. read the data from the ConNLL-U file. Filters and transformations filter data for the subsequent processing, transform data values or map one data type to another one (e.g. index sentences to instances or extract the texts of the sentences). Actions invoke the whole pipeline chain and perform the final operation with the processed data (e.g. collect the processed data in the Python list or write data to the CoNLL-U file.
The pipeline can optionally specify only one data source, and if specified, the data source has to be configured as the first operation of the pipeline. Alternatively, the data source can be provided as an iterable object specified in the
source
argument.The pipeline objects are iterable and callable. The iterator invokes the configured data source and processes the data with all filters and transformations of the pipeline. Calling of
p(data)
applies the filters and transformations ofp
on the provideddata
and returns an iterator over the processed data. Thedata
argument can be any iterable object (including another pipeline).The pipelines can be arbitrarily chained using the
Pipeline.pipe()
method, i.e. the data can be loaded and partially processed by some operations of the first pipeline, then processed by the second pipeline and then finally processed by the remaining operations of the first one.The operations can be further divided according to the data type to the operations for sentences, tokens, fields' values, instances etc. For an overview and more information about the operations, see the description of
Pipeline
class.Args
source
:iterable
- The configured data source of the pipeline.
*args
:pipelines
- The list of the pipelines chained after the data source, i.e.
pipe(data, p1, p2, ... ,pn)
is equivalent topipe(data).pipe(p1, p2, ..., pn)
. SeePipeline.pipe()
method for more information.
def read_conllu(file, underscore_form=True, parse_comments=True, parse_feats=False, parse_deps=False)
-
Read the CoNLL-U file and return an iterator over the parsed sentences.
The
file
argument can be a path-like or file-like object.To parse values of FEATS or DEPS fields to dictionaries or sets of tuples, set the
parse_feats
orparse_deps
arguments to True. By default the features and dependencies are not parsed and values are stored as a string.If
underscore_form
is True (default) and LEMMA field is underscore, the underscore character in the FORM field is parsed as the FORM value. Otherwise, it indicates an unspecified FORM value.By default, comments are parsed as the metadata dictionary. To skip comments parsing, set
parse_comments
argument to False. def write_conllu(file, data, write_comments=True)
-
Write the sentences to the CoNLL-U file.
The
file
argument can be a path-like or file-like object. Writtendata
is an iterable object of sentences or one sentence. If thewrite_comments
argument is True (default), sentence metadata are encoded as the comments and written to the file.
Classes
class DependencyTree (sentence)
-
A dependency tree representation of the sentence.
A basic dependency tree is a labeled tree structure where each node of the tree corresponds to exactly one syntactic word in the sentence. The relations between the node and its parent (head) are labeled with the Universal dependencies and stored in the HEAD and DEPREL fields of the corresponding word.
The DependencyTree class should not be instantiated directly. Use the
Sentence.to_tree()
orInstance.to_tree()
methods to create a dependency representation for the sentence or indexed instance. The implementation of nodes is provided byNode
class.The dependency tree object is iterable and returns an iterator over all nodes in the order of corresponding words in the sentence.
len(tree)
returns the number of nodes.Note that the dependency tree is constructed only from the basic dependency relations. Enhanced dependency relations stored in the DEPS field are not included in the tree.
Attributes
Methods
def inorder(self)
-
Return an iterator traversing in-order over all nodes.
def is_projective(self, return_arcs=False)
-
Return True if the dependency tree is projective, otherwise False.
A dependency tree is projective when all its arcs are projective, i.e. for all arcs (i, j) from parent i to child j and for all nodes k between the i and j in the sentence, there must be a path from i to k.
If the argument
return_arcs
is True, the function returns the list of conflicting non-projective arcs. For projective trees the list is empty. def leaves(self)
-
Return an iterator over all leaves of the tree in the sentence order.
def postorder(self)
-
Return an iterator traversing post-order over all nodes.
def preorder(self)
-
Return an iterator traversing pre-order over all nodes.
class Instance (fields=(), metadata=None)
-
An indexed representation of the sentence in the compact numerical form.
An instance can be created from a sentence using the
Sentence.to_instance()
method. The sentence values are mapped to the numerical indexes by the provided index mapping. The index for a set of sentences can be created with thecreate_index()
function.An instance is a dictionary type where each field is mapped to the NumPy array with the integer values continuously indexed for all tokens in the sentence, i.e. the field value of the
i
-th token is stored asinstance[field][i]
. The length of all mapped arrays is equal to the length of the sentence. The default numerical type of the arrays isnp.int64
.The ID field is not stored in the instance. Note that this also means that the type of tokens is not preserved. The FEATS and DEPS fields are indexed as unparsed strings, i.e. the features or dependencies are not indexed separately.
By default, unknown values (i.e. values not mapped in the provided index) are stored as 0. Missing values (i.e. when a token does not have value for the indexed field) are stored as -1. For more information, see
create_index()
function.Attributes
metadata
:any
- Any optional data associated with the instance, by default copied from the sentence.
Ancestors
- builtins.dict
Instance variables
var length
-
int: The length of the intance (i.e. the number of tokens in the indexed sentence).
Methods
def copy(self)
-
Return a shallow copy of the instance.
def is_projective(self, return_arcs=False)
-
Return True if this instance can be represented as the projective dependency tree, otherwise False.
See
DependencyTree.is_projective()
method for more information. def to_sentence(self, inverse_index, fields=None)
-
Return a new sentence build from the instance with the values re-indexed by the
inverse_index
.Optional
fields
argument specifies a subset of the fields added into the sentence. By default, all instance fields are included. The ID values are always generated as the sequence of integers starting from 1, which corresponds to the sequence of lexical words without the empty or multiword tokens.This operation is inverse to the indexing in
Sentence.to_instance()
method.Raises
KeyError
- If some of the instance values is not mapped in the
inverse_index
.
def to_tree(self)
-
Return a dependency tree representation of the instance.
See
DependencyTree
class for more information. All tokens referenced in the tree are indexed views, as it is described for theInstance.token()
method. Note that the implementation assumes proper ordering of the tokens and that the instance does not contain empty or multiword tokens.Raises
ValueError
- If the instance contains the tokens without the HEAD field (HEAD = -1), or when the instance does not have exactly one root with HEAD = 0.
def token(self, i)
-
Return a view to the
i
-th token of the instance.The view is a mutable mapping object, which maps fields to the scalar values stored in the instance at the
i
-th position, i.e. for the values of thei
-th token view, the following condition holdstoken[field] == instance[field][i]
.The view object supports all mapping methods and operations except the deleting of the field or setting the value of the field not indexed in the instance.
def tokens(self)
-
Return an iterator over all tokens. The iterated values are token view objects.
class Node (index, token)
-
A node in the dependency tree corresponding to the syntactic word in the sentence.
A node object is iterable, and returns an iterator over the direct children.
len(node)
returns the number of children, andnode[i]
returns thei
-th child (or sublist of children, ifi
is the slice of indices).Attributes
Instance variables
var deprel
-
str or int: Universal dependency relation to the HEAD stored in the
token[DEPREL]
, or None if the token does not have DEPREL field. var is_leaf
-
bool: True, if the node is a leaf node (has no children).
var is_root
-
bool: True, if the node is the root of the tree (has no parent).
class Sentence (tokens=(), metadata=None)
-
A list type representing the sentence, i.e. the sequence of tokens.
For valid CoNLL-U sentences, tokens have to be ordered according to their IDs. The syntactic words form the sequence with ID=1, 2, 3, etc. Multiword tokens with the range ID 'start-end' are inserted before the first word in the range (i.e. before the word with ID=start). The ranges of all multiword tokens must be non-empty and non-overlapping. Empty tokens with the decimal IDs 'token_id.index' are inserted in the index order at the beginning of the sentence (if token_id=0), or immediately after the word with ID=token_id.
Note that the Sentence methods are not checking the order of the tokens, and it is up to the programmer to preserve the correct ordering.
The Sentence class provides
Sentence.words()
method to extract only the sequence of syntactic words without the empty or multiword tokens, andSentence.raw_tokens()
method to extract the sequence of raw tokens (i.e. how the sentence is written orthographically with the multiword tokens).For example, for the Spanish sentence:
1-2 vámonos 1 vamos 2 nos 3-4 al 3 a 4 el 5 mar
the
words
method returns the sequence of expanded syntactic words 'vamos', 'nos', 'a', 'el', 'mar', and theraw_tokens
returns sequence for raw text 'vámonos', 'al', 'mar'.For the sentence with empty tokens:
1 Sue 2 likes 3 coffee 4 and 5 Bill 5.1 likes 6 tea
both
words
andraw_tokens
methods return the sequence without the empty tokens 'Sue', 'likes', 'coffee', 'and', 'Bill', 'tea'.Attributes
metadata
:any
- Any optional data associated with the sentence. By default for the CoNLL-U format,
metadata
are parsed from the comment lines as the dictionary of key = value pairs. If the comment string has no key-value format separated with=
, it is stored as a key with None value.
Create an empty sentence or initialize a new sentence with the
tokens
from the provided sequence and optionalmetadata
.Ancestors
- builtins.list
Static methods
def from_conllu(s, multiple=False, **kwargs)
-
Parse a sentence (or list of sentences) from the string in the CoNLL-U format.
If the argument
multiple
is True, the function returns the list of all sentences parsed from the string. Otherwise (default), it returns only the first sentence. This function supports all additional keyword arguments as theread_conllu()
function.Raises
ValueError
- If there is an error parsing at least one sentence from the string.
Methods
def copy(self)
-
Return a shallow copy of the sentence.
def get(self, id, default=None)
-
Return token with the specified ID.
The
id
argument can be an integer from 1, tuple generated byempty_id()
ormultiword_id()
functions, or string in CoNLL-U notation (e.g. "1" for words, "2-3" for multiword tokens, or "0.1" for empty tokens). Note that the implementation assumes the proper ordering of the tokens according to their IDs.If a token with the
id
cannot be found, the method returns provideddefault
value or None ifdefault
is not given. def is_projective(self, return_arcs=False)
-
Return True if this sentence can be represented as the projective dependency tree, otherwise False.
See
DependencyTree.is_projective()
method for more information. def raw_tokens(self)
-
Return an iterator over all raw tokens representing the written text of the sentence.
The raw tokens are all multiword tokens and all words outside of the multiword ranges (excluding the empty tokens). Note that the implementation assumes the proper ordering of the tokens according to their IDs.
def text(self, default_form='_')
-
Return text of the sentence reconstructed from the raw tokens.
The insertion of spaces is controlled by
SpaceAfter=No
feature in the MISC field. Unspecified forms are replaced with the value ofdefault_form
argument, which defaults to underscore '_'.Note that space is also appended after the last word unless the last token has specified
SpaceAfter=No
. def to_conllu(self, write_comments=True)
-
Return a string representation of the sentence in the CoNLL-U format.
If the
write_comments
argument is True (default), the string also includes comments generated from the metadata. def to_instance(self, index, fields=None, dtype=numpy.int64)
-
Return an instance representation of the sentence with the values indexed by the
index
.Optional
fields
argument specifies a subset of the fields added into the instance. By default, HEAD field and all fields from theindex
are included.The numerical type of the instance data can be specified in
dtype
argument. The default type isnp.int64
. SeeInstance
class for more information.Raises
KeyError
- If some of the
fields
are not indexed in theindex
.
def to_tree(self)
-
Return a dependency tree representation of the sentence.
See
DependencyTree
class for more information. Note that the implementation assumes the proper ordering of the tokens according to their IDs.Raises
ValueError
- If the sentence contains the words without the HEAD field, or when the sentence does not have exactly one root with HEAD = 0.
def tokens(self)
-
Return an iterator over all tokens in the sentence (alias to
iter(self)
). def words(self)
-
Return an iterator over all syntactic words (i.e. without multiword and empty tokens).
class Token (fields=(), **kwargs)
-
A dictionary type representing a token in the sentence.
A token can represent a regular syntactic word, or can be a multiword token spanning across multiple words (e.g. like in Spanish vámonos = vamos nos), or can be an empty token (inserted in the extended dependency tree, e.g. for the analysis of ellipsis). Type of the token can be tested using the read-only
is_multiword
andis_empty
properties.A token can contain mappings for the following standard CoNLL-U fields:
- ID: word index (integer starting from 1); or range of the indexes for multiword tokens; or decimal notation for empty tokens.
- FORM: word form or punctuation symbol.
- LEMMA: lemma or stem of word form.
- UPOS: Universal part-of-speech tag.
- XPOS: language-specific part-of-speech tag.
- FEATS: list of morphological features from the Universal feature inventory or language-specific extension.
- HEAD: head of the current word in the dependency tree representation (ID or 0 for root).
- DEPREL: Universal dependency relation to the HEAD.
- DEPS: enhanced dependency graph in the form of head-deprel pairs.
- MISC: any other annotation associated with the token.
The ID values are parsed as the integers for regular words or tuples for multiword and empty tokens (see
multiword_id()
andempty_id()
functions for more information).The HEAD values are parsed as the integers.
The FORM, LEMMA, POS, XPOS, DEPREL and MISC values are strings.
The FEATS are strings or parsed as the dictionaries with attribute-value mappings and multiple values stored in the sets.
The DEPS values are strings or parsed as the set of head-deprel tuples.
Create an empty token or token with the fields initialized from the provided mapping object or keyword arguments.
Ancestors
- builtins.dict
Instance variables
var is_empty
-
bool: True if the token is an empty token, otherwise False.
var is_multiword
-
bool: True if the token is a multiword token, otherwise False.
Methods
def copy(self)
-
Return a shallow copy of the token.
def to_collu(self)
-
Return a string representation of the token in the CoNLL-U format.