medacy.model.feature_extractor module

class medacy.model.feature_extractor.FeatureExtractor(window_size=2, spacy_features=['pos_', 'shape_', 'prefix_', 'suffix_', 'like_num'])[source]

Bases: object

Extracting training data for use in a CRF. Features are given as rich dictionaries as described in: https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#features

sklearn CRF suite is a wrapper for CRF suite that gives it a sci-kit compatability.

_sent_to_feature_dicts(sent)[source]
_sent_to_labels(sent, attribute='gold_label')[source]
_token_to_feature_dict(index, sentence)[source]
Parameters
  • index – the index of the token in the sequence

  • sentence – an array of tokens corresponding to a sequence

Returns

get_features_with_span_indices(doc)[source]

Given a document this method orchestrates the organization of features and labels for the sequences to classify. Sequences for classification are determined by the sentence boundaries set by spaCy. These can be modified. :param doc: an annoted spacy Doc object :return: Tuple of parallel arrays - ‘features’ an array of feature dictionaries for each sequence (spaCy determined sentence) and ‘indices’ which are arrays of character offsets corresponding to each extracted sequence of features.

mapper_for_crf_wrapper(text)[source]

CURRENTLY UNUSED. CRF wrapper uses regexes to extract the output of the underlying C++ code. The inclusion of n and space characters mess up these regexes, hence we map them to text here. :return: