medacy.tools.annotations module

author

Andriy Mulyar, Steele W. Farnsworth

date

12 January, 2019

class medacy.tools.annotations.Annotations(annotation_data, annotation_type='ann', source_text_path=None)[source]

Bases: object

A medaCy annotation. This stores all relevant information needed as input to medaCy or as output. The Annotation object is utilized by medaCy to structure input to models and output from models. This object wraps a dictionary containing two keys at the root level: ‘entities’ and ‘relations’. This structured dictionary is designed to interface easily with the BRAT ANN format. The key ‘entities’ contains as a value a dictionary with keys T1, T2, … ,TN corresponding each to a single entity. The key ‘relations’ contains a list of tuple relations where the first element of each tuple is the relation type and the last two elements correspond to keys in the ‘entities’ dictionary.

_Annotations__default_strict = 0.2
compare_by_entity(gold_anno)[source]

Compares two Annotations for checking if an unverified annotation matches an accurate one by creating a data structure that looks like this:

{
‘females’: {

‘this_anno’: [(‘Sex’, 1396, 1403), (‘Sex’, 295, 302), (‘Sex’, 3205, 3212)], ‘gold_anno’: [(‘Sex’, 1396, 1403), (‘Sex’, 4358, 4365), (‘Sex’, 263, 270)] }

‘SALDOX’: {

‘this_anno’: [(‘GroupName’, 5408, 5414)], ‘gold_anno’: [(‘TestArticle’, 5406, 5412)] }

‘MISSED_BY_PREDICTION’:

[(‘GroupName’, 8644, 8660, ‘per animal group’), (‘CellLine’, 1951, 1968, ‘on control diet (‘)]

}

The object itself should be the predicted Annotations and the argument should be the gold Annotations.

Parameters

gold_anno – the Annotations object for the gold data.

Returns

The data structure detailed above.

compare_by_index(gold_anno, strict=0.2)[source]

Similar to compare_by_entity, but organized by start index. The two data sets used in the comparison will often not have two annotations beginning at the same index, so the strict value is used to calculate within what margin a matched pair can be separated. :param gold_anno: The Annotation object representing an annotation set that is known to be accurate. :param strict: Used to calculate within what range a possible match can be. The length of the entity is

multiplied by this number, and the product of those two numbers is the difference that the entity can begin or end relative to the starting index of the entity in the gold dataset. Default is 0.2.

Returns

compare_by_index_stats(gold_anno, strict=0.2)[source]

Runs compare_by_index() and returns a dict of related statistics. :param gold_anno: See compare_by_index() :param strict: See compare_by_index() :return: A dictionary with keys:

“num_not_matched”: The number of entites in the predicted data that are not matched to an entity in the

gold data,

“avg_accuracy”: The average of all the decimal values representing how close to a 1:1 correlation there was

between the start and end indices in the gold and predicted data.

diff(other_anno)[source]

Identifies the difference between two Annotations objects. Useful for checking if an unverified annotation matches an annotation known to be accurate. :param other_anno: Another Annotations object. :return: A list of tuples of non-matching annotation pairs.

from_ann(ann_file_path)[source]

Loads an ANN file given by ann_file :param ann_file_path: the system path to the ann_file to load :return: annotations object is loaded with the ann file.

from_con(con_file_path)[source]

Converts a con file from a given path to an Annotations object. The conversion takes place through the from_ann() method in this class because the indices for the Annotations object must be those used in the BRAT format. The path to the source text for the annotations must be defined unless that file exists in the same directory as the con file. :param con_file_path: path to the con file being converted to an Annotations object.

get_entity_annotations(return_dictionary=False)[source]

Returns a list of entity annotation tuples :param return_dictionary: returns the dictionary storing the annotation mappings. Useful if also working with relationship extraction :return: a list of entities or underlying dictionary of entities

get_entity_count()[source]
stats()[source]

Count the number of instances of a given entity type and the number of unique entities. :return: a dict with keys:

“entity_counts”: a dict matching entities to the number of times that entity appears

in the Annotations,

“unique_entity_num”: an int of how many unique entities are in the Annotations, “entity_list”: a list of all the entities that appear in the list; each only appears once.

to_ann(write_location=None)[source]

Formats the Annotations object into a string representing a valid ANN file. Optionally writes the formatted string to a destination. :param write_location: path of location to write ann file to :return: returns string formatted as an ann file, if write_location is valid path also writes to that path.

to_con(write_location=None)[source]

Formats the Annotation object to a valid con file. Optionally writes the string to a specified location. :param write_location: Optional path to an output file; if provided but not an existing file, will be

created. If this parameter is not provided, nothing will be written to file.

Returns

A string representation of the annotations in the con format.

to_html(output_file_path, title='medaCy')[source]

Convert the Annotations to a displaCy-formatted HTML representation. The Annotations must have the path to the source file as one of its attributes. Does not return a value. :param output_file_path: Where to write the HTML to. :param title: What should appear in the header of the outputted HTML file; not very important

exception medacy.tools.annotations.InvalidAnnotationError[source]

Bases: ValueError

Raised when a given input is not in the valid format for that annotation type.