medacy.data.dataset module

A Dataset facilities the management of data for both model training and model prediction. A Dataset object provides a wrapper for a unix file directory containing training/prediction data. If a Dataset, at training time, is fed into a pipeline requiring auxilary files (Metamap for instance) the Dataset will automatically create those files in the most efficient way possible.

# Training When a directory contains both raw text files alongside annotation files, an instantiated Dataset detects and facilitates access to those files.

# Prediction When a directory contains only raw text files, an instantiated Dataset object interprets this as a directory of files that need to be predicted. This means that the internal Datafile that aggregates meta-data for a given prediction file does not have fields for annotation_file_path set.

# External Datasets An actual dataset can be versioned and distributed by interfacing this class as described in the Dataset examples. Existing Datasets can be imported by installing the relevant python packages that wrap them.

class medacy.data.dataset.Dataset(data_directory, raw_text_file_extension='txt', annotation_file_extension='ann', metamapped_files_directory=None, data_limit=None)[source]

Bases: object

A facilitation class for data management.

_parallel_metamap(files, i)[source]

Facilitates Metamapping in parallel by forking off processes to Metamap each file individually. :param files: an array of file paths to the file to map :param i: index in the array used to determine the file that the called process will be responsible for mapping :return: metamapped_files_directory now contains metamapped versions of the dataset files

get_data_directory()[source]

Retrieves the directory this Dataset abstracts from. :return:

get_data_files()[source]

Retrieves an list containing all the files registered by a Dataset. :return: a list of DataFile objects.

is_metamapped()[source]

Verifies if all fil es in the Dataset are metamapped. :return: True if all data files are metamapped, False otherwise.

is_training()[source]

Whether this Dataset can be used for training. :return: True if training dataset, false otherwise. A training dataset is a collection raw text and corresponding annotation files while a prediction dataset contains solely raw text files.

static load_external(package_name)[source]

Loads an external medaCy compatible dataset. Requires the dataset’s associated package to be installed. Alternatively, you can import the package directly and call it’s .load() method. :param package_name: the package name of the dataset :return: A tuple containing a training set, evaluation set, and meta_data

metamap(metamap, n_jobs=3, retry_possible_corruptions=True)[source]

Metamaps the files registered by a Dataset. Attempts to Metamap utilizing a max prune depth of 30, but on failure retries with lower max prune depth. A lower prune depth roughly equates to decreased MetaMap performance. More information can be found in the MetaMap documentation. :param metamap: an instance of MetaMap. :param n_jobs: the number of processes to spawn when metamapping. Defaults to one less core than available on your machine. :param retry_possible_corruptions: Re-Metamap’s files that are detected as being possibly corrupt. Set to False for more control over what gets Metamapped or if you are having bugs with Metamapping. (default: True) :return: Inside metamapped_files_directory or by default inside a sub-directory of your data_directory named metamapped we have that for each raw text file, file_name, an auxiliary metamapped version is created and stored.

set_data_limit(data_limit)[source]

A limit to the number of files in the Dataset that medaCy works with This is useful for preliminary experimentation when working with an entire Dataset takes time. :return: