mlsnippet.datafs¶

class mlsnippet.datafs.TarArchiveFS(archive_file, strict=False)¶

Bases: mlsnippet.datafs.archivefs._ArchiveFS

Tar archive file based DataFS.

__init__(archive_file, strict=False)¶

Construct a new TarArchiveFS.

Parameters:	archive_file (str) – Path of the archive file. strict (bool) – Whether or not this `DataFS` works in strict mode? (default `False`)

_close()¶: Override this method to destroy the internal states.

_init()¶: Override this method to initialize the internal states.

isfile(filename)¶

Check whether or not a file exists.

Parameters:	filename (str) – The name of the file.
Returns:	`True` if `filename` exists and is a file, and `False` otherwise.
Return type:	bool

iter_files(meta_keys=None)¶

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…]) –

A tuple containing the: name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()¶

Iterate through all the file names in this DataFS.

Yields:	str – The file name of each file.

open(filename, mode)¶

Open a file-like object to read / write a file.

Parameters:	filename (str) – The name of the file. mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:	The file-like object. This object will be immediately closed as soon as this `DataFS` instance is closed.
Return type:	file-like
Raises:	`InvalidOpenMode` – If the specified mode is not supported, e.g., `mode == 'w'` but `WRITE_DATA` capacity is absent. `DataFileNotExist` – If `mode == 'r'` but filename does not exist.

class mlsnippet.datafs.ZipArchiveFS(archive_file, strict=False)¶

Bases: mlsnippet.datafs.archivefs._ArchiveFS

Zip archive file based DataFS.

__init__(archive_file, strict=False)¶

Construct a new ZipArchiveFS.

Parameters:	archive_file (str) – Path of the archive file. strict (bool) – Whether or not this `DataFS` works in strict mode? (default `False`)

_close()¶: Override this method to destroy the internal states.

_init()¶: Override this method to initialize the internal states.

isfile(filename)¶

Check whether or not a file exists.

Parameters:	filename (str) – The name of the file.
Returns:	`True` if `filename` exists and is a file, and `False` otherwise.
Return type:	bool

iter_files(meta_keys=None)¶

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…]) –

A tuple containing the: name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()¶

Iterate through all the file names in this DataFS.

Yields:	str – The file name of each file.

open(filename, mode)¶

Open a file-like object to read / write a file.

Parameters:	filename (str) – The name of the file. mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:	The file-like object. This object will be immediately closed as soon as this `DataFS` instance is closed.
Return type:	file-like
Raises:	`InvalidOpenMode` – If the specified mode is not supported, e.g., `mode == 'w'` but `WRITE_DATA` capacity is absent. `DataFileNotExist` – If `mode == 'r'` but filename does not exist.

class mlsnippet.datafs.DataFSCapacity(mode=0)¶

Bases: object

Enumeration class to represent the capacity of a DataFS.

There are 7 different categories of capacities. Every method of DataFS may only work if the DataFS has the particular one or more capacities. One may check whether the DataFS has a certain capacity by can_[capacity_name]().

ALL = 127¶: All capacities are supported.

LIST_META = 16¶: Can enumerate the meta keys for a particular file.

QUICK_COUNT = 32¶: Can get the count of files without iterating through them.

RANDOM_SAMPLE = 64¶: Can randomly sample files without obtaining the whole file list.

READ_DATA = 1¶: Can read file data, the basic capacity of a DataFS.

READ_META = 4¶: Can read meta data.

READ_WRITE_DATA = 3¶: Can read and write file data.

READ_WRITE_META = 12¶: Can read and write meta data.

WRITE_DATA = 2¶: Can write file data.

WRITE_META = 8¶: Can write meta data.

__init__(mode=0)¶

Construct a new DataFSCapacity.

Parameters:	mode (int) – The mode number of this capacity flag.

can_list_meta()¶

can_quick_count()¶

can_random_sample()¶

can_read_data()¶

can_read_meta()¶

can_write_data()¶

can_write_meta()¶

class mlsnippet.datafs.DataFS(capacity, strict=False)¶

Bases: mlsnippet.utils.concepts.AutoInitAndCloseable

Base class for all data file systems.

A DataFS provides access to a machine learning dataset stored in a file system like backend. For example, large image datasets are usually stored as raw image files, gathered in a directory. Such true file system can be accessed by LocalFS.

Apart from the true file system, some may instead store these images in a database provided virtual file system, for example, the GridFS of MongoDB, which can be accessed via MongoFS.

__init__(capacity, strict=False)¶

Initialize the base DataFS class.

Parameters:	capacity (int or DataFSCapacity) – Specify the capacity of the derived `DataFS`. strict (bool) – Whether or not this `DataFS` works in strict mode? (default `False`) In strict mode, the following behaviours will take place: Accessing the value of a non-exist meta key will cause a `MetaKeyNotExist`, instead of getting `None`.

as_flow(batch_size, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False, names_pattern=None)¶

Construct a DataFlow, which iterates through the files once and only once in an epoch.

The returned DataFSFlow will hold a copy of this instance (obtained by clone()) instead of holding this instance itself.

Parameters:

batch_size (int) – Size of each mini-batch.
with_names (bool) – Whether or not to include the file names in the returned flow? (default True)
meta_keys (None or Iterable[str]) – The keys of the meta data to be included in the returned flow. (default None)
shuffle (bool) – Whether or not to shuffle the files in each epoch of the flow? Setting this to True will force loading the file list into memory. (default False)
skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
names_pattern (None or str or regex) – The file name pattern. If specified, only if the file name matches this pattern, would the file be included in the constructed data flow. Specifying this option will force loading the file list into memory. (default None)

Returns:

A dataflow, with each mini-batch: having numpy arrays ([filename,] content, [meta-data...]), according to the arguments.

Return type:

tfsnippet.dataflow.DataFlow

batch_get_meta(filenames, meta_keys)¶

Get meta data of files.

Parameters:

filenames (Iterable[str]) – The names of the files.
meta_keys (Iterable[str]) – The keys of the meta data.

Returns:

A list of meta values, or None: if the corresponding file does not exist.

Return type:

list[tuple[any] or None]

batch_isfile(filenames)¶

Check whether or not the files exist.

Parameters:	filenames (Iterable[str]) – The names of the files.
Returns:	A list of indicators, where `True` if the corresponding `filename` exists and is a file, and `False` otherwise.
Return type:	list[bool]

capacity¶

Get the capacity of this DataFS.

Returns:	The capacity object.
Return type:	DataFSCapacity

clear_and_put_meta(filename, meta_dict=None, **meta_dict_kwargs)¶

Set the meta data of a file. The un-mentioned meta data will be cleared. This method is not necessarily slower than put_meta().

Parameters:	filename (str) – The name of the file. meta_dict (dict[str, any]) – The meta values to be updated. **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in `meta_dict`.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `LIST_META` capacity) is(are) absent.

clear_meta(filename)¶

Clear all the meta data of a file.

Parameters:	filename (str) – The name of the file.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `LIST_META` capacity) is(are) absent.

clone()¶

Obtain a clone of this DataFS instance.

Returns:	The cloned `DataFS`. Only the construction arguments will be copied. All the internal states (e.g., database connections) are kept un-initialized.
Return type:	DataFS

count()¶

Count the files in this DataFS.

Will iterate through all the files via iter_names(), if QUICK_COUNT capacity is absent.

Returns:	The total number of files.
Return type:	int

get_data(filename)¶

Get the content of a file.

Parameters:	filename (str) – The name of the file.
Returns:	The content of a file. DataFileNotExist: If filename does not exist.
Return type:	bytes

get_meta(filename, meta_keys)¶

Get meta data of a file.

Parameters:	filename (str) – The name of the file. meta_keys (Iterable[str]) – The keys of the meta data.
Returns:	The meta values, corresponding to `meta_keys`. If a requested key is absent for a file, `None` will take the place.
Return type:	tuple[any]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `READ_META` capacity is absent.

get_meta_dict(filename)¶

Get all the meta data of a file, as a dict.

Parameters:	filename (str) – The name of the file.
Returns:	The meta values, as a dict.
Return type:	dict[str, any]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `READ_META` or `LIST_META` capacity is absent.

isfile(filename)¶

Check whether or not a file exists.

Parameters:	filename (str) – The name of the file.
Returns:	`True` if `filename` exists and is a file, and `False` otherwise.
Return type:	bool

iter_files(meta_keys=None)¶

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…]) –

A tuple containing the: name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()¶

Iterate through all the file names in this DataFS.

Yields:	str – The file name of each file.

list_meta(filename)¶

List the meta keys of a file.

Parameters:	filename (str) – The name of the file.
Returns:	The keys of the meta data of the file.
Return type:	tuple[str]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `LIST_META` capacity is absent.

list_names()¶

Get the list of all the file names.

Returns:	The file names list.
Return type:	list[str]

open(filename, mode)¶

Open a file-like object to read / write a file.

Parameters:	filename (str) – The name of the file. mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:	The file-like object. This object will be immediately closed as soon as this `DataFS` instance is closed.
Return type:	file-like
Raises:	`InvalidOpenMode` – If the specified mode is not supported, e.g., `mode == 'w'` but `WRITE_DATA` capacity is absent. `DataFileNotExist` – If `mode == 'r'` but filename does not exist.

put_data(filename, data)¶

Save the content of a file.

Parameters:	filename (str) – The name of the file. data (bytes or file-like) – The content of the file, or a file-like object with `read(size)` method.
Raises:	`UnsupportedOperation` – If `WRITE_DATA` capacity is absent.

put_meta(filename, meta_dict=None, **meta_dict_kwargs)¶

Update the meta data of a file. The un-mentioned meta data will remain unchanged. This method is not necessarily faster than clear_and_put_meta(). In some backends it may be implemented by first calling get_meta_dict, then updating the meta dict in memory, and finally calling clear_and_put_meta.

Parameters:	filename (str) – The name of the file. meta_dict (dict[str, any]) – The meta values to be updated. **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in `meta_dict`.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `READ_META` capacity) is(are) absent.

random_flow(batch_size, with_names=True, meta_keys=None, skip_incomplete=False, batch_count=None)¶

Construct a DataFlow, with infinite or pre-configured number of mini-batches in an epoch, randomly sampled from the whole DataFS.

The returned DataFSRandomFlow will hold a copy of this instance (obtained by clone()) instead of holding this instance itself.

Parameters:	batch_size (int) – Size of each mini-batch. with_names (bool) – Whether or not to include the file names in the returned flow? (default `True`) meta_keys (None or Iterable[str]) – The keys of the meta data to be included in the returned flow. (default `None`) skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than `batch_size`? (default `False`, the final mini-batch will always be visited even if it has fewer data than `batch_size`) batch_count (int or None) – The number of mini-batches to obtain in an epoch. (default `None`, infinite mini-batches)
Returns:	A dataflow, with each mini-batch having numpy arrays `([filename,] content, [meta-data...])`, according to the arguments.
Return type:	tfsnippet.dataflow.DataFlow
Raises:	`UnsupportedOperation` – If `RANDOM_SAMPLE` capacity is absent.

retrieve(filename, meta_keys=None)¶

Retrieve the content and maybe meta data of a file.

Parameters:

filename (str) – The name of the file to be retrieved.
meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Returns:

The content, or a tuple: containing the content and the meta values, corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Return type:

bytes or (bytes, [meta-data…])

Notes

As long as meta_keys is not None, a tuple will always be returned, even if meta_keys is an empty collection.

Raises:	`UnsupportedOperation` – If `meta_keys` is specified, but `READ_META` capacity is absent. `DataFileNotExist` – If filename does not exist.

sample_files(n_samples, meta_keys=None)¶

Sample n_samples files from this DataFS.

Parameters:	n_samples (int) – The number of files to sample. meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default `None`)
Returns:	A list of tuples, each tuple contains the name of a file, its content, and the values of each meta data corresponding to `meta_keys`. If a requested key is absent for a file, `None` will take the place.
Return type:	list[(filename, content, [meta-data…])]
Raises:	`UnsupportedOperation` – If `RANDOM_SAMPLE` capacity is absent, or `meta_keys` is specified, but `READ_META` capacity is absent.

sample_names(n_samples)¶

Sample n_samples file names from this DataFS.

Parameters:	n_samples (int) – Number of names to sample. The returned names may be fewer than this number, if there are less than `n_samples` files in this `DataFS`.
Returns:	The list of sampled file names.
Return type:	list[str]
Raises:	`UnsupportedOperation` – If `RANDOM_SAMPLE` capacity is absent.

strict¶: Whether or not this DataFS works in strict mode?

sub_flow(batch_size, names, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False)¶

Construct a DataFlow, which iterates through the files according to selected names.

The returned DataFSFlow will hold a copy of this instance (obtained by clone()) instead of holding this instance itself.

Parameters:

batch_size (int) – Size of each mini-batch.
names (list[str] or np.ndarray[str]) – The names to retrieve.
with_names (bool) – Whether or not to include the file names in the returned flow? (default True)
meta_keys (None or Iterable[str]) – The keys of the meta data to be included in the returned flow. (default None)
shuffle (bool) – Whether or not to shuffle the files in each epoch of the flow? Setting this to True will force loading the file list into memory. (default False)
skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)

Returns:

A dataflow, with each mini-batch: having numpy arrays ([filename,] content, [meta-data...]), according to the arguments.

Return type:

tfsnippet.dataflow.DataFlow

exception mlsnippet.datafs.DataFSError¶

Bases: exceptions.Exception

Base class for all DataFS errors.

exception mlsnippet.datafs.UnsupportedOperation¶

Bases: mlsnippet.datafs.errors.DataFSError

Class to indicate that a requested operation is not supported by the specific DataFS subclass.

exception mlsnippet.datafs.InvalidOpenMode(mode)¶

Bases: mlsnippet.datafs.errors.UnsupportedOperation

Class to indicate that the specified open mode is not supported.

mode¶

exception mlsnippet.datafs.DataFileNotExist(filename)¶

Bases: mlsnippet.datafs.errors.DataFSError

Class to indicate a requested data file does not exist.

filename¶

exception mlsnippet.datafs.MetaKeyNotExist(filename, meta_key)¶

Bases: mlsnippet.datafs.errors.DataFSError

Class to indicate a requested meta key does not exist.

filename¶

meta_key¶

class mlsnippet.datafs.LocalFS(root_dir, strict=False)¶

Bases: mlsnippet.datafs.base.DataFS

Local directory based DataFS.

__init__(root_dir, strict=False)¶

Construct a new LocalFS.

Parameters:	root_dir (str) – The root directory for this `LocalFS`. strict (bool) – Whether or not this `DataFS` works in strict mode? (default `False`)

_close()¶: Override this method to destroy the internal states.

_init()¶: Override this method to initialize the internal states.

clear_meta(filename)¶

Clear all the meta data of a file.

Parameters:	filename (str) – The name of the file.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `LIST_META` capacity) is(are) absent.

clone()¶

Obtain a clone of this DataFS instance.

Returns:	The cloned `DataFS`. Only the construction arguments will be copied. All the internal states (e.g., database connections) are kept un-initialized.
Return type:	DataFS

get_meta(filename, meta_keys)¶

Get meta data of a file.

Parameters:	filename (str) – The name of the file. meta_keys (Iterable[str]) – The keys of the meta data.
Returns:	The meta values, corresponding to `meta_keys`. If a requested key is absent for a file, `None` will take the place.
Return type:	tuple[any]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `READ_META` capacity is absent.

isfile(filename)¶

Check whether or not a file exists.

Parameters:	filename (str) – The name of the file.
Returns:	`True` if `filename` exists and is a file, and `False` otherwise.
Return type:	bool

iter_names()¶

Iterate through all the file names in this DataFS.

Yields:	str – The file name of each file.

list_meta(filename)¶

List the meta keys of a file.

Parameters:	filename (str) – The name of the file.
Returns:	The keys of the meta data of the file.
Return type:	tuple[str]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `LIST_META` capacity is absent.

open(filename, mode)¶

Open a file-like object to read / write a file.

Parameters:	filename (str) – The name of the file. mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:	The file-like object. This object will be immediately closed as soon as this `DataFS` instance is closed.
Return type:	file-like
Raises:	`InvalidOpenMode` – If the specified mode is not supported, e.g., `mode == 'w'` but `WRITE_DATA` capacity is absent. `DataFileNotExist` – If `mode == 'r'` but filename does not exist.

put_meta(filename, meta_dict=None, **meta_dict_kwargs)¶

Update the meta data of a file. The un-mentioned meta data will remain unchanged. This method is not necessarily faster than clear_and_put_meta(). In some backends it may be implemented by first calling get_meta_dict, then updating the meta dict in memory, and finally calling clear_and_put_meta.

Parameters:	filename (str) – The name of the file. meta_dict (dict[str, any]) – The meta values to be updated. **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in `meta_dict`.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `READ_META` capacity) is(are) absent.

root_dir¶: Get the absolute path of the root directory.

sample_names(n_samples)¶

Sample n_samples file names from this DataFS.

Parameters:	n_samples (int) – Number of names to sample. The returned names may be fewer than this number, if there are less than `n_samples` files in this `DataFS`.
Returns:	The list of sampled file names.
Return type:	list[str]
Raises:	`UnsupportedOperation` – If `RANDOM_SAMPLE` capacity is absent.

class mlsnippet.datafs.MongoFS(conn_str, db_name, coll_name, strict=False)¶

Bases: mlsnippet.datafs.base.DataFS, mlsnippet.utils.mongo_binder.MongoBinder

MongoDB GridFS based DataFS.

This class provides a DataFS, which saves the files in a MongoDB GridFS, and stores the meta values in metadata field of each record in the fs collection.

__init__(conn_str, db_name, coll_name, strict=False)¶

Construct a new MongoFS.

Parameters:	conn_str (str) – The MongoDB connection string. db_name (str) – The MongoDB database name. coll_name (str) – The collection name (prefix) of the GridFS. strict (bool) – Whether or not this `DataFS` works in strict mode? (default `False`)

batch_get_meta(filenames, meta_keys)¶

Get meta data of files.

Parameters:

filenames (Iterable[str]) – The names of the files.
meta_keys (Iterable[str]) – The keys of the meta data.

Returns:

A list of meta values, or None: if the corresponding file does not exist.

Return type:

list[tuple[any] or None]

batch_isfile(filenames)¶

Check whether or not the files exist.

Parameters:	filenames (Iterable[str]) – The names of the files.
Returns:	A list of indicators, where `True` if the corresponding `filename` exists and is a file, and `False` otherwise.
Return type:	list[bool]

clear_and_put_meta(filename, meta_dict=None, **meta_dict_kwargs)¶

Set the meta data of a file. The un-mentioned meta data will be cleared. This method is not necessarily slower than put_meta().

Parameters:	filename (str) – The name of the file. meta_dict (dict[str, any]) – The meta values to be updated. **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in `meta_dict`.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `LIST_META` capacity) is(are) absent.

clear_meta(filename)¶

Clear all the meta data of a file.

Parameters:	filename (str) – The name of the file.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `LIST_META` capacity) is(are) absent.

clone()¶

Obtain a clone of this DataFS instance.

Returns:	The cloned `DataFS`. Only the construction arguments will be copied. All the internal states (e.g., database connections) are kept un-initialized.
Return type:	DataFS

count()¶

Count the files in this DataFS.

Will iterate through all the files via iter_names(), if QUICK_COUNT capacity is absent.

Returns:	The total number of files.
Return type:	int

get_meta(filename, meta_keys)¶

Get meta data of a file.

Parameters:	filename (str) – The name of the file. meta_keys (Iterable[str]) – The keys of the meta data.
Returns:	The meta values, corresponding to `meta_keys`. If a requested key is absent for a file, `None` will take the place.
Return type:	tuple[any]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `READ_META` capacity is absent.

get_meta_dict(filename)¶

Get all the meta data of a file, as a dict.

Parameters:	filename (str) – The name of the file.
Returns:	The meta values, as a dict.
Return type:	dict[str, any]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `READ_META` or `LIST_META` capacity is absent.

isfile(filename)¶

Check whether or not a file exists.

Parameters:	filename (str) – The name of the file.
Returns:	`True` if `filename` exists and is a file, and `False` otherwise.
Return type:	bool

iter_files(meta_keys=None)¶

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…]) –

A tuple containing the: name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()¶

Iterate through all the file names in this DataFS.

Yields:	str – The file name of each file.

list_meta(filename)¶

List the meta keys of a file.

Parameters:	filename (str) – The name of the file.
Returns:	The keys of the meta data of the file.
Return type:	tuple[str]
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `LIST_META` capacity is absent.

open(filename, mode)¶

Open a file-like object to read / write a file.

Parameters:	filename (str) – The name of the file. mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:	The file-like object. This object will be immediately closed as soon as this `DataFS` instance is closed.
Return type:	file-like
Raises:	`InvalidOpenMode` – If the specified mode is not supported, e.g., `mode == 'w'` but `WRITE_DATA` capacity is absent. `DataFileNotExist` – If `mode == 'r'` but filename does not exist.

put_data(filename, data)¶

Save the content of a file.

Parameters:	filename (str) – The name of the file. data (bytes or file-like) – The content of the file, or a file-like object with `read(size)` method.
Raises:	`UnsupportedOperation` – If `WRITE_DATA` capacity is absent.

put_meta(filename, meta_dict=None, **meta_dict_kwargs)¶

Update the meta data of a file. The un-mentioned meta data will remain unchanged. This method is not necessarily faster than clear_and_put_meta(). In some backends it may be implemented by first calling get_meta_dict, then updating the meta dict in memory, and finally calling clear_and_put_meta.

Parameters:	filename (str) – The name of the file. meta_dict (dict[str, any]) – The meta values to be updated. **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in `meta_dict`.
Raises:	`DataFileNotExist` – If filename does not exist. `UnsupportedOperation` – If the `WRITE_META` capacity (and possibly the `READ_META` capacity) is(are) absent.

retrieve(filename, meta_keys=None)¶

Retrieve the content and maybe meta data of a file.

Parameters:

filename (str) – The name of the file to be retrieved.
meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Returns:

The content, or a tuple: containing the content and the meta values, corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Return type:

bytes or (bytes, [meta-data…])

Notes

As long as meta_keys is not None, a tuple will always be returned, even if meta_keys is an empty collection.

Raises:	`UnsupportedOperation` – If `meta_keys` is specified, but `READ_META` capacity is absent. `DataFileNotExist` – If filename does not exist.

sample_files(n_samples, meta_keys=None)¶

Sample n_samples files from this DataFS.

Parameters:	n_samples (int) – The number of files to sample. meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default `None`)
Returns:	A list of tuples, each tuple contains the name of a file, its content, and the values of each meta data corresponding to `meta_keys`. If a requested key is absent for a file, `None` will take the place.
Return type:	list[(filename, content, [meta-data…])]
Raises:	`UnsupportedOperation` – If `RANDOM_SAMPLE` capacity is absent, or `meta_keys` is specified, but `READ_META` capacity is absent.

sample_names(n_samples)¶

Sample n_samples file names from this DataFS.

Parameters:	n_samples (int) – Number of names to sample. The returned names may be fewer than this number, if there are less than `n_samples` files in this `DataFS`.
Returns:	The list of sampled file names.
Return type:	list[str]
Raises:	`UnsupportedOperation` – If `RANDOM_SAMPLE` capacity is absent.

class mlsnippet.datafs.DataFSForwardFlow(fs, batch_size, with_names=True, meta_keys=None, skip_incomplete=False)¶

Bases: mlsnippet.datafs.dataflow._BaseDataFSFlow

A DataFS derived DataFlow, iterating through mini-batches in a forward-only fashion (data are obtained by iter_files()).

__init__(fs, batch_size, with_names=True, meta_keys=None, skip_incomplete=False)¶

Construct a new DataFSForwardFlow.

Parameters:

fs (DataFS) – The data fs instance, where to read data.
batch_size (int) – Size of each mini-batch.
with_names (bool) – Whether or not to include the file names in mini-batches? (default True)
meta_keys (None or Iterable[str]) – The keys of the meta data to be included in mini-batches. (default None)
skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)

class mlsnippet.datafs.DataFSIndexedFlow(fs, batch_size, names, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False, random_state=None)¶

Bases: mlsnippet.datafs.dataflow._BaseDataFSFlow

A DataFS derived DataFlow, iterating through mini-batches according to given names (data are obtaining by retrieve()).

__init__(fs, batch_size, names, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False, random_state=None)¶

Construct a new DataFSIndexedFlow.

Parameters:

fs (DataFS) – The data fs instance, where to read data.
batch_size (int) – Size of each mini-batch.
names (list[str] or np.ndarray[str]) – The names to retrieve.
with_names (bool) – Whether or not to include the file names in mini-batches? (default True)
meta_keys (None or Iterable[str]) – The keys of the meta data to be included in mini-batches. (default None)
shuffle (bool) – Whether or not to shuffle the name indices before each epoch? (default False)
skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
random_state (RandomState) – Optional numpy RandomState for shuffling data before each epoch. (default None, use the global RandomState).

is_shuffled¶: Whether or not to shuffle the names before each epoch?

names¶

Get the names of files to retrieve.

Returns:	The names, as numpy array.
Return type:	np.ndarray[str]

class mlsnippet.datafs.DataFSRandomFlow(fs, batch_size, with_names=True, meta_keys=None, batch_count=None, skip_incomplete=False)¶

Bases: mlsnippet.datafs.dataflow._BaseDataFSFlow

A DataFS derived DataFlow, obtaining random samples from the DataFS.

__init__(fs, batch_size, with_names=True, meta_keys=None, batch_count=None, skip_incomplete=False)¶

Construct a new DataFSRandomFlow.

Parameters:

fs (DataFS) – The data fs instance, where to read data.
batch_size (int) – Size of each mini-batch.
with_names (bool) – Whether or not to include the file names in mini-batches? (default True)
meta_keys (None or Iterable[str]) – The keys of the meta data to be included in mini-batches. (default None)
batch_count (int or None) – The number of mini-batches to obtain in an epoch. (default None, infinite mini-batches)
skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)

batch_count¶: Get the number of mini-batches to obtain in an epoch.

mlsnippet.datafs¶

MLSnippet

Navigation

Related Topics