mlsnippet.datafs

class mlsnippet.datafs.TarArchiveFS(archive_file, strict=False)

Bases: mlsnippet.datafs.archivefs._ArchiveFS

Tar archive file based DataFS.

__init__(archive_file, strict=False)

Construct a new TarArchiveFS.

Parameters:
  • archive_file (str) – Path of the archive file.
  • strict (bool) – Whether or not this DataFS works in strict mode? (default False)
_close()

Override this method to destroy the internal states.

_init()

Override this method to initialize the internal states.

isfile(filename)

Check whether or not a file exists.

Parameters:filename (str) – The name of the file.
Returns:
True if filename exists and is a file,
and False otherwise.
Return type:bool
iter_files(meta_keys=None)

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…])

A tuple containing the

name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()

Iterate through all the file names in this DataFS.

Yields:str – The file name of each file.
open(filename, mode)

Open a file-like object to read / write a file.

Parameters:
  • filename (str) – The name of the file.
  • mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:

The file-like object. This object will be immediately

closed as soon as this DataFS instance is closed.

Return type:

file-like

Raises:
  • InvalidOpenMode – If the specified mode is not supported, e.g., mode == 'w' but WRITE_DATA capacity is absent.
  • DataFileNotExist – If mode == 'r' but filename does not exist.
class mlsnippet.datafs.ZipArchiveFS(archive_file, strict=False)

Bases: mlsnippet.datafs.archivefs._ArchiveFS

Zip archive file based DataFS.

__init__(archive_file, strict=False)

Construct a new ZipArchiveFS.

Parameters:
  • archive_file (str) – Path of the archive file.
  • strict (bool) – Whether or not this DataFS works in strict mode? (default False)
_close()

Override this method to destroy the internal states.

_init()

Override this method to initialize the internal states.

isfile(filename)

Check whether or not a file exists.

Parameters:filename (str) – The name of the file.
Returns:
True if filename exists and is a file,
and False otherwise.
Return type:bool
iter_files(meta_keys=None)

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…])

A tuple containing the

name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()

Iterate through all the file names in this DataFS.

Yields:str – The file name of each file.
open(filename, mode)

Open a file-like object to read / write a file.

Parameters:
  • filename (str) – The name of the file.
  • mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:

The file-like object. This object will be immediately

closed as soon as this DataFS instance is closed.

Return type:

file-like

Raises:
  • InvalidOpenMode – If the specified mode is not supported, e.g., mode == 'w' but WRITE_DATA capacity is absent.
  • DataFileNotExist – If mode == 'r' but filename does not exist.
class mlsnippet.datafs.DataFSCapacity(mode=0)

Bases: object

Enumeration class to represent the capacity of a DataFS.

There are 7 different categories of capacities. Every method of DataFS may only work if the DataFS has the particular one or more capacities. One may check whether the DataFS has a certain capacity by can_[capacity_name]().

ALL = 127

All capacities are supported.

LIST_META = 16

Can enumerate the meta keys for a particular file.

QUICK_COUNT = 32

Can get the count of files without iterating through them.

RANDOM_SAMPLE = 64

Can randomly sample files without obtaining the whole file list.

READ_DATA = 1

Can read file data, the basic capacity of a DataFS.

READ_META = 4

Can read meta data.

READ_WRITE_DATA = 3

Can read and write file data.

READ_WRITE_META = 12

Can read and write meta data.

WRITE_DATA = 2

Can write file data.

WRITE_META = 8

Can write meta data.

__init__(mode=0)

Construct a new DataFSCapacity.

Parameters:mode (int) – The mode number of this capacity flag.
can_list_meta()
can_quick_count()
can_random_sample()
can_read_data()
can_read_meta()
can_write_data()
can_write_meta()
class mlsnippet.datafs.DataFS(capacity, strict=False)

Bases: mlsnippet.utils.concepts.AutoInitAndCloseable

Base class for all data file systems.

A DataFS provides access to a machine learning dataset stored in a file system like backend. For example, large image datasets are usually stored as raw image files, gathered in a directory. Such true file system can be accessed by LocalFS.

Apart from the true file system, some may instead store these images in a database provided virtual file system, for example, the GridFS of MongoDB, which can be accessed via MongoFS.

__init__(capacity, strict=False)

Initialize the base DataFS class.

Parameters:
  • capacity (int or DataFSCapacity) – Specify the capacity of the derived DataFS.
  • strict (bool) –

    Whether or not this DataFS works in strict mode? (default False)

    In strict mode, the following behaviours will take place:

    1. Accessing the value of a non-exist meta key will cause a MetaKeyNotExist, instead of getting None.
as_flow(batch_size, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False, names_pattern=None)

Construct a DataFlow, which iterates through the files once and only once in an epoch.

The returned DataFSFlow will hold a copy of this instance (obtained by clone()) instead of holding this instance itself.

Parameters:
  • batch_size (int) – Size of each mini-batch.
  • with_names (bool) – Whether or not to include the file names in the returned flow? (default True)
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be included in the returned flow. (default None)
  • shuffle (bool) – Whether or not to shuffle the files in each epoch of the flow? Setting this to True will force loading the file list into memory. (default False)
  • skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
  • names_pattern (None or str or regex) – The file name pattern. If specified, only if the file name matches this pattern, would the file be included in the constructed data flow. Specifying this option will force loading the file list into memory. (default None)
Returns:

A dataflow, with each mini-batch

having numpy arrays ([filename,] content, [meta-data...]), according to the arguments.

Return type:

tfsnippet.dataflow.DataFlow

batch_get_meta(filenames, meta_keys)

Get meta data of files.

Parameters:
  • filenames (Iterable[str]) – The names of the files.
  • meta_keys (Iterable[str]) – The keys of the meta data.
Returns:

A list of meta values, or None

if the corresponding file does not exist.

Return type:

list[tuple[any] or None]

batch_isfile(filenames)

Check whether or not the files exist.

Parameters:filenames (Iterable[str]) – The names of the files.
Returns:
A list of indicators, where True if the
corresponding filename exists and is a file, and False otherwise.
Return type:list[bool]
capacity

Get the capacity of this DataFS.

Returns:The capacity object.
Return type:DataFSCapacity
clear_and_put_meta(filename, meta_dict=None, **meta_dict_kwargs)

Set the meta data of a file. The un-mentioned meta data will be cleared. This method is not necessarily slower than put_meta().

Parameters:
  • filename (str) – The name of the file.
  • meta_dict (dict[str, any]) – The meta values to be updated.
  • **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in meta_dict.
Raises:
clear_meta(filename)

Clear all the meta data of a file.

Parameters:

filename (str) – The name of the file.

Raises:
clone()

Obtain a clone of this DataFS instance.

Returns:
The cloned DataFS. Only the construction
arguments will be copied. All the internal states (e.g., database connections) are kept un-initialized.
Return type:DataFS
count()

Count the files in this DataFS.

Will iterate through all the files via iter_names(), if QUICK_COUNT capacity is absent.

Returns:The total number of files.
Return type:int
get_data(filename)

Get the content of a file.

Parameters:filename (str) – The name of the file.
Returns:The content of a file. DataFileNotExist: If filename does not exist.
Return type:bytes
get_meta(filename, meta_keys)

Get meta data of a file.

Parameters:
  • filename (str) – The name of the file.
  • meta_keys (Iterable[str]) – The keys of the meta data.
Returns:

The meta values, corresponding to meta_keys.

If a requested key is absent for a file, None will take the place.

Return type:

tuple[any]

Raises:
get_meta_dict(filename)

Get all the meta data of a file, as a dict.

Parameters:

filename (str) – The name of the file.

Returns:

The meta values, as a dict.

Return type:

dict[str, any]

Raises:
isfile(filename)

Check whether or not a file exists.

Parameters:filename (str) – The name of the file.
Returns:
True if filename exists and is a file,
and False otherwise.
Return type:bool
iter_files(meta_keys=None)

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…])

A tuple containing the

name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()

Iterate through all the file names in this DataFS.

Yields:str – The file name of each file.
list_meta(filename)

List the meta keys of a file.

Parameters:

filename (str) – The name of the file.

Returns:

The keys of the meta data of the file.

Return type:

tuple[str]

Raises:
list_names()

Get the list of all the file names.

Returns:The file names list.
Return type:list[str]
open(filename, mode)

Open a file-like object to read / write a file.

Parameters:
  • filename (str) – The name of the file.
  • mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:

The file-like object. This object will be immediately

closed as soon as this DataFS instance is closed.

Return type:

file-like

Raises:
  • InvalidOpenMode – If the specified mode is not supported, e.g., mode == 'w' but WRITE_DATA capacity is absent.
  • DataFileNotExist – If mode == 'r' but filename does not exist.
put_data(filename, data)

Save the content of a file.

Parameters:
  • filename (str) – The name of the file.
  • data (bytes or file-like) – The content of the file, or a file-like object with read(size) method.
Raises:

UnsupportedOperation – If WRITE_DATA capacity is absent.

put_meta(filename, meta_dict=None, **meta_dict_kwargs)

Update the meta data of a file. The un-mentioned meta data will remain unchanged. This method is not necessarily faster than clear_and_put_meta(). In some backends it may be implemented by first calling get_meta_dict, then updating the meta dict in memory, and finally calling clear_and_put_meta.

Parameters:
  • filename (str) – The name of the file.
  • meta_dict (dict[str, any]) – The meta values to be updated.
  • **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in meta_dict.
Raises:
random_flow(batch_size, with_names=True, meta_keys=None, skip_incomplete=False, batch_count=None)

Construct a DataFlow, with infinite or pre-configured number of mini-batches in an epoch, randomly sampled from the whole DataFS.

The returned DataFSRandomFlow will hold a copy of this instance (obtained by clone()) instead of holding this instance itself.

Parameters:
  • batch_size (int) – Size of each mini-batch.
  • with_names (bool) – Whether or not to include the file names in the returned flow? (default True)
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be included in the returned flow. (default None)
  • skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
  • batch_count (int or None) – The number of mini-batches to obtain in an epoch. (default None, infinite mini-batches)
Returns:

A dataflow, with each mini-batch

having numpy arrays ([filename,] content, [meta-data...]), according to the arguments.

Return type:

tfsnippet.dataflow.DataFlow

Raises:

UnsupportedOperation – If RANDOM_SAMPLE capacity is absent.

retrieve(filename, meta_keys=None)

Retrieve the content and maybe meta data of a file.

Parameters:
  • filename (str) – The name of the file to be retrieved.
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)
Returns:

The content, or a tuple

containing the content and the meta values, corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Return type:

bytes or (bytes, [meta-data…])

Notes

As long as meta_keys is not None, a tuple will always be returned, even if meta_keys is an empty collection.

Raises:
sample_files(n_samples, meta_keys=None)

Sample n_samples files from this DataFS.

Parameters:
  • n_samples (int) – The number of files to sample.
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)
Returns:

A list of tuples,

each tuple contains the name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Return type:

list[(filename, content, [meta-data…])]

Raises:

UnsupportedOperation – If RANDOM_SAMPLE capacity is absent, or meta_keys is specified, but READ_META capacity is absent.

sample_names(n_samples)

Sample n_samples file names from this DataFS.

Parameters:n_samples (int) – Number of names to sample. The returned names may be fewer than this number, if there are less than n_samples files in this DataFS.
Returns:The list of sampled file names.
Return type:list[str]
Raises:UnsupportedOperation – If RANDOM_SAMPLE capacity is absent.
strict

Whether or not this DataFS works in strict mode?

sub_flow(batch_size, names, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False)

Construct a DataFlow, which iterates through the files according to selected names.

The returned DataFSFlow will hold a copy of this instance (obtained by clone()) instead of holding this instance itself.

Parameters:
  • batch_size (int) – Size of each mini-batch.
  • names (list[str] or np.ndarray[str]) – The names to retrieve.
  • with_names (bool) – Whether or not to include the file names in the returned flow? (default True)
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be included in the returned flow. (default None)
  • shuffle (bool) – Whether or not to shuffle the files in each epoch of the flow? Setting this to True will force loading the file list into memory. (default False)
  • skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
Returns:

A dataflow, with each mini-batch

having numpy arrays ([filename,] content, [meta-data...]), according to the arguments.

Return type:

tfsnippet.dataflow.DataFlow

exception mlsnippet.datafs.DataFSError

Bases: exceptions.Exception

Base class for all DataFS errors.

exception mlsnippet.datafs.UnsupportedOperation

Bases: mlsnippet.datafs.errors.DataFSError

Class to indicate that a requested operation is not supported by the specific DataFS subclass.

exception mlsnippet.datafs.InvalidOpenMode(mode)

Bases: mlsnippet.datafs.errors.UnsupportedOperation

Class to indicate that the specified open mode is not supported.

mode
exception mlsnippet.datafs.DataFileNotExist(filename)

Bases: mlsnippet.datafs.errors.DataFSError

Class to indicate a requested data file does not exist.

filename
exception mlsnippet.datafs.MetaKeyNotExist(filename, meta_key)

Bases: mlsnippet.datafs.errors.DataFSError

Class to indicate a requested meta key does not exist.

filename
meta_key
class mlsnippet.datafs.LocalFS(root_dir, strict=False)

Bases: mlsnippet.datafs.base.DataFS

Local directory based DataFS.

__init__(root_dir, strict=False)

Construct a new LocalFS.

Parameters:
  • root_dir (str) – The root directory for this LocalFS.
  • strict (bool) – Whether or not this DataFS works in strict mode? (default False)
_close()

Override this method to destroy the internal states.

_init()

Override this method to initialize the internal states.

clear_meta(filename)

Clear all the meta data of a file.

Parameters:

filename (str) – The name of the file.

Raises:
clone()

Obtain a clone of this DataFS instance.

Returns:
The cloned DataFS. Only the construction
arguments will be copied. All the internal states (e.g., database connections) are kept un-initialized.
Return type:DataFS
get_meta(filename, meta_keys)

Get meta data of a file.

Parameters:
  • filename (str) – The name of the file.
  • meta_keys (Iterable[str]) – The keys of the meta data.
Returns:

The meta values, corresponding to meta_keys.

If a requested key is absent for a file, None will take the place.

Return type:

tuple[any]

Raises:
isfile(filename)

Check whether or not a file exists.

Parameters:filename (str) – The name of the file.
Returns:
True if filename exists and is a file,
and False otherwise.
Return type:bool
iter_names()

Iterate through all the file names in this DataFS.

Yields:str – The file name of each file.
list_meta(filename)

List the meta keys of a file.

Parameters:

filename (str) – The name of the file.

Returns:

The keys of the meta data of the file.

Return type:

tuple[str]

Raises:
open(filename, mode)

Open a file-like object to read / write a file.

Parameters:
  • filename (str) – The name of the file.
  • mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:

The file-like object. This object will be immediately

closed as soon as this DataFS instance is closed.

Return type:

file-like

Raises:
  • InvalidOpenMode – If the specified mode is not supported, e.g., mode == 'w' but WRITE_DATA capacity is absent.
  • DataFileNotExist – If mode == 'r' but filename does not exist.
put_meta(filename, meta_dict=None, **meta_dict_kwargs)

Update the meta data of a file. The un-mentioned meta data will remain unchanged. This method is not necessarily faster than clear_and_put_meta(). In some backends it may be implemented by first calling get_meta_dict, then updating the meta dict in memory, and finally calling clear_and_put_meta.

Parameters:
  • filename (str) – The name of the file.
  • meta_dict (dict[str, any]) – The meta values to be updated.
  • **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in meta_dict.
Raises:
root_dir

Get the absolute path of the root directory.

sample_names(n_samples)

Sample n_samples file names from this DataFS.

Parameters:n_samples (int) – Number of names to sample. The returned names may be fewer than this number, if there are less than n_samples files in this DataFS.
Returns:The list of sampled file names.
Return type:list[str]
Raises:UnsupportedOperation – If RANDOM_SAMPLE capacity is absent.
class mlsnippet.datafs.MongoFS(conn_str, db_name, coll_name, strict=False)

Bases: mlsnippet.datafs.base.DataFS, mlsnippet.utils.mongo_binder.MongoBinder

MongoDB GridFS based DataFS.

This class provides a DataFS, which saves the files in a MongoDB GridFS, and stores the meta values in metadata field of each record in the fs collection.

__init__(conn_str, db_name, coll_name, strict=False)

Construct a new MongoFS.

Parameters:
  • conn_str (str) – The MongoDB connection string.
  • db_name (str) – The MongoDB database name.
  • coll_name (str) – The collection name (prefix) of the GridFS.
  • strict (bool) – Whether or not this DataFS works in strict mode? (default False)
batch_get_meta(filenames, meta_keys)

Get meta data of files.

Parameters:
  • filenames (Iterable[str]) – The names of the files.
  • meta_keys (Iterable[str]) – The keys of the meta data.
Returns:

A list of meta values, or None

if the corresponding file does not exist.

Return type:

list[tuple[any] or None]

batch_isfile(filenames)

Check whether or not the files exist.

Parameters:filenames (Iterable[str]) – The names of the files.
Returns:
A list of indicators, where True if the
corresponding filename exists and is a file, and False otherwise.
Return type:list[bool]
clear_and_put_meta(filename, meta_dict=None, **meta_dict_kwargs)

Set the meta data of a file. The un-mentioned meta data will be cleared. This method is not necessarily slower than put_meta().

Parameters:
  • filename (str) – The name of the file.
  • meta_dict (dict[str, any]) – The meta values to be updated.
  • **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in meta_dict.
Raises:
clear_meta(filename)

Clear all the meta data of a file.

Parameters:

filename (str) – The name of the file.

Raises:
clone()

Obtain a clone of this DataFS instance.

Returns:
The cloned DataFS. Only the construction
arguments will be copied. All the internal states (e.g., database connections) are kept un-initialized.
Return type:DataFS
count()

Count the files in this DataFS.

Will iterate through all the files via iter_names(), if QUICK_COUNT capacity is absent.

Returns:The total number of files.
Return type:int
get_meta(filename, meta_keys)

Get meta data of a file.

Parameters:
  • filename (str) – The name of the file.
  • meta_keys (Iterable[str]) – The keys of the meta data.
Returns:

The meta values, corresponding to meta_keys.

If a requested key is absent for a file, None will take the place.

Return type:

tuple[any]

Raises:
get_meta_dict(filename)

Get all the meta data of a file, as a dict.

Parameters:

filename (str) – The name of the file.

Returns:

The meta values, as a dict.

Return type:

dict[str, any]

Raises:
isfile(filename)

Check whether or not a file exists.

Parameters:filename (str) – The name of the file.
Returns:
True if filename exists and is a file,
and False otherwise.
Return type:bool
iter_files(meta_keys=None)

Iterate through all the files in this DataFS.

Parameters:

meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)

Yields:

(filename, content, [meta-data…])

A tuple containing the

name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Raises:

UnsupportedOperation – If meta_keys is specified, but READ_META capacity is absent.

iter_names()

Iterate through all the file names in this DataFS.

Yields:str – The file name of each file.
list_meta(filename)

List the meta keys of a file.

Parameters:

filename (str) – The name of the file.

Returns:

The keys of the meta data of the file.

Return type:

tuple[str]

Raises:
open(filename, mode)

Open a file-like object to read / write a file.

Parameters:
  • filename (str) – The name of the file.
  • mode ({'r', 'w'}) – The open mode of the file, either ‘r’ for reading or ‘w’ for writing. Other modes are not supported in general.
Returns:

The file-like object. This object will be immediately

closed as soon as this DataFS instance is closed.

Return type:

file-like

Raises:
  • InvalidOpenMode – If the specified mode is not supported, e.g., mode == 'w' but WRITE_DATA capacity is absent.
  • DataFileNotExist – If mode == 'r' but filename does not exist.
put_data(filename, data)

Save the content of a file.

Parameters:
  • filename (str) – The name of the file.
  • data (bytes or file-like) – The content of the file, or a file-like object with read(size) method.
Raises:

UnsupportedOperation – If WRITE_DATA capacity is absent.

put_meta(filename, meta_dict=None, **meta_dict_kwargs)

Update the meta data of a file. The un-mentioned meta data will remain unchanged. This method is not necessarily faster than clear_and_put_meta(). In some backends it may be implemented by first calling get_meta_dict, then updating the meta dict in memory, and finally calling clear_and_put_meta.

Parameters:
  • filename (str) – The name of the file.
  • meta_dict (dict[str, any]) – The meta values to be updated.
  • **meta_dict_kwargs – The meta values to be updated, as keyword arguments. This will override the values provided in meta_dict.
Raises:
retrieve(filename, meta_keys=None)

Retrieve the content and maybe meta data of a file.

Parameters:
  • filename (str) – The name of the file to be retrieved.
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)
Returns:

The content, or a tuple

containing the content and the meta values, corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Return type:

bytes or (bytes, [meta-data…])

Notes

As long as meta_keys is not None, a tuple will always be returned, even if meta_keys is an empty collection.

Raises:
sample_files(n_samples, meta_keys=None)

Sample n_samples files from this DataFS.

Parameters:
  • n_samples (int) – The number of files to sample.
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be retrieved. (default None)
Returns:

A list of tuples,

each tuple contains the name of a file, its content, and the values of each meta data corresponding to meta_keys. If a requested key is absent for a file, None will take the place.

Return type:

list[(filename, content, [meta-data…])]

Raises:

UnsupportedOperation – If RANDOM_SAMPLE capacity is absent, or meta_keys is specified, but READ_META capacity is absent.

sample_names(n_samples)

Sample n_samples file names from this DataFS.

Parameters:n_samples (int) – Number of names to sample. The returned names may be fewer than this number, if there are less than n_samples files in this DataFS.
Returns:The list of sampled file names.
Return type:list[str]
Raises:UnsupportedOperation – If RANDOM_SAMPLE capacity is absent.
class mlsnippet.datafs.DataFSForwardFlow(fs, batch_size, with_names=True, meta_keys=None, skip_incomplete=False)

Bases: mlsnippet.datafs.dataflow._BaseDataFSFlow

A DataFS derived DataFlow, iterating through mini-batches in a forward-only fashion (data are obtained by iter_files()).

__init__(fs, batch_size, with_names=True, meta_keys=None, skip_incomplete=False)

Construct a new DataFSForwardFlow.

Parameters:
  • fs (DataFS) – The data fs instance, where to read data.
  • batch_size (int) – Size of each mini-batch.
  • with_names (bool) – Whether or not to include the file names in mini-batches? (default True)
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be included in mini-batches. (default None)
  • skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
class mlsnippet.datafs.DataFSIndexedFlow(fs, batch_size, names, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False, random_state=None)

Bases: mlsnippet.datafs.dataflow._BaseDataFSFlow

A DataFS derived DataFlow, iterating through mini-batches according to given names (data are obtaining by retrieve()).

__init__(fs, batch_size, names, with_names=True, meta_keys=None, shuffle=False, skip_incomplete=False, random_state=None)

Construct a new DataFSIndexedFlow.

Parameters:
  • fs (DataFS) – The data fs instance, where to read data.
  • batch_size (int) – Size of each mini-batch.
  • names (list[str] or np.ndarray[str]) – The names to retrieve.
  • with_names (bool) – Whether or not to include the file names in mini-batches? (default True)
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be included in mini-batches. (default None)
  • shuffle (bool) – Whether or not to shuffle the name indices before each epoch? (default False)
  • skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
  • random_state (RandomState) – Optional numpy RandomState for shuffling data before each epoch. (default None, use the global RandomState).
is_shuffled

Whether or not to shuffle the names before each epoch?

names

Get the names of files to retrieve.

Returns:The names, as numpy array.
Return type:np.ndarray[str]
class mlsnippet.datafs.DataFSRandomFlow(fs, batch_size, with_names=True, meta_keys=None, batch_count=None, skip_incomplete=False)

Bases: mlsnippet.datafs.dataflow._BaseDataFSFlow

A DataFS derived DataFlow, obtaining random samples from the DataFS.

__init__(fs, batch_size, with_names=True, meta_keys=None, batch_count=None, skip_incomplete=False)

Construct a new DataFSRandomFlow.

Parameters:
  • fs (DataFS) – The data fs instance, where to read data.
  • batch_size (int) – Size of each mini-batch.
  • with_names (bool) – Whether or not to include the file names in mini-batches? (default True)
  • meta_keys (None or Iterable[str]) – The keys of the meta data to be included in mini-batches. (default None)
  • batch_count (int or None) – The number of mini-batches to obtain in an epoch. (default None, infinite mini-batches)
  • skip_incomplete (bool) – Whether or not to exclude a mini-batch, if it has fewer data than batch_size? (default False, the final mini-batch will always be visited even if it has fewer data than batch_size)
batch_count

Get the number of mini-batches to obtain in an epoch.