Streams module

StreamGenerator([n_chunks, chunk_size, ...])

Data streams generator for both stationary and drifting data streams.

ARFFParser(path[, chunk_size, n_chunks])

Stream-aware parser of datasets in ARFF format.

CSVParser(path[, chunk_size, n_chunks])

Stream-aware parser of datasets in CSV format.

NPYParser(path[, chunk_size, n_chunks])

Stream-aware parser of datasets in numpy format.

class strlearn.streams.ARFFParser(path, chunk_size=200, n_chunks=250)

Bases: object

Stream-aware parser of datasets in ARFF format.

Parameters:
  • path (string) – Path to the ARFF file.

  • chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.

  • n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.

Example:

>>> import strlearn as sl
>>> stream = sl.streams.ARFFParser("Agrawal.arff")
>>> clf = sl.classifiers.AccumulatedSamplesClassifier()
>>> evaluator = sl.evaluators.PrequentialEvaluator()
>>> evaluator.process(clf, stream)
>>> stream.reset()
>>> print(evaluator.scores_)
...
[[0.855      0.80815508 0.79478582 0.80815508 0.89679715]
[0.795      0.75827674 0.7426779  0.75827674 0.84644195]
[0.8        0.75313899 0.73559983 0.75313899 0.85507246]
...
[0.885      0.86181169 0.85534199 0.86181169 0.91119691]
[0.895      0.86935764 0.86452058 0.86935764 0.92134831]
[0.87       0.85104088 0.84813907 0.85104088 0.9       ]]
get_chunk()

Generating a data chunk of a stream.

Used by all evaluators but also accesible for custom evaluation.

Returns:

Generated samples and target values.

Return type:

tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}

is_dry()

Checking if we have reached the end of the stream.

Returns:

flag showing if the stream has ended

Return type:

boolean

reset()

Reset processed stream and close ARFF file.

class strlearn.streams.CSVParser(path, chunk_size=200, n_chunks=250)

Bases: object

Stream-aware parser of datasets in CSV format.

Parameters:
  • path (string) – Path to the csv file.

  • chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.

  • n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.

Example:

>>> import strlearn as sl
>>> stream = sl.streams.CSVParser("Agrawal.csv")
>>> clf = sl.classifiers.AccumulatedSamplesClassifier()
>>> evaluator = sl.evaluators.PrequentialEvaluator()
>>> evaluator.process(clf, stream)
>>> stream.reset()
>>> print(evaluator.scores_)
...
[[0.855      0.80815508 0.79478582 0.80815508 0.89679715]
[0.795      0.75827674 0.7426779  0.75827674 0.84644195]
[0.8        0.75313899 0.73559983 0.75313899 0.85507246]
...
[0.885      0.86181169 0.85534199 0.86181169 0.91119691]
[0.895      0.86935764 0.86452058 0.86935764 0.92134831]
[0.87       0.85104088 0.84813907 0.85104088 0.9       ]]
get_chunk()

Generating a data chunk of a stream.

Used by all evaluators but also accesible for custom evaluation.

Returns:

Generated samples and target values.

Return type:

tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}

is_dry()

Checking if we have reached the end of the stream.

Returns:

flag showing if the stream has ended

Return type:

boolean

reset()

Reset stream to the beginning.

class strlearn.streams.NPYParser(path, chunk_size=200, n_chunks=250)

Bases: object

Stream-aware parser of datasets in numpy format.

Parameters:
  • path (string) – Path to the npy file.

  • chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.

  • n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.

Example:

>>> import strlearn as sl
>>> stream = sl.streams.NPYParser("Agrawal.npy")
>>> clf = sl.classifiers.AccumulatedSamplesClassifier()
>>> evaluator = sl.evaluators.PrequentialEvaluator()
>>> evaluator.process(clf, stream)
>>> stream.reset()
>>> print(evaluator.scores_)
...
[[0.855      0.80815508 0.79478582 0.80815508 0.89679715]
[0.795      0.75827674 0.7426779  0.75827674 0.84644195]
[0.8        0.75313899 0.73559983 0.75313899 0.85507246]
...
[0.885      0.86181169 0.85534199 0.86181169 0.91119691]
[0.895      0.86935764 0.86452058 0.86935764 0.92134831]
[0.87       0.85104088 0.84813907 0.85104088 0.9       ]]
get_chunk()

Generating a data chunk of a stream.

Used by all evaluators but also accesible for custom evaluation.

Returns:

Generated samples and target values.

Return type:

tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}

is_dry()

Checking if we have reached the end of the stream.

Returns:

flag showing if the stream has ended

Return type:

boolean

reset()

Reset stream to the beginning.

class strlearn.streams.SemiSyntheticStreamGenerator(X, y, n_chunks=200, chunk_size=250, random_state=None, n_drifts=2, n_features=10, interpolation='nearest', stabilize_factor=0.2, binarize=True, density=None, base_projection_pool_size=50, evaluation_measures=[])

Bases: object

Semi-Synthetic Data streams generator for drifting data streams.

A generator that allows preparing a replicable classification dataset based on real-world input data. The generator uses one-dimensional interpolation to generate the drifting projections, based on which the final data stream is generated.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Static dataset features.

  • y (array-like, shape (n_samples, )) – Static dataset labels.

  • n_chunks (integer, optional (default=200)) – The number of data chunks, that the stream is composed of.

  • chunk_size (integer, optional (default=250)) – The number of instances in each data chunk.

  • random_state (integer, optional (default=None)) – The seed used by the random number generator.

  • n_drifts (integer, optional (default=2)) – The number of concept changes in the data stream.

  • n_features (integer, optional (default=10)) – The number of features in output stream.

  • interpolation (string, optional (default='nearest')) – Interpolation type.

  • stabilize_factor (float, optional (default=0.2)) – The factor describing the stability of a concept.

  • binarize (boolean, optional (default=True)) – Flag describing if the data should be binarized.

  • density (integer, optional (default=None)) – The number of possible drift points from which the generated drifts are randomly selected.

  • base_projection_pool_size (integer, optional (default=50)) – Number of initial projections from which the final ones are selected.

  • evaluation_measures (list, optional (default=[])) – Measures based on which the projections are selected.

Example:

>>> import strlearn as sl
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.naive_bayes import GaussianNB
>>> X, y =  load_breast_cancer(return_X_y=True)
>>> stream = sl.streams.SemiSyntheticStreamGenerator(X, y, n_drifts=4, interpolation='cubic')
>>> clf = GaussianNB()
>>> evaluator = sl.evaluators.TestThenTrain()
>>> evaluator.process(stream, clf)
>>> print(stream._get_drifts())
[ 14  48  89 155]
get_chunk()

Generating a data chunk of a stream.

Used by all evaluators but also accesible for custom evaluation.

Returns:

Generated samples and target values.

Return type:

tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}

save_to_arff(filepath)

Save generated stream to the ARFF format file.

Parameters:

filepath (string) – Path to the file where data will be saved in ARFF format.

save_to_csv(filepath)

Save generated stream to the csv format file.

Parameters:

filepath (string) – Path to the file where data will be saved in csv format.

save_to_npy(filepath)

Save generated stream to the numpy format file.

Parameters:

filepath (string) – Path to the file where data will be saved in numpy format.

class strlearn.streams.StreamGenerator(n_chunks=250, chunk_size=200, random_state=None, n_drifts=0, concept_sigmoid_spacing=None, n_classes=2, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_clusters_per_class=2, recurring=False, weights=None, incremental=False, y_flip=0.01, **kwargs)

Bases: object

Data streams generator for both stationary and drifting data streams.

A key element of the stream-learn package is a generator that allows to prepare a replicable (according to the given random_state value) classification dataset with class distribution changing over the course of stream, with base concepts build on a default class distributions for the scikit-learn package from the make_classification() function. These types of distributions try to reproduce the rules for generating the Madelon set. The StreamGenerator is capable of preparing any variation of the data stream known in the general taxonomy of data streams.

Parameters:
  • n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.

  • chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.

  • random_state (integer, optional (default=1410)) – The seed used by the random number generator.

  • n_drifts (integer, optional (default=4)) – The number of concept changes in the data stream.

  • concept_sigmoid_spacing (float, optional (default=10.)) – Value that determines the shape of sigmoid function and how sudden is the change of concept. The higher the value, the more sudden the drift is.

  • n_classes (integer, optional (default=2)) – The number of classes in the generated data stream.

  • y_flip (float or tuple (default=0.01)) – Label noise for whole dataset or separate classes.

  • recurring (boolean, optional (default=False)) – Determines if the streams can go back to the previously encountered concepts.

  • weights (array-like, shape (n_classes, ) or tuple (only for 2 classes)) – If array - class weight for static imbalance, if 3-valued tuple - (n_drifts, concept_sigmoid_spacing, IR amplitude [0-1]) for generation of continous dynamically imbalanced streams, if 2-valued tuple - (mean value, standard deviation) for generation of discreete dynamically imbalanced streams.

Example:

>>> import strlearn as sl
>>> stream = sl.streams.StreamGenerator(n_drifts=2, weights=[0.2, 0.8], concept_sigmoid_spacing=5)
>>> clf = sl.classifiers.AccumulatedSamplesClassifier()
>>> evaluator = sl.evaluators.PrequentialEvaluator()
>>> evaluator.process(clf, stream)
>>> print(evaluator.scores_)
[[0.955      0.93655817 0.93601827 0.93655817 0.97142857]
 [0.94       0.91397849 0.91275313 0.91397849 0.96129032]
 [0.9        0.85565271 0.85234488 0.85565271 0.93670886]
 ...
 [0.815      0.72584133 0.70447376 0.72584133 0.8802589 ]
 [0.83       0.69522145 0.65223303 0.69522145 0.89570552]
 [0.845      0.67267706 0.61257135 0.67267706 0.90855457]]
get_chunk()

Generating a data chunk of a stream.

Used by all evaluators but also accesible for custom evaluation.

Returns:

Generated samples and target values.

Return type:

tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}

save_to_arff(filepath)

Save generated stream to the ARFF format file.

Parameters:

filepath (string) – Path to the file where data will be saved in ARFF format.

save_to_csv(filepath)

Save generated stream to the csv format file.

Parameters:

filepath (string) – Path to the file where data will be saved in csv format.

save_to_npy(filepath)

Save generated stream to the numpy format file.

Parameters:

filepath (string) – Path to the file where data will be saved in numpy format.