Streams module
|
Data streams generator for both stationary and drifting data streams. |
|
Stream-aware parser of datasets in ARFF format. |
|
Stream-aware parser of datasets in CSV format. |
|
Stream-aware parser of datasets in numpy format. |
- class strlearn.streams.ARFFParser(path, chunk_size=200, n_chunks=250)
Bases:
objectStream-aware parser of datasets in ARFF format.
- Parameters:
path (string) – Path to the ARFF file.
chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.
n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.
- Example:
>>> import strlearn as sl >>> stream = sl.streams.ARFFParser("Agrawal.arff") >>> clf = sl.classifiers.AccumulatedSamplesClassifier() >>> evaluator = sl.evaluators.PrequentialEvaluator() >>> evaluator.process(clf, stream) >>> stream.reset() >>> print(evaluator.scores_) ... [[0.855 0.80815508 0.79478582 0.80815508 0.89679715] [0.795 0.75827674 0.7426779 0.75827674 0.84644195] [0.8 0.75313899 0.73559983 0.75313899 0.85507246] ... [0.885 0.86181169 0.85534199 0.86181169 0.91119691] [0.895 0.86935764 0.86452058 0.86935764 0.92134831] [0.87 0.85104088 0.84813907 0.85104088 0.9 ]]
- get_chunk()
Generating a data chunk of a stream.
Used by all evaluators but also accesible for custom evaluation.
- Returns:
Generated samples and target values.
- Return type:
tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}
- is_dry()
Checking if we have reached the end of the stream.
- Returns:
flag showing if the stream has ended
- Return type:
boolean
- reset()
Reset processed stream and close ARFF file.
- class strlearn.streams.CSVParser(path, chunk_size=200, n_chunks=250)
Bases:
objectStream-aware parser of datasets in CSV format.
- Parameters:
path (string) – Path to the csv file.
chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.
n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.
- Example:
>>> import strlearn as sl >>> stream = sl.streams.CSVParser("Agrawal.csv") >>> clf = sl.classifiers.AccumulatedSamplesClassifier() >>> evaluator = sl.evaluators.PrequentialEvaluator() >>> evaluator.process(clf, stream) >>> stream.reset() >>> print(evaluator.scores_) ... [[0.855 0.80815508 0.79478582 0.80815508 0.89679715] [0.795 0.75827674 0.7426779 0.75827674 0.84644195] [0.8 0.75313899 0.73559983 0.75313899 0.85507246] ... [0.885 0.86181169 0.85534199 0.86181169 0.91119691] [0.895 0.86935764 0.86452058 0.86935764 0.92134831] [0.87 0.85104088 0.84813907 0.85104088 0.9 ]]
- get_chunk()
Generating a data chunk of a stream.
Used by all evaluators but also accesible for custom evaluation.
- Returns:
Generated samples and target values.
- Return type:
tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}
- is_dry()
Checking if we have reached the end of the stream.
- Returns:
flag showing if the stream has ended
- Return type:
boolean
- reset()
Reset stream to the beginning.
- class strlearn.streams.NPYParser(path, chunk_size=200, n_chunks=250)
Bases:
objectStream-aware parser of datasets in numpy format.
- Parameters:
path (string) – Path to the npy file.
chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.
n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.
- Example:
>>> import strlearn as sl >>> stream = sl.streams.NPYParser("Agrawal.npy") >>> clf = sl.classifiers.AccumulatedSamplesClassifier() >>> evaluator = sl.evaluators.PrequentialEvaluator() >>> evaluator.process(clf, stream) >>> stream.reset() >>> print(evaluator.scores_) ... [[0.855 0.80815508 0.79478582 0.80815508 0.89679715] [0.795 0.75827674 0.7426779 0.75827674 0.84644195] [0.8 0.75313899 0.73559983 0.75313899 0.85507246] ... [0.885 0.86181169 0.85534199 0.86181169 0.91119691] [0.895 0.86935764 0.86452058 0.86935764 0.92134831] [0.87 0.85104088 0.84813907 0.85104088 0.9 ]]
- get_chunk()
Generating a data chunk of a stream.
Used by all evaluators but also accesible for custom evaluation.
- Returns:
Generated samples and target values.
- Return type:
tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}
- is_dry()
Checking if we have reached the end of the stream.
- Returns:
flag showing if the stream has ended
- Return type:
boolean
- reset()
Reset stream to the beginning.
- class strlearn.streams.SemiSyntheticStreamGenerator(X, y, n_chunks=200, chunk_size=250, random_state=None, n_drifts=2, n_features=10, interpolation='nearest', stabilize_factor=0.2, binarize=True, density=None, base_projection_pool_size=50, evaluation_measures=[])
Bases:
objectSemi-Synthetic Data streams generator for drifting data streams.
A generator that allows preparing a replicable classification dataset based on real-world input data. The generator uses one-dimensional interpolation to generate the drifting projections, based on which the final data stream is generated.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Static dataset features.
y (array-like, shape (n_samples, )) – Static dataset labels.
n_chunks (integer, optional (default=200)) – The number of data chunks, that the stream is composed of.
chunk_size (integer, optional (default=250)) – The number of instances in each data chunk.
random_state (integer, optional (default=None)) – The seed used by the random number generator.
n_drifts (integer, optional (default=2)) – The number of concept changes in the data stream.
n_features (integer, optional (default=10)) – The number of features in output stream.
interpolation (string, optional (default='nearest')) – Interpolation type.
stabilize_factor (float, optional (default=0.2)) – The factor describing the stability of a concept.
binarize (boolean, optional (default=True)) – Flag describing if the data should be binarized.
density (integer, optional (default=None)) – The number of possible drift points from which the generated drifts are randomly selected.
base_projection_pool_size (integer, optional (default=50)) – Number of initial projections from which the final ones are selected.
evaluation_measures (list, optional (default=[])) – Measures based on which the projections are selected.
- Example:
>>> import strlearn as sl >>> from sklearn.datasets import load_breast_cancer >>> from sklearn.naive_bayes import GaussianNB >>> X, y = load_breast_cancer(return_X_y=True) >>> stream = sl.streams.SemiSyntheticStreamGenerator(X, y, n_drifts=4, interpolation='cubic') >>> clf = GaussianNB() >>> evaluator = sl.evaluators.TestThenTrain() >>> evaluator.process(stream, clf) >>> print(stream._get_drifts()) [ 14 48 89 155]
- get_chunk()
Generating a data chunk of a stream.
Used by all evaluators but also accesible for custom evaluation.
- Returns:
Generated samples and target values.
- Return type:
tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}
- save_to_arff(filepath)
Save generated stream to the ARFF format file.
- Parameters:
filepath (string) – Path to the file where data will be saved in ARFF format.
- save_to_csv(filepath)
Save generated stream to the csv format file.
- Parameters:
filepath (string) – Path to the file where data will be saved in csv format.
- save_to_npy(filepath)
Save generated stream to the numpy format file.
- Parameters:
filepath (string) – Path to the file where data will be saved in numpy format.
- class strlearn.streams.StreamGenerator(n_chunks=250, chunk_size=200, random_state=None, n_drifts=0, concept_sigmoid_spacing=None, n_classes=2, n_features=20, n_informative=2, n_redundant=2, n_repeated=0, n_clusters_per_class=2, recurring=False, weights=None, incremental=False, y_flip=0.01, **kwargs)
Bases:
objectData streams generator for both stationary and drifting data streams.
A key element of the
stream-learnpackage is a generator that allows to prepare a replicable (according to the givenrandom_statevalue) classification dataset with class distribution changing over the course of stream, with base concepts build on a default class distributions for thescikit-learnpackage from themake_classification()function. These types of distributions try to reproduce the rules for generating theMadelonset. TheStreamGeneratoris capable of preparing any variation of the data stream known in the general taxonomy of data streams.- Parameters:
n_chunks (integer, optional (default=250)) – The number of data chunks, that the stream is composed of.
chunk_size (integer, optional (default=200)) – The number of instances in each data chunk.
random_state (integer, optional (default=1410)) – The seed used by the random number generator.
n_drifts (integer, optional (default=4)) – The number of concept changes in the data stream.
concept_sigmoid_spacing (float, optional (default=10.)) – Value that determines the shape of sigmoid function and how sudden is the change of concept. The higher the value, the more sudden the drift is.
n_classes (integer, optional (default=2)) – The number of classes in the generated data stream.
y_flip (float or tuple (default=0.01)) – Label noise for whole dataset or separate classes.
recurring (boolean, optional (default=False)) – Determines if the streams can go back to the previously encountered concepts.
weights (array-like, shape (n_classes, ) or tuple (only for 2 classes)) – If array - class weight for static imbalance, if 3-valued tuple - (n_drifts, concept_sigmoid_spacing, IR amplitude [0-1]) for generation of continous dynamically imbalanced streams, if 2-valued tuple - (mean value, standard deviation) for generation of discreete dynamically imbalanced streams.
- Example:
>>> import strlearn as sl >>> stream = sl.streams.StreamGenerator(n_drifts=2, weights=[0.2, 0.8], concept_sigmoid_spacing=5) >>> clf = sl.classifiers.AccumulatedSamplesClassifier() >>> evaluator = sl.evaluators.PrequentialEvaluator() >>> evaluator.process(clf, stream) >>> print(evaluator.scores_) [[0.955 0.93655817 0.93601827 0.93655817 0.97142857] [0.94 0.91397849 0.91275313 0.91397849 0.96129032] [0.9 0.85565271 0.85234488 0.85565271 0.93670886] ... [0.815 0.72584133 0.70447376 0.72584133 0.8802589 ] [0.83 0.69522145 0.65223303 0.69522145 0.89570552] [0.845 0.67267706 0.61257135 0.67267706 0.90855457]]
- get_chunk()
Generating a data chunk of a stream.
Used by all evaluators but also accesible for custom evaluation.
- Returns:
Generated samples and target values.
- Return type:
tuple {array-like, shape (n_samples, n_features), array-like, shape (n_samples, )}
- save_to_arff(filepath)
Save generated stream to the ARFF format file.
- Parameters:
filepath (string) – Path to the file where data will be saved in ARFF format.
- save_to_csv(filepath)
Save generated stream to the csv format file.
- Parameters:
filepath (string) – Path to the file where data will be saved in csv format.
- save_to_npy(filepath)
Save generated stream to the numpy format file.
- Parameters:
filepath (string) – Path to the file where data will be saved in numpy format.