Data Streams
A key element of the stream-learn
package is a generator that allows to prepare a replicable (according to the given random_state
value) classification dataset with class distribution changing over the course of stream, with base concepts build on a default class distributions for the scikit-learn
package from the make_classification()
function. These types of distributions try to reproduce the rules for generating the Madelon
set. The StreamGenerator
is capable of preparing any variation of the data stream known in the general taxonomy of data streams.
Stationary stream
The simplest variation of data streams are stationary streams. They contain one basic concept, static for the whole course of the processing. Chunks differ from each other in terms of the patterns inside, but the decision boundaries of the models built on them should not be statistically different. This type of stream may be generated with a clean generator call, without any additional parameters.
StreamGenerator()
The above illustration contains the series of scatter plots for a two-dimensional stationary stream with the binary problem. The StreamGenerator
class in the initializer accepts almost all standard attributes of the make_classification()
function, so to get exactly the distribution as above, the used call was:
stream = StreamGenerator(
n_classes=2,
n_features=2,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_features=2,
random_state=105,
n_chunks=100,
chunk_size=500
)
What’s very important, contrary to the typical call to make_classification()
, we don’t specify the
n_samples
parameter here, which determines the number of patterns in the set, but instead we provide
two new attributes of data stream:
n_chunks
— to determine the number of chunks in a data stream.chunk_size
— to determine the number of patterns in each data chunk.
Additionally, data streams may contain noise which, while not considered as concept drift, provides additional challenge during the data stream analysis and data stream classifiers should be robust to it. The StreamGenerator
class implements noise by inverting the class labels of a given percentage of incoming instances in the data stream. This percentage can be defined by a y_flip
parameter, like in standard make_classification()
call. If a single float is given as the parameter value, the percentage of noise refers to combined instances from all classes, while if we specify a tuple of floats, the noise occurs within each class separately using the given percentages.
Streams containing concept drifts
The most commonly studied nature of data streams is their variability in time. Responsible for this is the phenomenon of the concept drift, where class distributions change over time with different dynamics, which necessitates the rebuilding of already
fitted classification models. The stream-learn
package tries to meet the need to synthesize all basic variations of this phenomenon (i.e. sudden (abrupt) and gradual drifts).
Sudden (Abrupt) drift
This type of drift occurs when the concept from which the data stream is generated is suddenly replaced by another one. Concept probabilities used by the StreamGenerator
class are created based on sigmoid function, which is generated using concept_sigmoid_spacing
parameter, which determines the function shape and how sudden the change of concept is. The higher the value, the more sudden the drift becames. Here, this parameter takes the default value of 999, which allows us for a generation of sigmoid function simulating an abrupt change in the data stream.
StreamGenerator(n_drifts=2)
Gradual drift
Unlike sudden drifts, gradual ones are associated with a slower change rate, which can be noticed during a longer observation of the data stream. This kind of drift refers to the transition phase
where the probability of getting instances from the first concept decreases while the probability of sampling from the next concept increases. The StreamGenerator
class simulates gradual drift by
comparing the concept probabilities with the generated random noise and, depending on the result, selecting which concept is active at a given time.
StreamGenerator(
n_drifts=2, concept_sigmoid_spacing=5
)
Incremental (Stepwise) drift
The incremental drift occurs when we are dealing with a series of barely noticeable changes in the concept used to generate the data stream, in opposite of gradual drift, where we are mixing samples from different concepts without changing them. Due to this, the drift may be identified only after some time. The severity of changes, and hence the speed of transition of one concept into another, is, like in previous example, described by the concept_sigmoid_spacing parameter.
StreamGenerator(
n_drifts=2, concept_sigmoid_spacing=5, incremental=True
)
Recurrent drift
The cyclic repetition of class distributions is a completely different property of concept drifts. If after another drift, the concept earlier present in the stream returns, we are dealing with a recurrent drift. We can get this kind of data stream by setting the recurring
flag in the generator.
StreamGenerator(
n_drifts=2, recurring=True
)
Non-recurring drift
The default mode of consecutive concept occurences is a non-recurring drift, where in each concept drift we are generating a completely new, previously unseen class distribution.
StreamGenerator(
n_drifts=2
)
Class imbalance
Another area of data stream properties, different from the concept drift phenomenon, is the prior probability of problem classes. By default, a balanced stream is generated, i.e. one in which patterns of all classes are present in a similar number.
StreamGenerator()
Stationary imbalanced stream
The basic type of problem in which we are dealing with disturbed class distribution is a dataset imbalanced stationary, where the classes maintain a predetermined proportion in each chunk of data stream. To acquire this type of a stream, one should pass the list
to the weights
parameter of the generator (i) consisting of as many elements as the classes in the problem and (ii) adding to one.
StreamGenerator(weights=[0.3, 0.7])
Dynamically imbalanced stream
A less common type of imbalanced data, impossible to obtain in static datasets, is data imbalanced dynamically. In this case, the class distribution is not constant throughout the course of a stream, but changes over time, similar to changing the concept presence in gradual streams. To get this type of data stream, we pass a tuple
of three numeric values to the weights
parameter of the generator:
the number of cycles of distribution changes,
concept_sigmoid_spacing
parameter, deciding about the dynamics of changes on the same principle as in gradual and incremental drifts,range within which oscillation is to take place.
StreamGenerator(weights=(2, 5, 0.9))
Mixing drift properties
Of course, when generating data streams, we don’t have to limit ourselves to just one modification of their properties. We can easily prepare a stream with many drifts, any dynamics of changes, a selected type of drift and a diverse, dynamic imbalanced ratio. The last example in this chapter of User Guide is such proposition, namely, DISCO (Dynamically Imbalanced Stream with Concept Oscillation).
StreamGenerator(
weights=(2, 5, 0.9), n_drifts=3, concept_sigmoid_spacing=5,
recurring=True, incremental=True
)