Data Preprocessing with Orange Tool

Gohil Rushabh Navinchandra
3 min readOct 26, 2021

--

Data processing is manipulation of data by a computer. It includes the conversion of raw data to machine-readable form, flow of data through the CPU and memory to output devices, and formatting or transformation of output. Any use of computers to perform defined operations on data can be included under data processing.

In the Orange tool canvas, take the Python script from the left panel and double click on it.

Discretization

Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.

import Orange
iris = Orange.data.Table("iris.tab")
disc = Orange.preprocess.Discretize()
disc.method=Orange.preprocess.discretize.EqualFreq(n=3)
d_iris = disc(iris)
print("Original dataset:\n")
for e in iris[:3]:
print(e)
print("Discretized dataset:")
for e in d_iris[:3]:
print(e)

Continuization

A continuation reifies the program control state, i.e. the continuation is a data structure that represents the computational process at a given point in the process’ execution; the created data structure can be accessed by the programming language, instead of being hidden in the runtime environment.

Continuize_Indicators

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one and others will be zero. This is the default behavior.

For example as shown in the below code snippet, dataset “titanic” has feature “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

titanic = Orange.data.Table("titanic")continuizer = Orange.preprocess.Continuize()titanic1 = continuizer(titanic)print("Before Continuization : ",titanic.domain)print("After Continuization : ",titanic1.domain)#Data of row 15 in the before and after continuizationprint("15th row data before : ",titanic[15])print("15th row data after : ",titanic1[15])

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute(on lower scale) because of other attribute having values on larger scale. We use the Normalize function to perform normalization.

from Orange.data import Table
from Orange.preprocess import Normalize
data = Table("iris")
normalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(data)
print("Before Normalization : ",iris[2])
print("After Normalization : ",normalized_data[2])

Randomization

With randomization, given a data table, preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

#python script for Randomize
from Orange.data import Table
from Orange.preprocess import Randomize
data = Table("iris")
randomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(data)
print("Before randomization : ",iris[2])
print("After Randomization : ",randomized_data[2])

--

--