Scikit-Learn

Feature extraction

Q: How to vectorize dictionaries?

A: Using DictVectorizer() to convert a list of dictionaries into a numeric array.

from sklearn.feature_extraction import DictVectorizer

dic1 = {'x':4, 'y':"a", 'z':True, 'u':"a"}
dic2 = {'x':2, 'y':"b", 'z':False, 'u':"a"}
dic3 = {'x':9, 'y':"a", 'z':True, 'u':"c"}
dic = [dic1, dic2, dic3]
dict_vect = DictVectorizer(sparse=False)
dict_vect.fit_transform(dic)

# array([[1., 0., 4., 1., 0., 1.],
#        [1., 0., 2., 0., 1., 0.],
#        [0., 1., 9., 1., 0., 1.]])

"""
The result is an array, each individual dic corresponding to one row.

The dictionaries have 4 keys, 'x', 'y', 'z', 'u'. In the output, they
are reordered alphabetically.

The first key 'u' has two unique values so it takes the first two columns in the array.
The second key 'x' is numeric, taking only one column.
The third key 'y' also has two unique values, taking next two columns.
The fourth key 'z' is logical, taking the last one column.

The dict_vect is trained with dic and can be used to transform new dictionaries. If a key value does not appear in training sets, such as "aaa" below, the column for 'y':'a' and 'y':'b' simply assigned 0
"""
dic4 = {'x':99, 'y':"aaa", 'z':True, 'u':"a"}
dict_vect.transform([dic4])
# array([[ 1.,  0., 99.,  0.,  0.,  1.]])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ml_question_and_answers.md

ml_question_and_answers.md

Scikit-Learn

Feature extraction

Files

ml_question_and_answers.md

Latest commit

History

ml_question_and_answers.md

File metadata and controls

Scikit-Learn

Feature extraction