跳转至

DataTransform

Data Transform is the most basic component of Fate Algorithm module. It transforms the input Table to a Table whose values are Instance Object defined here, and what's more, the transformed table is the input data format of all other algorithm module, such as intersect、 homo LR and hetero LR、SecureBoost and so on.

Data IO module accepts the following input data format and transforms them to desired output Table.

  • dense input format
    input Table's value item is a list of single element, e.g. :

    1.0,2.0,3.0,4.5
    1.1,2.1,3.4,1.3
    2.4,6.3,1.5,9.0
    
  • svm-light input format
    first item of input Table's value is label, following by a list of complex "feature_id:value" items, e.g. :

    1 1:0.5 2:0.6
    0 1:0.7 3:0.8 5:0.2
    
  • tag input format
    the input Table's value is a list of tag, data transform module first aggregates all tags occurred in input table, then changes all input line to one-hot representation in sorting the occurred tags by lexicographic order, e.g. assume values is :

    a c
    a b d
    

    after processing, the new values became: :

    1 0 1 0
    1 1 0 1
    
  • :tag:value input format: the input Table's value is a list of , like a mixed svm-light and tag input-format. data transform module first aggregates all tags occurred in input table, then changes all input line to one-hot representation in sorting the occurred tags by lexicographic order, then fill the occur item with value. e.g. assume values is

    a:0.2 c:1.5
    a:0.3 b:0.6 d:0.7
    

    after processing, the new values became: :

    0.2 0 0.5 0
    0.3 0.6 0 0.7
    

Param

data_transform_param

Classes

DataTransformParam(input_format='dense', delimitor=',', data_type='float64', exclusive_data_type=None, tag_with_value=False, tag_value_delimitor=':', missing_fill=False, default_value=0, missing_fill_method=None, missing_impute=None, outlier_replace=False, outlier_replace_method=None, outlier_impute=None, outlier_replace_value=0, with_label=False, label_name='y', label_type='int', output_format='dense', need_run=True, with_match_id=False, match_id_name='', match_id_index=0)

Bases: BaseParam

Define data transform parameters that used in federated ml.

Parameters:

Name Type Description Default
input_format

please have a look at this tutorial at "DataTransform" section of federatedml/util/README.md. Formally, dense input format data should be set to "dense", svm-light input format data should be set to "sparse", tag or tag:value input format data should be set to "tag". Note: in fate's version >= 1.9.0, this params can be used in uploading/binding data's meta

'dense'
delimitor str

the delimitor of data input, default: ','

','
data_type int

{'float64','float','int','int64','str','long'} the data type of data input

'float64'
exclusive_data_type dict

the key of dict is col_name, the value is data_type, use to specified special data type of some features.

None
tag_with_value

use if input_format is 'tag', if tag_with_value is True, input column data format should be tag[delimitor]value, otherwise is tag only

False
tag_value_delimitor

use if input_format is 'tag' and 'tag_with_value' is True, delimitor of tag[delimitor]value column value.

':'
missing_fill bool

need to fill missing value or not, accepted only True/False, default: False

False
default_value None or object or list

the value to replace missing value. if None, it will use default value define in federatedml/feature/imputer.py, if single object, will fill missing value with this object, if list, it's length should be the sample of input data' feature dimension, means that if some column happens to have missing values, it will replace it the value by element in the identical position of this list.

0
missing_fill_method

the method to replace missing value, should be one of [None, 'min', 'max', 'mean', 'designated']

None
missing_impute

element of list can be any type, or auto generated if value is None, define which values to be consider as missing

None
outlier_replace

need to replace outlier value or not, accepted only True/False, default: True

False
outlier_replace_method

the method to replace missing value, should be one of [None, 'min', 'max', 'mean', 'designated']

None
outlier_impute

element of list can be any type, which values should be regard as missing value

None
outlier_replace_value

the value to replace outlier. if None, it will use default value define in federatedml/feature/imputer.py, if single object, will replace outlier with this object, if list, it's length should be the sample of input data' feature dimension, means that if some column happens to have outliers, it will replace it the value by element in the identical position of this list.

0
with_label bool

True if input data consist of label, False otherwise. default: 'false' Note: in fate's version >= 1.9.0, this params can be used in uploading/binding data's meta

False
label_name str

column_name of the column where label locates, only use in dense-inputformat. default: 'y'

'y'
label_type

use when with_label is True

'int','int64','float','float64','long','str'
output_format

output format

'dense'
with_match_id

True if dataset has match_id, default: False Note: in fate's version >= 1.9.0, this params can be used in uploading/binding data's meta

False
match_id_name

Valid if input_format is "dense", and multiple columns are considered as match_ids, the name of match_id to be used in current job Note: in fate's version >= 1.9.0, this params can be used in uploading/binding data's meta

''
match_id_index

Valid if input_format is "tag" or "sparse", and multiple columns are considered as match_ids, the index of match_id, default: 0 This param works only when data meta has been set with uploading/binding.

0
Source code in federatedml/param/data_transform_param.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def __init__(self, input_format="dense", delimitor=',', data_type='float64',
             exclusive_data_type=None,
             tag_with_value=False, tag_value_delimitor=":",
             missing_fill=False, default_value=0, missing_fill_method=None,
             missing_impute=None, outlier_replace=False, outlier_replace_method=None,
             outlier_impute=None, outlier_replace_value=0,
             with_label=False, label_name='y',
             label_type='int', output_format='dense', need_run=True,
             with_match_id=False, match_id_name='', match_id_index=0):
    self.input_format = input_format
    self.delimitor = delimitor
    self.data_type = data_type
    self.exclusive_data_type = exclusive_data_type
    self.tag_with_value = tag_with_value
    self.tag_value_delimitor = tag_value_delimitor
    self.missing_fill = missing_fill
    self.default_value = default_value
    self.missing_fill_method = missing_fill_method
    self.missing_impute = missing_impute
    self.outlier_replace = outlier_replace
    self.outlier_replace_method = outlier_replace_method
    self.outlier_impute = outlier_impute
    self.outlier_replace_value = outlier_replace_value
    self.with_label = with_label
    self.label_name = label_name
    self.label_type = label_type
    self.output_format = output_format
    self.need_run = need_run
    self.with_match_id = with_match_id
    self.match_id_name = match_id_name
    self.match_id_index = match_id_index
Attributes
input_format = input_format instance-attribute
delimitor = delimitor instance-attribute
data_type = data_type instance-attribute
exclusive_data_type = exclusive_data_type instance-attribute
tag_with_value = tag_with_value instance-attribute
tag_value_delimitor = tag_value_delimitor instance-attribute
missing_fill = missing_fill instance-attribute
default_value = default_value instance-attribute
missing_fill_method = missing_fill_method instance-attribute
missing_impute = missing_impute instance-attribute
outlier_replace = outlier_replace instance-attribute
outlier_replace_method = outlier_replace_method instance-attribute
outlier_impute = outlier_impute instance-attribute
outlier_replace_value = outlier_replace_value instance-attribute
with_label = with_label instance-attribute
label_name = label_name instance-attribute
label_type = label_type instance-attribute
output_format = output_format instance-attribute
need_run = need_run instance-attribute
with_match_id = with_match_id instance-attribute
match_id_name = match_id_name instance-attribute
match_id_index = match_id_index instance-attribute
Functions
check()
Source code in federatedml/param/data_transform_param.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
def check(self):

    descr = "data_transform param's"

    self.input_format = self.check_and_change_lower(self.input_format,
                                                    ["dense", "sparse", "tag"],
                                                    descr)

    self.output_format = self.check_and_change_lower(self.output_format,
                                                     ["dense", "sparse"],
                                                     descr)

    self.data_type = self.check_and_change_lower(self.data_type,
                                                 ["int", "int64", "float", "float64", "str", "long"],
                                                 descr)

    if type(self.missing_fill).__name__ != 'bool':
        raise ValueError("data_transform param's missing_fill {} not supported".format(self.missing_fill))

    if self.missing_fill_method is not None:
        self.missing_fill_method = self.check_and_change_lower(self.missing_fill_method,
                                                               ['min', 'max', 'mean', 'designated'],
                                                               descr)

    if self.outlier_replace_method is not None:
        self.outlier_replace_method = self.check_and_change_lower(self.outlier_replace_method,
                                                                  ['min', 'max', 'mean', 'designated'],
                                                                  descr)

    if type(self.with_label).__name__ != 'bool':
        raise ValueError("data_transform param's with_label {} not supported".format(self.with_label))

    if self.with_label:
        if not isinstance(self.label_name, str):
            raise ValueError("data transform param's label_name {} should be str".format(self.label_name))

        self.label_type = self.check_and_change_lower(self.label_type,
                                                      ["int", "int64", "float", "float64", "str", "long"],
                                                      descr)

    if self.exclusive_data_type is not None and not isinstance(self.exclusive_data_type, dict):
        raise ValueError("exclusive_data_type is should be None or a dict")

    if not isinstance(self.with_match_id, bool):
        raise ValueError("with_match_id should be boolean variable, but {} find".format(self.with_match_id))

    if not isinstance(self.match_id_index, int) or self.match_id_index < 0:
        raise ValueError("match_id_index should be non negative integer")

    if self.match_id_name is not None and not isinstance(self.match_id_name, str):
        raise ValueError("match_id_name should be str")

    return True

Other Features of DataTransform

  • Missing value impute, provides methods to impute missing value
  • Outlier value replace, also provides several outlier replace method like missing value impute.
  • Parameters of data meta should be set when uploading or binding data since FATE-v1.9.0, refer to upload guides please.

最后更新: 2022-08-25