DataTransform(DataIO)

Data IO is the most basic component of Fate Algorithm module. It transforms the input Table to a Table whose values are Instance Object defined here, and what’s more, the transformed table is the input data format of all other algorithm module, such as intersect、 homo LR and hetero LR、SecureBoost and so on.

Data IO module accepts the following input data format and transforms them to desired output Table.

dense input format

input Table’s value item is a list of single element, e.g.

1.0,2.0,3.0,4.5
1.1,2.1,3.4,1.3
2.4,6.3,1.5,9.0
svm-light input format

first item of input Table’s value is label, following by a list of complex “feature_id:value” items, e.g.

1 1:0.5 2:0.6
0 1:0.7 3:0.8 5:0.2
tag input format

the input Table’s value is a list of tag, data io module first aggregates all tags occurred in input table, then changes all input line to one-hot representation in sorting the occurred tags by lexicographic order, e.g. assume values is

a c
a b d

after processing, the new values became:

1 0 1 0
1 1 0 1
tag:value input format

the input Table’s value is a list of tag:value, like a mixed svm-light and tag input-format. data io module first aggregates all tags occurred in input table, then changes all input line to one-hot representation in sorting the occurred tags by lexicographic order, then fill the occur item with value. e.g. assume values is

a:0.2 c:1.5
a:0.3 b:0.6 d:0.7

after processing, the new values became:

0.2 0 0.5 0
0.3 0.6 0 0.7

Param

class DataIOParam(input_format='dense', delimitor=',', data_type='float64', exclusive_data_type=None, tag_with_value=False, tag_value_delimitor=':', missing_fill=False, default_value=0, missing_fill_method=None, missing_impute=None, outlier_replace=False, outlier_replace_method=None, outlier_impute=None, outlier_replace_value=0, with_label=False, label_name='y', label_type='int', output_format='dense', need_run=True)

Define dataio parameters that used in federated ml.

Parameters
  • input_format (str, accepted 'dense','sparse' 'tag' only in this version. default: 'dense'.) –

    please have a look at this tutorial at “DataIO” section of federatedml/util/README.md. Formally,

    dense input format data should be set to “dense”, svm-light input format data should be set to “sparse”, tag or tag:value input format data should be set to “tag”.

  • delimitor (str, the delimitor of data input, default: ',') –

  • data_type (str, the data type of data input, accepted 'float','float64','int','int64','str','long') – “default: “float64”

  • exclusive_data_type (dict, the key of dict is col_name, the value is data_type, use to specified special data type) – of some features.

  • tag_with_value (bool, use if input_format is 'tag', if tag_with_value is True,) – input column data format should be tag[delimitor]value, otherwise is tag only

  • tag_value_delimitor (str, use if input_format is 'tag' and 'tag_with_value' is True,) – delimitor of tag[delimitor]value column value.

  • missing_fill (bool, need to fill missing value or not, accepted only True/False, default: False) –

  • default_value (None or single object type or list, the value to replace missing value.) –

    if None, it will use default value define in federatedml/feature/imputer.py, if single object, will fill missing value with this object, if list, it’s length should be the sample of input data’ feature dimension,

    means that if some column happens to have missing values, it will replace it the value by element in the identical position of this list.

    default: None

  • missing_fill_method (None or str, the method to replace missing value, should be one of [None, 'min', 'max', 'mean', 'designated'], default: None) –

  • missing_impute (None or list, element of list can be any type, or auto generated if value is None, define which values to be consider as missing, default: None) –

  • outlier_replace (bool, need to replace outlier value or not, accepted only True/False, default: True) –

  • outlier_replace_method (None or str, the method to replace missing value, should be one of [None, 'min', 'max', 'mean', 'designated'], default: None) –

  • outlier_impute (None or list, element of list can be any type, which values should be regard as missing value, default: None) –

  • outlier_replace_value (None or single object type or list, the value to replace outlier.) –

    if None, it will use default value define in federatedml/feature/imputer.py, if single object, will replace outlier with this object, if list, it’s length should be the sample of input data’ feature dimension,

    means that if some column happens to have outliers, it will replace it the value by element in the identical position of this list.

    default: None

  • with_label (bool, True if input data consist of label, False otherwise. default: 'false') –

  • label_name (str, column_name of the column where label locates, only use in dense-inputformat. default: 'y') –

  • label_type (object, accepted 'int','int64','float','float64','long','str' only,) – use when with_label is True. default: ‘false’

  • output_format (str, accepted 'dense','sparse' only in this version. default: 'dense') –

Other Features of DataIO

  • Missing value impute, provides [“mean”, “designated”, “min”, “max”] methods to impute missing value

  • Outlier value replace, also provides several outlier replace method like missing value impute.

Please check out federatedmd/feature/imputer.py for more details.

“__init__ of class Imputer”
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    def __init__(self, missing_value_list=None):
        """
        Parameters
        ----------
        missing_value_list: list of str, the value to be replaced. Default None, if is None, it will be set to list of blank, none, null and na,
                            which regarded as missing filled. If not, it can be outlier replace, and missing_value_list includes the outlier values
        """
        if missing_value_list is None:
            self.missing_value_list = ['', 'none', 'null', 'na']
        else:
            self.missing_value_list = missing_value_list

        self.support_replace_method = ['min', 'max', 'mean', 'median', 'quantile', 'designated']
        self.support_output_format = {
            'str': str,
            'float': float,
            'int': int,
            'origin': None
        }

        self.support_replace_area = {
            'min': 'col',
            'max': 'col',
            'mean': 'col',
            'median': 'col',
            'quantile': 'col',
            'designated': 'col'
        }

        self.cols_fit_impute_rate = []
        self.cols_transform_impute_rate = []

Sample Weight

Sample Weight assigns weight to input sample. Weight may be specified by input param class_weight or sample_weight_name. Output data instances will each have a weight value, which will be used for training.

If result weighted instances include negative weight, a warning message will be given.

Please note that when weight is not None, only weight_diff convergence check method may be used for training GLM.

How to Use

params
class_weight

str or dict, class weight dictionary or class weight computation mode. String value only accepts ‘balanced’. If dict provided, key should be class(label), and weight will not be normalized.

sample_weight_name

str, name of column which specifies sample weight. Extracted weight values will be normalized.

normalize

bool, default False. Whether to normalize sample weight extracted from sample_weight_name column

need_run

bool, whether to run this module or not

Note

If both class_weight and sample_weight_name are provided, values from column of sample_weight_name will be used.