Skip to content

DataTransform(DataIO)

Data IO is the most basic component of Fate Algorithm module. It transforms the input Table to a Table whose values are Instance Object defined here, and what's more, the transformed table is the input data format of all other algorithm module, such as intersect、 homo LR and hetero LR、SecureBoost and so on.

Data IO module accepts the following input data format and transforms them to desired output Table.

  • dense input format
    input Table's value item is a list of single element, e.g. :

    1.0,2.0,3.0,4.5
    1.1,2.1,3.4,1.3
    2.4,6.3,1.5,9.0
    
  • svm-light input format
    first item of input Table's value is label, following by a list of complex "feature_id:value" items, e.g. :

    1 1:0.5 2:0.6
    0 1:0.7 3:0.8 5:0.2
    
  • tag input format
    the input Table's value is a list of tag, data io module first aggregates all tags occurred in input table, then changes all input line to one-hot representation in sorting the occurred tags by lexicographic order, e.g. assume values is :

    a c
    a b d
    

    after processing, the new values became: :

    1 0 1 0
    1 1 0 1
    
  • :tag:value input format: the input Table's value is a list of , like a mixed svm-light and tag input-format. data io module first aggregates all tags occurred in input table, then changes all input line to one-hot representation in sorting the occurred tags by lexicographic order, then fill the occur item with value. e.g. assume values is

    a:0.2 c:1.5
    a:0.3 b:0.6 d:0.7
    

    after processing, the new values became: :

    0.2 0 0.5 0
    0.3 0.6 0 0.7
    

Param

dataio_param

Classes

DataIOParam (BaseParam)

Define dataio parameters that used in federated ml.

Parameters:

Name Type Description Default
input_format {'dense', 'sparse', 'tag'}

please have a look at this tutorial at "DataIO" section of federatedml/util/README.md. Formally, dense input format data should be set to "dense", svm-light input format data should be set to "sparse", tag or tag:value input format data should be set to "tag".

'dense'
delimitor str

the delimitor of data input, default: ','

','
data_type {'float64', 'float', 'int', 'int64', 'str', 'long'}

the data type of data input

'float64'
exclusive_data_type dict

the key of dict is col_name, the value is data_type, use to specified special data type of some features.

None
tag_with_value bool

use if input_format is 'tag', if tag_with_value is True, input column data format should be tag[delimitor]value, otherwise is tag only

False
tag_value_delimitor str

use if input_format is 'tag' and 'tag_with_value' is True, delimitor of tag[delimitor]value column value.

':'
missing_fill bool

need to fill missing value or not, accepted only True/False, default: False

False
default_value None or object or list

the value to replace missing value. if None, it will use default value define in federatedml/feature/imputer.py, if single object, will fill missing value with this object, if list, it's length should be the sample of input data' feature dimension, means that if some column happens to have missing values, it will replace it the value by element in the identical position of this list.

0
missing_fill_method {None, 'min', 'max', 'mean', 'designated'}

the method to replace missing value

None
missing_impute None or list

element of list can be any type, or auto generated if value is None, define which values to be consider as missing

None
outlier_replace bool

need to replace outlier value or not, accepted only True/False, default: True

False
outlier_replace_method {None, 'min', 'max', 'mean', 'designated'}

the method to replace missing value

None
outlier_impute None or list

element of list can be any type, which values should be regard as missing value, default: None

None
outlier_replace_value None or object or list

the value to replace outlier. if None, it will use default value define in federatedml/feature/imputer.py, if single object, will replace outlier with this object, if list, it's length should be the sample of input data' feature dimension, means that if some column happens to have outliers, it will replace it the value by element in the identical position of this list.

0
with_label bool

True if input data consist of label, False otherwise. default: 'false'

False
label_name str

column_name of the column where label locates, only use in dense-inputformat. default: 'y'

'y'
label_type {'int', 'int64', 'float', 'float64', 'long', 'str'}

use when with_label is True.

'int'
output_format {'dense', 'sparse'}

output format

'dense'
Source code in federatedml/param/dataio_param.py
class DataIOParam(BaseParam):
    """
    Define dataio parameters that used in federated ml.

    Parameters
    ----------
    input_format : {'dense', 'sparse', 'tag'}
        please have a look at this tutorial at "DataIO" section of federatedml/util/README.md.
        Formally,
            dense input format data should be set to "dense",
            svm-light input format data should be set to "sparse",
            tag or tag:value input format data should be set to "tag".

    delimitor : str
        the delimitor of data input, default: ','

    data_type : {'float64', 'float', 'int', 'int64', 'str', 'long'}
        the data type of data input

    exclusive_data_type : dict
        the key of dict is col_name, the value is data_type, use to specified special data type 
        of some features.

    tag_with_value: bool
        use if input_format is 'tag', if tag_with_value is True,
        input column data format should be tag[delimitor]value, otherwise is tag only

    tag_value_delimitor: str
        use if input_format is 'tag' and 'tag_with_value' is True,
        delimitor of tag[delimitor]value column value.

    missing_fill : bool
        need to fill missing value or not, accepted only True/False, default: False

    default_value : None or object or list
        the value to replace missing value.
            if None, it will use default value define in federatedml/feature/imputer.py,
            if single object, will fill missing value with this object,
            if list, it's length should be the sample of input data' feature dimension,
                means that if some column happens to have missing values, it will replace it
                the value by element in the identical position of this list.

    missing_fill_method : {None, 'min', 'max', 'mean', 'designated'}
        the method to replace missing value

    missing_impute: None or list
        element of list can be any type, or auto generated if value is None, define which values to be consider as missing

    outlier_replace: bool
        need to replace outlier value or not, accepted only True/False, default: True

    outlier_replace_method : {None, 'min', 'max', 'mean', 'designated'}
        the method to replace missing value

    outlier_impute: None or list
        element of list can be any type, which values should be regard as missing value, default: None

    outlier_replace_value : None or object or list
        the value to replace outlier.
            if None, it will use default value define in federatedml/feature/imputer.py,
            if single object, will replace outlier with this object,
            if list, it's length should be the sample of input data' feature dimension,
                means that if some column happens to have outliers, it will replace it
                the value by element in the identical position of this list.

    with_label : bool
        True if input data consist of label, False otherwise. default: 'false'

    label_name : str
        column_name of the column where label locates, only use in dense-inputformat. default: 'y'

    label_type : {'int', 'int64', 'float', 'float64', 'long', 'str'}
        use when with_label is True.

    output_format : {'dense', 'sparse'}
        output format

    """

    def __init__(self, input_format="dense", delimitor=',', data_type='float64',
                 exclusive_data_type=None,
                 tag_with_value=False, tag_value_delimitor=":",
                 missing_fill=False, default_value=0, missing_fill_method=None,
                 missing_impute=None, outlier_replace=False, outlier_replace_method=None,
                 outlier_impute=None, outlier_replace_value=0,
                 with_label=False, label_name='y',
                 label_type='int', output_format='dense', need_run=True):
        self.input_format = input_format
        self.delimitor = delimitor
        self.data_type = data_type
        self.exclusive_data_type = exclusive_data_type
        self.tag_with_value = tag_with_value
        self.tag_value_delimitor = tag_value_delimitor
        self.missing_fill = missing_fill
        self.default_value = default_value
        self.missing_fill_method = missing_fill_method
        self.missing_impute = missing_impute
        self.outlier_replace = outlier_replace
        self.outlier_replace_method = outlier_replace_method
        self.outlier_impute = outlier_impute
        self.outlier_replace_value = outlier_replace_value
        self.with_label = with_label
        self.label_name = label_name
        self.label_type = label_type
        self.output_format = output_format
        self.need_run = need_run

    def check(self):

        descr = "dataio param's"

        self.input_format = self.check_and_change_lower(self.input_format,
                                                        ["dense", "sparse", "tag"],
                                                        descr)

        self.output_format = self.check_and_change_lower(self.output_format,
                                                         ["dense", "sparse"],
                                                         descr)

        self.data_type = self.check_and_change_lower(self.data_type,
                                                     ["int", "int64", "float", "float64", "str", "long"],
                                                     descr)

        if type(self.missing_fill).__name__ != 'bool':
            raise ValueError("dataio param's missing_fill {} not supported".format(self.missing_fill))

        if self.missing_fill_method is not None:
            self.missing_fill_method = self.check_and_change_lower(self.missing_fill_method,
                                                                   ['min', 'max', 'mean', 'designated'],
                                                                   descr)

        if self.outlier_replace_method is not None:
            self.outlier_replace_method = self.check_and_change_lower(self.outlier_replace_method,
                                                                      ['min', 'max', 'mean', 'designated'],
                                                                      descr)

        if type(self.with_label).__name__ != 'bool':
            raise ValueError("dataio param's with_label {} not supported".format(self.with_label))

        if self.with_label:
            if not isinstance(self.label_name, str):
                raise ValueError("dataio param's label_name {} should be str".format(self.label_name))

            self.label_type = self.check_and_change_lower(self.label_type,
                                                          ["int", "int64", "float", "float64", "str", "long"],
                                                          descr)

        if self.exclusive_data_type is not None and not isinstance(self.exclusive_data_type, dict):
            raise ValueError("exclusive_data_type is should be None or a dict")

        return True
__init__(self, input_format='dense', delimitor=',', data_type='float64', exclusive_data_type=None, tag_with_value=False, tag_value_delimitor=':', missing_fill=False, default_value=0, missing_fill_method=None, missing_impute=None, outlier_replace=False, outlier_replace_method=None, outlier_impute=None, outlier_replace_value=0, with_label=False, label_name='y', label_type='int', output_format='dense', need_run=True) special
Source code in federatedml/param/dataio_param.py
def __init__(self, input_format="dense", delimitor=',', data_type='float64',
             exclusive_data_type=None,
             tag_with_value=False, tag_value_delimitor=":",
             missing_fill=False, default_value=0, missing_fill_method=None,
             missing_impute=None, outlier_replace=False, outlier_replace_method=None,
             outlier_impute=None, outlier_replace_value=0,
             with_label=False, label_name='y',
             label_type='int', output_format='dense', need_run=True):
    self.input_format = input_format
    self.delimitor = delimitor
    self.data_type = data_type
    self.exclusive_data_type = exclusive_data_type
    self.tag_with_value = tag_with_value
    self.tag_value_delimitor = tag_value_delimitor
    self.missing_fill = missing_fill
    self.default_value = default_value
    self.missing_fill_method = missing_fill_method
    self.missing_impute = missing_impute
    self.outlier_replace = outlier_replace
    self.outlier_replace_method = outlier_replace_method
    self.outlier_impute = outlier_impute
    self.outlier_replace_value = outlier_replace_value
    self.with_label = with_label
    self.label_name = label_name
    self.label_type = label_type
    self.output_format = output_format
    self.need_run = need_run
check(self)
Source code in federatedml/param/dataio_param.py
def check(self):

    descr = "dataio param's"

    self.input_format = self.check_and_change_lower(self.input_format,
                                                    ["dense", "sparse", "tag"],
                                                    descr)

    self.output_format = self.check_and_change_lower(self.output_format,
                                                     ["dense", "sparse"],
                                                     descr)

    self.data_type = self.check_and_change_lower(self.data_type,
                                                 ["int", "int64", "float", "float64", "str", "long"],
                                                 descr)

    if type(self.missing_fill).__name__ != 'bool':
        raise ValueError("dataio param's missing_fill {} not supported".format(self.missing_fill))

    if self.missing_fill_method is not None:
        self.missing_fill_method = self.check_and_change_lower(self.missing_fill_method,
                                                               ['min', 'max', 'mean', 'designated'],
                                                               descr)

    if self.outlier_replace_method is not None:
        self.outlier_replace_method = self.check_and_change_lower(self.outlier_replace_method,
                                                                  ['min', 'max', 'mean', 'designated'],
                                                                  descr)

    if type(self.with_label).__name__ != 'bool':
        raise ValueError("dataio param's with_label {} not supported".format(self.with_label))

    if self.with_label:
        if not isinstance(self.label_name, str):
            raise ValueError("dataio param's label_name {} should be str".format(self.label_name))

        self.label_type = self.check_and_change_lower(self.label_type,
                                                      ["int", "int64", "float", "float64", "str", "long"],
                                                      descr)

    if self.exclusive_data_type is not None and not isinstance(self.exclusive_data_type, dict):
        raise ValueError("exclusive_data_type is should be None or a dict")

    return True

Other Features of DataIO

  • Missing value impute, provides methods to impute missing value
  • Outlier value replace, also provides several outlier replace method like missing value impute.

Last update: 2021-11-08
Back to top