跳转至

Data Split

Data Split module splits data into desired train, test, and/or validate sets. The module is based on sklearn train_test_split method while its output can include an extra validate data set.

Use

Data Split supports homogeneous (both Guest & Host have y) and heterogeneous (only Guest has y) mode.

Here lists supported split modes and scenario.

Split Mode Federated Heterogeneous Federated Homogeneous(Local)
Random
Stratified (continuous label split into intervals)

The module takes one table input as specified in job config file. Table must be uploaded beforehand as with other FederatedML models. Module parameters should be specified in job config file. Any parameter unspecified will take the default value detailed in parameter definition below.

Data Split module always outputs three tables (train, test, and validate sets). Each table may be used as input of another module. Below are the rules regarding set sizes:

  1. if all three set sizes are None, the original data input will be split in the following ratio: 80% to train set, 20% to validate set, and an empty test set;

  2. if only test size or validate size is given, train size is set to be of complement given size;

  3. only one of the three sizes is needed to split input data, but all three may be specified. The module takes either int (instance count) or float (fraction) value for set sizes, but mixed-type inputs cannot be used.

Param

data_split_param

Attributes

Classes

DataSplitParam(random_state=None, test_size=None, train_size=None, validate_size=None, stratified=False, shuffle=True, split_points=None, need_run=True)

Bases: BaseParam

Define data split param that used in data split.

Parameters:

Name Type Description Default
random_state None or int, default

Specify the random state for shuffle.

None
test_size float or int or None, default

Specify test data set size. float value specifies fraction of input data set, int value specifies exact number of data instances

None
train_size float or int or None, default

Specify train data set size. float value specifies fraction of input data set, int value specifies exact number of data instances

None
validate_size float or int or None, default

Specify validate data set size. float value specifies fraction of input data set, int value specifies exact number of data instances

None
stratified bool, default

Define whether sampling should be stratified, according to label value.

False
shuffle bool, default

Define whether do shuffle before splitting or not.

True
split_points None or list, default

Specify the point(s) by which continuous label values are bucketed into bins for stratified split. eg.[0.2] for two bins or [0.1, 1, 3] for 4 bins

None
need_run

Specify whether to run data split

True
Source code in python/federatedml/param/data_split_param.py
52
53
54
55
56
57
58
59
60
61
62
def __init__(self, random_state=None, test_size=None, train_size=None, validate_size=None, stratified=False,
             shuffle=True, split_points=None, need_run=True):
    super(DataSplitParam, self).__init__()
    self.random_state = random_state
    self.test_size = test_size
    self.train_size = train_size
    self.validate_size = validate_size
    self.stratified = stratified
    self.shuffle = shuffle
    self.split_points = split_points
    self.need_run = need_run
Attributes
random_state = random_state instance-attribute
test_size = test_size instance-attribute
train_size = train_size instance-attribute
validate_size = validate_size instance-attribute
stratified = stratified instance-attribute
shuffle = shuffle instance-attribute
split_points = split_points instance-attribute
need_run = need_run instance-attribute
Functions
check()
Source code in python/federatedml/param/data_split_param.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def check(self):
    model_param_descr = "data split param's "
    if self.random_state is not None:
        if not isinstance(self.random_state, int):
            raise ValueError(f"{model_param_descr} random state should be int type")
        BaseParam.check_nonnegative_number(self.random_state, f"{model_param_descr} random_state ")

    if self.test_size is not None:
        BaseParam.check_nonnegative_number(self.test_size, f"{model_param_descr} test_size ")
        if isinstance(self.test_size, float):
            BaseParam.check_decimal_float(self.test_size, f"{model_param_descr} test_size ")
    if self.train_size is not None:
        BaseParam.check_nonnegative_number(self.train_size, f"{model_param_descr} train_size ")
        if isinstance(self.train_size, float):
            BaseParam.check_decimal_float(self.train_size, f"{model_param_descr} train_size ")
    if self.validate_size is not None:
        BaseParam.check_nonnegative_number(self.validate_size, f"{model_param_descr} validate_size ")
        if isinstance(self.validate_size, float):
            BaseParam.check_decimal_float(self.validate_size, f"{model_param_descr} validate_size ")
    # use default size values if none given
    if self.test_size is None and self.train_size is None and self.validate_size is None:
        self.test_size = 0.0
        self.train_size = 0.8
        self.validate_size = 0.2

    BaseParam.check_boolean(self.stratified, f"{model_param_descr} stratified ")
    BaseParam.check_boolean(self.shuffle, f"{model_param_descr} shuffle ")
    BaseParam.check_boolean(self.need_run, f"{model_param_descr} need run ")

    if self.split_points is not None:
        if not isinstance(self.split_points, list):
            raise ValueError(f"{model_param_descr} split_points should be list type")

    LOGGER.debug("Finish data_split parameter check!")
    return True

Examples

Example
## Data Split Pipeline Example Usage Guide.

#### Example Tasks

This section introduces the Pipeline scripts for different types of tasks.

1. Heterogeneous Data Split Task:

    script: pipeline-hetero-data-split.py

    data type: continuous

    stratification: stratified by given split points

2. Homogeneous Data Spilt Task:

    script: pipeline-homo-data-split.py

    data type: categorical

    stratification: None


3. Homogeneous Data Spilt Task(only validate size specified):

    script: pipeline-homo-data-split-validate.py

    data type: categorical

    stratification: stratified by label

4. Heterogeneous Data Split Task with Multiple Models:

    script: pipeline-hetero-data-split-multi-model.py

    data type: continuous

    stratification: None

Users can run a pipeline job directly:

    python ${pipeline_script}
pipeline-homo-data-split-validate.py
import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HomoDataSplit
from pipeline.component import Reader
from pipeline.interface import Data

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]

    guest_train_data = {"name": "breast_homo_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "breast_homo_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")

    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        output_format="dense",
        label_name="y",
        label_type="int")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=True)

    homo_data_split_0 = HomoDataSplit(name="homo_data_split_0", stratified=True, validate_size=0.2)

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(homo_data_split_0, data=Data(data=data_transform_0.output.data))

    pipeline.compile()

    pipeline.fit()

    print(pipeline.get_component("homo_data_split_0").get_summary())


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()
pipeline-hetero-data-split.py
import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HeteroDataSplit
from pipeline.component import HeteroLinR
from pipeline.component import Intersection
from pipeline.component import Reader
from pipeline.interface import Data

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]
    arbiter = parties.arbiter[0]

    guest_train_data = {"name": "motor_hetero_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "motor_hetero_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host, arbiter=arbiter)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")
    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        label_name="motor_speed",
        label_type="float",
        output_format="dense")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=False)

    intersection_0 = Intersection(name="intersection_0")
    hetero_data_split_0 = HeteroDataSplit(name="hetero_data_split_0", stratified=True,
                                          test_size=0.3, split_points=[0.0, 0.2])
    hetero_linr_0 = HeteroLinR(name="hetero_linr_0", penalty="L2", optimizer="sgd", tol=0.001,
                               alpha=0.01, max_iter=10, early_stop="weight_diff", batch_size=-1,
                               learning_rate=0.15, decay=0.0, decay_sqrt=False,
                               init_param={"init_method": "zeros"})

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
    pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
    pipeline.add_component(hetero_linr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
                                                    validate_data=hetero_data_split_0.output.data.test_data))

    pipeline.compile()

    pipeline.fit()


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()
pipeline-homo-data-split.py
import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HomoDataSplit
from pipeline.component import Reader
from pipeline.interface import Data

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]

    guest_train_data = {"name": "breast_homo_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "breast_homo_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")

    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        output_format="dense",
        label_name="y",
        label_type="int")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=True)

    homo_data_split_0 = HomoDataSplit(name="homo_data_split_0", stratified=False, test_size=0.3, validate_size=0.2)

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(homo_data_split_0, data=Data(data=data_transform_0.output.data))

    pipeline.compile()

    pipeline.fit()

    print(pipeline.get_component("homo_data_split_0").get_summary())


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()
init.py

data_split_testsuite.json
{
    "data": [
        {
            "file": "examples/data/motor_hetero_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/motor_hetero_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_host",
            "namespace": "experiment",
            "role": "host_0"
        },
        {
            "file": "examples/data/breast_homo_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/breast_homo_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_host",
            "namespace": "experiment",
            "role": "host_0"
        }
    ],
    "pipeline_tasks": {
        "hetero_data_split": {
            "script": "pipeline-hetero-data-split.py"
        },
        "homo_data_split": {
            "script": "pipeline-homo-data-split.py"
        },
        "homo_data_split_validate": {
            "script": "pipeline-homo-data-split-validate.py"
        },
        "hetero_data_split_multi_model": {
            "script": "pipeline-homo-data-split.py"
        }
    }
}
pipeline-hetero-data-split-multi-model.py
import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HeteroDataSplit
from pipeline.component import HeteroLinR
from pipeline.component import Intersection
from pipeline.component import Reader
from pipeline.interface import Data
from pipeline.interface import Model

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]
    arbiter = parties.arbiter[0]

    guest_train_data = {"name": "motor_hetero_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "motor_hetero_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host, arbiter=arbiter)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")
    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        label_name="motor_speed",
        label_type="float",
        output_format="dense")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=False)

    intersection_0 = Intersection(name="intersection_0")
    hetero_data_split_0 = HeteroDataSplit(name="hetero_data_split_0", stratified=False,
                                          test_size=0.3, validate_size=0.2)
    hetero_linr_0 = HeteroLinR(name="hetero_linr_0", penalty="L2", optimizer="sgd", tol=0.001,
                               alpha=0.01, max_iter=10, early_stop="weight_diff", batch_size=-1,
                               learning_rate=0.15, decay=0.0, decay_sqrt=False,
                               init_param={"init_method": "zeros"})
    hetero_linr_1 = HeteroLinR()

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
    pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
    pipeline.add_component(hetero_linr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
                                                    validate_data=hetero_data_split_0.output.data.validate_data))
    pipeline.add_component(hetero_linr_1, data=Data(test_data=hetero_data_split_0.output.data.test_data),
                           model=Model(model=hetero_linr_0.output.model))

    pipeline.compile()

    pipeline.fit()


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()
## Data Split Configuration Usage Guide.

#### Example Tasks

This section introduces the dsl and conf for different types of tasks.

1. Heterogeneous Data Split Task:

    dsl: test_hetero_data_split_job_dsl.json

    runtime_config : test_hetero_data_split_job_conf.json

    data type: continuous

    stratification: stratified by given split points

2. Homogeneous Data Spilt Task:

    dsl: test_homo_data_split_job_dsl.json

    runtime_config: test_homo_data_split_job_conf.json

    data type: categorical

    stratification: None


3. Homogeneous Data Spilt Task(only validate size specified):

    dsl: test_homo_data_split_job_dsl.json

    runtime_config: test_homo_data_split_validate_job_conf.json

    data type: categorical

    stratification: stratified by label

4. Heterogeneous Data Split Task with Multiple Models:

    dsl: test_hetero_data_split_multi_model_job_dsl.json

    runtime_config: test_hetero_data_split_multi_model_job_conf.json

    data type: continuous

    stratification: None

Users can use following commands to run a task.

    flow job submit -c ${runtime_config} -d ${dsl}
test_homo_data_split_job_conf.json
{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "homo_data_split_0": {
                "test_size": 0.3,
                "validate_size": 0.2,
                "stratified": false
            }
        },
        "role": {
            "host": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_host",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true
                    }
                }
            },
            "guest": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_guest",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "y",
                        "label_type": "int",
                        "output_format": "dense"
                    }
                }
            }
        }
    }
}            
test_hetero_data_split_multi_model_job_conf.json
{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "arbiter": [
            10000
        ],
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "hetero_data_split_0": {
                "test_size": 0.3,
                "validate_size": 0.2,
                "stratified": false
            },
            "hetero_linr_0": {
                "penalty": "L2",
                "tol": 0.001,
                "alpha": 0.01,
                "optimizer": "sgd",
                "batch_size": -1,
                "learning_rate": 0.15,
                "init_param": {
                    "init_method": "zeros"
                },
                "max_iter": 10,
                "early_stop": "weight_diff",
                "decay": 0.0,
                "decay_sqrt": false
            }
        },
        "role": {
            "host": {
                "0": {
                    "data_transform_0": {
                        "with_label": false
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_host",
                            "namespace": "experiment"
                        }
                    }
                }
            },
            "guest": {
                "0": {
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "motor_speed",
                        "label_type": "float",
                        "output_format": "dense"
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_guest",
                            "namespace": "experiment"
                        }
                    }
                }
            }
        }
    }
}            
test_hetero_data_split_job_dsl.json
{
    "components": {
        "reader_0": {
            "module": "Reader",
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "data_transform_0": {
            "module": "DataTransform",
            "input": {
                "data": {
                    "data": [
                        "reader_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "intersection_0": {
            "module": "Intersection",
            "input": {
                "data": {
                    "data": [
                        "data_transform_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "hetero_data_split_0": {
            "module": "HeteroDataSplit",
            "input": {
                "data": {
                    "data": [
                        "intersection_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "train_data",
                    "validate_data",
                    "test_data"
                ]
            }
        },
        "hetero_linr_0": {
            "module": "HeteroLinR",
            "input": {
                "data": {
                    "train_data": [
                        "hetero_data_split_0.train_data"
                    ],
                    "validate_data": [
                        "hetero_data_split_0.test_data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        }
    }
}            
test_homo_data_split_validate_job_conf.json
{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "homo_data_split_0": {
                "validate_size": 0.2,
                "stratified": true
            }
        },
        "role": {
            "host": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_host",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true
                    }
                }
            },
            "guest": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_guest",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "y",
                        "label_type": "int",
                        "output_format": "dense"
                    }
                }
            }
        }
    }
}            
test_hetero_data_split_multi_model_job_dsl.json
{
    "components": {
        "reader_0": {
            "module": "Reader",
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "data_transform_0": {
            "module": "DataTransform",
            "input": {
                "data": {
                    "data": [
                        "reader_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "intersection_0": {
            "module": "Intersection",
            "input": {
                "data": {
                    "data": [
                        "data_transform_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "hetero_data_split_0": {
            "module": "HeteroDataSplit",
            "input": {
                "data": {
                    "data": [
                        "intersection_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "train_data",
                    "validate_data",
                    "test_data"
                ]
            }
        },
        "hetero_linr_0": {
            "module": "HeteroLinR",
            "input": {
                "data": {
                    "train_data": [
                        "hetero_data_split_0.train_data"
                    ],
                    "validate_data": [
                        "hetero_data_split_0.validate_data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "heterolinr_1": {
            "module": "HeteroLinR",
            "input": {
                "data": {
                    "test_data": [
                        "hetero_data_split_0.test_data"
                    ]
                },
                "model": [
                    "hetero_linr_0.model"
                ]
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        }
    }
}            
data_split_testsuite.json
{
    "data": [
        {
            "file": "examples/data/motor_hetero_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/motor_hetero_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_host",
            "namespace": "experiment",
            "role": "host_0"
        },
        {
            "file": "examples/data/breast_homo_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/breast_homo_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_host",
            "namespace": "experiment",
            "role": "host_0"
        }
    ],
    "tasks": {
        "hetero_data_split": {
            "conf": "test_hetero_data_split_job_conf.json",
            "dsl": "test_hetero_data_split_job_dsl.json"
        },
        "homo_data_split": {
            "conf": "test_homo_data_split_job_conf.json",
            "dsl": "test_homo_data_split_job_dsl.json"
        },
        "homo_data_split_validate": {
            "conf": "test_homo_data_split_validate_job_conf.json",
            "dsl": "test_homo_data_split_job_dsl.json"
        },
        "hetero_data_split_multi_model": {
            "conf": "test_hetero_data_split_multi_model_job_conf.json",
            "dsl": "test_hetero_data_split_multi_model_job_dsl.json"
        }
    }
}            
test_hetero_data_split_job_conf.json
{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "arbiter": [
            10000
        ],
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "hetero_data_split_0": {
                "test_size": 0.3,
                "stratified": true,
                "split_points": [
                    0.0,
                    0.2
                ]
            },
            "hetero_linr_0": {
                "penalty": "L2",
                "tol": 0.001,
                "alpha": 0.01,
                "optimizer": "sgd",
                "batch_size": -1,
                "learning_rate": 0.15,
                "init_param": {
                    "init_method": "zeros"
                },
                "max_iter": 10,
                "early_stop": "weight_diff",
                "decay": 0.0,
                "decay_sqrt": false
            }
        },
        "role": {
            "host": {
                "0": {
                    "data_transform_0": {
                        "with_label": false
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_host",
                            "namespace": "experiment"
                        }
                    }
                }
            },
            "guest": {
                "0": {
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "motor_speed",
                        "label_type": "float",
                        "output_format": "dense"
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_guest",
                            "namespace": "experiment"
                        }
                    }
                }
            }
        }
    }
}            
test_homo_data_split_job_dsl.json
{
    "components": {
        "reader_0": {
            "module": "Reader",
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "data_transform_0": {
            "module": "DataTransform",
            "input": {
                "data": {
                    "data": [
                        "reader_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "homo_data_split_0": {
            "module": "HomoDataSplit",
            "input": {
                "data": {
                    "data": [
                        "data_transform_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "train_data",
                    "validate_data",
                    "test_data"
                ]
            }
        }
    }
}            

最后更新: 2022-07-12