Data Split¶

Data Split module splits data into desired train, test, and/or validate sets. The module is based on sklearn train_test_split method while its output can include an extra validate data set.

Use¶

Data Split supports homogeneous (both Guest & Host have y) and heterogeneous (only Guest has y) mode.

Here lists supported split modes and scenario.

Split Mode	Federated Heterogeneous	Federated Homogeneous(Local)
Random	✓	✓
Stratified	✓(continuous label split into intervals)	✓

The module takes one table input as specified in job config file. Table must be uploaded beforehand as with other FederatedML models. Module parameters should be specified in job config file. Any parameter unspecified will take the default value detailed in parameter definition below.

Data Split module always outputs three tables (train, test, and validate sets). Each table may be used as input of another module. Below are the rules regarding set sizes:

if all three set sizes are None, the original data input will be split in the following ratio: 80% to train set, 20% to validate set, and an empty test set;
if only test size or validate size is given, train size is set to be of complement given size;
only one of the three sizes is needed to split input data, but all three may be specified. The module takes either int (instance count) or float (fraction) value for set sizes, but mixed-type inputs cannot be used.

Param¶

`data_split_param` ¶

Attributes¶

Classes¶

`DataSplitParam(random_state=None, test_size=None, train_size=None, validate_size=None, stratified=False, shuffle=True, split_points=None, need_run=True)` ¶

Bases: BaseParam

Define data split param that used in data split.

Parameters:

Name	Type	Description	Default
`random_state`	`None or int, default`	Specify the random state for shuffle.	`None`
`test_size`	`float or int or None, default`	Specify test data set size. float value specifies fraction of input data set, int value specifies exact number of data instances	`None`
`train_size`	`float or int or None, default`	Specify train data set size. float value specifies fraction of input data set, int value specifies exact number of data instances	`None`
`validate_size`	`float or int or None, default`	Specify validate data set size. float value specifies fraction of input data set, int value specifies exact number of data instances	`None`
`stratified`	`bool, default`	Define whether sampling should be stratified, according to label value.	`False`
`shuffle`	`bool, default`	Define whether do shuffle before splitting or not.	`True`
`split_points`	`None or list, default`	Specify the point(s) by which continuous label values are bucketed into bins for stratified split. eg.[0.2] for two bins or [0.1, 1, 3] for 4 bins	`None`
`need_run`		Specify whether to run data split	`True`

Source code in python/federatedml/param/data_split_param.py

def __init__(self, random_state=None, test_size=None, train_size=None, validate_size=None, stratified=False,
             shuffle=True, split_points=None, need_run=True):
    super(DataSplitParam, self).__init__()
    self.random_state = random_state
    self.test_size = test_size
    self.train_size = train_size
    self.validate_size = validate_size
    self.stratified = stratified
    self.shuffle = shuffle
    self.split_points = split_points
    self.need_run = need_run

Attributes¶

random_state = random_state instance-attribute ¶

test_size = test_size instance-attribute ¶

train_size = train_size instance-attribute ¶

validate_size = validate_size instance-attribute ¶

stratified = stratified instance-attribute ¶

shuffle = shuffle instance-attribute ¶

split_points = split_points instance-attribute ¶

need_run = need_run instance-attribute ¶

Functions¶

check() ¶

Source code in python/federatedml/param/data_split_param.py

def check(self):
    model_param_descr = "data split param's "
    if self.random_state is not None:
        if not isinstance(self.random_state, int):
            raise ValueError(f"{model_param_descr} random state should be int type")
        BaseParam.check_nonnegative_number(self.random_state, f"{model_param_descr} random_state ")

    if self.test_size is not None:
        BaseParam.check_nonnegative_number(self.test_size, f"{model_param_descr} test_size ")
        if isinstance(self.test_size, float):
            BaseParam.check_decimal_float(self.test_size, f"{model_param_descr} test_size ")
    if self.train_size is not None:
        BaseParam.check_nonnegative_number(self.train_size, f"{model_param_descr} train_size ")
        if isinstance(self.train_size, float):
            BaseParam.check_decimal_float(self.train_size, f"{model_param_descr} train_size ")
    if self.validate_size is not None:
        BaseParam.check_nonnegative_number(self.validate_size, f"{model_param_descr} validate_size ")
        if isinstance(self.validate_size, float):
            BaseParam.check_decimal_float(self.validate_size, f"{model_param_descr} validate_size ")
    # use default size values if none given
    if self.test_size is None and self.train_size is None and self.validate_size is None:
        self.test_size = 0.0
        self.train_size = 0.8
        self.validate_size = 0.2

    BaseParam.check_boolean(self.stratified, f"{model_param_descr} stratified ")
    BaseParam.check_boolean(self.shuffle, f"{model_param_descr} shuffle ")
    BaseParam.check_boolean(self.need_run, f"{model_param_descr} need run ")

    if self.split_points is not None:
        if not isinstance(self.split_points, list):
            raise ValueError(f"{model_param_descr} split_points should be list type")

    LOGGER.debug("Finish data_split parameter check!")
    return True

Examples¶

Example

Pipeline

## Data Split Pipeline Example Usage Guide.

#### Example Tasks

This section introduces the Pipeline scripts for different types of tasks.

1. Heterogeneous Data Split Task:

    script: pipeline-hetero-data-split.py

    data type: continuous

    stratification: stratified by given split points

2. Homogeneous Data Spilt Task:

    script: pipeline-homo-data-split.py

    data type: categorical

    stratification: None


3. Homogeneous Data Spilt Task(only validate size specified):

    script: pipeline-homo-data-split-validate.py

    data type: categorical

    stratification: stratified by label

4. Heterogeneous Data Split Task with Multiple Models:

    script: pipeline-hetero-data-split-multi-model.py

    data type: continuous

    stratification: None

Users can run a pipeline job directly:

    python ${pipeline_script}

pipeline-homo-data-split-validate.py

import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HomoDataSplit
from pipeline.component import Reader
from pipeline.interface import Data

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]

    guest_train_data = {"name": "breast_homo_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "breast_homo_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")

    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        output_format="dense",
        label_name="y",
        label_type="int")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=True)

    homo_data_split_0 = HomoDataSplit(name="homo_data_split_0", stratified=True, validate_size=0.2)

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(homo_data_split_0, data=Data(data=data_transform_0.output.data))

    pipeline.compile()

    pipeline.fit()

    print(pipeline.get_component("homo_data_split_0").get_summary())


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()

pipeline-hetero-data-split.py

import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HeteroDataSplit
from pipeline.component import HeteroLinR
from pipeline.component import Intersection
from pipeline.component import Reader
from pipeline.interface import Data

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]
    arbiter = parties.arbiter[0]

    guest_train_data = {"name": "motor_hetero_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "motor_hetero_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host, arbiter=arbiter)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")
    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        label_name="motor_speed",
        label_type="float",
        output_format="dense")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=False)

    intersection_0 = Intersection(name="intersection_0")
    hetero_data_split_0 = HeteroDataSplit(name="hetero_data_split_0", stratified=True,
                                          test_size=0.3, split_points=[0.0, 0.2])
    hetero_linr_0 = HeteroLinR(name="hetero_linr_0", penalty="L2", optimizer="sgd", tol=0.001,
                               alpha=0.01, max_iter=10, early_stop="weight_diff", batch_size=-1,
                               learning_rate=0.15, decay=0.0, decay_sqrt=False,
                               init_param={"init_method": "zeros"})

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
    pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
    pipeline.add_component(hetero_linr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
                                                    validate_data=hetero_data_split_0.output.data.test_data))

    pipeline.compile()

    pipeline.fit()


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()

pipeline-homo-data-split.py

import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HomoDataSplit
from pipeline.component import Reader
from pipeline.interface import Data

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]

    guest_train_data = {"name": "breast_homo_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "breast_homo_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")

    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        output_format="dense",
        label_name="y",
        label_type="int")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=True)

    homo_data_split_0 = HomoDataSplit(name="homo_data_split_0", stratified=False, test_size=0.3, validate_size=0.2)

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(homo_data_split_0, data=Data(data=data_transform_0.output.data))

    pipeline.compile()

    pipeline.fit()

    print(pipeline.get_component("homo_data_split_0").get_summary())


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()

init.py

data_split_testsuite.json

{
    "data": [
        {
            "file": "examples/data/motor_hetero_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/motor_hetero_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_host",
            "namespace": "experiment",
            "role": "host_0"
        },
        {
            "file": "examples/data/breast_homo_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/breast_homo_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_host",
            "namespace": "experiment",
            "role": "host_0"
        }
    ],
    "pipeline_tasks": {
        "hetero_data_split": {
            "script": "pipeline-hetero-data-split.py"
        },
        "homo_data_split": {
            "script": "pipeline-homo-data-split.py"
        },
        "homo_data_split_validate": {
            "script": "pipeline-homo-data-split-validate.py"
        },
        "hetero_data_split_multi_model": {
            "script": "pipeline-homo-data-split.py"
        }
    }
}

pipeline-hetero-data-split-multi-model.py

import argparse

from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HeteroDataSplit
from pipeline.component import HeteroLinR
from pipeline.component import Intersection
from pipeline.component import Reader
from pipeline.interface import Data
from pipeline.interface import Model

from pipeline.utils.tools import load_job_config


def main(config="../../config.yaml", namespace=""):
    # obtain config
    if isinstance(config, str):
        config = load_job_config(config)
    parties = config.parties
    guest = parties.guest[0]
    host = parties.host[0]
    arbiter = parties.arbiter[0]

    guest_train_data = {"name": "motor_hetero_guest", "namespace": f"experiment{namespace}"}
    host_train_data = {"name": "motor_hetero_host", "namespace": f"experiment{namespace}"}

    pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host, arbiter=arbiter)

    reader_0 = Reader(name="reader_0")
    reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
    reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)

    data_transform_0 = DataTransform(name="data_transform_0")
    data_transform_0.get_party_instance(
        role='guest',
        party_id=guest).component_param(
        with_label=True,
        label_name="motor_speed",
        label_type="float",
        output_format="dense")
    data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=False)

    intersection_0 = Intersection(name="intersection_0")
    hetero_data_split_0 = HeteroDataSplit(name="hetero_data_split_0", stratified=False,
                                          test_size=0.3, validate_size=0.2)
    hetero_linr_0 = HeteroLinR(name="hetero_linr_0", penalty="L2", optimizer="sgd", tol=0.001,
                               alpha=0.01, max_iter=10, early_stop="weight_diff", batch_size=-1,
                               learning_rate=0.15, decay=0.0, decay_sqrt=False,
                               init_param={"init_method": "zeros"})
    hetero_linr_1 = HeteroLinR()

    pipeline.add_component(reader_0)
    pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
    pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
    pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
    pipeline.add_component(hetero_linr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
                                                    validate_data=hetero_data_split_0.output.data.validate_data))
    pipeline.add_component(hetero_linr_1, data=Data(test_data=hetero_data_split_0.output.data.test_data),
                           model=Model(model=hetero_linr_0.output.model))

    pipeline.compile()

    pipeline.fit()


if __name__ == "__main__":
    parser = argparse.ArgumentParser("PIPELINE DEMO")
    parser.add_argument("-config", type=str,
                        help="config file")
    args = parser.parse_args()
    if args.config is not None:
        main(args.config)
    else:
        main()

DSL

## Data Split Configuration Usage Guide.

#### Example Tasks

This section introduces the dsl and conf for different types of tasks.

1. Heterogeneous Data Split Task:

    dsl: test_hetero_data_split_job_dsl.json

    runtime_config : test_hetero_data_split_job_conf.json

    data type: continuous

    stratification: stratified by given split points

2. Homogeneous Data Spilt Task:

    dsl: test_homo_data_split_job_dsl.json

    runtime_config: test_homo_data_split_job_conf.json

    data type: categorical

    stratification: None


3. Homogeneous Data Spilt Task(only validate size specified):

    dsl: test_homo_data_split_job_dsl.json

    runtime_config: test_homo_data_split_validate_job_conf.json

    data type: categorical

    stratification: stratified by label

4. Heterogeneous Data Split Task with Multiple Models:

    dsl: test_hetero_data_split_multi_model_job_dsl.json

    runtime_config: test_hetero_data_split_multi_model_job_conf.json

    data type: continuous

    stratification: None

Users can use following commands to run a task.

    flow job submit -c ${runtime_config} -d ${dsl}

test_homo_data_split_job_conf.json

{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "homo_data_split_0": {
                "test_size": 0.3,
                "validate_size": 0.2,
                "stratified": false
            }
        },
        "role": {
            "host": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_host",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true
                    }
                }
            },
            "guest": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_guest",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "y",
                        "label_type": "int",
                        "output_format": "dense"
                    }
                }
            }
        }
    }
}

test_hetero_data_split_multi_model_job_conf.json

{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "arbiter": [
            10000
        ],
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "hetero_data_split_0": {
                "test_size": 0.3,
                "validate_size": 0.2,
                "stratified": false
            },
            "hetero_linr_0": {
                "penalty": "L2",
                "tol": 0.001,
                "alpha": 0.01,
                "optimizer": "sgd",
                "batch_size": -1,
                "learning_rate": 0.15,
                "init_param": {
                    "init_method": "zeros"
                },
                "max_iter": 10,
                "early_stop": "weight_diff",
                "decay": 0.0,
                "decay_sqrt": false
            }
        },
        "role": {
            "host": {
                "0": {
                    "data_transform_0": {
                        "with_label": false
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_host",
                            "namespace": "experiment"
                        }
                    }
                }
            },
            "guest": {
                "0": {
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "motor_speed",
                        "label_type": "float",
                        "output_format": "dense"
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_guest",
                            "namespace": "experiment"
                        }
                    }
                }
            }
        }
    }
}

test_hetero_data_split_job_dsl.json

{
    "components": {
        "reader_0": {
            "module": "Reader",
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "data_transform_0": {
            "module": "DataTransform",
            "input": {
                "data": {
                    "data": [
                        "reader_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "intersection_0": {
            "module": "Intersection",
            "input": {
                "data": {
                    "data": [
                        "data_transform_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "hetero_data_split_0": {
            "module": "HeteroDataSplit",
            "input": {
                "data": {
                    "data": [
                        "intersection_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "train_data",
                    "validate_data",
                    "test_data"
                ]
            }
        },
        "hetero_linr_0": {
            "module": "HeteroLinR",
            "input": {
                "data": {
                    "train_data": [
                        "hetero_data_split_0.train_data"
                    ],
                    "validate_data": [
                        "hetero_data_split_0.test_data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        }
    }
}

test_homo_data_split_validate_job_conf.json

{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "homo_data_split_0": {
                "validate_size": 0.2,
                "stratified": true
            }
        },
        "role": {
            "host": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_host",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true
                    }
                }
            },
            "guest": {
                "0": {
                    "reader_0": {
                        "table": {
                            "name": "breast_homo_guest",
                            "namespace": "experiment"
                        }
                    },
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "y",
                        "label_type": "int",
                        "output_format": "dense"
                    }
                }
            }
        }
    }
}

test_hetero_data_split_multi_model_job_dsl.json

{
    "components": {
        "reader_0": {
            "module": "Reader",
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "data_transform_0": {
            "module": "DataTransform",
            "input": {
                "data": {
                    "data": [
                        "reader_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "intersection_0": {
            "module": "Intersection",
            "input": {
                "data": {
                    "data": [
                        "data_transform_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "hetero_data_split_0": {
            "module": "HeteroDataSplit",
            "input": {
                "data": {
                    "data": [
                        "intersection_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "train_data",
                    "validate_data",
                    "test_data"
                ]
            }
        },
        "hetero_linr_0": {
            "module": "HeteroLinR",
            "input": {
                "data": {
                    "train_data": [
                        "hetero_data_split_0.train_data"
                    ],
                    "validate_data": [
                        "hetero_data_split_0.validate_data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "heterolinr_1": {
            "module": "HeteroLinR",
            "input": {
                "data": {
                    "test_data": [
                        "hetero_data_split_0.test_data"
                    ]
                },
                "model": [
                    "hetero_linr_0.model"
                ]
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        }
    }
}

data_split_testsuite.json

{
    "data": [
        {
            "file": "examples/data/motor_hetero_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/motor_hetero_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "motor_hetero_host",
            "namespace": "experiment",
            "role": "host_0"
        },
        {
            "file": "examples/data/breast_homo_guest.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_guest",
            "namespace": "experiment",
            "role": "guest_0"
        },
        {
            "file": "examples/data/breast_homo_host.csv",
            "head": 1,
            "partition": 16,
            "table_name": "breast_homo_host",
            "namespace": "experiment",
            "role": "host_0"
        }
    ],
    "tasks": {
        "hetero_data_split": {
            "conf": "test_hetero_data_split_job_conf.json",
            "dsl": "test_hetero_data_split_job_dsl.json"
        },
        "homo_data_split": {
            "conf": "test_homo_data_split_job_conf.json",
            "dsl": "test_homo_data_split_job_dsl.json"
        },
        "homo_data_split_validate": {
            "conf": "test_homo_data_split_validate_job_conf.json",
            "dsl": "test_homo_data_split_job_dsl.json"
        },
        "hetero_data_split_multi_model": {
            "conf": "test_hetero_data_split_multi_model_job_conf.json",
            "dsl": "test_hetero_data_split_multi_model_job_dsl.json"
        }
    }
}

test_hetero_data_split_job_conf.json

{
    "dsl_version": 2,
    "initiator": {
        "role": "guest",
        "party_id": 9999
    },
    "role": {
        "arbiter": [
            10000
        ],
        "host": [
            10000
        ],
        "guest": [
            9999
        ]
    },
    "component_parameters": {
        "common": {
            "hetero_data_split_0": {
                "test_size": 0.3,
                "stratified": true,
                "split_points": [
                    0.0,
                    0.2
                ]
            },
            "hetero_linr_0": {
                "penalty": "L2",
                "tol": 0.001,
                "alpha": 0.01,
                "optimizer": "sgd",
                "batch_size": -1,
                "learning_rate": 0.15,
                "init_param": {
                    "init_method": "zeros"
                },
                "max_iter": 10,
                "early_stop": "weight_diff",
                "decay": 0.0,
                "decay_sqrt": false
            }
        },
        "role": {
            "host": {
                "0": {
                    "data_transform_0": {
                        "with_label": false
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_host",
                            "namespace": "experiment"
                        }
                    }
                }
            },
            "guest": {
                "0": {
                    "data_transform_0": {
                        "with_label": true,
                        "label_name": "motor_speed",
                        "label_type": "float",
                        "output_format": "dense"
                    },
                    "reader_0": {
                        "table": {
                            "name": "motor_hetero_guest",
                            "namespace": "experiment"
                        }
                    }
                }
            }
        }
    }
}

test_homo_data_split_job_dsl.json

{
    "components": {
        "reader_0": {
            "module": "Reader",
            "output": {
                "data": [
                    "data"
                ]
            }
        },
        "data_transform_0": {
            "module": "DataTransform",
            "input": {
                "data": {
                    "data": [
                        "reader_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "data"
                ],
                "model": [
                    "model"
                ]
            }
        },
        "homo_data_split_0": {
            "module": "HomoDataSplit",
            "input": {
                "data": {
                    "data": [
                        "data_transform_0.data"
                    ]
                }
            },
            "output": {
                "data": [
                    "train_data",
                    "validate_data",
                    "test_data"
                ]
            }
        }
    }
}

最后更新: 2022-07-12

Data Split¶

Use¶

Param¶

data_split_param ¶

Attributes¶

Classes¶

DataSplitParam(random_state=None, test_size=None, train_size=None, validate_size=None, stratified=False, shuffle=True, split_points=None, need_run=True) ¶

Attributes¶

Functions¶

Examples¶

`data_split_param` ¶

`DataSplitParam(random_state=None, test_size=None, train_size=None, validate_size=None, stratified=False, shuffle=True, split_points=None, need_run=True)` ¶