Data Split¶
Data Split module splits data into desired train, test, and/or validate sets. The module is based on sklearn train_test_split method while its output can include an extra validate data set.
Use¶
Data Split supports homogeneous (both Guest & Host have y) and heterogeneous (only Guest has y) mode.
Here lists supported split modes and scenario.
Split Mode | Federated Heterogeneous | Federated Homogeneous(Local) |
---|---|---|
Random | ✓ | ✓ |
Stratified | ✓(continuous label split into intervals) | ✓ |
The module takes one table input as specified in job config file. Table must be uploaded beforehand as with other FederatedML models. Module parameters should be specified in job config file. Any parameter unspecified will take the default value detailed in parameter definition below.
Data Split module always outputs three tables (train, test, and validate sets). Each table may be used as input of another module. Below are the rules regarding set sizes:
-
if all three set sizes are None, the original data input will be split in the following ratio: 80% to train set, 20% to validate set, and an empty test set;
-
if only test size or validate size is given, train size is set to be of complement given size;
-
only one of the three sizes is needed to split input data, but all three may be specified. The module takes either int (instance count) or float (fraction) value for set sizes, but mixed-type inputs cannot be used.
Param¶
data_split_param
¶
Attributes¶
Classes¶
DataSplitParam(random_state=None, test_size=None, train_size=None, validate_size=None, stratified=False, shuffle=True, split_points=None, need_run=True)
¶
Bases: BaseParam
Define data split param that used in data split.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
random_state |
None or int, default
|
Specify the random state for shuffle. |
None
|
test_size |
float or int or None, default
|
Specify test data set size. float value specifies fraction of input data set, int value specifies exact number of data instances |
None
|
train_size |
float or int or None, default
|
Specify train data set size. float value specifies fraction of input data set, int value specifies exact number of data instances |
None
|
validate_size |
float or int or None, default
|
Specify validate data set size. float value specifies fraction of input data set, int value specifies exact number of data instances |
None
|
stratified |
bool, default
|
Define whether sampling should be stratified, according to label value. |
False
|
shuffle |
bool, default
|
Define whether do shuffle before splitting or not. |
True
|
split_points |
None or list, default
|
Specify the point(s) by which continuous label values are bucketed into bins for stratified split. eg.[0.2] for two bins or [0.1, 1, 3] for 4 bins |
None
|
need_run |
Specify whether to run data split |
True
|
Source code in python/federatedml/param/data_split_param.py
52 53 54 55 56 57 58 59 60 61 62 |
|
Attributes¶
random_state = random_state
instance-attribute
¶test_size = test_size
instance-attribute
¶train_size = train_size
instance-attribute
¶validate_size = validate_size
instance-attribute
¶stratified = stratified
instance-attribute
¶shuffle = shuffle
instance-attribute
¶split_points = split_points
instance-attribute
¶need_run = need_run
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/data_split_param.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
|
Examples¶
Example
## Data Split Pipeline Example Usage Guide.
#### Example Tasks
This section introduces the Pipeline scripts for different types of tasks.
1. Heterogeneous Data Split Task:
script: pipeline-hetero-data-split.py
data type: continuous
stratification: stratified by given split points
2. Homogeneous Data Spilt Task:
script: pipeline-homo-data-split.py
data type: categorical
stratification: None
3. Homogeneous Data Spilt Task(only validate size specified):
script: pipeline-homo-data-split-validate.py
data type: categorical
stratification: stratified by label
4. Heterogeneous Data Split Task with Multiple Models:
script: pipeline-hetero-data-split-multi-model.py
data type: continuous
stratification: None
Users can run a pipeline job directly:
python ${pipeline_script}
pipeline-homo-data-split-validate.py
import argparse
from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HomoDataSplit
from pipeline.component import Reader
from pipeline.interface import Data
from pipeline.utils.tools import load_job_config
def main(config="../../config.yaml", namespace=""):
# obtain config
if isinstance(config, str):
config = load_job_config(config)
parties = config.parties
guest = parties.guest[0]
host = parties.host[0]
guest_train_data = {"name": "breast_homo_guest", "namespace": f"experiment{namespace}"}
host_train_data = {"name": "breast_homo_host", "namespace": f"experiment{namespace}"}
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host)
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)
data_transform_0 = DataTransform(name="data_transform_0")
data_transform_0.get_party_instance(
role='guest',
party_id=guest).component_param(
with_label=True,
output_format="dense",
label_name="y",
label_type="int")
data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=True)
homo_data_split_0 = HomoDataSplit(name="homo_data_split_0", stratified=True, validate_size=0.2)
pipeline.add_component(reader_0)
pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
pipeline.add_component(homo_data_split_0, data=Data(data=data_transform_0.output.data))
pipeline.compile()
pipeline.fit()
print(pipeline.get_component("homo_data_split_0").get_summary())
if __name__ == "__main__":
parser = argparse.ArgumentParser("PIPELINE DEMO")
parser.add_argument("-config", type=str,
help="config file")
args = parser.parse_args()
if args.config is not None:
main(args.config)
else:
main()
pipeline-hetero-data-split.py
import argparse
from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HeteroDataSplit
from pipeline.component import HeteroLinR
from pipeline.component import Intersection
from pipeline.component import Reader
from pipeline.interface import Data
from pipeline.utils.tools import load_job_config
def main(config="../../config.yaml", namespace=""):
# obtain config
if isinstance(config, str):
config = load_job_config(config)
parties = config.parties
guest = parties.guest[0]
host = parties.host[0]
arbiter = parties.arbiter[0]
guest_train_data = {"name": "motor_hetero_guest", "namespace": f"experiment{namespace}"}
host_train_data = {"name": "motor_hetero_host", "namespace": f"experiment{namespace}"}
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host, arbiter=arbiter)
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)
data_transform_0 = DataTransform(name="data_transform_0")
data_transform_0.get_party_instance(
role='guest',
party_id=guest).component_param(
with_label=True,
label_name="motor_speed",
label_type="float",
output_format="dense")
data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=False)
intersection_0 = Intersection(name="intersection_0")
hetero_data_split_0 = HeteroDataSplit(name="hetero_data_split_0", stratified=True,
test_size=0.3, split_points=[0.0, 0.2])
hetero_linr_0 = HeteroLinR(name="hetero_linr_0", penalty="L2", optimizer="sgd", tol=0.001,
alpha=0.01, max_iter=10, early_stop="weight_diff", batch_size=-1,
learning_rate=0.15, decay=0.0, decay_sqrt=False,
init_param={"init_method": "zeros"})
pipeline.add_component(reader_0)
pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
pipeline.add_component(hetero_linr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
validate_data=hetero_data_split_0.output.data.test_data))
pipeline.compile()
pipeline.fit()
if __name__ == "__main__":
parser = argparse.ArgumentParser("PIPELINE DEMO")
parser.add_argument("-config", type=str,
help="config file")
args = parser.parse_args()
if args.config is not None:
main(args.config)
else:
main()
pipeline-homo-data-split.py
import argparse
from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HomoDataSplit
from pipeline.component import Reader
from pipeline.interface import Data
from pipeline.utils.tools import load_job_config
def main(config="../../config.yaml", namespace=""):
# obtain config
if isinstance(config, str):
config = load_job_config(config)
parties = config.parties
guest = parties.guest[0]
host = parties.host[0]
guest_train_data = {"name": "breast_homo_guest", "namespace": f"experiment{namespace}"}
host_train_data = {"name": "breast_homo_host", "namespace": f"experiment{namespace}"}
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host)
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)
data_transform_0 = DataTransform(name="data_transform_0")
data_transform_0.get_party_instance(
role='guest',
party_id=guest).component_param(
with_label=True,
output_format="dense",
label_name="y",
label_type="int")
data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=True)
homo_data_split_0 = HomoDataSplit(name="homo_data_split_0", stratified=False, test_size=0.3, validate_size=0.2)
pipeline.add_component(reader_0)
pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
pipeline.add_component(homo_data_split_0, data=Data(data=data_transform_0.output.data))
pipeline.compile()
pipeline.fit()
print(pipeline.get_component("homo_data_split_0").get_summary())
if __name__ == "__main__":
parser = argparse.ArgumentParser("PIPELINE DEMO")
parser.add_argument("-config", type=str,
help="config file")
args = parser.parse_args()
if args.config is not None:
main(args.config)
else:
main()
init.py
data_split_testsuite.json
{
"data": [
{
"file": "examples/data/motor_hetero_guest.csv",
"head": 1,
"partition": 16,
"table_name": "motor_hetero_guest",
"namespace": "experiment",
"role": "guest_0"
},
{
"file": "examples/data/motor_hetero_host.csv",
"head": 1,
"partition": 16,
"table_name": "motor_hetero_host",
"namespace": "experiment",
"role": "host_0"
},
{
"file": "examples/data/breast_homo_guest.csv",
"head": 1,
"partition": 16,
"table_name": "breast_homo_guest",
"namespace": "experiment",
"role": "guest_0"
},
{
"file": "examples/data/breast_homo_host.csv",
"head": 1,
"partition": 16,
"table_name": "breast_homo_host",
"namespace": "experiment",
"role": "host_0"
}
],
"pipeline_tasks": {
"hetero_data_split": {
"script": "pipeline-hetero-data-split.py"
},
"homo_data_split": {
"script": "pipeline-homo-data-split.py"
},
"homo_data_split_validate": {
"script": "pipeline-homo-data-split-validate.py"
},
"hetero_data_split_multi_model": {
"script": "pipeline-homo-data-split.py"
}
}
}
pipeline-hetero-data-split-multi-model.py
import argparse
from pipeline.backend.pipeline import PipeLine
from pipeline.component import DataTransform
from pipeline.component import HeteroDataSplit
from pipeline.component import HeteroLinR
from pipeline.component import Intersection
from pipeline.component import Reader
from pipeline.interface import Data
from pipeline.interface import Model
from pipeline.utils.tools import load_job_config
def main(config="../../config.yaml", namespace=""):
# obtain config
if isinstance(config, str):
config = load_job_config(config)
parties = config.parties
guest = parties.guest[0]
host = parties.host[0]
arbiter = parties.arbiter[0]
guest_train_data = {"name": "motor_hetero_guest", "namespace": f"experiment{namespace}"}
host_train_data = {"name": "motor_hetero_host", "namespace": f"experiment{namespace}"}
pipeline = PipeLine().set_initiator(role='guest', party_id=guest).set_roles(guest=guest, host=host, arbiter=arbiter)
reader_0 = Reader(name="reader_0")
reader_0.get_party_instance(role='guest', party_id=guest).component_param(table=guest_train_data)
reader_0.get_party_instance(role='host', party_id=host).component_param(table=host_train_data)
data_transform_0 = DataTransform(name="data_transform_0")
data_transform_0.get_party_instance(
role='guest',
party_id=guest).component_param(
with_label=True,
label_name="motor_speed",
label_type="float",
output_format="dense")
data_transform_0.get_party_instance(role='host', party_id=host).component_param(with_label=False)
intersection_0 = Intersection(name="intersection_0")
hetero_data_split_0 = HeteroDataSplit(name="hetero_data_split_0", stratified=False,
test_size=0.3, validate_size=0.2)
hetero_linr_0 = HeteroLinR(name="hetero_linr_0", penalty="L2", optimizer="sgd", tol=0.001,
alpha=0.01, max_iter=10, early_stop="weight_diff", batch_size=-1,
learning_rate=0.15, decay=0.0, decay_sqrt=False,
init_param={"init_method": "zeros"})
hetero_linr_1 = HeteroLinR()
pipeline.add_component(reader_0)
pipeline.add_component(data_transform_0, data=Data(data=reader_0.output.data))
pipeline.add_component(intersection_0, data=Data(data=data_transform_0.output.data))
pipeline.add_component(hetero_data_split_0, data=Data(data=intersection_0.output.data))
pipeline.add_component(hetero_linr_0, data=Data(train_data=hetero_data_split_0.output.data.train_data,
validate_data=hetero_data_split_0.output.data.validate_data))
pipeline.add_component(hetero_linr_1, data=Data(test_data=hetero_data_split_0.output.data.test_data),
model=Model(model=hetero_linr_0.output.model))
pipeline.compile()
pipeline.fit()
if __name__ == "__main__":
parser = argparse.ArgumentParser("PIPELINE DEMO")
parser.add_argument("-config", type=str,
help="config file")
args = parser.parse_args()
if args.config is not None:
main(args.config)
else:
main()
## Data Split Configuration Usage Guide.
#### Example Tasks
This section introduces the dsl and conf for different types of tasks.
1. Heterogeneous Data Split Task:
dsl: test_hetero_data_split_job_dsl.json
runtime_config : test_hetero_data_split_job_conf.json
data type: continuous
stratification: stratified by given split points
2. Homogeneous Data Spilt Task:
dsl: test_homo_data_split_job_dsl.json
runtime_config: test_homo_data_split_job_conf.json
data type: categorical
stratification: None
3. Homogeneous Data Spilt Task(only validate size specified):
dsl: test_homo_data_split_job_dsl.json
runtime_config: test_homo_data_split_validate_job_conf.json
data type: categorical
stratification: stratified by label
4. Heterogeneous Data Split Task with Multiple Models:
dsl: test_hetero_data_split_multi_model_job_dsl.json
runtime_config: test_hetero_data_split_multi_model_job_conf.json
data type: continuous
stratification: None
Users can use following commands to run a task.
flow job submit -c ${runtime_config} -d ${dsl}
test_homo_data_split_job_conf.json
{
"dsl_version": 2,
"initiator": {
"role": "guest",
"party_id": 9999
},
"role": {
"host": [
10000
],
"guest": [
9999
]
},
"component_parameters": {
"common": {
"homo_data_split_0": {
"test_size": 0.3,
"validate_size": 0.2,
"stratified": false
}
},
"role": {
"host": {
"0": {
"reader_0": {
"table": {
"name": "breast_homo_host",
"namespace": "experiment"
}
},
"data_transform_0": {
"with_label": true
}
}
},
"guest": {
"0": {
"reader_0": {
"table": {
"name": "breast_homo_guest",
"namespace": "experiment"
}
},
"data_transform_0": {
"with_label": true,
"label_name": "y",
"label_type": "int",
"output_format": "dense"
}
}
}
}
}
}
test_hetero_data_split_multi_model_job_conf.json
{
"dsl_version": 2,
"initiator": {
"role": "guest",
"party_id": 9999
},
"role": {
"arbiter": [
10000
],
"host": [
10000
],
"guest": [
9999
]
},
"component_parameters": {
"common": {
"hetero_data_split_0": {
"test_size": 0.3,
"validate_size": 0.2,
"stratified": false
},
"hetero_linr_0": {
"penalty": "L2",
"tol": 0.001,
"alpha": 0.01,
"optimizer": "sgd",
"batch_size": -1,
"learning_rate": 0.15,
"init_param": {
"init_method": "zeros"
},
"max_iter": 10,
"early_stop": "weight_diff",
"decay": 0.0,
"decay_sqrt": false
}
},
"role": {
"host": {
"0": {
"data_transform_0": {
"with_label": false
},
"reader_0": {
"table": {
"name": "motor_hetero_host",
"namespace": "experiment"
}
}
}
},
"guest": {
"0": {
"data_transform_0": {
"with_label": true,
"label_name": "motor_speed",
"label_type": "float",
"output_format": "dense"
},
"reader_0": {
"table": {
"name": "motor_hetero_guest",
"namespace": "experiment"
}
}
}
}
}
}
}
test_hetero_data_split_job_dsl.json
{
"components": {
"reader_0": {
"module": "Reader",
"output": {
"data": [
"data"
]
}
},
"data_transform_0": {
"module": "DataTransform",
"input": {
"data": {
"data": [
"reader_0.data"
]
}
},
"output": {
"data": [
"data"
],
"model": [
"model"
]
}
},
"intersection_0": {
"module": "Intersection",
"input": {
"data": {
"data": [
"data_transform_0.data"
]
}
},
"output": {
"data": [
"data"
]
}
},
"hetero_data_split_0": {
"module": "HeteroDataSplit",
"input": {
"data": {
"data": [
"intersection_0.data"
]
}
},
"output": {
"data": [
"train_data",
"validate_data",
"test_data"
]
}
},
"hetero_linr_0": {
"module": "HeteroLinR",
"input": {
"data": {
"train_data": [
"hetero_data_split_0.train_data"
],
"validate_data": [
"hetero_data_split_0.test_data"
]
}
},
"output": {
"data": [
"data"
],
"model": [
"model"
]
}
}
}
}
test_homo_data_split_validate_job_conf.json
{
"dsl_version": 2,
"initiator": {
"role": "guest",
"party_id": 9999
},
"role": {
"host": [
10000
],
"guest": [
9999
]
},
"component_parameters": {
"common": {
"homo_data_split_0": {
"validate_size": 0.2,
"stratified": true
}
},
"role": {
"host": {
"0": {
"reader_0": {
"table": {
"name": "breast_homo_host",
"namespace": "experiment"
}
},
"data_transform_0": {
"with_label": true
}
}
},
"guest": {
"0": {
"reader_0": {
"table": {
"name": "breast_homo_guest",
"namespace": "experiment"
}
},
"data_transform_0": {
"with_label": true,
"label_name": "y",
"label_type": "int",
"output_format": "dense"
}
}
}
}
}
}
test_hetero_data_split_multi_model_job_dsl.json
{
"components": {
"reader_0": {
"module": "Reader",
"output": {
"data": [
"data"
]
}
},
"data_transform_0": {
"module": "DataTransform",
"input": {
"data": {
"data": [
"reader_0.data"
]
}
},
"output": {
"data": [
"data"
],
"model": [
"model"
]
}
},
"intersection_0": {
"module": "Intersection",
"input": {
"data": {
"data": [
"data_transform_0.data"
]
}
},
"output": {
"data": [
"data"
]
}
},
"hetero_data_split_0": {
"module": "HeteroDataSplit",
"input": {
"data": {
"data": [
"intersection_0.data"
]
}
},
"output": {
"data": [
"train_data",
"validate_data",
"test_data"
]
}
},
"hetero_linr_0": {
"module": "HeteroLinR",
"input": {
"data": {
"train_data": [
"hetero_data_split_0.train_data"
],
"validate_data": [
"hetero_data_split_0.validate_data"
]
}
},
"output": {
"data": [
"data"
],
"model": [
"model"
]
}
},
"heterolinr_1": {
"module": "HeteroLinR",
"input": {
"data": {
"test_data": [
"hetero_data_split_0.test_data"
]
},
"model": [
"hetero_linr_0.model"
]
},
"output": {
"data": [
"data"
],
"model": [
"model"
]
}
}
}
}
data_split_testsuite.json
{
"data": [
{
"file": "examples/data/motor_hetero_guest.csv",
"head": 1,
"partition": 16,
"table_name": "motor_hetero_guest",
"namespace": "experiment",
"role": "guest_0"
},
{
"file": "examples/data/motor_hetero_host.csv",
"head": 1,
"partition": 16,
"table_name": "motor_hetero_host",
"namespace": "experiment",
"role": "host_0"
},
{
"file": "examples/data/breast_homo_guest.csv",
"head": 1,
"partition": 16,
"table_name": "breast_homo_guest",
"namespace": "experiment",
"role": "guest_0"
},
{
"file": "examples/data/breast_homo_host.csv",
"head": 1,
"partition": 16,
"table_name": "breast_homo_host",
"namespace": "experiment",
"role": "host_0"
}
],
"tasks": {
"hetero_data_split": {
"conf": "test_hetero_data_split_job_conf.json",
"dsl": "test_hetero_data_split_job_dsl.json"
},
"homo_data_split": {
"conf": "test_homo_data_split_job_conf.json",
"dsl": "test_homo_data_split_job_dsl.json"
},
"homo_data_split_validate": {
"conf": "test_homo_data_split_validate_job_conf.json",
"dsl": "test_homo_data_split_job_dsl.json"
},
"hetero_data_split_multi_model": {
"conf": "test_hetero_data_split_multi_model_job_conf.json",
"dsl": "test_hetero_data_split_multi_model_job_dsl.json"
}
}
}
test_hetero_data_split_job_conf.json
{
"dsl_version": 2,
"initiator": {
"role": "guest",
"party_id": 9999
},
"role": {
"arbiter": [
10000
],
"host": [
10000
],
"guest": [
9999
]
},
"component_parameters": {
"common": {
"hetero_data_split_0": {
"test_size": 0.3,
"stratified": true,
"split_points": [
0.0,
0.2
]
},
"hetero_linr_0": {
"penalty": "L2",
"tol": 0.001,
"alpha": 0.01,
"optimizer": "sgd",
"batch_size": -1,
"learning_rate": 0.15,
"init_param": {
"init_method": "zeros"
},
"max_iter": 10,
"early_stop": "weight_diff",
"decay": 0.0,
"decay_sqrt": false
}
},
"role": {
"host": {
"0": {
"data_transform_0": {
"with_label": false
},
"reader_0": {
"table": {
"name": "motor_hetero_host",
"namespace": "experiment"
}
}
}
},
"guest": {
"0": {
"data_transform_0": {
"with_label": true,
"label_name": "motor_speed",
"label_type": "float",
"output_format": "dense"
},
"reader_0": {
"table": {
"name": "motor_hetero_guest",
"namespace": "experiment"
}
}
}
}
}
}
}
test_homo_data_split_job_dsl.json
{
"components": {
"reader_0": {
"module": "Reader",
"output": {
"data": [
"data"
]
}
},
"data_transform_0": {
"module": "DataTransform",
"input": {
"data": {
"data": [
"reader_0.data"
]
}
},
"output": {
"data": [
"data"
],
"model": [
"model"
]
}
},
"homo_data_split_0": {
"module": "HomoDataSplit",
"input": {
"data": {
"data": [
"data_transform_0.data"
]
}
},
"output": {
"data": [
"train_data",
"validate_data",
"test_data"
]
}
}
}
}