Tree Models

Hetero SecureBoost

Gradient Boosting Decision Tree(GBDT) is a widely used statistic model for classification and regression problems. FATE has provided a novel lossless privacy-preserving tree-boosting system known as [SecureBoost: A Lossless Federated Learning Framework].

This federated learning system allows a learning process to be jointly conducted over multiple parties with partially common user samples but different feature sets, which corresponds to a vertically partitioned virtual data set. An advantage of SecureBoost is that it provides the same level of accuracy as the non privacy-preserving approach while at the same time, reveal no information of each private data provider.

The following figure shows the proposed Federated SecureBoost framework.

../../_images/secureboost.png

Figure 1: Framework of Federated SecureBoost

Active Party

We define the active party as the data provider who holds both a data matrix and the class label. Since the class label information is indispensable for supervised learning, there must be an active party with access to the label y. The active party naturally takes the responsibility as a dominating server in federated learning.

Passive Party

We define the data provider which has only a data matrix as a passive party. Passive parties play the role of clients in the federated learning setting. They are also in need of building a model to predict the class label y for their prediction purposes. Thus they must collaborate with the active party to build their model to predict y for their future users using their own features.

We align the data samples under an encryption scheme by using the privacy-preserving protocol for inter-database intersections to find the common shared users or data samples across the parties without compromising the non-shared parts of the user sets.

To ensure security, passive parties cannot get access to gradient and hessian directly. We use a “XGBoost” like tree-learning algorithm. In order to keep gradient and hessian confidential, we require the active party to encrypt gradient and hessian before sending them to passive parties. After encrypted the gradient and hessian, active party willsend the encrypted [[gradient]] and [[hessian]] to passive party, the passive party use [[gradient]] and [[hessian]] to calculate the encrypted feature histograms, then encodes the (feature, split_bin_val) and construct a (feature, split_bin_val) lookup table, sends the encode value of (feature, split_bin_val) with feature histograms to active party. After receiving the feature histograms from passive party, the active party decrypt them and find the best gains, if the feature belongs to passive party, send back the encode (feature, split_bin_val) to passive party. The following figure shows the process of federated split finding.

../../_images/split_finding.png

Figure 2: Process of Federated Split Finding

The parties continue the split finding process until finishing constructed the tree. Each party only knows the detailed split information of the tree nodes where the split features belong to its data. The following figure shows the final structure of a single decision tree.

../../_images/tree_structure.png

Figure 3: A Single Decision Tree

To use the learned model to classify a new instance, firstly the active party judge if current tree node belongs to it or not. If the current tree belongs to active party, then it can use its (feature, split_bin_val) lookup table to decide going to left child node of right,otherwise, the active party sends the node id to designated passive party, the passive party looks at its lookup table and sends back to active party which branch should the current node goes to. This process stops until the current node is a leave. The following figure shows the federated inference process.

../../_images/federated_inference.png

Figure 4: Process of Federated Inference

By following the SecureBoost framework, multiple parties can jointly build tree ensemble model without leaking privacy in federated learning. If you want to learn more about the algorithm details, you can read the paper attached above.

Optimization in Parallel Learning

SecureBoost use data parallel learning algorithm to build the decision trees in every party. The procedure of the data parallel algorithm in each party is:

  • Every party use mapPartitions API interface to generate feature-histograms of each partition of data.

  • Use reduce API interface to merge global histograms from all local feature-histograms

  • Find the best splits from merged global histograms by federated learning, then perform splits.

Applications

Hetero secureboost supports the following applications.

  • binary classification, the objective function is sigmoid cross-entropy

  • multi classification, the objective function is softmax cross-entropy

  • regression, objective function now support is least-squared-error-loss、least-absolutely-error-loss、huber-loss、

tweedie-loss、fair-loss、 log-cosh-loss

Other features

  • Column sub-sample

  • Controlling the number of nodes to split parallelly at each layer by setting max_split_nodes parameter,in order to avoid memory limit exceeding

  • Support feature importance calculation

  • Support Multi-host and single guest to build model

  • Support different encrypt-mode to balance speed and security

  • Support missing value in train and predict process

  • Support evaluate training and validate data during training process

  • Support another homomorphic encryption method called “Iterative Affine” since FATE-1.1

  • Support early stopping in FATE-1.4, to use early stopping, see [Boosting Tree Param]

Homo Secureboost

Unlike Hetero Secureboost, Homo Secureboost is conducted under a different setting. In homo secureboost, every participant(clients) holds data that shares the same feature space, and jointly train a GBDT model without leaking any data sample.

The figure below shows the overall framework of the homo secureboost algorithm.

../../_images/homo_framework.png

Figure 1: Framework of Homo SecureBoost

Client

Clients are the participants who hold their labeled samples. Samples from all client parties have the same feature space. Participants have the demand for building a more powerful model together without leaking local samples, and they share the same trained model after learning.

Server

There are potentials of data leakage if all participants send its local histogram(which contains sum of gradient and hessian) to each other because sometimes features and labels can be inferred from gradient sums and hessian sums. Thus, to ensure security in the learning process, the Server uses secure aggregation to aggregate all participants’ local histograms in a safe manner. The server can get a global histogram while not getting any local histogram and then find and broadcast the best splits to clients. Server collaboratives with all clients in the learning process.

The key steps of learning a homo secureboost model are described below:

  • Clients and the server initialize local settings. Clients and the server apply homo feature binning to get binning points for all features and then to pre-process local samples.

  • Clients and Server build a decision tree collaboratively:

    1. Clients compute local histograms for cur leaf nodes (left nodes or root node)

    2. The server applies secure aggregations: every local histogram plus a random number, and these numbers can cancel each other out. By this way server can get the global histogram without knowing any local histograms and data leakage is prevented. Figure below shows how histogram secure aggregations are conducted.

    ../../_images/secure_agg.png

    Figure 2: Secure aggregation

    1. The server commit histogram subtractions: getting the right node histograms by subtracting left node local histogram from parent node histogram. Then, the server will find the best splits points and broadcast them to clients.

    2. After getting the best split points, clients build the next layer for the current decision tree and re-assign samples. If current decision tree reaches the max depth or stop conditions are fulfilled, stop build the current tree, else go back to step (1). Figure below shows the procedure of fitting a decision tree.

    ../../_images/homo_fit.png

    Figure 3: Example of bulding a two-layer homo-decision tree

  • If tree number reach the max number, or loss is converged, stop fitting homo secureboost.

By follwing the steps above clients are able to jointly build a GBDT model together. After getting the model, every client can conduct inference to predict a new instance locally.

Optimization in learning

Homo secureboost utilizes data parallelization and histogram subtraction to accelerate the learning process.

  • Every party use mapPartitions and reduce API interface to generate local feature-histograms, only samples in left nodes are used in computing feature histograms.

  • The server aggregates all local histograms to get global histograms then get sibling histograms by subtracting left node histograms from parent histograms.

  • The server finds the best splits from merged global histograms, then broadcast best splits.

  • The computational cost and transmission cost are halved by using node subtraction.

Applications

homo secureboost supports the following applications.

  • binary classification, the objective function is sigmoid cross-entropy

  • multi classification, the objective function is softmax cross-entropy

  • regression, objective function now support is least-squared-error-loss、least-absolutely-error-loss、huber-loss、tweedie-loss、fair-loss、 log-cosh-loss

Other features

  • The server uses safety aggregations to aggregate clients’ histograms and losses, ensuring the data security

  • Column sub-sample

  • Controlling the number of nodes to split parallelly at each layer by setting max_split_nodes parameter,in order to avoid memory limit exceeding

  • Support feature importance calculation

  • Support Multi-host and single guest to build model

  • Support missing value in train and predict process

  • Support evaluate training and validate data during training process

Param

class BoostingTreeParam(tree_param=<federatedml.param.boosting_tree_param.DecisionTreeParam object>, task_type='classification', objective_param=<federatedml.param.boosting_tree_param.ObjectiveParam object>, learning_rate=0.3, num_trees=5, subsample_feature_rate=0.8, n_iter_no_change=True, tol=0.0001, encrypt_param=<federatedml.param.encrypt_param.EncryptParam object>, bin_num=32, use_missing=False, zero_as_missing=False, encrypted_mode_calculator_param=<federatedml.param.encrypted_mode_calculation_param.EncryptedModeCalculatorParam object>, predict_param=<federatedml.param.predict_param.PredictParam object>, cv_param=<federatedml.param.cross_validation_param.CrossValidationParam object>, validation_freqs=None, early_stopping_rounds=None, metrics=None, use_first_metric_only=True)

Define boosting tree parameters that used in federated ml.

Parameters
  • task_type (str, accepted 'classification', 'regression' only, default: 'classification') –

  • tree_param (DecisionTreeParam Object, default: DecisionTreeParam()) –

  • objective_param (ObjectiveParam Object, default: ObjectiveParam()) –

  • learning_rate (float, accepted float, int or long only, the learning rate of secure boost. default: 0.3) –

  • num_trees (int, accepted int, float only, the max number of trees to build. default: 5) –

  • subsample_feature_rate (float, a float-number in [0, 1], default: 0.8) –

  • n_iter_no_change (bool,) – when True and residual error less than tol, tree building process will stop. default: True

  • encrypt_param (EncodeParam Object, encrypt method use in secure boost, default: EncryptParam(), this parameter) – is only for hetero-secureboost

  • bin_num (int, positive integer greater than 1, bin number use in quantile. default: 32) –

  • encrypted_mode_calculator_param (EncryptedModeCalculatorParam object, the calculation mode use in secureboost,) – default: EncryptedModeCalculatorParam(), only for hetero-secureboost

  • use_missing (bool, accepted True, False only, use missing value in training process or not. default: False) –

  • zero_as_missing (bool, accepted True, False only, regard 0 as missing value or not,) – will be use only if use_missing=True, default: False

  • validation_freqs (None or positive integer or container object in python. Do validation in training process or Not.) –

    if equals None, will not do validation in train process; if equals positive integer, will validate data every validation_freqs epochs passes; if container object in python, will validate data if epochs belong to this container.

    e.g. validation_freqs = [10, 15], will validate data when epoch equals to 10 and 15.

    Default: None The default value is None, 1 is suggested. You can set it to a number larger than 1 in order to speed up training by skipping validation rounds. When it is larger than 1, a number which is divisible by “num_trees” is recommended, otherwise, you will miss the validation scores of last training iteration.

  • early_stopping_rounds (should be a integer larger than 0,will stop training if one metric of one validation data) – doesn’t improve in last early_stopping_round rounds, need to set validation freqs and will check early_stopping every at every validation epoch,

  • metrics (list, default: []) – Specify which metrics to be used when performing evaluation during training process. If set as empty, default metrics will be used. For regression tasks, default metrics are [‘root_mean_squared_error’, ‘mean_absolute_error’], For binary-classificatiin tasks, default metrics are [‘auc’, ‘ks’]. For multi-classification tasks, default metrics are [‘accuracy’, ‘precision’, ‘recall’]

  • use_first_metric_only (use only the first metric for early stopping) –

class DecisionTreeParam(criterion_method='xgboost', criterion_params=[0.1], max_depth=5, min_sample_split=2, min_imputiry_split=0.001, min_leaf_node=1, max_split_nodes=65536, feature_importance_type='split', n_iter_no_change=True, tol=0.001, use_missing=False, zero_as_missing=False)

Define decision tree parameters that used in federated ml.

Parameters
  • criterion_method (str, accepted "xgboost" only, the criterion function to use, default: 'xgboost') –

  • criterion_params (list, should be non empty and first element is float-number, default: 0.1.) –

  • max_depth (int, positive integer, the max depth of a decision tree, default: 5) –

  • min_sample_split (int, least quantity of nodes to split, default: 2) –

  • min_impurity_split (float, least gain of a single split need to reach, default: 1e-3) –

  • min_leaf_node (int, when samples no more than min_leaf_node, it becomes a leave, default: 1) –

  • max_split_nodes (int, positive integer, we will use no more than max_split_nodes to) – parallel finding their splits in a batch, for memory consideration. default is 65536

  • n_iter_no_change (bool, accepted True,False only, if set to True, tol will use to consider) – stop tree growth. default: True

  • feature_importance_type (str, support 'split', 'gain' only.) – if is ‘split’, feature_importances calculate by feature split times, if is ‘gain’, feature_importances calculate by feature split gain. default: ‘split’

  • tol (float, only use when n_iter_no_change is set to True, default: 0.001) –

  • use_missing (bool, accepted True, False only, use missing value in training process or not. default: False) –

  • zero_as_missing (bool, accepted True, False only, regard 0 as missing value or not,) – will be use only if use_missing=True, default: False

class ObjectiveParam(objective=None, params=None)

Define objective parameters that used in federated ml.

Parameters
  • objective (None or str, accepted None,``’cross_entropy’:class:`,`’lse’:class:`,`’lae’:class:`,`’log_cosh’:class:`,`’tweedie’:class:`,`’fair’:class:`,`’huber’`` only,) – None in host’s config, should be str in guest’config. when task_type is classification, only support cross_enctropy, other 6 types support in regression task. default: None

  • params (None or list, should be non empty list when objective is 'tweedie',``’fair’:class:`,`’huber’``,) – first element of list shoulf be a float-number large than 0.0 when objective is ‘fair’,’huber’, first element of list should be a float-number in [1.0, 2.0) when objective is ‘tweedie’