Hetero Feature Binning

Feature binning or data binning is a data pre-processing technique. It can be use to reduce the effects of minor observation errors, calculate information values and so on.

Currently, we provide quantile binning and bucket binning methods. To achieve quantile binning approach, we have used a special data structure mentioned in this [paper]. Feel free to check out the detail algorithm in the paper.

As for calculating the federated iv and woe values, the following figure can describe the principle properly.

../../_images/binning_principle.png

Figure 1 (Federated Feature Binning Principle)

As the figure shows, B party which has the data labels encrypt its labels with Addiction homomorphic encryption and then send to A. A static each bin’s label sum and send back. Then B can calculate woe and iv base on the given information.

For multiple hosts, it is similar with one host case. Guest sends its encrypted label information to all hosts, and each of the hosts calculates and sends back the static info.

samples

Figure 2: Multi-Host Binning Principle

For optimal binning, each party use quantile binning or bucket binning find initial split points. Then Guest will send encrypted labels to Host. Host use them calculate histogram of each bin and send back to Guest. Then start optimal binning methods.

../../_images/optimal_binning.png

Figure 3: Multi-Host Binning Principle

There exist two kinds of methods, merge-optimal binning and split-optimal binning. When choosing metrics as iv, gini or chi-square, merge type optimal binning will be used. On the other hand, if ks is choosed, split type optimal binning will be used.

Param

class FeatureBinningParam(method='quantile', compress_thres=10000, head_size=10000, error=0.001, bin_num=10, bin_indexes=-1, bin_names=None, adjustment_factor=0.5, transform_param=<federatedml.param.feature_binning_param.TransformParam object>, optimal_binning_param=<federatedml.param.feature_binning_param.OptimalBinningParam object>, local_only=False, category_indexes=None, category_names=None, need_run=True)

Define the feature binning method

Parameters
  • method (str, 'quantile' or 'bucket', default: 'quantile') – Binning method.

  • compress_thres (int, default: 10000) – When the number of saved summaries exceed this threshold, it will call its compress function

  • head_size (int, default: 10000) – The buffer size to store inserted observations. When head list reach this buffer size, the QuantileSummaries object start to generate summary(or stats) and insert into its sampled list.

  • error (float, 0 <= error < 1 default: 0.001) – The error of tolerance of binning. The final split point comes from original data, and the rank of this value is close to the exact rank. More precisely, floor((p - 2 * error) * N) <= rank(x) <= ceil((p + 2 * error) * N) where p is the quantile in float, and N is total number of data.

  • bin_num (int, bin_num > 0, default: 10) – The max bin number for binning

  • bin_indexes (list of int or int, default: -1) – Specify which columns need to be binned. -1 represent for all columns. If you need to indicate specific cols, provide a list of header index instead of -1.

  • bin_names (list of string, default: []) – Specify which columns need to calculated. Each element in the list represent for a column name in header.

  • adjustment_factor (float, default: 0.5) – the adjustment factor when calculating WOE. This is useful when there is no event or non-event in a bin. Please note that this parameter will NOT take effect for setting in host.

  • category_indexes (list of int or int, default: []) –

    Specify which columns are category features. -1 represent for all columns. List of int indicate a set of such features. For category features, bin_obj will take its original values as split_points and treat them as have been binned. If this is not what you expect, please do NOT put it into this parameters.

    The number of categories should not exceed bin_num set above.

  • category_names (list of string, default: []) – Use column names to specify category features. Each element in the list represent for a column name in header.

  • local_only (bool, default: False) – Whether just provide binning method to guest party. If true, host party will do nothing.

  • transform_param (TransformParam) – Define how to transfer the binned data.

  • need_run (bool, default True) – Indicate if this module needed to be run

class OptimalBinningParam(metric_method='iv', min_bin_pct=0.05, max_bin_pct=1.0, init_bin_nums=1000, mixture=True, init_bucket_method='quantile')

Indicate optimal binning params

Parameters

metric_method (str, default: "iv") – The algorithm metric method. Support iv, gini, ks, chi-square

min_bin_pct: float, default: 0.05

The minimum percentage of each bucket

max_bin_pct: float, default: 1.0

The maximum percentage of each bucket

init_bin_nums: int, default 100

Number of bins when initialize

mixture: bool, default: True

Whether each bucket need event and non-event records

init_bucket_method: str default: quantile

Init bucket methods. Accept quantile and bucket.

class TransformParam(transform_cols=- 1, transform_names=None, transform_type='bin_num')

Define how to transfer the cols

Parameters
  • transform_cols (list of column index, default: -1) – Specify which columns need to be transform. If column index is None, None of columns will be transformed. If it is -1, it will use same columns as cols in binning module.

  • transform_names (list of string, default: []) – Specify which columns need to calculated. Each element in the list represent for a column name in header.

transform_type: str, ‘bin_num’or ‘woe’ or None default: ‘bin_num’
Specify which value these columns going to replace.
  1. bin_num: Transfer original feature value to bin index in which this value belongs to.

  2. woe: This is valid for guest party only. It will replace original value to its woe value

  3. None: nothing will be replaced.

Features

  1. Support Quantile Binning based on quantile summary algorithm.

  2. Support Bucket Binning.

  3. Support missing value input by ignoring them.

  4. Support sparse data format generated by dataio component.

  5. Support calculating woe and iv as well as counting positive and negative cases for each bin.

  6. Support transforming data into bin indexes.

  7. Support multiple hosts binning.

  8. Support 4 types of optimal binning.

  9. Support asymmetric binning methods on Host & Guest sides.

Hetero Feature Selection

Feature selection is a process that selects a subset of features for model construction. Take good advantage of feature selection can improve model performance.

In this version, we provide several filter methods for feature selection.

Param

class FeatureSelectionParam(select_col_indexes=-1, select_names=None, filter_methods=None, unique_param=<federatedml.param.feature_selection_param.UniqueValueParam object>, iv_value_param=<federatedml.param.feature_selection_param.IVValueSelectionParam object>, iv_percentile_param=<federatedml.param.feature_selection_param.IVPercentileSelectionParam object>, iv_top_k_param=<federatedml.param.feature_selection_param.IVTopKParam object>, variance_coe_param=<federatedml.param.feature_selection_param.VarianceOfCoeSelectionParam object>, outlier_param=<federatedml.param.feature_selection_param.OutlierColsSelectionParam object>, manually_param=<federatedml.param.feature_selection_param.ManuallyFilterParam object>, percentage_value_param=<federatedml.param.feature_selection_param.PercentageValueParam object>, need_run=True)

Define the feature selection parameters.

Parameters
  • select_col_indexes (list or int, default: -1) – Specify which columns need to calculated. -1 represent for all columns.

  • select_names (list of string, default: []) – Specify which columns need to calculated. Each element in the list represent for a column name in header.

  • filter_methods (list, ["manually", "unique_value", "iv_value_thres", "iv_percentile",) –

    “coefficient_of_variation_value_thres”, “outlier_cols”, “percentage_value”,
    ”iv_top_k”],

    default: [“unique_value”]

    Specify the filter methods used in feature selection. The orders of filter used is depended on this list. Please be notified that, if a percentile method is used after some certain filter method, the percentile represent for the ratio of rest features.

    e.g. If you have 10 features at the beginning. After first filter method, you have 8 rest. Then, you want top 80% highest iv feature. Here, we will choose floor(0.8 * 8) = 6 features instead of 8.

  • unique_param (filter the columns if all values in this feature is the same) –

  • iv_value_param (Use information value to filter columns. If this method is set, a float threshold need to be provided.) – Filter those columns whose iv is smaller than threshold.

  • iv_percentile_param (Use information value to filter columns. If this method is set, a float ratio threshold) – need to be provided. Pick floor(ratio * feature_num) features with higher iv. If multiple features around the threshold are same, all those columns will be keep.

  • variance_coe_param (Use coefficient of variation to judge whether filtered or not.) –

  • outlier_param (Filter columns whose certain percentile value is larger than a threshold.) –

  • percentage_value_param (Filter the columns that have a value that exceeds a certain percentage.) –

  • need_run (bool, default True) – Indicate if this module needed to be run

class IVPercentileSelectionParam(percentile_threshold=1.0, local_only=False)

Use information values to select features.

Parameters

percentile_threshold (float, 0 <= percentile_threshold <= 1.0, default: 1.0) – Percentile threshold for iv_percentile method

class IVTopKParam(k=10, local_only=False)

Use information values to select features.

Parameters

k (int, should be greater than 0, default: 10) – Percentile threshold for iv_percentile method

class IVValueSelectionParam(value_threshold=0.0, host_thresholds=None, local_only=False)

Use information values to select features.

Parameters
  • value_threshold (float, default: 1.0) – Used if iv_value_thres method is used in feature selection.

  • host_thresholds (List of float or None, default: None) – Set threshold for different host. If None, use same threshold as guest. If provided, the order should map with the host id setting.

class ManuallyFilterParam(filter_out_indexes=None, filter_out_names=None, left_col_indexes=None, left_col_names=None)

Specified columns that need to be filtered. If exist, it will be filtered directly, otherwise, ignore it.

Parameters
  • filter_out_indexes (list of int, default: None) – Specify columns’ indexes to be filtered out

  • filter_out_names (list of string, default: None) – Specify columns’ names to be filtered out

  • left_col_indexes (list of int, default: None) – Specify left_col_index

  • left_col_names (list of string, default: None) – Specify left col names

  • Filter_out or left parameters only works for this specific filter. For instances, if you set some columns left (Both) –

  • this filter but those columns are filtered by other filters, those columns will NOT left in final. (in) –

  • note that (left_col_indexes & left_col_names) cannot use with (Please) – (filter_out_indexes & filter_out_names) simultaneously.

class OutlierColsSelectionParam(percentile=1.0, upper_threshold=1.0)

Given percentile and threshold. Judge if this quantile point is larger than threshold. Filter those larger ones.

Parameters
  • percentile (float, [0., 1.] default: 1.0) – The percentile points to compare.

  • upper_threshold (float, default: 1.0) – Percentile threshold for coefficient_of_variation_percentile method

class PercentageValueParam(upper_pct=1.0)

Filter the columns that have a value that exceeds a certain percentage.

Parameters

upper_pct (float, [0.1, 1.], default: 1.0) – The upper percentage threshold for filtering, upper_pct should not be less than 0.1.

class UniqueValueParam(eps=1e-05)

Use the difference between max-value and min-value to judge.

Parameters

eps (float, default: 1e-5) – The column(s) will be filtered if its difference is smaller than eps.

class VarianceOfCoeSelectionParam(value_threshold=1.0)

Use coefficient of variation to select features. When judging, the absolute value will be used.

Parameters

value_threshold (float, default: 1.0) – Used if coefficient_of_variation_value_thres method is used in feature selection. Filter those columns who has smaller coefficient of variance than the threshold.

Features

  1. unique_value: filter the columns if all values in this feature is the same

  2. iv_value_thres: Use information value to filter columns. Filter those columns whose iv is smaller than threshold.

  3. iv_percentile: Use information value to filter columns. A float ratio threshold need to be provided. Pick floor(ratio * feature_num) features with higher iv. If multiple features around the threshold are same, all those columns will be keep.

  4. coefficient_of_variation_value_thres: Use coefficient of variation to judge whether to filter or not.

  5. outlier_cols: Filter columns whose percentile value is larger than the given threshold.

  6. manually: Indicate features that need to be filtered.

Besides, we support multi-host federated feature selection for iv filters. Hosts encode feature names and send the feature ids that are involved in feature selection. Guest use iv filters’ logic to judge whether a feature is left or not. Then guest sends result back to hosts. Hosts decode feature ids back to feature names and obtain selection results.

../../_images/multi_host_selection.png

Figure 3: Multi-Host Selection Principle</div>

More feature selection methods will be provided. Please make suggestions by submitting an issue.

Federated Sampling

From Fate v0.2 supports sample method. Sample module supports two sample modes: Random sample mode and StratifiedSampler sample mode.

  • In random mode, “downsample” and “upsample” methods are provided. Users can set the sample parameter “fractions”, which is the sample ratio within data.

  • In stratified mode, “downsample” and “upsample” methods are also provided. Users can set the sample parameter “fractions” too, but it should be a list of tuples in the form (label_i, ratio). Tuples in the list each specify the sample ratio of corresponding label. e.g.

    [(0, 1.5), (1, 2.5), (3, 3.5)]
    

Param

class SampleParam(mode='random', method='downsample', fractions=None, random_state=None, task_type='hetero', need_run=True)

Define the sample method

Parameters
  • mode (str, accepted 'random','stratified'' only in this version, specify sample to use, default: 'random') –

  • method (str, accepted 'downsample','upsample' only in this version. default: 'downsample') –

  • fractions (None or float or list, if mode equals to random, it should be a float number greater than 0,) – otherwise a list of elements of pairs like [label_i, sample_rate_i], e.g. [[0, 0.5], [1, 0.8], [2, 0.3]]. default: None

  • random_state (int, RandomState instance or None, default: None) –

  • need_run (bool, default True) – Indicate if this module needed to be run

Feature Scale

Feature scale is a process that scales each feature along column. The feature scale module now supports min-max scale and standard scale.

  1. min-max scale: this estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between min and max value of each feature.

  2. standard scale: standardize features by removing the mean and scaling to unit variance

Param

class ScaleParam(method=None, mode='normal', scale_col_indexes=- 1, scale_names=None, feat_upper=None, feat_lower=None, with_mean=True, with_std=True, need_run=True)

Define the feature scale parameters.

Parameters
  • method (str, like scale in sklearn, now it support "min_max_scale" and "standard_scale", and will support other scale method soon.) – Default None, which will do nothing for scale

  • mode (str, the mode support "normal" and "cap". for mode is "normal", the feat_upper and feat_lower is the normal value like "10" or "3.1" and for "cap", feat_upper and) – feature_lower will between 0 and 1, which means the percentile of the column. Default “normal”

  • feat_upper (int or float, the upper limit in the column. If the scaled value is larger than feat_upper, it will be set to feat_upper. Default None.) –

  • feat_lower (int or float, the lower limit in the column. If the scaled value is less than feat_lower, it will be set to feat_lower. Default None.) –

  • scale_col_indexes (list,the idx of column in scale_column_idx will be scaled, while the idx of column is not in, it will not be scaled.) –

  • scale_names (list of string, default: []Specify which columns need to scaled. Each element in the list represent for a column name in header.) –

  • with_mean (bool, used for "standard_scale". Default False.) –

  • with_std (bool, used for "standard_scale". Default False.) – The standard scale of column x is calculated as : z = (x - u) / s, where u is the mean of the column and s is the standard deviation of the column. if with_mean is False, u will be 0, and if with_std is False, s will be 1.

  • need_run (bool, default True) – Indicate if this module needed to be run

OneHot Encoder

OneHot encoding is a process by which category variables are converted to binary values. The detailed info could be found in [OneHot wiki]

Param

class OneHotEncoderParam(transform_col_indexes=- 1, transform_col_names=None, need_run=True)
Parameters
  • transform_col_indexes (list or int, default: -1) – Specify which columns need to calculated. -1 represent for all columns.

  • transform_col_names (list of string, default: []) – Specify which columns need to calculated. Each element in the list represent for a column name in header.

need_run: bool, default True

Indicate if this module needed to be run