Hetero Feature Selection¶
Feature selection is a process that selects a subset of features for model construction. Taking advantage of feature selection can improve model performance.
In this version, we provide several filter methods for feature
selection. Note that module works in a cascade manner where
selected result of filter A will be input into next filter B.
User should pay attention to the order of listing when
supplying multiple filters to filter_methods
param in job configuration.
Features¶
Below lists available input models and their corresponding filter methods(as parameters in configuration):
Isometric Model | Filter Method |
---|---|
None | manually percentage_value |
Binning | iv_filter(threshold) iv_filter(top_k) iv_filter(top_percentile) |
Statistic | statistic_filter |
Pearson | correlation_filter(with 'iv' metric & binning model) vif_filter |
SBT | hetero_sbt_filter hetero_fast_sbt_filter |
PSI | psi_filter |
Most of the filter methods above share the same set of configurable parameters. Below lists their acceptable parameter values.
Filter Method | Parameter Name | metrics | filter_type | take_high |
---|---|---|---|---|
IV Filter | filter_param | "iv" | "threshold", "top_k", "top_percentile" | True |
Statistic Filter | statistic_param | "max", "min", "mean", "median", "stddev", "variance", "coefficient_of_variance", "skewness", "kurtosis", "missing_count", "missing_ratio", quantile(e.g."95%") | "threshold", "top_k", "top_percentile" | True/False |
PSI Filter | psi_param | "psi" | "threshold", "top_k", "top_percentile" | False |
VIF Filter | vif_param | "vif" | "threshold", "top_k", "top_percentile" | False |
Hetero/Homo/HeteroFast SBT Filter | sbt_param | "feature_importance" | "threshold", "top_k", "top_percentile" | True |
-
unique_value: filter the columns if all values in this feature are the same
-
-
iv_filter: Use iv as criterion to selection features. Support three mode: threshold value, top-k and top-percentile.
- threshold value: Filter those columns whose iv is smaller than threshold. You can also set different threshold for each party.
- top-k: Sort features from larger iv to smaller and take top k features in the sorted result.
- top-percentile. Sort features from larger to smaller and take top percentile.
Besides, multi-class iv filter is available if multi-class iv has been calculated in upstream component. There are three mechanisms to select features. Please remind that there exist as many ivs calculated as the number of labels since we use one-vs-rest for multi-class cases.
- "min": take the minimum iv among all results.
- "max": take the maximum ones
* "average": take the average among all results. After that, we get unique one iv for each column so that we can use the three mechanism mentioned above to select features.
-
-
statistic_filter: Use statistic values calculate from DataStatistic component. Support coefficient of variance, missing value, percentile value etc. You can pick the columns with higher statistic values or smaller values as you need.
-
psi_filter: Take PSI component as input isometric model. Then, use its psi value as criterion of selection.
-
hetero_sbt_filter/homo_sbt_filter/hetero_fast_sbt_filter: Take secureboost component as input isometric model. And use feature importance as criterion of selection.
-
manually: Indicate features that need to be filtered.
-
percentage_value: Filter the columns that have a value that exceeds a certain percentage.
Besides, we support multi-host federated feature selection for iv filters. Hosts encode feature names and send the feature ids that are involved in feature selection. Guest use iv filters' logic to judge whether a feature is left or not. Then guest sends result back to hosts. Hosts decode feature ids back to feature names and obtain selection results.
Param¶
feature_selection_param
¶
Attributes¶
deprecated_param_list = ['iv_value_param', 'iv_percentile_param', 'iv_top_k_param', 'variance_coe_param', 'unique_param', 'outlier_param']
module-attribute
¶
Classes¶
UniqueValueParam(eps=1e-05)
¶
Bases: BaseParam
Use the difference between max-value and min-value to judge.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
eps |
float, default
|
The column(s) will be filtered if its difference is smaller than eps. |
1e-05
|
Source code in python/federatedml/param/feature_selection_param.py
34 35 |
|
IVValueSelectionParam(value_threshold=0.0, host_thresholds=None, local_only=False)
¶
Bases: BaseParam
Use information values to select features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value_threshold |
Used if iv_value_thres method is used in feature selection. |
0.0
|
|
host_thresholds |
Set threshold for different host. If None, use same threshold as guest. If provided, the order should map with the host id setting. |
None
|
Source code in python/federatedml/param/feature_selection_param.py
56 57 58 59 60 |
|
Attributes¶
value_threshold = value_threshold
instance-attribute
¶host_thresholds = host_thresholds
instance-attribute
¶local_only = local_only
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
62 63 64 65 66 67 68 69 70 71 72 73 |
|
IVPercentileSelectionParam(percentile_threshold=1.0, local_only=False)
¶
Bases: BaseParam
Use information values to select features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
percentile_threshold |
0 <= percentile_threshold <= 1.0, default: 1.0, Percentile threshold for iv_percentile method |
1.0
|
Source code in python/federatedml/param/feature_selection_param.py
86 87 88 89 |
|
Attributes¶
percentile_threshold = percentile_threshold
instance-attribute
¶local_only = local_only
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
91 92 93 94 95 96 |
|
IVTopKParam(k=10, local_only=False)
¶
Bases: BaseParam
Use information values to select features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
k |
should be greater than 0, default: 10, Percentile threshold for iv_percentile method |
10
|
Source code in python/federatedml/param/feature_selection_param.py
109 110 111 112 |
|
Attributes¶
k = k
instance-attribute
¶local_only = local_only
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
114 115 116 117 118 |
|
VarianceOfCoeSelectionParam(value_threshold=1.0)
¶
Bases: BaseParam
Use coefficient of variation to select features. When judging, the absolute value will be used.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
value_threshold |
Used if coefficient_of_variation_value_thres method is used in feature selection. Filter those columns who has smaller coefficient of variance than the threshold. |
1.0
|
Source code in python/federatedml/param/feature_selection_param.py
133 134 |
|
OutlierColsSelectionParam(percentile=1.0, upper_threshold=1.0)
¶
Bases: BaseParam
Given percentile and threshold. Judge if this quantile point is larger than threshold. Filter those larger ones.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
percentile |
The percentile points to compare. |
1.0
|
|
upper_threshold |
Percentile threshold for coefficient_of_variation_percentile method |
1.0
|
Source code in python/federatedml/param/feature_selection_param.py
154 155 156 |
|
Attributes¶
percentile = percentile
instance-attribute
¶upper_threshold = upper_threshold
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
158 159 160 161 162 |
|
CommonFilterParam(metrics, filter_type='threshold', take_high=True, threshold=1, host_thresholds=None, select_federated=True)
¶
Bases: BaseParam
All of the following parameters can set with a single value or a list of those values. When setting one single value, it means using only one metric to filter while a list represent for using multiple metrics.
Please note that if some of the following values has been set as list, all of them should have same length. Otherwise, error will be raised. And if there exist a list type parameter, the metrics should be in list type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics |
Indicate what metrics are used in this filter |
required | |
filter_type |
Should be one of "threshold", "top_k" or "top_percentile" |
'threshold'
|
|
take_high |
When filtering, taking highest values or not. |
True
|
|
threshold |
If filter type is threshold, this is the threshold value. If it is "top_k", this is the k value. If it is top_percentile, this is the percentile threshold. |
1
|
|
host_thresholds |
Set threshold for different host. If None, use same threshold as guest. If provided, the order should map with the host id setting. |
None
|
|
select_federated |
Whether select federated with other parties or based on local variables |
True
|
Source code in python/federatedml/param/feature_selection_param.py
194 195 196 197 198 199 200 201 202 |
|
Attributes¶
metrics = metrics
instance-attribute
¶filter_type = filter_type
instance-attribute
¶take_high = take_high
instance-attribute
¶threshold = threshold
instance-attribute
¶host_thresholds = host_thresholds
instance-attribute
¶select_federated = select_federated
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
|
IVFilterParam(filter_type='threshold', threshold=1, host_thresholds=None, select_federated=True, mul_class_merge_type='average')
¶
Bases: CommonFilterParam
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mul_class_merge_type |
Indicate how to merge multi-class iv results. Support "average", "min" and "max". |
'average'
|
Source code in python/federatedml/param/feature_selection_param.py
267 268 269 270 271 |
|
CorrelationFilterParam(sort_metric='iv', threshold=0.1, select_federated=True)
¶
Bases: BaseParam
This filter follow this specific rules: 1. Sort all the columns from high to low based on specific metric, eg. iv. 2. Traverse each sorted column. If there exists other columns with whom the absolute values of correlation are larger than threshold, they will be filtered.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sort_metric |
Specify which metric to be used to sort features. |
'iv'
|
|
threshold |
Correlation threshold |
0.1
|
|
select_federated |
Whether select federated with other parties or based on local variables |
True
|
Source code in python/federatedml/param/feature_selection_param.py
295 296 297 298 299 |
|
Attributes¶
sort_metric = sort_metric
instance-attribute
¶threshold = threshold
instance-attribute
¶select_federated = select_federated
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
301 302 303 304 305 306 307 308 309 |
|
PercentageValueParam(upper_pct=1.0)
¶
Bases: BaseParam
Filter the columns that have a value that exceeds a certain percentage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
upper_pct |
The upper percentage threshold for filtering, upper_pct should not be less than 0.1. |
1.0
|
Source code in python/federatedml/param/feature_selection_param.py
323 324 325 |
|
Attributes¶
upper_pct = upper_pct
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
327 328 329 330 331 332 333 334 |
|
ManuallyFilterParam(filter_out_indexes=None, filter_out_names=None, left_col_indexes=None, left_col_names=None)
¶
Bases: BaseParam
Specified columns that need to be filtered. If exist, it will be filtered directly, otherwise, ignore it.
Both Filter_out or left parameters only works for this specific filter. For instances, if you set some columns left in this filter but those columns are filtered by other filters, those columns will NOT left in final.
Please note that (left_col_indexes & left_col_names) cannot use with (filter_out_indexes & filter_out_names) simultaneously.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filter_out_indexes |
Specify columns' indexes to be filtered out
Note tha columns specified by |
None
|
|
filter_out_names |
list of string, default
|
Specify columns' names to be filtered out
Note tha columns specified by |
None
|
left_col_indexes |
Specify left_col_index
Note tha columns specified by |
None
|
|
left_col_names |
Specify left col names
Note tha columns specified by |
None
|
Source code in python/federatedml/param/feature_selection_param.py
362 363 364 365 366 367 368 |
|
Attributes¶
filter_out_indexes = filter_out_indexes
instance-attribute
¶filter_out_names = filter_out_names
instance-attribute
¶left_col_indexes = left_col_indexes
instance-attribute
¶left_col_names = left_col_names
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
370 371 372 373 374 375 376 377 378 379 380 381 |
|
FeatureSelectionParam(select_col_indexes=-1, select_names=None, filter_methods=None, unique_param=UniqueValueParam(), iv_value_param=IVValueSelectionParam(), iv_percentile_param=IVPercentileSelectionParam(), iv_top_k_param=IVTopKParam(), variance_coe_param=VarianceOfCoeSelectionParam(), outlier_param=OutlierColsSelectionParam(), manually_param=ManuallyFilterParam(), percentage_value_param=PercentageValueParam(), iv_param=IVFilterParam(), statistic_param=CommonFilterParam(metrics=consts.MEAN), psi_param=CommonFilterParam(metrics=consts.PSI, take_high=False), vif_param=CommonFilterParam(metrics=consts.VIF, threshold=5.0, take_high=False), sbt_param=CommonFilterParam(metrics=consts.FEATURE_IMPORTANCE), correlation_param=CorrelationFilterParam(), use_anonymous=False, need_run=True)
¶
Bases: BaseParam
Define the feature selection parameters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
select_col_indexes |
Specify which columns need to calculated. -1 represent for all columns.
Note tha columns specified by |
-1
|
|
select_names |
list of string, default
|
Specify which columns need to calculated. Each element in the list represent for a column name in header.
Note tha columns specified by |
None
|
filter_methods |
“hetero_sbt_filter", "homo_sbt_filter", "hetero_fast_sbt_filter", "percentage_value", "vif_filter", "correlation_filter"], default: ["manually"]. The following methods will be deprecated in future version: "unique_value", "iv_value_thres", "iv_percentile", "coefficient_of_variation_value_thres", "outlier_cols" Specify the filter methods used in feature selection. The orders of filter used is depended on this list. Please be notified that, if a percentile method is used after some certain filter method, the percentile represent for the ratio of rest features. e.g. If you have 10 features at the beginning. After first filter method, you have 8 rest. Then, you want top 80% highest iv feature. Here, we will choose floor(0.8 * 8) = 6 features instead of 8. |
None
|
|
unique_param |
filter the columns if all values in this feature is the same |
UniqueValueParam()
|
|
iv_value_param |
Use information value to filter columns. If this method is set, a float threshold need to be provided. Filter those columns whose iv is smaller than threshold. Will be deprecated in the future. |
IVValueSelectionParam()
|
|
iv_percentile_param |
Use information value to filter columns. If this method is set, a float ratio threshold need to be provided. Pick floor(ratio * feature_num) features with higher iv. If multiple features around the threshold are same, all those columns will be keep. Will be deprecated in the future. |
IVPercentileSelectionParam()
|
|
variance_coe_param |
Use coefficient of variation to judge whether filtered or not. Will be deprecated in the future. |
VarianceOfCoeSelectionParam()
|
|
outlier_param |
Filter columns whose certain percentile value is larger than a threshold. Will be deprecated in the future. |
OutlierColsSelectionParam()
|
|
percentage_value_param |
Filter the columns that have a value that exceeds a certain percentage. |
PercentageValueParam()
|
|
iv_param |
Setting how to filter base on iv. It support take high mode only. All of "threshold", "top_k" and "top_percentile" are accepted. Check more details in CommonFilterParam. To use this filter, hetero-feature-binning module has to be provided. |
IVFilterParam()
|
|
statistic_param |
Setting how to filter base on statistic values. All of "threshold", "top_k" and "top_percentile" are accepted. Check more details in CommonFilterParam. To use this filter, data_statistic module has to be provided. |
CommonFilterParam(metrics=consts.MEAN)
|
|
psi_param |
Setting how to filter base on psi values. All of "threshold", "top_k" and "top_percentile" are accepted. Its take_high properties should be False to choose lower psi features. Check more details in CommonFilterParam. To use this filter, data_statistic module has to be provided. |
CommonFilterParam(metrics=consts.PSI, take_high=False)
|
|
use_anonymous |
whether to interpret 'select_names' as anonymous names. |
False
|
|
need_run |
Indicate if this module needed to be run |
True
|
Source code in python/federatedml/param/feature_selection_param.py
450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 |
|
Attributes¶
correlation_param = correlation_param
instance-attribute
¶vif_param = vif_param
instance-attribute
¶select_col_indexes = select_col_indexes
instance-attribute
¶select_names = []
instance-attribute
¶filter_methods = [consts.MANUALLY_FILTER]
instance-attribute
¶unique_param = copy.deepcopy(unique_param)
instance-attribute
¶iv_value_param = copy.deepcopy(iv_value_param)
instance-attribute
¶iv_percentile_param = copy.deepcopy(iv_percentile_param)
instance-attribute
¶iv_top_k_param = copy.deepcopy(iv_top_k_param)
instance-attribute
¶variance_coe_param = copy.deepcopy(variance_coe_param)
instance-attribute
¶outlier_param = copy.deepcopy(outlier_param)
instance-attribute
¶percentage_value_param = copy.deepcopy(percentage_value_param)
instance-attribute
¶manually_param = copy.deepcopy(manually_param)
instance-attribute
¶iv_param = copy.deepcopy(iv_param)
instance-attribute
¶statistic_param = copy.deepcopy(statistic_param)
instance-attribute
¶psi_param = copy.deepcopy(psi_param)
instance-attribute
¶sbt_param = copy.deepcopy(sbt_param)
instance-attribute
¶need_run = need_run
instance-attribute
¶use_anonymous = use_anonymous
instance-attribute
¶Functions¶
check()
¶Source code in python/federatedml/param/feature_selection_param.py
501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 |
|