Data Statistic

This component will do some statistical work on the data, including statistical mean, maximum and minimum, median, etc.

The indicators for each column that can be statistic are list as follow.

  1. count: Number of data

  2. sum: The sum of this column

  3. mean: The mean of this column

  4. variance/stddev: Variance and standard deviation of this column

  5. median: Median of this column

  6. min/max: Min and Max value of this column

  7. coefficient of variance: The formula is abs(stddev / mean)

  8. missing_count/missing_ratio: Number and ratio of missing value in this column

  9. skewness: The definition can be referred to [here]

  10. kurtosis: The definition can be referred to [here]

  11. percentile: The value of percentile. Accept 0% to 100% while the number before the “%” should be integer.

These static values can be used in feature selection as a criterion.

Intersection

This module provide some method of PSI(Private Set Intersection)

RSA Intersection

This folder contains code for implementing algorithm based on [RSA Intersection]. This work is built on FATE, eggroll and federation API that construct the secure, distributed and parallel infrastructure.

Our Intersection module is trying to solve the problem that Privacy-Preserving Entity Match. This module will help at least two parties to find the same user ids without leaking all their user ids to the other. This is illustrated in figure 1.

../../../../_images/rsa_intersection.png

Figure 1 (RSA Intersection between party A and party B)

In figure 1 ,Party A has user id u1,u2,u3,u4, while Party B has u1,u2,u3,u5. After Intersection, party A and party B know their same user ids, which are u1,u2,u3, but party A know nothing about other user ids of party B, like u5, and party B know nothing about party A except u1,u2,u3 as well. While party A and party B transmit their processed id information to the other party, like \(Y-A\) and \(Z-B\), it will not leak any raw ids. \(Z-B\) can be safe because of the privacy key of party B. Each \(Y-A\) includes different random value which binds to each value in \(X-A\) and will be safe as well.

Using this module, we can get the intersection ids between two parties in security and efficiently.

RAW Intersection

This intersection module implements the simple intersection method that A or B as a sender will sends all his ids to another, and another one will find the sample ids according to it’s ids. Finally it will send the intersection ids to the sender.

Multi-Host Intersection

Both rsa and raw intersection support multi-host. It means a guest can do intersection with more than one host simultaneously and finally get the common ID with all hosts.

../../../../_images/multi_hosts.png

Figure 2 (multi-hosts Intersection)

See in figure 2, this is a introduction to a guest intersect with two hosts, and it is the same as more than two hosts. Firstly, guest will intersect with each host and get overlapping IDs respectively. Secondly, guest will find common IDs from all intersection results. Finally, guest will send common IDs to every host if necessary.

Repeated ID intersection

We support repeated id intersection for some applications. For this case, one should provide the mask id which map to the repeated ids. For instances, in Guest, your data is

mask_id, id, value
alice_1, alice, 2
alice_2, alice, 3
bob, bob, 4

In Host, you data is

id, value
alice, 5
bob, 6

After intersecting, guest will get the intersection results:

mask_id, value
alice_1, 2
alice_2, 3
bob, 4

And in host:

id, value
alice_1, 5
alice_2, 5
bob, 4

This switch is “repeated_id_process” in the parameter of intersection, set it to true if you want to use this function.

Param

class EncodeParam(salt='', encode_method='none', base64=False)

Define the hash method for raw intersect method

Parameters
  • salt (the src data string will be str = str + salt, default by empty string) –

  • encode_method (str, the hash method of src data string, it support md5, sha1, sha224, sha256, sha384, sha512, sm3, default by None) –

  • base64 (bool, if True, the result of hash will be changed to base64, default by False) –

class IntersectParam(intersect_method: str = 'raw', random_bit=128, sync_intersect_ids=True, join_role='guest', with_encode=False, only_output_key=False, encode_params=<federatedml.param.intersect_param.EncodeParam object>, rsa_params=<federatedml.param.intersect_param.RSAParam object>, intersect_cache_param=<federatedml.param.intersect_param.IntersectCache object>, repeated_id_process=False, repeated_id_owner='guest', with_sample_id=False, allow_info_share: bool = False, info_owner='guest')

Define the intersect method

Parameters
  • intersect_method (str, it supports 'rsa' and 'raw', default by 'raw') –

  • random_bit (positive int, it will define the encrypt length of rsa algorithm. It effective only for intersect_method is rsa) –

  • sync_intersect_ids (bool. In rsa, 'synchronize_intersect_ids' is True means guest or host will send intersect results to the others, and False will not.) – while in raw, ‘synchronize_intersect_ids’ is True means the role of “join_role” will send intersect results and the others will get them. Default by True.

  • join_role (str, role who joins ids, supports "guest" and "host" only and effective only for raw. If it is "guest", the host will send its ids to guest and find the intersection of) – ids in guest; if it is “host”, the guest will send its ids to host. Default by “guest”.

  • with_encode (bool, if True, it will use hash method for intersect ids. Effective only for "raw".) –

  • encode_params (EncodeParam, it effective only for with_encode is True) –

  • rsa_params (RSAParam, effective for rsa method only) –

  • only_output_key (bool, if false, the results of intersection will include key and value which from input data; if true, it will just include key from input) – data and the value will be empty or some useless character like “intersect_id”

  • repeated_id_process (bool, if true, intersection will process the ids which can be repeatable) –

  • repeated_id_owner (str, which role has the repeated ids) –

  • with_sample_id (bool, data with sample id or not, default False; set this param to True may lead to unexpected behavior) –

class RSAParam(salt='', hash_method='sha256', final_hash_method='sha256', split_calculation=False, random_base_fraction=None, key_length=1024)

Define the hash method for RSA intersect method

Parameters
  • salt (the src data string will be str = str + salt, default '') –

  • hash_method (str, the hash method of src data string, it support sha256, sha384, sha512, sm3, default sha256) –

  • final_hash_method (str, the hash method of result data string, it support md5, sha1, sha224, sha256, sha384, sha512, sm3, default sha256) –

  • split_calculation (bool, if True, Host & Guest split operations for faster performance, recommended on large data set) –

  • random_base_fraction (positive float, if not None, generate (fraction * public key id count) of r for encryption and reuse generated r;) – note that value greater than 0.99 will be taken as 1, and value less than 0.01 will be rounded up to 0.01

  • key_length (positive int, bit count of rsa key, default 1024) –

Feature

Both RSA and RAW intersection supports the following features:

  1. Support multi-host modeling task. The detail configuration for multi-host modeling task is located here.

  2. Repeated ID intersection using ID expanding.

RSA intersection support the following extra features:

  1. RSA support cache to speed up.

RAW intersection support the following extra features:

  1. RAW support some encoders like md5 or sha256 to make it more safely.