Title: | Toolkit for Credit Modeling, Analysis and Visualization |
---|---|
Description: | Provides a highly efficient R tool suite for Credit Modeling, Analysis and Visualization.Contains infrastructure functionalities such as data exploration and preparation, missing values treatment, outliers treatment, variable derivation, variable selection, dimensionality reduction, grid search for hyper parameters, data mining and visualization, model evaluation, strategy analysis etc. This package is designed to make the development of binary classification models (machine learning based models as well as credit scorecard) simpler and faster. The references including: 1 Refaat, M. (2011, ISBN: 9781447511199). Credit Risk Scorecard: Development and Implementation Using SAS; 2 Bezdek, James C.FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences (0098-3004),<DOI:10.1016/0098-3004(84)90020-7>. |
Authors: | Dongping Fan [aut, cre] |
Maintainer: | Dongping Fan <[email protected]> |
License: | AGPL-3 |
Version: | 1.3.1 |
Built: | 2025-02-16 03:32:05 UTC |
Source: | https://github.com/cran/creditmodel |
creditmodel provides a highly efficient R tool suite for Credit Modeling, Analysis and Visualization. Contains infrastructure functionalities such as data exploration and preparation, missing values treatment, outliers treatment, variable derivation, variable selection, dimensionality reduction, grid search for hyper parameters, data mining and visualization, model evaluation, strategy analysis etc. This package is designed to make the development of binary classification models (machine learning based models as well as credit scorecard) simpler and faster.
It has three main goals:
creditmodel is a free and open source automated modeling R package designed to help model developers improve model development efficiency and enable many people with no background in data science to complete the modeling work in a short time. Let them focus more on the problem itself and allocate more time to decision-making.
creditmodel covers various tools such as data preprocessing, variable processing/derivation, variable screening/dimensionality reduction, modeling, data analysis, data visualization, model evaluation, strategy analysis, etc. It is a set of customized "core" tool kit for model developers.
'creditmodel' is suitable for machine learning automated modeling of classification targets, and is more suitable for the risk and marketing data of financial credit, e-commerce, and insurance with relatively high noise and low information content.
To learn more about creditmodel, start with the WeChat Platform: hansenmode
Maintainer: Dongping Fan [email protected]
Fuzzy String matching
x %alike% y
x %alike% y
x |
A string. |
y |
A string. |
Logical.
"xyz" %alike% "xy"
"xyz" %alike% "xy"
Fuzzy String matching
x %islike% y
x %islike% y
x |
A string. |
y |
A string. |
Logical.
"xyz" %islike% "yz$"
"xyz" %islike% "yz$"
This function is not intended to be used by end user.
add_variable_process(add)
add_variable_process(add)
add |
A data.frame contained address variables. |
This function is not intended to be used by end user.
address_varieble( df, address_cols = NULL, address_pattern = NULL, parallel = TRUE )
address_varieble( df, address_cols = NULL, address_pattern = NULL, parallel = TRUE )
df |
A data.frame. |
address_cols |
Variables of address, |
address_pattern |
Regular expressions, used to match address variable names. |
parallel |
Logical, parallel computing. Default is TRUE. |
#' analysis_nas
is for understanding the reason for missing data and understand distribution of missing data so we can categorise it as:
missing completely at random(MCAR)
Mmissing at random(MAR), or
missing not at random, also known as IM.
analysis_nas( dat, class_var = FALSE, nas_rate = NULL, na_vars = NULL, mat_nas_shadow = NULL, dt_nas_random = NULL, ... )
analysis_nas( dat, class_var = FALSE, nas_rate = NULL, na_vars = NULL, mat_nas_shadow = NULL, dt_nas_random = NULL, ... )
dat |
A data.frame with independent variables and target variable. |
class_var |
Logical, nas analysis of the nominal variables. Default is TRUE. |
nas_rate |
A list contains nas rate of each variable. |
na_vars |
Names of variables which contain nas. |
mat_nas_shadow |
A shadow matrix of variables which contain nas. |
dt_nas_random |
A data.frame with random nas imputation. |
... |
Other parameters. |
A data.frame with outliers analysis for each variable.
#' analysis_outliers
is the function for outliers analysis.
analysis_outliers(dat, target, x, lof = NULL)
analysis_outliers(dat, target, x, lof = NULL)
dat |
A data.frame with independent variables and target variable. |
target |
The name of target variable. |
x |
The name of variable to process. |
lof |
Outliers of each variable detected by |
A data.frame with outliers analysis for each variable.
as_percent
is a small function for making percent format..
as_percent(x, digits = 2)
as_percent(x, digits = 2)
x |
A numeric vector or list. |
digits |
Number of digits.Default: 2. |
x with percent format.
as_percent(0.2363, digits = 2) as_percent(1)
as_percent(0.2363, digits = 2) as_percent(1)
auc_value
is for get best lambda required in lasso_filter. This function required in lasso_filter
auc_value
auc_value
is for get best lambda required in lasso_filter. This function required in lasso_filter
auc_value(target, prob)
auc_value(target, prob)
target |
Vector of target. |
prob |
A list of redict probability or score. |
Lanmbda value
char_cor_vars
is function for calculating Cramer's V matrix between categorical variables.
char_cor
is function for calculating the correlation coefficient between variables by cremers 'V
char_cor_vars(dat, x) char_cor(dat, x_list = NULL, ex_cols = "date$", parallel = FALSE, note = FALSE)
char_cor_vars(dat, x) char_cor(dat, x_list = NULL, ex_cols = "date$", parallel = FALSE, note = FALSE)
dat |
A data frame. |
x |
The name of variable to process. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical. Outputs info. Default is TRUE. |
A list contains correlation index of x with other variables in dat.
## Not run: char_x_list = get_names(dat = UCICreditCard, types = c('factor', 'character'), ex_cols = "ID$|date$|default.payment.next.month$", get_ex = FALSE) char_cor(dat = UCICreditCard[char_x_list]) ## End(Not run)
## Not run: char_x_list = get_names(dat = UCICreditCard, types = c('factor', 'character'), ex_cols = "ID$|date$|default.payment.next.month$", get_ex = FALSE) char_cor(dat = UCICreditCard[char_x_list]) ## End(Not run)
char_to_num
is for transfering character variables which are actually numerical numbers containing strings to numeric.
char_to_num( dat, char_list = NULL, m = 0, p = 0.5, note = FALSE, ex_cols = NULL )
char_to_num( dat, char_list = NULL, m = 0, p = 0.5, note = FALSE, ex_cols = NULL )
dat |
A data frame |
char_list |
The list of charecteristic variables that need to merge categories, Default is NULL. In case of NULL, merge categories for all variables of string type. |
m |
The minimum number of categories. |
p |
The max percent of categories. |
note |
Logical, outputs info. Default is TRUE. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
A data.frame
dat_sub = lendingclub[c('dti_joint', 'emp_length')] str(dat_sub) #variables that are converted to numbers containing strings dat_sub = char_to_num(dat_sub) str(dat_sub)
dat_sub = lendingclub[c('dti_joint', 'emp_length')] str(dat_sub) #variables that are converted to numbers containing strings dat_sub = char_to_num(dat_sub) str(dat_sub)
checking_data
cheking dat before processing.
checking_data( dat = NULL, target = NULL, occur_time = NULL, note = FALSE, pos_flag = NULL )
checking_data( dat = NULL, target = NULL, occur_time = NULL, note = FALSE, pos_flag = NULL )
dat |
A data.frame with independent variables and target variable. |
target |
The name of target variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
note |
Logical.Outputs info.Default is TRUE. |
pos_flag |
The value of positive class of target variable, default: "1". |
data.frame
dat = checking_data(dat = UCICreditCard, target = "default.payment.next.month")
dat = checking_data(dat = UCICreditCard, target = "default.payment.next.month")
This function is used for city variables derivation.
city_varieble( df = df, city_cols = NULL, city_pattern = NULL, city_class = city_class, parallel = TRUE )
city_varieble( df = df, city_cols = NULL, city_pattern = NULL, city_class = city_class, parallel = TRUE )
df |
A data.frame. |
city_cols |
Variables of city, |
city_pattern |
Regular expressions, used to match city variable names. Default is "city$". |
city_class |
Class or levels of cities. |
parallel |
Logical, parallel computing. Default is TRUE. |
This function is not intended to be used by end user.
city_varieble_process(df_city, x, city_class)
city_varieble_process(df_city, x, city_class)
df_city |
A data.frame. |
x |
Variables of city, |
city_class |
Class or levels of cities. |
cohort_table_plot
is for ploting cohort(vintage) analysis table.This function is not intended to be used by end user.
cohort_table_plot(cohort_dat) cohort_plot(cohort_dat)
cohort_table_plot(cohort_dat) cohort_plot(cohort_dat)
cohort_dat |
A data.frame generated by |
cor_heat_plot
is for ploting correlation matrix
cor_heat_plot( cor_mat, low_color = love_color("deep_red"), high_color = love_color("light_cyan"), title = "Correlation Matrix" )
cor_heat_plot( cor_mat, low_color = love_color("deep_red"), high_color = love_color("light_cyan"), title = "Correlation Matrix" )
cor_mat |
A correlation matrix. |
low_color |
color of the lowest correlation between variables. |
high_color |
color of the highest correlation between variables. |
title |
title of plot. |
train_test = train_test_split(UCICreditCard, split_type = "Random", prop = 0.8,save_data = FALSE) dat_train = train_test$train dat_test = train_test$test cor_mat = cor(dat_train[,8:12],use = "complete.obs") cor_heat_plot(cor_mat)
train_test = train_test_split(UCICreditCard, split_type = "Random", prop = 0.8,save_data = FALSE) dat_train = train_test$train dat_test = train_test$test cor_mat = cor(dat_train[,8:12],use = "complete.obs") cor_heat_plot(cor_mat)
cor_plot
is for ploting correlation matrix
cor_plot( dat, dir_path = tempdir(), x_list = NULL, gtitle = NULL, save_data = FALSE, plot_show = FALSE )
cor_plot( dat, dir_path = tempdir(), x_list = NULL, gtitle = NULL, save_data = FALSE, plot_show = FALSE )
dat |
A data.frame with independent variables and target variable. |
dir_path |
The path for periodically saved graphic files. Default is "./model/LR" |
x_list |
Names of independent variables. |
gtitle |
The title of the graph & The name for periodically saved graphic file. Default is "_correlation_of_variables". |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
plot_show |
Logical, show graph in current graphic device. |
train_test = train_test_split(UCICreditCard, split_type = "Random", prop = 0.8,save_data = FALSE) dat_train = train_test$train dat_test = train_test$test cor_plot(dat_train[,8:12],plot_show = TRUE)
train_test = train_test_split(UCICreditCard, split_type = "Random", prop = 0.8,save_data = FALSE) dat_train = train_test$train dat_test = train_test$test cor_plot(dat_train[,8:12],plot_show = TRUE)
This function is not intended to be used by end user.
cos_sim(x, y, cos_margin = 1)
cos_sim(x, y, cos_margin = 1)
x |
A list of numbers |
y |
A list of numbers |
cos_margin |
Margin of matrix, 1 for rows and 2 for cols, Default is 1. |
A number of cosin similarity
customer_segmentation
is a function for clustering and find the best segment variable.
customer_segmentation( dat, x_list = NULL, ex_cols = NULL, cluster_control = list(meth = "Kmeans", kc = 2, nstart = 1, epsm = 1e-06, sf = 2, max_iter = 100), tree_control = list(cv_folds = 5, maxdepth = kc + 1, minbucket = nrow(dat)/(kc + 1)), save_data = FALSE, file_name = NULL, dir_path = tempdir() )
customer_segmentation( dat, x_list = NULL, ex_cols = NULL, cluster_control = list(meth = "Kmeans", kc = 2, nstart = 1, epsm = 1e-06, sf = 2, max_iter = 100), tree_control = list(cv_folds = 5, maxdepth = kc + 1, minbucket = nrow(dat)/(kc + 1)), save_data = FALSE, file_name = NULL, dir_path = tempdir() )
dat |
A data.frame contained only predict variables. |
x_list |
A list of x variables. |
ex_cols |
A list of excluded variables. Default is NULL. |
cluster_control |
A list controls cluster. kc is the number of cluster center (default is 2), nstart is the number of random groups (default is 1), max_iter max iteration number(default is 100) .
|
tree_control |
A list of controls for desison tree to find the best segment variable.
|
save_data |
Logical. If TRUE, save outliers analysis file to the specified folder at |
file_name |
The name for periodically saved segmentation file. Default is NULL. |
dir_path |
The path for periodically saved segmentation file. |
A "data.frame" object contains cluster results.
Bezdek, James C. "FCM: The fuzzy c-means clustering algorithm". Computers & Geosciences (0098-3004),doi:10.1016/0098-3004(84)90020-7
clust = customer_segmentation(dat = lendingclub[1:10000,20:30], x_list = NULL, ex_cols = "id$|loan_status", cluster_control = list(meth = "FCM", kc = 2), save_data = FALSE, tree_control = list(minbucket = round(nrow(lendingclub) / 10)), file_name = NULL, dir_path = tempdir())
clust = customer_segmentation(dat = lendingclub[1:10000,20:30], x_list = NULL, ex_cols = "id$|loan_status", cluster_control = list(meth = "FCM", kc = 2), save_data = FALSE, tree_control = list(minbucket = round(nrow(lendingclub) / 10)), file_name = NULL, dir_path = tempdir())
cut_equal
is used to generate initial breaks for equal frequency binning.
cut_equal(dat_x, g = 10, sp_values = NULL, cut_bin = "equal_depth")
cut_equal(dat_x, g = 10, sp_values = NULL, cut_bin = "equal_depth")
dat_x |
A vector of an variable x. |
g |
numeric, number of initial bins for equal_bins. |
sp_values |
a list of special value. Default: list(-1, "missing") |
cut_bin |
A string, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
get_breaks
, get_breaks_all
,get_tree_breaks
#equal sample size breaks equ_breaks = cut_equal(dat = UCICreditCard[, "PAY_AMT2"], g = 10)
#equal sample size breaks equ_breaks = cut_equal(dat = UCICreditCard[, "PAY_AMT2"], g = 10)
this function creates stratified folds for cross validation.
cv_split(dat, k = 5, occur_time = NULL, seed = 46)
cv_split(dat, k = 5, occur_time = NULL, seed = 46)
dat |
A data.frame. |
k |
k is an integer specifying the number of folds. |
occur_time |
time variable for creating OOT folds. Default is NULL. |
seed |
A seed. Default is 46. |
a list of indices
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,]
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,]
The data_cleansing
function is a simpler wrapper for data cleaning functions, such as
delete variables that values are all NAs;
checking dat and target format.
delete low variance variables
replace null or NULL or blank with NA;
encode variables which NAs & miss value rate is more than 95
encode variables which unique value rate is more than 95
merge categories of character variables that is more than 10;
transfer time variables to dateformation;
remove duplicated observations;
process outliers;
process NAs.
data_cleansing( dat, target = NULL, obs_id = NULL, occur_time = NULL, pos_flag = NULL, x_list = NULL, ex_cols = NULL, miss_values = NULL, remove_dup = TRUE, outlier_proc = TRUE, missing_proc = "median", low_var = 0.999, missing_rate = 0.999, merge_cat = TRUE, note = TRUE, parallel = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir() )
data_cleansing( dat, target = NULL, obs_id = NULL, occur_time = NULL, pos_flag = NULL, x_list = NULL, ex_cols = NULL, miss_values = NULL, remove_dup = TRUE, outlier_proc = TRUE, missing_proc = "median", low_var = 0.999, missing_rate = 0.999, merge_cat = TRUE, note = TRUE, parallel = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir() )
dat |
A data frame with x and target. |
target |
The name of target variable. |
obs_id |
The name of ID of observations.Default is NULL. |
occur_time |
The name of occur time of observations.Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
x_list |
A list of x variables. |
ex_cols |
A list of excluded variables. Default is NULL. |
miss_values |
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing". |
remove_dup |
Logical, if TRUE, remove the duplicated observations. |
outlier_proc |
Logical, process outliers or not. Default is TRUE. |
missing_proc |
If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis. |
low_var |
The maximum percent of unique values (including NAs) for filtering low variance variables. |
missing_rate |
The maximum percent of missing values for recoding values to missing and non_missing. |
merge_cat |
The minimum number of categories for merging categories of character variables. |
note |
Logical. Outputs info. Default is TRUE. |
parallel |
Logical, parallel computing or not. Default is FALSE. |
save_data |
Logical, save the result or not. Default is FALSE. |
file_name |
The name for periodically saved data file. Default is NULL. |
dir_path |
The path for periodically saved data file. Default is tempdir(). |
A preprocessed data.frame
remove_duplicated
,
null_blank_na
,
entry_rate_na
,
low_variance_filter
,
process_nas
,
process_outliers
#data cleaning dat_cl = data_cleansing(dat = UCICreditCard[1:2000,], target = "default.payment.next.month", x_list = NULL, obs_id = "ID", occur_time = "apply_date", ex_cols = c("PAY_6|BILL_"), outlier_proc = TRUE, missing_proc = TRUE, low_var = TRUE, save_data = FALSE)
#data cleaning dat_cl = data_cleansing(dat = UCICreditCard[1:2000,], target = "default.payment.next.month", x_list = NULL, obs_id = "ID", occur_time = "apply_date", ex_cols = c("PAY_6|BILL_"), outlier_proc = TRUE, missing_proc = TRUE, low_var = TRUE, save_data = FALSE)
#'The data_exploration
includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions, to correlations, cross-tabulation and characteristic analysis.
data_exploration( dat, save_data = FALSE, file_name = NULL, dir_path = tempdir(), note = FALSE )
data_exploration( dat, save_data = FALSE, file_name = NULL, dir_path = tempdir(), note = FALSE )
dat |
A data.frame with x and target. |
save_data |
Logical. If TRUE, save files to the specified folder at |
file_name |
The file name for periodically saved outliers analysis file. Default is NULL. |
dir_path |
The path for periodically saved outliers analysis file. Default is tempdir(). |
note |
Logical, outputs info. Default is TRUE. |
A list contains both categrory and numeric variable analysis.
data_ex = data_exploration(dat = UCICreditCard[1:1000,])
data_ex = data_exploration(dat = UCICreditCard[1:1000,])
date_cut
is a small function to get date point.
date_cut(dat_time, pct = 0.7, g = 100)
date_cut(dat_time, pct = 0.7, g = 100)
dat_time |
time vectors. |
pct |
the percent of cutting. Default: 0.7. |
g |
Number of cuts. |
A Date.
date_cut(dat_time = lendingclub$issue_d, pct = 0.8) #"2018-08-01"
date_cut(dat_time = lendingclub$issue_d, pct = 0.8) #"2018-08-01"
de_one_hot_encoding
is for one-hot encoding recovery processing
de_one_hot_encoding(dat_one_hot, cat_vars = NULL, na_act = TRUE, note = FALSE)
de_one_hot_encoding(dat_one_hot, cat_vars = NULL, na_act = TRUE, note = FALSE)
dat_one_hot |
A dat frame with the one hot encoding variables |
cat_vars |
variables to be recovery processed, default is null, if null, find these variables through regular expressions . |
na_act |
Logical,If true, the missing value is assigned as "missing", if FALSE missing value is omitted, the default is TRUE. |
note |
Logical.Outputs info.Default is TRUE. |
A dat frame with the one hot encoding recorery character variables
#one hot encoding dat1 = one_hot_encoding(dat = UCICreditCard, cat_vars = c("SEX", "MARRIAGE"), merge_cat = TRUE, na_act = TRUE) #de one hot encoding dat2 = de_one_hot_encoding(dat_one_hot = dat1, cat_vars = c("SEX","MARRIAGE"), na_act = FALSE)
#one hot encoding dat1 = one_hot_encoding(dat = UCICreditCard, cat_vars = c("SEX", "MARRIAGE"), merge_cat = TRUE, na_act = TRUE) #de one hot encoding dat2 = de_one_hot_encoding(dat_one_hot = dat1, cat_vars = c("SEX","MARRIAGE"), na_act = FALSE)
de_percent
is a small function for recoverying percent format..
de_percent(x, digits = 2)
de_percent(x, digits = 2)
x |
Character with percent formant. |
digits |
Number of digits.Default: 2. |
x without percent format.
de_percent("24%")
de_percent("24%")
This function is not intended to be used by end user.
derived_interval(dat_s, interval_type = c("cnt_interval", "time_interval"))
derived_interval(dat_s, interval_type = c("cnt_interval", "time_interval"))
dat_s |
A data.frame contained only predict variables. |
interval_type |
Available of c("cnt_interval", "time_interval") |
This function is not intended to be used by end user.
derived_partial_acf(dat_s)
derived_partial_acf(dat_s)
dat_s |
A data.frame |
This function is not intended to be used by end user.
derived_pct(dat_s, pct_type = "total_pct")
derived_pct(dat_s, pct_type = "total_pct")
dat_s |
A data.frame contained only predict variables. |
pct_type |
Available of "total_pct" |
This function is used for derivating behavioral variables and is not intended to be used by end user.
derived_ts_vars( dat, grx = NULL, td = NULL, ID = NULL, ex_cols = NULL, x_list = NULL, der = c("cvs", "sums", "means", "maxs", "max_mins", "time_intervals", "cnt_intervals", "total_pcts", "cum_pcts", "partial_acfs"), parallel = TRUE, note = TRUE ) derived_ts( dat, grx_x = NULL, x_list = NULL, td = NULL, ID = NULL, ex_cols = NULL, der = c("cvs", "sums", "means", "maxs", "max_mins", "time_intervals", "cnt_intervals", "total_pcts", "cum_pcts", "partial_acfs") )
derived_ts_vars( dat, grx = NULL, td = NULL, ID = NULL, ex_cols = NULL, x_list = NULL, der = c("cvs", "sums", "means", "maxs", "max_mins", "time_intervals", "cnt_intervals", "total_pcts", "cum_pcts", "partial_acfs"), parallel = TRUE, note = TRUE ) derived_ts( dat, grx_x = NULL, x_list = NULL, td = NULL, ID = NULL, ex_cols = NULL, der = c("cvs", "sums", "means", "maxs", "max_mins", "time_intervals", "cnt_intervals", "total_pcts", "cum_pcts", "partial_acfs") )
dat |
A data.frame contained only predict variables. |
grx |
Regular expressions used to match variable names. |
td |
Number of variables to derivate. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
x_list |
Names of independent variables. |
der |
Variables to derivate |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
grx_x |
Regular expression used to match a group of variable names. |
The key to creating a good model is not the power of a specific modelling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.
digits_num
is for caculating optimal digits number for numeric variables.
digits_num(dat_x)
digits_num(dat_x)
dat_x |
A numeric variable. |
A number of digits
## Not run: digits_num(lendingclub[,"dti"]) # 7 ## End(Not run)
## Not run: digits_num(lendingclub[,"dti"]) # 7 ## End(Not run)
entropy_weight
is for calculating Entropy Weight.
entropy_weight(dat, pos_vars, neg_vars)
entropy_weight(dat, pos_vars, neg_vars)
dat |
A data.frame with independent variables. |
pos_vars |
Names or index of positive direction variables, the bigger the better. |
neg_vars |
Names or index of negative direction variables, the smaller the better. |
Step1 Raw data normalization Step2 Find out the total amount of contributions of all samples to the index Xj Step3 Each element of the step generated matrix is transformed into the product of each element and the LN (element), and the information entropy is calculated. Step4 Calculate redundancy. Step5 Calculate the weight of each index.
A data.frame with weights of each variable.
entropy_weight(dat = ewm_data, pos_vars = c(6,8,9,10), neg_vars = c(7,11))
entropy_weight(dat = ewm_data, pos_vars = c(6,8,9,10), neg_vars = c(7,11))
entry_rate_na
is the function to recode variables with missing values up to a certain percentage with missing and non_missing.
entry_rate_na(dat, nr = 0.98, note = FALSE)
entry_rate_na(dat, nr = 0.98, note = FALSE)
dat |
A data frame with x and target. |
nr |
The maximum percent of NAs. |
note |
Logical.Outputs info.Default is TRUE. |
A data.frame
datss = entry_rate_na(dat = lendingclub[1:1000, ], nr = 0.98)
datss = entry_rate_na(dat = lendingclub[1:1000, ], nr = 0.98)
This function is not intended to be used by end user.
euclid_dist(x, y, cos_margin = 1)
euclid_dist(x, y, cos_margin = 1)
x |
A list |
y |
A list |
cos_margin |
rows or cols |
eval_auc
,eval_ks
,eval_lift
,eval_tnr
is for getting best params of xgboost.
eval_auc(preds, dtrain) eval_ks(preds, dtrain) eval_tnr(preds, dtrain) eval_lift(preds, dtrain)
eval_auc(preds, dtrain) eval_ks(preds, dtrain) eval_tnr(preds, dtrain) eval_lift(preds, dtrain)
preds |
A list of predict probability or score. |
dtrain |
Matrix of x predictors. |
List of best value
This data is for Entropy Weight Method examples.
A data frame with 10 rows and 13 variables.
fast_high_cor_filter
In a highly correlated variable group, select the variable with the highest IV.
high_cor_filter
In a highly correlated variable group, select the variable with the highest IV.
fast_high_cor_filter( dat, p = 0.95, x_list = NULL, com_list = NULL, ex_cols = NULL, save_data = FALSE, cor_class = TRUE, vars_name = TRUE, parallel = FALSE, note = FALSE, file_name = NULL, dir_path = tempdir(), ... ) high_cor_filter( dat, com_list = NULL, x_list = NULL, ex_cols = NULL, onehot = TRUE, parallel = FALSE, p = 0.7, file_name = NULL, dir_path = tempdir(), save_data = FALSE, note = FALSE, ... )
fast_high_cor_filter( dat, p = 0.95, x_list = NULL, com_list = NULL, ex_cols = NULL, save_data = FALSE, cor_class = TRUE, vars_name = TRUE, parallel = FALSE, note = FALSE, file_name = NULL, dir_path = tempdir(), ... ) high_cor_filter( dat, com_list = NULL, x_list = NULL, ex_cols = NULL, onehot = TRUE, parallel = FALSE, p = 0.7, file_name = NULL, dir_path = tempdir(), save_data = FALSE, note = FALSE, ... )
dat |
A data.frame with independent variables. |
p |
Threshold of correlation between features. Default is 0.95. |
x_list |
Names of independent variables. |
com_list |
A data.frame with important values of each variable. eg : IV_list |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
cor_class |
Culculate catagery variables's correlation matrix. Default is FALSE. |
vars_name |
Logical, output a list of filtered variables or table with detailed compared value of each variable. Default is TRUE. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical. Outputs info. Default is TRUE. |
file_name |
The name for periodically saved results files. Default is "Feature_selected_COR". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
... |
Additional parameters. |
onehot |
one-hot-encoding independent variables. |
A list of selected variables.
get_correlation_group
, high_cor_selector
, char_cor_vars
# calculate iv for each variable. iv_list = feature_selector(dat_train = UCICreditCard[1:1000,], dat_test = NULL, target = "default.payment.next.month", occur_time = "apply_date", filter = c("IV"), cv_folds = 1, iv_cp = 0.01, ex_cols = "ID$|date$|default.payment.next.month$", save_data = FALSE, vars_name = FALSE) fast_high_cor_filter(dat = UCICreditCard[1:1000,], com_list = iv_list, save_data = FALSE, ex_cols = "ID$|date$|default.payment.next.month$", p = 0.9, cor_class = FALSE ,var_name = FALSE)
# calculate iv for each variable. iv_list = feature_selector(dat_train = UCICreditCard[1:1000,], dat_test = NULL, target = "default.payment.next.month", occur_time = "apply_date", filter = c("IV"), cv_folds = 1, iv_cp = 0.01, ex_cols = "ID$|date$|default.payment.next.month$", save_data = FALSE, vars_name = FALSE) fast_high_cor_filter(dat = UCICreditCard[1:1000,], com_list = iv_list, save_data = FALSE, ex_cols = "ID$|date$|default.payment.next.month$", p = 0.9, cor_class = FALSE ,var_name = FALSE)
feature_selector
This function uses four different methods (IV, PSI, correlation, xgboost) in order to select important features.The correlation algorithm must be used with IV.
feature_selector( dat_train, dat_test = NULL, x_list = NULL, target = NULL, pos_flag = NULL, occur_time = NULL, ex_cols = NULL, filter = c("IV", "PSI", "XGB", "COR"), cv_folds = 1, iv_cp = 0.01, psi_cp = 0.5, xgb_cp = 0, cor_cp = 0.98, breaks_list = NULL, hopper = FALSE, vars_name = TRUE, parallel = FALSE, note = TRUE, seed = 46, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
feature_selector( dat_train, dat_test = NULL, x_list = NULL, target = NULL, pos_flag = NULL, occur_time = NULL, ex_cols = NULL, filter = c("IV", "PSI", "XGB", "COR"), cv_folds = 1, iv_cp = 0.01, psi_cp = 0.5, xgb_cp = 0, cor_cp = 0.98, breaks_list = NULL, hopper = FALSE, vars_name = TRUE, parallel = FALSE, note = TRUE, seed = 46, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
dat_train |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
pos_flag |
The value of positive class of target variable, default: "1". |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
filter |
The methods for selecting important and stable variables. |
cv_folds |
Number of cross-validations. Default: 5. |
iv_cp |
The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.02 |
psi_cp |
The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1 |
xgb_cp |
Threshold of XGB feature's Gain. 0 <= xgb_cp <=1. Default is 1/number of independent variables. |
cor_cp |
Threshold of correlation between features. 0 <= cor_cp <=1; 0.7 to 0.98 usually work. Default is 0.98. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
hopper |
Logical.Filtering screening. Default is FALSE. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical.Outputs info. Default is TRUE. |
seed |
Random number seed. Default is 46. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "select_vars". |
dir_path |
The path for periodically saved results files. Default is "./variable" |
... |
Other parameters. |
A list of selected features
psi_iv_filter
, xgb_filter
, gbm_filter
feature_selector(dat_train = UCICreditCard[1:1000,c(2,8:12,26)], dat_test = NULL, target = "default.payment.next.month", occur_time = "apply_date", filter = c("IV", "PSI"), cv_folds = 1, iv_cp = 0.01, psi_cp = 0.1, xgb_cp = 0, cor_cp = 0.98, vars_name = FALSE,note = FALSE)
feature_selector(dat_train = UCICreditCard[1:1000,c(2,8:12,26)], dat_test = NULL, target = "default.payment.next.month", occur_time = "apply_date", filter = c("IV", "PSI"), cv_folds = 1, iv_cp = 0.01, psi_cp = 0.1, xgb_cp = 0, cor_cp = 0.98, vars_name = FALSE,note = FALSE)
This function is used for Fuzzy Clustering.
fuzzy_cluster_means( dat, kc = 2, sf = 2, nstart = 1, max_iter = 100, epsm = 1e-06 ) fuzzy_cluster(dat, kc = 2, init_centers, sf = 3, max_iter = 100, epsm = 1e-06)
fuzzy_cluster_means( dat, kc = 2, sf = 2, nstart = 1, max_iter = 100, epsm = 1e-06 ) fuzzy_cluster(dat, kc = 2, init_centers, sf = 3, max_iter = 100, epsm = 1e-06)
dat |
A data.frame contained only predict variables. |
kc |
The number of cluster center (default is 2), |
sf |
Default is 2. |
nstart |
The number of random groups (default is 1), |
max_iter |
Max iteration number(default is 100) . |
epsm |
Default is 1e-06. |
init_centers |
Initial centers of obs. |
Bezdek, James C. "FCM: The fuzzy c-means clustering algorithm". Computers & Geosciences (0098-3004),doi:10.1016/0098-3004(84)90020-7
This function is used for gathering or aggregating data.
gather_data(dat, x_list = NULL, ID = NULL, FUN = sum_x)
gather_data(dat, x_list = NULL, ID = NULL, FUN = sum_x)
dat |
A data.frame contained only predict variables. |
x_list |
The names of variables to gather. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
FUN |
The function of gathering method. |
The key to creating a good model is not the power of a specific modelling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a', 'b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b', 'c','h','h','i','c','d','g','k','k'), time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1, 3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3)) gather_data(dat = dat, x_list = "time", ID = 'id', FUN = sum_x)
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a', 'b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b', 'c','h','h','i','c','d','g','k','k'), time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1, 3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3)) gather_data(dat = dat, x_list = "time", ID = 'id', FUN = sum_x)
gbm_filter
is for selecting important features using GBM.
gbm_filter( dat, target = NULL, x_list = NULL, ex_cols = NULL, pos_flag = NULL, GBM.params = gbm_params(), cores_num = 2, vars_name = TRUE, note = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), seed = 46, ... )
gbm_filter( dat, target = NULL, x_list = NULL, ex_cols = NULL, pos_flag = NULL, GBM.params = gbm_params(), cores_num = 2, vars_name = TRUE, note = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), seed = 46, ... )
dat |
A data.frame with independent variables and target variable. |
target |
The name of target variable. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
GBM.params |
Parameters of GBM. |
cores_num |
The number of CPU cores to use. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "Feature_importance_GBDT". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
seed |
Random number seed. Default is 46. |
... |
Other parameters to pass to gbdt_params. |
Selected variables.
psi_iv_filter
, xgb_filter
, feature_selector
GBM.params = gbm_params(n.trees = 2, interaction.depth = 2, shrinkage = 0.1, bag.fraction = 1, train.fraction = 1, n.minobsinnode = 30, cv.folds = 2) ## Not run: features = gbm_filter(dat = UCICreditCard[1:1000, c(8:12, 26)], target = "default.payment.next.month", occur_time = "apply_date", GBM.params = GBM.params , vars_name = FALSE) ## End(Not run)
GBM.params = gbm_params(n.trees = 2, interaction.depth = 2, shrinkage = 0.1, bag.fraction = 1, train.fraction = 1, n.minobsinnode = 30, cv.folds = 2) ## Not run: features = gbm_filter(dat = UCICreditCard[1:1000, c(8:12, 26)], target = "default.payment.next.month", occur_time = "apply_date", GBM.params = GBM.params , vars_name = FALSE) ## End(Not run)
gbm_params
is the list of parameters to train a GBM using in training_model
.
gbm_params( n.trees = 1000, interaction.depth = 6, shrinkage = 0.01, bag.fraction = 0.5, train.fraction = 0.7, n.minobsinnode = 30, cv.folds = 5, ... )
gbm_params( n.trees = 1000, interaction.depth = 6, shrinkage = 0.01, bag.fraction = 0.5, train.fraction = 0.7, n.minobsinnode = 30, cv.folds = 5, ... )
n.trees |
Integer specifying the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. Default is 100. |
interaction.depth |
Integer specifying the maximum depth of each tree(i.e., the highest level of variable interactions allowed) . A value of 1 implies an additive model, a value of 2 implies a model with up to 2 - way interactions, etc. Default is 1. |
shrinkage |
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step - size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.1 . |
bag.fraction |
the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction < 1 then running the same model twice will result in similar but different fits. gbm uses the R random number generator so set.seed can ensure that the model can be reconstructed. Preferably, the user can save the returned gbm.object using save. Default is 0.5 . |
train.fraction |
The first train.fraction * nrows(data) observations are used to fit the gbm and the remainder are used for computing out-of-sample estimates of the loss function. |
n.minobsinnode |
Integer specifying the minimum number of observations in the terminal nodes of the trees. Note that this is the actual number of observations, not the total weight. |
cv.folds |
Number of cross - validation folds to perform. If cv.folds > 1 then gbm, in addition to the usual fit, will perform a cross - validation, calculate an estimate of generalization error returned in cv.error. |
... |
Other parameters |
See details at: gbm
A list of parameters.
training_model
, lr_params
, xgb_params
, rf_params
get_auc_ks_lambda
is for get best lambda required in lasso_filter. This function required in lasso_filter
get_auc_ks_lambda
get_auc_ks_lambda
is for get best lambda required in lasso_filter. This function required in lasso_filter
get_auc_ks_lambda( lasso_model, x_test, y_test, save_data = FALSE, plot_show = TRUE, file_name = NULL, dir_path = tempdir() )
get_auc_ks_lambda( lasso_model, x_test, y_test, save_data = FALSE, plot_show = TRUE, file_name = NULL, dir_path = tempdir() )
lasso_model |
A lasso model genereted by glmnet. |
x_test |
A matrix of test dataset with x. |
y_test |
A matrix of y test dataset with y. |
save_data |
Logical, save results in locally specified folder. Default is FALSE |
plot_show |
Logical, if TRUE plot the results. Default is TRUE. |
file_name |
The name for periodically saved results files. Default is NULL. |
dir_path |
The path for periodically saved results files. |
Lanmbda values with max K-S and AUC.
lasso_filter
, get_sim_sign_lambda
get_bins_table
is used to generates summary information of varaibles.
get_bins_table_all
can generates bins table for all specified independent variables.
get_bins_table_all( dat, x_list = NULL, target = NULL, pos_flag = NULL, dat_test = NULL, ex_cols = NULL, breaks_list = NULL, parallel = FALSE, note = FALSE, bins_total = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir() ) get_bins_table( dat, x, target = NULL, pos_flag = NULL, dat_test = NULL, breaks = NULL, breaks_list = NULL, bins_total = TRUE, note = FALSE )
get_bins_table_all( dat, x_list = NULL, target = NULL, pos_flag = NULL, dat_test = NULL, ex_cols = NULL, breaks_list = NULL, parallel = FALSE, note = FALSE, bins_total = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir() ) get_bins_table( dat, x, target = NULL, pos_flag = NULL, dat_test = NULL, breaks = NULL, breaks_list = NULL, bins_total = TRUE, note = FALSE )
dat |
A data.frame with independent variables and target variable. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
pos_flag |
Value of positive class, Default is "1". |
dat_test |
A data.frame of test data. Default is NULL. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
bins_total |
Logical, total sum for each columns. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved bins table file. Default is "bins_table". |
dir_path |
The path for periodically saved bins table file. Default is "./variable". |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
get_iv
,
get_iv_all
,
get_psi
,
get_psi_all
breaks_list = get_breaks_all(dat = UCICreditCard, x_list = names(UCICreditCard)[3:4], target = "default.payment.next.month", equal_bins =TRUE,best = FALSE,g=5, ex_cols = "ID|apply_date", save_data = FALSE) get_bins_table_all(dat = UCICreditCard, breaks_list = breaks_list, target = "default.payment.next.month")
breaks_list = get_breaks_all(dat = UCICreditCard, x_list = names(UCICreditCard)[3:4], target = "default.payment.next.month", equal_bins =TRUE,best = FALSE,g=5, ex_cols = "ID|apply_date", save_data = FALSE) get_bins_table_all(dat = UCICreditCard, breaks_list = breaks_list, target = "default.payment.next.month")
get_breaks
is for generating optimal binning for numerical and nominal variables.
The get_breaks_all
is a simpler wrapper for get_breaks
.
get_breaks_all( dat, target = NULL, x_list = NULL, ex_cols = NULL, pos_flag = NULL, occur_time = NULL, oot_pct = 0.7, best = TRUE, equal_bins = FALSE, cut_bin = "equal_depth", g = 10, sp_values = NULL, tree_control = list(p = 0.05, cp = 1e-06, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.05, b_odds = 0.1, b_psi = 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.2, kc = 1), parallel = FALSE, note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) get_breaks( dat, x, target = NULL, pos_flag = NULL, best = TRUE, equal_bins = FALSE, cut_bin = "equal_depth", g = 10, sp_values = NULL, occur_time = NULL, oot_pct = 0.7, tree_control = NULL, bins_control = NULL, note = FALSE, ... )
get_breaks_all( dat, target = NULL, x_list = NULL, ex_cols = NULL, pos_flag = NULL, occur_time = NULL, oot_pct = 0.7, best = TRUE, equal_bins = FALSE, cut_bin = "equal_depth", g = 10, sp_values = NULL, tree_control = list(p = 0.05, cp = 1e-06, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.05, b_odds = 0.1, b_psi = 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.2, kc = 1), parallel = FALSE, note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) get_breaks( dat, x, target = NULL, pos_flag = NULL, best = TRUE, equal_bins = FALSE, cut_bin = "equal_depth", g = 10, sp_values = NULL, occur_time = NULL, oot_pct = 0.7, tree_control = NULL, bins_control = NULL, note = FALSE, ... )
dat |
A data frame with x and target. |
target |
The name of target variable. |
x_list |
A list of x variables. |
ex_cols |
A list of excluded variables. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
best |
Logical, if TRUE, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, if TRUE, equal sample size initial breaks generates.If FALSE , tree breaks generates using desison tree. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
g |
Integer, number of initial bins for equal_bins. |
sp_values |
A list of missing values. |
tree_control |
the list of tree parameters.
|
bins_control |
the list of parameters.
|
parallel |
Logical, parallel computing or not. Default is FALSE. |
note |
Logical.Outputs info.Default is TRUE. |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
file_name |
File name that save results in locally specified folder. Default is "breaks_list". |
dir_path |
Path to save results. Default is "./variable" |
... |
Additional parameters. |
x |
The Name of an independent variable. |
A table containing a list of splitting points for each independent variable.
get_tree_breaks
, cut_equal
, select_best_class
, select_best_breaks
#controls tree_control = list(p = 0.02, cp = 0.000001, xval = 5, maxdepth = 10) bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1, b_psi = 0.05, b_or = 15, mono = 0.2, odds_psi = 0.1, kc = 5) # get categrory variable breaks b = get_breaks(dat = UCICreditCard[1:1000,], x = "MARRIAGE", target = "default.payment.next.month", occur_time = "apply_date", sp_values = list(-1, "missing"), tree_control = tree_control, bins_control = bins_control) # get numeric variable breaks b2 = get_breaks(dat = UCICreditCard[1:1000,], x = "PAY_2", target = "default.payment.next.month", occur_time = "apply_date", sp_values = list(-1, "missing"), tree_control = tree_control, bins_control = bins_control) # get breaks of all predictive variables b3 = get_breaks_all(dat = UCICreditCard[1:1000,], target = "default.payment.next.month", x_list = c("MARRIAGE","PAY_2"), occur_time = "apply_date", ex_cols = "ID", sp_values = list(-1, "missing"), tree_control = tree_control, bins_control = bins_control, save_data = FALSE)
#controls tree_control = list(p = 0.02, cp = 0.000001, xval = 5, maxdepth = 10) bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1, b_psi = 0.05, b_or = 15, mono = 0.2, odds_psi = 0.1, kc = 5) # get categrory variable breaks b = get_breaks(dat = UCICreditCard[1:1000,], x = "MARRIAGE", target = "default.payment.next.month", occur_time = "apply_date", sp_values = list(-1, "missing"), tree_control = tree_control, bins_control = bins_control) # get numeric variable breaks b2 = get_breaks(dat = UCICreditCard[1:1000,], x = "PAY_2", target = "default.payment.next.month", occur_time = "apply_date", sp_values = list(-1, "missing"), tree_control = tree_control, bins_control = bins_control) # get breaks of all predictive variables b3 = get_breaks_all(dat = UCICreditCard[1:1000,], target = "default.payment.next.month", x_list = c("MARRIAGE","PAY_2"), occur_time = "apply_date", ex_cols = "ID", sp_values = list(-1, "missing"), tree_control = tree_control, bins_control = bins_control, save_data = FALSE)
get_correlation_group
is funtion for obtaining highly correlated variable groups.
select_cor_group
is funtion for selecting highly correlated variable group.
select_cor_list
is funtion for selecting highly correlated variable list.
get_correlation_group(cor_mat, p = 0.8) select_cor_group(cor_vars) select_cor_list(cor_vars_list)
get_correlation_group(cor_mat, p = 0.8) select_cor_group(cor_vars) select_cor_list(cor_vars_list)
cor_mat |
A correlation matrix of independent variables. |
p |
Threshold of correlation between features. Default is 0.7. |
cor_vars |
Correlated variables. |
cor_vars_list |
List of correlated variable |
A list of selected variables.
## Not run: cor_mat = cor(UCICreditCard[8:20], use = "complete.obs", method = "spearman") get_correlation_group(cor_mat, p = 0.6 ) ## End(Not run)
## Not run: cor_mat = cor(UCICreditCard[8:20], use = "complete.obs", method = "spearman") get_correlation_group(cor_mat, p = 0.6 ) ## End(Not run)
get_iv
is used to calculate Information Value (IV) of an independent variable.
get_iv_all
can loop through IV for all specified independent variables.Calculate Information Value (IV)
get_iv
is used to calculate Information Value (IV) of an independent variable.
get_iv_all
can loop through IV for all specified independent variables.
get_iv_all( dat, x_list = NULL, ex_cols = NULL, breaks_list = NULL, target = NULL, pos_flag = NULL, best = TRUE, equal_bins = FALSE, tree_control = NULL, bins_control = NULL, g = 10, parallel = FALSE, note = FALSE ) get_iv( dat, x, target = NULL, pos_flag = NULL, breaks = NULL, breaks_list = NULL, best = TRUE, equal_bins = FALSE, tree_control = NULL, bins_control = NULL, g = 10, note = FALSE )
get_iv_all( dat, x_list = NULL, ex_cols = NULL, breaks_list = NULL, target = NULL, pos_flag = NULL, best = TRUE, equal_bins = FALSE, tree_control = NULL, bins_control = NULL, g = 10, parallel = FALSE, note = FALSE ) get_iv( dat, x, target = NULL, pos_flag = NULL, breaks = NULL, breaks_list = NULL, best = TRUE, equal_bins = FALSE, tree_control = NULL, bins_control = NULL, g = 10, note = FALSE )
dat |
A data.frame with independent variables and target variable. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
target |
The name of target variable. |
pos_flag |
Value of positive class, Default is "1". |
best |
Logical, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, generates initial breaks for equal frequency binning. |
tree_control |
Parameters of using Decision Tree to segment initial breaks. See detials: |
bins_control |
Parameters used to control binning. See detials: |
g |
Number of initial breakpoints for equal frequency binning. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
IV Rules of Thumb for evaluating the strength a predictor Less than 0.02:unpredictive 0.02 to 0.1:weak 0.1 to 0.3:medium 0.3 + :strong
Information Value Statistic:Bruce Lund, Magnify Analytics Solutions, a Division of Marketing Associates, Detroit, MI(Paper AA - 14 - 2013)
get_iv
,get_iv_all
,get_psi
,get_psi_all
get_iv_all(dat = UCICreditCard, x_list = names(UCICreditCard)[3:10], equal_bins = TRUE, best = FALSE, target = "default.payment.next.month", ex_cols = "ID|apply_date") get_iv(UCICreditCard, x = "PAY_3", equal_bins = TRUE, best = FALSE, target = "default.payment.next.month")
get_iv_all(dat = UCICreditCard, x_list = names(UCICreditCard)[3:10], equal_bins = TRUE, best = FALSE, target = "default.payment.next.month", ex_cols = "ID|apply_date") get_iv(UCICreditCard, x = "PAY_3", equal_bins = TRUE, best = FALSE, target = "default.payment.next.month")
get_logistic_coef
is for geting logistic coefficients.
get_logistic_coef( lg_model, file_name = NULL, dir_path = tempdir(), save_data = FALSE )
get_logistic_coef( lg_model, file_name = NULL, dir_path = tempdir(), save_data = FALSE )
lg_model |
An object of logistic model. |
file_name |
The name for periodically saved coefficient file. Default is "LR_coef". |
dir_path |
The Path for periodically saved coefficient file. Default is "./model". |
save_data |
Logical, save the result or not. Default is FALSE. |
A data.frame with logistic coefficients.
# dataset spliting sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] #rename the target variable dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) #train_ test pliting train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transforming train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE) Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit)) #get LR coefficient dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE) bins_table = get_bins_table_all(dat = dat_train, target = "target", x_list = x_list,dat_test = dat_test, breaks_list = breaks_list, note = FALSE) #score card LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target") #scoring train_pred = dat_train[, c("ID", "apply_date", "target")] test_pred = dat_test[, c("ID", "apply_date", "target")] train_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = train_woe, save_data = TRUE)[, "score"] test_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = test_woe, save_data = FALSE)[, "score"]
# dataset spliting sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] #rename the target variable dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) #train_ test pliting train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transforming train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE) Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit)) #get LR coefficient dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE) bins_table = get_bins_table_all(dat = dat_train, target = "target", x_list = x_list,dat_test = dat_test, breaks_list = breaks_list, note = FALSE) #score card LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target") #scoring train_pred = dat_train[, c("ID", "apply_date", "target")] test_pred = dat_test[, c("ID", "apply_date", "target")] train_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = train_woe, save_data = TRUE)[, "score"] test_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = test_woe, save_data = FALSE)[, "score"]
This function is not intended to be used by end user.
get_median(x, weight_avg = NULL)
get_median(x, weight_avg = NULL)
x |
A vector or list. |
weight_avg |
avg weight to calculate means. |
get_names
is for getting names of particular classes of variables
get_names( dat, types = c("logical", "factor", "character", "numeric", "integer64", "integer", "double", "Date", "POSIXlt", "POSIXct", "POSIXt"), ex_cols = NULL, get_ex = FALSE )
get_names( dat, types = c("logical", "factor", "character", "numeric", "integer64", "integer", "double", "Date", "POSIXlt", "POSIXct", "POSIXt"), ex_cols = NULL, get_ex = FALSE )
dat |
A data.frame with independent variables and target variable. |
types |
The class or types of variables which names to get. Default: c('numeric', 'integer', 'double') |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
get_ex |
Logical ,if TRUE, return a list contains names of excluded variables. |
A list contains names of variables
x_list = get_names(dat = UCICreditCard, types = c('factor', 'character'), ex_cols = c("default.payment.next.month","ID$|_date$"), get_ex = FALSE) x_list = get_names(dat = UCICreditCard, types = c('numeric', 'character', "integer"), ex_cols = c("default.payment.next.month", "ID$|SEX "), get_ex = FALSE)
x_list = get_names(dat = UCICreditCard, types = c('factor', 'character'), ex_cols = c("default.payment.next.month","ID$|_date$"), get_ex = FALSE) x_list = get_names(dat = UCICreditCard, types = c('numeric', 'character', "integer"), ex_cols = c("default.payment.next.month", "ID$|SEX "), get_ex = FALSE)
This function is not intended to be used by end user.
get_nas_random(dat)
get_nas_random(dat)
dat |
A data.frame contained only predict variables. |
get_psi
is used to calculate Population Stability Index (PSI) of an independent variable.
get_psi_all
can loop through PSI for all specified independent variables.Calculate Population Stability Index (PSI)
get_psi
is used to calculate Population Stability Index (PSI) of an independent variable.
get_psi_all
can loop through PSI for all specified independent variables.
get_psi_all( dat, x_list = NULL, target = NULL, dat_test = NULL, breaks_list = NULL, occur_time = NULL, start_date = NULL, cut_date = NULL, oot_pct = 0.7, pos_flag = NULL, parallel = FALSE, ex_cols = NULL, as_table = FALSE, g = 10, bins_no = TRUE, note = FALSE ) get_psi( dat, x, target = NULL, dat_test = NULL, occur_time = NULL, start_date = NULL, cut_date = NULL, pos_flag = NULL, breaks = NULL, breaks_list = NULL, oot_pct = 0.7, g = 10, as_table = TRUE, note = FALSE, bins_no = TRUE )
get_psi_all( dat, x_list = NULL, target = NULL, dat_test = NULL, breaks_list = NULL, occur_time = NULL, start_date = NULL, cut_date = NULL, oot_pct = 0.7, pos_flag = NULL, parallel = FALSE, ex_cols = NULL, as_table = FALSE, g = 10, bins_no = TRUE, note = FALSE ) get_psi( dat, x, target = NULL, dat_test = NULL, occur_time = NULL, start_date = NULL, cut_date = NULL, pos_flag = NULL, breaks = NULL, breaks_list = NULL, oot_pct = 0.7, g = 10, as_table = TRUE, note = FALSE, bins_no = TRUE )
dat |
A data.frame with independent variables and target variable. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
start_date |
The earliest occurrence time of observations. |
cut_date |
Time points for spliting data sets, e.g. : spliting Actual and Expected data sets. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
pos_flag |
Value of positive class, Default is "1". |
parallel |
Logical, parallel computing. Default is FALSE. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
as_table |
Logical, output results in a table. Default is TRUE. |
g |
Number of initial breakpoints for equal frequency binning. |
bins_no |
Logical, add serial numbers to bins. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
PSI Rules for evaluating the stability of a predictor Less than 0.02: Very stable 0.02 to 0.1: Stable 0.1 to 0.2: Unstable 0.2 to 0.5] : Change more than 0.5: Great change
get_iv
,get_iv_all
,get_psi
,get_psi_all
# dat_test is null get_psi(dat = UCICreditCard, x = "PAY_3", occur_time = "apply_date") # dat_test is not all # train_test split train_test = train_test_split(dat = UCICreditCard, prop = 0.7, split_type = "OOT", occur_time = "apply_date", start_date = NULL, cut_date = NULL, save_data = FALSE, note = FALSE) dat_ex = train_test$train dat_ac = train_test$test # generate psi table get_psi(dat = dat_ex, dat_test = dat_ac, x = "PAY_3", occur_time = "apply_date", bins_no = TRUE)
# dat_test is null get_psi(dat = UCICreditCard, x = "PAY_3", occur_time = "apply_date") # dat_test is not all # train_test split train_test = train_test_split(dat = UCICreditCard, prop = 0.7, split_type = "OOT", occur_time = "apply_date", start_date = NULL, cut_date = NULL, save_data = FALSE, note = FALSE) dat_ex = train_test$train dat_ac = train_test$test # generate psi table get_psi(dat = dat_ex, dat_test = dat_ac, x = "PAY_3", occur_time = "apply_date", bins_no = TRUE)
get_iv_psi
is used to calculate Information Value (IV) and Population Stability Index (PSI) of an independent variable.
get_iv_psi_all
can loop through IV & PSI for all specified independent variables.
get_psi_iv_all( dat, dat_test = NULL, x_list = NULL, target, ex_cols = NULL, pos_flag = NULL, breaks_list = NULL, occur_time = NULL, oot_pct = 0.7, equal_bins = FALSE, cut_bin = "equal_depth", tree_control = NULL, bins_control = NULL, bins_total = FALSE, best = TRUE, g = 10, as_table = TRUE, note = FALSE, parallel = FALSE, bins_no = TRUE ) get_psi_iv( dat, dat_test = NULL, x, target, pos_flag = NULL, breaks = NULL, breaks_list = NULL, occur_time = NULL, oot_pct = 0.7, equal_bins = FALSE, cut_bin = "equal_depth", tree_control = NULL, bins_control = NULL, bins_total = FALSE, best = TRUE, g = 10, as_table = TRUE, note = FALSE, bins_no = TRUE )
get_psi_iv_all( dat, dat_test = NULL, x_list = NULL, target, ex_cols = NULL, pos_flag = NULL, breaks_list = NULL, occur_time = NULL, oot_pct = 0.7, equal_bins = FALSE, cut_bin = "equal_depth", tree_control = NULL, bins_control = NULL, bins_total = FALSE, best = TRUE, g = 10, as_table = TRUE, note = FALSE, parallel = FALSE, bins_no = TRUE ) get_psi_iv( dat, dat_test = NULL, x, target, pos_flag = NULL, breaks = NULL, breaks_list = NULL, occur_time = NULL, oot_pct = 0.7, equal_bins = FALSE, cut_bin = "equal_depth", tree_control = NULL, bins_control = NULL, bins_total = FALSE, best = TRUE, g = 10, as_table = TRUE, note = FALSE, bins_no = TRUE )
dat |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. |
target |
The name of target variable. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
equal_bins |
Logical, generates initial breaks for equal frequency or width binning. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
tree_control |
Parameters of using Decision Tree to segment initial breaks. See detials: |
bins_control |
Parameters used to control binning. See detials: |
bins_total |
Logical, total sum for each variable. |
best |
Logical, merge initial breaks to get optimal breaks for binning. |
g |
Number of initial breakpoints for equal frequency binning. |
as_table |
Logical, output results in a table. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
parallel |
Logical, parallel computing. Default is FALSE. |
bins_no |
Logical, add serial numbers to bins. Default is FALSE. |
x |
The name of an independent variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
get_iv
,get_iv_all
,get_psi
,get_psi_all
iv_list = get_psi_iv_all(dat = UCICreditCard[1:1000, ], x_list = names(UCICreditCard)[3:5], equal_bins = TRUE, target = "default.payment.next.month", ex_cols = "ID|apply_date") get_psi_iv(UCICreditCard, x = "PAY_3", target = "default.payment.next.month",bins_total = TRUE)
iv_list = get_psi_iv_all(dat = UCICreditCard[1:1000, ], x_list = names(UCICreditCard)[3:5], equal_bins = TRUE, target = "default.payment.next.month", ex_cols = "ID|apply_date") get_psi_iv(UCICreditCard, x = "PAY_3", target = "default.payment.next.month",bins_total = TRUE)
You can use the psi_plot
to plot PSI of your data.
get_psi_plots
can loop through plots for all specified independent variables.
get_psi_plots( dat_train, dat_test = NULL, x_list = NULL, ex_cols = NULL, breaks_list = NULL, occur_time = NULL, g = 10, plot_show = TRUE, save_data = FALSE, file_name = NULL, parallel = FALSE, g_width = 8, dir_path = tempdir() ) psi_plot( dat_train, x, dat_test = NULL, occur_time = NULL, g_width = 8, breaks_list = NULL, breaks = NULL, g = 10, plot_show = TRUE, save_data = FALSE, dir_path = tempdir() )
get_psi_plots( dat_train, dat_test = NULL, x_list = NULL, ex_cols = NULL, breaks_list = NULL, occur_time = NULL, g = 10, plot_show = TRUE, save_data = FALSE, file_name = NULL, parallel = FALSE, g_width = 8, dir_path = tempdir() ) psi_plot( dat_train, x, dat_test = NULL, occur_time = NULL, g_width = 8, breaks_list = NULL, breaks = NULL, g = 10, plot_show = TRUE, save_data = FALSE, dir_path = tempdir() )
dat_train |
A data.frame with independent variables. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
occur_time |
The name of occur time. |
g |
Number of initial breakpoints for equal frequency binning. |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved data file. Default is NULL. |
parallel |
Logical, parallel computing. Default is FALSE. |
g_width |
The width of graphs. |
dir_path |
The path for periodically saved graphic files. |
x |
The name of an independent variable. |
breaks |
Splitting points for a continues variable. |
train_test = train_test_split(UCICreditCard[1:1000,], split_type = "Random", prop = 0.8, save_data = FALSE) dat_train = train_test$train dat_test = train_test$test get_psi_plots(dat_train[, c(8, 9)], dat_test = dat_test[, c(8, 9)])
train_test = train_test_split(UCICreditCard[1:1000,], split_type = "Random", prop = 0.8, save_data = FALSE) dat_train = train_test$train dat_test = train_test$test get_psi_plots(dat_train[, c(8, 9)], dat_test = dat_test[, c(8, 9)])
get_score_card
is for generating a stardard scorecard
get_score_card( lg_model, target, bins_table, a = 600, b = 50, file_name = NULL, dir_path = tempdir(), save_data = FALSE )
get_score_card( lg_model, target, bins_table, a = 600, b = 50, file_name = NULL, dir_path = tempdir(), save_data = FALSE )
lg_model |
An object of glm model. |
target |
The name of target variable. |
bins_table |
a data.frame generated by |
a |
Base line of score. |
b |
Numeric.Increased scores from doubling Odds. |
file_name |
The name for periodically saved scorecard file. Default is "LR_Score_Card". |
dir_path |
The path for periodically saved scorecard file. Default is "./model" |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
scorecard
# dataset spliting sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] #rename the target variable dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) #train_ test pliting train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transforming train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE) Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit)) #get LR coefficient dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE) bins_table = get_bins_table_all(dat = dat_train, target = "target", dat_test = dat_test, x_list = x_list, breaks_list = breaks_list, note = FALSE) #score card LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target") #scoring train_pred = dat_train[, c("ID", "apply_date", "target")] test_pred = dat_test[, c("ID", "apply_date", "target")] train_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = train_woe, save_data = FALSE)[, "score"] test_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = test_woe, save_data = FALSE)[, "score"]
# dataset spliting sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] #rename the target variable dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) #train_ test pliting train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transforming train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE) Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit)) #get LR coefficient dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE) bins_table = get_bins_table_all(dat = dat_train, target = "target", dat_test = dat_test, x_list = x_list, breaks_list = breaks_list, note = FALSE) #score card LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target") #scoring train_pred = dat_train[, c("ID", "apply_date", "target")] test_pred = dat_test[, c("ID", "apply_date", "target")] train_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = train_woe, save_data = FALSE)[, "score"] test_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = test_woe, save_data = FALSE)[, "score"]
This function is not intended to be used by end user.
get_shadow_nas(dat)
get_shadow_nas(dat)
dat |
A data.frame contained only predict variables. |
get_sim_sign_lambda
is for get Best lambda required in lasso_filter. This function required in lasso_filter
get_sim_sign_lambda
get_sim_sign_lambda
is for get Best lambda required in lasso_filter. This function required in lasso_filter
get_sim_sign_lambda(lasso_model, sim_sign = "negtive")
get_sim_sign_lambda(lasso_model, sim_sign = "negtive")
lasso_model |
A lasso model genereted by glmnet. |
sim_sign |
Default is "negtive". This is related to pos_plag. If pos_flag equals 1 or 1, the value must be set to negetive. If pos_flag equals 0 or 0, the value must be set to positive. |
lambda.sim_sign give the model with the same positive or negetive coefficients of all variables.
Lanmbda value
get_tree_breaks
is for generating initial braks by decision tree for a numerical or nominal variable.
The get_breaks
function is a simpler wrapper for get_tree_breaks
.
get_tree_breaks( dat, x, target, pos_flag = NULL, tree_control = list(p = 0.02, cp = 1e-06, xval = 5, maxdepth = 10), sp_values = NULL )
get_tree_breaks( dat, x, target, pos_flag = NULL, tree_control = list(p = 0.02, cp = 1e-06, xval = 5, maxdepth = 10), sp_values = NULL )
dat |
A data frame with x and target. |
x |
name of variable to cut breaks by tree. |
target |
The name of target variable. |
pos_flag |
The value of positive class of target variable, default: "1". |
tree_control |
the list of parameters to control cutting initial breaks by decision tree.
|
sp_values |
A list of special value. Default: NULL. |
#tree breaks tree_control = list(p = 0.02, cp = 0.000001, xval = 5, maxdepth = 10) tree_breaks = get_tree_breaks(dat = UCICreditCard, x = "MARRIAGE", target = "default.payment.next.month", tree_control = tree_control)
#tree breaks tree_control = list(p = 0.02, cp = 0.000001, xval = 5, maxdepth = 10) tree_breaks = get_tree_breaks(dat = UCICreditCard, x = "MARRIAGE", target = "default.payment.next.month", tree_control = tree_control)
get_x_list
is for getting intersect names of x_list, train and test.
get_x_list( dat_train = NULL, dat_test = NULL, x_list = NULL, ex_cols = NULL, note = FALSE )
get_x_list( dat_train = NULL, dat_test = NULL, x_list = NULL, ex_cols = NULL, note = FALSE )
dat_train |
A data.frame with independent variables. |
dat_test |
Another data.frame. |
x_list |
Names of independent variables. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
note |
Logical. Outputs info. Default is TRUE. |
A list contains names of variables
x_list = get_x_list(x_list = NULL,dat_train = UCICreditCard, ex_cols = c("default.payment.next.month","ID$|_date$"))
x_list = get_x_list(x_list = NULL,dat_train = UCICreditCard, ex_cols = c("default.payment.next.month","ID$|_date$"))
high_cor_selector
is function for comparing the two highly correlated variables, select a variable with the largest IV value.
high_cor_selector( cor_mat, p = 0.95, x_list = NULL, com_list = NULL, retain = TRUE )
high_cor_selector( cor_mat, p = 0.95, x_list = NULL, com_list = NULL, retain = TRUE )
cor_mat |
A correlation matrix. |
p |
The threshold of high correlation. |
x_list |
Names of independent variables. |
com_list |
A data.frame with important values of each variable. eg : IV_list. |
retain |
Logical, output selected variables, if FALSE, output filtered variables. |
A list of selected variables.
is_date
is a small function for distinguishing time formats
is_date(x)
is_date(x)
x |
list or vectors |
A Date.
is_date(lendingclub$issue_d)
is_date(lendingclub$issue_d)
This function is not intended to be used by end user.
knn_nas_imp( dat, x, nas_rate = NULL, mat_nas_shadow = NULL, dt_nas_random = NULL, k = 10, scale = FALSE, method = "median", miss_value_num = -1 )
knn_nas_imp( dat, x, nas_rate = NULL, mat_nas_shadow = NULL, dt_nas_random = NULL, k = 10, scale = FALSE, method = "median", miss_value_num = -1 )
dat |
A data.frame with independent variables. |
x |
The name of variable to process. |
nas_rate |
A list contains nas rate of each variable. |
mat_nas_shadow |
A shadow matrix of variables which contain nas. |
dt_nas_random |
A data.frame with random nas imputation. |
k |
Number of neighbors of each obs which x is missing. |
scale |
Logical.Standardization of variable. |
method |
The methods of imputation by knn. "median" is knn imputation with k neighbors median, "avg_dist" is knn imputation with k neighbors of distance weighted mean. |
miss_value_num |
Default value of missing data imputation for numeric variables, Defualt is -1. |
ks_table
is for generating a model performance table.
ks_table_plot
is for ploting the table generated by ks_table
ks_psi_plot
is for K-S & PSI distrbution ploting.
ks_table( train_pred, test_pred = NULL, target = NULL, score = NULL, g = 10, breaks = NULL, pos_flag = list("1", "1", "Bad", 1) ) ks_table_plot( train_pred, test_pred, target = "target", score = "score", g = 10, plot_show = TRUE, g_width = 12, file_name = NULL, save_data = FALSE, dir_path = tempdir(), gtitle = NULL ) ks_psi_plot( train_pred, test_pred, target = "target", score = "score", gtitle = NULL, plot_show = TRUE, g_width = 12, save_data = FALSE, breaks = NULL, g = 10, dir_path = tempdir() ) model_key_index(tb_pred)
ks_table( train_pred, test_pred = NULL, target = NULL, score = NULL, g = 10, breaks = NULL, pos_flag = list("1", "1", "Bad", 1) ) ks_table_plot( train_pred, test_pred, target = "target", score = "score", g = 10, plot_show = TRUE, g_width = 12, file_name = NULL, save_data = FALSE, dir_path = tempdir(), gtitle = NULL ) ks_psi_plot( train_pred, test_pred, target = "target", score = "score", gtitle = NULL, plot_show = TRUE, g_width = 12, save_data = FALSE, breaks = NULL, g = 10, dir_path = tempdir() ) model_key_index(tb_pred)
train_pred |
A data frame of training with predicted prob or score. |
test_pred |
A data frame of validation with predict prob or score. |
target |
The name of target variable. |
score |
The name of prob or score variable. |
g |
Number of breaks for prob or score. |
breaks |
Splitting points of prob or score. |
pos_flag |
The value of positive class of target variable, default: "1". |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
g_width |
Width of graphs. |
file_name |
The name for periodically saved data file. Default is NULL. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
dir_path |
The path for periodically saved graphic files. |
gtitle |
The title of the graph & The name for periodically saved graphic file. Default is "_ks_psi_table". |
tb_pred |
A table generated by codeks_table |
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) # model evaluation ks_psi_plot(train_pred = dat_train, test_pred = dat_test, score = "pred_LR", target = "target", plot_show = TRUE) tb_pred = ks_table_plot(train_pred = dat_train, test_pred = dat_test, score = "pred_LR", target = "target", g = 10, g_width = 13, plot_show = FALSE) key_index = model_key_index(tb_pred)
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) # model evaluation ks_psi_plot(train_pred = dat_train, test_pred = dat_test, score = "pred_LR", target = "target", plot_show = TRUE) tb_pred = ks_table_plot(train_pred = dat_train, test_pred = dat_test, score = "pred_LR", target = "target", g = 10, g_width = 13, plot_show = FALSE) key_index = model_key_index(tb_pred)
ks_value
is for get K-S value for a prob or score.
ks_value(target, prob)
ks_value(target, prob)
target |
Vector of target. |
prob |
A list of redict probability or score. |
KS value
lasso_filter
filter variables by lasso.
lasso_filter( dat_train, dat_test = NULL, target = NULL, x_list = NULL, pos_flag = NULL, ex_cols = NULL, sim_sign = "negtive", best_lambda = "lambda.auc", save_data = FALSE, plot.it = TRUE, seed = 46, file_name = NULL, dir_path = tempdir(), note = FALSE )
lasso_filter( dat_train, dat_test = NULL, target = NULL, x_list = NULL, pos_flag = NULL, ex_cols = NULL, sim_sign = "negtive", best_lambda = "lambda.auc", save_data = FALSE, plot.it = TRUE, seed = 46, file_name = NULL, dir_path = tempdir(), note = FALSE )
dat_train |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
x_list |
Names of independent variables. |
pos_flag |
The value of positive class of target variable, default: "1". |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
sim_sign |
The coefficients of all variables should be all negetive or positive, after turning to woe. Default is "negetive" for pos_flag is "1". |
best_lambda |
Metheds of best lambda stardards using to filter variables by LASSO. There are 3 methods: ("lambda.auc", "lambda.ks", "lambda.sim_sign") . Default is "lambda.auc". |
save_data |
Logical, save results in locally specified folder. Default is FALSE |
plot.it |
Logical, shrinkage plot. Default is TRUE. |
seed |
Random number seed. Default is 46. |
file_name |
The name for periodically saved results files. Default is "Feature_selected_LASSO". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
note |
Logical, outputs info. Default is FALSE. |
A list of filtered x variables by lasso.
sub = cv_split(UCICreditCard, k = 40)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat_train = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) dat_train = process_nas(dat_train) #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transform train_woe = woe_trans_all(dat = dat_train,x_list = x_list, target = "target", breaks_list = breaks_list, woe_name = FALSE) lasso_filter(dat_train = train_woe, target = "target", x_list = x_list, save_data = FALSE, plot.it = FALSE)
sub = cv_split(UCICreditCard, k = 40)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat_train = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) dat_train = process_nas(dat_train) #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transform train_woe = woe_trans_all(dat = dat_train,x_list = x_list, target = "target", breaks_list = breaks_list, woe_name = FALSE) lasso_filter(dat_train = train_woe, target = "target", x_list = x_list, save_data = FALSE, plot.it = FALSE)
This data contains complete loan data for all loans issued through the time period stated, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The data containing loan data through the "present" contains complete loan data for all loans issued through the previous completed calendar quarter(time period: 2018Q1:2018Q4).
A data frame with 63532 rows and 145 variables.
id: A unique LC assigned ID for the loan listing.
issue_d: The month which the loan was funded.
loan_status: Current status of the loan.
addr_state: The state provided by the borrower in the loan application.
acc_open_past_24mths: Number of trades opened in past 24 months.
all_util: Balance to credit limit on all trades.
annual_inc: The self:reported annual income provided by the borrower during registration.
avg_cur_bal: Average current balance of all accounts.
bc_open_to_buy: Total open to buy on revolving bankcards.
bc_util: Ratio of total current balance to high credit/credit limit for all bankcard accounts.
dti: A ratio calculated using the borrower's total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower's self:reported monthly income.
dti_joint: A ratio calculated using the co:borrowers' total monthly payments on the total debt obligations, excluding mortgages and the requested LC loan, divided by the co:borrowers' combined self:reported monthly income
emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
emp_title: The job title supplied by the Borrower when applying for the loan.
funded_amnt_inv: The total amount committed by investors for that loan at that point in time.
grade: LC assigned loan grade
inq_last_12m: Number of credit inquiries in past 12 months
installment: The monthly payment owed by the borrower if the loan originates.
max_bal_bc: Maximum current balance owed on all revolving accounts
mo_sin_old_il_acct: Months since oldest bank installment account opened
mo_sin_old_rev_tl_op: Months since oldest revolving account opened
mo_sin_rcnt_rev_tl_op: Months since most recent revolving account opened
mo_sin_rcnt_tl: Months since most recent account opened
mort_acc: Number of mortgage accounts.
pct_tl_nvr_dlq: Percent of trades never delinquent
percent_bc_gt_75: Percentage of all bankcard accounts > 75
purpose: A category provided by the borrower for the loan request.
sub_grade: LC assigned loan subgrade
term: The number of payments on the loan. Values are in months and can be either 36 or 60.
tot_cur_bal: Total current balance of all accounts
tot_hi_cred_lim: Total high credit/credit limit
total_acc: The total number of credit lines currently in the borrower's credit file
total_bal_ex_mort: Total credit balance excluding mortgage
total_bc_limit: Total bankcard high credit/credit limit
total_cu_tl: Number of finance trades
total_il_high_credit_limit: Total installment high credit/credit limit
verification_status_joint: Indicates if the co:borrowers' joint income was verified by LC, not verified, or if the income source was verified
zip_code: The first 3 numbers of the zip code provided by the borrower in the loan application.
lift_value
is for getting max lift value for a prob or score.
lift_value(target, prob)
lift_value(target, prob)
target |
Vector of target. |
prob |
A list of predict probability or score. |
Max lift value
local_outlier_factor
is function for calculating the lof factor for a data set using knn
This function is not intended to be used by end user.local_outlier_factor
local_outlier_factor
is function for calculating the lof factor for a data set using knn
This function is not intended to be used by end user.
local_outlier_factor(dat, k = 10)
local_outlier_factor(dat, k = 10)
dat |
A data.frame contained only predict variables. |
k |
Number of neighbors for LOF.Default is 10. |
log_trans
is for logarithmic transformation
log_trans( dat, target, x_list = NULL, cor_dif = 0.01, ex_cols = NULL, note = TRUE ) log_vars(dat, x_list = NULL, target = NULL, cor_dif = 0.01, ex_cols = NULL)
log_trans( dat, target, x_list = NULL, cor_dif = 0.01, ex_cols = NULL, note = TRUE ) log_vars(dat, x_list = NULL, target = NULL, cor_dif = 0.01, ex_cols = NULL)
dat |
A data.frame. |
target |
The name of target variable. |
x_list |
A list of x variables. |
cor_dif |
The correlation coefficient difference with the target of logarithm transformed variable and original variable. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
note |
Logical, outputs info. Default is TRUE. |
Log transformed data.frame.
dat = log_trans(dat = UCICreditCard, target = "default.payment.next.month", x_list =NULL,cor_dif = 0.01,ex_cols = "ID", note = TRUE)
dat = log_trans(dat = UCICreditCard, target = "default.payment.next.month", x_list =NULL,cor_dif = 0.01,ex_cols = "ID", note = TRUE)
loop_function
is an iterator to loop throughLoop Function.
#' loop_function
is an iterator to loop through
loop_function( func = NULL, args = list(data = NULL), x_list = NULL, bind = "rbind", parallel = TRUE, as_list = FALSE )
loop_function( func = NULL, args = list(data = NULL), x_list = NULL, bind = "rbind", parallel = TRUE, as_list = FALSE )
func |
A function. |
args |
A list of argauments required by function. |
x_list |
Names of objects to loop through. |
bind |
Complie results, "rbind" & "cbind" are available. |
parallel |
Logical, parallel computing. |
as_list |
Logical, whether outputs to be a list. |
A data.frame or list
dat = UCICreditCard[24:26] num_x_list = get_names(dat = dat, types = c('numeric', 'integer', 'double'), ex_cols = NULL, get_ex = FALSE) dat[ ,num_x_list] = loop_function(func = outliers_kmeans_lof, x_list = num_x_list, args = list(dat = dat), bind = "cbind", as_list = FALSE, parallel = FALSE)
dat = UCICreditCard[24:26] num_x_list = get_names(dat = dat, types = c('numeric', 'integer', 'double'), ex_cols = NULL, get_ex = FALSE) dat[ ,num_x_list] = loop_function(func = outliers_kmeans_lof, x_list = num_x_list, args = list(dat = dat), bind = "cbind", as_list = FALSE, parallel = FALSE)
love_color
is for get plots for a variable.
love_color(color = NULL, type = "Blues", n = 10, ...)
love_color(color = NULL, type = "Blues", n = 10, ...)
color |
The name of colors. |
type |
The type of colors, "deep", or the name of palette:. The sequential palettes names are Blues BuGn BuPu GnBu Greens Greys Oranges OrRd PuBu PuBuGn PuRd Purples RdPu Reds YlGn YlGnBu YlOrBr YlOrRd The diverging palettes are BrBG PiYG PRGn PuOr RdBu RdGy RdYlBu RdYlGn Spectral The qualitative palettes are Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3 |
n |
Number of different colors, minimum is 1. |
... |
Other parameters. |
love_color(color="dark_cyan")
love_color(color="dark_cyan")
low_variance_filter
is for removing variables with repeated values up to a certain percentage.
low_variance_filter( dat, lvp = 0.97, only_NA = FALSE, note = FALSE, ex_cols = NULL )
low_variance_filter( dat, lvp = 0.97, only_NA = FALSE, note = FALSE, ex_cols = NULL )
dat |
A data frame with x and target. |
lvp |
The maximum percent of unique values (including NAs). |
only_NA |
Logical, only process variables which NA's rate are more than lvp. |
note |
Logical.Outputs info.Default is TRUE. |
ex_cols |
A list of excluded variables. Default is NULL. |
A data.frame
dat = low_variance_filter(lendingclub[1:1000, ], lvp = 0.9)
dat = low_variance_filter(lendingclub[1:1000, ], lvp = 0.9)
lr_params
is the list of parameters to train a LR model or Scorecard using in training_model
.
lr_params_search
is for searching the optimal parameters of logistic regression,if any parameters of params in lr_params
is more than one.
lr_params( tree_control = list(p = 0.02, cp = 1e-08, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.02, b_odds = 0.1, b_psi = 0.03, b_or = 0.15, mono = 0.2, odds_psi = 0.15, kc = 1), f_eval = "ks", best_lambda = "lambda.ks", method = "random_search", iters = 10, lasso = TRUE, step_wise = TRUE, score_card = TRUE, sp_values = NULL, forced_in = NULL, obsweight = c(1, 1), thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.5), ... ) lr_params_search( method = "random_search", dat_train, target, dat_test = NULL, occur_time = NULL, x_list = NULL, prop = 0.7, iters = 10, tree_control = list(p = 0.02, cp = 0, xval = 1, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1, b_psi = 0.05, b_or = 0.1, mono = 0.1, odds_psi = 0.03, kc = 1), thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.6), step_wise = FALSE, lasso = FALSE, f_eval = "ks" )
lr_params( tree_control = list(p = 0.02, cp = 1e-08, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.02, b_odds = 0.1, b_psi = 0.03, b_or = 0.15, mono = 0.2, odds_psi = 0.15, kc = 1), f_eval = "ks", best_lambda = "lambda.ks", method = "random_search", iters = 10, lasso = TRUE, step_wise = TRUE, score_card = TRUE, sp_values = NULL, forced_in = NULL, obsweight = c(1, 1), thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.5), ... ) lr_params_search( method = "random_search", dat_train, target, dat_test = NULL, occur_time = NULL, x_list = NULL, prop = 0.7, iters = 10, tree_control = list(p = 0.02, cp = 0, xval = 1, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1, b_psi = 0.05, b_or = 0.1, mono = 0.1, odds_psi = 0.03, kc = 1), thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.6), step_wise = FALSE, lasso = FALSE, f_eval = "ks" )
tree_control |
the list of parameters to control cutting initial breaks by decision tree. See details at: |
bins_control |
the list of parameters to control merging initial breaks. See details at: |
f_eval |
Custimized evaluation function, "ks" & "auc" are available. |
best_lambda |
Metheds of best lanmbda stardards using to filter variables by LASSO. There are 3 methods: ("lambda.auc", "lambda.ks", "lambda.sim_sign") . Default is "lambda.auc". |
method |
Method of searching optimal parameters. "random_search","grid_search","local_search" are available. |
iters |
Number of iterations of "random_search" optimal parameters. |
lasso |
Logical, if TRUE, variables filtering by LASSO. Default is TRUE. |
step_wise |
Logical, stepwise method. Default is TRUE. |
score_card |
Logical, transfer woe to a standard scorecard. If TRUE, Output scorecard, and score prediction, otherwise output probability. Default is TRUE. |
sp_values |
Vaules will be in separate bins.e.g. list(-1, "missing") means that -1 & missing as special values.Default is NULL. |
forced_in |
Names of forced input variables. Default is NULL. |
obsweight |
An optional vector of 'prior weights' to be used in the fitting process. Should be NULL or a numeric vector. If you oversample or cluster diffrent datasets to training the LR model, you need to set this parameter to ensure that the probability of logistic regression output is the same as that before oversampling or segmentation. e.g.:There are 10,000 0 obs and 500 1 obs before oversampling or under-sampling, 5,000 0 obs and 3,000 1 obs after oversampling. Then this parameter should be set to c(10000/5000, 500/3000). Default is NULL.. |
thresholds |
Thresholds for selecting variables.
|
... |
Other parameters |
dat_train |
data.frame of train data. Default is NULL. |
target |
name of target variable. |
dat_test |
data.frame of test data. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
x_list |
names of independent variables. Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
A list of parameters.
training_model
, xgb_params
, gbm_params
, rf_params
lr_vif
is for calculating Variance-Inflation Factors.
lr_vif(lr_model)
lr_vif(lr_model)
lr_model |
An object of logistic model. |
sub = cv_split(UCICreditCard, k = 30)[[1]] x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") dat = re_name(UCICreditCard[sub,], "default.payment.next.month", "target") dat = dat[,c("target",x_list)] dat = data_cleansing(dat, miss_values = list("", -1)) train_test = train_test_split(dat, prop = 0.7) dat_train = train_test$train dat_test = train_test$test Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) lr_vif(lr_model) get_logistic_coef(lr_model) class(dat) mod = lr_model lr_vif(lr_model)
sub = cv_split(UCICreditCard, k = 30)[[1]] x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") dat = re_name(UCICreditCard[sub,], "default.payment.next.month", "target") dat = dat[,c("target",x_list)] dat = data_cleansing(dat, miss_values = list("", -1)) train_test = train_test_split(dat, prop = 0.7) dat_train = train_test$train dat_test = train_test$test Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) lr_vif(lr_model) get_logistic_coef(lr_model) class(dat) mod = lr_model lr_vif(lr_model)
max_min_norm
is for normalizing each column vector of matrix 'x' using max_min normalization
max_min_norm(x)
max_min_norm(x)
x |
Vector |
Normalized vector
dat_s = apply(UCICreditCard[,12:14], 2, max_min_norm)
dat_s = apply(UCICreditCard[,12:14], 2, max_min_norm)
merge_category
is for merging category of nominal variables which number of categories is more than m or percent of samples in any categories is less than p.
merge_category(dat, char_list = NULL, ex_cols = NULL, m = 10, note = TRUE)
merge_category(dat, char_list = NULL, ex_cols = NULL, m = 10, note = TRUE)
dat |
A data frame with x and target. |
char_list |
The list of charecteristic variables that need to merge categories, Default is NULL. In case of NULL,merge categories for all variables of string type. |
ex_cols |
A list of excluded variables. Default is NULL. |
m |
The minimum number of categories. |
note |
Logical, outputs info. Default is TRUE. |
A data.frame with merged category variables.
#merge_catagory dat = merge_category(lendingclub,ex_cols = "id$|_d$") char_list = get_names(dat = dat,types = c('factor', 'character'), ex_cols = "id$|_d$", get_ex = FALSE) str(dat[,char_list])
#merge_catagory dat = merge_category(lendingclub,ex_cols = "id$|_d$") char_list = get_names(dat = dat,types = c('factor', 'character'), ex_cols = "id$|_d$", get_ex = FALSE) str(dat[,char_list])
min_max_norm
is for normalizing each column vector of matrix 'x' using min_max normalization
min_max_norm(x)
min_max_norm(x)
x |
Vector |
Normalized vector
dat_s = apply(UCICreditCard[,12:14], 2, min_max_norm)
dat_s = apply(UCICreditCard[,12:14], 2, min_max_norm)
model_result_plot
is a wrapper of following:
perf_table
is for generating a model performance table.
ks_plot
is for K-S.
roc_plot
is for ROC.
lift_plot
is for Lift Chart.
score_distribution_plot
is for ploting the score distribution.model result plots
model_result_plot
is a wrapper of following:
perf_table
is for generating a model performance table.
ks_plot
is for K-S.
roc_plot
is for ROC.
lift_plot
is for Lift Chart.
score_distribution_plot
is for ploting the score distribution.
performance table
ks_plot
lift_plot
roc_plot
score_distribution_plot
model_result_plot( train_pred, score, target, test_pred = NULL, gtitle = NULL, perf_dir_path = NULL, save_data = FALSE, plot_show = TRUE, total = TRUE, g = 10, cut_bin = "equal_depth", digits = 4 ) perf_table( train_pred, test_pred = NULL, target = NULL, score = NULL, g = 10, cut_bin = "equal_depth", breaks = NULL, digits = 2, pos_flag = list("1", "1", "Bad", 1), total = FALSE, binsNO = FALSE ) ks_plot( train_pred, test_pred = NULL, target = NULL, score = NULL, gtitle = NULL, breaks = NULL, g = 10, cut_bin = "equal_width", perf_tb = NULL ) lift_plot( train_pred, test_pred = NULL, target = NULL, score = NULL, gtitle = NULL, breaks = NULL, g = 10, cut_bin = "equal_depth", perf_tb = NULL ) roc_plot( train_pred, test_pred = NULL, target = NULL, score = NULL, gtitle = NULL ) score_distribution_plot( train_pred, test_pred, target, score, gtitle = NULL, breaks = NULL, g = 10, cut_bin = "equal_depth", perf_tb = NULL )
model_result_plot( train_pred, score, target, test_pred = NULL, gtitle = NULL, perf_dir_path = NULL, save_data = FALSE, plot_show = TRUE, total = TRUE, g = 10, cut_bin = "equal_depth", digits = 4 ) perf_table( train_pred, test_pred = NULL, target = NULL, score = NULL, g = 10, cut_bin = "equal_depth", breaks = NULL, digits = 2, pos_flag = list("1", "1", "Bad", 1), total = FALSE, binsNO = FALSE ) ks_plot( train_pred, test_pred = NULL, target = NULL, score = NULL, gtitle = NULL, breaks = NULL, g = 10, cut_bin = "equal_width", perf_tb = NULL ) lift_plot( train_pred, test_pred = NULL, target = NULL, score = NULL, gtitle = NULL, breaks = NULL, g = 10, cut_bin = "equal_depth", perf_tb = NULL ) roc_plot( train_pred, test_pred = NULL, target = NULL, score = NULL, gtitle = NULL ) score_distribution_plot( train_pred, test_pred, target, score, gtitle = NULL, breaks = NULL, g = 10, cut_bin = "equal_depth", perf_tb = NULL )
train_pred |
A data frame of training with predicted prob or score. |
score |
The name of prob or score variable. |
target |
The name of target variable. |
test_pred |
A data frame of validation with predict prob or score. |
gtitle |
The title of the graph & The name for periodically saved graphic file. |
perf_dir_path |
The path for periodically saved graphic files. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
plot_show |
Logical, show model performance in current graphic device. Default is TRUE. |
total |
Whether to summarize the table. default: TRUE. |
g |
Number of breaks for prob or score. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
digits |
Digits of numeric,default is 4. |
breaks |
Splitting points of prob or score. |
pos_flag |
The value of positive class of target variable, default: "1". |
binsNO |
Bins Number.Default is FALSE. |
perf_tb |
Performance table. |
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") dat = data_cleansing(dat, target = "target", obs_id = "ID",x_list = x_list, occur_time = "apply_date", miss_values = list("", -1)) dat = process_nas(dat,default_miss = TRUE) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) # model evaluation perf_table(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") ks_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") roc_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") #lift_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") #score_distribution_plot(train_pred = dat_train, test_pred = dat_test, #target = "target", score = "pred_LR") #model_result_plot(train_pred = dat_train, test_pred = dat_test, #target = "target", score = "pred_LR")
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") dat = data_cleansing(dat, target = "target", obs_id = "ID",x_list = x_list, occur_time = "apply_date", miss_values = list("", -1)) dat = process_nas(dat,default_miss = TRUE) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) # model evaluation perf_table(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") ks_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") roc_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") #lift_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") #score_distribution_plot(train_pred = dat_train, test_pred = dat_test, #target = "target", score = "pred_LR") #model_result_plot(train_pred = dat_train, test_pred = dat_test, #target = "target", score = "pred_LR")
Plot multiple ggplot-objects as a grid-arranged single plot.
multi_grid(..., grobs = list(...), nrow = NULL, ncol = NULL)
multi_grid(..., grobs = list(...), nrow = NULL, ncol = NULL)
... |
Other parameters. |
grobs |
A list of ggplot-objects to be arranged into the grid. |
nrow |
Number of rows in the plot grid. |
ncol |
Number of columns in the plot grid. |
This function takes a list
of ggplot-objects as argument.
Plotting functions of this package that produce multiple plot
objects (e.g., when there is an argument facet.grid
) usually
return multiple plots as list.
An object of class gtable
.
library(ggplot2) sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) dat = process_nas(dat) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) # model evaluation p1 = ks_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p2 = roc_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p3 = lift_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p4 = score_distribution_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p_plots= multi_grid(p1,p2,p3,p4) plot(p_plots)
library(ggplot2) sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) dat = process_nas(dat) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) # model evaluation p1 = ks_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p2 = roc_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p3 = lift_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p4 = score_distribution_plot(train_pred = dat_train, test_pred = dat_test, target = "target", score = "pred_LR") p_plots= multi_grid(p1,p2,p3,p4) plot(p_plots)
multi_left_join
is for left jion a list of datasets fast.
multi_left_join(..., df_list = list(...), key_dt = NULL, by = NULL)
multi_left_join(..., df_list = list(...), key_dt = NULL, by = NULL)
... |
Datasets need join |
df_list |
A list of datasets. |
key_dt |
Name or index of Key table to left join. |
by |
Name of Key columns to join. |
multi_left_join(UCICreditCard[1:10, 1:10], UCICreditCard[1:10, c(1,8:14)], UCICreditCard[1:10, c(1,20:25)], by = "ID")
multi_left_join(UCICreditCard[1:10, 1:10], UCICreditCard[1:10, c(1,8:14)], UCICreditCard[1:10, c(1,20:25)], by = "ID")
Returns the number of "code points", in a string.
n_char(string)
n_char(string)
string |
A string. |
A numeric vector giving number of characters (code points) in each element of the character vector. Missing string have missing length.
n_char(letters) n_char(NA)
n_char(letters) n_char(NA)
null_blank_na
is the function to replace null ,NULL, blank or other missing vaules with NA.
null_blank_na(dat, miss_values = NULL, note = FALSE)
null_blank_na(dat, miss_values = NULL, note = FALSE)
dat |
A data frame with x and target. |
miss_values |
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing". |
note |
Logical.Outputs info.Default is TRUE. |
A data.frame
datss = null_blank_na(dat = UCICreditCard[1:1000, ], miss_values =list(-1,-2))
datss = null_blank_na(dat = UCICreditCard[1:1000, ], miss_values =list(-1,-2))
one_hot_encoding
is for converting the factor or character variables into multiple columns
one_hot_encoding( dat, cat_vars = NULL, ex_cols = NULL, merge_cat = TRUE, na_act = TRUE, note = FALSE )
one_hot_encoding( dat, cat_vars = NULL, ex_cols = NULL, merge_cat = TRUE, na_act = TRUE, note = FALSE )
dat |
A dat frame. |
cat_vars |
The name or Column index list to be one_hot encoded. |
ex_cols |
Variables to be excluded, use regular expression matching |
merge_cat |
Logical. If TRUE, to merge categories greater than 8, default is TRUE. |
na_act |
Logical,If true, the missing value is processed, if FALSE missing value is omitted . |
note |
Logical.Outputs info.Default is TRUE. |
A dat frame with the one hot encoding applied to all the variables with type as factor or character.
dat1 = one_hot_encoding(dat = UCICreditCard, cat_vars = c("SEX", "MARRIAGE"), merge_cat = TRUE, na_act = TRUE) dat2 = de_one_hot_encoding(dat_one_hot = dat1, cat_vars = c("SEX","MARRIAGE"), na_act = FALSE)
dat1 = one_hot_encoding(dat = UCICreditCard, cat_vars = c("SEX", "MARRIAGE"), merge_cat = TRUE, na_act = TRUE) dat2 = de_one_hot_encoding(dat_one_hot = dat1, cat_vars = c("SEX","MARRIAGE"), na_act = FALSE)
outliers_detection
is for outliers detecting using Kmeans and Local Outlier Factor (lof)Outliers Detection
outliers_detection
is for outliers detecting using Kmeans and Local Outlier Factor (lof)
outliers_detection(dat, x, kc = 3, kn = 5)
outliers_detection(dat, x, kc = 3, kn = 5)
dat |
A data.frame with independent variables. |
x |
The name of variable to process. |
kc |
Number of clustering centers for Kmeans |
kn |
Number of neighbors for LOF. |
Outliers of each variable.
This function is not intended to be used by end user.
p_ij(x) e_ij(x)
p_ij(x) e_ij(x)
x |
A numeric vector. |
A numeric vector of entropy.
p_to_score
is for transforming probability to score.
p_to_score(p, PDO = 20, base = 600, ratio = 1)
p_to_score(p, PDO = 20, base = 600, ratio = 1)
p |
Probability. |
PDO |
Point-to-Double Odds. |
base |
Base Point. |
ratio |
The corresponding odds when the score is base. |
new prob.
partial_dependence_plot
is for generating a partial dependence plot.
get_partial_dependence_plots
is for ploting partial dependence of all vairables in x_list.
partial_dependence_plot(model, x, x_train, n.trees = NULL) get_partial_dependence_plots( model, x_train, x_list, n.trees = NULL, dir_path = getwd(), save_data = TRUE, plot_show = FALSE, parallel = FALSE )
partial_dependence_plot(model, x, x_train, n.trees = NULL) get_partial_dependence_plots( model, x_train, x_list, n.trees = NULL, dir_path = getwd(), save_data = TRUE, plot_show = FALSE, parallel = FALSE )
model |
A data frame of training with predicted prob or score. |
x |
The name of an independent variable. |
x_train |
A data.frame with independent variables. |
n.trees |
Number of trees for best.iter of gbm. |
x_list |
Names of independent variables. |
dir_path |
The path for periodically saved graphic files. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
parallel |
Logical, parallel computing. Default is FALSE. |
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) #plot partial dependency of one variable partial_dependence_plot(model = lr_model, x ="LIMIT_BAL", x_train = dat_train) #plot partial dependency of all variables pd_list = get_partial_dependence_plots(model = lr_model, x_list = x_list[1:2], x_train = dat_train, save_data = FALSE,plot_show = TRUE)
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) #plot partial dependency of one variable partial_dependence_plot(model = lr_model, x ="LIMIT_BAL", x_train = dat_train) #plot partial dependency of all variables pd_list = get_partial_dependence_plots(model = lr_model, x_list = x_list[1:2], x_train = dat_train, save_data = FALSE,plot_show = TRUE)
PCA_reduce
is used for PCA reduction of high demension data .
PCA_reduce(train = train, test = NULL, mc = 0.9)
PCA_reduce(train = train, test = NULL, mc = 0.9)
train |
A data.frame with independent variables and target variable. |
test |
A data.frame of test data. |
mc |
Threshold of cumulative imp. |
## Not run: num_x_list = get_names(dat = UCICreditCard, types = c('numeric'), ex_cols = "ID$|date$|default.payment.next.month$", get_ex = FALSE) PCA_dat = PCA_reduce(train = UCICreditCard[num_x_list]) ## End(Not run)
## Not run: num_x_list = get_names(dat = UCICreditCard, types = c('numeric'), ex_cols = "ID$|date$|default.payment.next.month$", get_ex = FALSE) PCA_dat = PCA_reduce(train = UCICreditCard[num_x_list]) ## End(Not run)
You can use the plot_colors
to show colors on the graph device.
plot_colors(colors) color_ramp_palette(colors)
plot_colors(colors) color_ramp_palette(colors)
colors |
A vector of colors. |
plot_colors(rgb(158,122,122, maxColorValue = 255 ))
plot_colors(rgb(158,122,122, maxColorValue = 255 ))
plot_oot_perf
is for ploting performance of cross time samples in the futureplot_oot_perf
plot_oot_perf
is for ploting performance of cross time samples in the future
plot_oot_perf( dat_test, x, occur_time, target, k = 3, g = 10, period = "month", best = FALSE, equal_bins = TRUE, pl = "rate", breaks = NULL, cut_bin = "equal_depth", gtitle = NULL, perf_dir_path = NULL, save_data = FALSE, plot_show = TRUE )
plot_oot_perf( dat_test, x, occur_time, target, k = 3, g = 10, period = "month", best = FALSE, equal_bins = TRUE, pl = "rate", breaks = NULL, cut_bin = "equal_depth", gtitle = NULL, perf_dir_path = NULL, save_data = FALSE, plot_show = TRUE )
dat_test |
A data frame of testing dataset with predicted prob or score. |
x |
The name of prob or score variable. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
target |
The name of target variable. |
k |
If period is NULL, number of equal frequency samples. |
g |
Number of breaks for prob or score. |
period |
OOT period, 'weekly' and 'month' are available.if NULL, use k equal frequency samples. |
best |
Logical, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, generates initial breaks for equal frequency or width binning. |
pl |
'lift' is for lift chart plot,'rate' is for positive rate plot. |
breaks |
Splitting points of prob or score. |
cut_bin |
A string, if equal_bins is TRUE, 'equal_depth' or 'equal_width', default is 'equal_depth'. |
gtitle |
The title of the graph & The name for periodically saved graphic file. |
perf_dir_path |
The path for periodically saved graphic files. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
plot_show |
Logical, show model performance in current graphic device. Default is TRUE. |
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") dat = data_cleansing(dat, target = "target", obs_id = "ID",x_list = x_list, occur_time = "apply_date", miss_values = list("", -1)) dat = process_nas(dat) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) plot_oot_perf(dat_test = dat_test, occur_time = "apply_date", target = "target", x = "pred_LR")
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "PAY_3", "PAY_2") dat = data_cleansing(dat, target = "target", obs_id = "ID",x_list = x_list, occur_time = "apply_date", miss_values = list("", -1)) dat = process_nas(dat) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = dat_train[, c("target", x_list)], family = binomial(logit)) dat_train$pred_LR = round(predict(lr_model, dat_train[, x_list], type = "response"), 5) dat_test$pred_LR = round(predict(lr_model, dat_test[, x_list], type = "response"), 5) plot_oot_perf(dat_test = dat_test, occur_time = "apply_date", target = "target", x = "pred_LR")
plot_table
is for table visualizaiton.
plot_table( grid_table, theme = c("cyan", "grey", "green", "red", "blue", "purple"), title = NULL, title.size = 12, title.color = "black", title.face = "bold", title.position = "middle", subtitle = NULL, subtitle.size = 8, subtitle.color = "black", subtitle.face = "plain", subtitle.position = "middle", tile.color = "white", tile.size = 1, colname.size = 3, colname.color = "white", colname.face = "bold", colname.fill.color = love_color("dark_cyan"), text.size = 3, text.color = love_color("dark_grey"), text.face = "plain", text.fill.color = c("white", love_color("pale_grey")) )
plot_table( grid_table, theme = c("cyan", "grey", "green", "red", "blue", "purple"), title = NULL, title.size = 12, title.color = "black", title.face = "bold", title.position = "middle", subtitle = NULL, subtitle.size = 8, subtitle.color = "black", subtitle.face = "plain", subtitle.position = "middle", tile.color = "white", tile.size = 1, colname.size = 3, colname.color = "white", colname.face = "bold", colname.fill.color = love_color("dark_cyan"), text.size = 3, text.color = love_color("dark_grey"), text.face = "plain", text.fill.color = c("white", love_color("pale_grey")) )
grid_table |
A data.frame or table |
theme |
The theme of color, "cyan","grey","green","red","blue","purple" are available. |
title |
The title of table |
title.size |
The title size of plot. |
title.color |
The title color. |
title.face |
The title face, such as "plain", "bold". |
title.position |
The title position,such as "left","middle","right". |
subtitle |
The subtitle of table |
subtitle.size |
The subtitle size. |
subtitle.color |
The subtitle color. |
subtitle.face |
The subtitle face, such as "plain", "bold",default is "bold". |
subtitle.position |
The subtitle position,such as "left","middle","right", default is "middle". |
tile.color |
The color of table lines, default is 'white'. |
tile.size |
The size of table lines , default is 1. |
colname.size |
The size of colnames, default is 3. |
colname.color |
The color of colnames, default is 'white'. |
colname.face |
The face of colnames,default is 'bold'. |
colname.fill.color |
The fill color of colnames, default is love_color("dark_cyan"). |
text.size |
The size of text, default is 3. |
text.color |
The color of text, default is love_color("dark_grey"). |
text.face |
The face of text, default is 'plain'. |
text.fill.color |
The fill color of text, default is c('white',love_color("pale_grey"). |
iv_list = get_psi_iv_all(dat = UCICreditCard[1:1000, ], x_list = names(UCICreditCard)[3:5], equal_bins = TRUE, target = "default.payment.next.month", ex_cols = "ID|apply_date") iv_dt =get_psi_iv(UCICreditCard, x = "PAY_3", target = "default.payment.next.month", bins_total = TRUE) plot_table(iv_dt)
iv_list = get_psi_iv_all(dat = UCICreditCard[1:1000, ], x_list = names(UCICreditCard)[3:5], equal_bins = TRUE, target = "default.payment.next.month", ex_cols = "ID|apply_date") iv_dt =get_psi_iv(UCICreditCard, x = "PAY_3", target = "default.payment.next.month", bins_total = TRUE) plot_table(iv_dt)
plot_theme
is a simper wrapper of theme for ggplot2.
plot_theme( legend.position = "top", angle = 30, legend_size = 7, axis_size_y = 8, axis_size_x = 8, axis_title_size = 10, title_size = 11, title_vjust = 0, title_hjust = 0, linetype = "dotted", face = "bold" )
plot_theme( legend.position = "top", angle = 30, legend_size = 7, axis_size_y = 8, axis_size_x = 8, axis_title_size = 10, title_size = 11, title_vjust = 0, title_hjust = 0, linetype = "dotted", face = "bold" )
legend.position |
see details at: codelegend.position |
angle |
see details at: codeaxis.text.x |
legend_size |
see details at: codelegend.text |
axis_size_y |
see details at: codeaxis.text.y |
axis_size_x |
see details at: codeaxis.text.x |
axis_title_size |
see details at: codeaxis.title.x |
title_size |
see details at: codeplot.title |
title_vjust |
see details at: codeplot.title |
title_hjust |
see details at: codeplot.title |
linetype |
see details at: codepanel.grid.major |
face |
see details at: codeaxis.title.x |
see details at: codetheme
pred_score
is for using logistic regression model model to predict new data.
pred_score( model, dat, x_list = NULL, bins_table = NULL, obs_id = NULL, miss_values = list(-1, "-1", "NULL", "-1", "-9999", "-9996", "-9997", "-9995", "-9998", -9999, -9998, -9997, -9996, -9995), woe_name = FALSE )
pred_score( model, dat, x_list = NULL, bins_table = NULL, obs_id = NULL, miss_values = list(-1, "-1", "NULL", "-1", "-9999", "-9996", "-9997", "-9995", "-9998", -9999, -9998, -9997, -9996, -9995), woe_name = FALSE )
model |
Logistic Regression Model generated by |
dat |
Dataframe of new data. |
x_list |
Into the model variables. |
bins_table |
a data.frame generated by |
obs_id |
The name of ID of observations or key variable of data. Default is NULL. |
miss_values |
Special values. |
woe_name |
Logical. Whether woe variable's name contains 'woe'.Default is FALSE. |
new scores.
training_model
, lr_params
, xgb_params
, rf_params
process_nas_var
is for missing value analysis and treatment using knn imputation, central impulation and random imputation.
process_nas
is a simpler wrapper for process_nas_var
.
process_nas( dat, x_list = NULL, class_var = FALSE, miss_values = list(-1, "missing"), default_miss = list(-1, "missing"), parallel = FALSE, ex_cols = NULL, method = "median", note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) process_nas_var( dat = dat, x, missing_type = NULL, method = "median", nas_rate = NULL, default_miss = list("missing", -1), mat_nas_shadow = NULL, dt_nas_random = NULL, note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
process_nas( dat, x_list = NULL, class_var = FALSE, miss_values = list(-1, "missing"), default_miss = list(-1, "missing"), parallel = FALSE, ex_cols = NULL, method = "median", note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) process_nas_var( dat = dat, x, missing_type = NULL, method = "median", nas_rate = NULL, default_miss = list("missing", -1), mat_nas_shadow = NULL, dt_nas_random = NULL, note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
dat |
A data.frame with independent variables. |
x_list |
Names of independent variables. |
class_var |
Logical, nas analysis of the nominal variables. Default is TRUE. |
miss_values |
Other extreme value might be used to represent missing values, e.g:-1, -9999, -9998. These miss_values will be encoded to NA. |
default_miss |
Default value of missing data imputation, Defualt is list(-1,'missing'). |
parallel |
Logical, parallel computing. Default is FALSE. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
method |
The methods of imputation by knn. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical. If TRUE, save missing analysis to |
file_name |
The file name for periodically saved missing analysis file. Default is NULL. |
dir_path |
The path for periodically saved missing analysis file. Default is "./variable". |
... |
Other parameters. |
x |
The name of variable to process. |
missing_type |
Type of missing, genereted by codeanalysis_nas |
nas_rate |
A list contains nas rate of each variable. |
mat_nas_shadow |
A shadow matrix of variables which contain nas. |
dt_nas_random |
A data.frame with random nas imputation. |
A dat frame with no NAs.
dat_na = process_nas(dat = UCICreditCard[1:1000,], parallel = FALSE,ex_cols = "ID$", method = "median")
dat_na = process_nas(dat = UCICreditCard[1:1000,], parallel = FALSE,ex_cols = "ID$", method = "median")
outliers_kmeans_lof
is for outliers detection and treatment using Kmeans and Local Outlier Factor (lof)
process_outliers
is a simpler wrapper for outliers_kmeans_lof
.
process_outliers( dat, target, ex_cols = NULL, kc = 3, kn = 5, x_list = NULL, parallel = FALSE, note = FALSE, process = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir() ) outliers_kmeans_lof( dat, x, target = NULL, kc = 3, kn = 5, note = FALSE, process = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir() )
process_outliers( dat, target, ex_cols = NULL, kc = 3, kn = 5, x_list = NULL, parallel = FALSE, note = FALSE, process = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir() ) outliers_kmeans_lof( dat, x, target = NULL, kc = 3, kn = 5, note = FALSE, process = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir() )
dat |
Dataset with independent variables and target variable. |
target |
The name of target variable. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
kc |
Number of clustering centers for Kmeans |
kn |
Number of neighbors for LOF. |
x_list |
Names of independent variables. |
parallel |
Logical, parallel computing. |
note |
Logical, outputs info. Default is TRUE. |
process |
Logical, process outliers, not just analysis. |
save_data |
Logical. If TRUE, save outliers analysis file to the specified folder at |
file_name |
The file name for periodically saved outliers analysis file. Default is NULL. |
dir_path |
The path for periodically saved outliers analysis file. Default is "./variable". |
x |
The name of variable to process. |
A data frame with outliers process to all the variables.
dat_out = process_outliers(UCICreditCard[1:10000,c(18:21,26)], target = "default.payment.next.month", ex_cols = "date$", kc = 3, kn = 10, parallel = FALSE,note = TRUE)
dat_out = process_outliers(UCICreditCard[1:10000,c(18:21,26)], target = "default.payment.next.month", ex_cols = "date$", kc = 3, kn = 10, parallel = FALSE,note = TRUE)
psi_iv_filter
is for selecting important and stable features using IV & PSI.
psi_iv_filter( dat, dat_test = NULL, target, x_list = NULL, breaks_list = NULL, pos_flag = NULL, ex_cols = NULL, occur_time = NULL, best = FALSE, equal_bins = TRUE, g = 10, sp_values = NULL, tree_control = list(p = 0.05, cp = 1e-06, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.05, b_odds = 0.1, b_psi = 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.2, kc = 1), oot_pct = 0.7, psi_i = 0.1, iv_i = 0.01, cos_i = 0.7, vars_name = FALSE, note = TRUE, parallel = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
psi_iv_filter( dat, dat_test = NULL, target, x_list = NULL, breaks_list = NULL, pos_flag = NULL, ex_cols = NULL, occur_time = NULL, best = FALSE, equal_bins = TRUE, g = 10, sp_values = NULL, tree_control = list(p = 0.05, cp = 1e-06, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.05, b_odds = 0.1, b_psi = 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.2, kc = 1), oot_pct = 0.7, psi_i = 0.1, iv_i = 0.01, cos_i = 0.7, vars_name = FALSE, note = TRUE, parallel = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
dat |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
x_list |
Names of independent variables. |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
best |
Logical, if TRUE, merge initial breaks to get optimal breaks for binning. |
equal_bins |
Logical, if TRUE, equal sample size initial breaks generates.If FALSE , tree breaks generates using desison tree. |
g |
Integer, number of initial bins for equal_bins. |
sp_values |
A list of missing values. |
tree_control |
the list of tree parameters. |
bins_control |
the list of parameters. |
oot_pct |
Percentage of observations retained for overtime test (especially to calculate PSI). Defualt is 0.7 |
psi_i |
The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1 |
iv_i |
The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.01 |
cos_i |
cos_similarity of posive rate of train and test. 0.7 to 0.9 usually work.Default: 0.5. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
parallel |
Logical, parallel computing. Default is FALSE. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "Feature_importance_IV_PSI". |
dir_path |
The path for periodically saved results files. Default is tempdir(). |
... |
Other parameters. |
A list with the following elements:
Feature
Selected variables.
IV
IV of variables.
PSI
PSI of variables.
COS
cos_similarity of posive rate of train and test.
xgb_filter
, gbm_filter
, feature_selector
psi_iv_filter(dat= UCICreditCard[1:1000,c(2,4,8:9,26)], target = "default.payment.next.month", occur_time = "apply_date", parallel = FALSE)
psi_iv_filter(dat= UCICreditCard[1:1000,c(2,4,8:9,26)], target = "default.payment.next.month", occur_time = "apply_date", parallel = FALSE)
quick_as_df
is function for fast dat frame transfromation.
quick_as_df(df_list)
quick_as_df(df_list)
df_list |
A list of data. |
packages installed and library,
UCICreditCard = quick_as_df(UCICreditCard)
UCICreditCard = quick_as_df(UCICreditCard)
ranking_percent_proc
is for processing ranking percent variables.
ranking_percent_dict
is for generating ranking percent dictionary.
ranking_percent_proc( dat, ex_cols = NULL, x_list = NULL, rank_dict = NULL, pct = 0.01, parallel = FALSE, note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) ranking_percent_proc_x(dat, x, rank_dict = NULL, pct = 0.01) ranking_percent_dict( dat, x_list = NULL, ex_cols = NULL, pct = 0.01, parallel = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) ranking_percent_dict_x(dat, x = NULL, pct = 0.01)
ranking_percent_proc( dat, ex_cols = NULL, x_list = NULL, rank_dict = NULL, pct = 0.01, parallel = FALSE, note = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) ranking_percent_proc_x(dat, x, rank_dict = NULL, pct = 0.01) ranking_percent_dict( dat, x_list = NULL, ex_cols = NULL, pct = 0.01, parallel = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... ) ranking_percent_dict_x(dat, x = NULL, pct = 0.01)
dat |
A data.frame. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
x_list |
A list of x variables. |
rank_dict |
The dictionary of rank_percent generated by |
pct |
Percent of rank. Default is 0.01. |
parallel |
Logical, parallel computing. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results in locally specified folder. Default is FALSE |
file_name |
The name for periodically saved rank_percent data file. Default is "dat_rank_percent". |
dir_path |
The path for periodically saved rank_percent data file Default is "tempdir()" |
... |
Additional parameters. |
x |
The name of an independent variable. |
Data.frame with new processed variables.
rank_dict = ranking_percent_dict(dat = UCICreditCard[1:1000,], x_list = c("LIMIT_BAL","BILL_AMT2","PAY_AMT3"), ex_cols = NULL ) UCICreditCard_new = ranking_percent_proc(dat = UCICreditCard[1:1000,], x_list = c("LIMIT_BAL", "BILL_AMT2", "PAY_AMT3"), rank_dict = rank_dict, parallel = FALSE)
rank_dict = ranking_percent_dict(dat = UCICreditCard[1:1000,], x_list = c("LIMIT_BAL","BILL_AMT2","PAY_AMT3"), ex_cols = NULL ) UCICreditCard_new = ranking_percent_proc(dat = UCICreditCard[1:1000,], x_list = c("LIMIT_BAL", "BILL_AMT2", "PAY_AMT3"), rank_dict = rank_dict, parallel = FALSE)
re_code
search for matches to argument pattern within each element of a character vector:re_code
re_code
search for matches to argument pattern within each element of a character vector:
re_code(x, codes)
re_code(x, codes)
x |
Variable to recode. |
codes |
A data.frame of original value & recode value |
SEX = sample(c("F","M"),1000,replace = TRUE) codes= data.frame(ori_value = c('F','M'), code = c(0,1) ) SEX_re = re_code(SEX,codes)
SEX = sample(c("F","M"),1000,replace = TRUE) codes= data.frame(ori_value = c('F','M'), code = c(0,1) ) SEX_re = re_code(SEX,codes)
re_name
is for renaming variables.
re_name(dat, oldname = c(), newname = c())
re_name(dat, oldname = c(), newname = c())
dat |
A data frame with vairables to rename. |
oldname |
Old names of vairables. |
newname |
New names of vairables. |
data with new variable names.
dt = re_name(dat = UCICreditCard, "default.payment.next.month" , "target") names(dt['target'])
dt = re_name(dat = UCICreditCard, "default.payment.next.month" , "target") names(dt['target'])
read_data
is for loading data, formats like csv, txt,data and so on.
read_data( path, pattern = NULL, encoding = "unknown", header = TRUE, sep = "auto", stringsAsFactors = FALSE, select = NULL, drop = NULL, nrows = Inf ) check_data_format(path)
read_data( path, pattern = NULL, encoding = "unknown", header = TRUE, sep = "auto", stringsAsFactors = FALSE, select = NULL, drop = NULL, nrows = Inf ) check_data_format(path)
path |
Path to file or file name in working directory & path to file. |
pattern |
An optional regular expression. Only file names which match the regular expression will be returned. |
encoding |
Default is "unknown". Other possible options are "UTF-8" and "Latin-1". |
header |
Does the first data line contain column names? |
sep |
The separator between columns. |
stringsAsFactors |
Logical. Convert all character columns to factors? |
select |
A vector of column names or numbers to keep, drop the rest. |
drop |
A vector of column names or numbers to drop, keep the rest. |
nrows |
The maximum number of rows to read. |
reduce_high_cor_filter
is function for filtering highly correlated variables with reduce method.
reduce_high_cor_filter( dat, x_list = NULL, size = ncol(dat)/10, p = 0.95, com_list = NULL, ex_cols = NULL, cor_class = TRUE, parallel = FALSE )
reduce_high_cor_filter( dat, x_list = NULL, size = ncol(dat)/10, p = 0.95, com_list = NULL, ex_cols = NULL, cor_class = TRUE, parallel = FALSE )
dat |
A data.frame with independent variables. |
x_list |
Names of independent variables. |
size |
Size of vairable group. |
p |
Threshold of correlation between features. Default is 0.7. |
com_list |
A data.frame with important values of each variable. eg : IV_list |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
cor_class |
Culculate catagery variables's correlation matrix. Default is FALSE. |
parallel |
Logical, parallel computing. Default is FALSE. |
remove_duplicated
is the function to remove duplicated observations
remove_duplicated( dat = dat, obs_id = NULL, occur_time = NULL, target = NULL, note = FALSE )
remove_duplicated( dat = dat, obs_id = NULL, occur_time = NULL, target = NULL, note = FALSE )
dat |
A data frame with x and target. |
obs_id |
The name of ID of observations. Default is NULL. |
occur_time |
The name of occur time of observations.Default is NULL. |
target |
The name of target variable. |
note |
Logical.Outputs info.Default is TRUE. |
A data.frame
datss = remove_duplicated(dat = UCICreditCard, target = "default.payment.next.month", obs_id = "ID", occur_time = "apply_date")
datss = remove_duplicated(dat = UCICreditCard, target = "default.payment.next.month", obs_id = "ID", occur_time = "apply_date")
replace_value
is for replacing values of some variables .
replace_value_x
is for replacing values of a variable.
replace_value( dat = dat, x_list = NULL, x_pattern = NULL, replace_dat, MARGIN = 2, VALUE = if (MARGIN == 2) colnames(replace_dat) else rownames(replace_dat), RE_NAME = TRUE, parallel = FALSE ) replace_value_x( dat, x, replace_dat, MARGIN = 2, VALUE = if (MARGIN == 2) colnames(replace_dat) else rownames(replace_dat), RE_NAME = TRUE )
replace_value( dat = dat, x_list = NULL, x_pattern = NULL, replace_dat, MARGIN = 2, VALUE = if (MARGIN == 2) colnames(replace_dat) else rownames(replace_dat), RE_NAME = TRUE, parallel = FALSE ) replace_value_x( dat, x, replace_dat, MARGIN = 2, VALUE = if (MARGIN == 2) colnames(replace_dat) else rownames(replace_dat), RE_NAME = TRUE )
dat |
A data.frame. |
x_list |
Names of variables to replace value. |
x_pattern |
Regular expressions, used to match variable names. |
replace_dat |
A data.frame contains value to replace. |
MARGIN |
A vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names. |
VALUE |
Values to replace. |
RE_NAME |
Logical, rename the replaced variable. |
parallel |
Logical, parallel computing. Default is TRUE. |
x |
Name of variable to replace value. |
require_packages
is function for librarying required packages and installing missing packages if needed.
require_packages(..., pkg = as.character(substitute(list(...))))
require_packages(..., pkg = as.character(substitute(list(...))))
... |
Packages need loaded |
pkg |
A list or vector of names of required packages. |
packages installed and library.
## Not run: require_packages(data.table, ggplot2, dplyr) ## End(Not run)
## Not run: require_packages(data.table, ggplot2, dplyr) ## End(Not run)
rf_params
is the list of parameters to train a Random Forest using in training_model
.
rf_params(ntree = 100, nodesize = 30, samp_rate = 0.5, tune_rf = FALSE, ...)
rf_params(ntree = 100, nodesize = 30, samp_rate = 0.5, tune_rf = FALSE, ...)
ntree |
Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. |
nodesize |
Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time). Note that the default values are different for classification (1) and regression (5). |
samp_rate |
Percentage of sample to draw. Default is 0.2. |
tune_rf |
A logical.If TRUE, then tune Random Forest model.Default is FALSE. |
... |
Other parameters |
See details at : https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
A list of parameters.
training_model
, lr_params
, gbm_params
, xgb_params
Functions for vector operation.
rowAny(x) rowAllnas(x) colAllnas(x) colAllzeros(x) rowAll(x) rowCVs(x, na.rm = FALSE) rowSds(x, na.rm = FALSE) colSds(x, na.rm = TRUE) rowMaxs(x, na.rm = FALSE) rowMins(x, na.rm = FALSE) rowMaxMins(x, na.rm = FALSE) colMaxMins(x, na.rm = FALSE) cnt_x(x) sum_x(x) max_x(x) min_x(x) avg_x(x)
rowAny(x) rowAllnas(x) colAllnas(x) colAllzeros(x) rowAll(x) rowCVs(x, na.rm = FALSE) rowSds(x, na.rm = FALSE) colSds(x, na.rm = TRUE) rowMaxs(x, na.rm = FALSE) rowMins(x, na.rm = FALSE) rowMaxMins(x, na.rm = FALSE) colMaxMins(x, na.rm = FALSE) cnt_x(x) sum_x(x) max_x(x) min_x(x) avg_x(x)
x |
A data.frame or Matrix. |
na.rm |
Logical, remove NAs. |
A data.frame or Matrix.
#any row has missing values row_amy = rowAny(UCICreditCard[8:10]) #rows which is all missing values row_na = rowAllnas(UCICreditCard[8:10]) #cols which is all missing values col_na = colAllnas(UCICreditCard[8:10]) #cols which is all zeros row_zero = colAllzeros(UCICreditCard[8:10]) #sum all numbers of a row row_all = rowAll(UCICreditCard[8:10]) #caculate cv of a row row_cv = rowCVs(UCICreditCard[8:10]) #caculate sd of a row row_sd = rowSds(UCICreditCard[8:10]) #caculate sd of a column col_sd = colSds(UCICreditCard[8:10])
#any row has missing values row_amy = rowAny(UCICreditCard[8:10]) #rows which is all missing values row_na = rowAllnas(UCICreditCard[8:10]) #cols which is all missing values col_na = colAllnas(UCICreditCard[8:10]) #cols which is all zeros row_zero = colAllzeros(UCICreditCard[8:10]) #sum all numbers of a row row_all = rowAll(UCICreditCard[8:10]) #caculate cv of a row row_cv = rowCVs(UCICreditCard[8:10]) #caculate sd of a row row_sd = rowSds(UCICreditCard[8:10]) #caculate sd of a column col_sd = colSds(UCICreditCard[8:10])
save_data
is for saving a data.frame or a list fast.
save_data( ..., files = list(...), file_name = as.character(substitute(list(...))), dir_path = getwd(), note = FALSE, as_list = FALSE, row_names = FALSE, append = FALSE )
save_data( ..., files = list(...), file_name = as.character(substitute(list(...))), dir_path = getwd(), note = FALSE, as_list = FALSE, row_names = FALSE, append = FALSE )
... |
datasets |
files |
A dataset or a list of datasets. |
file_name |
The file name of data. |
dir_path |
A string. The dir path to save breaks_list. |
note |
Logical. Outputs info.Default is TRUE. |
as_list |
Logical. List format or data.frame format to save. Default is FALSE. |
row_names |
Logical,retain rownames. |
append |
Logical, append newdata to old. |
save_data(UCICreditCard,"UCICreditCard", tempdir())
save_data(UCICreditCard,"UCICreditCard", tempdir())
score_transfer
is for transfer woe to score.
score_transfer( model, tbl_woe, a = 600, b = 50, file_name = NULL, dir_path = tempdir(), save_data = FALSE )
score_transfer( model, tbl_woe, a = 600, b = 50, file_name = NULL, dir_path = tempdir(), save_data = FALSE )
model |
A data frame with x and target. |
tbl_woe |
a data.frame with woe variables. |
a |
Base line of score. |
b |
Numeric.Increased scores from doubling Odds. |
file_name |
The name for periodically saved score file. Default is "dat_score". |
dir_path |
The path for periodically saved score file. Default is "./data" |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
A data.frame with variables which values transfered to score.
# dataset spliting sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] #rename the target variable dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) #train_ test pliting train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transforming train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE) Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit)) #get LR coefficient dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE) bins_table = get_bins_table_all(dat = dat_train, target = "target", x_list = x_list,dat_test = dat_test, breaks_list = breaks_list, note = FALSE) #score card LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target") #scoring train_pred = dat_train[, c("ID", "apply_date", "target")] test_pred = dat_test[, c("ID", "apply_date", "target")] train_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = train_woe, save_data = FALSE)[, "score"] test_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = test_woe, save_data = FALSE)[, "score"]
# dataset spliting sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] #rename the target variable dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) #train_ test pliting train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transforming train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE) Formula = as.formula(paste("target", paste(x_list, collapse = ' + '), sep = ' ~ ')) set.seed(46) lr_model = glm(Formula, data = train_woe[, c("target", x_list)], family = binomial(logit)) #get LR coefficient dt_imp_LR = get_logistic_coef(lg_model = lr_model, save_data = FALSE) bins_table = get_bins_table_all(dat = dat_train, target = "target", x_list = x_list,dat_test = dat_test, breaks_list = breaks_list, note = FALSE) #score card LR_score_card = get_score_card(lg_model = lr_model, bins_table, target = "target") #scoring train_pred = dat_train[, c("ID", "apply_date", "target")] test_pred = dat_test[, c("ID", "apply_date", "target")] train_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = train_woe, save_data = FALSE)[, "score"] test_pred$pred_LR = score_transfer(model = lr_model, tbl_woe = test_woe, save_data = FALSE)[, "score"]
select_best_class
& select_best_breaks
are for merging initial breaks of variables using chi-square, odds-ratio,PSI,G/B index and so on.
The get_breaks
is a simpler wrapper for select_best_class
& select_best_class
.
select_best_class( dat, x, target, breaks = NULL, occur_time = NULL, oot_pct = 0.7, pos_flag = NULL, bins_control = NULL, sp_values = NULL, ... ) select_best_breaks( dat, x, target, breaks = NULL, pos_flag = NULL, sp_values = NULL, occur_time = NULL, oot_pct = 0.7, bins_control = NULL, ... )
select_best_class( dat, x, target, breaks = NULL, occur_time = NULL, oot_pct = 0.7, pos_flag = NULL, bins_control = NULL, sp_values = NULL, ... ) select_best_breaks( dat, x, target, breaks = NULL, pos_flag = NULL, sp_values = NULL, occur_time = NULL, oot_pct = 0.7, bins_control = NULL, ... )
dat |
A data frame with x and target. |
x |
The name of variable to process. |
target |
The name of target variable. |
breaks |
Splitting points for an independent variable. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
oot_pct |
The percentage of Actual and Expected set for PSI calculating. |
pos_flag |
The value of positive class of target variable, default: "1". |
bins_control |
the list of parameters.
|
sp_values |
A list of special value. |
... |
Other parameters. |
The folloiwing is the list of Reference Principles
1.The increasing or decreasing trend of variables is consistent with the actual business experience.(The percent of Non-monotonic intervals of which are not head or tail is less than 0.35)
2.Maximum 10 intervals for a single variable.
3.Each interval should cover more than 2
4.Each interval needs at least 30 or 1
5.Combining the values of blank, missing or other special value into the same interval called missing.
6.The difference of Chi effect size between intervals should be at least 0.02 or more.
7.The difference of absolute odds ratio between intervals should be at least 0.1 or more.
8.The difference of positive rate between intervals should be at least 1/10 of the total positive rate.
9.The difference of G/B index between intervals should be at least 15 or more.
10.The PSI of each interval should be less than 0.1.
A list of breaks for x.
get_tree_breaks
,
cut_equal
,
get_breaks
#equal sample size breaks equ_breaks = cut_equal(dat = UCICreditCard[, "PAY_AMT2"], g = 10) # select best bins bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1, b_psi = 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.1, kc = 1) select_best_breaks(dat = UCICreditCard, x = "PAY_AMT2", breaks = equ_breaks, target = "default.payment.next.month", occur_time = "apply_date", sp_values = NULL, bins_control = bins_control)
#equal sample size breaks equ_breaks = cut_equal(dat = UCICreditCard[, "PAY_AMT2"], g = 10) # select best bins bins_control = list(bins_num = 10, bins_pct = 0.02, b_chi = 0.02, b_odds = 0.1, b_psi = 0.05, b_or = 0.15, mono = 0.3, odds_psi = 0.1, kc = 1) select_best_breaks(dat = UCICreditCard, x = "PAY_AMT2", breaks = equ_breaks, target = "default.payment.next.month", occur_time = "apply_date", sp_values = NULL, bins_control = bins_control)
This function is not intended to be used by end user.
sim_str(a, b, sep = "_|[.]|[A-Z]")
sim_str(a, b, sep = "_|[.]|[A-Z]")
a |
A string |
b |
A string |
sep |
Seprater of strings. Default is "_|[.]|[A-Z]". |
split_bins
is for binning using breaks.
split_bins( dat, x, breaks = NULL, bins_no = TRUE, as_factor = FALSE, labels = NULL, use_NA = TRUE, char_free = FALSE )
split_bins( dat, x, breaks = NULL, bins_no = TRUE, as_factor = FALSE, labels = NULL, use_NA = TRUE, char_free = FALSE )
dat |
A data.frame with independent variables. |
x |
The name of an independent variable. |
breaks |
Breaks for binning. |
bins_no |
Number the generated bins. Default is TRUE. |
as_factor |
Whether to convert to factor type. |
labels |
Labels of bins. |
use_NA |
Whether to process NAs. |
char_free |
Logical, if TRUE, characters are not splitted. |
A data.frame with Bined x.
bins = split_bins(dat = UCICreditCard, x = "PAY_AMT1", breaks = NULL, bins_no = TRUE)
bins = split_bins(dat = UCICreditCard, x = "PAY_AMT1", breaks = NULL, bins_no = TRUE)
split_bins
is for transforming data to bins.
The split_bins_all
function is a simpler wrapper for split_bins
.
split_bins_all( dat, x_list = NULL, ex_cols = NULL, breaks_list = NULL, bins_no = TRUE, note = FALSE, return_x = FALSE, char_free = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
split_bins_all( dat, x_list = NULL, ex_cols = NULL, breaks_list = NULL, bins_no = TRUE, note = FALSE, return_x = FALSE, char_free = FALSE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
dat |
A data.frame with independent variables. |
x_list |
A list of x variables. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
breaks_list |
A list contains breaks of variables. it is generated by codeget_breaks_all,codeget_breaks |
bins_no |
Number the generated bins. Default is TRUE. |
note |
Logical, outputs info. Default is TRUE. |
return_x |
Logical, return data.frame containing only variables in x_list. |
char_free |
Logical, if TRUE, characters are not splitted. |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
file_name |
The name for periodically saved woe file. Default is "dat_woe". |
dir_path |
The path for periodically saved woe file Default is "./data" |
... |
Additional parameters. |
A data.frame with splitted bins.
get_tree_breaks
, cut_equal
, select_best_class
, select_best_breaks
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transform train_bins = split_bins_all(dat = dat_train, breaks_list = breaks_list, woe_name = FALSE) test_bins = split_bins_all(dat = dat_test, breaks_list = breaks_list, note = FALSE)
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transform train_bins = split_bins_all(dat = dat_train, breaks_list = breaks_list, woe_name = FALSE) test_bins = split_bins_all(dat = dat_test, breaks_list = breaks_list, note = FALSE)
Returns text parse of hive SQL
sql_hive_text_parse( sql_dt, key_sql = NULL, key_table = NULL, key_id = NULL, key_where = c("dt = date_add(current_date(),-1)"), only_key = FALSE, left_id = NULL, left_where = c("dt = date_add(current_date(),-1)"), new_name = NULL, ... )
sql_hive_text_parse( sql_dt, key_sql = NULL, key_table = NULL, key_id = NULL, key_where = c("dt = date_add(current_date(),-1)"), only_key = FALSE, left_id = NULL, left_where = c("dt = date_add(current_date(),-1)"), new_name = NULL, ... )
sql_dt |
The data dictionary has three columns: table, map and feature. |
key_sql |
You can write your own SQL for the main table. |
key_table |
Key table. |
key_id |
Primary key id. |
key_where |
Key table conditions. |
only_key |
Only key table. |
left_id |
Right table's key id. |
left_where |
Right table conditions. |
new_name |
A string, Rename all variables except primary key with suffix 'new_name'. |
... |
Other params. |
Text parse of hive SQL
#sql_dt:table, map and feature sql_dt = data.frame(table = c("table_1", "table_1", "table_1", "table_1","table_1", "table_2", "table_2","table_2", "table_2","table_2","table_2","table_2", "table_2","table_2","table_2","table_2", "table_2","table_2","table_2","table_3","table_3", "table_3","table_3","table_3"), map = c("all","all", "all","all","all","all","all","all","all","all", "all", "all","all","id_card_info", "id_card_info","id_card_info", "mobile_info","mobile_info", "mobile_info","all", "all","all", "all","all"), feature =c( "user_id","real_name","id_card_encode","mobile_encode","dt", "user_id","type_code","first_channel", "second_channel","user_name","user_sex","user_birthday", "user_age","card_province","card_zone", "card_city","city","province","carrier","user_id", "biz_id","biz_code","apply_time","dt")) #sample 1 sql_hive_text_parse(sql_dt = sql_dt, key_sql = NULL, key_table = "table_2", key_where = c("user_sex = 'male", "user_age > 20"), only_key = FALSE, key_id = "user_id", left_id = "user_id", left_where = c("dt = date_add(current_date(),-1)", "apply_time >= '2020-05-01' " ), new_name ="basic" ) #sample 2 sql_hive_text_parse(sql_dt = subset(sql_dt), key_sql = "SELECT user_id, max(apply_time) as max_apply_time FROM table_3 WHERE dt = date_add(current_date(),-1) GROUP BY user_id", key_id = "user_id", left_id = "user_id", left_where = c("dt = date_add(current_date(),-1)" ), new_name = NULL)
#sql_dt:table, map and feature sql_dt = data.frame(table = c("table_1", "table_1", "table_1", "table_1","table_1", "table_2", "table_2","table_2", "table_2","table_2","table_2","table_2", "table_2","table_2","table_2","table_2", "table_2","table_2","table_2","table_3","table_3", "table_3","table_3","table_3"), map = c("all","all", "all","all","all","all","all","all","all","all", "all", "all","all","id_card_info", "id_card_info","id_card_info", "mobile_info","mobile_info", "mobile_info","all", "all","all", "all","all"), feature =c( "user_id","real_name","id_card_encode","mobile_encode","dt", "user_id","type_code","first_channel", "second_channel","user_name","user_sex","user_birthday", "user_age","card_province","card_zone", "card_city","city","province","carrier","user_id", "biz_id","biz_code","apply_time","dt")) #sample 1 sql_hive_text_parse(sql_dt = sql_dt, key_sql = NULL, key_table = "table_2", key_where = c("user_sex = 'male", "user_age > 20"), only_key = FALSE, key_id = "user_id", left_id = "user_id", left_where = c("dt = date_add(current_date(),-1)", "apply_time >= '2020-05-01' " ), new_name ="basic" ) #sample 2 sql_hive_text_parse(sql_dt = subset(sql_dt), key_sql = "SELECT user_id, max(apply_time) as max_apply_time FROM table_3 WHERE dt = date_add(current_date(),-1) GROUP BY user_id", key_id = "user_id", left_id = "user_id", left_where = c("dt = date_add(current_date(),-1)" ), new_name = NULL)
This function is not intended to be used by end user.
start_parallel_computing(parallel = TRUE)
start_parallel_computing(parallel = TRUE)
parallel |
A logical, default is TRUE. |
parallel works.
This function is not intended to be used by end user.
stop_parallel_computing(cluster)
stop_parallel_computing(cluster)
cluster |
Parallel works. |
stop clusters.
str_match
search for matches to argument pattern within each element of a character vector:string match
#' str_match
search for matches to argument pattern within each element of a character vector:
str_match(pattern, str_r)
str_match(pattern, str_r)
pattern |
character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. missing values are allowed except for regexpr and gregexpr. |
str_r |
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported. |
orignal_nam = c("12mdd","11mdd","10mdd") str_match(str_r = orignal_nam,pattern= "\\d+")
orignal_nam = c("12mdd","11mdd","10mdd") str_match(str_r = orignal_nam,pattern= "\\d+")
#'The sum_table
includes both univariate and bivariate analysis and ranges from univariate statistics and frequency distributions, to correlations, cross-tabulation and characteristic analysis.
sum_table(dat, ..., x_s = as.character(substitute(list(...))), x_list = NULL)
sum_table(dat, ..., x_s = as.character(substitute(list(...))), x_list = NULL)
dat |
A data.frame with x and target. |
... |
x of dat |
x_s |
A list of x. |
x_list |
Names of dat. |
A list contains both categrory and numeric variable analysis.
sum_table(UCICreditCard) sum_table(UCICreditCard,LIMIT_BAL,AGE,EDUCATION,SEX)
sum_table(UCICreditCard) sum_table(UCICreditCard,LIMIT_BAL,AGE,EDUCATION,SEX)
The term_filter
is for filtering stop_words and low frequency words.
The term_idf
is for computing idf(inverse documents frequency) of terms.
The term_tfidf
is for computing tf-idf of documents.
term_tfidf(term_df, idf = NULL) term_idf(term_df, n_total = NULL) term_filter(term_df, low_freq = 0.01, stop_words = NULL)
term_tfidf(term_df, idf = NULL) term_idf(term_df, n_total = NULL) term_filter(term_df, low_freq = 0.01, stop_words = NULL)
term_df |
A data.frame with id and term. |
idf |
A data.frame with idf. |
n_total |
Number of documents. |
low_freq |
Use rate of terms or use numbers of terms. |
stop_words |
Stop words. |
A data.frame
term_df = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a','b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b','c','h','h','i','c','d','g','k','k')) term_df = term_filter(term_df = term_df, low_freq = 1) idf = term_idf(term_df) tf_idf = term_tfidf(term_df,idf = idf)
term_df = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a','b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b','c','h','h','i','c','d','g','k','k')) term_df = term_filter(term_df = term_df, low_freq = 1) idf = term_idf(term_df) tf_idf = term_tfidf(term_df,idf = idf)
This function is used for time series data processing.
time_series_proc(dat, ID = NULL, group = NULL, time = NULL)
time_series_proc(dat, ID = NULL, group = NULL, time = NULL)
dat |
A data.frame contained only predict variables. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
group |
The group of behavioral or status variables. |
time |
The name of variable which is time when behavior was happened. |
The key to creating a good model is not the power of a specific modelling technique, but the breadth and depth of derived variables that represent a higher level of knowledge about the phenomena under examination.
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a', 'b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b', 'c','h','h','i','c','d','g','k','k'), time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1, 3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3)) time_series_proc(dat = dat, ID = 'id', group = 'terms',time = 'time')
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a', 'b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b', 'c','h','h','i','c','d','g','k','k'), time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1, 3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3)) time_series_proc(dat = dat, ID = 'id', group = 'terms',time = 'time')
time_transfer
is for transfering time variables to time format.
time_transfer(dat, date_cols = NULL, ex_cols = NULL, note = FALSE)
time_transfer(dat, date_cols = NULL, ex_cols = NULL, note = FALSE)
dat |
A data frame |
date_cols |
Names of time variable or regular expressions for finding time variables. Default is "DATE$|time$|date$|timestamp$|stamp$". |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
note |
Logical, outputs info. Default is TRUE. |
A data.frame with transfermed time variables.
#transfer a variable. dat = time_transfer(dat = lendingclub,date_cols = "issue_d") class(dat[,"issue_d"]) #transfer a group of variables with similar name. #transfer all time variables. dat = time_transfer(dat = lendingclub[1:3],date_cols = "_d$") class(dat[,"issue_d"])
#transfer a variable. dat = time_transfer(dat = lendingclub,date_cols = "issue_d") class(dat[,"issue_d"]) #transfer a group of variables with similar name. #transfer all time variables. dat = time_transfer(dat = lendingclub[1:3],date_cols = "_d$") class(dat[,"issue_d"])
This function is not intended to be used by end user.
time_variable( dat, date_cols = NULL, enddate = NULL, units = c("secs", "mins", "hours", "days", "weeks") )
time_variable( dat, date_cols = NULL, enddate = NULL, units = c("secs", "mins", "hours", "days", "weeks") )
dat |
A data.frame. |
date_cols |
Time variables. |
enddate |
End time. |
units |
Units of diff_time, "secs", "mins", "hours", "days", "weeks" is available. |
This function is not intended to be used by end user.
time_vars_process( df_tm = df_tm, x, enddate = NULL, units = c("secs", "mins", "hours", "days", "weeks") )
time_vars_process( df_tm = df_tm, x, enddate = NULL, units = c("secs", "mins", "hours", "days", "weeks") )
df_tm |
A data.frame |
x |
Time variable. |
enddate |
End time. |
units |
Units of diff_time, "secs", "mins", "hours", "days", "weeks" is available. |
tnr_value
is for get true negtive rate for a prob or score.
tnr_value(prob, target)
tnr_value(prob, target)
prob |
A list of redict probability or score. |
target |
Vector of target. |
True Positive Rate
train_lr
is for training the logistic regression model using in training_model
.
train_lr( dat_train, dat_test = NULL, target, x_list = NULL, occur_time = NULL, prop = 0.7, tree_control = list(p = 0.02, cp = 1e-08, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.02, b_odds = 0.1, b_psi = 0.03, b_or = 0.15, mono = 0.2, odds_psi = 0.15, kc = 1), thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.6), lasso = TRUE, step_wise = TRUE, best_lambda = "lambda.auc", seed = 1234, ... )
train_lr( dat_train, dat_test = NULL, target, x_list = NULL, occur_time = NULL, prop = 0.7, tree_control = list(p = 0.02, cp = 1e-08, xval = 5, maxdepth = 10), bins_control = list(bins_num = 10, bins_pct = 0.05, b_chi = 0.02, b_odds = 0.1, b_psi = 0.03, b_or = 0.15, mono = 0.2, odds_psi = 0.15, kc = 1), thresholds = list(cor_p = 0.8, iv_i = 0.02, psi_i = 0.1, cos_i = 0.6), lasso = TRUE, step_wise = TRUE, best_lambda = "lambda.auc", seed = 1234, ... )
dat_train |
data.frame of train data. Default is NULL. |
dat_test |
data.frame of test data. Default is NULL. |
target |
name of target variable. |
x_list |
names of independent variables. Default is NULL. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
tree_control |
the list of parameters to control cutting initial breaks by decision tree. See details at: |
bins_control |
the list of parameters to control merging initial breaks. See details at: |
thresholds |
Thresholds for selecting variables.
|
lasso |
Logical, if TRUE, variables filtering by LASSO. Default is TRUE. |
step_wise |
Logical, stepwise method. Default is TRUE. |
best_lambda |
Metheds of best lanmbda stardards using to filter variables by LASSO. There are 3 methods: ("lambda.auc", "lambda.ks", "lambda.sim_sign") . Default is "lambda.auc". |
seed |
Random number seed. Default is 1234. |
... |
Other parameters |
train_test_split
Functions for partition of data.
train_test_split( dat, prop = 0.7, split_type = "Random", occur_time = NULL, cut_date = NULL, start_date = NULL, save_data = FALSE, dir_path = tempdir(), file_name = NULL, note = FALSE, seed = 43 )
train_test_split( dat, prop = 0.7, split_type = "Random", occur_time = NULL, cut_date = NULL, start_date = NULL, save_data = FALSE, dir_path = tempdir(), file_name = NULL, note = FALSE, seed = 43 )
dat |
A data.frame with independent variables and target variable. |
prop |
The percentage of train data samples after the partition. |
split_type |
Methods for partition.
|
occur_time |
The name of the variable that represents the time at which each observation takes place. It is used for "OOT" split. |
cut_date |
Time points for spliting data sets, e.g. : spliting Actual and Expected data sets. |
start_date |
The earliest occurrence time of observations. |
save_data |
Logical, save results in locally specified folder. Default is FALSE. |
dir_path |
The path for periodically saved data file. Default is "./data". |
file_name |
The name for periodically saved data file. Default is "dat". |
note |
Logical. Outputs info. Default is TRUE. |
seed |
Random number seed. Default is 46. |
A list of indices (train-test)
train_test = train_test_split(lendingclub, split_type = "OOT", prop = 0.7, occur_time = "issue_d", seed = 12, save_data = FALSE) dat_train = train_test$train dat_test = train_test$test
train_test = train_test_split(lendingclub, split_type = "OOT", prop = 0.7, occur_time = "issue_d", seed = 12, save_data = FALSE) dat_train = train_test$train dat_test = train_test$test
train_xgb
is for training a xgb model using in training_model
.
train_xgb( seed_number = 1234, dtrain, nthread = 2, nfold = 1, watchlist = NULL, nrounds = 100, f_eval = "ks", early_stopping_rounds = 10, verbose = 0, params = NULL, ... )
train_xgb( seed_number = 1234, dtrain, nthread = 2, nfold = 1, watchlist = NULL, nrounds = 100, f_eval = "ks", early_stopping_rounds = 10, verbose = 0, params = NULL, ... )
seed_number |
Random number seed. Default is 1234. |
dtrain |
train-data of xgb.DMatrix datasets. |
nthread |
Number of threads |
nfold |
Number of the cross validation of xgboost |
watchlist |
named list of xgb.DMatrix datasets to use for evaluating model performance.generating by |
nrounds |
Max number of boosting iterations. |
f_eval |
Custimized evaluation function,"ks" & "auc" are available. |
early_stopping_rounds |
If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. |
verbose |
If 0, xgboost will stay silent. If 1, it will print information about performance. |
params |
List of contains parameters of xgboost. The complete list of parameters is available at: http://xgboost.readthedocs.io/en/latest/parameter.html |
... |
Other parameters |
training_model
Model builder
training_model( model_name = "mymodel", dat, dat_test = NULL, target = NULL, occur_time = NULL, obs_id = NULL, x_list = NULL, ex_cols = NULL, pos_flag = NULL, prop = 0.7, split_type = if (!is.null(occur_time)) "OOT" else "Random", preproc = TRUE, low_var = 0.99, missing_rate = 0.98, merge_cat = 30, remove_dup = TRUE, outlier_proc = TRUE, missing_proc = "median", default_miss = list(-1, "missing"), miss_values = NULL, one_hot = FALSE, trans_log = FALSE, feature_filter = list(filter = c("IV", "PSI", "COR", "XGB"), iv_cp = 0.02, psi_cp = 0.1, xgb_cp = 0, cv_folds = 1, hopper = FALSE), algorithm = list("LR", "XGB", "GBM", "RF"), LR.params = lr_params(), XGB.params = xgb_params(), GBM.params = gbm_params(), RF.params = rf_params(), breaks_list = NULL, parallel = FALSE, cores_num = NULL, save_pmml = FALSE, plot_show = FALSE, vars_plot = TRUE, model_path = tempdir(), seed = 46, ... )
training_model( model_name = "mymodel", dat, dat_test = NULL, target = NULL, occur_time = NULL, obs_id = NULL, x_list = NULL, ex_cols = NULL, pos_flag = NULL, prop = 0.7, split_type = if (!is.null(occur_time)) "OOT" else "Random", preproc = TRUE, low_var = 0.99, missing_rate = 0.98, merge_cat = 30, remove_dup = TRUE, outlier_proc = TRUE, missing_proc = "median", default_miss = list(-1, "missing"), miss_values = NULL, one_hot = FALSE, trans_log = FALSE, feature_filter = list(filter = c("IV", "PSI", "COR", "XGB"), iv_cp = 0.02, psi_cp = 0.1, xgb_cp = 0, cv_folds = 1, hopper = FALSE), algorithm = list("LR", "XGB", "GBM", "RF"), LR.params = lr_params(), XGB.params = xgb_params(), GBM.params = gbm_params(), RF.params = rf_params(), breaks_list = NULL, parallel = FALSE, cores_num = NULL, save_pmml = FALSE, plot_show = FALSE, vars_plot = TRUE, model_path = tempdir(), seed = 46, ... )
model_name |
A string, name of the project. Default is "mymodel" |
dat |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
obs_id |
The name of ID of observations or key variable of data. Default is NULL. |
x_list |
Names of independent variables. Default is NULL. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
pos_flag |
The value of positive class of target variable, default: "1". |
prop |
Percentage of train-data after the partition. Default: 0.7. |
split_type |
Methods for partition. See details at : |
preproc |
Logical. Preprocess data. Default is TRUE. |
low_var |
Logical, delete low variance variables or not. Default is TRUE. |
missing_rate |
The maximum percent of missing values for recoding values to missing and non_missing. |
merge_cat |
merge categories of character variables that is more than m. |
remove_dup |
Logical, if TRUE, remove the duplicated observations. |
outlier_proc |
Logical, process outliers or not. Default is TRUE. |
missing_proc |
If logical, process missing values or not. If "median", then Nas imputation with k neighbors median. If "avg_dist", the distance weighted average method is applied to determine the NAs imputation with k neighbors. If "default", assigning the missing values to -1 or "missing", otherwise ,processing the missing values according to the results of missing analysis. |
default_miss |
Default value of missing data imputation, Defualt is list(-1,'missing'). |
miss_values |
Other extreme value might be used to represent missing values, e.g: -9999, -9998. These miss_values will be encoded to -1 or "missing". |
one_hot |
Logical. If TRUE, one-hot_encoding of category variables. Default is FASLE. |
trans_log |
Logical, Logarithmic transformation. Default is FALSE. |
feature_filter |
Parameters for selecting important and stable features.See details at: |
algorithm |
Algorithms for training a model. list("LR", "XGB", "GBDT", "RF") are available. |
LR.params |
Parameters of logistic regression & scorecard. See details at : |
XGB.params |
Parameters of xgboost. See details at : |
GBM.params |
Parameters of GBM. See details at : |
RF.params |
Parameters of Random Forest. See details at : |
breaks_list |
A table containing a list of splitting points for each independent variable. Default is NULL. |
parallel |
Default is FALSE. |
cores_num |
The number of CPU cores to use. |
save_pmml |
Logical, save model in PMML format. Default is TRUE. |
plot_show |
Logical, show model performance in current graphic device. Default is FALSE. |
vars_plot |
Logical, if TRUE, plot distribution ,correlation or partial dependence of model input variables . Default is TRUE. |
model_path |
The path for periodically saved data file. Default is |
seed |
Random number seed. Default is 46. |
... |
Other parameters. |
A list containing Model Objects.
train_test_split
,data_cleansing
, feature_selector
, lr_params
, xgb_params
, gbm_params
, rf_params
,fast_high_cor_filter
,get_breaks_all
,lasso_filter
, woe_trans_all
, get_logistic_coef
, score_transfer
,get_score_card
, model_key_index
,ks_psi_plot
,ks_table_plot
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] x_list = c("LIMIT_BAL") B_model = training_model(dat = dat, model_name = "UCICreditCard", target = "default.payment.next.month", x_list = x_list, occur_time =NULL, obs_id =NULL, dat_test = NULL, preproc = FALSE, outlier_proc = FALSE, missing_proc = FALSE, feature_filter = NULL, algorithm = list("LR"), LR.params = lr_params(lasso = FALSE, step_wise = FALSE, score_card = FALSE), breaks_list = NULL, parallel = FALSE, cores_num = NULL, save_pmml = FALSE, plot_show = FALSE, vars_plot = FALSE, model_path = tempdir(), seed = 46)
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] x_list = c("LIMIT_BAL") B_model = training_model(dat = dat, model_name = "UCICreditCard", target = "default.payment.next.month", x_list = x_list, occur_time =NULL, obs_id =NULL, dat_test = NULL, preproc = FALSE, outlier_proc = FALSE, missing_proc = FALSE, feature_filter = NULL, algorithm = list("LR"), LR.params = lr_params(lasso = FALSE, step_wise = FALSE, score_card = FALSE), breaks_list = NULL, parallel = FALSE, cores_num = NULL, save_pmml = FALSE, plot_show = FALSE, vars_plot = FALSE, model_path = tempdir(), seed = 46)
This research aimed at the case of customers's default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 24 variables as explanatory variables
A data frame with 30000 rows and 26 variables.
ID: Customer id
apply_date: This is a fake occur time.
LIMIT_BAL: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
SEX: Gender (male; female).
EDUCATION: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
MARRIAGE: Marital status (1 = married; 2 = single; 3 = others).
AGE: Age (year) History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:
PAY_0: the repayment status in September
PAY_2: the repayment status in August
PAY_3: ...
PAY_4: ...
PAY_5: ...
PAY_6: the repayment status in April The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months;...;8 = payment delay for eight months; 9 = payment delay for nine months and above. Amount of bill statement (NT dollar)
BILL_AMT1: amount of bill statement in September
BILL_AMT2: mount of bill statement in August
BILL_AMT3: ...
BILL_AMT4: ...
BILL_AMT5: ...
BILL_AMT6: amount of bill statement in April Amount of previous payment (NT dollar)
PAY_AMT1: amount paid in September
PAY_AMT2: amount paid in August
PAY_AMT3: ....
PAY_AMT4: ...
PAY_AMT5: ...
PAY_AMT6: amount paid in April
default.payment.next.month: default payment (Yes = 1, No = 0), as the response variable
http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients
This function is used for grouped numeric data processing.
var_group_proc(dat, ID = NULL, group = NULL, num_var = NULL)
var_group_proc(dat, ID = NULL, group = NULL, num_var = NULL)
dat |
A data.frame contained only predict variables. |
ID |
The name of ID of observations or key variable of data. Default is NULL. |
group |
The group of behavioral or status variables. |
num_var |
The name of numeric variable to process. |
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a', 'b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b', 'c','h','h','i','c','d','g','k','k'), time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1, 3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3)) time_series_proc(dat = dat, ID = 'id', group = 'terms',time = 'time')
dat = data.frame(id = c(1,1,1,2,2,3,3,3,4,4,4,4,4,5,5,6,7,7, 8,8,8,9,9,9,10,10,11,11,11,11,11,11), terms = c('a','b','c','a','c','d','d','a', 'b','c','a','c','d','a','c', 'd','a','e','f','b','c','f','b', 'c','h','h','i','c','d','g','k','k'), time = c(8,3,1,9,6,1,4,9,1,3,4,8,2,7,1, 3,4,1,8,7,2,5,7,8,8,2,1,5,7,2,7,3)) time_series_proc(dat = dat, ID = 'id', group = 'terms',time = 'time')
This function is not intended to be used by end user.
variable_process(add)
variable_process(add)
add |
A data.frame |
woe_trans
is for transforming data to woe.
The woe_trans_all
function is a simpler wrapper for woe_trans
.
woe_trans_all( dat, x_list = NULL, ex_cols = NULL, bins_table = NULL, target = NULL, breaks_list = NULL, note = FALSE, save_data = FALSE, parallel = FALSE, woe_name = FALSE, file_name = NULL, dir_path = tempdir(), ... ) woe_trans( dat, x, bins_table = NULL, target = NULL, breaks_list = NULL, woe_name = FALSE )
woe_trans_all( dat, x_list = NULL, ex_cols = NULL, bins_table = NULL, target = NULL, breaks_list = NULL, note = FALSE, save_data = FALSE, parallel = FALSE, woe_name = FALSE, file_name = NULL, dir_path = tempdir(), ... ) woe_trans( dat, x, bins_table = NULL, target = NULL, breaks_list = NULL, woe_name = FALSE )
dat |
A data.frame with independent variables. |
x_list |
A list of x variables. |
ex_cols |
Names of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
bins_table |
A table contians woe of each bin of variables, it is generated by codeget_bins_table_all,codeget_bins_table |
target |
The name of target variable. Default is NULL. |
breaks_list |
A list contains breaks of variables. it is generated by codeget_breaks_all,codeget_breaks |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results in locally specified folder. Default is TRUE |
parallel |
Logical, parallel computing. Default is FALSE. |
woe_name |
Logical. Add "_woe" at the end of the variable name. |
file_name |
The name for periodically saved woe file. Default is "dat_woe". |
dir_path |
The path for periodically saved woe file Default is "./data" |
... |
Additional parameters. |
x |
The name of an independent variable. |
A list of breaks for each variables.
get_tree_breaks
, cut_equal
, select_best_class
, select_best_breaks
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transform train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE)
sub = cv_split(UCICreditCard, k = 30)[[1]] dat = UCICreditCard[sub,] dat = re_name(dat, "default.payment.next.month", "target") dat = data_cleansing(dat, target = "target", obs_id = "ID", occur_time = "apply_date", miss_values = list("", -1)) train_test = train_test_split(dat, split_type = "OOT", prop = 0.7, occur_time = "apply_date") dat_train = train_test$train dat_test = train_test$test #get breaks of all predictive variables x_list = c("PAY_0", "LIMIT_BAL", "PAY_AMT5", "EDUCATION", "PAY_3", "PAY_2") breaks_list = get_breaks_all(dat = dat_train, target = "target", x_list = x_list, occur_time = "apply_date", ex_cols = "ID", save_data = FALSE, note = FALSE) #woe transform train_woe = woe_trans_all(dat = dat_train, target = "target", breaks_list = breaks_list, woe_name = FALSE) test_woe = woe_trans_all(dat = dat_test, target = "target", breaks_list = breaks_list, note = FALSE)
xgb_data
is for prepare data using in training_model
.
xgb_data( dat_train, target, dat_test = NULL, x_list = NULL, prop = 0.7, occur_time = NULL )
xgb_data( dat_train, target, dat_test = NULL, x_list = NULL, prop = 0.7, occur_time = NULL )
dat_train |
data.frame of train data. Default is NULL. |
target |
name of target variable. |
dat_test |
data.frame of test data. Default is NULL. |
x_list |
names of independent variables of raw data. Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
xgb_filter
is for selecting important features using xgboost.
xgb_filter( dat_train, dat_test = NULL, target = NULL, pos_flag = NULL, x_list = NULL, occur_time = NULL, ex_cols = NULL, xgb_params = list(nrounds = 100, max_depth = 6, eta = 0.1, min_child_weight = 1, subsample = 1, colsample_bytree = 1, gamma = 0, scale_pos_weight = 1, early_stopping_rounds = 10, objective = "binary:logistic"), f_eval = "auc", cv_folds = 1, cp = NULL, seed = 46, vars_name = TRUE, note = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
xgb_filter( dat_train, dat_test = NULL, target = NULL, pos_flag = NULL, x_list = NULL, occur_time = NULL, ex_cols = NULL, xgb_params = list(nrounds = 100, max_depth = 6, eta = 0.1, min_child_weight = 1, subsample = 1, colsample_bytree = 1, gamma = 0, scale_pos_weight = 1, early_stopping_rounds = 10, objective = "binary:logistic"), f_eval = "auc", cv_folds = 1, cp = NULL, seed = 46, vars_name = TRUE, note = TRUE, save_data = FALSE, file_name = NULL, dir_path = tempdir(), ... )
dat_train |
A data.frame with independent variables and target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
target |
The name of target variable. |
pos_flag |
The value of positive class of target variable, default: "1". |
x_list |
Names of independent variables. |
occur_time |
The name of the variable that represents the time at which each observation takes place. |
ex_cols |
A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL. |
xgb_params |
Parameters of xgboost.The complete list of parameters is available at: http://xgboost.readthedocs.io/en/latest/parameter.html. |
f_eval |
Custimized evaluation function,"ks" & "auc" are available. |
cv_folds |
Number of cross-validations. Default: 5. |
cp |
Threshold of XGB feature's Gain. Default is 1/number of independent variables. |
seed |
Random number seed. Default is 46. |
vars_name |
Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE. |
note |
Logical, outputs info. Default is TRUE. |
save_data |
Logical, save results results in locally specified folder. Default is FALSE. |
file_name |
The name for periodically saved results files. Default is "Feature_importance_XGB". |
dir_path |
The path for periodically saved results files. Default is "./variable". |
... |
Other parameters to pass to xgb_params. |
Selected variables.
psi_iv_filter
, gbm_filter
, feature_selector
dat = UCICreditCard[1:1000,c(2,4,8:9,26)] xgb_params = list(nrounds = 100, max_depth = 6, eta = 0.1, min_child_weight = 1, subsample = 1, colsample_bytree = 1, gamma = 0, scale_pos_weight = 1, early_stopping_rounds = 10, objective = "binary:logistic") ## Not run: xgb_features = xgb_filter(dat_train = dat, dat_test = NULL, target = "default.payment.next.month", occur_time = "apply_date",f_eval = 'ks', xgb_params = xgb_params, cv_folds = 1, ex_cols = "ID$|date$|default.payment.next.month$", vars_name = FALSE) ## End(Not run)
dat = UCICreditCard[1:1000,c(2,4,8:9,26)] xgb_params = list(nrounds = 100, max_depth = 6, eta = 0.1, min_child_weight = 1, subsample = 1, colsample_bytree = 1, gamma = 0, scale_pos_weight = 1, early_stopping_rounds = 10, objective = "binary:logistic") ## Not run: xgb_features = xgb_filter(dat_train = dat, dat_test = NULL, target = "default.payment.next.month", occur_time = "apply_date",f_eval = 'ks', xgb_params = xgb_params, cv_folds = 1, ex_cols = "ID$|date$|default.payment.next.month$", vars_name = FALSE) ## End(Not run)
xgb_params
is the list of parameters to train a XGB model using in training_model
.
xgb_params_search
is for searching the optimal parameters of xgboost,if any parameters of params in xgb_params
is more than one.
xgb_params( nrounds = 1000, params = list(max_depth = 6, eta = 0.01, gamma = 0, min_child_weight = 1, subsample = 1, colsample_bytree = 1, scale_pos_weight = 1), early_stopping_rounds = 100, method = "random_search", iters = 10, f_eval = "auc", nfold = 1, nthread = 2, ... ) xgb_params_search( dat_train, target, dat_test = NULL, x_list = NULL, prop = 0.7, occur_time = NULL, method = "random_search", iters = 10, nrounds = 100, early_stopping_rounds = 10, params = list(max_depth = 6, eta = 0.01, gamma = 0, min_child_weight = 1, subsample = 1, colsample_bytree = 1, scale_pos_weight = 1), f_eval = "auc", nfold = 1, nthread = 2, ... )
xgb_params( nrounds = 1000, params = list(max_depth = 6, eta = 0.01, gamma = 0, min_child_weight = 1, subsample = 1, colsample_bytree = 1, scale_pos_weight = 1), early_stopping_rounds = 100, method = "random_search", iters = 10, f_eval = "auc", nfold = 1, nthread = 2, ... ) xgb_params_search( dat_train, target, dat_test = NULL, x_list = NULL, prop = 0.7, occur_time = NULL, method = "random_search", iters = 10, nrounds = 100, early_stopping_rounds = 10, params = list(max_depth = 6, eta = 0.01, gamma = 0, min_child_weight = 1, subsample = 1, colsample_bytree = 1, scale_pos_weight = 1), f_eval = "auc", nfold = 1, nthread = 2, ... )
nrounds |
Max number of boosting iterations. |
params |
List of contains parameters of xgboost. The complete list of parameters is available at: http://xgboost.readthedocs.io/en/latest/parameter.html |
early_stopping_rounds |
If NULL, the early stopping function is not triggered. If set to an integer k, training with a validation set will stop if the performance doesn't improve for k rounds. |
method |
Method of searching optimal parameters."random_search","grid_search","local_search" are available. |
iters |
Number of iterations of "random_search" optimal parameters. |
f_eval |
Custimized evaluation function,"ks" & "auc" are available. |
nfold |
Number of the cross validation of xgboost |
nthread |
Number of threads |
... |
Other parameters |
dat_train |
A data.frame of train data. Default is NULL. |
target |
Name of target variable. |
dat_test |
A data.frame of test data. Default is NULL. |
x_list |
Names of independent variables. Default is NULL. |
prop |
Percentage of train-data after the partition. Default: 0.7. |
occur_time |
The name of the variable that represents the time at which each observation takes place.Default is NULL. |
A list of parameters.
training_model
, lr_params
,gbm_params
, rf_params