7.5 预置算子说明
7.5.3 模型工程
7.5.3.1 分类
7.5.3.1.3 LightGBM 分类
概述
对mmlspark python包中LightGBM分类的封装
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
input_features_str - 输入的列名以逗号分隔组成的字符串,例如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_ind
ex_col - 目标列经过标签编码后的新的列名,默认为
"label_index"
classifier_feature_v
ector_col - 算子输入的特征向量列的列名,默认为
"model_features"
参数 子参数 参数说明
timeout - 超时时间,默认为1200秒
objective - 目标函数,支持
binary,multiclass,multiclassova,默认为
"binary"
max_depth - 树的最大深度,默认为-1
num_iteration - 迭代次数,默认为100 learning_rate - 学习率,默认为0.1 num_leaves - 叶子数目,默认为31
max_bin - 最大分箱数,默认为255
bagging_fraction - bagging的比例,默认为1 bagging_freq - bagging的频率,默认为0
bagging_seed - bagging时的随机数种子,默认为3 early_stopping_rou
nd - 提前结束迭代的轮数,默认为0
feature_fraction - 特征的比例,默认为1.0 min_sum_hessian_i
n_leaf - 一个叶子上最小hessian和。取值区间为[0, 1],默认为1e-3
boost_from_averag
e - 是否将初始分数调整为标签的平均值,以加快
收敛速度,,默认为True boosting_type - 提升方法的提升类型。
可选值有:gbdt、gbrt、rf、dart、goss,默 认为"gbdt"
lambda_l1 - L1正则化系数,默认为0.0 lambda_l2 - L2正则化系数,,默认为0.0
num_batches - 如果大于0,在训练中将数据集分割成不同的 批次,默认为0
parallelism - 学习树时的并行方法,支持data_parallel, voting_parallel,默认为"data_parallel"
thresholds_str - 多分类时使用,表示每个类别对应的概率值预 置的数组,字符串用逗号隔开
"inputs": inputs, "b_output_action": True, "outer_pipeline_stages": None, "input_features_str": "", # @param
{"label":"input_features_str","type":"string","required":"false","helpTip":""}
"label_col": "", # @param {"label":"label_col","type":"string","required":"true","helpTip":""}
"classifier_label_index_col": "label_index", # @param
{"label":"classifier_label_index_col","type":"string","required":"false","helpTip":""}
"classifier_feature_vector_col": "model_features", # @param
{"label":"classifier_feature_vector_col","type":"string","required":"false","helpTip":""}
"prediction_index_col": "prediction_index", # @param
{"label":"prediction_index_col","type":"string","required":"false","helpTip":""}
"prediction_col": "prediction", # @param
{"label":"prediction_col","type":"string","required":"false","helpTip":""}
"probability_col": "probability", # @param
{"label":"probability_col","type":"string","required":"false","helpTip":""}
"is_unbalance": False, # @param {"label":"is_unbalance","type":"boolean","required":"false","helpTip":""}
"timeout": 1200.0, # @param {"label":"timeout","type":"number","required":"false","helpTip":""}
"objective": "binary", # @param {"label":"objective","type":"string","required":"false","helpTip":""}
"max_depth": -1, # @param
{"label":"max_depth","type":"integer","required":"false","range":"[-1,2147483647]","helpTip":""}
"num_iteration": 100, # @param
{"label":"num_iteration","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"learning_rate": 0.1, # @param {"label":"learning_rate","type":"number","required":"false","helpTip":""}
"num_leaves": 31, # @param
"early_stopping_round": 0, # @param
{"label":"early_stopping_round","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
"feature_fraction": 1.0, # @param
{"label":"feature_fraction","type":"number","required":"false","helpTip":""}
"min_sum_hessian_in_leaf": 1e-3, # @param
{"label":"min_sum_hessian_in_leaf","type":"number","required":"false","helpTip":""}
"boost_from_average": True, # @param
{"label":"boost_from_average","type":"boolean","required":"false","helpTip":""}
"boosting_type": "gbdt", # @param
{"label":"boosting_type","type":"string","required":"false","helpTip":""}
"lambda_l1": 0.0, # @param {"label":"lambda_l1","type":"number","required":"false","helpTip":""}
"lambda_l2": 0.0, # @param {"label":"lambda_l2","type":"number","required":"false","helpTip":""}
"num_batches": 0, # @param
{"label":"num_batches","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
"parallelism": "data_parallel", # @param
{"label":"parallelism","type":"string","required":"false","helpTip":""}
"thresholds_str": "" # @param {"label":"thresholds_str","type":"string","required":"false","helpTip":""}
}lightgbm_classifier____id___ = MLSLightGBMClassifier(**params) lightgbm_classifier____id___.run()
# @output {"label":"pipeline_model","name":"lightgbm_classifier____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
● 二分类
inputs datafram
e inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
b_use_default_encod
er - 是否使用默认编码,默认为True
input_features_str - 输入的列名以逗号分隔组成的字符串,例如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_index
_col - 目标列经过标签编码后的新的列名,默认为
"label_index"
参数 子参数 参数说明
max_iter - 最大迭代次数,默认为100
reg_param - 正则化系数,默认为0.0
tol - 收敛阈值,默认为1e-6
fit_intercept - 默认为True
standardization - 训练模型之前是否对训练特征标准化,默认为 True
aggregation_depth - 聚合时的深度,默认为2
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"outer_pipeline_stages": None,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "true", "helpTip": ""}
"classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col",
"type": "string", "required": "true", "helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"reg_param": 0.0, # @param {"label": "reg_param", "type": "number", "required": "true", "range":
"[0,none)", "helpTip": ""}
"tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip":
""}
"fit_intercept": True, # @param {"label": "fit_intercept", "type": "boolean", "required": "true", "helpTip":
""}
"standardization": True, # @param {"label": "standardization", "type": "boolean", "required": "true",
"helpTip": ""}
"aggregation_depth": 2 # @param {"label": "aggregation_depth", "type": "integer", "required": "true",
"range": "(0,2147483647]", "helpTip": ""}
}
7.5.3.1.5 逻辑回归分类
inputs dataframe inputs为字典类型,dataframe为 pyspark中的DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,
例如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_index_c
prediction_col - 算子输出的预测label对应的标签列,默 认为"prediction_index"
参数 子参数 参数说明
tol - 迭代算法的收敛阈值,默认为1e-6
fit_intercept - 是否要使用截距,默认为True standardization - 是否正则化特征,默认为True aggregation_depth - 聚合的深度,默认为2
family - 模型训练中使用哪种标签分布,支持
auto、binomial、multinomial,默认为
"auto"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"outer_pipeline_stages": None,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "true", "helpTip": ""}
"classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col",
"type": "string", "required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"reg_param": 0.0, # @param {"label": "reg_param", "type": "number", "required": "true", "range":
"[0,none)", "helpTip": ""}
"elastic_net_param": 0.0, # @param {"label": "elastic_net_param", "type": "number", "required": "true",
"range": "[0,none)", "helpTip": ""}
"tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip":
""}
"fit_intercept": True, # @param {"label": "fit_intercept", "type": "boolean", "required": "true", "helpTip":
""}
"standardization": True, # @param {"label": "standardization", "type": "boolean", "required": "true",
"helpTip": ""}
"aggregation_depth": 2, # @param {"label": "aggregation_depth", "type": "integer", "required": "true",
"range": "(0,2147483647]", "helpTip": ""}
7.5.3.1.6 多层感知机分类
概述
“多层感知机分类”节点可用于建立一个基于前馈人工神经网络的分类模型。
前馈人工神经网络采用一种单向多层结构。其中每一层包含若干个神经元,同一层的 神经元之间没有互相连接,层间信息的传送只沿一个方向进行。其中第一层称为输入 层。最后一层为输出层,中间为隐层。K+1层前馈神经网络矩阵形式如下表示,其中X 为特征集,w为权重值,b为偏置量,y为预测值。
中间层的节点使用sigmod函数:
输出层的节点使用softmax函数:
输出层中的节点个数对应类别数量。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
spark pipeline类型的模型
参数 子参数 参数说明
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_inde
x_col - 目标列经过标签编码后的新的列名,默认为
"label_index"
classifier_feature_v
ector_col - 算子输入的特征向量列的列名,默认为
"model_features"
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
prediction_index_co
l - 算子输出的预测label对应的标签列,默认为
"prediction_index"
max_iter - 最大迭代次数,默认为100
tol - 收敛阈值,默认为1e-6
seed - 随机数种子,默认为0
layers_str - 层的个数用逗号分隔组成的字符串,例如:
"2,3,4"
"3"
step_size - 步长,默认为0.03
solver - 用来优化的处理算法,支持l-bfgs、gd,默认 为"l-bfgs"
initial_weights_str - 初始化权重用逗号分隔组成的字符串,例如:
"0.01"
"0.01,0.02,0.04"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
"type": "string", "required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip":
""}
"seed": 0, # @param {"label": "seed", "type": "integer", "required": "false", "range": "[0,2147483647]",
"helpTip": ""}
"layers_str": "", # @param {"label": "layers_str", "type": "string", "required": "false", "helpTip": ""}
"block_size": 128,
"step_size": 0.03, # @param {"label": "step_size", "type": "number", "required": "true", "range":
"(0,none)", "helpTip": ""}
"solver": "l-bfgs", # @param {"label": "solver", "type": "enum", "required": "true", "options": "gd,l-bfgs",
"helpTip": ""}
"initial_weights_str": "" # @param {"label": "initial_weights_str", "type": "string", "required": "false",
"helpTip": ""}
}multilayer_perception_classifier____id___ = MLSMultilayerPerceptronClassifier(**params) multilayer_perception_classifier____id___.run()
# @output {"label":"pipeline_model","name":"multilayer_perception_classifier____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
参数说明
参数 子参数 参数说明
b_use_default_encode
r - 是否使用默认编码,默认为True
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_index_
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
prediction_index_col - 算子输出的预测label对应的标签列,默认为
"prediction_index"
smoothing - 平滑参数,默认为1.0
model_type - 模型类型,支持multinomial、bernoulli,默 认为"multinomial"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"outer_pipeline_stages": None,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "true", "helpTip": ""}
# @output {"label":"pipeline_model","name":"naive_bayes_classifier____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
7.5.3.1.8 随机森林分类
概述
“随机决策森林分类”节点用于产生二分类或多分类模型。随机决策森林是用随机的 方式建立一个森林模型,森林由很多的决策树组成,每棵决策树之间没有关联。当有 一个新的样本输入时,森林中的每一棵决策树分别进行判断,哪一类被选择最多,就 预测这个样本属于那一类。
随机决策森林分类中的决策树算法通过基尼不纯度(Gini impurity)或熵(Entropy)
来对一个集合的有序程度进行量化,并对一次拆分进行量化评价。
● 基尼不纯度是指将来自集合中的某种结果随机应用于集合中某一数据项的预期误 差率,计算公式如下:
● 熵是信息论中的概念,用来表示集合的无序程度,熵越大表示集合越混乱,反之 则表示集合越有序,计算公式如下:
fi表示类别i样本数量占所有样本的比例,C表示数据类别数。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
spark pipeline类型的模型
参数 子参数 参数说明
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_index
_col - 传给分类器的目标列,必须为数值列
classifier_feature_vec
tor_col - 传给分类器的特征列,必须为向量列
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
prediction_index_col - 算子输出的预测label对应的标签列,默认 为"prediction_index"
max_depth - 树的最大深度,默认为5
max_bins - 最大分箱数,默认为32
min_instances_per_n
ode - 节点分割时,要求子节点必须包含的最少实
例数,默认为1
min_info_gain - 节点是否分割要求的最小信息增益,默认为 0
impurity - 计算信息增益的方法,支持entropy、
gini,默认为"gini"
num_trees - 树的个数,默认为20
feature_subset_strate
gy - 节点分割时考虑用到的特征列的策略,支持
auto、all、onethird、sqrt、log2、n,默 认为"all"
subsampling_rate - 学习每棵决策树用到的训练集的比例,默认 为1.0
seed - 随机数种子,默认为0
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "true", "helpTip": ""}
"classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col",
"type": "string", "required": "true", "helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",
"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",