7.5 预置算子说明
7.5.3 模型工程
7.5.3.1 分类
7.5.3.3.2 聚类评估
概述
对聚类模型预测的结果数据集进行评估。
输入
参数 子参数 参数说明
inputs datafra
me inputs为字典类型,dataframe为pyspark中的DataFrame类 型对象
_col - 预测结果数据集中,预测列的列名,默认为"prediction"
样例
inputs = {
"predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"features_col": "model_features", # @param {"label": "features_col", "type": "string", "required": "true",
"helpTip": ""}
"prediction_col": "prediction" # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
}cluster_evaluation____id___ = MLSClusterEvaluation(**params) cluster_evaluation____id___.run()
# @output {"label":"dataframe","name":"cluster_evaluation____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
pipeline_mod
el inputs为字典类型,pipeline_model为spark pipeline类 型的模型对象
"dataframe": None, # @input {"label":"dataframe","type":"DataFrame"}
"pipeline_model": None # @input {"label":"pipeline_model","type":"PipelineModel"}
}params = { "inputs": inputs
}model_predict____id___ = MLSModelPredict(**params)
model_predict____id___.run()
# @output {"label":"dataframe","name":"model_predict____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
参数说明
参数 子参数 参数说明
label_col - 目标列 prediction_inde
x_col - 输入预测数据集的预测label 标签列的列名 label_index_col - 输入预测数据集的真实label标签列的列名
样例
inputs = {
"predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"label_index_col": "label_index" # @param {"label": "label_index_col", "type": "string", "required": "true",
"helpTip": ""}
}multi_class_evaluation____id___ = MLSMultiClassEvaluation(**params)
multi_class_evaluation____id___.run()
# @output {"label":"dataframe","name":"multi_class_evaluation____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
inputs datafra
me inputs为字典类型,dataframe为pyspark中的DataFrame 类型对象
输出
参数 子参数 参数说明 prediction_c
ol - 预测结果数据集的预测列的列名
样例
inputs = {
"predict_dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"prediction_col": "prediction" # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
}regression_evaluation____id___ = MLSRegressionEvaluation(**params)
regression_evaluation____id___.run()
# @output {"label":"dataframe","name":"regression_evaluation____id___.get_outputs() ['output_port_1']","type":"DataFrame"}
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
参数 子参数 参数说明
rating_col - 评分所在的列名
recommend_nums - 推荐物品的个数,默认为10 prediction_col - 预测列列名,默认为"prediction"
cold_start_strategy - 冷启动策略,默认为"nan"
alpha - 矩阵分解的正则化系数,默认为1.0
implicit_prefs - 是否使用隐含偏好,默认为Flase
max_iter - 最大迭代次数,默认为50
non_negative - 是否使用非负限制,默认为False
rank - 因子分解的秩,默认为10
reg_param - 正则化系数,默认为0.0
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs,
"user_col": "", # @param {"label":"user_col","type":"string","required":"true","helpTip":""}
"item_col": "", # @param {"label":"item_col","type":"string","required":"true","helpTip":""}
"rating_col": "", # @param {"label":"rating_col","type":"string","required":"true","helpTip":""}
"recommend_nums": 10, # @param
{"label":"recommend_nums","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""}
"prediction_col": "prediction", # @param
{"label":"prediction_col","type":"string","required":"false","helpTip":""}
"cold_start_strategy": "nan", # @param
{"label":"cold_start_strategy","type":"string","required":"false","helpTip":""}
"alpha": 1, # @param
{"label":"alpha","type":"number","required":"false","range":"(none,none)","helpTip":""}
"implicit_prefs": False, # @param {"label":"implicit_prefs","type":"boolean","required":"false","helpTip":""}
"max_iter": 10, # @param
{"label":"max_iter","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""}
"non_negative": False, # @param
{"label":"non_negative","type":"boolean","required":"false","helpTip":""}
"rank": 10, # @param
{"label":"rank","type":"integer","required":"false","range":"(0,2147483647)","helpTip":""}
"reg_param": 0.1 # @param
{"label":"reg_param","type":"number","required":"false","range":"(none,none)","helpTip":""}
}als____id___ = MLSALS(**params) als____id___.run()
7.5.3.5.1 决策树回归
概述
“决策树回归”节点用于产生回归模型。
决策树算法是递归地构建决策树的过程,用平方误差最小准则,进行特征选择,生成 二叉树。平方误差计算公式如下:
其中 是样本类标的均值,yi 是样本的标签,N 是样本数量。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
b_use_default_enco
der - 是否使用默认编码,默认为True
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
参数 子参数 参数说明
min_info_gain - 最小信息增益,默认为0.0
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"outer_pipeline_stages": None,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"regressor_feature_vector_col": "model_features", # @param {"label": "regressor_feature_vector_col",
"type": "string", "required": "true", "helpTip": ""}
"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",
"required": "true", "range": "(0,2147483647]", "helpTip": ""}
"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range":
"[0.0,none)", "helpTip": ""}
"impurity": "variance"
}dt_regressor____id___ = MLSDecisionTreeRegression(**params)
dt_regressor____id___.run()
# @output {"label":"pipeline_model","name":"dt_regressor____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
b_use_default_enco
der - 是否使用默认编码,默认为True
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列
regressor_feature_v
ector_col - 算子输入的特征向量列的列名,默认为
"model_features"
max_depth - 树的最大深度,默认为5
max_bins - 最大分箱数,默认为32
min_instances_per_
node - 节点分割时,要求子节点必须包含的最少实
例数,默认为1
min_info_gain - 节点是否分割要求的最小信息增益,默认为 0.0
subsampling_rate - 学习每棵决策树用到的训练集的抽样比例,
默认为1.0
loss_type - 损失函数类型,支持squared、absolute,默 认为"squared"
max_iter - 最大迭代次数,默认为20
"inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"outer_pipeline_stages": None,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"}
"regressor_feature_vector_col": "model_features", # @param {"label": "regressor_feature_vector_col",
"type": "string", "required": "true", "helpTip": ""}
"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",
"required": "true", "range": "(0,2147483647]", "helpTip": ""}
"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range":
"[0.0,none)", "helpTip": ""}
"subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "true",
"range": "(0.0,1.0]", "helpTip": ""}
"loss_type": "squared", # @param {"label": "loss_type", "type": "enum", "required": "true", "options":
"squared,absolute", "helpTip": ""}
"max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"step_size": 0.1, # @param {"label": "step_size", "type": "number", "required": "true", "range":
"(0.0,none)", "helpTip": ""}
"impurity": "variance"
}gbt_regressor____id___ = MLSGBTRegression(**params)
gbt_regressor____id___.run()
# @output {"label":"pipeline_model","name":"gbt_regressor____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
7.5.3.5.3 LightGBM 回归
概述
对mmlspark python包中LightGBM回归的封装
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中的 DataFrame类型对象
输出
spark pipeline类型的模型
参数 子参数 参数说明
label_col - 目标列
regressor_feature_
vector_col - 算子输入的特征向量列的列名,默认为
"model_features"
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
objective - 目标函数,默认为"regression"
max_depth - 树的最大深度,默认为-1
num_iteration - 迭代次数,默认为100 learning_rate - 学习率,默认为0.1 num_leaves - 叶子数目,默认为31
max_bin - 最大分箱数,默认为255
bagging_fraction - bagging的比例,默认为1 bagging_freq - bagging的频率,默认为0
bagging_seed - bagging时的随机数种子,默认为3 early_stopping_rou
nd - 提前结束迭代的轮数,默认为0
feature_fraction - 特征的比例,默认为1.0 min_sum_hessian_i
n_leaf - 一个叶子上最小hessian和。取值区间为[0, 1],默认为1e-3
boost_from_averag
e - 是否将初始分数调整为标签的平均值,以加快
收敛速度,,默认为True boosting_type - 提升方法的提升类型。
可选值有:gbdt、gbrt、rf、dart、goss,默 认为gbdt
lambda_l1 - L1正则化系数,默认为0.0 lambda_l2 - L2正则化系数,,默认为0.0
num_batches - 如果大于0,在训练中将数据集分割成不同的 批次,默认为0
params = { "inputs": inputs, "b_output_action": True, "outer_pipeline_stages": None, "input_features_str": "", # @param
{"label":"input_features_str","type":"string","required":"false","helpTip":""}
"label_col": "", # @param {"label":"label_col","type":"string","required":"true","helpTip":""}
"regressor_feature_vector_col": "model_features", # @param
{"label":"regressor_feature_vector_col","type":"string","required":"false","helpTip":""}
"prediction_col": "prediction", # @param
{"label":"prediction_col","type":"string","required":"false","helpTip":""}
"objective": "regression", # @param {"label":"objective","type":"string","required":"false","helpTip":""}
"max_depth": -1, # @param
{"label":"max_depth","type":"integer","required":"false","range":"[-1,2147483647]","helpTip":""}
"num_iteration": 100, # @param
{"label":"num_iteration","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
"learning_rate": 0.1, # @param {"label":"learning_rate","type":"number","required":"false","helpTip":""}
"num_leaves": 31, # @param
"early_stopping_round": 0, # @param
{"label":"early_stopping_round","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
"feature_fraction": 1.0, # @param
{"label":"feature_fraction","type":"number","required":"false","helpTip":""}
"min_sum_hessian_in_leaf": 1e-3, # @param
{"label":"min_sum_hessian_in_leaf","type":"number","required":"false","helpTip":""}
"boost_from_average": True, # @param
{"label":"boost_from_average","type":"boolean","required":"false","helpTip":""}
"boosting_type": "gbdt", # @param
{"label":"boosting_type","type":"string","required":"false","helpTip":""}
"lambda_l1": 0.0, # @param {"label":"lambda_l1","type":"number","required":"false","helpTip":""}
"lambda_l2": 0.0, # @param {"label":"lambda_l2","type":"number","required":"false","helpTip":""}
"num_batches": 0, # @param
{"label":"num_batches","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
"parallelism": "data_parallel" # @param
{"label":"parallelism","type":"string","required":"false","helpTip":""}
}lightgbm_regressor____id___ = MLSLightGbmRegression(**params)
lightgbm_regressor____id___.run()
# @output {"label":"pipeline_model","name":"lightgbm_regressor____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
7.5.3.5.4 线性回归
概述
“线性回归”节点用于产生线性回归模型。它是利用数理统计中的回归分析,来确定 两种或两种以上变数间相互依赖的定量关系的统计分析方法。