7.5 预置算子说明
7.5.3 模型工程
7.5.3.1 分类
7.5.3.1.5 逻辑回归分类
inputs dataframe inputs为字典类型,dataframe为 pyspark中的DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
b_use_default_encoder - 是否使用默认编码,默认为True input_features_str - 输入的列名以逗号分隔组成的字符串,
例如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_index_c
prediction_col - 算子输出的预测label对应的标签列,默 认为"prediction_index"
参数 子参数 参数说明
tol - 迭代算法的收敛阈值,默认为1e-6
fit_intercept - 是否要使用截距,默认为True standardization - 是否正则化特征,默认为True aggregation_depth - 聚合的深度,默认为2
family - 模型训练中使用哪种标签分布,支持
auto、binomial、multinomial,默认为
"auto"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"outer_pipeline_stages": None,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "true", "helpTip": ""}
"classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col",
"type": "string", "required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"reg_param": 0.0, # @param {"label": "reg_param", "type": "number", "required": "true", "range":
"[0,none)", "helpTip": ""}
"elastic_net_param": 0.0, # @param {"label": "elastic_net_param", "type": "number", "required": "true",
"range": "[0,none)", "helpTip": ""}
"tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip":
""}
"fit_intercept": True, # @param {"label": "fit_intercept", "type": "boolean", "required": "true", "helpTip":
""}
"standardization": True, # @param {"label": "standardization", "type": "boolean", "required": "true",
"helpTip": ""}
"aggregation_depth": 2, # @param {"label": "aggregation_depth", "type": "integer", "required": "true",
"range": "(0,2147483647]", "helpTip": ""}
7.5.3.1.6 多层感知机分类
概述
“多层感知机分类”节点可用于建立一个基于前馈人工神经网络的分类模型。
前馈人工神经网络采用一种单向多层结构。其中每一层包含若干个神经元,同一层的 神经元之间没有互相连接,层间信息的传送只沿一个方向进行。其中第一层称为输入 层。最后一层为输出层,中间为隐层。K+1层前馈神经网络矩阵形式如下表示,其中X 为特征集,w为权重值,b为偏置量,y为预测值。
中间层的节点使用sigmod函数:
输出层的节点使用softmax函数:
输出层中的节点个数对应类别数量。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
spark pipeline类型的模型
参数 子参数 参数说明
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_inde
x_col - 目标列经过标签编码后的新的列名,默认为
"label_index"
classifier_feature_v
ector_col - 算子输入的特征向量列的列名,默认为
"model_features"
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
prediction_index_co
l - 算子输出的预测label对应的标签列,默认为
"prediction_index"
max_iter - 最大迭代次数,默认为100
tol - 收敛阈值,默认为1e-6
seed - 随机数种子,默认为0
layers_str - 层的个数用逗号分隔组成的字符串,例如:
"2,3,4"
"3"
step_size - 步长,默认为0.03
solver - 用来优化的处理算法,支持l-bfgs、gd,默认 为"l-bfgs"
initial_weights_str - 初始化权重用逗号分隔组成的字符串,例如:
"0.01"
"0.01,0.02,0.04"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
"type": "string", "required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"tol": 1e-6, # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip":
""}
"seed": 0, # @param {"label": "seed", "type": "integer", "required": "false", "range": "[0,2147483647]",
"helpTip": ""}
"layers_str": "", # @param {"label": "layers_str", "type": "string", "required": "false", "helpTip": ""}
"block_size": 128,
"step_size": 0.03, # @param {"label": "step_size", "type": "number", "required": "true", "range":
"(0,none)", "helpTip": ""}
"solver": "l-bfgs", # @param {"label": "solver", "type": "enum", "required": "true", "options": "gd,l-bfgs",
"helpTip": ""}
"initial_weights_str": "" # @param {"label": "initial_weights_str", "type": "string", "required": "false",
"helpTip": ""}
}multilayer_perception_classifier____id___ = MLSMultilayerPerceptronClassifier(**params) multilayer_perception_classifier____id___.run()
# @output {"label":"pipeline_model","name":"multilayer_perception_classifier____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
参数说明
参数 子参数 参数说明
b_use_default_encode
r - 是否使用默认编码,默认为True
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_index_
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
prediction_index_col - 算子输出的预测label对应的标签列,默认为
"prediction_index"
smoothing - 平滑参数,默认为1.0
model_type - 模型类型,支持multinomial、bernoulli,默 认为"multinomial"
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"outer_pipeline_stages": None,
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": ""}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "true", "helpTip": ""}
# @output {"label":"pipeline_model","name":"naive_bayes_classifier____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
7.5.3.1.8 随机森林分类
概述
“随机决策森林分类”节点用于产生二分类或多分类模型。随机决策森林是用随机的 方式建立一个森林模型,森林由很多的决策树组成,每棵决策树之间没有关联。当有 一个新的样本输入时,森林中的每一棵决策树分别进行判断,哪一类被选择最多,就 预测这个样本属于那一类。
随机决策森林分类中的决策树算法通过基尼不纯度(Gini impurity)或熵(Entropy)
来对一个集合的有序程度进行量化,并对一次拆分进行量化评价。
● 基尼不纯度是指将来自集合中的某种结果随机应用于集合中某一数据项的预期误 差率,计算公式如下:
● 熵是信息论中的概念,用来表示集合的无序程度,熵越大表示集合越混乱,反之 则表示集合越有序,计算公式如下:
fi表示类别i样本数量占所有样本的比例,C表示数据类别数。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
spark pipeline类型的模型
参数 子参数 参数说明
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
"column_a"
"column_a,column_b"
label_col - 目标列
classifier_label_index
_col - 传给分类器的目标列,必须为数值列
classifier_feature_vec
tor_col - 传给分类器的特征列,必须为向量列
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
prediction_index_col - 算子输出的预测label对应的标签列,默认 为"prediction_index"
max_depth - 树的最大深度,默认为5
max_bins - 最大分箱数,默认为32
min_instances_per_n
ode - 节点分割时,要求子节点必须包含的最少实
例数,默认为1
min_info_gain - 节点是否分割要求的最小信息增益,默认为 0
impurity - 计算信息增益的方法,支持entropy、
gini,默认为"gini"
num_trees - 树的个数,默认为20
feature_subset_strate
gy - 节点分割时考虑用到的特征列的策略,支持
auto、all、onethird、sqrt、log2、n,默 认为"all"
subsampling_rate - 学习每棵决策树用到的训练集的比例,默认 为1.0
seed - 随机数种子,默认为0
"label_col": "", # @param {"label": "label_col", "type": "string", "required": "true", "helpTip": "target label column"}
"classifier_label_index_col": "label_index", # @param {"label": "classifier_label_index_col", "type":
"string", "required": "true", "helpTip": ""}
"classifier_feature_vector_col": "model_features", # @param {"label": "classifier_feature_vector_col",
"type": "string", "required": "true", "helpTip": ""}
"prediction_index_col": "prediction_index", # @param {"label": "prediction_index_col", "type": "string",
"required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"max_depth": 5, # @param {"label": "max_depth", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"max_bins": 32, # @param {"label": "max_bins", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"min_instances_per_node": 1, # @param {"label": "min_instances_per_node", "type": "integer",
"required": "true", "range": "(0,2147483647]", "helpTip": ""}
"min_info_gain": 0.0, # @param {"label": "min_info_gain", "type": "number", "required": "true", "range":
"[0,none)", "helpTip": ""}
"impurity": "gini", # @param {"label": "impurity", "type": "enum", "required": "true", "options":
"entropy,gini", "helpTip": ""}
"num_trees": 20, # @param {"label": "num_trees", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"feature_subset_strategy": "all", # @param {"label": "feature_subset_strategy", "type": "enum",
"required": "true", "options":"auto,all,onethird,sqrt,log2", "helpTip": ""}
"subsampling_rate": 1.0, # @param {"label": "subsampling_rate", "type": "number", "required": "true",
"range": "(0,1.0]", "helpTip": ""}
"seed": 0 # @param {"label": "seed", "type": "integer", "required": "true",
"range":"[0,2147483647]","helpTip": "seed"}
}rf_classifier____id___ = MLSRandomForestClassifier(**params) rf_classifier____id___.run()
# @output {"label":"pipeline_model","name":"rf_classifier____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
7.5.3.2 聚类
7.5.3.2.1 二分 k 均值
概述
二分k-means算法是分层聚类(Hierarchical clustering)的一种,分层聚类是聚类分 析中常用的方法。
3. 使用k-means算法将可分裂的簇分为两簇。
4. 一直重复2、3步,直到满足迭代结束条件。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为pyspark中 的DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
input_features_str - 输入的列名以逗号分隔组成的字符串,例 如:
prediction_col - 算子输出的预测label的列名,默认为
"prediction"
k - 想要聚类的个数,默认为2
max_iter - 最大迭代次数,默认为100
min_divisible_cluste
r_size - 值如果大于等于1,它表示一个可切分簇的
最小点数量;如果值小于1,它表示可切分 簇的点数量占总数的最小比例,该值默认为 1
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"cluster_feature_vector_col": "model_features", # @param {"label": "cluster_feature_vector_col", "type":
"string", "required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"k": 2, # @param {"label": "k", "type": "integer", "required": "true", "range": "(0,2147483647]",
"helpTip": ""}
"max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"min_divisible_cluster_size": 1.0 # @param {"label": "min_divisible_cluster_size", "type": "number",
"required": "true", "range": "(0,none)", "helpTip": ""}
}bisecting_kmeans____id___ = MLSBisectingKmeans(**params)
bisecting_kmeans____id___.run()
# @output {"label":"pipeline_model","name":"bisecting_kmeans____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
7.5.3.2.2 高斯混合模型
概述
高斯混合模型(Gaussian Mixture Model)通常简称GMM,是一种业界广泛使用的聚 类算法,该方法使用了高斯分布作为参数模型,并使用了期望最大(Expectation Maximization,简称EM)算法进行训练。
输入
参数 子参数 参数说明
inputs dataframe inputs为字典类型,dataframe为 pyspark中的DataFrame类型对象
输出
spark pipeline类型的模型
参数说明
参数 子参数 参数说明
input_features_str - 输入的列名以逗号分隔组成的字符串,
例如:
"column_a"
"column_a,column_b"
参数 子参数 参数说明
k - 要聚类的个数,默认为2
max_iter - 最大迭代次数,默认为100
tol - 收敛阈值,默认为0.01
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True, "b_use_default_encoder": True, "outer_pipeline_stages": None,
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"cluster_feature_vector_col": "model_features", # @param {"label": "cluster_feature_vector_col", "type":
"string", "required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"probability_col": "probability", # @param {"label": "probability_col", "type": "string", "required": "true",
"helpTip": ""}
"k": 2, # @param {"label": "k", "type": "integer", "required": "true", "range": "(0,2147483647]",
"helpTip": ""}
"max_iter": 100, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"tol": 0.01 # @param {"label": "tol", "type": "number", "required": "true", "range": "(0,none)", "helpTip":
""}
}gaussian_mixture____id___ = MLSGaussianMixture(**params)
gaussian_mixture____id___.run()
# @output {"label":"pipeline_model","name":"gaussian_mixture____id___.get_outputs() ['output_port_1']","type":"PipelineModel"}
参数说明
参数 子参数说明 参数说明
b_use_default_enc
oder - 是否使用默认编码,默认为True
input_features_str - 输入的列名以逗号分隔组成的字符串,例如:
"column_a"
"column_a,column_b"
cluster_feature_ve
ctor_col - 算子输入的特征向量列的列名,默认为
"model_features"
prediction_col - pyspark kmeans聚类器输出的预测列
k - 聚类的个数,默认为2
init_mode - 聚类采用的初始算法,random、k-means,默 认为"random"
init_steps - 采用k-means|| 初始化模式的步数,默认为2
max_iter - 最大迭代次数,默认为20
tol - 迭代算法的收敛阈值,默认为1e-4
样例
inputs = {
"dataframe": None # @input {"label":"dataframe","type":"DataFrame"}
}params = { "inputs": inputs, "b_output_action": True,
"b_use_default_encoder": True, # @param {"label": "b_use_default_encoder", "type": "boolean",
"required": "true", "helpTip": ""}
"outer_pipeline_stages": None,
"input_features_str": "", # @param {"label": "input_features_str", "type": "string", "required": "false",
"helpTip": ""}
"cluster_feature_vector_col": "model_features", # @param {"label": "cluster_feature_vector_col", "type":
"string", "required": "true", "helpTip": ""}
"prediction_col": "prediction", # @param {"label": "prediction_col", "type": "string", "required": "true",
"helpTip": ""}
"k": 2, # @param {"label": "k", "type": "integer", "required": "true", "range": "(0,2147483647]",
"helpTip": ""}
"init_mode": "random", # @param {"label": "init_mode", "type": "string", "required": "true", "options":
"random,k-means", "helpTip": ""}
"init_steps": 2, # @param {"label": "init_steps", "type": "integer", "required": "true", "range":
"(0,2147483647]", "helpTip": ""}
"max_iter": 20, # @param {"label": "max_iter", "type": "integer", "required": "true", "range":