第⼆節、研究對象與資料搜集⽅法
第四節、 研究實施步驟
(ㄧ)斷詞、詞庫建⽴與簡繁體轉換
在進⾏資料處理前,中⽂⽂字資料須經過斷詞步驟。在取得各媒體純⽂本資料 後,本研究先以中央研究院詞庫⼩組提供之中⽂斷詞系統 CKIP 進⾏斷詞(Tsai&
Chen, 2004)。依據斷詞的結果以及初步⽂字雲的視覺呈現決定下⼀步使⽤ R 軟體 內 Rwordseg 套件斷詞時之詞庫建⽴應包含哪些字詞。如研究對象中 ShareCourse 的「物聯網」課程,在⽂字雲呈現以及 CKIP 斷詞結果皆會出現部分「聯網」的字 詞統計。在 Rwordseg 的詞庫⽂件中直接輸⼊「物聯網」可避免此字詞被拆解成「物」
+「聯網」。另中國的簡體字媒體內容則先透過 Microsoft Word 中的簡轉繁體功能 轉換成繁體字⽂字資料,確保 R 軟體的⽂字分析結果可以正常顯⽰與輸出。但此 簡轉繁步驟會導致部份詞彙字⾯上的字詞轉換,本研究範圍內之重要關鍵字詞整 理如下表三,在研究結果分析時可參照此表。
表 3
簡體字轉換繁體之字詞對照表
原簡體字詞 ⼤數據 學堂在線 信息 網絡 在線 繁體字結果 ⼤資料 學堂線上 訊息 網路 線上
(⼆)貼⽂與報導數量迴歸分析
⾸先在 R 軟體腳本欄位輸⼊聯合新聞網與⼈民網在 2013~2017 的歷年的報導 數量,這兩個媒體研究對象的數值是台灣與中國社群媒體對應的迴歸分析的對象。
依次再輸⼊更改各社群平台的歷年貼⽂數量變化,可以很快速透過 R 軟體運算得 知其正負相關性與相關顯著程度。如以⼈民網與 MOOC 學院之貼⽂數量作為程式 碼範例:
a<-c(87, 93, 40 , 5) b<-c(720, 1393, 3163, 309)
x<-data.frame(a,b)
result<-lm(data = x)
summary(result)
結果顯⽰:
Call:
lm(data = x)
Residuals:
1 2 3 4 5 37.133 -13.973 8.908 -3.221 -28.846
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 33.67636 19.11851 1.761 0.176 b 0.02832 0.01411 2.007 0.138
Residual standard error: 28.84 on 3 degrees of freedom
Multiple R-squared: 0.5732, Adjusted R-squared: 0.4309 F-statistic: 4.029 on 1 and 3 DF, p-value: 0.1384
R 相關係數為 0.4309,p 值為 0.1384,顯⽰⼈民網與 MOOC 學院的貼⽂數量變化 並無顯著相關性。24
(三)語意分析相關視覺化呈現
此研究步驟會運⽤到相關的 R 軟體套件,即使在安裝之後,在 R 軟體內開始 匯⼊資料分析之前需先使⽤ library 命令調⽤相關的套件名稱。如下:
library(tm) library(Rwordseg)
⽽本步驟涉及六個⽅法,呈現五個結果,可以開啟七個 R 軟體專案檔輸⼊以便管 理。這六個⽅法分別是:TF-IDF 字詞權重計算、SVD 奇異值分解語意散布圖、
dendrogram 字詞群聚樹狀圖、apriori 關聯分析共現詞組、LDA 字詞主題模型
(Gibbs)、模型複雜度與最佳主題數量計算。以下皆以研究對象 MOOC 學院為操 作程式碼範例:
1. TF-IDF 字詞權重計算
library(tm) library(Rwordseg)
csv<- read.csv("C:/Users/green/Desktop/Target/data/weibo_MOOC學院.csv",colClasses="character") seg_words <- lapply(csv$content, segmentCN)
doc.list <- strsplit(as.character(seg_words), split=" ")
dg.corpus <- gsub("'", "", doc.list)
dg1.corpus <- gsub("[[:punct:]]", " ", dg.corpus) dg2.corpus <- gsub("[[:cntrl:]]", " ", dg1.corpus) dg3.corpus <- gsub("^[[:space:]]+", "", dg2.corpus) dg4.corpus <- gsub("[[:space:]]+$", "", dg3.corpus)
24 p-value<0.05。此研究步驟由於台⼤ MOOC 與 Openedu 之平台資料只從 2015 年開始,樣本數
過少故不列⼊計算。
dg5.corpus <- gsub("[[0-9]]", " ", dg4.corpus)
wordcorpus <- VCorpus(VectorSource(dg5.corpus))
tdm <- TermDocumentMatrix(wordcorpus, control = list(wordLengths = c(2, Inf)))
tdm.tfidf<-weightTfIdf(tdm, normalize = T) dtm<-as.matrix(tdm.tfidf)
v<-sort(rowSums(dtm), decreasing = T) d<-data.frame(word=names(v),tfidf=v)
write.csv(d, file = "C:/Users/green/Desktop/Target/results/tfidf/weibo_MOOC學院.csv")
gsub 函數協助清除掉檔案中無意義的字詞與數字,最後取 2 個字詞⾧度以上的詞 彙做計算,將結果直接寫⼊預設的檔案位置,建⽴ csv 檔案。
2. SVD 奇異值分解語意散布圖
library(tm) library(Rwordseg) library(ggplot2)
csv<- read.csv("C:/Users/green/Desktop/Target/data/weibo_MOOC學院.csv",colClasses="character") seg_words <- lapply(csv$content, segmentCN)
doc.list <- strsplit(as.character(seg_words), split=" ")
dg.corpus <- gsub("'", "", doc.list)
dg1.corpus <- gsub("[[:punct:]]", " ", dg.corpus) dg2.corpus <- gsub("[[:cntrl:]]", " ", dg1.corpus) dg3.corpus <- gsub("^[[:space:]]+", "", dg2.corpus) dg4.corpus <- gsub("[[:space:]]+$", "", dg3.corpus) dg5.corpus <- gsub("[[0-9]]", " ", dg4.corpus)
wordcorpus <- VCorpus(VectorSource(dg5.corpus))
tdm <- TermDocumentMatrix(wordcorpus, control = list(wordLengths = c(2, Inf))) tdm.tfidf<- weightTfIdf(tdm)
res<- svd(tdm.tfidf) nrow(res$u) ncol(res$v)
datau<- data.frame(res$u[,2:3]) datav<- data.frame(res$v[,2:3])
p<-ggplot()+geom_point(data= datav, aes(X1, X2), size=2, color ='red')+geom_text(data= datav, aes(X1, X2),label=1:5000, vjust=1.5)+theme_bw()
print(p)
csv<- read.csv("C:/Users/green/Desktop/Target/data/weibo_MOOC學院.csv",colClasses="character") seg_words <- lapply(csv$content, segmentCN)
doc.list <- strsplit(as.character(seg_words), split=" ")
dg.corpus<-gsub("\\b\\w{1,1}\\b","",doc.list) dg1.corpus <- gsub("'", "", dg.corpus)
dg2.corpus <- gsub("[[:punct:]]", " ", dg1.corpus) dg3.corpus <- gsub("[[:cntrl:]]", " ", dg2.corpus) dg4.corpus <- gsub("^[[:space:]]+", "", dg3.corpus) dg5.corpus <- gsub("[[:space:]]+$", "", dg4.corpus) dg6.corpus <- gsub("[[0-9]]", "", dg5.corpus)
wordcorpus <- VCorpus(VectorSource(dg6.corpus))
dtm <- DocumentTermMatrix(wordcorpus, control = list(wordLengths = c(2, Inf))) dtm01 <- weightTfIdf(dtm)
dtm02 <- removeSparseTerms(dtm01, 0.95) tdm = as.TermDocumentMatrix(dtm02) tdm <- weightTfIdf(tdm)
mydata.df <- as.matrix(tdm) mydata.df.scale <- scale(mydata.df) d<- dist(mydata.df.scale)
fit <- hclust(d) plot(fit)
gsub("\\b\\w{1,1}\\b","",doc.list)將斷詞後的單獨字詞,即字詞⾧度 1 以下的詞彙刪 除,以利視覺呈現結果。removeSparseTerms 函數可以清除掉字詞矩陣內的空
csv<- read.csv("C:/Users/green/Desktop/Target/data/weibo_MOOC學院.csv",colClasses="character") seg_words <- lapply(csv$content, segmentCN)
word_cut<- function(x, n=1){
x<-gsub("[[0-9]]","",x) x<-gsub("http","",x) x<-gsub("www","",x) x<-gsub("php","",x) x<-gsub("admin","",x) x<-gsub("tool","",x) x<-gsub("org","",x)
x<-gsub("\\b\\w{1,2}\\b","",x) x[nchar(x)>n]
}
seg_words2<-lapply(seg_words, word_cut)
wordcorpus <- Corpus(VectorSource(seg_words2))
cw.len1<-lapply(wordcorpus, length) cw.len2<-unlist(cw.len1)
words2<- seg_words2[cw.len2>0]
words_s<-lapply(words2, as.factor) words_s<-lapply(words2, unique)
trans<- as(words_s, "transactions")
items<- apriori(trans, parameter = list(supp=0.01, conf=0.01, minlen=2, target="frequent itemsets"), control = list(verbose=F))
plot(items, method = "graph",control = list(type="items", main="",cex=2, nodeCol =c("green")))
as(sort(items)[1:10], "data.frame")
運⽤ gsub 函數去除掉⽂本中已知的雜訊,如:http、www、admin、php 等,以及 刪去字詞⾧度 2 以下的詞彙。最後列出前 10 組最為相關且同時出現的字詞組。
csv<- read.csv("C:/Users/green/Desktop/Target/data/weibo_MOOC學院.csv",colClasses="character") seg_words <- lapply(csv$content, segmentCN)
doc.list <- strsplit(as.character(seg_words), split=" ")
dg.corpus <- gsub("'", "", doc.list)
dg1.corpus<-gsub("\\b\\w{1,1}\\b","",doc.list) dg2.corpus <- gsub("[[:punct:]]", " ", dg1.corpus) dg3.corpus <- gsub("[[:cntrl:]]", " ", dg2.corpus) dg4.corpus <- gsub("^[[:space:]]+", "", dg3.corpus) dg5.corpus <- gsub("[[:space:]]+$", "", dg4.corpus)
dg6.corpus <- gsub("[[0-9]]", " ", dg5.corpus)
wordcorpus <- VCorpus(VectorSource(dg6.corpus))
tdm <- TermDocumentMatrix(wordcorpus, control = list(wordLengths = c(2, Inf))) dtm<-DocumentTermMatrix(wordcorpus, control = list(wordLengths = c(2, Inf)))
term_tfidf <-tapply(dtm$v/row_sums(dtm)[dtm$i], dtm$j, mean) * log2(nDocs(dtm)/col_sums(dtm >
0))
l1=term_tfidf >= quantile(term_tfidf, 0.99) dtm <- dtm[,l1]
dtm = dtm[row_sums(dtm)>0, ]; dim(dtm) summary(col_sums(dtm))
k = 30 SEED <- 2003 jss_TM2 <- list(
VEM = LDA(dtm, k = k, control = list(seed = SEED)), save(jss_TM2, file = paste(getwd(), "/jss_TM2.Rdata", sep = ""))
save(jss_TM, file = paste(getwd(), "/jss_TM1.Rdata", sep = ""))
termsForSave1<- terms(jss_TM2[["VEM"]], 5) termsForSave2<- terms(jss_TM2[["VEM_fixed"]], 5) termsForSave3<- terms(jss_TM2[["Gibbs"]], 5) termsForSave4<- terms(jss_TM2[["CTM"]], 5)
write.csv(as.data.frame(t(termsForSave1)),
paste(getwd(), "/topic-document_", "_VEM_", k, "_2.csv", sep=""), fileEncoding = "UTF-8")
write.csv(as.data.frame(t(termsForSave2)),
paste(getwd(), "/topic-document_", "_VEM_fixed_", k, "_2.csv", sep=""), fileEncoding = "UTF-8")
write.csv(as.data.frame(t(termsForSave3)),
paste(getwd(), "/topic-document_", "_Gibbs_", k, "_2.csv", sep=""), fileEncoding = "UTF-8")
write.csv(as.data.frame(t(termsForSave4)),
paste(getwd(), "/topic-document_", "_CTM_", k, "_2.csv", sep=""), fileEncoding = "UTF-8")
tfs = as.data.frame(termsForSave3, stringsAsFactors = F); tfs[,1]
adjacent_list = lapply(1:30, function(i) embed(tfs[,i], 2)[, 2:1])
edgelist = as.data.frame(do.call(rbind, adjacent_list), stringsAsFactors =F) topic = unlist(lapply(1:30, function(i) rep(i, 4)))
edgelist$topic = topic
g <-graph.data.frame(edgelist,directed=T ) l<-layout.fruchterman.reingold(g)
edge.color="black"
nodesize = centralization.degree(g)$res V(g)$size = log( centralization.degree(g)$res )
nodeLabel = V(g)$name
E(g)$color = unlist(lapply(sample(colors()[26:137], 10), function(i) rep(i, 9))); unique(E(g)$color)
png(paste(getwd(), "/topic_graph_gibbs_", g, "XX",".png", sep=''),width=5, height=5, units="in", res=700)
plot(g, vertex.label= nodeLabel, edge.curved=TRUE, vertex.label.cex =0.5, edge.arrow.size=0.2, layout=l )
csv<- read.csv("C:/Users/green/Desktop/Target/data/weibo_MOOC學院.csv",colClasses="character")
seg_words <- lapply(csv$content, segmentCN)
doc.list <- strsplit(as.character(seg_words), split=" ")
dg.corpus<-gsub("\\b\\w{1,1}\\b","",doc.list) dg1.corpus <- gsub("'", "", dg.corpus)
dg2.corpus <- gsub("[[:punct:]]", " ", dg1.corpus) dg3.corpus <- gsub("[[:cntrl:]]", " ", dg2.corpus) dg4.corpus <- gsub("^[[:space:]]+", "", dg3.corpus) dg5.corpus <- gsub("[[:space:]]+$", "", dg4.corpus) dg6.corpus <- gsub("[[0-9]]", "", dg5.corpus)
wordcorpus <- VCorpus(VectorSource(dg6.corpus))
dtm<-DocumentTermMatrix(wordcorpus, control = list(wordLengths = c(2, Inf)))
fold_num = 10
kv_num = c(5, 10*c(1:5, 10)) seed_num = 2003
try_num = 1
smp<-function(cross=fold_num,n,seed) {
set.seed(seed) dd=list()
aa0=sample(rep(1:cross,ceiling(n/cross))[1:n],n) for (i in 1:cross) dd[[i]]=(1:n)[aa0==i]
return(dd) }
selectK<-function(dtm,kv=kv_num,SEED=seed_num,cross=fold_num,sp) # change 60 to 15 {
cat("R is running for", "topic", k, "fold", i,
return(list(perplex=per_ctm,loglik=log_ctm)) }
sp=smp(n=dtm$nrow, seed=seed_num) # n = nrow(dtm)
system.time((ctmK=selectK(dtm=dtm,kv=kv_num,SEED=seed_num,cross=fold_num,sp=sp)))
## plot the perplexity
m_per=apply(ctmK[[1]],1,mean) m_log=apply(ctmK[[2]],1,mean)
k=c(kv_num)
df = ctmK[[1]] # perplexity matrix logLik = ctmK[[2]] # perplexity matrix
write.csv(data.frame(k, df, logLik), paste(getwd(), "/Perplexity2_","gibbs5_100", ".csv", sep = ""))
# save the figure
png(paste(getwd(), "/Perplexity2_",try_num, "_gibbs5_100",".png", sep = ''), width=5, height=5,
units="in", res=700)
matplot(k, df, type = c("b"), xlab = "Number of topics",
ylab = "Perplexity", pch=1:try_num,col = 1, main = '') legend("topright", legend = paste("fold", 1:try_num), col=1, pch=1:try_num)
dev.off()
png(paste(getwd(), "/LogLikelihood2_", "gibbs5_100",".png", sep = ''), width=5, height=5,
units="in", res=700)
matplot(k, logLik, type = c("b"), xlab = "Number of topics",
ylab = "Log-Likelihood", pch=1:try_num,col = 1, main = '') legend("topright", legend = paste("fold", 1:try_num), col=1, pch=1:try_num) dev.off()
此步驟呈現的兩張圖表可藉此判斷此資料範圍內最適合之主題數量,即是上⼀個 LDA 字詞主題模型⽅法中 k=30,數值的由來。
(四)議題主題趨勢分析
此步驟為驗證式資料分析,直接將資料作為驗證研究假說的來源。個⼈化學習、
企業培訓以及⾼等教育改⾰三個議題類⽬對於台灣、中國的社群媒體與傳統報業 媒體作時序分佈統計。此研究設計於時間點前後期之設定蔚為關鍵,本研究以 MOOC 平台⿓頭,使⽤者最多的 Coursera,在 2015 年 2 ⽉宣佈與其他企業合作開 設企業專屬的專項課程為標注前、後期之時間點。使⽤ Word2vec 在 R 軟體中的開 源版本套件 wordVectors 訓練電腦辨別整體研究對象形成的百萬字的⽂字檔
「all_word2vec.txt」中,與議題主題相關的字詞。
library(tm) library(Rwordseg)
library(wordVectors)
d.corpus <- Corpus(DirSource("data"), list(language = NA)) seg_words <- lapply(d.corpus, segmentCN)
word2vec_split = sapply(seg_words, paste, collapse = " ")
writeLines(word2vec_split, "all_word2vec.txt")
model = train_word2vec("all_word2vec.txt", output="hy.bin", vectors=500,threads=4 ,window=12,iter=5,negative_samples=0)
vec = read.vectors("hy.bin")
model %>% closest_to("改⾰") model %>% closest_to("教育") model %>% closest_to("企業") model %>% closest_to("培訓") model %>% closest_to("使⽤者") model %>% closest_to("學習")
此步驟後再以 R 軟體內的正則表⽰式(regular expression)中的 grep 函數辨別各個 研究對象中有無此字詞存在,若存在則會回報字詞所在的貼⽂欄位編號或是報導 編號。最後以 dplyr 套件中的 union 函數取得各個議題字詞統計的聯集,此數值再 計算其與貼⽂或報導總數量則可獲得該議題⽐例(詳細操作步驟以及研究資料可 連結⾄