21-02 R軟體爬蟲和文字斷詞

(1)

1 R 軟體爬蟲和⽂字斷詞李智慎副統計分析師這⼀期將教⼤家如何使⽤ R 軟體擷取網路⾴⾯上的資料即俗稱的爬網或爬蟲，我們將以台灣最⼤的社群論壇 PTT ⼋卦版做⽰範，主要使⽤ rvest 套件擷取⽂章內容和 jiebaR 套件做斷詞。在準備爬網前，必須先了解網⾴整體的架構，所以⾸先讓我們進⼊ PTT ⼋卦版⾸⾴ https://www.ptt.cc/bbs/Gossiping/index.html 由上圖可看到⼀開始進⼊ PTT ⼋卦版⾴⾯時，會有⼀個是否年滿 18 歲的驗證⾴⾯，按下我同意後才會進⼊⼋卦版⽂章的⾸⾴，畫⾯如下那要怎麼樣利⽤ R 軟體的程式執⾏驗證的⾴⾯呢?⾸先準備好要使⽤的套件並引⼊ #install.packages(c("tidyverse", "rvest", "stringr", "jiebaR", "tmcn"))

library(tidyverse)

(2)

2 library(stringr) library(jiebaR) library(tmcn) 各套件⽤途描述如下 - tidyverse: 內含資料處理套件(dplyr)和繪圖套件(ggplot2)等 - rvest: 網⾴解析處理套件 - stringr: 字串處理套件 - jiebaR: ⽂字斷詞 - tmcn: ⽂字字庫將 PTT 網址設定好使⽤ html_session 讀取並存為 gossiping.session ptt.url <- "https://www.ptt.cc"

gossiping.url <- str_c(ptt.url, "/bbs/Gossiping") gossiping.url

[1] "https://www.ptt.cc/bbs/Gossiping"

gossiping.session <- html_session(url = gossiping.url) gossiping.session

<session> https://www.ptt.cc/ask/over18?from=%2Fbbs%2FGossiping%2Findex.html Status: 200

Type: text/html; charset=utf-8 Size: 2411 可看到 url 指定⼋卦版的連結，但由於是初始進⼊⽽被導向⾄認證⾴⾯，所以在 gossiping.session 的上⽅連結才會是認證⾴⾯的連結。接下來的步驟如下 1. 找到認證的表單(form) gossiping.form <-gossiping.session %>% html_node("form") %>% html_form() gossiping.form

<form> '<unnamed>' (POST /ask/over18)

<input hidden> 'from': /bbs/Gossiping/index.html <button submit> 'yes

<button submit> 'no

(3)

3 gossiping <- submit_form( session = gossiping.session, form = gossiping.form, submit ="yes" ) gossiping <session> https://www.ptt.cc/bbs/Gossiping/index.html Status: 200

Type: text/html; charset=utf-8 Size: 8661

可看到在提交確認表單後，gossiping 的連結已改變為⼋卦版⾸⾴，此時我們可以先進⼊⼋卦版⾸⾴看⼀下整體網⾴架構，⽽觀看原始碼的⽅式只需在⾴⾯空⽩處按右鍵並點選檢視網⾴原始碼，如下圖

(4)

4 上圖紅⾊框內的 a 元素就是下圖網⾴上標題的部份⽽接下來的任務就是將這些連結⼀⼀儲存起來，然後再進⼊連結內⽂蒐集各部分的⽂字內容，但⼀⾴只會顯⽰ 10 多篇⽂章，所以還必須經過跳⾴來蒐集⽂章的連結，⽽網址連結有⼀個規則性，我們只需利⽤這規則性到各個⾴⾯蒐集⽂章連結即可，⾸⾴連結: https://www.ptt.cc/bbs/Gossiping/index.html ⾸⾴上⾴連結: https://www.ptt.cc/bbs/Gossiping/index25436.html 可看到 index 後⾯會接數字，所以只需要依照數字遞減，依序的往前作連結即可蒐集各⾴⾯的⽂章連結，程式如下

(5)

5 page.latest <-gossiping %>% html_nodes("a") %>% html_attr("href") %>% str_subset("index[0-9]{2,}\\.html") %>% str_extract("[0-9]+") %>% as.numeric() page.latest 是最新⾴碼數字，必須從主⾴ gossiping 裡去尋找步驟依序如下 1. html_nodes("a"): 擷取所有 a 元素 2. html_attr("href"): 擷取 a 元素裡的 href 屬性也就是該連結網址 3. str_subset("index[0-9]{2,}\\.html"): 篩選符合 index 後⾯接⼀串數字的連結 4. str_extract("[0-9]+ "): 擷取連結內數字的部分 5. as.numeric(): 轉換為數字格式透過以上步驟即可得到最新⾴⾯的號碼數 page.latest 得到最新的⾴碼數 page.latest 後，利⽤迴圈的⽅式對每⼀⾴做⽂章連結的收集 links.article <- NULL page.length <- 5

for (page.index in page.latest:(page.latest - page.length)) { link <- str_c(gossiping.url, "/index", page.index, ".html") print(link) links.article <- c( links.article, gossiping %>% jump_to(link) %>% html_nodes("a") %>% html_attr("href") %>% str_subset("[A-z]\\.[0-9]+\\.[A-z]\\.[A-z0-9]+\\.html") ) } [1] "https://www.ptt.cc/bbs/Gossiping/index25439.html" [1] "https://www.ptt.cc/bbs/Gossiping/index25438.html" [1] "https://www.ptt.cc/bbs/Gossiping/index25437.html" [1] "https://www.ptt.cc/bbs/Gossiping/index25436.html" [1] "https://www.ptt.cc/bbs/Gossiping/index25435.html" [1] "https://www.ptt.cc/bbs/Gossiping/index25434.html"

(6)

6

links.article 為存放⽂章連結的空間，page.length 為想要存放的⾴數，迴圈內容如下

1. link: 當下迴圈輪到的連結 2. jump_to(link): 跳⾴⾄新⾴⾯

3. html_nodes("a") & html_attr("href"): 找 a 元素的 href 屬性

4. str_subset("[A-z]\\.[0-9]+\\.[A-z]\\.[A-z0-9]+\\.html"): 篩選符合⽂章連結格式的連結

links.article <- unique(links.article)

為了避免連結重複⽤ unique 重新做處理。

經過以上迴圈處理後得到每⼀⾴內的⽂章連結 links.article，接下來進⼊ links.article 內各連結即可爬⽂

push.table <- tibble() # 建⽴推⽂儲存空間

article.table <- tibble() # 建⽴⽂章儲存空間 for (temp.link in links.article) {

article.url <- str_c(ptt.url, temp.link) # ⽂章網址

temp.html <-gossiping %>% jump_to(article.url) # 連結⾄⽂章網址

article.header <-temp.html %>%

html_nodes("span.article-meta-value") %>% # 開頭部分元素 html_text()

article.author <-article.header[1] %>% str_extract("^[A-z0-9_]+") # 作者

article.title <-article.header[3] # 標題 article.datetime <-article.header[4] # 時間 article.content <-temp.html %>% html_nodes( # 內⽂部分 xpath ='//div[@id="main-content"]/node()[not(self::div|self::span[@class="f2"])]' ) %>%

html_text(trim =TRUE) %>% str_c(collapse ="") article.table <-article.table %>% # 合併⽂章資料 bind_rows( tibble( datetime = article.datetime, title = article.title, author = article.author,

(7)

7 content = article.content, url = article.url ) )

article.push <-temp.html %>% html_nodes("div.push") # 擷取推⽂

push.table.tag <-article.push %>% html_nodes("span.push-tag") %>% html_text(trim = TRUE) # 推⽂種類

push.table.author <-article.push %>% html_nodes("span.push-userid") %>% html_text(trim =TRUE) # 作者

push.table.content <-article.push %>% html_nodes("span.push-content") %>% html_text(trim =TRUE) %>% str_sub(3) # 推⽂內容

push.table.datetime <-article.push %>% html_nodes("span.push-ipdatetime") %>% html_text(trim =TRUE) # 推聞時間

push.table <-push.table %>% # 合併推⽂資料 bind_rows( tibble( tag = push.table.tag, author = push.table.author, content = push.table.content, datetime = push.table.datetime, url = article.url ) ) } article.table <-article.table %>% # 格式整理清除 NA mutate(

datetime =str_sub(datetime, 5) %>% parse_datetime("%b %d %H:%M:%S %Y"), month =format(datetime, "%m"),

day =format(datetime, "%d") ) %>% filter_all( all_vars(!is.na(.)) ) push.table <-push.table %>% # 格式整理清除 NA mutate(

(8)

8

datetime =str_c("2017/", datetime) %>% parse_datetime("%Y/%m/%d %H:%M"), month =format(datetime, "%m"),

day =format(datetime, "%d") ) %>% filter_all( all_vars(!is.na(.)) ) 經過以上步驟，可得到 2 資料 1. article.table: ⽂章資料(標題、作者、時間、內⽂) 2. push.table: 推⽂資料(類型、作者、時間、內⽂) 接下來我們可以使⽤ jiebaR 這套件做斷詞如下， library(jiebaR) jieba.worker <- worker() jieba.worker 是⼀個斷詞⼯具，可和 segment 搭配使⽤，有了斷詞⼯具後就可以來對每天⼋卦版的⽂章內容做斷詞，如下 article.date <-article.table %>% group_by(date) %>% # 以每⽇做分組 do((function(input) {

freq(segment(input$content, jieba.worker)) %>% # 斷詞後計算詞頻 filter(

!(char %in% toTrad(stopwordsCN())), # 過濾 stopword

!str_detect(char, "[A-z0-9]"), # 過濾英⽂數字

nchar(char) > 1# 過濾單個字

) %>%

arrange(desc(freq)) %>% # 以詞頻排序 slice(1:100) %>% # 取前 100

return })(.)) %>% ungroup

article.date.words <- freq(article.date$char) %>% rename(freq.all = freq)

最後可得每⽇出現最多次前 100 名的詞，我們藉此統計並過濾出每⽇特有的詞最後當⽇的前 5 ⼤代表詞，程式如下

(9)

9 article.everyday <-article.date %>% left_join( # ⽐對全部詞 article.date.words, by ='char' ) %>% group_by(date) %>% # 以每⽇做分組 arrange(freq.all) %>% # 每組的詞頻做排序由⼩到⼤ slice(1:5) %>% # 取每組前 5 summarise( # 合併詞並對詞頻加總

char =str_c(char, collapse =", "), freq =sum(freq)

) %>% ungroup article.every 資料檢視前 10 筆⽇期詞頻前 5 詞頻加總 2017-10-24 分⾝, 悠遊, 選舉, 對⽅, 機⾞ 453 2017-10-23 ⿈國, 留⾔, 甲甲, 使⽤, 以上 467 2017-10-22 夜市, 禁⽌, 結婚, 電⼦, ⽼婆 576 2017-10-21 產業, 電⼦, 對⽅, 研究, 甲甲 557 2017-10-20 硫酸, 保全, 役男, 宿舍, 時區 671 2017-10-19 時區, 法官, 報告, 習近平, 研究 820 2017-10-18 堅持, 電影, ⼗九, 數據, 報告 567 2017-10-17 研究, 關係, 經濟, 如題, 鄉⺠ 475 2017-10-16 蝦⽪, 習近平, 議員, ⼥性, ⼩時 519 2017-10-15 醫師, ⾼雄, ⼀點, 討論, 如題 479 最後我們可以以上資料做個簡單的圖表，並利⽤顏⾊做熱⾨程度的區別 article.everyday %>% mutate( # 計算⽉⽇和頻率排名

month = str_c(format(date, "%m"), "⽉"),

day = format(date, "%d") %>% parse_number(), freq.rank = rank(freq)

) %>% ggplot() + geom_text( aes(

(10)

10 x = 1, y = day, label = char, color = freq.rank ), hjust = 1, size = 3 ) + geom_text( aes( x = 0, y = day,

label = format(date, "%d") ),

hjust = 0, size = 3, alpha = 0.4

) +

scale_color_continuous(low = "#03A9F4", high = "#EF5350") + scale_y_reverse() +

facet_grid( ~ month) + theme_void()

(11)

11

可看到就整體資料 8~10 ⽉部份，以 8 ⽉底到 9 ⽉初的關注最熱⾨，可能是受世⼤運的影響，基本上⽂字探勘還算是⼀項很新的技術，該怎麼定義什麼是熱⾨詞彙、什麼是代表詞彙的⽅法有很多，可以從各種⾯向去做篩選，要⽤什麼樣⽅式過濾也可以看個⼈的想法與創意。