語意式資料雲上如何來巧控海量資料分析效益與保護間的平衡

(1)

科技部補助專題研究計畫成果報告

期末報告

語意式資料雲上如何來巧控海量資料分析效益與保護間的

平衡

計畫類別：個別型計畫

計畫編號： MOST 102-2221-E-004-014-

執行期間： 102 年 08 月 01 日至 103 年 09 月 30 日

執行單位：國立政治大學資訊科學系

計畫主持人：胡毓忠

計畫參與人員：碩士班研究生-兼任助理人員：劉文友

碩士班研究生-兼任助理人員：潘宗哲

大專生-兼任助理人員：謝濟謙

大專生-兼任助理人員：薛元昊

大專生-兼任助理人員：張筆翔

報告附件：出席國際會議研究心得報告及發表論文

處理方式：

1.公開資訊：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

2.「本研究」是否已有嚴重損及公共利益之發現：否

3.「本報告」是否建議提供政府單位施政參考：是，政府公部門需執行開放式

政府資料(Open Data)的分析的匿名化作業

中華民國 103 年 12 月 16 日

(2)

中文摘要：本研究案主要是探討分散式社群 Web 上具有隱私權保護的

WebID 分析研究。我們首先論述為何要用開放分散式而不是

封閉集中式的個人資料管理控管機制。接著我們提出一個具

有電腦規範認知的系統架構，在此架構上個人資料擁有者可

以挑選一個可信的資料管理者來匿名化其個人資料與社群網

路脈絡的 WebID。這些個人化 WebID 匿名化資料集是以

RDF(S)串連式資料型態來提供巨量資料的分析。除此之外我

們引用結合 R 和 Hadoop 的 RHadoop 分析系統平台來進行有效

且大量 RDF(S)為主的分散式社群資料集的分析。最後我們設

計並且實做出三種型態的 WebID 資料集控管所需的電腦可執

行規範，主要包括了資料使用者控管規範，資料處理規範，

與資料揭露規範，這些電腦規範可以呼叫上述 RHadoop 資料

分析的模組，並且更進一步平衡資料使用效益和個人資料保

護間的平衡。這一部份的研究成果已經發表在 IEEE Web

Intelligent-2014, Warsaw, Poland 的國際研討會。

我們也完成另外一份論文的初稿: Propagation Control

Services for WebID Analytics on the Decentralized

Social Web。並準備投稿到相關的國際電腦科技研討會或專

書。本論文初稿是延續上述已經發表論文的內容而從資訊流

控管服務（Propagation Control Services）的觀點來分析

分散式社群網路之上相關成員如資料擁有者，資料控管者，

以及資料使用者之間的關係。我們沿用上述三種 WebID 資料

集控管的三種電腦規範，並且強調這三種電腦規範執行與落

實必須要在一個具可靠性與透通性的情況下來進行資料流通

控管服務。我們最後點出該如何在資訊流通鏈利用上述三種

電腦控管規範來呼叫 WebID 流通控管服務模組以化解 WebID

資料保護與效益間所產生的衝突。

本研究案：「語意式資料雲上如何來巧控海量資料分析效益

與保護間的平衡」詳細的研究目的、文獻探討、研究方法與

研究步驟、結論與未來研究請參考下面已經發表在 IEEE Web

Intelligence-2014

國際研討會的論文：Privacy-Preserving WebID Analytics on the Decentralized

Policy-Aware Social Web

(https://dl.acm.org/citation.cfm?id=2682811 )。以及另

外一篇投稿中的論文：Propagation Control Services for

WebID Analytics on the Decentralized Social Web。另外

碩士生孫肇祥同學在 103 年度的碩士論文：整合 R 與

Hadoop/MapReduce 來分析 FOAF 社群網路，亦為本專案研究

計畫成果之一。

(3)

中文關鍵詞：分散式社群 Web、具隱私保護社群 Web、個人可辨識別碼、

WebID、具語意式的電腦規範、巨量資料分析、統計揭露控管

英文摘要： We address the research challenges of

privacy-preserving WebID analytics on the decentralized

Social Web. We first argue why we should use open and

decentralized control but not closed and centralized

control of personal data management. Then, we present

a policy-aware architecture, where a data owner

hand-picks a trusted data controller to mask his/her

personally identifiable information (PII) and other

sensitive social relationships of the WebID so only

anonymous RDF(S) linked datasets are available for

analytics. Moreover, we advocate using a R and Hadoop

integration paradigm, called RHadoop, for effective

hybrid WebID analytics of large-scale social network

linked datasets. Finally, we propose various types of

semantics-enabled policies to call for the RHadoop

hybrid WebID analytics and further balance data

utility and protection on the privacy-aware Social

Web.

The primary stakeholders in WebID analytics are the

data owner, data controller, and data user. Above

three types of semantics-enabled policy are proposed

and enforced by data controllers to enable access

control, data handling, and data releasing actions on

the WebID datasets. The policy enforcement should be

accountable and transparent at the data controllers

to provide WebID propagation control services. Each

data controller enforces a data handling policy to

anonymize massive WebIDs. Moreover, the super data

controller enforces access control and data releasing

policies to ensure that the data owners receive the

privacy-preserving WebID analytics services. Finally,

we point out how to resolve WebID protection and

utility conflict through different types of

semantics-enabled policy to call for WebID

propagation control services at the data controllers

of an information value chain.

(4)

More detailed information about this project,

Crafting the Balance between Big Data Analytics

Utility and Protection in the Semantic Data Cloud,

MOST 102-2221-E-004-014-, research results, please

refer to the paper published at IEEE International

Conference on Web Intelligence-2014, Warsaw, Poland

(https://dl.acm.org/citation.cfm?id=2682811 ) and

another submitting article, Propagation Control

Services for WebID Analytics on the Decentralized

Social Web. A master student thesis, using R and

Hadoop /MapReduce for FOAF-based Social Network

Analytics, submitted by Jhao-Siang Sun is one of the

results.

英文關鍵詞： Decentralized Social Web, Privacy-Aware Social Web,

Personally Identifiable Information (PII), WebID,

Semantics-enabled Policy, Big Data Analytics,

Statistical Disclosure Control (SDC)

(5)

1 科技部補助專題研究計畫成果報告

（□期中進度報告/■期末報告）

語意式資料雲上如何來巧控海量資料分析效益與保護間的平衡

計畫類別：■個別型計畫 □整合型計畫

計畫編號：MOST 102－2221－Ｅ－004－014－

執行期間：102 年 08 月 01 日至 103 年 09 月 31 日

執行機構及系所：國立政治大學資訊科學系

計畫主持人：胡毓忠

計畫參與人員：劉文友、潘宗哲、謝濟謙、薛元昊、鍾佳樺、張筆翔

本計畫除繳交成果報告外，另含下列出國報告，共 _1_ 份：

□執行國際合作與移地研究心得報告

■出席國際學術會議心得報告

期末報告處理方式：

1. 公開方式：

□非列管計畫亦不具下列情形，立即公開查詢

■涉及專利或其他智慧財產權，□一年■二年後可公開查詢

2.「本研究」是否已有嚴重損及公共利益之發現：■否 □是

3.「本報告」是否建議提供政府單位施政參考 ■否 □是，（請列舉提供

之單位；本部不經審議，依勾選逕予轉送）

中華民國 103 年 12 月 13 日

(6)

2 中文摘要

關鍵詞 : 分散式社群 Web、具隱私保護社群 Web、個人可辨識別碼、WebID、具語意式的電腦規範、巨

量資料分析、統計揭露控管

本研究案主要是探討分散式社群 Web 上具有隱私權保護的 WebID 分析研究。我們首先論述為何要用開

放分散式而不是封閉集中式的個人資料管理控管機制。接著我們提出一個具有電腦規範認知的系統架

構，在此架構上個人資料擁有者可以挑選一個可信的資料管理者來匿名化其個人資料與社群網路脈絡

的 WebID。這些個人化 WebID 匿名化資料集是以 RDF(S)串連式資料型態來提供巨量資料的分析。除此

之外我們引用結合 R 和 Hadoop 的 RHadoop 分析系統平台來進行有效且大量 RDF(S)為主的分散式社群

資料集的分析。最後我們設計並且實做出三種型態的 WebID 資料集控管所需的電腦可執行規範，主要

包括了資料使用者控管規範，資料處理規範，與資料揭露規範，這些電腦規範可以呼叫上述 RHadoop

資料分析的模組，並且更進一步平衡資料使用效益和個人資料保護間的平衡。這一部份的研究成果已

經發表在 IEEE Web Intelligent-2014, Warsaw, Poland 的國際研討會。

我們也完成另外一份論文的初稿: Propagation Control Services for WebID Analytics on the

Decentralized Social Web。並準備投稿到相關的國際電腦科技研討會或專書。本論文初稿是延續上

述已經發表論文的內容而從資訊流控管服務（Propagation Control Services）的觀點來分析分散式

社群網路之上相關成員如資料擁有者，資料控管者，以及資料使用者之間的關係。我們沿用上述三種

WebID 資料集控管的三種電腦規範，並且強調這三種電腦規範執行與落實必須要在一個具可靠性與透

通性的情況下來進行資料流通控管服務。我們最後點出該如何在資訊流通鏈利用上述三種電腦控管規

範來呼叫 WebID 流通控管服務模組以化解 WebID 資料保護與效益間所產生的衝突。

本研究案：

「語意式資料雲上如何來巧控海量資料分析效益與保護間的平衡」詳細的研究目的、文獻探

討、研究方法與研究步驟、結論與未來研究請參考下面已經發表在 IEEE Web Intelligence-2014 國

際研討會的論文：Privacy-Preserving WebID Analytics on the Decentralized Policy-Aware Social

Web (

https://dl.acm.org/citation.cfm?id=2682811

)。以及另外一篇投稿中的論文：Propagation

Control Services for WebID Analytics on the Decentralized Social Web。另外碩士生孫肇祥同

學在 103 年度的碩士論文：整合 R 與 Hadoop/MapReduce 來分析 FOAF 社群網路，亦為本專案研究計畫

成果之一。

(7)

3 Abstract

Keywords: Decentralized Social Web, Privacy-Aware Social Web, Personally Identifiable Information (PII),

WebID, Semantics-enabled Policy, Big Data Analytics, Statistical Disclosure Control (SDC)

We address the research challenges of privacy-preserving WebID analytics on the decentralized Social Web.

We first argue why we should use open and decentralized control but not closed and centralized control of

personal data management. Then, we present a policy-aware architecture, where a data owner hand-picks a

trusted data controller to mask his/her personally identifiable information (PII) and other sensitive social

relationships of the WebID so only anonymous RDF(S) linked datasets are available for analytics. Moreover,

we advocate using a R and Hadoop integration paradigm, called RHadoop, for effective hybrid WebID

analytics of large-scale social network linked datasets. Finally, we propose various types of semantics-enabled

policies to call for the RHadoop hybrid WebID analytics and further balance data utility and protection on the

privacy-aware Social Web.

The primary stakeholders in WebID analytics are the data owner, data controller, and data user. Above three

types of semantics-enabled policy are proposed and enforced by data controllers to enable access control, data

handling, and data releasing actions on the WebID datasets. The policy enforcement should be accountable

and transparent at the data controllers to provide WebID propagation control services. Each data controller

enforces a data handling policy to anonymize massive WebIDs. Moreover, the super data controller enforces

access control and data releasing policies to ensure that the data owners receive the privacy-preserving WebID

analytics services. Finally, we point out how to resolve WebID protection and utility conflict through different

types of semantics-enabled policy to call for WebID propagation control services at the data controllers of an

information value chain.

More detailed information about this project, Crafting the Balance between Big Data Analytics Utility and

Protection in the Semantic Data Cloud, MOST 102-2221-E-004-014-, research results, please refer to the

paper published at IEEE International Conference on Web Intelligence-2014, Warsaw, Poland

(

https://dl.acm.org/citation.cfm?id=2682811

) and another submitting article, Propagation Control Services for

WebID Analytics on the Decentralized Social Web. A master student thesis, using R and Hadoop /MapReduce

for FOAF-based Social Network Analytics, submitted by Jhao-Siang Sun is one of the results.

(8)

4 科技部補助專題研究計畫成果報告自評表

請就研究內容與原計畫相符程度、達成預期目標情況、研究成果之學術或應用價

值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）

、是否適

合在學術期刊發表或申請專利、主要發現（簡要敘述成果是否有嚴重損及公共利

益之發現）或其他有關價值等，作一綜合評估。

1. 請就研究內容與原計畫相符程度、達成預期目標情況作一綜合評估

■

達成目標

□ 未達成目標（請說明，以 100 字為限）

□ 實驗失敗

□ 因故實驗中斷

□ 其他原因

說明：

本研究案利用分散式社群網路為主要架構來探討該如何進行以 RDF（S）所主表示的 WebID 資

料集中的個人資料與社群網絡的資料分析與保護。我們提出三種語意式電腦規範：存取控管

規範，資料處理規範，與資料揭露規範。透過這三種語意式電腦規範透過 RDF（S）查詢語言

SPARQL 來呼叫適當的巨量資料分析模組與平台 RHadoop(R+Hadoop)與統計資料揭露控管

（Statistical Disclosure Control, SDC）來達成巧控巨量資料分析效益與保護間的平衡。

2. 研究成果在學術期刊發表或申請專利等情形：

論文：■已發表 ■未發表之文稿 □撰寫中 □無

專利：□已獲得 □申請中 □無

技轉：□已技轉 □洽談中 □無

其他：（以 100 字為限）

本計畫案除了已經完成一篇論文發表在 IEEE International Conference on Web

Intelligence, Warsaw, Poland, 2014。另外一篇論文 Propagation Control Services for

WebID Analytics on the Decentralized Social Web 的技術報告，並安排投稿中。碩士生孫

肇祥同學在 103 年度的碩士論文：整合 R 與 Hadoop/MapReduce 來分析 FOAF 社群網路，亦為

本專案研究計畫成果之一。

(9)

5 3. 請依學術成就、技術創新、社會影響等方面，評估研究成果之學術或應用價

值（簡要敘述成果所代表之意義、價值、影響或進一步發展之可能性）

，如已

有嚴重損及公共利益之發現，請簡述可能損及之相關程度（以 500 字為限）

本研究案原申請三年期，因為只核准一年期，因此選擇可行重點議題來進行。本研究是以分

散式社群網路上海量資料保護與分析為主軸。現有的社群網路如 Facebook, LinkedIn,

Twitter 等海量資料主要是建構在封閉式環境之上，因此形成所為 Walled Garden 的現象，因

此這些海量資料分析與保護是透過其單一資料管理者(Data Controller)來執行，並且是在一

個不透明的運作機制上來運作。而本研究案是依據 Tim Berners-Lee (TBL)等人所倡導的分散

式社群網路平台概念，因此社群網路的使用者也就是資料的擁有者可以彈性的選擇自己所信

任的資料管理者的保護平台，並且運用單一 WebID 認證識別碼來登入到任何的分散式社群網

路平台上。

WebID 因為含有完整個人身份屬性與社群網的 FOAF （Friend-Of-A-Friend）好友關係鏈資料，

因此我們可以有效運用海量資料分析技術與平台系統如 RHadoop 來進行分析。另外為了確保

個人隱私權受到保護，因此 WebID 也需要被有效控管。我們參考原有 TBL 在 W3C 所建議的 WebID

規範書與個人資料保護控管（ Access Control ）概念並融合現有的統計資料控管揭露

（Statistical Disclosure Control）技術來設計三種資料保護電腦規範：Access Control

Policy, Data Handling Policy,與 Data Releasing Policy。這三種電腦規範主要是以 RDF(S)

語言來表示並且透過 SPARQL 查詢語言來呼叫適當的巨量資料分析與資料匿名化的控管模組以

落實資料分析者在進行資料查詢與分析時，提供資料處理的匿名化，資料分析者身份控管，

與資料查詢與揭露時檢驗等功能。

延續上述以語意式電腦規範的概念，我們在一個海量資料的資訊鏈分析不同參與者之間的關

係如資料擁有者、資料控管者、及資料使用者。並且以資料控管者在執行上述三者語意式控

管規範所具有的可靠性（Accountability）與透通性(Transparency)，能夠滿足以分散式社

群網路的資訊鏈控管服務概念來有效達成 WebID 資料集資料分析效益與個人資料隱私保護平

衡的目的。

未來研究則希望能夠運用更先進的資料揭露控管技術如 Differential Privacy 並且加入適當

的機器學習（Machine Learning）與知識庫的表達結構如 RDF（S）Graph 或 Datalog Logic

來進行資料分析時的複合式塑模與保護。本研究案成果已經發表在 2014 年的 IEEE Web

Intelligent, Warsaw, Poland 的國際知名研討會 Special session on Big Data Analytics。

本研究的技術可以被應用到開放式政府個人資料整合收集，揭露與分析時需要保護的情境，

以確保個人隱私權保護準則。

(10)

Privacy-Preserving WebID Analytics on the

Decentralized Policy-Aware Social Web

Yuh-Jong Hu

Dept. of Computer Science

NCCU, Taipei, Taiwan

[email protected]

Abstract—We address the research challenges of

privacy-preserving WebID analytics on the decentralized Social Web. We ﬁrst argue why we should use open and decentralized control but not closed and centralized control of personal data management. Then, we present a policy-aware architecture, where a data owner hand-picks a trusted data controller to mask his/her personally identiﬁable information (PII) and other sensitive social relationships of the WebID so only anonymous RDF(S) linked datasets are available for analytics. Moreover, we advocate using a R and Hadoop integration paradigm, called RHadoop, for effective hybrid WebID analytics of large-scale social network linked datasets. Finally, we propose various types of semantics-enabled policies to call for the RHadoop hybrid WebID analytics and further balance data utility and protection on the privacy-aware Social Web.

Keywords—Decentralized Social Web, privacy-aware Social Web, Personally Identiﬁable Information (PII), WebID, semantics-enabled policy, big data analytics, Statistical Disclosure Control (SDC)

I. INTRODUCTION

The centralized closed online social networking (OSN) sites are walled gardens that limit the cross-site relationships between people who possess accounts only on their sites [1]. The problem of the centralized social network silos motivates people to consider the decentralized Social Web architecture, where a boundless social networking infrastructure in which all of the centralized OSN sites are interoperable with each other [2].

Each Web user connects to a service provider via a Web browser using a single sign-on WebID on the decentralized Social Web [4]. WebID, known as Friend-of-a-Friend (FOAF) + Transport Layer Service (TLS) protocol, uses client-side cer-tiﬁcates for a Web user’s authentication. A Web server requests an X.509 certiﬁcate from a Web user over the TLS to enable secure data communication and service access authentication. This approach is a single sign-on authentication, because a Web user does not have to prepare numerous user names and passwords to access different service providers. A WebID is also portable and linkable across social networking sites because it provides a Webizing capability through URIs.

In [5], a “meta-framework” is depicted to describe the identity, proﬁle, and privacy frameworks for a standard, open, and privacy-aware Social Web. It will be a research challenge learning on how to enforce semantics-enabled policies for privacy, proﬁle, and identity services in the “meta-framework” of the policy-aware Social Web. In [6], it uses discretionary

rule-based policy-aware techniques for a Web server’s resource access veriﬁcation. In this policy-aware Web, a user is au-thenticated by the rule-based access control policy without having to register with the Website. Similarly, we need a policy-aware Social Web to support privacy protection while providing data analytics. On the one hand, a Web user can act as a data owner who requests Web services from an online social networking site. On the other hand, another Web user can act as a data analyst who discovers new insights into massive datasets through effective data analytics.

While providing boundless data sharing and analytics, we will face the new challenge of enforcing the law compliant privacy protection principles. For example, without wall gar-dens, each user’s WebID is relatively easy to collect, integrate, and analyze. We are aware of the possible pros and cons of re-architecting social networks as decentralized architecture.

From the pros side, it is much easier to verify transparently whether a data controller’s data protection policy is compliant with the privacy law and to further ensure that the data acquisition, anonymizing, integration, modeling, analysis, and privacy protection are truly following the data owner’s privacy preferences.

From the cons side, it is much more difficult to anonymize a tremendous amount of WebIDs in an open and decentralized architecture; moreover, each user’s sensitive social relation-ships come from heterogeneous data sources. So the insights of data analytics are more unpredictable, which implies that it is more difficult to set up the justification criteria of the data disclosure principles to avoid any possible privacy violation.

In this paper, we first mask each data owner’s WebID so that a collection of anonymized WebIDs datasets is stored at a trusted data controller without worrying about privacy viola-tion challenges. Similar to the microdata protecviola-tion techniques in [7], we must ensure that each anonymized WebID is de-coupled from its original one so that the risk of a personally identifiable information (PII)’s re-identification probability is low. However, we should still preserve the original WebID dataset’s statistical features for analytics.

The entire socially aware data cloud layer is modeled as a distributed Hadoop ecosystem. Big data analytics are supported through MapReduce distributed programming or/and open source statistical computation packages, which are called R. Given the RHadoop integration paradigm, we justify the incentives of using the semantics-enabled policies to achieve privacy-preserving WebID analytics.

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

504

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

(11)

We propose a semantics-enabled policy-aware decentral-ized Social Web architecture that provides automated data an-alytics. This architecture provides effective mediator services between data owners and a data analyst for analytics purpose. In addition, we attempt to balance privacy protection and data utility through the semantics-enabled policy enforcement. We point out the potential challenges of balancing privacy protection and data utility while providing WebID analytics. These challenges arise from the following conﬂict: either a data owner’s privacy right is invaded or a data analyst’s analytics power is limited.

It is not easy to accomplish a data utility and protection balancing. In fact, this problem has been intensively investi-gated in the statistical disclosure control (SDC) research ﬁeld for decades [8]. SDC techniques are usually used for microdata protection in the statistical databases, so they are not directly applicable to data protection on the decentralized Social Web. When applying Social Web data analytics, we not only protect each personal proﬁle but also protect sensitive relationships.

We have built a highly transparent platform to provide socially aware data management services in the distributed Hadoop ecosystems. Both a data owner and a data analyst can ensure that the services for data acquisition, anonymizing, recording, sharing, integration, and analytics are all following the privacy protection principles declared in the Web server’s terms-of-service statements. Initially, a data owner configures the appropriate rules for enforcing data retention, use, and protection. Later on, a data owner can track or be notified of data dissemination and usage through a privacy-aware notification system.

A pattern-based query is a conditional data retrieval, so its acquisition data is disclosed only when its metadata are satisfied with the original data owner’s specified data access conditions, such as a data user’s role, purpose, access opera-tion, locaopera-tion, and time. When a data user’s request is rejected, the privacy-aware notification system gives an explanation to that data user through the automated policy reasoning services. Finally, when data protection services are revised, we allow a data owner to update the respective policies in the policy-aware management framework to reflect the updated status.

By following the big data analytics lifecycle in [9], we propose a revised version that can be shown as six consecu-tive stages corresponding to three types of semantics-enabled policies for data management services: (1) data acquisition and recording; (2) data extraction, data cleaning, and semantic an-notation for anonymizing; (3) data representation, integration, and aggregation; (4) data modeling and analysis; (5) query processing and analytics; and (6) interpretation. We intend to apply semantics-enabled policy management services in order to provide data protection that are seamlessly bound together between the stages of lifecycle services.

A. Research Issue and Contributions

This paper addresses the following major research issues: 1) We argue why we choose the decentralized but not the centralized online social networking architectures for WebID analytics.

2) How can we proceed the anonymized WebIDs col-lection and still ensure WebID utility?

3) How can we provide RHadoop WebID analytics through unifying anonymized WebIDs on the decen-tralized Social Web?

4) How can we call for RHadoop WebID analytics through types of semantics-enabled policy enforce-ment?

Our contributions. (i) We state the reasons for why we

should use decentralized instead of centralized social network architecture. (ii) We show how to proceed the anonymized WebIDs collection and still ensure data utility. (iii) We il-lustrate how to link anonymized WebIDs for analytics. (iii) We propose three types of semantics-enabled policy for access control, data handling, and data releasing, to enable privacy-preserving WebID analytics.

Outline. This paper is organized as follows. In Section I, we

give an introduction. Then, in Section II, we address related work. We provide background knowledge in Section III. In Section IV, we argue why we use a decentralized but not a centralized policy-aware Social Web. In Section V, we address types of data analytics and explain why we choose RHadoop for hybrid WebIDs analytics. In Section VI, we present types of semantics-enabled policies that call for RHadoop hybrid WebIDs analytics. Finally, in Section VII, we conclude this paper and point out possible future work.

II. RELATEDWORK

Privacy-preserving data analytics for the socially aware data cloud is a not yet resolved challenge [10]. The SDC techniques have been developed for anonymizing the statistical database’s microdata but are not yet ready for socially aware big datasets protection [7] [8].

A critique in [11] pointed out the difﬁculty of adopting decentralized social networks for personal data management. This viewpoint disagrees with other proposals, which stand for personal data management on the decentralized Social Web [4]. A W5 architecture was addressed in [1] that also endorsed the personal data management through decentralized social networks. In [2] and [3], the researchers proposed concepts of the future decentralized Social Web, but the access control of resources is currently restricted to a subject-based query. They did not deal with the privacy protection analytics for a pattern-based query problem.

Privacy-preserving data access for social networks is be-coming an important research problem. However, most of the studies focused on only resource access control [12] [13]. Semantics-enabled policy techniques have been proposed for general data resources access control [14]. Others are aim-ing at resources access control and data analytics on the Web [6] [12] [15].

III. BACKGROUND

In [3], Tim describes a way of re-architecting an open decentralized OSN with its applications (or services) that are separated from the socially aware data cloud. Thus, Software-as-a-Service (SaaS) layer can be separated from the Platform-as-a-Service (PaaS) layer. Here, the PaaS layer provides ser-vices that are barely powerful enough for the requirements

505 504

(12)

of the SaaS layer. Each OSN should extend across the entire Social Web so that boundless data sharing is feasible.

In the decentralized social network, services are created by different service developers and trusted by a data owner (see Figure 1). Services are run for data analytics as a whole for WebIDs collected from various social network sites. Various centralized OSN sites are interoperable with each other. On the one hand, a data user can control and retain his/her own data in the remote trusted informediary server. On the other hand, each social network site can justify its existence by sharing data with other OSNs for analytics without hoarding the entire datasets.

In[1], the World Wide Web Without Walls (W5) ecosystem concept is proposed to improve the data management problem of current centralized OSNs. They point out that three desired properties are required: decoupling services (or applications) from data, giving users control over their data, and minimizing the data trust footprint. In fact, the status quo of the centralized OSN is a lack of these properties. It has evolved into inde-pendent silos, so integrations of these silos’ data are almost impossible. Moreover, the real data protection enforcement principles are not transparent in each online social networking site.

W5 is a possible solution with aggregates. An aggregate, similar to a data controller, is a single virtual logical machine that a W5 provider supplies, and it hosts a large collection of services from developers and commingled data from many Web users. Services are written by the trusted third-party (TTP) developers, and run inside an aggregate. From a cloud computing service viewpoint, a W5’s provider offers PaaS for a W5’s numerous third-party developers, which, in turn, provide SaaS in the socially aware data cloud.

IV. DECENTRALIZED VS. CENTRALIZED

We face several challenges when PII are acquired and recorded in separate walled garden silos of centralized OSNs. For example, a data controller will struggle to integrate PII and socially aware relationships from heterogeneous data sources because data schema and format differences will always exist. Hence, data controllers will always have limited data sharing capabilities with each other. Without intensive human manipu-lation efforts, it is almost impossible to provide data analytics across heterogeneous data sources.

In addition, a data owner does not have full control over his/her own data. Socially aware relationships and PII data are dispersed but not interoperable. In fact, each centralized social network silo privately proceeds its own data analytics without requiring the original data owners’ explicit consent. Finally, data are not portable if a data owner decides to terminate use of the services.

Since the policy enforcement is not transparent, a data owner is not ensured of the data controller’s privacy policy compliance status. Moreover, the authentication process is cumbersome for data owners and data users because central-ized OSN lacks a single sign-on service. In the centralcentral-ized OSN, a data owner will ﬁnd it almost impossible to track data usage and provenance.

Some techniques might alleviate these problems, but only marginally. For example, open graph API provides an interface for limited data sharing and portability on the Facebook platform. In contrast, OpenSocial deﬁnes a common API for social applications across multiple Websites. OpenID supports only single log-in services for Web users when they access services on the multiple Websites. At best, these techniques provide partial and incomplete solutions for privacy-preserving data analytics. We need a comprehensive solution that can link together datasets that come from existing walled garden silos or other emerging social network sites for data analytics but sill persevering privacy.

A. Decentralized Social Web

A certified TTP data controller can be established to store and anonymize all RDF(S)-based WebIDs pertaining to Social Web users’ profile and social relationships. A WebID is referred to an URI that has a HTTPS scheme for secure transmission. A WebID uniquely describes a person and his/her social relationships [4]. Each data owner can flexibly select one of the TTP data controllers as a guardian of his/her WebID, which allows the control over data and the control over services to be separated at a data controller. Using WebID for a single sign-on authentication is feasible on the decentralized Social Web, because each WebID refers to the original person.

A Web user is fully in charge of his/her own WebID shar-ing and dissemination through the semantics-enabled policy enforcement at a data controller. Therefore, a data owner is endowed with the transparent and self-control of a privacy protection policy. In contrast, a Web user lacks these features in the centralized OSNs. An enormous amount of interconnected WebID proﬁles are fully anonymized after they are collected and archived in a trusted data controller. The real research challenge will be how to engage analytics of the anonymized WebIDs from the need to be interconnected WebID datasets. Here, we use the enhanced microdata protection techniques at a trusted data controller for anonymizing social network WebID datasets.

Semantics-enabled policies, including a data handling pol-icy for data anonymizing, on the one hand is established at a data controller. On the other hand, access control and data releasing policies are established and enforced at the super data controller (see Figure 1). The purposes of the semantics-enabled policy extend beyond the original WebID’s access control concern, because the original WebID restricts nothing more than each person’s resource access. Here, the access control and data handling policies are uniﬁed to allow the right person with the right purpose to query a given anonymized WebID dataset. Socially aware data are selectively disclosed with pattern-based queries through a data releasing policy.

Semantics-enabled policies are represented as a combi-nation of ontologies and rules [16]. Ontologies describe the concepts of data analytics and protection services, and rules enforce selective data disclosure and dissemination services with access control capability for the super data controller. Once generic semantics-enabled policies are established by a TTP at the super data controller, a data owner can verify and conﬁgure the data disclosure and usage policies at a selective data controller.

506 505

(13)

Fig. 1. The super-peer domain (SPD) data cloud for the decentralized Social Web, where a data owner hand-picks a trusted data controller to mask and record his/her WebID in a anonymized WebID dataset, and forward to the super data controller to enforce WebID disclosure and analytics services.

Each data controller registers to the super data controller within a super-peer domain. A super-peer domain is the legal boundary of socially aware data cloud. This design greatly sim-pliﬁes the management of semantics-enabled policy, because the numerous data controllers that operate within a super-peer domain do not have to enforce their own access control or data releasing policies.

In this study, we downgraded our OWL-based ontologies and rules from structured relational data in [15] to the RDF(S)-based graphs to leverage the power of linked data integration. RDF(S) linked data graph is represented as a triple for each name-value pair data. Thus, an entire RDF(S) graph is repre-sented as a set of triples, and SPARQL provides triples access. A data owner has the right to select appropriate privacy protection preferences that describe the conditions of data usage so that a data controller can ﬂexibly select suitable data masking techniques for anonymizing a WebID and provide future data disclosure. Semantics-enabled policy enforcement is transparent, so all of the data disclosure is under a data owner’s explicit permission.

V. TYPES OFWEBID ANALYTICS

In the big data analytics lifecycle, the first step’s WebID acquisition and recording, and the second step’s WebID profile extraction and cleaning are obviously easy because WebIDs are uniquely identified through URIs. Therefore, WebIDs are acquired and extracted directly without further cleaning. For the second step’s semantic annotation, we use a semantics-enabled data handling policy to collect the context and anonymize the content of WebIDs. However, before WebIDs are anonymized, we must ensure WebIDs’ interconnections through their unique URIs at a data controller for further integration. In step three, which involves data representation, integration, and aggregation. Each WebID’s context is defined as the FOAF ontology schema with a tremendous amount of anonymized PII instances content attached, which represented as RDF(S)’s Turtle serializable format. Multiple FOAF graphs exist, where some graphs are mutually interconnected and

others are separated, reﬂecting the true clustering and grouping topology of the decentralized Social Web.

A. Unifying Anonymized WebID for Analytics

The WebID’s structure and its sharing mechanism are based on the open standardized RDF(S)-based ontologies and secure transmission protocol, e.g., FOAF + TLS. Therefore, we do not face an ontology matching and merging problem when WebIDs are captured and shared between multiple data sources. Here, the centralized OSN’s PII data are also possibly included for analytics if an adapter is available to transform private data into an open WebID by using RDF(S) serialization and de-serialization techniques.

Turtle, the Terse RDF triple language, is a primary concrete syntax of a RDF(S) dataset representation besides the original RDF/XML verbose data format [17]. JSON is an interchange language for the data serialization and de-serialization of Graph API outputs captured from the NoSQL data-stores. Until recently, the centralized social network outputs through Graph API were primarily represented in the JSON for third-party applications to consume. In [18], JSON is translated into Turtle to offer data interoperability in a semantically-enriched RDF Linked Open Data (LOD).

Another emerging solution for unifying PII is creating data in a JSON-based Linked Data, JSON-LD [19](see Figure 2). JSON-LD is also an interchange language. It uses@context to describe JSON data schema and vocabulary sources;@type to describe a data type of a vocabulary; and@id to represent a vocabulary as an identiﬁer. Therefore, JSON objects become JSON-LD objects, and are interoperable and reusable with these additional vocabularies. This slight upgrade from JSON to JSON-LD allows existing JSON data to be interpreted as Linked Data with minimal syntax changes.

B. R and Hadoop for WebID Analytics

In this study, we classify big volumes of WebID analytics into three types: lightweight, heavyweight, and hybrid.

507 506

(14)

1) A lightweight analytics provides a simple analytics service for unstructured data with small mathematical operations in a MapReduce programming paradigm of the distributed Hadoop environment.

2) A heavyweight analytics provides analytics for struc-tured data with complex mathematical operations in a statistical computational software, such as R. 3) A hybrid analytics provides a combination of a

lightweight analytics of unstructured data and a heavyweight analytics of structured data to leverage the power of both analytics services.

On the one hand, accessible JSON-LD unstructured data are used for MapReduce lightweight analytics. On the other hand, socially aware structured JSON-LD data are accessed through the SPARQL query language if the data can be transformed into Turtle for R’s heavyweight analytics. A hybrid WebID analytics proceeds as follows. The unstruc-tured text in the anonymized WebID of JSON-LD datasets are processed by a MapReduce lightweight analytics. Then, the structured proﬁle attributes and social relationships of anonymized WebIDs are transformed into Turtle and queried through SPARQL ﬁltering for a R heavyweight analytics. Finally, the lightweight and heavyweight analytics results are integrated with a comprehensive interpretation (see Figure 2).

Fig. 2. The PII in JSON is upgraded to JSON-LD in the centralized OSN, and then integrated with the JSON-LD of WebIDs in the decentralized social network for integral data analytics.

We use Revolution Analytics RHadoop platform. as a hybrid analytics testing environment to verify the feasibility of our semantics-enabled policies. Statistical computing packages from open source R are used for heavyweight analytics, but they work only for the in-memory data of a standalone computer. The Hadoop framework with a MapReduce pro-gramming paradigm allows the distributed processing of large datasets, but they work only for lightweight analytics. The purpose of integrating R and Hadoop as RHadoop is to bring the distributed (or parallel) MapReduce processing capability of Hadoop to the heavyweight analytics of R.

VI. PRIVACY-AWARESOCIALWEB

A policy-oriented Social Web becomes a privacy-aware Social Web when the privacy protection principles of per-sonal proﬁles and relationships are represented and enforced

automatically by the semantics-enabled policies. A profile management service could be run in the browser or via a TTP. One type of TTP is an aggregate that keeps track of users’ distributed attributes and profiles on the Social Web. With this aggregate, we allow each user to configure and edit his/her personal data attributes. The core service features offered by third-party social applications are masking, maintaining, and expanding users’ connections without violating privacy protection principles.

A. Semantics-enabled Policies

Consider privacy-preserving analytics for decentralized on-line social networks: they are quite different from the ones used for the centralized relational databases, so we need to consider collecting linked data from potentially different RDF(S) data sources. These linked data, which serve as RDF(S)’s ontolo-gies, are in the triple data stores but not in the tables of a relational database. Therefore, we need to revise previous data modeling, access, anonymizing, and selective revelation techniques for linked data publishing and disclosure.

Original SDC methods are classiﬁed as conceptual, query restriction, data perturbation, and output perturbation [20]. In this study, an access control policy provides a query restriction and a data handling policy provides data perturbation. Finally, a data releasing policy provides output perturbation.

A super-peer domain (SPD) includes the super data con-troller and various data concon-trollers. Consider the three types of semantics-enabled policies. In one type of policy, the super data controller uses an access control policy (ACP) to decide whether a data request from a data analyst is permitted. In another type, a data controller uses a data handling policy (DHP) with various SDC techniques for masking WebIDs to provide selective data revelation from its own anonymized WebID dataset. In the third type, the super data controller per-forms a data releasing policy (DRP) for anonymized WebIDs datasets collected from numerous data controllers to achieve permissible selective data revelation.

B. Access Control Policy (ACP)

An ACP is used for data request verifications. The concept is represented as an ACP ontology (see Figure 3) and enforced as a SPARQL query. An ACP decides whether to permit (or to deny) a particular pattern-based data request from a data analyst. The ACP enforcement can be also applied to subject-based queries. In addition, each permissible request context will be forwarded to the DRP for selective data revelation. Based on an ACP ontology for data access verification, a data analyst provides five profile attributes, including hasDataUserRole, hasAction, hasPurpose, hasLocation, and hasDateTime, that are satisfied with the policy: Condition specified by the super data controller. A data request is permitted if a combination of access condition attributes submitted by a data analyst is one of the members of a feasible condition set of data access. Otherwise, it is rejected. Let us suppose that a data analyst named Peter submits a data request in PeterRequest.rdf shown as the following set of triples:

@prefix foaf : < http : //xmlns.com/foaf/0.1/ > .

508 507

(15)

@prefix policy : < http : //nccu.edu.tw/policy > . policy: QueryType rdf : type rdf : Class

policy: PBQ rdf : type policy : QueryType policy: Condition [ hasDataUserRole “DataAnalyst”; hasPurpose “Analytics”; hasAction “Read”; hasLocation “Taipei”; hasDateTime “2013 : 12 : 25 : 15 : 00” ].

Fig. 3. An ACP ontology, where the upper part describes the condition proﬁle attributes of data access, and the lower part describes how the access condition is veriﬁed for a query type, e.g., subject-based or pattern-based.

The super data controller uses the following SPARQL’s Ask Boolean query with return value yes (or no) to decide whether a data request from Peter is permitted or (not permitted) based on whether the requester’s proﬁle attribute conditions are declared in the legal access condition set: Ask ?permit

From < PeterRequest.rdf >

Where {?r policy : isEmpowered ?permit. ?r [ ?qt rdf : type policy : QueryType; policy: hasCondition ?c [ hasDataUserRole ?role; hasPurpose ?purpose; hasAction ?action; hasLocation ?location; hasDateTime ?time ] ].}

C. Data Handling Policy (DHP)

A data controller uses a DHP with its ontology’s vo-cabulary, known as policy: SDC, to describe the SDC techniques that can be applied to anonymize a WebID’s proﬁle and relationship variables (see Figure 4). On the one hand, SDC methods are applied to datasets with two data types with a vocabulary, known as policy: DataType for twordfs: subClassOf vocabularies: Categorical and

Continuous. On the other hand, the same datasets can be classiﬁed as data attributes with a property vocabulary, called policy: dataAttribute for three rdfs : subPropertyOf vocabularies: identifiers, quasi− Identifiers, and confidential.

The anonymizing principle for data handling is to decide which attribute is protected by which SDC technique. They are (1) an identifier attribute that is completely de-identified; (2) quasi-identifier attributes that are selective revelation with the applicable SDC methods for categorical or continuous attribute types; and (3) confidential attributes that are only disclosed through the data releasing policy enforcement as SDC methods are coupled with the de-anonymized (quasi-)identifiable attributes that satisfied at least the k-anonymity or differential privacy minimum criteria, etc [8].

We use a DHP ontology to describe which SDC methods can be applied to which data types or at-tributes. policy: SDC can be classiﬁed into two sub-classes: Masking and Synthetic . For example, a triple Non− Perturbative canApplyTo DataType describes that theNon− Perturbative techniques are feasibly used for the Categorical or Continuous data type.

A anonymized WebID proﬁle attributes with direct hop of friendships are represented as a RDF(S)-based Turtle ﬁle shown as:

@prefix foaf : < http : //xmlns.com/foaf/0.1/ > . @prefix policy : < http : //nccu.edu.tw/policy > .

< http : //nccu.edu.tw/j/foaf.rdf > a

foaf: PersonalProfileDocument.

< http : //nccu.edu.tw/j/foaf.rdf > foaf : maker :me.

< http : //nccu.edu.tw/j/foaf.rdf >

foaf: primaryTopic _:me. :me a foaf: Person.

/ ∗ De − identification ∗ /

:me [ foaf : name “Yuh − Jong Hu”;

foaf: homepage < http : //nccu.edu.tw/j >; foaf: mbox < mailto : [email protected] >;

/ ∗ Generalization ∗ /

foaf: phone < tel : +886 − 2 − 29387620 >;

...

foaf: knows [ a foaf : Person;

foaf: name “Kua − Ping Cheng”; rdfs: seeAlso

/ ∗ enhanced microdata protection techniques ∗ / < http : //nccu.edu.tw/k/foaf.rdf > ].

foaf: knows [ a foaf : Person;

foaf: name “Ya − Ling Huang”; rdfs: seeAlso

/ ∗ enhanced microdata protection techniques ∗ / < http : //nccu.edu.tw/y/foaf.rdf > ].

... ]

D. Data Releasing Policy (DRP)

A DRP governs what conditions are acceptable for ana-lytics as well as which anonymized WebID’s attributes are

509 508

(16)

Fig. 4. An ontology of a data handling policy describes concepts regarding what SDC methods are applicable to which WebID data types and attributes.

available for analytics so that the analytics does not violate the privacy protection principles. The WebID’s structure rep-resentation and verification of usage conditions come from the policy: Condition in an ACP’s ontology. A user’s profile includes personal profile attributes, social relationships, etc. A data usage context submitted from a data analyst is compared with the ACP’s semantic access pre-setting conditions, called policy: Condition. If a data analyst’s usage context is satisfied with the WebIDs’ pre-setting access conditions, then these anonymized WebIDs’ profile attributes are disclosed through a DRP.

A simple SPARQL query to access anonymized proﬁle attributes with one hop of Social links:

Select ?graph ?gender ?age ?member ?interest From< http : //nccu.edu.tw/j/foaf.rdf >

From named graph???

Where{< http : //nccu.edu.tw/j/foaf.rdf#me > foaf: knows ?X.

{ ?X rdfs : seeAlso ?graph.

graph ?graph {[ a foaf : Person.

foaf: mbox ?mbox; foaf: name ?name; foaf: gender ?gender;

...;

/ ∗ Generalization ∗ /

foaf: phone ?phone;

/ ∗ GlobalRecording ∗ /

foaf: age ?age; foaf: member ?member; foaf: interest ?interest;

foaf: knows [ ?graph ]. ]}}}

The masked PII attributes of WebIDs are disclosed for an-alytics, but we should still preserve data utility under the PII’s de-identifiable principle. Each personal WebID’s anonymized profile attributes are defined as a name-value set of RDF(S) triples so that the entire dataset of WebID’s profile attributes are a set of name-value RDF(S)-based triples. These datasets of name-value profile attributes can be turned into key-values and applied to the MapReduce algorithm for lightweight analytics. As for social link analytics, a simple social link attribute of a WebID, such as friendship, is based on a Boolean data type. In addition, complex social links between two persons, such as a relationship type, a friendship trust level, and years of friendship, can be specified as a categorical or continuous data type. These anonymized social links are also described as a set of name-value triples. So SDC methods with the MapReduce distributed programming can be applied to social links in order to achieve a privacy-preserving lightweight WebID analytics.

The heavyweight WebID analytics of social link is to cluster members of a social network into a different group by using one of the social network analysis algorithms, such as k-means or clustering, and further derive different centrality measures, such as the in-degree/out-degree, closeness, and

betweenness of each group or the entire social network.

E. Privacy-Preserving WebID Analytics

The stakeholder roles, such as a data owner, a data con-troller, and a data analyst, are created. We ensure that different types of semantics-enabled policies are properly enacted by

510 509

(17)

their respective roles. We enforce actions to achieve respective privacy preference selection, data disclosure, and protection.

The original SDC enforcement principle is obliged to a data controller, so a data analyst does not have the option to choose SDC methods to mask data attributes for analytics, which certainly decreases data utility. We seek a balance between an individual’s right to privacy and society’s need for information as we acquire and record data in a data controller and provide data releasing for a data analyst at the super data controller.

We face the challenges of maximizing data utility while minimizing disclosure risk. However, these two objectives are usually in conﬂict, because we cannot simultaneously increase data utility and decrease disclosure risk [8]. SDC seeks to optimize the trade-off between data utility and disclosure risk for the protected data, such as de-identiﬁable PII.

We comprise WebID utility and risk through various types of semantics-enabled policy enforcement because we should consider WebID data types, attributes, and usage criteria re-quested from a data analyst while applying SDC solutions for WebIDs in the socially aware datasets to enforce data owners’ privacy preferences. WebIDs analytics utility and risk criteria are based on the policies and guidelines of the super data controller.

The super data controller chooses the SDC methods and parameter values based on the feedback information from a data analyst, and forward these methods and parameters to various data controllers and ensure a balance between WebIDs’ privacy protection and utility. Therefore, the super data controller could postulate any possible WebID disclosure threats and see if its disclosure protection strategies proved to be sufﬁcient and effective.

VII. CONCLUSION ANDFUTUREWORK

The primary incentives of using the decentralized architec-ture are data interoperability, self control of data disclosure, and transparent data usage tracking. In this study, we demon-strate how the automated semantics-enabled policy enforce-ment can enable these incentives. Moreover, we can provide data integral analytics, because WebIDs are easily integrated from different social network architectures. This solves the walled garden silos problem of the centralized social networks. Types of WebID analytics exist for the decentralized Social Web: heavyweight, lightweight, and hybrid. Hadoop with MapReduce distributed processing provides lightweight analyt-ics, and statistical computing language R provides heavyweight analytics. RHadoop is one of the R and Hadoop integration paradigms, which leverages R and MapReduce for hybrid WebID analytics.

Semantics-enabled policy are proposed for access control, data handling, and data releasing. An access control policy is enforced at the super data controller. It veriﬁes the data usage context of a data analyst who intends to collect in-terconnected WebIDs’ datasets. In addition, a data releasing policy is enforced at the super data controller to ensure the data protection and utility balancing. By using a data handling policy, various data controllers collect a collection of WebID datasets and call for the SDC methods to anonymize PII and

social relationships information with selective revelation for the super data controller.

In our future work, we will be intensively exploiting MapReduce and R parallelism operations for synthetic instead of masking socially aware datasets. If possible, each WebID will be completely anonymized before leaving a data owner’s platform to ensure the absolutely privacy protection princi-ples. Furthermore, we will be fully implementing types of semantics-enabled policy for RHadoop hybrid WebID analytics on the decentralized privacy-aware Social Web.

ACKNOWLEDGEMENTS

This research was partially supported by the NSC Taiwan under Grant No. NSC 102-2221-E-004-014.

REFERENCES

[1] M. Krohn et al., “A world wide web without walls,” in 6th ACM Workshop on Hot Topics in Networking (Hotnets). ACM, 2007. [2] C. Yeung, A. et al., “Decentralization: The future of online social

networking,” in W3C Workshop on the Future of Social Networking. W3C, 2009.

[3] T. Berners-Lee, “Socially aware cloud storage,” September 2011. [4] T. Inkster, H. Story, and B. Harbulot, “WebID-TLS: WebID

authenti-cation over TLS,” W3C, Tech. Rep., October 2013.

[5] D. Appelquist et al., “A standard-based, open and privacy-aware social web,” W3C Incubator Group Report, Tech. Rep., December 2010. [6] J. D. Weitzner et al., “Creating a policy-aware web: Discretionary,

rule-based access for the world wide web,” in Web and Information Security, E. Ferrari and B. Thuraisingham, Eds. IGI, 2006, pp. 1–31. [7] V. Ciriani et al., “Microdata protection,” in Secure Data Management

in Decentralized Systems, T. Yu and S. Jajodia, Eds. Springer, 2007, pp. 291–321.

[8] A. Hundepool et al., Statistical Disclosure Control. Wiley Series in Survey Methodology, 2012.

[9] A. Labrinidis et al., “Challenges and opportunities with big data,” Computing Research Consortium (CSR), Tech. Rep., 2012.

[10] K. Liu et al., “Privacy-preserving data analysis on graphs and social networks,” in Next Generation Data Mining, H. Kargupta et al., Eds. CRC Press, 2008, pp. 1–17.

[11] A. Narayanan et al., “A critical look at decentralized personal data architectures,” Cornell University Library, Tech. Rep., 2012. [12] B. Carminati and E. Ferrari, “Privacy-aware access control in social

networks: Issues and solutions,” in Privacy and Anonymity in Informa-tion Management Systems, J. Nin and J. Herranz, Eds. Springer, 2010, pp. 181–195.

[13] E. Zheleva, E. Terizi, and L. Getoor, Privacy in Social Networks. Morgan&Claypool, 2012.

[14] S. D. C. d. Vimercati et al., “Access control policies and languages in open environments,” in Secure Data Management in Decentralized Systems, T. Yu and S. Jajodia, Eds. Springer, 2007, pp. 21–58. [15] Y. J. Hu et al., “Crafting a balance between big data utility and

protection in the semantic data cloud,” in International Conference on Web Intelligence, Mining and Semantics (WIMS’13). ACM Press, June 2013.

[16] A. P. Bonatti, “Datalog for security, privacy and trust,” in Datalog 2010, ser. LNCS 6702. Springer, 2011, pp. 21–36.

[17] D. Beckett et al., “Turtle: Terse RDF triple language,” W3C Candidate Recommendation, Tech. Rep., February 2013.

[18] J. Weaver and P. Tarjan, “Facebook linked data via the graph API,” Semantic Web - Interoperability, Usability, Applicability, 2012. [19] M. Spomy et al., “JSON-LD 1.0,” W3C Proposed Recommendation,

Tech. Rep., November 2013.

[20] R. N. Adam and C. J. Worthmann, “Security-control methods for statistical databases: A comparative study,” ACM Computing Survey, vol. 21, no. 4, pp. 515–556, 1989.

511 510

(18)

Propagation Control Services for WebID

Analytics on the Decentralized Social Web

Yuh-Jong Hu

Abstract A WebID is a single sign-on token for a user’s authentication at multiple servers. In this chapter, we allow boundless WebIDs to be collected, shared, and integrated for analytics on the decentralized Social Web. The primary stakehold-ers in WebID analytics are the data owner, data controller, and data user. All three types of stakeholders are sufficiently aware of propagation control services so that WebIDs have best protection and usage. Types of semantics-enabled policy are pro-posed and enforced by data controllers to enable access control, data handling, and data releasing actions on the WebID datasets. The policy enforcement should be accountable and transparent at the data controllers to provide WebID propagation control services. Each data controller enforces a data handling policy to anonymize massive WebIDs. Moreover, the super data controller enforces access control and data releasing policies to ensure that the data owners receive the privacy-preserving WebID analytics services. Finally, we point out how to resolve WebID protection and utility conflict through different types of semantics-enabled policy to call for WebID propagation control services at the data controllers of an information value chain.

1 Introduction

Personal data can be considered a new asset class that provides valuable insights when placed under effective analytics and interpretation [36]. Big data analytics has become one of the emerging research issues in the computer science field and other related fields, such as statistical analysis and data-driven decision making [29]. We

Yuh-Jong Hu

Emerging Network Technology (ENT) Lab Department of Computer Science

National Chengchi University, Taipei, Taiwan e-mail: [email protected]

(19)

2 Yuh-Jong Hu

face several research challenges when providing socially aware data analytics in online social networks (OSNs). First, the data volume on an OSN is so large and its velocity moves so fast that it exceeds the processing capacity of conventional data management systems. Second, the data come from heterogeneous sources in a variety of data formats and semantics, so it is extremely difficult to provide effec-tive data integration. Third, current centralized OSNs are all walled gardens, which makes seamless data integration almost impossible.

Most of the current big data analytics studies are mainly dealing with the three v’s challenges to resolve the data volume, velocity, and variety problems [39]. A special report, titled “Data, Data, Data Everywhere”, explores the problem of vast information collection with the complex issues of data archiving, accessing, manag-ing, and securing [16]. This report suggests that we should consider using metadata, or data about data, for effective machine processing to glean implicit values through data analytics.

We also need new rules to regulate the big data analytics processes and further-more to ensure the compliance of privacy protection principles. We therefore con-sidered the emerging research issue of big data protection [41] in order to reveal the complete landscape of privacy-preserving data analytics on social networks. Oth-erwise, we might face a new barrier when applying integral data analytics services across legal domains of data sources.

An inter-disciplinary study of big data privacy was recently presented at the workshop of CSAIL, MIT1_{. In this workshop, academia and industries pointed out}

their concerns about the lack of privacy services for big data. In fact, several well-known cryptography and statistical techniques, e.g., differential privacy [14] and fully homomorphic encryption [19], have been proposed to enable output pertur-bation and data encryption while providing private data management services for analytics in the open outsourcing cloud computing environment [18].

A previous k-anonymity model introduced the concept of the risk of re-identifying personally identifiable information (PII) across multiple data sources [40]. We must ensure that the quasi-identifier has at least k-anonymous PII in a dataset to avoid re-identification risk. Therefore, we mask PII attributes in a quasi-identifier to de-identify each PII before disclosure to a data analyst. However, k-anonymity did suf-fer from a privacy protection insufficiency problem against a PII cross-linkage attack when we had an unknown number of available external data sources.

The studies on differential privacy aim to achieve the ambition of bringing theo-retical soundness of cryptography for statistical disclosure control (SDC) with query outputs perturbation by noise [15]. In fact, the research in differential privacy seems to be more focused on controlling the re-identifiable risk of data than on providing analytics utility [12]. Conflict always exists between data protection and usage util-ity. How to balance these two objectives is an eminent challenge for the big data research community [22].

In this study, we consider using socially aware anonymized WebID datasets for analytics. WebID-TLS, known as the Friend-of-a-Friend (FOAF) + Transport Layer

(20)

Propagation Control Services for WebID Analytics on the Decentralized Social Web 3

Security (TLS) protocol, uses client-side certificates of WebIDs for a Web user’s authentication. A Web server requests an X.509 certificate from a Web user over the TLS to enable secure data communication and service access authentication [24]. A WebID [35], including a Web user’s Profile with its certificate, and the social rela-tionship information, are described as the RDF(S)-based FOAF ontology. The We-bID Profile attributes of PII and a quasi-identifier must be anonymized before dis-closure to prevent a data owner’s privacy from violation. Similarly, the data owner’s social relationships are also anonymized to preserve the owner’s privacy.

The SDC methods were classified as conceptual, query restriction, data pertur-bation, and output perturbation [1]. In this study, three types of semantics-enabled policy are proposed and enforced to enable access control, data handling, and data releasing actions for appropriate propagation control services. These actions cor-respond to the original SDC methods for query restriction, data manipulation and perturbation, and output perturbation for microdata protection [11].

The concepts of appropriate propagation control services are described as RDF(S)-based ontologies, and are enforced as SPARQL. In fact, we leverage the power of Semantic Web techniques, including RDF(S), FOAF, and SPARQL, and apply three types of semantics-enabled policy enforcement to call for appropriate WebID prop-agation control services at the data controllers. For more details, please see Sec-tion 4.2.

1.1 Research Issues and Contributions

Main research goals. In this study, we argue why we should consider applying prop-agation control services for WebID analytics on the decentralized Social Web. We-bIDs will be collected and propagated at each data controller and will be available later at the super data controller for big data analytics. We must ensure that each data owner’s privacy rights are well-respected and free from any usage violations. Moreover, we must ensure transparent and accountable propagation control services at the data controllers and super data controllers along with the entire WebID prove-nance propagation path.

The WebID secure management services for access control, dissemination, and disclosure are enacted as parts of the WebID propagation control services. For ex-ample, an access control policy calls for query restriction services, and a data han-dling policy calls for data manipulation and anonymizing services. Finally, a data releasing policy calls for output perturbation services. More specifically, this paper addresses the following major research issues:

1. How do we restructure the current centralized online social network architec-ture into the decentralized Social Web to provide wide-scale WebID capturing, recording, anonymizing, sharing, integration, modeling, and analytics services? 2. How do we provide transparent and accountable WebID propagation control

services at the data controllers to assure WebID protection for the data owner and usage utility for the data user?

(21)

4 Yuh-Jong Hu

3. How do we provide WebID protection and usage utility through types of semantics-enabled policy enforcement to call for WebID propagation control services at the data controllers of an information value chain?

Our contributions. Our main contributions are (i) restructuring the centralized online social network architecture into the decentralized Social Web for wide-scale WebID collection and analytics, (ii) demonstrating how to provide transparent and accountable propagation control services at the data controllers to assure WebID protection for the data owner and usage utility for the data user, and (iii) modeling how to provide WebID protection and utility through types of semantics-enabled policy enforcement to call for WebID propagation control services at the data con-trollers of an information value chain.

Outline. This paper is organized as follows. In Section 1, we give an introduction. Then, we provide background information in Section 2. In Section 3, we explain why we restructured the centralized online social networks into the decentralized Social Web. In Section 4, we present the concepts of transparent and accountable propagation control services for WebID sharing, integration, and protection. In Sec-tion 4.2, we also point out the reasons for choosing RDF(S)-based ontologies and SPARQL queries to enable propagation control services. In Section 5, we present three types of semantics-enabled policies that call for WebID propagation control services on the privacy-aware Social Web. In addition, we explain how the big vol-ume of WebID hybrid analytics services can be implemented in the RHadoop plat-form. In Section 6, we address related work. Finally, we conclude this paper with possible future work in Section 7.

2 Background

We first exploited the centralized social network’s architecture and restructured it into the decentralized Social Web to provide wide-scale data sharing. The research issues of privacy in social networks are not the same as the research issues in data protection in the relational database management system [48]. Given a complete information value chain, we intend to apply types of semantics-enabled policy for information propagation and control to assure the information quality and privacy protection criteria.

We allow the big data analytics process to be operated in the entire informa-tion value chain, and the semantics-enabled policies are enacted transparently and accountably at the data controller, which ensures that each data owner’s privacy concerns is respected and each data user’s usage utility is preserved.

Any available data manipulation techniques, such as sanitation, obfuscation, and anonymity, are applied to the WebID datasets to de-identify the PII, quasi-identifier, and sensitive social relationships. Moreover, we also allow upstream data owners and downstream data users using data provenance techniques [30] to trace and ex-amine the data protection and usage criteria at each data controller checkpoint along with the WebID propagation path of the information value chain. The final goal of

語意式資料雲上如何來巧控海量資料分析效益與保護間的平衡

科技部補助專題研究計畫成果報告

期末報告

語意式資料雲上如何來巧控海量資料分析效益與保護間的

平衡

計 畫 類 別 ： 個別型計畫

計 畫 編 號 ： MOST 102-2221-E-004-014-

執 行 期 間 ： 102 年 08 月 01 日至 103 年 09 月 30 日

執 行 單 位 ： 國立政治大學資訊科學系

計 畫 主 持 人 ： 胡毓忠

計畫參與人員： 碩士班研究生-兼任助理人員：劉文友

碩士班研究生-兼任助理人員：潘宗哲

大專生-兼任助理人員：謝濟謙

大專生-兼任助理人員：薛元昊

大專生-兼任助理人員：張筆翔

報 告 附 件 ： 出席國際會議研究心得報告及發表論文

處 理 方 式 ：

1.公開資訊：本計畫涉及專利或其他智慧財產權，2 年後可公開查詢

2.「本研究」是否已有嚴重損及公共利益之發現：否

3.「本報告」是否建議提供政府單位施政參考：是，政府公部門需執行開放式

政府資料(Open Data)的分析的匿名化作業

中 華 民 國 103 年 12 月 16 日

中 文 摘 要 ： 本研究案主要是探討分散式社群 Web 上具有隱私權保護的

WebID 分析研究。我們首先論述為何要用開放分散式而不是

封閉集中式的個人資料管理控管機制。接著我們提出一個具

有電腦規範認知的系統架構，在此架構上個人資料擁有者可

以挑選一個可信的資料管理者來匿名化其個人資料與社群網

路脈絡的 WebID。這些個人化 WebID 匿名化資料集是以

RDF(S)串連式資料型態來提供巨量資料的分析。除此之外我

們引用結合 R 和 Hadoop 的 RHadoop 分析系統平台來進行有效

且大量 RDF(S)為主的分散式社群資料集的分析。最後我們設

計並且實做出三種型態的 WebID 資料集控管所需的電腦可執

行規範，主要包括了資料使用者控管規範，資料處理規範，

與資料揭露規範，這些電腦規範可以呼叫上述 RHadoop 資料

分析的模組，並且更進一步平衡資料使用效益和個人資料保

護間的平衡。這一部份的研究成果已經發表在 IEEE Web

Intelligent-2014, Warsaw, Poland 的國際研討會。

我們也完成另外一份論文的初稿: Propagation Control

Services for WebID Analytics on the Decentralized

Social Web。並準備投稿到相關的國際電腦科技研討會或專

書。本論文初稿是延續上述已經發表論文的內容而從資訊流

控管服務（Propagation Control Services）的觀點來分析

分散式社群網路之上相關成員如資料擁有者，資料控管者，

以及資料使用者之間的關係。我們沿用上述三種 WebID 資料

集控管的三種電腦規範，並且強調這三種電腦規範執行與落

實必須要在一個具可靠性與透通性的情況下來進行資料流通

控管服務。我們最後點出該如何在資訊流通鏈利用上述三種

電腦控管規範來呼叫 WebID 流通控管服務模組以化解 WebID

資料保護與效益間所產生的衝突。

本研究案：「語意式資料雲上如何來巧控海量資料分析效益

與保護間的平衡」詳細的研究目的、文獻探討、研究方法與

研究步驟、結論與未來研究請參考下面已經發表在 IEEE Web

Intelligence-2014

國際研討會的論文：Privacy-Preserving WebID Analytics on the Decentralized

Policy-Aware Social Web

(https://dl.acm.org/citation.cfm?id=2682811 )。以及另

外一篇投稿中的論文：Propagation Control Services for

WebID Analytics on the Decentralized Social Web。另外

碩士生孫肇祥同學在 103 年度的碩士論文：整合 R 與

Hadoop/MapReduce 來分析 FOAF 社群網路，亦為本專案研究

計畫成果之一。

中文關鍵詞： 分散式社群 Web、具隱私保護社群 Web、個人可辨識別碼、

WebID、具語意式的電腦規範、巨量資料分析、統計揭露控管

英 文 摘 要 ： We address the research challenges of

privacy-preserving WebID analytics on the decentralized

Social Web. We first argue why we should use open and

decentralized control but not closed and centralized

control of personal data management. Then, we present

a policy-aware architecture, where a data owner

hand-picks a trusted data controller to mask his/her

personally identifiable information (PII) and other

sensitive social relationships of the WebID so only

anonymous RDF(S) linked datasets are available for

analytics. Moreover, we advocate using a R and Hadoop

integration paradigm, called RHadoop, for effective

hybrid WebID analytics of large-scale social network

linked datasets. Finally, we propose various types of

semantics-enabled policies to call for the RHadoop

hybrid WebID analytics and further balance data

utility and protection on the privacy-aware Social

計畫類別：個別型計畫

計畫編號： MOST 102-2221-E-004-014-

執行期間： 102 年 08 月 01 日至 103 年 09 月 30 日

執行單位：國立政治大學資訊科學系

計畫主持人：胡毓忠

計畫參與人員：碩士班研究生-兼任助理人員：劉文友

報告附件：出席國際會議研究心得報告及發表論文

處理方式：

中華民國 103 年 12 月 16 日

中文摘要：本研究案主要是探討分散式社群 Web 上具有隱私權保護的

中文關鍵詞：分散式社群 Web、具隱私保護社群 Web、個人可辨識別碼、

英文摘要： We address the research challenges of

3.「本報告」是否建議提供政府單位施政參考 ■否 □是，（請列舉提供

中華民國 103 年 12 月 13 日