978-1-4799-4143-8/14 $31.00 © 2014 IEEE DOI 10.1109/WI-IAT.2014.140
504
2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)
978-1-4799-4143-8/14 $31.00 © 2014 IEEE DOI 10.1109/WI-IAT.2014.140
503
We propose a semantics-enabled policy-aware decentral-ized Social Web architecture that provides automated data an-alytics. This architecture provides effective mediator services between data owners and a data analyst for analytics purpose.
In addition, we attempt to balance privacy protection and data utility through the semantics-enabled policy enforcement.
We point out the potential challenges of balancing privacy protection and data utility while providing WebID analytics.
These challenges arise from the following conflict: either a data owner’s privacy right is invaded or a data analyst’s analytics power is limited.
It is not easy to accomplish a data utility and protection balancing. In fact, this problem has been intensively investi-gated in the statistical disclosure control (SDC) research field for decades [8]. SDC techniques are usually used for microdata protection in the statistical databases, so they are not directly applicable to data protection on the decentralized Social Web.
When applying Social Web data analytics, we not only protect each personal profile but also protect sensitive relationships.
We have built a highly transparent platform to provide socially aware data management services in the distributed Hadoop ecosystems. Both a data owner and a data analyst can ensure that the services for data acquisition, anonymizing, recording, sharing, integration, and analytics are all following the privacy protection principles declared in the Web server’s terms-of-service statements. Initially, a data owner configures the appropriate rules for enforcing data retention, use, and protection. Later on, a data owner can track or be notified of data dissemination and usage through a privacy-aware notification system.
A pattern-based query is a conditional data retrieval, so its acquisition data is disclosed only when its metadata are satisfied with the original data owner’s specified data access conditions, such as a data user’s role, purpose, access opera-tion, locaopera-tion, and time. When a data user’s request is rejected, the privacy-aware notification system gives an explanation to that data user through the automated policy reasoning services.
Finally, when data protection services are revised, we allow a data owner to update the respective policies in the policy-aware management framework to reflect the updated status.
By following the big data analytics lifecycle in [9], we propose a revised version that can be shown as six consecu-tive stages corresponding to three types of semantics-enabled policies for data management services: (1) data acquisition and recording; (2) data extraction, data cleaning, and semantic an-notation for anonymizing; (3) data representation, integration, and aggregation; (4) data modeling and analysis; (5) query processing and analytics; and (6) interpretation. We intend to apply semantics-enabled policy management services in order to provide data protection that are seamlessly bound together between the stages of lifecycle services.
A. Research Issue and Contributions
This paper addresses the following major research issues:
1) We argue why we choose the decentralized but not the centralized online social networking architectures for WebID analytics.
2) How can we proceed the anonymized WebIDs col-lection and still ensure WebID utility?
3) How can we provide RHadoop WebID analytics through unifying anonymized WebIDs on the decen-tralized Social Web?
4) How can we call for RHadoop WebID analytics through types of semantics-enabled policy enforce-ment?
Our contributions. (i) We state the reasons for why we should use decentralized instead of centralized social network architecture. (ii) We show how to proceed the anonymized WebIDs collection and still ensure data utility. (iii) We il-lustrate how to link anonymized WebIDs for analytics. (iii) We propose three types of semantics-enabled policy for access control, data handling, and data releasing, to enable privacy-preserving WebID analytics.
Outline. This paper is organized as follows. In Section I, we give an introduction. Then, in Section II, we address related work. We provide background knowledge in Section III. In Section IV, we argue why we use a decentralized but not a centralized policy-aware Social Web. In Section V, we address types of data analytics and explain why we choose RHadoop for hybrid WebIDs analytics. In Section VI, we present types of semantics-enabled policies that call for RHadoop hybrid WebIDs analytics. Finally, in Section VII, we conclude this paper and point out possible future work.
II. RELATEDWORK
Privacy-preserving data analytics for the socially aware data cloud is a not yet resolved challenge [10]. The SDC techniques have been developed for anonymizing the statistical database’s microdata but are not yet ready for socially aware big datasets protection [7] [8].
A critique in [11] pointed out the difficulty of adopting decentralized social networks for personal data management.
This viewpoint disagrees with other proposals, which stand for personal data management on the decentralized Social Web [4].
A W5 architecture was addressed in [1] that also endorsed the personal data management through decentralized social networks. In [2] and [3], the researchers proposed concepts of the future decentralized Social Web, but the access control of resources is currently restricted to a subject-based query.
They did not deal with the privacy protection analytics for a pattern-based query problem.
Privacy-preserving data access for social networks is be-coming an important research problem. However, most of the studies focused on only resource access control [12] [13].
Semantics-enabled policy techniques have been proposed for general data resources access control [14]. Others are aim-ing at resources access control and data analytics on the Web [6] [12] [15].
III. BACKGROUND
In [3], Tim describes a way of re-architecting an open decentralized OSN with its applications (or services) that are separated from the socially aware data cloud. Thus, Software-as-a-Service (SaaS) layer can be separated from the Platform-as-a-Service (PaaS) layer. Here, the PaaS layer provides ser-vices that are barely powerful enough for the requirements
505 504
of the SaaS layer. Each OSN should extend across the entire Social Web so that boundless data sharing is feasible.
In the decentralized social network, services are created by different service developers and trusted by a data owner (see Figure 1). Services are run for data analytics as a whole for WebIDs collected from various social network sites. Various centralized OSN sites are interoperable with each other. On the one hand, a data user can control and retain his/her own data in the remote trusted informediary server. On the other hand, each social network site can justify its existence by sharing data with other OSNs for analytics without hoarding the entire datasets.
In[1], the World Wide Web Without Walls (W5) ecosystem concept is proposed to improve the data management problem of current centralized OSNs. They point out that three desired properties are required: decoupling services (or applications) from data, giving users control over their data, and minimizing the data trust footprint. In fact, the status quo of the centralized OSN is a lack of these properties. It has evolved into inde-pendent silos, so integrations of these silos’ data are almost impossible. Moreover, the real data protection enforcement principles are not transparent in each online social networking site.
W5 is a possible solution with aggregates. An aggregate, similar to a data controller, is a single virtual logical machine that a W5 provider supplies, and it hosts a large collection of services from developers and commingled data from many Web users. Services are written by the trusted third-party (TTP) developers, and run inside an aggregate. From a cloud computing service viewpoint, a W5’s provider offers PaaS for a W5’s numerous third-party developers, which, in turn, provide SaaS in the socially aware data cloud.
IV. DECENTRALIZED VS. CENTRALIZED
We face several challenges when PII are acquired and recorded in separate walled garden silos of centralized OSNs.
For example, a data controller will struggle to integrate PII and socially aware relationships from heterogeneous data sources because data schema and format differences will always exist.
Hence, data controllers will always have limited data sharing capabilities with each other. Without intensive human manipu-lation efforts, it is almost impossible to provide data analytics across heterogeneous data sources.
In addition, a data owner does not have full control over his/her own data. Socially aware relationships and PII data are dispersed but not interoperable. In fact, each centralized social network silo privately proceeds its own data analytics without requiring the original data owners’ explicit consent. Finally, data are not portable if a data owner decides to terminate use of the services.
Since the policy enforcement is not transparent, a data owner is not ensured of the data controller’s privacy policy compliance status. Moreover, the authentication process is cumbersome for data owners and data users because central-ized OSN lacks a single sign-on service. In the centralcentral-ized OSN, a data owner will find it almost impossible to track data usage and provenance.
Some techniques might alleviate these problems, but only marginally. For example, open graph API provides an interface for limited data sharing and portability on the Facebook platform. In contrast, OpenSocial defines a common API for social applications across multiple Websites. OpenID supports only single log-in services for Web users when they access services on the multiple Websites. At best, these techniques provide partial and incomplete solutions for privacy-preserving data analytics. We need a comprehensive solution that can link together datasets that come from existing walled garden silos or other emerging social network sites for data analytics but sill persevering privacy.
A. Decentralized Social Web
A certified TTP data controller can be established to store and anonymize all RDF(S)-based WebIDs pertaining to Social Web users’ profile and social relationships. A WebID is referred to an URI that has a HTTPS scheme for secure transmission. A WebID uniquely describes a person and his/her social relationships [4]. Each data owner can flexibly select one of the TTP data controllers as a guardian of his/her WebID, which allows the control over data and the control over services to be separated at a data controller. Using WebID for a single sign-on authentication is feasible on the decentralized Social Web, because each WebID refers to the original person.
A Web user is fully in charge of his/her own WebID shar-ing and dissemination through the semantics-enabled policy enforcement at a data controller. Therefore, a data owner is endowed with the transparent and self-control of a privacy protection policy. In contrast, a Web user lacks these features in the centralized OSNs. An enormous amount of interconnected WebID profiles are fully anonymized after they are collected and archived in a trusted data controller. The real research challenge will be how to engage analytics of the anonymized WebIDs from the need to be interconnected WebID datasets.
Here, we use the enhanced microdata protection techniques at a trusted data controller for anonymizing social network WebID datasets.
Semantics-enabled policies, including a data handling pol-icy for data anonymizing, on the one hand is established at a data controller. On the other hand, access control and data releasing policies are established and enforced at the super data controller (see Figure 1). The purposes of the semantics-enabled policy extend beyond the original WebID’s access control concern, because the original WebID restricts nothing more than each person’s resource access. Here, the access control and data handling policies are unified to allow the right person with the right purpose to query a given anonymized WebID dataset. Socially aware data are selectively disclosed with pattern-based queries through a data releasing policy.
Semantics-enabled policies are represented as a combi-nation of ontologies and rules [16]. Ontologies describe the concepts of data analytics and protection services, and rules enforce selective data disclosure and dissemination services with access control capability for the super data controller.
Once generic semantics-enabled policies are established by a TTP at the super data controller, a data owner can verify and configure the data disclosure and usage policies at a selective data controller.
506 505
Fig. 1. The super-peer domain (SPD) data cloud for the decentralized Social Web, where a data owner hand-picks a trusted data controller to mask and record his/her WebID in a anonymized WebID dataset, and forward to the super data controller to enforce WebID disclosure and analytics services.
Each data controller registers to the super data controller within a super-peer domain. A super-peer domain is the legal boundary of socially aware data cloud. This design greatly sim-plifies the management of semantics-enabled policy, because the numerous data controllers that operate within a super-peer domain do not have to enforce their own access control or data releasing policies.
In this study, we downgraded our OWL-based ontologies and rules from structured relational data in [15] to the RDF(S)-based graphs to leverage the power of linked data integration.
RDF(S) linked data graph is represented as a triple for each name-value pair data. Thus, an entire RDF(S) graph is repre-sented as a set of triples, and SPARQL provides triples access.
A data owner has the right to select appropriate privacy protection preferences that describe the conditions of data usage so that a data controller can flexibly select suitable data masking techniques for anonymizing a WebID and provide future data disclosure. Semantics-enabled policy enforcement is transparent, so all of the data disclosure is under a data owner’s explicit permission.
V. TYPES OFWEBID ANALYTICS
In the big data analytics lifecycle, the first step’s WebID acquisition and recording, and the second step’s WebID profile extraction and cleaning are obviously easy because WebIDs are uniquely identified through URIs. Therefore, WebIDs are acquired and extracted directly without further cleaning. For the second step’s semantic annotation, we use a semantics-enabled data handling policy to collect the context and anonymize the content of WebIDs. However, before WebIDs are anonymized, we must ensure WebIDs’ interconnections through their unique URIs at a data controller for further integration. In step three, which involves data representation, integration, and aggregation. Each WebID’s context is defined as the FOAF ontology schema with a tremendous amount of anonymized PII instances content attached, which represented as RDF(S)’s Turtle serializable format. Multiple FOAF graphs exist, where some graphs are mutually interconnected and
others are separated, reflecting the true clustering and grouping topology of the decentralized Social Web.
A. Unifying Anonymized WebID for Analytics
The WebID’s structure and its sharing mechanism are based on the open standardized RDF(S)-based ontologies and secure transmission protocol, e.g., FOAF + TLS. Therefore, we do not face an ontology matching and merging problem when WebIDs are captured and shared between multiple data sources. Here, the centralized OSN’s PII data are also possibly included for analytics if an adapter is available to transform private data into an open WebID by using RDF(S) serialization and de-serialization techniques.
Turtle, the Terse RDF triple language, is a primary concrete syntax of a RDF(S) dataset representation besides the original RDF/XML verbose data format [17]. JSON is an interchange language for the data serialization and de-serialization of Graph API outputs captured from the NoSQL data-stores. Until recently, the centralized social network outputs through Graph API were primarily represented in the JSON for third-party applications to consume. In [18], JSON is translated into Turtle to offer data interoperability in a semantically-enriched RDF Linked Open Data (LOD).
Another emerging solution for unifying PII is creating data in a JSON-based Linked Data, JSON-LD [19](see Figure 2).
JSON-LD is also an interchange language. It uses@context to describe JSON data schema and vocabulary sources;@type to describe a data type of a vocabulary; and@id to represent a vocabulary as an identifier. Therefore, JSON objects become JSON-LD objects, and are interoperable and reusable with these additional vocabularies. This slight upgrade from JSON to JSON-LD allows existing JSON data to be interpreted as Linked Data with minimal syntax changes.
B. R and Hadoop for WebID Analytics
In this study, we classify big volumes of WebID analytics into three types: lightweight, heavyweight, and hybrid.
507 506
1) A lightweight analytics provides a simple analytics service for unstructured data with small mathematical operations in a MapReduce programming paradigm of the distributed Hadoop environment.
2) A heavyweight analytics provides analytics for struc-tured data with complex mathematical operations in a statistical computational software, such as R.
3) A hybrid analytics provides a combination of a lightweight analytics of unstructured data and a heavyweight analytics of structured data to leverage the power of both analytics services.
On the one hand, accessible JSON-LD unstructured data are used for MapReduce lightweight analytics. On the other hand, socially aware structured JSON-LD data are accessed through the SPARQL query language if the data can be transformed into Turtle for R’s heavyweight analytics. A hybrid WebID analytics proceeds as follows. The unstruc-tured text in the anonymized WebID of JSON-LD datasets are processed by a MapReduce lightweight analytics. Then, the structured profile attributes and social relationships of anonymized WebIDs are transformed into Turtle and queried through SPARQL filtering for a R heavyweight analytics.
Finally, the lightweight and heavyweight analytics results are integrated with a comprehensive interpretation (see Figure 2).
Fig. 2. The PII in JSON is upgraded to JSON-LD in the centralized OSN, and then integrated with the JSON-LD of WebIDs in the decentralized social network for integral data analytics.
We use Revolution Analytics RHadoop platform. as a hybrid analytics testing environment to verify the feasibility of our semantics-enabled policies. Statistical computing packages from open source R are used for heavyweight analytics, but they work only for the in-memory data of a standalone computer. The Hadoop framework with a MapReduce pro-gramming paradigm allows the distributed processing of large datasets, but they work only for lightweight analytics. The purpose of integrating R and Hadoop as RHadoop is to bring the distributed (or parallel) MapReduce processing capability of Hadoop to the heavyweight analytics of R.
VI. PRIVACY-AWARESOCIALWEB
A policy-oriented Social Web becomes a privacy-aware Social Web when the privacy protection principles of per-sonal profiles and relationships are represented and enforced
automatically by the semantics-enabled policies. A profile management service could be run in the browser or via a TTP.
One type of TTP is an aggregate that keeps track of users’
distributed attributes and profiles on the Social Web. With this aggregate, we allow each user to configure and edit his/her personal data attributes. The core service features offered by third-party social applications are masking, maintaining, and expanding users’ connections without violating privacy protection principles.
A. Semantics-enabled Policies
Consider privacy-preserving analytics for decentralized on-line social networks: they are quite different from the ones used for the centralized relational databases, so we need to consider collecting linked data from potentially different RDF(S) data sources. These linked data, which serve as RDF(S)’s ontolo-gies, are in the triple data stores but not in the tables of a relational database. Therefore, we need to revise previous data modeling, access, anonymizing, and selective revelation techniques for linked data publishing and disclosure.
Original SDC methods are classified as conceptual, query
Original SDC methods are classified as conceptual, query