Big Data is a term closely related with the development of the Internet. Due to the existence of such a global network, it is possible to share and collectthis huge amount of data as well as to provide Big Data processing and analytics as a service to a broader audience throughout the entire network.
Big Data means data that’s too big, too fast, or too hard for existing tools to process. The enormous amount of data could no longer be handled byconventional scale-up database systems. With the rapidly growing demands, such system designs do not only cause exploding costs but also reach quickly thetechnical limits of performance.
Today’s common design of Big Data is to scale-out, or to use thousands of inexpensive commodity servers to achieve unprecedented scalability, performance,and availability. The topology of the network connecting those servers with each other and with the clients is expected to have a significant impact on these targetcharacteristics.
The powerfulness of Big Data also raises privacy concerns which could stir a regulatory backlash dampening the data economy and stifling innovation.
Here, we propose a concept of decentralized network-based Big Data architec-turetoprovidebothhighscalabilityandprotectionofprivacy.Anetworkoflooselyconnected Big Data analytic server nodes allows not only decentralized storage andprocessing, but also individualandprivacy – preservingresponsepolicies.
Data generation and collection has increasingly grown over the last 10 years. At the same time storage has become more and more affordable. The appearance ofsmartphones and tablets enabled people to be connected to the Internet almost anywhere at any time. Along with the expansion of broadband networks, thosedevices are enabled by integrated sensors to generate additional usable data, for example, motion profiles. These developments opened a huge market for newcompanies in software development but also generated new growth opportunities for established companies (Internet of things, smart metering, etc.). With respect to analytics and enabling new products, Big Data may create significant value to the organizations.
Big Data is classified mostly with the three V’s of volume, velocity, and variety. Regarding quality and accuracy, there is a fourth V for veracity. These fourV’s present the mostly technical challenges. In addition, the three F’s of fast, flexible, and focused and other functions and software components are aspects thatneed to be considered to find a holistic approach for a Big Data platform.
It is desirable that the performance increases linearly with the number of servers. However, this linear scalability is not achieved by the implemented serverclusters. As shown by, the performance per server drops by a factor of 2-5 when increasing the number of servers from 1 to 500.
There are also other approaches addressing scalability. Reference presented an information integration system based on a mediator/wrapper or virtual dataintegration approach, which enables central information access without requiring central data storage.
No less than seven “big benefits” of Big Data have been presented in: health care, mobile applications, smart grid, traffic management, retail, payments, andonline applications. In addition, Big Data may provide significant advantages in law enforcement or terrorism prevention. At the same time, increasingconcerns about privacy are raised. From the point of view of privacy protection, it is necessary to provide k-anonymity, that is, “any one individual in thedataset cannot be distinguished from at least k-1 other individuals in the same dataset”. Given its comprehensive data collection a centralized Big Data solution can never achieve the k-anonymity for any k>1. The fact that the storage of the data can be carried out decentralized is significant,insofar as the fact that private companies and government organizations collect andholdmoreandmoreinformationaboutindividualsisregardedincreasinglyascriticaland dangerous to the society.
Based on the decentralized concept, meta-search engines have been introduced with a certain success while diverse decentralized social networks likeSafebook. Diaspora, Friendica, and many others have also mushroomed since 2009. It seems that the lack of interoperability (particularly between the socialnetworks) and the incomplete business models (all network-based Big Data services including meta-search engines) prevent the breakthrough of decentralizedconcepts. From a technical point of view especially in relation to telemetry-based Big Data services in the field of transportation, health, environmental protection,and numerous others will soon be or are already available.
- Network-based Big Data
3.1. A concept of network-based Big Data
We propose a concept of decentralized network-based Big Data structures. According to this concept the entire network is partitioned into multiple logical BigData subnetworks. The Big Data service is provided by multiple decentralized Big Data server nodes, each single node in a single subnetwork. It is aimed to provide benefits both for the data creators regarding their demand for protection of privacy and the users of diverse Big Data solutions.
We define the logical structure of data exchange for “network-based Big Data” with the underlying undirected graph given by
As mentioned previously the entire network is partitioned into n subnetworks, denoted as
At the same time, all server nodes form a complete graph
Within any given subnetwork Gi0, vi0 is the only Big Data server node which may collect data there. As a prerequisite to each
a possibly individually negotiated Big Data policy is to be applied. This policy defines how Big Data may be collected, processed, retrieved, and deleted in case oftermination.
Within G0 no sharing of Big Data may take place. Instead, the servers cooperate with each other and build a mediator/wrapper architecture(s).
Figure 1 shows an example of a network with four logical subnetworks G1, G2, G3, G4. Each of them consists of one logical Big Data server node andmultiple client nodes from which the Big Data may be collected.
The server nodes v10, v20, v30, v40 in this example are connected logically to each other, and thus form a complete graph
Collecting information from a creator
- Creator v41 sends information (tweets, geo information, consumption data, documents, etc.) to server node v40. As a prerequisite, server v40ensures thecreator applies its individual Big Data policy, which can be considered as a contract between the creator v41 and the provider of the Big Data server nodev40.
- Server node v40processes and stores data using standard Big Data techniques.
- Server node v40does not share data with the other server nodes, v10, v20, v30.
Figure 1. An example of decentralized network-based Big Data. (0) Collecting information from a creator; (1) Consumer requests data within its network; (2) Server v20 broad-casts the request to all otherservers it knows; (3) Server v40 responds to the request of server v20; (4) Server v20 responds to the initial request of Consumer.
Consumer requests data within its network
- Consumer v22sends a request to server node v20 searching for relevant data (tweets of the originator whom he or she follows, etc.)
Server v20 broadcasts the request to all other servers it knows
- Server node v20has to have a trusted connection with the other servers.
- Server node v20has to authenticate itself to the server nodes v10, v20, v40
- Server node v20may provide the identity of the consumer subject to the policy of the requested Big Data service and the individual Big Data policyagreement between the consumer v22 and the server node v20.
Server v40 responds to the request of server v20
- Following the policy defined by the creator v41and based on the requested information provided by v20 and the collected Big Data, server nodev40 responds to the request.
- Server node v40does not provide any Big Data information to v20.
Server v20 responds to the initial request of consumer
- Server node v20collects all the answers of the other server nodes, if provided.
- Server node v20sends the results to the consumer v22.
3.2. Considerations of the network traffic and the computational effort
Based on the architecture discussed before, the network traffic and the computational effort for the entire network can be estimated.
The network traffic B is given by
is the network traffic which is related to a single Big Data server node vi0 and consists of two parts, the client-related traffic and the intra-server traffic, cijthe network traffic for collecting data from a single client vij ( 0 in Figure 1), rij the network traffic to respond to the requests by vij ( 1 and 4 in Figure1), mi the number of clients within the subnetwork Gi, and fik the network traffic for forwarding a request from vi0 to another server node vk0 and collectinganswers ( 2 and 3 in Figure 1). Notice that
The intra-server traffic can be estimated for a search engine network and a social network application as explained in the following.
Today Google processes about 50,000 search queries every second. Assuming each query may cause a data traffic of 200 KB we have
If using a decentralized network-based Big Data concept with n server nodes, each server node has to handle all client requests, either directly or forwarded bythe other n-1 server nodes. The intra-server network traffic can be estimated as follows
for n = 101. This amount of data flow is acceptable for today’s Internet, since the estimated total Internet traffic is about 30 TB/s.
WhatsApp claims to be able to receive more than 300,000 and send more than 700,000 messages per second for its users. In a decentralized network-basedversion, each message is (1) first sent to the server node to which the originator is related, (2) then forwarded to possibly another server node, which hosts aspecific chat, and (3) finally received by all server nodes to which at least one in the chat-registered user is related. The intra-server traffic is approximately1,000,000 messages per second or about 1 GB/s.
The total computational effort C is given by
where Ci is the computational effort of a single Big Data server node vi0, Ai the data amount collected by this server node, P1 a function of the datacollection flow and the data amount, and P2 a function of the data request flow and the data amount.
The sublinear scalability of Big Data server nodes can be expressed as
where Fi is the data flow from and to a Big Data server node vi0, F and A are, respectively, the sum of the data flow and the data amount of all server nodes or
is a characteristic scalability factor depending on the technique and the number of the server nodes used. In the case described earlier, has approximately thevalue 5; in an ideal case of full linear scalability its value is 1.
Comparison to a centralized architecture
Centralized Big Data architecture is actually a special case in which there is only a single subnetwork with m^ clients. The network traffic and thecomputational effort are, respectively:
In other words, compared to the decentralized architecture discussed previously, a centralized architecture will certainly reduce the intra-server network trafficto zero, but increase the total computational effort for data collection because of the sublinear scalability of Big Data server clusters.
Comparison with distributed architecture
An alternative concept to the network-based Big Data architecture is to have distributed server nodes of which the data have to be kept “eventually consistent”.In such an architecture, the network traffic and the computational effort are given by
is the network traffic and
the computational effort to keep the data on the server node vi0 eventually consistent. Notice that both Si and Ri can be parameterized to a relative lowvalue as a trade-off for more performance.
The computational effort in a distributed architecture is higher than the computational effort in a network-based architecture divided by a certain factor. Thenetwork traffic in the network-based architecture is again about up to the amount of the intra-server traffic higher than in the distributed architecture.
3.3. Privacy, interoperability and competition
The previous section shows that compared with both the centralized and the distributed architecture, the network-based Big Data architecture reduces thecomputational effort for data collection because of the sublinear scalability, but increases the request-related network traffic by factor n, in the form of intra-server traffic. From a technical point of view, a choice in favour of the network-based architecture and the determination of the number of server nodes may bea result of diverse considerations of specific applications and the costs of server or network deployment.
When considering privacy protection, however, the network-based Big Data architecture becomes a must, as it provides better privacy protection.
The k-anonymity limits the probability of a random identification of an individual based on any existing quasi-identifier within the data set underconsideration:
As mentioned in the previous section, the data stored on a Big Data server node cannot be considered k-anonymous. But if a specific Big Data application is builtbased on n network-based server nodes, a similar effect of k-anonymity can be achieved since the probability that the personal data of an individual is on a specificserver node i is given by
Just like the configurable eventual consistency, the privacy protection can also be configured
The higher the number of network-based Big Data server nodes the higher the privacy protection. In addition, in such an architecture each Big Dataserver node faces much smaller potential damage from hacker attacks.
Interoperability in mobile telecommunications is a fundamental principle. Nobody wants to register himself in several provider networks only to be enabled tocommunicate with members of the other networks. Social networks do not support interoperability because they want to bind the users in order to achievethe greatest possible diversification of advertising. If one could choose only to be member of one specific network, the amount of possible members of eachnetwork would decrease. This fact means that the mass of the user focus on the network of the market leader to reach most of the peers. This in turn leads tomore market power for the market leader and more personal information concentrated on one server node.
Interoperability would lead to better competition, since the social networks would have to advertise to their users. The accessibility of other peers would no longerbe in the foreground but rather other quality characteristics, for example, privacy and security matters. There are different approaches to decentralized socialnetworks-none of them are really successful. They suffer from the fact that they cannot find their peers or like-minded people so they can learn about aparticular subject within the network.
As discussed earlier, both search engines and messengers can be built on a network-based Big Data architecture, which also allows the implementation of thefurther applications mentioned in before.
A smart grid fault indicator will be capable of helping consumers to manage and optimize their day-to-day energy use, even at the appliance level. However,“The information collected on a smart grid will form a library of personal information, the mishandling of which could be highly invasive of consumer privacy”.In this context, a multilevel network-based Big Data architecture may contribute decisively to provide both privacy and transparency for better grid-wide loadmanagement. In such an architecture appliance-specific data remains in the subnetwork of a private household while aggregated real and forecasted informationis forwarded to the Big Data server node, at a higher level, say, the neighbourhood server node, which in turn provides aggregated information from severalhouseholds in the neighbourhood to an even higher level server network. In this way a sufficient anonymization of consumer’s information takes place.
The network-based Big Data architecture may help to establish health care data services which may collect health data of a certain individual only with his or her prior explicit permission. The data collection may even include personal search engine logs. The collected data must not be shared with any other services orpersons due to medical confidentiality rules. However, medical analytics can be carried out and the results can be provided to optimize health care management aslong as it conforms to the agreement between the data owner and the service provider.
GPS and other physical or even chemical sensors are increasingly becoming relevant data sources. From a technical point of view the data streams fromsuch sources may be used for many different applications, from motion tracking providing evidence of a healthy lifestyle or a defensive driving style to obtainthe maximal discount on health care or vehicle insurance, through traffic management to epidemic control and dragnet investigations. A decentralized network-based Big Data architecture obviously provides an exceptional option to ensure the data is only used for the purposes desired by the owner.
Here, we proposed a concept called “network-based Big Data.” A network of loosely connected Big Data analytic server nodes allows not only decentralizedstorage and processing, but also individual and privacy-preserving response policies.
The sublinear scalability of Big Data systems makes decentralized data storage more favourable with regard to performance. The amount of extra networktraffic between the server nodes turns out to be acceptable. The core benefit of a network-based Big Data architecture, however, is in the possibility toimplement individual policies of privacy protection on behalf of the owners of the collected data.
A key issue is to establish standards for intra-server communications which shall be supported by different providers. Such interoperability facilitates not onlythe network-based Big Data architecture on the operational level but also competition, which in the long run will continuously promote improvements in servicequality and especially the protection of privacy.