IntegratingBig Data in Cloud Environment – A ReviewMr.

Deepak Ahlawat1, Dr. Deepali Gupta2PhD Research Scholar MMU Sadopur1, HODCSE MMU [email protected],[email protected]  Abstract:Inthis paper the concept of the Big Data and Cloud Computing are integrated and reviewed. Bigdata term refers to huge volume of data in today’s internet environment, muchof which cannot be integrated easily.

Cloud computing and big data gohand in hand. Big data gives the users the ability to utilize massive computingpower to process the distributed queries in different datasets and returnoutcome sets in a timely manner. Cloud computing is the underlying engine thatalong with Hadoop, provides the platform for distributed data-processing. Inthe later section, future work with the integration of big data and cloudcomputing are presented.Keywords: GA, PRF, CURE.1 Introduction1.

1.  Big DataBig data 1 can be characterized by 4Vs: the extreme volume of data, the widevariety of types of data, the velocity at which the data must be must processedand the valueof the process of discovering huge hidden values from large datasets withvarious types and rapid generation. . Big data term refers to huge volume of data in today’sinternet environment, much of which cannot be integrated easily.Big data takes huge amount of time andcosts/money to get some useful analysis done on it.

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

As knowledge can only bedrive from a careful analysis of data (Data Mining), thus several newapproaches to storing and analysing data have emerged. Instead, rawdata withextended metadata is aggregated in a datalake and machinelearning andartificial intelligence (AI) programs usecomplex algorithms to look for repeatable patterns 2. Collection of large amount of data takes place becauseof the human involvement in the digital space. The work is being shared storedand managed and lives online. As an example, approximately several terabytes ofdata daily uploaded and viewed on Facebook.                                     Fig.1. Big Data Classification This kind of huge data with useful information isknown as big data.

Clustering is the capable data mining method using widelyfor mining valuable information in the unlabeled data. From the last fewdecades, numbers of clustering algorithms are developed on the basis of avariety of theories plus applications.1.2.   Cloud ComputingA cloud is acomputing process in which services are dispersed above network by computingprocesses 3. Service models consist of three main categories 4:                                                                                           Software                                                                                                  Platform                                                            Infrastructure Fig.2.

Service ModelsSaaS(Software as a Service)·        Theweb access is given to commercial software.·        Froma middle location, the software is managed. ·        One–to-many is the way for delivering the software. ·        Theusers don’t need to manage software improvements and patches.

·        Amongnumber of software’s, Application Programming Interfaces (APIs) allows theintegration.PaaS (Platform as a Service)·            To allow the services to expand, experiment, organize, host andprotect the application in the same integrated improved atmosphere and theequivalent services desired to accomplish the application developmentprocedure. ·            The web build user interface formation tools assists to make,adapt, test and organize dissimilar UI framework. ·            Multi-tenant plan that has numerous simultaneous users use thesimilar growth application.

  ·            Constructed in scalability of deployed software counting loadbalancing and failover.·            Addition with the web services and databases of frequentstandards.·            Sustain for growth team collaboration – some PaaS solutionscomprises of project planning and communication tools. ·            Tools to handle billing and subscription management.  IaaS (Infrastructure as a Service)  The resources are dispersed as a service. It permits for effectual scaling. It has a patchy cost, usefulness pricing model.

Usually it has a multiple user environment.1.3.  Relation ofCloud Computing and Big DataCloud computingand big data go hand in hand. Big data gives the users the ability to utilizemassive computing power to process the distributed queries in differentdatasets and return outcome sets in a timely manner. Cloud computing is theunderlying engine that along with Hadoop, provides the platform for distributeddata-processing 5.

The relation between cloud computing and big data is shownin below figure. The large data sources from the cloud and Web are being storedin a distributed fault-tolerant database and processed via the programmingmodel for huge datasets with parallel distributed algorithm within a cluster6.                      Fig.3. Relation of Cloud Computing and Big Data1.4.

 Clusteringin Big DataData clustering is known as a problem of a partition of unlabeled objectssets that is O = {o1, o2,. . . , on} in k groups of alike objects, in which 1

These datacould be shown in the form of an   n × n dissimilaritymatrix D, having Dij representing dissimilarity (distance) among oiplus oj. Basically,the Euclidean distance      ||xi ? xj|| is known as the dissimilarity measure, but it could besome norm on Rp 7.Following are some of the clustering algorithms:1.4.

1.        K-meanclusteringThe k-meansclustering algorithm is the fundamental algorithm which is dependent on thepartitioning method using for many clustering tasks mainly with low dimensiondatasets. It utilizes k as a parameter, with the division of n objects in kclusters for the objects in the similar cluster to behave similar to every, butdifferent to another objects in other clusters. The algorithm normally findsthe cluster centers, (C1 …… Ck), for minimizing the sum of thesquared distances of every data point, xi, 1 ? i ? n, to itsnearest cluster center Cj, 1 ? j ? k.

Initially, the algorithmarbitrarily selects the k objects, showing a cluster mean/center. Later, the object xiin the data set is transferred to the adjacent cluster center i.e. to theparallel center.

The algorithm calculates the novel mean for every cluster andre-assigns every object to the adjoining new center. This method iterates tillno amendments occur for the assigning the objects. The convergence outcomeminimizes the sum-of-squares error which is defined as the squared distancessum from every object to its cluster center 7.1.

4.2.        Fuzzy K-meanFuzzyK-Means is also known as Fuzzy C-Means Clustering, which is the extension of K-Meanstechnique 8. The K-Means algorithm only finds the clusters of regularshapes, i.e., Hard Clusters, but Fuzzy K-mean is also suitable to find the SoftClusters 9.The fuzzy k-means algorithm is described asfollows: 1.

      To assume afixed number of clusters k.To Randomly initialize the k-means  connected with the clusters with thecomputation of the probability that every data point is a member of a known cluster k, 2.      To recalculate the centroid of thecluster as  the weighted centroid mentioned the probabilities ofmembership of all data points 3.      To iterate till convergence ofa user-specified number of iterations being reached. 1.

4.3.        Clustering using Genetic AlgorithmGA (Geneticalgorithm) was proposed early in 1989 that attracts many attentions as itperform a globalized investigation for solutions whereas another clusteringapproaches execute a localized search and therefore, simply get stuck at localoptimality’s. In a localized search, the novel obtain solution take over theones in the preceding iteration. Such example includes k-means, ANNs, fuzzyclustering algorithms with tabu search, annealing schemes. However, in GeneticAlgorithm, the crossover and mutation operators could produce novel solutionsthat are very dissimilar from the preceding iteration which is where the globaloptimality basically comes 10. Also, Genetic algorithm works paralleling, makingit possible for implementing parallel hardware for speed up the execution. Infact, Genetic Algorithm is known as evolutionary approach, which appliesevolutionary operators and solutions population for achieving a partition ofglobal optimal.

GA includes selection of functions, mutation operation, and afitness function. The candidate solutions to the clustering problem are beingencoded as chromosomes, and later a fitness function inversely proportional tothe squared error value is applied for determining the chromosomes existinglikelihood in the subsequent generation 11.2 Related WorksChen et al., (2017) presented Parallel Random Forest (PRF) algorithm for big data on theApache Spark platform.

Researcher optimized the PRF algorithm on a hybridapproach combining the data-parallel and task-parallel optimization. Indata-parallel optimization, vertical data-partitioning method is used and intask-parallel optimization, a dual parallel approach is carried out. Both thetechniques sufficiently improve the efficiency. Moreover, by using thedimension-reduction in the training process and weighted voting approach in theprocess preceding parallelization improves the accuracy for algorithm forlarge, noisy and high-dimensional data. Experimental results are superior toprevious results implemented by Spark MLlib and other studies 12.Xu et al., (2017) had designed a speculative execution schemes for parallel processingclusters.

Researchers devised two schemes: one for lightly loaded systems andother for heavily loaded systems. For light loaded systems, they proposed SmartCloning Algorithm (SCA) and for heavily loaded systems, Enhanced SpeculativeExecution (ESE) Algorithm is proposed. The simulation result compares the SCAand Microsoft Mantri, in SCA the total job flowtime is reduced by 6% incomparison to Microsoft Mantri. In terms of the job flowtime, the ESE algorithmoutperforms the Microsoft Mantri baseline scheme by 71% 13.Thingom et al.

, (2017) discusses the concept of the integration of big data and cloudcomputing. Researchers pointed out the flexibility and minimum cost (pay model) required in the cloud scenario 14. El-Seoud et al., (2017) showed the trends and challenges faced in the field of big data andcloud computing. Study reveals the risks plus benefits that may arise due tothe integration of big data and cloud computing. The study also unfolded theconcepts behind big data and cloud computing 15.Bharill et al., (2016) has focused his paper on clustering large datasets in Apache Sparkenvironment.

Authors designed and implement partitioned dependent clustering andchoose the specified environment because of its low computational needs. Inthis research, Scalable Random Sampling with Iterative Optimization Fuzzy C-Meansalgorithm (SRSIO-FCM) is implemented on an Apache Spark Cluster. Theexperimental studies on different big datasets are conducted. The performanceof SRSIO-FCM is better in comparison to the Literal Fuzzy C-Means (LFCM). Theresults are stated in terms of space and time complexity. According to theresults, SRSIO-FCM runs in less time without compromising the clustering quality16.

Wei Shao et al., (2016) have presented a model for clustering data by means ofspatiotemporal-intervals, which is consider as a spatiotemporal data typeconnected with a start- and an end-point. The model proposed by the researchercould be used to evaluate the spatiotemporal interval data clusters. The workhas aimed to deal with the evaluation of clustering results in variety ofEuclidean spaces. This is dissimilar from the existing clustering thatcalculates the outcome in space of single Euclidean.

The existing clusteringalgorithms are analyzed and compared with the use of energy function 17. Sun et al., (2015) has done clustering with the use of time impact factor matrix. Thematrix monitors how user interest drifts and then predicts the rating of theitem. In addition to the time impact factor matrix, the author has added onemore time impact factor and use the linear regression for predicting the userinterest drift. The comparisons of the experiments have been conducted on threebig data sets, namely, MovieLens1M, MovieLens100K, and MovieLens10M. Theresults have shown that the proposed approach has efficiently improved theprediction accuracy 18.Sookhak et al.

,(2015)improves the storage capability of the cloud system by reducing thecommunicational and computational overhead costs. Authors proposed an RDA(Remote Data Auditing) technique which is dependent on algebraic signature properties.The authors also design a novel data structure, DCT (Divide and Conquer Table)which could effectively supports operations of dynamic data like insert,append, delete and modify. The DCT data structure could be applied to thestorage of large-scale data and incurred less computational cost. Thecomparison among the proposed method and other RDA techniques has shown thatthe proposed method is efficient and secured, and hence reduces the computationaland communication costs on the server and the auditor 19. Kumar et al.

,(2015) proposed the ClusiVAT algorithm. The proposedalgorithm is compared with the K-means, single pass K-means, online K-means andCURE (Clustering using representatives).The comparison results show that ClusiVAT is the fastest and accurate among allfive algorithms. For example, it has recovered 97% of the ground truth labelsin the real world KDD-99 cup data (4 292 637 samples in 41 dimensions) in 76 s7.Hashem et al.

,(2014) studiedhow the vast amount of data (Big data) and cloud computing is a challenge intoday’s computer world. The characteristics definition with the classificationof big data has been discussed on cloud computing. The relationship among bigdata and cloud computing, Hadoop technology with the storage systems of bigdata are discussed. Also, the investigation of research challenges with thefocus on availability, data transformation, data heterogeneity, governance andprivacy with legal regulatory issues is taken place 1. Yin et al.,(2014) havefocussed on the detection of faults with the isolation for the systems ofvehicle suspension. The system being proposed is classified into mainly threesteps, primarily to confirm the number of clusters dependent on PCA (Principalcomponent analysis and secondly to detect the faults by using  fuzzypositivistic C-means clustering with the fault lines and next to isolate theroot causes for faults by using the technique of Fisher discriminant analysis.Dissimilar from another scheme, the proposed method only requires measurementsof accelerometers which are fixed on four corners of a vehicle suspension.

Moreover, dissimilar spring attenuation coefficients are being regarded as aspecial failure in place of few others 8. Konak et al.,(2006) studiedthe emerging technology GA (Genetic Algorithm) for the existing problems. Authoraddresses the multi-objective formulations which are considered as realistictechniques for problems of more complex engineering optimization. For real-lifeproblems, the objectives under consideration conflicts with each other and theoptimization of particular solution for single objective that could result inunacceptable results for other objectives 11.3 Future WorksBeyond the basic execution needs, small additional serviceslike Machine learning, Analytics, and Orchestration are being accessible by thecloud. There are numerous reasons for this move as summarized below 20:        i.           Clouds are the main providers for data services.

       ii.           Machine Learning and other AI approaches willsurely improve the scenario and Orchestration (Automation) would make theservice provider capable to have the Service level agreement on time.      iii.

           Analytics would accelerate the business andOrchestration can be helpful when the acceleration takes place.      iv.           The future of Clouds would be the mixture of Analyticsand Orchestration.       v.           Big Data and Cloud Computing will surely automatethe maximum workload in the distributed computing environment.  References1.       Hashem et al., “The rise of “big data” oncloud computing: Review and open research issues”, Information Systems 47 (2014): 98-115.

2.       Wu et al., “Data mining with big data”, IEEE transactions on knowledge and data engineering 26.1 (2013): 97-107.3.

       Subashini et al., “A survey on security issuesin service delivery models of cloud computing”, Journal of network and computer applications 34.1 (2011): 1-11.4.       Pallis et al.

, “Cloud computing: the newfrontier of internet computing”, IEEEinternet computing 14.5 (2010): 70-73.5.       Talia Domenico, “Toward cloud-based big-dataanalytics”, IEEE Computer Science (2013): 98-101.

6.       Fernandez et al., “Big Data with CloudComputing: an insight on the computing environment, MapReduce, and programmingframeworks”, Wiley Interdisciplinary Reviews: Data Mining and KnowledgeDiscovery 4.5 (2014): 380-409.

7.       Kumar et al., “A hybrid approach to clusteringin big data”, IEEE transactions on cybernetics 46.10 (2015): 2372-2385.8.

       Yin et al., “Performance monitoring for vehiclesuspension system via fuzzy positivistic C-means clustering based onaccelerometer measurements”, IEEE/ASMETransactions on Mechatronics 20.5 (2014):2613-2620.9.       Zahid et al., “Fuzzy clustering based onK-nearest-neighbours rule”, FuzzySets and Systems 120.2 (2001):239-247.10.

     Maulik et al., “Genetic algorithm-based clusteringtechnique”, Pattern recognition 33.9 (2000): 1455-1465.11.     Konak et al., “Multi-objective optimization using geneticalgorithms: A tutorial”, ReliabilityEngineering & System Safety 91.9 (2006):992-1007.

12.     Chen et al., “AParallel Random Forest Algorithm for Big Data in a Spark Cloud ComputingEnvironment”, IEEE Transactions onParallel and Distributed Systems 28.4 (2017): 909-933.13.

     Xu et al.,”Optimization for Speculative Execution in Big Data Processing Clusters”, IEEE Transactions on Parallel andDistributed Systems 28.2 (2017): 530-545.14.     Thingom et al.,”An Integration of Big Data and Cloud Computing”, Proceedings of the International Conference on Data Engineering andCommunication Technology (2017): 729-737.

15.     El-Seoud et al.,”Big Data and Cloud Computing: Trends and Challenges”, International Journal of Interactive Mobile Technologies 11.

2(2017): 34-52.16.     Bharill et al., “Fuzzy Based Scalable Clustering Algorithmsfor Handling Big Data Using Apache Spark”, IEEE Transactions on Big Data 2.4 (2016): 339-352.17.

     Wei Shao et al., “Clustering Big Spatiotemporal – IntervalData”, IEEE Transactions on Big Data 2.3 (2016): 190 – 203.18.

     Sun et al., “Dynamic Model Adaptive to User Interest DriftBased on Cluster and Nearest Neighbors”, IEEE Access 14.8 (2015):1682-1691.19.     Sookhak et al., “Dynamic remote data auditing for securingbig data storage in cloud computing”, Information Sciences 380 (2015): 101-116.20.     Furht et al.

, “Handbook of cloud computing”,Vol. 3. New York: Springer, (2010).21.     Azar et al.,”Dimensionality Reduction of Medical Big Data using Neural-Fuzzy Classifier”, Soft Computing: Springer 19.4 (2015):1115-1127.

22.     Cao et al.,”Cluster as a Service: A Resource Sharing Approach for Private Cloud”, Tsinghua Science and Technology 21.6(2016): 610-619.23.     Han et al., “DataMining: Concepts and Techniques”, Elsevier(2006).

24.     Kurasova et al.,”Strategies for Big Data Clustering”, IEEE26th International Conference on Tools with Artificial Intelligence (2014):740-747.25.     Reshmy et al.

,”Data Mining of Unstructured Big Data in Cloud Computing”, International Journal of Business Intelligence and Data Mining 12.3(2017).26.

     Zhao et al.,”Independent Tasks Scheduling Based on Genetic Algorithm in Cloud Computing”, WiCom ’09- 5th InternationalConference (2009).