Mapreduce: a flexible data processing tool pdf download
These two UDFs are useful for calculating age and event intervals. PickupSequenceValues filters data observed consecutively for a period starting from an assigned date. This UDF is useful for extracting log data of pharmaceutical administration repeated over a period Figure 4.
Step-by-step example of date management. Time efficiency is an important issue in data management. The main goal of this study was to provide researchers with open-source, time-efficient software for handling large scale administrative data. Existing methods designed to handle small datasets would require a vast amount of time to process a large dataset. This is a serious problem because it may hinder researchers in carrying out large-data studies. We developed our software to solve this problem and contribute to the enhancement of research using a large administrative database.
Consequently, we evaluated the performance of the software mainly in terms of time efficiency and scalability. The Elastic Compute Cloud EC2 infrastructure service from Amazon was used as a test bed for the performance evaluation.
In this benchmarking test, we created dummy administrative data for in-hospital services containing patient discharge summary data and medical activity logs for 20 different kinds of medications.
We prepared discharge summary data for 2. The Input and Output data image is as shown in Additional file 1 : Appendix 1 while the program script used for the benchmark test is given in Additional file 2 : Appendix 2.
We created a Hadoop cluster on Amazon EC2, composed of one master for the master name and job tracker node, and varying numbers of slave nodes for task tracker and data nodes. For the scaling benchmark, we used the entire sample data, and ran the same script 20 times using one master node and 2 slave nodes.
Then we doubled the number of slave nodes until 48 nodes were used, repeatedly measuring the processing time.
As shown in the graph, there is a clear linear relationship between the processing time and data size. The intercept of the model is significant, at seconds, which should correspond to the lead time for batch processing.
Processing speed benchmark. Dots indicate the average processing time for 20 trials. The line indicates the prediction equation fitted with a linear regression.
Scaling benchmark. The line indicates the prediction equation fitted with a power regression. The motivation for developing the current system was to simplify the use of large scale administrative databases in epidemiological and health service research, and for policy evaluation. We believe the developed system will be useful and will contribute to the above goal for the following reasons. Firstly, the developed system achieves satisfying scaling for conversion of a large scale dataset for parallelization with Hadoop.
Because of the overhead of managing each node, adding additional nodes yields a diminished volume of transactions, but retains adequate scaling ability. Processing the million log entries for administrative activities in the performance test took one hour at a cost of 10 US dollars using a parallel environment such as the Amazon system with one master and eight slave nodes.
To complete the task in 10 minutes requires one master plus 48 slave nodes at a cost of about 50 US dollars. Thus, the system allows users to choose between the tradeoff of time response and cost. Secondly, the current system uses only free and open-source components. It uses the Hadoop framework for distributed data processing, and the Pig Latin language for script development.
Furthermore, Pig can execute the same script on local computers even without Hadoop. Durok is a subproduct that has developed from this use of Pig, allowing the currently developed system to be used as original application software.
At present, Durok is open-source software available under an Apache License; however, it is not an official project of the Apache Foundation. The Durok system can be applied to small datasets that can be processed without a distributed data processing environment. Finally, the system achieved quick response in processing the large administrative database to allow convenient ad-hoc analysis in a trial-and-error fashion.
Quick and easy access to large databases allows researchers and analysts broader opportunities for investigating innovative research questions, generating hypotheses to be tested in formal research, and ad-hoc monitoring of adverse events. The current system still needs further development of the UDFs to allow more complicated data transformation with simpler scripts. Currently, the proposed UDFs are functionally separated into grouping and date functions owing to restrictions in the format design of GroupFilterFormat.
However, users may wish to identify patterns in timing and types of administered pharmaceuticals through data mining to find best practice patterns in a real setting. To satisfy such requirements, the format design needs further development to allow flexibility in setting a reference time point in GroupFilterFormat. We believe that the present system is generalizable to any large scale administrative database which has a similar data format to the DPC data.
Another challenge is to further improve efficiency in data processing with increased data sizes. The Reduce process is a limiting factor in improving the speed of data processing.
Currently the proposed scheme needs two iterations of the Reduce step to transform a table by Innergroup and numeric calculation. The model obtains the effective classification, but it is not applicable for the noisy data.
The Adaptive Exponential Bat algorithm is devised for training the classifier. The security constraints are the major drawback of this method. The major challenges in the existing techniques are computational complexity, time, cost, oversampling, and speed. These drawbacks were overcome by using the Bayesian classifiers.
The CNB classifier is well suited for imbalanced datasets. By using the optimization algorithms, better convergence is obtained with improved accuracy with low computational complexity. The different classifiers based on the Bayesian theorem are known as Bayesian classifiers; for example, NB is one of the bayesian classifiers.
The Bayesian classification is based on the posterior probability calculation by assumed prior probabilities and the probability of different data object under given assumptions. The NB classifier is based on the fact that each attribute of the object to be classified is independent of each other.
The NB classifier is based on the approach where the probability of each category is calculated, and the object belongs to the category with the largest probability associated.
One of the highly utilized classifiers is NB classifier, and the typical classifier is adopted with the Map-Reduce framework and used for big data classification [ 59 ]. At the initial phase of the training process, the input data are arranged in different groups based on the number of classes.
Testing data result is represented using the following equation:. The Eq. The block diagram of the developed model for big data classification is depicted in Fig. FCNB classifier is shown in the following equation:. Each data sample is classified into the number of classes let it be called K.
It is represented as follows:. The outcome of FCNB classifier is represented as below:. The output of FCNB is expressed as below equation:. The term here is represented as P g k X denoted as a posterior probability by using test data X for given class g k. The expression C k signifies correlation for class K. The new classification technique called HCNB [ 62 ] is introduced by combining the existing CNB classifier with the holoentropy function. The handling is based on the holoentropy estimation for each attribute using the following formula:.
Here, F represents the weight function, and T i b is the entropy. The formulae for the weight function and entropy is described in the following equations. Here, M i b is the unique value of the attribute vector i b. The training phase of the HCNB based on the training data samples produces the result in the following vector form:.
The individual class is selected by estimating posterior probability independently during a testing phase, which the below expression can represent:. This section presents the classifiers' evaluation results, and a comparative detail analysis is provided with the existing methods.
The system requirements and implementation details are provided in the experiment setup. The methods included in the developed classifiers are implemented in JAVA programming language. The parameters used for the experimentation are maximum iteration-5, population size-6, and mapper size The dataset utilized for the experimentation purpose is localization dataset and cover type dataset. The localization dataset is taken from the UCI machine learning repository for experimentation [ 63 ].
The recorded activities of five people wearing tags, ankle left, ankle right, belt, and chest are collected in terms of the localization dataset. The dataset contains a total of , instances that include 8 attributes. Each instance in the dataset forms localization data for tags, and attributes are used to recognize them. The cover type dataset is taken from the UCI machine learning repository for experimentation [ 64 ].
The dataset contains total of , instances with 54 attributes. Five metrics, such as accuracy, sensitivity, specificity, memory, and time, are used to evaluate the performance of the classifiers. The degree of veracity is measured using an accuracy metric defined as the proportion of true results.
The sensitivity and specificity are referred to as the proportion of correctly identified true positives and true negatives. A comparative analysis is done to evaluate the developed classifiers with the existing models based on sensitivity, specificity, and accuracy. In this paper, the first model C. The holoentropy extension of the CNB classifier HCNB is created and compared with the existing models with similar conditions and parameters and observed improvements when its performance is assessed on the localization dataset and the cover type dataset.
The number of mappers represents the number of desktops used for simulation. The developed classifiers CNB and CGCNB are evaluated based on accuracy, sensitivity, specificity, memory, and time analysis on the localization dataset. The performance evaluation is presented in this section. The performance evaluation process is carried out with a mapper size of 5 and on training data.
This improved performance with the increase in training percentage is achieved in all the classifiers. From the above analysis, we can interpret that the execution time and memory decrease with the increase in the mapper size.
In this proposed work, the mapper size depicts the number of desktops used. The performance evaluation using the cover type dataset is presented in this section. The developed classifiers are evaluated based on accuracy, sensitivity, specificity, memory, and time. The performance evaluation process is carried out by varying the number of mappers and the training data. Tables 1 and 2 for all the classifiers, the increase in training percentage increases the system's overall performance in terms of accuracy, sensitivity, specificity, memory, and execution time.
Likewise, while increasing the mapper size, the memory requirement and the execution time decrease. For the localization dataset, the FCNB classifier has improved performance accuracy, sensitivity, and specificity compared to other methods.
Similarly, for the cover type dataset, the HCNB classifier has enhanced performance accuracy, sensitivity, and specificity compared to other techniques. For both the datasets, CNB has improved performance compared to NB, because the highest posterior value is only selected as a consequential class. HCNB is well suited for big data classification. This paper focused on big data classification based on different functions incorporated with Map-Reduce framework.
The basic model is CNB classifier and later it is integrated with optimization algorithms, like cuckoo search and grey wolf optimization. In the future, the performance of the classifiers will be analyzed using log loss and training loss.
Smart4Job: a big data framework for intelligent job offers broadcasting using time series forecasting and semantic classification. Big Data Research. Article Google Scholar. Big data and MapReduce challenges, opportunities and trends. Int J Electr Comput Eng. A survey on tools used in big data platform. Adv Appl Math Sci. Google Scholar. Data mining with big data. Marx V. The big challenges of big data. New York: Wiley Publishing; Book Google Scholar.
Pole G, Gera P. A recent study of emerging tools and technologies boosting big data analytics. IEEE Access. Neural Comput Appl. World Wide Web 17, — Download citation. Received : 19 September Revised : 09 June Accepted : 14 June Published : 29 June Issue Date : September Anyone you share the following link with will be able to read this content:.
Sorry, a shareable link is not currently available for this article. Provided by the Springer Nature SharedIt content-sharing initiative. Skip to main content.
Search SpringerLink Search. Abstract Extreme Learning Machine ELM has been widely used in many fields such as text classification, image recognition and bioinformatics, as it provides good generalization performance at a extremely fast learning speed. References 1. Neurocomputing 2 , 52—58 Article Google Scholar 9. Neurocomputing 70 1—3 , — Article Google Scholar Neurocomputing 70 16—18 , — Article Google Scholar Neurocomputing 71 16—18 , — Article Google Scholar Neurocomputing 74 1—3 , — Article Google Scholar Part B 39 4 , — Article Google Scholar These features have to be also effectively identified from the big data sets so that accurate prediction models can be built in real time.
The OFS with data streams is closely associated with data stream mining [ 15 ]. When data becomes uncontrollable or huge, the parallel processing is employed for reducing the time complexity.
In this research effort, an ascendable efficient OFS method using the parallel Accelerated Bat Algorithm ABA technique is proposed to select the features from the data set online. To work with large-scale dataset, a distributed programming model, MapReduce is used which divides the dataset into smaller portions. The scalability of OFS-ABA over an extremely high dimensional and big dataset is proven through an empirical study which also demonstrates, that the algorithm performs superlatively well than the other state-of- the-art FS methods.
This research work is structured into various segments. Feature selection FS technique for big data analytics is envisioned to have a significant feature selection method with reduced time complexity and enhanced accuracy levels.
In the recent years of development, bio-inspired associated algorithms have been used for various problems of big data analytics [ 16 ]. Hoi et al. The application of OFS was established on real time issues, which significantly scales when compared to other FS algorithms.
The outcomes are validated with the efficacy of the projected techniques for extensive and varied large scale applications. Peralta et al. The evaluations showed that the spark implemented framework was beneficial to perform evolutionary FS on large data sets with enhanced classification precision and runtime. Tsamardinos et al. The experimental study demonstrated increased scalability number of features with speedup. Tan et al.
The algorithm was based on convex semi-infinite programming SIP , and multiple kernels learning MKL sub-problem, which is an adaptive accelerated proximal gradient technique, where each base kernel is associated with a set of features.
The results show an improved training, competence over bigger data with ultra-large sample size. De la Iglesia et al. The development of EC is the competence to efficiently search large population. The assessment and implementation uncovered the competency of these algorithms and further leads to new research direction in FS problems.
In this approach, based on the threshold value, the feature weights were proportionally decremented, which zeroed irrelevant featured weights. Hu et al. A comprehensive review of the present OFS method was analyzed and compared over other methods. The uncluttered issues were discussed in FS. Yu et al. It exhibited a superior performance compared to other prevailing algorithms. The review of the feature selection methods for handling data stream has been discussed in many recent works [ 24 , 25 , 26 , 27 , 28 , 29 ].
Fong et al. The APSO algorithm is based on swarm intelligence and the proven results show that the algorithm performed well in terms of accuracy, time complexity, and so on.
Five benchmarks datasets are experimented in this work. The results demonstrated that MOANOFS system can be successfully applied to diverse domains and were able to accomplish high accuracy on real time applications. Lin et al. The algorithm is pragmatic for FS in text classification problems in big data analytics.
The disadvantage here is, it was only pertinent to text classification problems. Gu et al. The algorithm performs FS to select minimal subsets followed by classification. The future work is protracted to explore multi-objective, meta-heuristics FS algorithm to handle huge dimensionality with enhanced accuracy. Manoj et al. The challenge in terms of this approach is to apply other types of data such as images and video. The exertion emphasized the use of population-based hybrid algorithm for FS problems.
Devi et al. This scheme had the limitation of having only one classifier and the performance was not compared with the other classifiers. For classification problems, DL techniques are considered to be efficient [ 30 , 31 ]. Wan S et al. The proposed classifier has demonstrated enhanced performance in terms of accuracy compared with other algorithms. Young et al. This work highlighted the issues of conventional mining, and proved the elevated performance level of Deep neural networks. From the prevailing literature, it can be deduced that the bio-inspired algorithm combined with the MapReduce approach evidences to be effectual and competent in Feature selection FS methods in the field of big data analytics.
It is evident that DMLP is used for classification problems. MapReduce model is applied to big datasets, which is further divided into smaller partition. In this approach, based on the threshold values, the feature weights are proportionally decremented and Clustering Coefficients of Variation CCV zeroed incognizant features weights. The scalability of OFS- ABA over an extremely high dimensional and big dataset is proven through an empirical study which also illustrates that the algorithm performs supremely improved than the other known FS methods.
The proposed model is shown in Fig. Normalization is commonly used to maintain the balance of significance amongst the attributes, when attributes are on a diverse scale. When datasets are with diverse range of attributes, they are preprocessed by min—max normalization method. In this process all the values are transferred into same scale between 0 and 1, thus giving importance to the attribute even with the low range of value on scale. It is the method of scaling the given dataset within the specified range of values between 0 and 1.
From the Eq. OFS [ 14 ] is related to streaming features. Accuracy will be achieved through only selecting the most relevant feature subset for classification. For features of streaming nature, the number of features is unknown priori, consequently this issue is well handled by OFS.
The OFS acquire dataset instance one at a time. Followed by, comparison of target and predicted class is done. The weight vector is rationalized using the following stochastic gradient rule given in Eq. In Eq. The examples are equally distributed and processed in parallel so as to achieve the class balance. Generally the reduce phase is carried out by a distinct process thus reducing the execution time in MR [ 33 ].
The entire execution is done with a single MR process which eliminated the added disk admittances. Bats collect the information of the streaming features. Microbats are capable of echolocation, a fascinating characteristic they possess to find optimal streaming features and classify. The process is given as follows [ 34 , 35 ]:. The wavelength is modified accordingly. At time step t, outline the rules how feature position fp i and velocities ve i in a higher dimensional population is given by the following Eqs.
The best feature is selected amongst the current best optimal feature using random walk.
0コメント