政大機構典藏-National Chengchi University Institutional Repository(NCCUR):Item 140.119/100634

English | 正體中文 | 简体中文 | Post-Print筆數 : 27 | Items with full text/Total items : 114014/145046 (79%)
Visitors : 52036658 Online Users : 262

RC Version 6.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.

Scope

please add "double quotation mark" for query phrases to get precise results

please goto advance search for comprehansive author search

Adv. Search

Home ‧ Login ‧ Upload ‧ Help ‧ About ‧ Administer

Goto mobile version

政大機構典藏 > 資訊學院 > 資訊科學系 > 學位論文 > Item 140.119/100634

Please use this identifier to cite or link to this item: https://nccur.lib.nccu.edu.tw/handle/140.119/100634

Title:	透過Spark平台實現大數據分析與建模的比較：以微博為例 Accomplish Big Data Analytic and Modeling Comparison on Spark: Weibo as an Example
Authors:	潘宗哲 Pan, Zong Jhe
Contributors:	胡毓忠 Hu, Yuh Jong 潘宗哲 Pan, Zong Jhe
Keywords:	大數據分析機器學習微博分析流程亞馬遜雲端服務 Big data analytics machine learning Weibo analytics pipeline Amazon EC2
Date:	2016
Issue Date:	2016-08-22 17:23:53 (UTC+8)
Abstract:	資料的快速增長與變化以及分析工具日新月異，增加資料分析的挑戰，本研究希望透過一個完整機器學習流程，提供學術或企業在導入大數據分析時的參考藍圖。我們以Spark作為大數據分析的計算框架，利用MLlib的Spark.ml與Spark.mllib兩個套件建構機器學習模型，解決傳統資料分析時可能會遇到的問題。在資料分析過程中會比較Spark不同分析模組的適用性情境，首先使用本地端叢集進行開發，最後提交至Amazon雲端叢集加快建模與分析的效能。大數據資料分析流程將以微博為實驗範例，並使用香港大學新聞與傳媒研究中心提供的2012年大陸微博資料集，我們採用RDD、Spark SQL與GraphX萃取微博使用者貼文資料的特增值，並以隨機森林建構預測模型，來預測使用者是否具有官方認證的二元分類。 The rapid growth of data volume and advanced data analytics tools dramatically increase the challenge of big data analytics services adoption. This paper presents a big data analytics pipeline referenced blueprint for academic and company when they consider importing the associated services. We propose to use Apache Spark as a big data computing framework, which Spark MLlib contains two packages Spark.ml and Spark.mllib, on building a machine learning model. This resolves the traditional data analytics problem. In this big data analytics pipeline, we address a situation for adopting suitable Spark modules. We first use local cluster to develop our data analytics project following the jobs submitted to AWS EC2 clusters to accelerate analytic performance. We demonstrate the proposed big data analytics blueprint by using 2012 Weibo datasets. Finally, we use Spark SQL and GraphX to extract information features from large amount of the Weibo users’ posts. The official certification prediction model is constructed for Weibo users through Random Forest algorithm.
Reference:	[1] T. H. Davenport and J. Dyché, "Big data in big companies," International Institute for Analytics, 2013. [2] R. Kabacoff, R in action: data analysis and graphics with R: Manning Publications Co., 2015. [3] F. Pedregosa, et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011. [4] L. Buitinck, et al., "API design for machine learning software: experiences from the scikit-learn project," arXiv preprint arXiv:1309.0238, 2013. [5] D. Agrawal, et al., "Big data and cloud computing: current state and future opportunities," in Proceedings of the 14th International Conference on Extending Database Technology, 2011, pp. 530-533. [6] K.-w. Fu, et al., "Assessing censorship on microblogs in China: Discriminatory keyword analysis and the real-name registration policy," Internet Computing, IEEE, vol. 17, pp. 42-50, 2013. [7] A. R. Jagdale, et al., "Data Mining and Data Pre-processing for Big Data." [8] D. Borthakur, "HDFS architecture guide," HADOOP APACHE PROJECT http://hadoop. apache. org/common/docs/current/hdfs design. pdf, 2008. [9] K. Shvachko, et al., "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1-10. [10] H. Karau, et al., Learning Spark: Lightning-Fast Big Data Analysis: " O`Reilly Media, Inc.", 2015. [11] M. Armbrust, et al., "Spark sql: Relational data processing in spark," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 1383-1394. [12] R. S. Xin, et al., "Graphx: A resilient distributed graph system on spark," in First International Workshop on Graph Data Management Experiences and Systems, 2013, p. 2. [13] M. Zaharia, et al., "Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012, pp. 2-2. [14] N. Rana and S. Deshmukh, "Shuffle Performance in Apache Spark," in International Journal of Engineering Research and Technology, 2015. [15] S. Kotsiantis, et al., "Data preprocessing for supervised leaning," International Journal of Computer Science, vol. 1, pp. 111-117, 2006. [16] S. Landset, et al., "A survey of open source tools for machine learning with big data in the Hadoop ecosystem," Journal of Big Data, vol. 2, pp. 1-36, 2015. [17] S. Mathew, "Overview of amazon web services," Amazon Whitepapers, 2014. [18] P. Pääkkönen and D. Pakkala, "Reference architecture and classification of technologies, products and services for big data systems," Big Data Research, vol. 2, pp. 166-186, 2015. [19] P. Gupta, et al., "Wtf: The who to follow service at twitter," in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 505-514. [20] A. Thusoo, et al., "Data warehousing and analytics infrastructure at facebook," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010, pp. 1013-1020. [21] G. Mishne, et al., "Fast data in the era of big data: Twitter`s real-time related query suggestion architecture," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 1147-1158. [22] M. Busch, et al., "Earlybird: Real-time search at twitter," in 2012 IEEE 28th International Conference on Data Engineering, 2012, pp. 1360-1369. [23] M. Zaharia, et al., "Spark: Cluster Computing with Working Sets," HotCloud, vol. 10, pp. 10-10, 2010. [24] C. Engle, et al., "Shark: fast data analysis using coarse-grained distributed memory," in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 689-692. [25] R. Sumbaly, et al., "The big data ecosystem at linkedin," in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013, pp. 1125-1134. [26] J. Lin and D. Ryaboy, "Scaling big data mining infrastructure: the twitter experience," ACM SIGKDD Explorations Newsletter, vol. 14, pp. 6-19, 2013. [27] X. Meng, et al., "Mllib: Machine learning in apache spark," arXiv preprint arXiv:1505.06807, 2015. [28] L. C. Freeman, "Centrality in social networks conceptual clarification," Social networks, vol. 1, pp. 215-239, 1978. [29] S. Ryza, "Advanced analytics with Spark. ed," by Ann Spencer. O’Reilly, 2014. [30] L. Breiman, "Bagging predictors," Machine learning, vol. 24, pp. 123-140, 1996. [31] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5-32, 2001. [32] R. Genuer, et al., "Random Forests for Big Data," arXiv preprint arXiv:1511.08327, 2015. [33] Y. Liu, "Random forest algorithm in big data environment," CMNT, vol. 18, pp. 147-51, 2014. [34] K. Singh, et al., "Big data analytics framework for peer-to-peer botnet detection using random forests," Information Sciences, vol. 278, pp. 488-497, 2014. [35] T. Fawcett, "An introduction to ROC analysis," Pattern recognition letters, vol. 27, pp. 861-874, 2006. [36] S. Venkataraman, et al., "SparkR: Scaling R Programs with Spark." [37] M. Armbrust, et al., "Scaling spark in the real world: performance and usability," Proceedings of the VLDB Endowment, vol. 8, pp. 1840-1843, 2015.
Description:	碩士國立政治大學資訊科學學系 103753040
Source URI:	http://thesis.lib.nccu.edu.tw/record/#G0103753040
Data Type:	thesis
Appears in Collections:	[資訊科學系] 學位論文

Files in This Item:

File	Size	Format
304001.pdf	4738Kb	Adobe PDF2	167	View/Open

All items in 政大典藏 are protected by copyright, with all rights reserved.

社群 sharing

著作權政策宣告 Copyright Announcement

1.本網站之數位內容為國立政治大學所收錄之機構典藏，無償提供學術研究與公眾教育等公益性使用，惟仍請適度，合理使用本網站之內容，以尊重著作權人之權益。商業上之利用，則請先取得著作權人之授權。
The digital content of this website is part of National Chengchi University Institutional Repository. It provides free access to academic research and public education for non-commercial use. Please utilize it in a proper and reasonable manner and respect the rights of copyright owners. For commercial use, please obtain authorization from the copyright owner in advance.

2.本網站之製作，已盡力防止侵害著作權人之權益，如仍發現本網站之數位內容有侵害著作權人權益情事者，請權利人通知本網站維護人員(nccur@nccu.edu.tw)，維護人員將立即採取移除該數位著作等補救措施。
NCCU Institutional Repository is made to protect the interests of copyright owners. If you believe that any material on the website infringes copyright, please contact our staff(nccur@nccu.edu.tw). We will remove the work from the repository and investigate your claim.

DSpace Software Copyright © 2002-2004 MIT & Hewlett-Packard / Enhanced by NTU Library IR team Copyright © - Feedback