计算机科学技术名家讲座
(2013-37)
讲座题目:Hierarchical MapReduce: Towards Simplified Cross‐domain Data Processing
主 讲 人:骆远 博士
美国印第安纳大学信息与计算学院
讲座时间:2013年10月21日下午15:30-17:00
讲座地点:前卫南校区计算机大楼A521报告厅
主办单位:太阳成集团tyc122cc
太阳成集团tyc122cc计算机科学技术研究所
太阳成集团tyc122cc软件学院
符号计算与知识工程教育部重点实验室
欢迎广大师生踊跃参加!
Abstract:
MapReduce is a programming model well suited to processing large datasets using high-throughput parallelism running on a large number of compute resources. While it has proven useful on data-intensive high throughput applications, conventional MapReduce model limits itself to scheduling jobs within a single cluster. As job sizes become larger, single-cluster solutions grow increasingly inadequate. Additionally, the input dataset could be very large and widely distributed across multiple clusters. Feeding large datasets repeatedly to remote computing resources becomes the bottleneck. When mapping such data-intensive tasks to compute resources, scheduling algorithms need to determine whether to bring data to computation or bring computation to data. We present a Hierarchical MapReduce framework that gathers computation resources from different clusters and runs MapReduce jobs across them. The applications implemented in this framework adopt the Map-Reduce-GlobalReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Two scheduling algorithms are introduced: Compute Capacity Aware Scheduling for compute-intensive jobs and Data Location Aware Scheduling for data-intensive jobs. Experimental evaluations using a molecule binding prediction tool, AutoDock, and grep demonstrate promising results for our framework.