kafka 集群调优 Hadoop集群参数的自动调优

时间：2019-05-07 03:32:57　来源：柠檬阅读网本文已影响人

　　摘要：Hadoop平台作为一个开源的在集群上运行大型数据库处理的框架受到了各个公司的青睐，然而要在Hadoop集群上运行一个作业必须手动设置将近200多个复杂的参数，如何设置这些参数对普通用户来说是非常困难的，该文针对这个问题提出了一种基于策略选择的抽样算法，通过在Hadoop中加入策略感知层，实验结果表明改进的Hadoop框架可以自动优化设置这些复杂的参数，从而提高整个系统的运行效率。
　　关键词：Hadoop；抽样算法；策略选择；感知层
　　中图分类号：TP311文献标识码：A文章编号：1009-3044(2012)12-2768-05
　　Parameter Auto-tuning of Hadoop Clusters
　　WANG Jiao, LIU Yan-feng
　　( Xi’an University of Architecture and Technology, Xi’an 710055, China)
　　Abstract:Hadoop platform as an open source cluster framework for running large-scale database processing by each company’s favor, how? ever, run job on a Hadoop cluster must be set manuallyalmost morethan 200 parameters, how to set these parameters for the ordinaryus? er is very diffic-ult. this paper proposes a solution for this problem, adding Hadoop Strategies perceived layer, that is use a sampling algo? rithm based strategic choice . the experimental results show that improved the Hadoop framework can automatically optimize the set? tings of these parameters, thereby improving the operating efficiency of the entire system.
　　Key words : Hadoop; sampling algorithm; strategic choice; perceived layer
　　Hadoop参数对于一个集群的运行效率有着非常重要的影响，一个好的参数配置单可以提高作业的运行效率和集群的吞吐率。假如在运行作业之前用户不手动设置参数，Hadoop会使用一套自己默认的配置参数单。[3,4]然而由于Hadoop集群的负载情况每时每刻都在变化，加上Hadoop作业的类型繁多，不同用户不同作业的需求也千差万别，默认的配置参数单往往运行效果并不理想，并不能保证集群资源的充分利用和系统的高吞吐率。针对这种问题，采用一种策略感知的方法可以有效的解决这个问题。
　　1 Hadoop参数问题
　　 1.1 Hadoop参数
　　在Hadoop中运行的是MapReduce程序，一个典型的MapReduce程序需要设置很多的参数，下面是一个简单的实例：
　　// the map function defined in a class named map
　　Public void map(longwritable key, text value,outputcollector output)
　　{
　　//we need to parse the input line of text
　　//to extract the produce&sales information
　　Stringtokenizer tok= new stringtokenizer(value.tostring(),”|”);
　　Text product = new text();
　　Produce.set(tok.nexttoken());
　　Doublewritable sales=new doublewritable();
　　Sales.set(double.parsedouble(tok.nexttoken()));
　　//output extracted product & sales information
　　//to be processed by the reduce function
　　Output.collect(product,sales);
　　}
　　Public static void main(string[] args)
　　{
　　Jobconf conf= new jobconf();
　　//set some parameters explicitly
　　Conf.setoutputkeyclass(text.class);
　　Conf.setoutputvalueclass(doublewritable.class);
　　Conf.setmapperclass(map.class);
　　Conf.setreduceclass(reduce.class);
　　Conf.setnumreducetasks(10);
　　//submit the job for execution
　　Jobclient.runjob(conf);
　　}
　　在这段程序的Main函数中明确了如下参数，实现map和reduce的各自类；map函数的key和value的数据类型，每一个reduce函数能够处理的reduce任务个数。除了上述几个参数以外，还有上百个参数需要在运行之前进行设置，图1给出一些Hadoop中最为重要的参数。
　　正确性。
　　首先把前k个块作为选中块，对于第k+1,我们以k/(k+1)的概率来决定是否要把它换入样本块。换入时随机地选取一块作为替换块。这样一直做下去，对于任意的样本空间n,对每个块的选取概率都为k/n。也就是对每个块的选取概率相等。伪代码如下：
　　Init:a sample with size :k
　　for i=k+1 to N
　　M=random(1,i);
　　if(M=k，每个样本的取出概率均相等，即k/n。
　　当n=k时，由我们把前k个块放入样本可知，每个块取出的概率均相等，即k/k=1。设当前块号为n,其每个取出样本概率均相等，即为k/n，要证明的是这种情况对于n+1也成立。
　　由于以k/(n+1)的概率决定是否把样本n+1块换入样本，那么对于n+1块其出现在样本中的概率就是k/(n+1)，对于前n个块中的任意块m(k+1

相关热词搜索：集群参数 Hadoop

kafka 集群调优 Hadoop集群参数的自动调优

最新文章

热门文章