Planning editor Natalie interview
AI frontline reading: a few days ago, Tencent led the open source big data platform Apache Hadoop 2.8.4 released the new version of the news attracted my attention. From the date of the birth of YAHOO, Hadoop has gone through 10 years. In recent years, especially in recent years, there have been many precedents for the release of the new version of the new version of the Chinese as Release Manager, but the companies behind the companies such as YAHOO, Microsoft, Hortonworks, Cloudera and other American companies. For the first time, the new version was led by Chinese companies, which is of course an important encouragement to the domestic open source community, indicating that China's developers and development organizations have the ability to break through the obstacles to the more influential role in the hot open source community; on the other hand, it also means that the Tencent has long been a long term. Supporting and embracing open source and open source communities has paid off, and has begun to harvest open source community influence.
For the author, the more curious is another question, which is why the Tencent has to spend so much effort to dominate the open source version of the Hadoop in the voice of the decline in Hadoop at home and abroad?
Hadoop was first born in 2006 and became a top tier Apache project in 2008. Although at the beginning of the birth, only a few giants at home and abroad tried to use Hadoop technology, but not long, Hadoop has become the standard configuration of large data computing in the Internet industry, and Hadoop has quickly become one of the gold medals of the Apache software foundation. Not only that, it also gave birth to a series of well-known Apache top projects, including HBase, Hive, ZooKeeper, and so on, which started in the form of Apache Hadoop sub projects in the community and are well known to the developers.
Hadoop has come through 12 years, which is a long life cycle for any software. From 2016, the sound of Hadoop began to appear at home and abroad. Although Hadoop is still an indispensable configuration of big data computing for many enterprises at home and abroad, many people are not optimistic about the future development of Hadoop.
In September last year Gartner eliminated the Hadoop distribution from the technical maturity curve of data management because of the complexity and availability of the entire Hadoop stack. Many organizations have begun to rethink their role in the information infrastructure. This year's survey of data science and machine learning tools released by KDnuggets shows that Hadoop usage has also fallen, leaving
The 2018 data science and machine learning tool survey showed that Hadoop utilization decreased by 35%.
At this time, why does Tencent have to exert great efforts to dominate the release of Hadoop's open source version?
In charge of leading the release of the open source version of Tencent cloud expert researcher du Junping told the AI, the real
Tencent selects Hadoop: take account of platform stability and technology advancement.
The big data platform of the Tencent has many products and components for its own special scene optimization and self research, but a considerable part is based on the construction of the open source Hadoop ecosystem.
At present, Tencent's big data platform uses a lot of Hadoop ecosystem components. As an example of the flexible MapReduce service opened on the Tencent cloud, Tencent provides component services such as Hadoop, HBase, Spark, Hive, Presto, Storm, Flink, and Sqoop. Different components also play different uses: data storage and computing resource scheduling is implemented by Hadoop, data import can be provided with Sqoop, HBase, NoSQL database service, off-line data processing completed by MapReduce, Spark, Hive, and stream data processing by Storm, Spark Streaming, and Flink. Supply and so on.
Du Junping said that for the selection of various components of Hadoop ecosystem, the overall principle of Tencent is to take account of platform stability and technology advancement. On the one hand, it is necessary to understand the scenarios and their ability boundaries for each component, and on the other hand, from the test and operation practice, the degree of stability and complexity of each component should be understood. Taking Hadoop based digital warehouse components as an example, the new version of Hive adds LLAP components to improve the performance and speed of interactive queries, but it is not stable from the actual effect of the current operation, so the Tencent temporarily introduces the component to the production system, Hive is more service to the off-line computing scene, and the interactive query is more Provide for stable SparkSQL and Presto.
Tencent is not an example. In the big data platform of many enterprises at home and abroad, all kinds of components of Hadoop ecosystem account for a large proportion. No one can live without it, but the application of Hadoop may be too common. As Hadoop's PMC, plugging Jun Ping said that Hadoop, as the core and the de facto standard status of big data platform, is not very different at home and abroad. However, in various industries, the maturity of Hadoop applications is not the same. For example, Hadoop is the earliest and most mature application of the Internet Co; secondly, the financial industry, the Hadoop big data platform landing of many successful cases, and relatively mature. The current hot spots in the application of Hadoop large data platforms are in the field of government and security and the IOT industrial Internet platform. These new hot spots will bring new needs to Hadoop technology and ecology to continue to evolve.
Hadoop technology is not old, but the way of use and distribution needs to be changed.
For Gartner to knock out Hadoop from the technical maturity curve, Mr. Pluto pointed out that the Gartner report was aimed at the Hadoop business release rather than the Hadoop technology itself.
Jun Jun Ping said frankly, there are some deficiencies in Hadoop ecology. Hadoop's ecosystem is very complex, each component is an independent module, developed and released by a separate open source community. We can call it loosely coupled. This loosely coupled development method has the advantages of flexible, wide adaptable and controllable development cycle. The disadvantages are low maturity, severe version conflict and difficult integration test. This also makes it difficult for users to use, because there are many components in a scenario that need to be configured.
Although streaming computing is becoming more and more important for big data processing, it does not support flow calculation, but it will not be a fatal injury to Hadoop. Although Hadoop itself does not provide streaming computing services, the main stream computing components, such as Storm, Spark Streaming, and Flink themselves are part of the Hadoop ecosystem, and therefore do not constitute too much of a problem.
Hadoop eco component competition is fierce, Spark has obvious advantages, MapReduce has entered maintenance mode.
Some developers said to AI front line that Hadoop was mainly dragged down by MapReduce, but HDFS and YARN were pretty good. Du Junping believes that MapReduce's drag on Hadoop is not accurate. First of all, MapReduce still has an application scene, but it is just getting narrower and narrower. It is still suitable for some large scale data processing batch tasks, and the task is very stable; secondly, the Hadoop community's positioning for MapReduce is to enter the maintenance mode. It does not pursue any new functions or performance evolution, so that resources can be put into the updated computing framework, such as Spark and Tez, to promote its maturity.
HDFS and YARN are still the de facto standards for distributed storage and resource scheduling systems in big data areas, but there are also some challenges. For HDFS, in the field of public cloud, more and more large data applications will choose to skip over HDFS and use the object storage directly on the cloud, which is more convenient to separate computing and storage, and increase the flexibility of resources. YARN also faces strong challenges from Kubernetes, especially the original docker support, better isolation and the integrity of the above ecosystem. However, K8S is still a catcher in big data area, and there is still much room for improvement in resource schedulers and support for computing frameworks.
Spark has dominated the computing framework basically. MapReduce is mainly used in some historical applications, and Tez is more like Hive's exclusive execution engine. Flow processing, the early flow processing engine Storm is retiring, and the current leading role is Spark Streaming and Flink, the two flow processing engines have a thousand years, the former is more ecological, the latter has an advantage in the architecture. An interesting situation is that the application of Spark Streaming and Flink is very different at home and abroad, and a large number of Companies in China have started to use Flink to build their own stream processing platforms, but the US market Spark Streaming is still a dominant position. Of course, there are also some new stream processing frameworks, such as Kafka Streams and so on.
In terms of big data SQL engine, the four main engines Hive, SparkSQL, Presto and Impala still have their strong points.
Hive's earliest contributions from Facebook open source are also the most widely used large data SQL engines in the early years. Like MapReduce, Hive is a slow and stable label in the industry. It unselfishly provides a lot of common components for other engines, such as the industry's conscience, such as the metadata service Hive Metastore, the query optimizer Calcite, the column storage ORC, and so on. In recent years, Hive has developed rapidly, for example, query optimization uses CBO, Tez is used to replace MapReduce in the execution engine, cache query results are optimized through LLAP, and ORC storage is evolving. However, these new technologies are not mature and stable from market applications, and Hive is still defined by a large number of users as a reliable ETL tool rather than an instant query product.
SparkSQL has been developing rapidly over the past two years, especially in the era of Spark entering 2.x, and the development is progressed by leaps and bounds. Its excellent SQL compatibility (the only pass TPC-DS all 99 query open source large data SQL), excellent performance, huge and active communities, perfect ecology (machine learning, graphics computing, flow processing, etc.) makes SparkSQL stand out from these open source products and get very much in the domestic and foreign markets. A wide range of applications.
Presto is also very widely used for two years, and this memory type MPP engine is characterized by the ability to deal with small scale data very quickly and to work harder when the amount of data is large. The performance of Impala is also very excellent, but its development route is relatively closed, community ecological progress is relatively slow, SQL compatibility is also relatively poor, the user group is relatively small.
Hadoop ecosystem is bound to develop Xiang Yun, IOT deserves long-term attention.
Hadoop is already 12 years old. How will Hadoop ecology develop in the future? In the future, Hadoop's ecology will move toward the cloud, and simplifying operations and even avoiding them is both a user's need and a cloud vendor's advantage, he said. More and more data is generated, stored and consumed in the cloud, thus forming a closed loop of the data life cycle in the cloud
In addition, the deployment and application of Hadoop in the mixed cloud will also be an important trend, and the technology and architecture in this area are not very mature and need continuous innovation and creation. Against this background, the voice of traditional Hadoop distribution companies will be relatively reduced in terms of technology and commerce, while the voice of cloud manufacturers will increase. Another trend is that Hadoop ecology will grow to the data application end, emphasizing the transformation from data processing to data governance, and more convenient ETL tools, metadata management and data management tools will gradually mature and perfect. Finally, Hadoop ecosystem will evolve from simple big data platform to data and machine learning platform, which will help many AI application scenarios in the future.
Du Junping told AI frontline that IOT is a field worthy of long-term attention in the future development direction of big data. In the history of big data development, this part of the business development cycle is relatively short, many technologies are not very mature, and the standards are not completely unified. In addition, the large data products on the cloud also have the space for technological change, such as cross data center / cloud solution, automated key data business migration, data privacy protection, automatic machine learning, and so on. In the future, there will be more innovative products to move and attract users on the cloud.
Tencent cloud will focus on the core pain points of cloud data users, and formulate corresponding technology and product routes. For the underlying platform architecture of the large data platform, the Tencent cloud will focus more on serverless, paying more attention to the balance of performance and overhead, and improving the utilization of resources will be a long-term direction. The Hadoop ecosystem will continue to play an important role in the market, because the market is more open to open and open source products and solutions. Tencent cloud will continue to contribute and feedback to the open source community, and create better technology with the community to meet future needs.
It's not easy for Hadoop to grow from a new open source project to a standard configuration of big data platform in 12 years. Now the Hadoop ecosystem is facing the pressure of competition from many young open source components. The survival of the fittest is also normal. There is no perfect open source platform in the world. With the advantages, the status of Hadoop ecology is still very stable, but will the future be able to radiate new vitality, or in a comprehensive cloud process. Gradually declining, it is still an unknown number.