Today, 2017 WeChat open class PRO version in Guangzhou Asian Games City kicked off. WeChat as a tool to the most efficient, shortest time to help users to complete the task, at the scene, we can even see the sweep of the scene of the "imagination". In addition to Zhang Xiaolong for small program interpretation, small program scene case, we also pay special attention to WeChat in the open source aspects of Zoran results. Today, WeChat terminal cross-platform components Mars officially announced open source. From the rise of mobile Internet to nowadays "platform-independent" cross-end pop, in IM, the weak network has been lying in front of application developers a big problem, Mars team members based on micro-business needs, a large number of Optimization work. In this article, the authors review the origin of Mars and research and development process, hoping to be able to explore the network optimization friends to bring inspiration.
In 2012, WeChat support, including Android, iOS, Symbian and other three platforms. But in all platforms, WeChat client does not have any unified basic module. 2012 WeChat is in a period of rapid development, the platform of the iterative speed, the use of programming languages, the background structure is also in the process of continuous exploration. A variety of factors that make the realization of each platform based on the module differences, resulting in many times need to do compatible work with the server aftermath. Network as the basis of WeChat, the importance of self-evident. Any network-implemented bug can lead to a major incident. Such as WeChat disaster recovery implementation, if the version of the implementation differences, resulting in some versions can not be disaster recovery, will seriously affect the user experience, and even cause the loss of users. We need a unified network of basic library for the rapid development of WeChat escort.
Just this time Saipan gradually into the twilight, WeChat support for Saipan gradually weakened. Boss from the Saipan group deployed manpower, composed of a small team of three initial team, began to do the basic components of the common. The basic components of the initial positioning as: cross-platform, cross-business infrastructure components. Now, this component in addition to solving the existing problems, back to the rapid development of WeChat brings a lot of advantages, such as:
After four years of development, cross-platform components have been included in the network components, log components, including multiple components. Looking back, this is a land reclamation road.
In the development of basic modules, the design is particularly important. In the design, WeChat base components to cross-platform, cross-business as the prerequisite to comply with high availability, high performance, load balancing design principles.
Can be used is an instant messaging class App of the conduct of the party. High availability is also reflected in multiple levels: network availability, App availability, system availability, and so on.
With the characteristics of high packet loss rate, limited bandwidth, delay fluctuation and third-party influence, the usability of the network, especially the weak network, becomes more and more critical in the mobile Internet. Mars STN components as a network layer based on the socket layer solution, in many details of the design will give full consideration to the availability of weak network.
App availability includes stability, performance, and many other aspects. The high-performance logging module, xlog, describes a number of design considerations that xlog does without affecting App performance.
In addition to considering the normal use of the scene, APP design also need from the perspective of the entire system design thinking. For example, in the design of disaster recovery, Mars not only used the server disaster recovery program, but also designed the client's local disaster recovery. When some of the server disaster, the current WeChat can be done, 15min more than 95% of the users transferred to the available server.
Protection of high availability does not mean you can sacrifice performance, for a user to use the most frequent applications, but also to the use of resources carefully. For example, in the Mars signaling transmission timeout design, the multi-level timeout design takes into account the balance between availability and high performance.
If the high availability of high performance is only the client itself to consider, then load balancing needs to be considered with the server side, and do a client network can never just look on the client. Any decision about network access should take into account the additional pressure on the server. In order to choose better quality IP, has written a complete client speed code, and later deleted, one of the reasons is because do not want to bring an additional burden to the server. Mars's code, the choice of IP with a large number of random function is to avoid a large number of users at the same time to visit the same server and do.
In these four years, I learned the most is simple and balanced. The program to do as simple as possible, so that it is not prone to error. Most of the time when the design is not possible to meet all the conditions to be achieved, this time to balance the various factors. A good example of this is the connection frequency of long connections (see longlink_connect_monitor.cc for details). The connection frequency is a combination of power consumption, traffic, network availability, user behavior, and so on.
The Development of Mars
Stage 1: Let WeChat run up
Cross-platform infrastructure components originated in the WeChat, the primary goal of course is to carry the first WeChat business. In order to not limited to WeChat, to meet cross-platform, cross-business design goals, in the design, the network components for the client and the server between the stateless network signaling channel, Push two ways. This allows the underlying components to be greatly simplified without regard to the relevancy, timing, and core interfaces of the request. At the same time, the simple interaction also makes the business logic coupling very little. At present, the interaction between the basic components and services only includes: codec, auth status query in two parts. The core interface is as follows: (see stn_logic.h for details).
In the choice of threading model, the earliest use of the multi-threaded model. When you need to do a job asynchronously, you start a thread. Multi-threading is bound to lock. But when the gray several times found that you want to avoid the four necessary conditions for deadlock and not as easy to imagine. Users use the complex scene, the client's timing, status factors, such as network switching events, front and back events, timer events, network events, task events, resulting in a lot of deadlock and object destructor timing disorders lead to Of memory access problems.
At this time, we began to think, multithreading does have its advantages: you can concurrent or even improve the speed of running. But for the network module, the performance bottleneck is mainly in the network time-consuming, not on the local program execution speed. Why not put the majority of procedures into serial implementation, so there will be no multi-threaded critical area of the problem, no lock will not naturally deadlock.
Therefore, we currently use the message queue of the program (specific implementation see comm / messagequeue directory), the vast majority of non-blocking operations into the message queue execution. And provides that the interaction between the underlying component and the caller must:
1. As soon as possible to complete, without any blocking operation;
2. One-way call to avoid the formation of a complex loop timing.
The introduction of Message Queuing improves the deadlock problem very well, but in the message queue threading model, we still can not avoid the need for blocking calls, such as network operations. In future attempts, we plan to introduce the coroutine approach to simplify the threading model as much as possible.
In other technology selection, and sometimes even need to detail the use of the API, such as consideration of platform compatibility issues, give up some thread-safe version of the function, the use of asctime, localtime, rand and other non-thread-safe version.
Stage two: practice internal strength
In many gray-level validation, data comparison, the WeChat network platform for the smooth transition to a unified basic components. In order to validate the effect of components, we have developed smc statistical monitoring components, began to focus on the indicators of the network, the basic research and optimization of the network, in particular the characteristics of mobile networks.
Phase III: "catch the demon"
After the full amount of basic components on-line WeChat to the user's amount of WeChat, of course, will encounter a variety of "demon". For example, writing a network program can not hide from an operator. Impressive comparison of a user feedback WiFi connection, the WeChat is not available, then tcpdump found that when the size of the package after more than a certain size can not go out. Solution: in the WiFi network to force the MSS to 1400 (code see unix_socket.cc).
Mobile clients do not avoid mobile phone manufacturers. One encounter a puzzling crash, the stack is as follows:
Look at the stack with the program xlog analysis, non-blocking socket card in the connect function in more than 6 min, by our own anr detection (code see anr.cc) found and then commit suicide. Finally, it is helpless, contact manufacturers to troubleshoot, and ultimately identify the reasons: in order to save power, when the phone is not even when the lock screen WiFi and no downlink network data, the chip gate will close, block all network requests until the downlink data Or more than 20min will be released. When the phone has a network Even in the case of mobile phone network, it is difficult not to downlink data, it will not trigger the basic component comes with anr detection, but when the phone is not connected to any network, it is easy to trigger. Solution: The vendor modifies the code logic and does not block network requests when there is no network.
Operators and mobile phone manufacturers for us is already a black box, but in fact also encountered more black black box. When the phone does not restart for a long time, there is a very small probability can not continue to use WeChat, restart the phone will be restored. But because we can not find a willing to cooperate with us and meet the conditions of the user, leading to this problem for a long period of time there is no progress, the final occasionally an opportunity to reproduce the problem in a test machine, tcpdump found in three During the handshake phase, the server brought back the client with the previous tsval field tampered, resulting in a direct three-way handshake failure, and the tampering occurred before leaving the server to reach the client before.
This problem is the micro-network module in the investigation of the longest time is also the most energy-intensive problem, not only because there is no case for a long time to analyze, but also because in the reproduce, contact a large number of colleagues and external help, Want to troubleshoot the culprit. But because the middle part of the operators involved in the relevant sectors and too many can not continue to investigate down, and ultimately did not find the root cause.
This time is pain and happiness, insight into a variety of poor network, only cut feel the harsh environment of the mobile network environment, but look at our network performance data in a steady increase and a sense of satisfaction. As of today, there are very few real network problems that need to be followed up. This is also one of the reasons why we can have time to get these open source code out.
Introduction to Mars
Tells a lot of Mars's development process, and finally came to the protagonist of the introduction. About a year ago, we began to have the idea to open source components out, when everyone is entangled in the name of what is good? At this point coincides with the "Mars rescue" is hot, a colleague said simply call Mars it, so he settled down called Mars. Look at the code and found that you want to open-source out may still need to do some other work.
First of all, the code style, because we initially use the file name, function name, variable name rules is the internal definition of the rules, in order to allow others to read more comfortable, we decided to change the style of Google style style, such as: All lowercase, underlined between words; left braces do not wrap and so on. However, in order to better distinguish the access space, we have in the Google code style of some modifications, such as: all private function "__" at the beginning of the function parameters to "_" at the beginning and so on.
Second, although the initial design has always been adhering to the business has nothing to do with the design, but in the actual development process is still inevitable to bring the WeChat business-related code, the typical is newdns. In order to Mars after the maintenance and to ensure that the source of open source code homologous, open source must be out before the business-related code out, pulled out of the structure is as follows:
Finally, to make the interface easier to use, the parameters of the calling interface and the callback interface are rethought and modified.
Before Mars, is to provide dynamic libraries (.so) directly to Android, because the code logic has been fixed, do not need to have customizable parts. To the Apple platform to provide static library (. A), because the external exposure function will almost never change, directly to the corresponding header files into the corresponding project on the line. But the open source is completely different: the log encryption algorithm may need others to achieve their own; long even with the header or short of some people need to customize their own; external interface header files we may modify & hellip;
In order to allow users to customize the code, the compiler for the Android platform, we offer two options: 1. Dynamic library. Some custom code may need to provide a default implementation. 2. Compile the static library first, then compile the dynamic library. Compiled out of the static library, to achieve their own need to customize the code, the implementation of ndk-build can be compiled out after the dynamic library. For the Apple platform, header files are all collected for the Mars maintenance, compiled directly Framework.
In order to allow developers to quickly get started, we provide Android, iOS, OS X platform demo, compile and demo other platforms will soon be supported.
Formed Mars structure as follows:
We have not done to meet all the needs of the components, but made a more suitable for our use of components, here are the next and the same type of open source code comparison.
As can be seen:
In general, Mars is a combination of Mobile App designed based on the socket layer solution, in the network tuning has better controllability, support for the HTTP complete protocol, have considered the follow-up version will join.
to sum up
Often have friends and I said: When the network signal is poor or other applications can not be used when the WeChat can still be sent to the message. Unconsciously we seem to have done nothing, look back, we have done so much. I think not a single line of code can be experienced 500 million active users of the test, thanks to the WeChat to provide us with such a platform. Now we want to share the code with you, the way Mars on the open source code and WeChat code used to maintain the same source, all open source code will first be verified on the WeChat after the public. Open source is not the end, just start. We continue to continue to explore the follow-up in the mobile Internet under the network optimization.
Talk is cheap, show you our code.
Follow Mars, come to Github to our star!