Home > News content

Where do you know where to "shoot", this AI black technology, you can experience by opening your phone

via:博客园     time:2019/7/16 17:04:18     readed:181

Gan Ming from the concave temple

Quantum Production Co., Ltd. Public Number QbitAI

If you don't understand, just take the phone and point it to you.

For example, which cosmetics are in the cabinet, which one is which? Just sweep it on the line:


This is the black technology in Baidu App, and it is also the new height of AI technology that Baidu shows.

At the annual Baidu AI Developers Conference, Baidu Senior Vice President and General Manager of the Mobile Ecology Group, Shake, took the mobile phone and opened the Baidu App to make such a demonstration.

No need for any other operations, just a mobile phone, you can real-time display multiple cosmetic information in the phone lens. As the mobile phone moves, the picture in the lens changes, and the content presented in the Baidu App changes in real time.

After the presentation, Shen Xiao introduced that this function is calledDynamic multi-objective recognitionIt can identify objects within 100ms and track the update of object position within 8ms, which has surpassed humans.


Around this technology, Baidu also has the corresponding mobile ecology as a support.

Baidu App This function not only can dynamically identify multiple targets in real time, but also quickly find the same model, compare price, and see evaluation. If you like, you can also place an order directly.

On the scope of recognition, the shaking also gives a number:

With 40 billion training data, it can identify more than 10 million items and can support more than 30+ recognition scenes.

Just open the Baidu app, click the camera button on the right side of the search box, select “automatically shoot” to experience.


For example, it is possible to scan the face and face the fortune:


Can also sweep the wine to check the winery year:


You can also scan the title search to see the analysis, translate the text to identify the text, sweep the dishes / ingredients to see the heat know how to use, star to see gossip, sweep the car to understand the model price and so on.

In addition, it has the cognitive ability of multiple categories such as text/book/poster/pharmaceutical/currency/movie. It can be said that it is a must-have artifact at home.

Nowadays, with the addition of dynamic multi-target recognition technology, experience re-evolution —— can simultaneously identify multiple objects, snacks on the shelves, daily necessities on the table, cosmetics, etc., can be held.


How is Baidu achieved?

From deployment to implementation, there are five major challenges to be addressed

With the development of artificial intelligence technology to today, it is not difficult to achieve multi-target recognition. However, Baidu needs to solve five major challenges in realizing dynamic real-time multi-target recognition tasks on mobile apps.

The first challenge: deploy complex deep learning models on mobile apps.

The computing resources of the mobile phone itself are very limited. To complete the multi-target recognition task, the model needs to be compressed and optimized to adapt to the device status.

Shen Shao said that Baidu App achieves this function, relying onFlying paddle mobile deployment library.


This is a subset of Baidu's deep learning platform, which is optimized for mobile scenes. such as:

The frame size is reduced to 300K; high performance is achieved while maintaining low power consumption through assembly instruction level speed optimization; the framework also supports 8 hardware and software platforms, enabling cross-platform coverage on the mobile side.

In the whole function of dynamic multi-target recognition, Shen Shi said that the original cloud 200 multi-layer visual algorithm model is optimized to 10 layers, the object is recognized within 100ms, and the object position tracking update is made within 8ms.

In contrast, human eyes recognize objects generally, from 170ms to 400ms, and tracking objects need to be refreshed for about 40ms, which means that their recognition speed has exceeded the human eye.

He said that the flying paddle mobile deployment library is also widely used in Baidu map, Baidu network disk and autopilot. From this point of view, the AI ​​of Baidu's mobile products is also being rolled out.

The remaining four challenges are the technologies that are deployed —— dynamic multi-target recognition —— inherent challenges.

Dynamic multi-objective recognition requires real-time dynamic computation of the target recognition model to provide quick feedback in the event of a change (mobile phone/new object).

The second challenge: to construct new objects and old objects in a fast, continuous and stable way to discover new objects.

Specifically, technically, there are two problems to be solved, one is to ensure the performance of object detection of a single frame image, and the other is to ensure the stability of continuous frame image object detection.

The performance of single-frame object detection includes accuracy, recall, and detection speed. The current better model uses a very deep CNN to achieve this task, which leads to a long process of reasoning. For example, Faster-RCNN, even with the NVIDIA Tesla P4 GPU, inferred to take about 200 to 300ms.

In response to this challenge, Baidu built a lightweight MobileNet network based on the flying paddle to compress the base model and improve the prediction speed. According to the official data, the final detection of single-frame multi-target detection on mobile phones takes less than 60ms, and the detection accuracy and recall rate of main objects are above 95%.


In addition, due to the poor generalization ability of the depth CNN for small changes in the image, the continuous frame image object detection is unstable, which results in the correlation model being far from the human eye ability in the effect of continuously discovering the object.

In recent years, the academic community has begun to propose solutions, such as sequence-based models, to improve subsequent object detection stability with multi-frame information. However, the sequence model calculation requirements are too large to be used on the terminal.

On this basis, Baidu gives the solution:

On the real-time continuous frame data, the short-term object state is maintained by the tracking, and when the visual field object changes, the output of the tracking algorithm is merged in the detection model to give the final stable continuous frame object detection result.


Not only can it be used on the mobile side, but the effect is also very good. According to official data, the final frame error rate was reduced from 16.7% to 2%.

This program has also been patented by Baidu.

The third challenge: making feedback as stable as placing it in the real world.

In other words, users can't feel the stuck when using the dynamic multi-modal recognition function. In order to achieve this effect, the computational performance of the projection is required to at least reach or exceed 24 FPS, that is, the human eye perceives the frequency.

Moreover, in order to keep the relative position unchanged, the tracking algorithm is required to control the accumulated deviation between frames to within 3 pixels/60 frames. Baidu uses SLAM (Simultaneous Localization and Mapping) technology to solve the above problems.

The main application scenario of this technology is that the robot moves in an unknown environment, determines its own motion trajectory through observation of the environment, and constructs a three-dimensional map of the environment.

Migrating to the camera camera scene is to achieve very small offset error tracking of the object. The way to achieve this is:

Use the limited movement of the phone to locate the phone and build a three-dimensional map of the environment, placing the virtual information in the specified 3D coordinates.


In support of SLAM technology, Baidu also adopted the VIO (Visual Inertial odometry) solution and streamlined the process of back-end optimization. The goal is to reduce the amount of calculation while solving the problem that the feature points are filtered during the optimization process, causing the feature points to be unstable.

In order to maintain the stability of the technology implementation, the scene of the mobile phone camera is also deeply optimized.

The fourth challenge: to achieve a multi-layered perception of visual signals from both a coarse-grained understanding and fine-grained cognition.

When recognizing an object, the general cognitive style of human beings is to have a preliminary understanding, such as a car in front. Then there is a deep understanding, such as this car is a BMW 320.

The same goes for the machine, which is divided into two parts: coarse-grained understanding and fine-grained cognition.

In the initial understanding stage, the semantic granularity is relatively coarse, and generally requires milliseconds to complete. Baidu's solution is to achieve this by integrating the self-developed mobile-side deep learning prediction framework and completing the inference process of multiple deep learning models in the terminal.

They said that the dataset of the training model comes from thousands of mobile phone video and some open source datasets (ImageNet, OpenImage, etc.), which have been built to cover the main scenes of office, family life, shopping malls, supermarkets, outdoor parks and streets. 300+ label classification label system, the object local map reaches a million.


In order to meet the mobile deployment requirements, they chose to implement multitasking model training based on MobileNet+ tiered loss. In the end, the classification accuracy in the initial stage reached 92% and the coverage rate was 80%. After the model compression was introduced, the single picture prediction took only 40 ms.

In the specific cognitive stage, the semantic granularity is fine, and it is required to be completed in the second level. As a whole, it is a complex cloud system. They said that this system includes a million-scale large-scale fine-grained classification model, and the cognitive accuracy rate in animals, plants, and automobiles is over 90%.

Combined with the visual search technology constructed by ANN nearest neighbor vector search, it supports the similar map, the same product, celebrity face and other retrieval functions. Under the same retrieval time, the accuracy and recall rate far exceeds the Facebook open source Faiss system.

The fifth challenge: in different scenarios and behavioral modes, to achieve seamless integration of discovery, tracking and multi-layered cognition.

After the technical capability is achieved, many factors need to be considered at the implementation level, such as judging the user's attention, the frame selection algorithm when focusing, the scheduling and switching strategy of the tracking and detection algorithms, etc., to enhance the user experience.

In terms of attention judgment, because the inertial sensor (IMU) measurement unit of the mobile phone has a large error, it is only used to judge the intense acceleration motion.

Baidu has adopted a combination of IMU and visual features to capture changes in microscopic motion using displacement and scale variation features of continuous visual image calculations.

In the frame selection algorithm, Baidu collects the simulated user's attention changes in different scenes, and relies on manual standards to construct the optimal frame training data set. The CNN model is used to fit the manual labeling process, and the best frame input is selected. Subsequent calculation process.

The reason behind this is that it is often the image quality of the first frame that triggers the detection of the object, which is often affected by noise such as illumination, sharpness, and object position.

In order to save the calculation amount, the scheduling algorithm tracks the output of the algorithm state and the attention judgment strategy in real time, and adjusts the calculation of the continuous frame detection model in time.

Through these finely combined scheduling algorithms, Baidu claims that the power consumption of dynamic multi-target recognition is controlled within 2%/10min, which satisfies the energy consumption requirements of mobile deployment.

It is based on the combination of the above four aspects of technology that dynamic multi-target recognition technology is realized in Baidu App.

Combined with the supporting services in the mobile ecosystem such as Baidu smart applet, Baidu's unique AI landing path has been formed.


Baidu mobile ecology is differentiated by AI

Such applications and capabilities, on the one hand, are direct demonstrations of Baidu's technological changes in the mobile space.

How to use AI to enhance the user experience? This technique provides an example.

On the other hand, when many people think that competition in the mobile field is long over, AI technology is also giving the Baidu mobile ecosystem a differentiated competitive advantage.

The competition in the mobile field has once again kicked off, and it will have more “technical content” than ever before. Who can have AI and who can use AI to become the final winner.

In addition, the new business brought by AI such as DuerOS and Apollo is easy to see the progress of AI technology.

However, it is applied to Baidu App's AI technology —— such as dynamic multi-target recognition, it is not easy to be perceived, and it is difficult to be a user-friendly function, and the difficulty is not too small.

Baidu can integrate it into the App to bring a more intuitive experience, and also to see its accumulation in the AI ​​field for many years.

Moreover, such AI technology can bring subversive changes to the user experience and life.

If someone asks how Baidu AI revolution begins, the answer is already starting, such as opening Baidu App.


QubitQbitAI · headline signing author

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments