Alibaba engineers’ latest tech treat for coffee lovers
Opened in 2017, the Starbucks Reserve Roastery and Tasting Room in Shanghai is the world’s first retail coffee shop with an Augmented Reality (AR) experience. Developed by Alibaba’s AR platform Laboratory X, the AR tech enables customers to sip on their latte while they explore the shop and discover everything there is to know about coffee, including where it comes from and how it is made. The cutting-edge AR technology is based on Artificial Intelligence (AI) and Starbucks has applied it to make a truly unique café experience.
AR is a type of graphic technology that combines reality with virtual content in real time. Let’s take a visit to Starbucks as an example. If a customer points their phone at a coffeemaker, then it will be recognized as a coffeemaker and an interactive button is overlaid on top of it.
This process of content overlay involves the three principles of AR technology:
· Identifying the object (i.e. looking for the coffeemaker)
· Rendering the overlaid content (i.e. drawing the interactive button on the coffeemaker)
· Tracking the identified object (i.e. ensuring the interactive button ‘sticks’ to the coffeemaker when the user moves their phone)
Tracking and rendering technology are already relatively mature and well developed. However, identifying objects presents the main technological barrier for AR applications.
Not Seeing Clearly
The demands of AR require recognition to be performed in real time at the client for the most efficient experience. The client computation method identifies larger objects that are not easily influenced by patterns, the Starbucks cask for example, and uses various data processing strategies to achieve a final captivating result. Unlike methods accessing servers, client computations are not restricted by network speed and so result in swift response times and a stable experience.
However there are several types of object that cause difficulties in recognition for the AR system, specifically:
· Metal surface reflections that give out different patterns when viewed from different angles.
· Transparent surfaces, such as hollow and transparent containers.
· Changes in the environment or equipment that increase the difficulty of consistently identifying objects, such as varying models of smartphone and their capabilities.
In order to overcome these difficulties, a cloud-based deep learning recognition server was established.
Deep learning applications on images is now relatively mature and the following associated models have been used for the Starbucks AR experience:
· Image classification model: This has a high perception ability, but cannot confirm target location.
· Object detection model: This has a low perception ability, but can accurately confirm target location.
By combining the advantages of both models, most objects can be recognized by the object detection model, while the classification model can be used for objects that are more difficult to identify but can be given an estimated location in the image.
The main purpose of image synthesis is to teach the machine to differentiate the foreground and background, rather than obtaining a measurably physical result, such as the size of the object.
Image synthesis has become a useful tool to overcome the issue of scene diversity causing problems with object identification. It can be used to automate target object recognition by collecting information on multiple target object images. This is done against a green screen background, then the background is removed, and the target color is switched to a similar distribution to that of the background.
Both machine learning and deep learning require a huge amount of data, which takes a lot of time and energy to collect. As objects, such as a transparent and reflective coffee pot, can appear anywhere in the scene, it is important for training data to be as diversified as possible to cover all likely scenarios.
Although these approaches could solely be used for a basic AR application, there are still outstanding issues for larger-scale uses.
Color and Image Simulation
AR target recognition relies heavily on device camera resolution. So, with the large-scale Starbucks application, accommodating users’ varying smartphone models and specifications was vital. This could be overcome by simply collecting data using many different smartphones models, but this is not a realistic approach. Instead, existing images were switched to simulated images from other smartphones, using two automated processes:
- Pure color transformation: Using image B as a reference, this process transforms the color distribution in image A to unify it with that of image B.