Finding the Perfect Outfit with Alibaba’s Dida AI Assistant

How the Alibaba tech team uses deep learning to help online marketplaces drive strong business results through the creation of quality outfit displays

Maintaining success in online retail is a constant struggle that requires retailers to determine the best method of presenting their products to customers. One method is to make an attractive poster for the item, which previously required a great deal of human labor. Another way is suggesting item combinations — especially for clothes and accessories — in a bid to encourage customers to make an extra purchase.

Alibaba’s Dida AI assistant combines both of these strategies.

Dida is modeled after Alibaba’s famous poster designing AI engine, Lu Ban. Like Lu Ban, Dida uses graphical algorithms to collate data and train itself to generate targeted, high-quality images of popular item combinations. While mainly used for generating personalized outfit combinations for online fashion retail, Dida’s use has now expanded to other areas, such as copywriting for product description.

So, how does Dida work? Technically Dida is not a single entity but a combination of multiple platforms and algorithms, including front-end operations, algorithm, image mosaic, and personalized launch platforms. First, operators select items on the front end and designate items to be combined using deep image processing and combination algorithms. When one or several items trigger a combination request to create a combination, Dida searches for additional items that are a good match and follow specific operating rules. Next, descriptive titles are generated based on information from the triggering and returned items.

Finally, the items are synthesized and displayed in a visually flattering way using smart typography. The images of the item combinations are then generated, personalized, and pushed to users with recommendation algorithms.

As of February 2017, Dida’s algorithms have been launched across a variety of Alibaba’s eCommerce platforms, including Taobao and Tmall. For major promotional events, Dida aids operators in generating millions of combinations for a wide range of products and it is utilized by hundreds of thousands of Alibaba design experts. Alibaba is already recognizing positive results from trials on several of its pages, including iFashion, Mobile Taobao Primary Focus, and YouHaoHuo.

Images from a backend outfit display on Dida

The next section briefly summarizes the top three benefits of using the Dida platform, and the rest of this article discusses the technical foundations of the Dida platform and the success stories of iFashion, Mobile Taobao Primary Focus, and YouHaoHuo.

Dida’s Benefits in a Nutshell

1. Content production

Dida utilizes a deep learning network to collate extensive collections of information from users, products, and operations knowledge. When input to help with content creation, the Dida AI engine performs at the same level of quality as an experienced design expert. The graphical algorithms generated by Dida are then applied across multiple business areas.

2. Platform enabling

Dida combines graphical algorithms and platform establishment to allow operators to select items from the Dida platform and generate item combinations and personalized launches in a one-stop management process.

3. Improved efficiency

The average design expert produces thousands of item combinations daily. The Dida platform generates millions of item combinations per hour, which is a massive improvement in efficiency. Dida also combines algorithms to expand the overall size of the information pool and enhance personalization.

Technical Foundations of Dida

When used for online fashion retail, Dida generates outfit combinations and descriptions, which requires designing graphical and textual algorithm frames. All bottom layer data is shared, including pictures and titles for items, operational input data, and various other information. Dida utilizes this information to generate outfits and then generates descriptions for each one.

First, a picture combination algorithm uses Convolutional Neural Networks (CNN) to pre-process images. Next, two combination logic algorithms are run using the Deep Semantic Similarity Model (DSSM):

– Long Short-Term Memory (LSTM)-based sequencing outfit production.

– Deep Aggregated Network (DAN)-based non-sequencing outfit production.

The generated outfits are imported into Context-aware Pointer-Generator Networks (CPGN) to generate text descriptions for each outfit. The final result includes text and graphs that provide holistic descriptions of each combination.

The relevant image and textual algorithms are introduced in the following sections.

Image Algorithms

This section includes all relevant preparation and processes for the use of image algorithms in this scenario.


Alibaba’s training data was initially sourced from Polyvore, a website that stores massive amounts of user-submitted item combination examples that are further delineated based on factors such as user likes and comments. When Taobao’s internal businesses practices were firmly established, Alibaba created an in-house database using hundreds of thousands of high-quality outfit combinations obtained from veteran Taobao experts.


First, items must be characterized. Images are the most intuitive in terms of delivering information. Alibaba uses Lu Ban’s library of millions of white-background images to generate image characteristics for items in a specific item pool. This is accomplished using the inception v3 model of CNN. The process is as follows:

1. The pre-trained model is fine-tuned using the category as a label, and the vector expression of the penultimate layer is extracted and used as the image representation of the item.

2. Vector expressions for all images with category constraints are organized into clusters using K-means clustering. Some optimizations have been done on K-means clustering which ensure centralized and balanced distribution, making it ideal for use in this scenario. Optimized K-means clustering uses differences in category relationships and combinations and the distributed quantity of items for various categories as determiners. An individual category can include a variety of different clusters. This step ensures every item ends up in a cluster.

3. The model is fine-tuned again using inception v3, with the clusters generated in Step 2 being used as labels. High-dimension vector expressions are also extracted. These vector expressions are used as the final characteristics for images.

Model I: LSTM-based sequencing outfit combination production

1. High-dimension vectors are obtained using CNN, and the vectors of side information are embedding- and stacking-transformed as the input layer of the model.

2. Input vectors are divided into two routes after the first MLP layer. One route leads to the LSTM network for sequence learning, and the other leads to the DSSM network for vector alignment. The details of both are below.

LSTM network

Generating combinations is a sequential process. Each item generated for an outfit is considered a sequential step. Starting from the first item, each new item must be correlated with previously-generated items. This is possible due to the LSTM network’s intrinsic sequential relationships. LSTM, an extension of Recurrent Neural Network (RNN), includes function gates that allow it to capture long-term dependency effectively.

For this scenario, S represents an outfit, xt represents the CNN characteristic representation of the item, and S=x1,x2,…,xN represents an outfit sequence. According to the maximum likelihood estimate (MLE), the main aim was to maximize the expectation of:

DSSM network

A DSSM network approach is adopted due to the expectation that items in one outfit combination should have closer distances in vector space. Positive samples come from online logs and quality outfits. The most-clicked and highest-quality outfits from veteran Alibaba experts are collected from the online log and divided into pairs to serve as positive examples for the DSSM network. Outfits receiving a low number of clicks are used as negative examples. As shown in the figure, vector expressions for items pass through MLP before entering LSTM. When LSTM generates an item X, X is subjected to MLP transformation, and its distance from other items that have not yet entered LSTM is calculated. Positive examples of other items are denoted as Y+ and negative examples as Y−. Short distances between positive examples and long distances between negative ones are the ideal situation. Therefore, loss is expressed as:

where the sim function uses cosine similarity, θ stands for parameters, and the goal is to maximize Δ. Mini-batch SGD is used to optimize θ on GPU.

Model II: DAN-based non-sequencing outfit production

The launch of the first version based on the LSTM Model proved satisfactory. However, subsequent studies indicated DAN as a superior model. The only changes made to the previously described structure is the replacement of the LSTM module. The DAN model achieved lower loss and better output.

The core function of DAN is considering outfits as a combination mode rather than a sequencing mode. Take outfits consisting of tops and pants as an example. Two sequences of training data are required to train the LSTM network: Tops + pants and pants + tops. For DAN, tops and pants are input into the network as combinations without sequential differences.

An outfit’s training data is input into DAN after passing through CNN and side-information embedding. Vectors are changed to a non-linear layer first, and then they enter the pooling layer. Trials of this process using sum-pooling and max-pooling have found the former is the better option. The entire DAN-based process is outlined in the following figure.

Compared to LSTM, DAN achieves lower loss during training. Another advantage of DAN is that it only requires combination data, rather than full permutations, to construct training data. Less training data shortens training time and makes a periodical iteration model possible.

Context graph-based prediction process

Outfits tend to be defined differently based on the operator. Women’s clothing sellers often view top+bottom+shoes or dress+accessories+bag as complete outfits. One-piece dresses and jeans never appear in the same outfit. From the perspective of a seller of household goods, the absence of either a bed, a nightstand, lamps, or wallpaper can make a bedroom combination incomplete. In practice, operators often factor in the usage scenario and additional restrictions based on style and season. The main difficulty in this scenario is creating combination algorithms that meet the needs of operators. Alibaba designed a context graph to address this issue.

A context graph is a set of structured operation rules, including constraints on category, style, season, and more. In the outfit combination prediction stage, all items and their side-information undergo embedding, stacking and full connections before they are stored in an item pool.

Take DAN as an example: when an activity sends requests, all triggered items undergo DAN. Without considering constraints, the MLP output is used to search the item pool for similar items to obtain the next item. According to the context graph, when constraints are factored into the similarity search, results are filtered so that only items that meet operation rules end up in the candidate set. Next, TopK is picked out of the candidate set. Whenever an outfit is generated, all satisfied combination constraints are re-calculated using the algorithms to promote the generation of the next item and a new candidate set.

The context graph is packed into the model, making outfit combination prediction a fully real-time process. It also ensures a high yield rate, as every generated outfit meets the input conditions of operators, reducing the cost of manual filtering.

Textual algorithms

This section describes all relevant preparation and processes for the use of textual algorithms in this scenario


Veteran design experts have created hundreds of thousands of self-determined outfits. Descriptions of these outfits are collated and used as training data. Training data inputs include item titles and outfit tags. Words are used as the basic unit.

CPGN model

Operational inputs are added based on PGN to establish strong connections between copywriting, items, and operational requirements. This new approach is named CPGN, and its algorithm structure is as follows:

The entire frame comprises an encoder-decoder structure. First, the original data (x1, x2, …, xn) and operational inputs (z1, z2, …, zn) are encoded. Every individual word from the original data goes through a single layer of a bi-directional LSTM network, and the hidden state is denoted as hi. Operational input can be a complete sentence or keywords. In the former case, LSTM processing is still adopted. In the latter case, embedding is applied directly to keywords, which are denoted as ri. In scenarios where i represents the i-th input, and t represents the number of decoding steps, the attention distribution (at) and context vectors (ht∗ and rt∗) are denoted as follows:

where η is a multi-layer MLP with tanh being its activation function, and st represents the decoder status at the t-th step.

The attention distribution is viewed as the probability for each encoded source word to generate the decode, where ht∗ and rt∗ are weighted sums of attention and the expression of the information obtained from source statements. The probability distribution of the next word across the dictionary can be derived on this basis:

where g is a two-layer MLP.

In this way, the probability of the generated section is derived. To balance pointing and the generator, parameter pgen∈[0,1] is designed, which is a probability soft-switch associated with the current decoder status st, context vectors ht∗, rt∗, and decoder input yt−1.

pgen allows the next word to be generated from the dictionary or copied from the input. Assuming the predicted probability of each word in the dictionary is Pvocab(w), then:

where σ is a sigmoid function.

The current P(w) contains the entire dictionary, including words that were not previously in the dictionary but appeared in inputs. This helps to solve problems with OOV. In the training stage, assuming the t-th target word is wt∗, loss can be denoted as:

Finally, a coverage algorithm was introduced to tackle repeat words. As all algorithms are launched directly online, Alibaba added a regular expression layer following duplicate removal in the algorithm to eliminate loopholes.

Dida Case Studies

This section looks at some successful case studies of Dida in iFashion outfits, Mobile Taobao Primary Focus, and YouHaoHuo.

iFashion outfit

iFashion is a Taobao page that uses an outfit combination-centric scenario. iFashion maintains high-performance requirements for graphical algorithms, both in terms of content quality and visual effect. Alibaba periodically uses Dida to produce matching outfits and text descriptions for iFashion selections. This represents a huge addition to the outfit pool, which previously relied solely on contributions from veteran experts. Outfits generated via algorithm are mixed together with outfits from veteran experts in waterfalls and then displayed to users in personalized pushes. Creating outfits via algorithm provides significant benefits, including decreased costs, positive feedback, and high conversion rates.

Mobile Taobao Primary Focus

Mobile Taobao Primary Focus is a Taobao banner with strong operational demands. Every image is linked to an event page that displays copywriting for the event and images of the products being sold. In personalized launches, Alibaba’s recommended algorithms displayed personalized contents for every user based on their behavior. Before Dida, identical copywriting was used for all item combinations, even if the images were personalized.

In this scenario, Dida’s graphical algorithms were used to create multi-item combinations for the garment industry. The resulting outfit combinations were high-quality and enriched the imagery and presentation in the primary focus area.

Alibaba has also tested using textual algorithms for copywriting across multiple industries. When using a conventional approach, operators often use a very bland, general style. For example:

· Primary copywriting: Sports party

· Secondary copywriting: Choice brands, large discount

· Benefits: Rush for big value coupons.

This style of copywriting is boring and easily ignored by consumers.

In contrast, Dida generates copywriting that is far more interesting and engaging:

For this example, Alibaba’s model uses item titles, descriptions, and attributes (information about yoga bricks) as the input data. Operational inputs are individual benefit keywords or tags (quality choices, big brands, sales promotions). The final copywriting consists of:

· Primary copywriting: Quality choices for yoga bricks (a combination of words extracted from item titles and operational inputs)

· Secondary copywriting: Big brands, big discounts, spend-and-get-back (derived from words extracted from operational inputs).

In comparison, smart copywriting is customized for specific items and events and describes products in a way that is interesting and highlights their benefits and associated sales events.

The graphs and text combinations generated for Primary Focus resulted in double-digit percentage increases in CTR and UCTR.


YouHaoHuo is a Taobao page that is highly popular among consumers due to its flagship concept, “Haohuo,” which refers to the sale of quality items. Many of the YouHaoHuo product titles designed by veteran experts are too long to be fully displayed on the current page layout. The resulting truncation leads to unclear item descriptions and prevents users from viewing the full information for items. Alibaba used Dida copywriting algorithms to re-extract key information from titles and limited the length, helping users to make better decisions.

(Original article by Chen Wen陈雯)

read original article here