Multimodal Conversational AI agents

Ricardo G. Sousa

Data Scientist
Project Manager
Conversational Bot

The goal of this project was to develop a high-end fashion marketplace that delivers an customer experience through advanced conversational AI technology. The aim was to create a seamless, high-touch customer journey that mimics the expertise of a fashion specialist, understanding customer needs and providing personalized fashion advice.

Key Objectives: Scale up the business while maintaining customer loyalty and satisfaction; Leverage live, human-like conversation services to improve conversion rates significantly; Address the growing demand for conversational assistants in online shopping, especially among younger generations; Utilize vast textual and visual data, along with accumulated knowledge from past user experiences, to provide accurate and tailored fashion recommendations; Develop a task-oriented multimodal conversational agent (MCA) that revolutionizes the way users shop online in the high-fashion marketplace.

https://youtu.be/7QkX3jHeFOU

Solution: The solution to this project encompassed several key aspects to create a highly effective and user-friendly conversational AI platform for the high-end fashion marketplace. One critical aspect is the attention-based multimodal utterance processing, which involves representing text utterances through a combination of state-of-the-art NLP strategies, using attention mechanisms to capture relevant details in the visual utterance, and combining textual and visual information to produce multimodal utterances.To capture all relevant details in the visual utterance and combine textual and visual information effectively, a synthetic dataset generator that leverages all use cases was created. For the Natural Language Understanding (NLU) task, the well-established JointBERT model has been employed. This model uses the latent representation of the [CLS] token as input to a softmax classifier for intent prediction, and the latent representations of each utterance token are fed to a softmax classifier for slot filling. The learning objective is to maximize the conditional probability of both tasks by minimizing the cross-entropy loss.A Directed Acyclic Graph (DAG) has been chosen for the policy, incorporating different use cases such as product information, multimodal information retrieval (show similar, free and faced search), and disambiguation strategies. The conversational agent encompasses three main tasks: question-answering, retrieval, and chit-chat. To ensure the NLU model is robust and can successfully understand the differences between each slot for these diverse contexts, the baseline JointBERT has been tailored to meet the project's specific requirements.Both template-based and generative response models were explored, with a focus on adapting existing models for response generation to introduce product information and attributes in the answers. Proper product representation is crucial for enabling product retrieval and is one of the most complex use cases. The work has focused on product representation using a structured view of the product category, attributes, and their relationships, as well as exploring state-of-the-art algorithms and representations that can satisfy multiple search criteria and feedback mechanisms related to pre-selected products. The pre-trained CLIP model has been fine-tuned on the product catalogue to obtain a model capable of retrieving products based on either image or text input queries.One of the main goals of the project is to develop a tool that can guide users through the product catalog, reduce the search space, and provide not only specific products but also similar or complementary items. This narrowing of the search is done through natural language processing, and the key to performing relevant searches in the catalog is through vector representations of the products, allowing the closest products to each other to be found in a given latent space. To ensure these representations are maintained and used by a robust and high-performing indexing service, the Milvus Vector Similarity Engine (VSE) has been selected.The entire platform was enabled by containers, ensuring a consistent and reproducible environment for development and deployment. It is important to note that this work was delivered by a team of four people, with myself serving as the project manager and principal contributor.

Partner With Ricardo G.
View Services

More Projects by Ricardo G.