It was quite fun and enlightening talking to Oliver Tan, co-founder and CEO at ViSenze. Having started ViSenze in 2012, he is one of the pioneers in the commercial applications of computer vision and machine learning. We discussed ViSenze’s primary areas of focus, its unique value proposition, the industry landscape, the prevalence of visual search, and future of visual search. I was surprised to hear that ViSenze increased its API usage ~6 times in 2017 and it was interesting to learn how it can convert videos and images into shoppable experiences. Below you can find our podcast edited for clarity and brevity.
Oliver: I’m one of the four founders of ViSenze, which is a combination of two words: visual and sense. The objective is simply to bring together the visual web and try to make sense out of it using computer vision. We spun out of the National University of Singapore about five years ago, where we worked in a lab that was set up by National University of Singapore as well as Tsinghua University in China – one of the top three universities in the region. We built technologies that focused on areas that Google wasn’t looking at, one of which was computer vision, a very nascent field at that time. We taught machines to process pixels and understand concepts within images themselves, and we played around with a lot of social data that we gathered from the visual web in China as well as in US. This allowed us to understand not just objects, but also concepts in images. The problem statement that we focused on at that point in time was very simple – if we can extract intelligence from pixels and images, what can we do to help online shoppers who are basically telling us that they are searching but they’re not finding? Why is it that they are searching and not finding? Is it because they’re using the wrong keywords or is it because the keywords that they used in the first place did not exist in product taxonomies that merchants and retailers had in the first place? The answer is actually both, so therein lies the huge disconnect between the way we describe things and the actual products themselves that are being tagged by merchants and retailers.
If I could show you a very nice cage-high heel ladies sandal or shoe, how would you even try to visualize that? You can see it but if you were to show the same shoe to ten women in a room, they’ll all use different descriptions like “cage lady sandal shoe,” “strappy lady sandal shoe,” or “gladiator lady sandal shoe.” We all use different terms that are most familiar to us but not necessarily the most appropriate ones in order to match with the product taxonomy. So, we figured if we could take away this hassle of keyword guessing, and instead just use pure images, we would be able to shorten the path towards search as well as discovery. Now I always like to say that a picture is really worth a thousand words, but you just don’t want to use a thousand words to describe the picture. Instead, you should let the image speak for itself. So, that’s exactly what we do. For example, across the border, products in Japan are tagged with Japanese metadata and keywords. How do I even try to look for a cage lady sandal in Japan if I don’t speak Japanese, right? But images themselves are a universal language, so using visual search, we can take that image as an input or as a query, process the metadata identifying what exactly is in that image, extract the visual attributes (in apparel, that would be things like color: is it a mandarin collar? Is it a standard collar? Etc.). We extract all of these visual attributes that allow us to approximate the actual result as if you were the real shopper in the real world looking for that product. It’s kind of a like an experience when you walk into a store and you say, “Hey, I like the pattern very much but I don’t like the cut – I want it shorter.” You can describe that in the real world, but in the online world, you don’t have that benefit. So, by training machines, we’re able to approximate that – we call it our “nearest neighbor” algorithm. Because you may find a dress that’s short but the store also has something that’s much longer, so we try to approximate the entire product as if you were looking for the real stuff in the real world.
With that, we’ve been able to apply ViSenze’s technology fundamentally starting in the fashion vertical because fashion is the most visual vertical amongst all of the e-commerce verticals that we looked at, and it’s very easy to identify that the technology helps to be able to search better. About three years ago, we transitioned from machine learning to deep learning and that basically gave us the quantum lead in terms of not just the accuracy but the overall relevance of results that we’ve been able to generate. We’ve been playing around with different metadata, which is not just visual, but rather input that we extract from pixels and images, in order to recommend better and more relevant items. For instance, if you’re looking at a $50 dress, does it make sense for me to show you a visually-similar dress that is worth $500? That’s not relevant, even though it’s visually similar, so we’ve been able to take in a lot more signals as compared to just visual signals and generate more relevance in contextual recommendation.
Moving forward very quickly, we’ve moved beyond fashion. ViSenze is now into home and décor, furniture, CPG (Consumer Packaged Goods), schematic drawings, artwork, product designs, stuff like that. They’re very visual in nature itself and have a distinctive pattern to the image, so we use that and we’ve been able to apply that. Currently, about 80% of our clients are in the retail space and about 20% are in a non-retail space. So, to give you a sense of the kind of customers we have on the retail side, they’d include brands like Rakuten in Japan (the number one marketplace), we have Uniqlo Japan (Japan’s largest fashion retailer) that uses us on their mobile platform. We’ve worked with three of the top five e-commerce players in Southeast Asia, two of the top five e-commerce players South Korea, and currently have two of Europe’s largest fashion brands using us. I can’t name this because we are under strict NDA specifically for them, but I’ll be very happy to show you some sites that use image search without putting names to it. And of course, we’re working with some major retailers, as well as market places, in North America right now.
The team is still very young. We have about 40 people on the R&D side, and we probably have one of the largest concentrations of computer vision and machine learning guys under one roof in Singapore, where we are headquartered. Our underlying stack is completely built on deep learning today. And of course, the other 20 percent of different use cases (outside of retail) that we looked at – trademarks, logo detection, artwork and product design, schematic drawings – those are very interesting areas that visual search can be applied to. So, what I just described to you, it relates to one branch of our solution that is visual search, on the image recognition side. Image recognition is basically transferring or converting from pixels into key words. We have our product tagging APIs which is in the beta stage right now and is commonly used by some major retailers to enhance or enrich the keywords in their catalogs. For instance, if I’m able to detect a mandarin collar, we can then expose that color as a keyword description which allows for me to search for a jacket with the exact phrase “mandarin collar.” We use that in order to expose what we call “fashion attributes” in the fashion vertical. When used in fashion, they enrich the catalog of all these major marketplaces, allowing them to have their products found better using the natural keywords that everyday shoppers use.
Cem: Thank you. I think we put you under image tagging and visual search and those are the two topics that you mentioned so far, right?
Oliver: Yes, that’s right.
Cem: But I’m sure we missed some things because it’s difficult to categorize the whole space but it’s good to hear that the we’ve got the important ones right.
Oliver: These are the two most important verticals that you put us in, and the other two verticals you might want to take note of. We also offer product recommendations. Visual search and product recommendations are different engines, fundamentally; the underlying logic is different. We work with the different partners and different players who use our technology as part of their product recommendations. Instead of a consumer uploading an image to search for something, while he or she is looking at something on the site itself, the product recommendation system uses visual inputs to make a recommendation to the consumer.
The last area that we’ve been working on is what I call “video commerce.” The idea is very simple; if we can process text, we can also process videos. Since videos have a lot of intelligence and signals, the chances of extracting signals out of videos, like from fashion video shows, are a lot richer. The ideas can be exposed, and product recommendations can be shown back within the video stream as we detect them. Therefore, I call it a video commerce experience; instead of creating shoppable videos, we make videos shoppable using machine learning.
Cem: That’s a pretty interesting use case, and there’s plenty of existing material to use. So, that’s interesting.
Oliver: Yes, it’s a very new era, only a few guys are out there in the market that I know of that are doing it. We have a lot of people creating shoppable videos the old way: embedding products within videos and designing videos for products to be embedded inside and exposed. But the newer way, using machine learning to expose shoppable moments within videos, is few and far between.
Cem: You are essentially doing image tagging on the video, right?
Oliver: It’s a combination of both image recognition as well as visual search on video. Not only do we detect, we generate an exciting shopping use case, so we recommend, or we do a visual search against a product database in order to find relevant products. It doesn’t just have to be fashion, it could be other stuff as well. It could be a branded product, such as a logo, or it could also be used for contextual advertising purposes. We call it “in-video contextual advertising”, and it’s a very rich area that has yet to be fully explored.
Cem: And can you talk a bit about industry landscape, about the competition, large and small players. We can also discuss pricing and roll out times for your products.
Oliver: Okay, let me take the first one which is the landscape. Visual search is already appearing on mainstream search today where we have major players like Rakuten implementing visual search on their platform. It really heralds the mainstreaming of visual search as part of the natural search experience for any consumer. My vision is to see democratized visual search and have visual search as common a feature as keyword search on any shopping site or any shopping platform.
Now with that, I see the market place has all sorts of visual search solutions out there, and even platforms like Google utilize it. But here’s the challenge; whereas general image search has brought itself to a sufficient level that stretches right across every vertical, how does one get deeper, sharper data and understand consumers’ behavior better? This is where companies like ViSenze come into business, because fundamentally when we train data, we train much sharper data, and we train on real data that is provided by merchants and retailers that we work with. There is a difference because working on real world data and understanding the way that people shop and whether they convert things like optimized prices, or don’t convert, signals the optimization that is required to understand actual consumer behavior. General visual search as a worldwide search engine may find the product you’re searching for, or may not find you that product. Fundamentally, they’re not optimized for conversion, and that’s where the foundational disconnect between companies like ViSenze and worldwide web search engines like Google, happen. Since we are vertical, we call ourselves “Vertical AI agents” that are heavily optimized for a specific function, like engagement, conversion, or discovery. In terms of the landscape, you will see that there are a few companies like ViSenze as well, and we are not alone in focusing on key verticals. We look deeply into things such as visual attributes of fashion apparels. Earlier, I talked about mandarin collar, that’s an example of how we do visual attributions. When we were in India, we thought that we understood modern fashion wear, but didn’t realize there was a niche for ethnic wear. For instance, what’s a sari?
Cem: I’ve worked in India for a bit so I can imagine the challenges, completely different fashion wardrobe.
Oliver: Yes, that’s right. Your standard outfit concept in ethnic wear doesn’t really work because an ethnic wear or sari is considered a full outfit, there’s no upper body or lower body concept, it’s just the entire outfit. You have to retrain for that domain understanding. That’s what gets us further and deeper than most other people when we look at other specific verticals.
Cem: Yes, totally. What do you think about Amazon? They have plenty of data for sure.
Oliver: Of course, the big guys have the advantage of data. Amazon just a couple of months ago launched Echo Look, have you tried that?
Cem: No, no. How is it?
Oliver: It’s alright. Amazon’s Echo Look is a device that you place in your bedroom and every morning before you go to work, you are supposed to ask Echo Look, “How am I dressed today?” It’s meant to give you an answer, or a recommendation based on how you could dress better. Now, I honestly wonder how useful it is as a real-world product. If someone has been dressing well every day for 35 years of their life, why would they ever need that product to tell them to dress better moving forward?
I see this product as a fashion statement that Amazon launched, but I think the real use case will be one in which we use AI to help people discover things that they don’t already know. For instance, if I bought a new scarf, is there a better way for me to wear this scarf? What color combinations does it go with? It’s about sharing good suggestions and then letting the consumer decides at the end of the day. It’s also about equipping and empowering consumers with visual knowledge that they would not otherwise have. I see this as a large opportunity for AI given the data that we have, and we will be able to use that at scale for consumers.
Cem: You work with plenty of these big e-commerce companies but you also have a solution that is possible to use as self-service; you have the API, with these large e-commerce companies is it more like exposing the API to them or more like consulting work where you work with them to make sure that their solution is tightly integrated with the systems they are providing?
Oliver: I’ll give you two answers to that. Most of our clients use our APIs and these are easily integrated and implemented. Some clients use our STKs that we provide to them as well, and it’s pretty standard.
At ViSenze, what we built is a highly trainable and highly configurable algorithm model. On top of the models that we have, we can take in feedback and data directly from clients and retrain that model extremely quickly. We can use a client’s data in order to optimize to their environment and their use case. For instance, in North Asia unlike in the U.S., a lot of their images are visually noisy. It’’s not uncommon to see their catalogue images having two or three models in the same image itself, so we are optimizing for their environment.
Although ViSenze is a cloud-based solution right now, we’re moving towards an on-premise solution where we can actually deploy or transfer our entire algorithm model within a closed data environment. This will let clients feel more secure having their data updated and shared within that environment. So, we have two models to go with.
Cem: Very clear. And once you have the model set up, one thing that I am wondering is how frequent is visual search use by the customers of the e-commerce websites? I’m sure it depends a lot on culture maybe or about the time it was introduced, but you know so much more, so whatever insight you want to share, I’m curious.
Oliver: That’s a very good question, I’m glad you asked that. When we started ViSenze five years ago, of course there were natural skeptics; does it actually work? Is it a natural consumer behavior? Five years later, when we have guys like ourselves and even Amazon using visual search on their platforms, it’s almost as natural as keyword search.
I’m going to share with you some stats. In 2016, we processed more than 350 million queries in the whole of the year. This is almost equivalent to one million searches a day from all of the clients, who are then exposing our APIs to their end consumers. In the first half of this year, we have seen that volume already increase three times. I am seeing a lot more visual search happening in one form or another, whether it’s an upload search experience or a product discovery experience, or recommendation experience – it is happening at scale.
Cem: And how do you see it changing in the future? Are there any trends you’re already seeing?
Oliver: Yes, in fact I’m seeing a lot more people using social media images – Instagram and Pinterest. In China, we have images from Sina Weibo, which is one of the top three social media platforms. People are not just searching clean catalogue images that they come across on other competitive websites. They’re using social media images – from Instagram, Facebook – that are being shared by their friends or Pinned, reposted, and liked. I see a big influence coming from the social media space and companies like Instagram and Pinterest are catalyzing and propelling visual search forward faster.
Cem: Well, we’re almost running out of the time we allocated so if you have any thing that you wanna share as a final message to potential end users we can talk about that.
Oliver: My favorite line is, “the future of search is visual,” one way or another it’s coming. Mary Meeker talked about it her 2017 Internet Trends Report, to be fair she said “visual as well as conversational” but it is there. We believe that visual search will become even more mainstream moving forward, but it is not just in shopping – visual search can be applied in many areas outside of e-commerce. We’re heavily optimized or commerce which is big enough for any of us to focus on already.