Scientists have been working on the puzzle of human vision for many decades. Convolutional Neural Network (CNN or convnet)-based Deep Learning reached a new landmark for image recognition when Microsoft announced it had beat the human benchmark in 2015. Five days later, Google one-upped Microsoft with a 0.04% improvement.

Figure 1. In a typical convnet model, the forward pass reduces the raw pixels into a vector representation of visual features. In its condensed form, the features can be effectively classified using fully connected layers.



Data Scientists don’t sleep. The competition immediately moved to the next battlefield of object segmentation and classification for embedded image content. The ability to pick out objects inside a crowded image is a precursor to fantastic capabilities, like image captioning, where the model describes a complex image in full sentences. The initial effort to translate full-image recognition to object classification involved different means of localization to efficiently derive bounding boxes around candidate objects. Each bounding box is then processed with a CNN to classify the single object inside the box. A direct pixel-level dense prediction without preprocessing was, for a long time, a highly sought-after prize.


Figure 2. Use bounding box to classify embedded objects in an image


In 2016, a UC Berkeley group, led by E. Shelhamer, achieved this goal using a technique called Fully Convolutional Neural Network. Instead of using convnet to extract visual features followed by fully connected layers to classify the input image, the fully connected layers are converted to additional layers of convnet. Whereas the fully connected layers completely lose all information on the original pixel locations, the cells in the final layer of a convnet are path-connected to the original pixels through a construct called receptive fields.

Figure 3. During the forward pass, a convnet reduces raw pixel information to condensed visual features which can then be effectively classified using fully connected neural network layers. In this sense, the feature vectors contain the semantic information derived from looking at the image as a whole.



Figure 4. In dense prediction, we want to both leverage the semantic information contained in the final layers of the convnet and assign the semantic meaning back to the pixels that generated the semantic information. The upsampling step, also known as the backward pass, maps the feature representations back onto the original pixels positions.



The upsampling step is something of great interest. In a sense, it deconvolutes the dense representation back to its original resolution and the deconvolution filters can be learned through Stochastic Gradient Descent, just like any forward pass learning process. A good visual demonstration of deconvolution can be found here. The most practical way to implement this deconvolution step is through bilinear interpolation, as discussed later.

The best dense prediction goes beyond just upsampling the last and coarsest convnet layer. By fusing results from shallower layers, the result becomes much more finely detailed. Using a skip architecture as shown in Figure 4, the model is able to make accurate local predictions that respect global structure. The fusion operation is based on concatenating vectors from two layers and perform a 1 x 1 convolution to reduce the vector dimension back down again.


Figure 5. Fuse upsampling results from shallower layers push the prediction limits to a finer scale.



As is often the case when working with Deep Learning, collecting high-quality training data is a real challenge. In the image recognition field, we are blessed with open source data from PASCAL VOC Project. The 2011 dataset provides 11,530 images with 20 classes. Each image is pre-segmented with pixel-level precision by academic researchers. Examples of segmented images can be found here.



Computer vision enthusiasts also benefit hugely from open source projects which implement almost every exciting new development in the deep learning field. The author’s group posted a Caffe implementation of FCNN. For keras implementations, you will find no fewer than 9 FCN projects on GitHub. After trying out a few, we focused on the Aurora FCNproject, which started running with very little modifications. The authors provided rather detailed instruction on environment setup and downloading of datasets. We chose the AstrousFCN_Resnet50_16s model out of the six included in the project. The training took 4 weeks on a two Nvidia 1080 card cluster, which was surprising but perhaps understandable given the huge number of layers. The overall model architecture can be visualized by either a JSON tree or with PNG graphics, although both are too long to fit on one page. The figure below shows just one tiny chunk of the overall model architecture.

Figure 6. Top portion of the FCN model. The portion shown is less than one-tenth of the total.


It is important to point out that the authors of the paper and code both leveraged established image recognition models, generally the winning entries of the ImageNet competition, such as the VGG nets, ResNet, AlexNet, and the GoogLeNet. Imaging is the one area where transfer learning applies readily. Researchers without the near infinite resources found at Google and Microsoft can still leverage their training results and retrain high-quality models by adding only small new datasets or make minor modifications. In this case, the proven classification architectures named above are modified by stripping away the fully connected layers at the end and replaced with fully convolutional and upsampling layers.


In particular, the open source code we experimented with is based on Resnet from Microsoft. Resnet has the distinction of being the deepest network ever presented on ImageNet, with 152 layers. In order to make such a deep network converge, the submitting group had to tackle a well-known problem where error rate tends to rise rather than drop after a certain depth. They discovered that by adding skip (aka highway) connections, the overall network converges much better. The explanation lies with the relative ease in training intermediates to minimize residuals rather the originally intended mapping (thus the name Residual Network). The figure below illustrates the use skip connections used in the original ResNet paper, which are found in the open source FCN model derived from ResNet.

Figure 7a. Resnet uses multiple skip connections to improve the overall error rate of a very deep network



Figure 7b. Middle portion of the Aurora model displaying skip connections, which is a characteristic of ResNet.


The exact intuition behind Residual Network is less than obvious. There is plenty good discussion in this Quora blog.


As alluded to in Figure 4, at the end stage the resolution of the tensor must be brought back to original dimension using an upsampling step. The original paper stated that a simple bilinear interpolation is fast and effective. And this is the approach taken in the Aurora project, as illustrated below.

Figure 8. Only a single upsampling stage was implemented in the open source code.


Although the paper authors pointed out the improvement achieved by use of skips and fusions in the upsampling stage, it is not implemented by the Aurora FCN project. The diagram for the end stage illustrates that only a single up sampling layer is used. This may leave room for further improvement in error rate.

The code simply makes a TensorFlow call to implement this upsampling stage:

X = tf.image.resize_bilinear(X, new_shape)



The metrics used to measure segmentation accuracy is intersection over union (IOU). The IOU measured over 21 randomly selected test images are:

[ 0.90853866  0.75403876  0.35943439  0.63641792  0.46839113  0.55811771

0.76582419  0.70945356  0.74176198  0.23796475  0.50426148  0.34436233

0.5800221   0.59974548  0.67946723  0.79982366  0.46768033  0.58926592

0.33912701  0.71760929  0.54273803]

These have a mean of 0.585907. This mean is very close to the number published in the original paper. The pixel level classification accuracy is very high at 0.903266, meaning when a pixel is classified as certain object type, it is correct about 90% of the time.



The ability to identify image pixels as members of a particular object without a pre-processing step of bounding box detection is a major step forward for deep image recognition. The techniques demonstrated by Shelhamer’s paper achieves this goal by combining coarse-level semantic identification with pixel-level location information. This technique leverages transfer learning based on pre-trained image recognition models that were winning entries in the ImageNet competition. Various open source project replicated the results. Certain implementations require extraordinarily long training time.

Voice Ordering Is Here. Voice Shopping Is Coming… And It’s Far More Interesting

Siri has been with us for years, but it’s in the last few months and largely due to Amazon that voice assistants have won rapid adoption and heightened awareness.

Over these past few months, we’ve been shown the power of a new interaction paradigm. I have an Echo Dot and I love it. Controlling media and the home controls (doing some lights, maybe thermostat soon) seem among the most useful and sticky applications. The Rock, Paper, Scissors skill… yeah, that one’s probably not going to see as much use. But let’s not forget that this slick device is brought to us by the most dominant eCommerce business in the known universe. So it’s great for voice shopping, right? No, not at all, as it doesn’t actually do “shopping.”

“But I heard the story about the six-year-old who ordered herself a dollhouse?” So did I, and it reinforces my point. Let me explain. The current state of commerce via Alexa is most like a broad set of voice operated Dash Buttons. For quick reorders of things you buy regularly and when you’re not interested in price comparisons, it’s fine. What it’s not — voice shopping. Shopping is an exercise in exploration, research, and comparison. That experience requires a friendly and intelligent guide. As such, voice shopping isn’t supported by the ubiquitous directive-driven (do X, response, end) voice assistants.


Enter Jaxon and Conversational AI

Shopping is about feature and price comparison, consideration of reviews, suggestions from smart recommendation engines, and more. Voice shopping is enabled by a conversational voice experience, one that understands history and context and delivers a far richer experience than is widely available today.


The Mobile Impact

Mobile commerce isn’t new but is still growing fast. Yet, despite consumers spending far more time on mobile devices than on desktops (broadly defined, including laptops), small screen eCommerce spending still lags far behind.

So why can’t merchants close on mobile? The small screen presents numerous challenges. Small screens make promotion difficult and negatively impact upselling and cross-selling. Another major factor, and one you’ve probably experienced, is the often terrible mobile checkout process. Odds are you’ve abandoned a mobile purchase path after fiddling with some poorly designed forms. I have. Maybe you went back via your laptop. Maybe you didn’t. Either way, that’s a terrible user experience.

Our approach to Conversational AI solves these small screen challenges. Merchants can now bring a human commerce experience to the small screen without the mess. It’s a new, unparalleled engagement opportunity — a chance to converse with your customer, capture real intelligence about their needs, and offer just the right thing. It’s an intelligent personal shopper in the hands of every customer.

Come re-imagine voice shopping with us. Imagine product discovery and comparison, driven by voice. Imagine being offered just what you were looking for, based on a natural language description of what you need. Imagine adjusting your cart with your voice. Imagine entering your payment and shipping info quickly and seamlessly, via voice. It’s all possible and it’s happening now with Jaxon.