When I was in graduate school, I designed a construction site of the future. It was in collaboration with Texas Instruments in the late 90s. The big innovation, at the time, was RFID (radio-frequency identification). Not that RFID was new. In fact, it has been around since World War II where it was used to identify allied planes. After the war, it made its way into industry through anti-theft applications. In the 80s, a group of scientists from Los Alamos National Laboratory formed a company using RFID for toll payment systems (still in use today). A separate group of scientists there also created a system for tracking medication management in livestock. From here it made its way into multiple other applications and began to proliferate.

RFID got a boost in 1999 when two MIT professors, David Brock and Sanjay Sarma, reversed the trend of adding more memory and more functionality to the tags and stripped them down to a low-cost, very simple microchip. The data gleaned from the chip was stored in a database and was accessible via the web. This was right at the time that the wireless web emerged (good old CDPD) as well, which really bolstered widespread adoption. This also precipitated funding from large companies, like Procter & Gamble and Gillette (this was before P&G acquired Gillette), to institute the Auto-ID Center at MIT, which furthered the creation of standards and cemented RFID as an invaluable weapon for companies, especially those with complex supply chains.

OK, as you can tell, RFID has a special place in my heart. I even patented the idea of marrying RFID with images, but that is another story. Anyway, up to this point you’ve probably decided this is a post about RFID, but it’s not. It’s a post about RFID to IoT (Internet of Things). The term Internet of Things (IoT) was first coined by British entrepreneur Kevin Ashton in 1999 while working at Auto-ID Labs, specifically referring to a global network of objects connected by RFID. But RFID is just one type of sensor and there are numerous sensors out there. I like this definition from Wikipedia:

In the broadest definition, a sensor is an electronic component, module, or subsystem whose purpose is to detect events or changes in its environment and send the information to other electronics, frequently a computer processor. A sensor is always used with other electronics, whether as simple as a light or as complex as a computer.

Sensors have been around for quite some time in various forms. The first thermostat came to market in 1883, and many consider this the first modern, manmade sensor. Infrared sensors have been around since the late 1940s, even though they’ve really only recently entered the popular nomenclature. Motion detectors have been in use for a number of years as well. Originally invented by Heinrich Hertz in the late 1800s, they were advanced in World War II in the form of radar technology. There are numerous other sensors: biotech, chemical, natural (e.g. heat and pressure), sonar, infrared, microwave, and silicon sensors to name a few.

According to Gartner, there are currently 8 Billion IoT Units worldwide and there will be 20 Billion by 2020. Suffice to say there are numerous sources of data to track “things” within an organization and throughout supply chains. There are also numerous complexities to managing all of these sensors, the data they generate, and the actionable intelligence that is extracted and needs to be acted on. Some major obstacles are networks with time delays, switching topologies, density of units in a bounded region, and metadata management (especially across trading partners and customers). These are all challenges we at BigR.io have helped customers work through and resolve. A great example is our Predictive Maintenance offering.

Let’s get back to RFID to IoT. There is a tight coupling because the IP address of the unit needs to be supplemented with other information about the thing (for example, condition, context, location, security, etc). RFID and other sensors working in unison can provide this supplemental information. This marriage enables advanced analytics including the ability to make predictions. Large sensor networks must be properly architected to enable effective sensor fusion. Machine Learning helps take IoT to the next level of sophistication for predictions and automation for fixes and can help figure out when and where every ”thing” fits in the ecosystem that they play in. A proper IoT agent should monitor the health of the systems individually and in relation to other parts. Consensus filters will help in the analysis of the convergence, noise propagation reduction, and ability to track fast signals.

There are other factors that play into why IoT is so hot right now: the whole Big Data phenomenon has lent itself to the growth, endless compute power has served as a foundation by which advanced applications using IoT can run, and the Machine Learning libraries have been democratized by companies like Google, Facebook, and Microsoft. In general, Machine Learning thrives when mounds of data are available. However, storing all data is cost prohibitive and there is so much data being generated that most companies opt to only store bits of critical data. Some companies only store the data to freeze it from failures. You may not want to store all data, but you don’t want to lose “metadata,” or the key information that the data is trying to tell you, whether from the sensor itself or indirectly through neighboring sensors. I had a stint where we supported Federal and Defense-related sensor fusion initiatives and I picked up a handy classification of data:

  • Data
  • Information
  • Knowledge
  • Intelligence

The flow is moving the metadata being generated down the line into information → knowledge → intelligence that can be acted upon.

There also exists the ABCs of Data Context:

[A]pplication Context: Describes how raw bits are interpreted for use.

[B]ehavioral Context: Information about how data was created and used by real people or systems.

[C]hange Over Time: The version history of the other two forms of data context.

Data context plays a major role in harnessing the power of an IoT network. As we progress to smarter networks, more sophisticated sensors, and artificial intelligence that manages our “things,” the architecture of your infrastructure (enterprise data hub), the cultivation and management of your data flows, and the analytics automation that rides on top of everything become critical for day-to-day operations. The good news is that if this is all done properly, you will reap the rewards of thing harmony (coined here first folks).

Please visit our Deep Learning Neural Networks for IoT white paper for a more technical slant.

Deep Learning: Image and Video Recognition

Written by Bruce Ho

BigR.io’s Chief Big Data Scientist

Abstract

This paper illustrates the advancements in implementing Deep Neural Networks for automatic feature extraction in image and video for applications including facial recognition, programmatic video highlights, and image segmentation and object classification. Given the limitations of human abilities in earlier extraction methods, these networks exponentially increase accuracy, output, and available feature selection options for further analysis. BigR.io specializes in the following industry use cases:

  • Image Recognition

  • Video Highlights

  • Anomaly Detection

 

ABOUT BIGR.IO

BigR.io is a technology consulting firm empowering data to drive analytics for revenue growth and operational efficiencies. Our teams deliver software solutions, data science strategies, enterprise infrastructure, and management consulting to the world’s largest companies. We are an elite group with MIT roots, shining when tasked with complex missions: assembling mounds of data from a variety of sources, building high-volume, highly-available systems, and orchestrating analytics to transform technology into perceivable business value. With extensive domain knowledge, BigR.io has teams of architects and engineers that deliver best-in-class solutions across a variety of verticals. This diverse industry exposure and our constant run-in with the cutting edge empowers us with invaluable tools, tricks, and techniques. We bring knowledge and horsepower that consistently delivers innovative, cost-conscious, and extensible results to complex software and data challenges. Learn more at www.bigr.io.

 

OVERVIEW

Over the past few years, Deep Neural Network (DNN) capabilities have surpassed human parity in recognizing and interpreting images. These DNNs use Convolutional Neural Networks (CNNs) to automatically extract features from an input image with the use of convolution filters. Backpropagation then facilitates the learning by these filters of their kernel functions, starting with random values and ending up with elemental features that best represent the class of images being trained (for instance, nose, eye, and jaw shapes for face images). Image recognition is also where the highly coveted idea of transfer learning got its early foothold. Pre-trained models based on certain categories of images can be repurposed for various classification applications using only a small dataset. Since data preparation and labeling is one of the most challenging steps when carrying out supervised learning, the impact this concept has on accelerating this process cannot be overstated. Published models and datasets by some of the biggest players in the field (Google, Microsoft, etc.) now serve as a strong starting point to build robust application-specific models for businesses with only modest means for development.

 

INDUSTRY USE CASES

Similar to the adoption of best practices in big data and data science across several industry verticals, image video recognition solutions affect business outcomes across diverse government agencies and businesses. In this paper, we specifically examine use cases in the security and professional sports segments, but these solutions illustrate applications across all areas of video content creation, consumption, and monitoring.

 

IMAGE INSIGHTS

FCN8s

 

Image recognition can go beyond classification tasks for an entire image. In dense prediction, we are asking the neural network to detect the semantic context of any given pixel in a document or image. CNNs work by first finding image features that resemble certain filter functions, then floating such features to a top-level representation as a translation-invariant descriptor (e.g., detection of a nose, regardless of its position within the image). By combining both coarse- and fine-grained features at different scales, we obtain both the semantic context and location information of any one pixel. This opens the door for pixel-level semantic segmentation (aka dense prediction). Recent work on Fully Convolutional Networks (FCNs) leverages this capability to extract semantic context of a digitized document. One could, for example, detect whether a particular pixel is a title, section header, figure caption, an image, or part of a long paragraph using FCNs. A mobile user could then easily re-layout or restyle an electronic document using the extracted semantic context. FCNs have also been successfully applied to segment parts of an image, as well as full documents, with remarkable accuracy. How does this system pick potential customers from an image of a crowd, a soccer team, or a room full of event attendees? Given a close-up face shot, is this person happy to be here, in the target age group, or giving a positive response to the last sales message? Being able to answer these audience measurement questions for marketing is one of the hot areas in need of a deep learning solution. Many classic approaches to facial feature extraction and classification, Support Vector Machines, for example, have been devoted to this long-standing problem. Deep learning research in facial identification is relatively new but already outperforming older techniques by a wide margin. This development, and many other impressive improvements achieved by deep learning, are generally attributed to the automatic feature extraction function of neural networks and the incremental accuracy boost that deep learning techniques achieve when given a huge training dataset. In many applications, a high-quality, close-up facial shot is not always available. Picking faces out of an ordinary action photo may be the first step before applying any facial feature analysis. For this, the region-based CNNs (R-CNNs) excel in both speed and accuracy. The R-CNN approach proposes a number of bounding boxes in the original photo using what is called Selective Search. In this method, initial object boundaries are set using a graphical pixel similarity approach. Neighboring boxes with high pixel similarity metrics are then merged to further reduce the object count. Finally, each boxed object can be classified based on a pre-trained image recognition model.
FCN8s

 

In other efforts, researchers have extended facial analysis to emotion detection. Classically, this simply involved image labeling where the subject exhibits a range of facial expressions and a group of volunteers would mark each as happy, sad, angry, etc. — typically up to eight emotions. More recent work also incorporates dynamic facial movements, for example, capturing the complete sequence of facial movements for a smile or frown. A more generalizable model can be developed using linear scoring along the valence- arousal graph. A prediction of valence and arousal scores on future subjects can then be interpreted using a wider range of emotion states instead of the initial selection of about eight.

 

valance arousal plot

Reference: G Paltoglout, M Thelwall, Seeing Stars of Valence and Arousal in Blog Posts. Issue No. 01 Jan-Mar 2013 Vol. 4, IEEE Transactions on Affective Computing.

Points on the valence arousal plot can be translated to commonly understood emotions.

 

VIDEO HIGHLIGHTS

There are numerous highlights in every major sporting event. Manual real-time extraction of these highlights by fully attentive labelers is error-prone, requires significant manpower, is very expensive, and doesn’t scale well. Furthermore, while the most recent games may benefit from manual labeling, there are years of archived footage that remain unprocessed. Most off-stats highlights are overlooked by human observers who are instructed to look for only specific events, for example, looking for a ball boy slipping while chasing a tennis ball or a Major League splitter in a Little League game.

Today, we can automate programmatic video highlights using video recognition techniques. In addition to applying CNNs to static image features, Recurrent Neural Networks (RNNs) are able to classify video segments using optical flow between image frames. This technique is easily trained not only to extract official stat events, but also to extract any interesting player motion not explicitly logged and indexed — for example, an alley-oop in basketball. Due to the automated nature of these extraction tasks, studios can come up with new ideas at any time to build upon an existing menu of highlights.

Going beyond sporting events, any kind of motion picture, video ad, or short-form video opens itself up for potential indexing and repurposing. For example, a DC Comics fan may want the ability to easily find all instances of girl superhero encounters within the DC universe. This task requires automatic video highlight extraction, which is the key to reviving and monetizing unlimited archive contents that would otherwise remain buried and forgotten.

 

Image: Durant eyeing Rihanna after hitting a 3-pointer (she was cheering for LeBron).

 

ANOMALY DETECTION

Independent Component Analysis (ICA) is one such approach with many proposed variants. An ICA-based deep sparse feature extraction strategy combined with a non-parametric Bayesian approach can automatically determine the most optimal dimension for the latent feature vector, removing the heavy labor in parameter tuning that a full deep learning approach would entail. The reported accuracy improvement exceeds 10% over previous results. Variants of Restricted Boltzmann Machines (RBMs) are another major direction of research for deep-sparse representation. While much progress has been made on the theoretical front, the experimental results thus far lag behind the best ICA models. Reference: Y. Cong, J. Yuan, and J. Liu, “Sparse reconstruction cost for abnormal event detection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2011, pp. 3449–3456

The graph on the right is a sparse vector representation of the image on the left. The vector dimensions, called training bases, are laid out along the x-axis, with the bars representing the coefficients for the bases needed to represent the image. A normal sample (top) can be represented as a sparse linear combination of the training bases, while an anomalous sample (bottom) requires a large number of base elements.

 

CONCLUSION

Recent advancements in image and video recognition pave the way for many business applications that would have been unimaginably hard or expensive to implement before. BigR.io excels at the application of deep learning to images and electronic documents for use cases ranging from facial recognition, to programmatic video highlights, to image segmentation and object classification.

For many years, and with rapidly accelerating levels of targeting sophistication, marketers have been tailoring their messaging to our tastes. Leveraging our data and capitalizing upon our shopping behaviors, they have successfully delivered finely-tuned, personalized messaging.

Consumers are curating their media ever more by the day. We’re buying smaller cable bundles, cutting cords, and buying OTT services a la carte. At the same time, we’re watching more and more short-form video. Video media is tilting toward snack-size bites and, of course, on demand.

Cable has been in decline for years and the effects are now hitting ESPN, once the mainstay of a cable package. Even live sports programming, long considered must see and even bulletproof by media executives, has seen declining viewership.

 

So what’s to be done?

To thrive, and perhaps merely to survive, content owners must adapt. Leagues and networks have come a long way toward embracing a “TV Everywhere” distribution model despite the obnoxious gates at every turn. But that’s not enough and the sports leagues know it.

While there are many reasons for declining viewership and low engagement among younger audiences, length of games and broadcasts are a significant factor. The leagues recognize that games are too long. The NBA has made some changes that will speed up the action and the NFL is also considering shortening games to avoid losing viewership. MLB has long been tinkering in the same vein. These changes are small, incremental, and of little consequence to the declining number of viewers.

Most sporting events are characterized by long stretches of calm, less interesting play that is occasionally accented by higher intensity action. Consider for a moment how much actual action there is in a typical football or baseball game. Intuitively, most sports fans know that the bulk of the three-hour event is consumed by time between plays and pitches. Still, it’s shocking to see the numbers from the Wall Street Journal, which point out that there are only 11 minutes of action in a typical football game and a mere 18 minutes in a typical baseball game.

 

A transformational opportunity

There is so much more they can do. Recent advances in neural network technology have enabled an array of features to be extracted from streaming video. The applications are broad and the impacts significant. In this sports media context, the opportunity is nothing short of transformational.

Computers can now be trained to programmatically classify the action in the underlying video. With intelligence around what happens where in the game video, the productization opportunities are endless. Fans could catch all of the action, or whatever plays and players are most important to them, in just a few minutes. With a large indexed database of sports media content, the leagues could present near unlimited content personalization to fans.

Want to see David Ortiz’s last ten home runs? Done.

Want to see Tom Brady’s last ten TD passes? You’re welcome.

Robust features like these will drive engagement and revenue. With this level of control, fans are more likely to subscribe to premium offerings, offering predictable recurring revenue that will outpace advertising in the long run.

Computer-driven, personalized content is going to happen. It’s going to be amazing, and we are one step closer to getting there.