This project started utilizing the AlchemyAPI ( recognition algorithm and then was extended to include the VisualInsights recognition algorithm. The AlchemyAPI demonstrated a low signal to noise ratio, with most images being classified as "NO_TAG," indicating that none of the known objects were identified; in addition, the algorithm only returned a single result.

The VisualInsights algorithm was made available as beta in December 2005, and I added its analysis capabilities to augment the signal; the VisualInsghts algorithm returns up to twenty-five (25) objects per call; the additional objects the VI algorithm could identify was a great benefit, but the signal, while improved, still included signifcant noise; specifically in the identification of "humans" being split across various classifications and without any hierarchical organization into groups.

Sadly, first the VisualInsights algorithm was deprecated in June 2016 and in 2017 the AlchemyAPI will also cease to operate. The new algorithm, Watson VisualRecognition, is the child of AlchemyAPI and VisualInsights with support for multiple entities per image, as well as a default classifier generating poor signal and still significant noise (n.b. there is now a hierarchy, but neither the classes nor the hierarchy is published and must be discovered from results).

Therefore, I embarked on building a training loop for whatever recognition algorithm I might utilize. This loop would capture the images from the camera's local storage (n.b. uSD card) and present those images to the application user community (e.g. elderly individual/couple) and enable manual classification for subsequent training, testing, and deploying of a model specific to both this application context (i.e. people detection) as well as the local environs (e.g. room location, dogs, cats, residents, ...)

Collecting the images

The images needed to create the training data for Watson VR are stored on each device in a local directory (n.b. /var/lib/motion). The image file names correspond to the date and time of the image, as well as a monotonically increasing sequence number. Access to these images is provided through FTP, restricted to access from the local LAN.

When the end-user engages in curating, a.k.a. labeling, the images into their respective distinct classes (see the next section), another service is invoked (aah-review). The review service periodically collects new events stored by the device in the Cloudant noSQL repository (e.g. rough-fog). New events include the image identifier; the device is accessed via FTP and the image is collected and collated. When the process is complete, the count of images in each class is updated in Cloudant (e.g. rough-fog/review/all), in addition to the sequence number of last event processed.

Labeling the images

Below is the user-interface for labeling images. Options are available as buttons (e.g. person, kitchen, dog, ..) based on previous labels assigned; new labels can be added in the text entry box and the image's initial classification and capture date are shown.

Simple Web application to label images

Ideally, images are labeled if and only if the image contains the entity in question, e.g. a person, and does not contain any of the other entities of interest (e.g. dog or cat). The training set also requires negative examples which do not include any entity (i.e. person, dog or cat). To achieve this distinction, each camera installation has been pre-defined to a corresponding label (e.g. "kitchen") that is used to identify the negative examples. Similarly, other locations may also be suitably classified (e.g. bathroom, dining room, living room, ...)

Labeled images are collated into separate directory structure for their new classes and symbolic links are utilized as a state maintaince indicator (i.e. collected, labeled). Once images are labeled they are deemed ready for training; additional curation of the labeled images is performed in the Training phase.

Training the classifiers

The Watson VisualRecognition service provides for both initial learning as well as updates with new classes and images. The API does not provide details on images utilized in training for either positive or negative examples so an independent record of images utilized must be maintained. In addition, no standard of practice is defined for validating or measuring the quality of the learned model, so independent testing and quality measurement must be constructed. Finally, as the training process appears to be a required constiuent component, other entities (e.g. myself, my wife, my kids, ..) could also be identified and used to train Watson VR.

The training set is limited to 100 megabytes (MB) of data for each class with a total maximum of 430 MB; minimum number of labeled images is ten (10). Updates can be made against a single labeled set at a time, also including negative examples (i.e. not including any previously labeled entities).

Each learned model is referred to by both a name as well as a specific identifier. The name is being utilized for the device (e.g. rough-fog) and the identifier determines the model and serves as an index to keep track of which images have been used for training purposes -- both positive and negative examples. The train_vr script is still in process. Evident in the log are failures of the Watson VR API call, e.g. 413 Request Entity Too Large, and corresponding successful repetition.

Results from Watson VR

Once the process has successfully complete, the updated model is recorded in Cloudant.

I copied a confusion matrix calculator and created a simple Web application to display the matrix for a given model and/or device (i.e. Watson VR classifier_id, and name); the prototype is available below:

Simple Web application to view model confusion matrix

The results from training the Watson VR algorithm using the curated examples improved the results, but overall the recall was less than 69% and typically under 40%.

Process Model

The script executes a number of steps sequentially based on the output of the aah-classify Web application. The curated images are organized in the file-system in a directory structure corresponding to device and class, e.g. rough-fog/person.