This project started utilizing the AlchemyAPI (http://www.alchemyapi.com/) recognition algorithm and then was extended to include the VisualInsights recognition algorithm. The AlchemyAPI demonstrated a low signal to noise ratio, with most images being classified as "NO_TAG," indicating that none of the known objects were identified; in addition, the algorithm only returned a single result.
The VisualInsights algorithm was made available as beta in December 2005, and I added its analysis capabilities to augment the signal; the VisualInsghts algorithm returns up to twenty-five (25) objects per call; the additional objects the VI algorithm could identify was a great benefit, but the signal, while improved, still included signifcant noise; specifically in the identification of "humans" being split across various classifications and without any hierarchical organization into groups.
Sadly, first the VisualInsights algorithm was deprecated in June 2016 and in 2017 the AlchemyAPI will also cease to operate. The new algorithm, Watson VisualRecognition, is the child of AlchemyAPI and VisualInsights with support for multiple entities per image, as well as a default classifier generating poor signal and still significant noise (n.b. there is now a hierarchy, but neither the classes nor the hierarchy is published and must be discovered from results).
Therefore, I embarked on building a training loop for whatever recognition algorithm I might utilize. This loop would capture the images from the camera's local storage (n.b. uSD card) and present those images to the application user community (e.g. elderly individual/couple) and enable manual classification for subsequent training, testing, and deploying of a model specific to both this application context (i.e. people detection) as well as the local environs (e.g. room location, dogs, cats, residents, ...)
The images needed to create the training data for Watson VR are stored on each device in a local directory (n.b.
The image file names correspond to the date and time of the image, as well as a monotonically increasing sequence number.
Access to these images is provided through FTP, restricted to access from the local LAN.
When the end-user engages in curating, a.k.a. labeling, the images into their respective distinct classes (see the next section), another service is invoked (
The review service periodically collects new events stored by the device in the Cloudant noSQL repository (e.g.
New events include the image identifier; the device is accessed via FTP and the image is collected and collated.
When the process is complete, the count of images in each class is updated in Cloudant (e.g.
rough-fog/review/all), in addition to the sequence number of last event processed.
Below is the user-interface for labeling images. Options are available as buttons (e.g. person, kitchen, dog, ..) based on previous labels assigned; new labels can be added in the text entry box and the image's initial classification and capture date are shown.
Ideally, images are labeled if and only if the image contains the entity in question, e.g. a person, and does not contain any of the other entities of interest (e.g. dog or cat). The training set also requires negative examples which do not include any entity (i.e. person, dog or cat). To achieve this distinction, each camera installation has been pre-defined to a corresponding label (e.g. "kitchen") that is used to identify the negative examples. Similarly, other locations may also be suitably classified (e.g. bathroom, dining room, living room, ...)
Labeled images are collated into separate directory structure for their new classes and symbolic links are utilized as a state maintaince indicator (i.e. collected, labeled). Once images are labeled they are deemed ready for training; additional curation of the labeled images is performed in the Training phase.