Video surveillance is ubiquitous in today's society. Office
buildings, schools and even busy intersections have numerous video
cameras rolling at all times covering numerous scenes sometimes from
different angles. Video surveillance has proved to be very effective
in catching criminals after the crime (e.g. convenience store or bank
robberies). However, due to the vast amount of surveillance data
accumulated each day and the fact that it is usually monitored by very
few "human eyes" (relative to the number of cameras) if at all, it
becomes almost impossible to detect and respond to an abnormal event
as it is happening.
Analysis without knowledge of when and where or even if an event has
occurred also takes place quite often. In this kind of analysis the
analyst is often interested in "something that deviates from the norm"
Without the appropriate tools this can be a daunting task consisting
of sequentially viewing all raw video data and using human judgment to
determine if an event is peculiar and/or requires action.
This paper proposes a tool to aid in this process. Using user-defined
events (both suspicious and normal) this system can determine if a new
video sequence contains any events that might be deemed suspicious and
require further attention from a human user. This should reduce the
user's job to determining if machine-flagged segments indeed require
action and take that action. The time spent browsing through raw
footage would be greatly reduced though use of this tool and thus
increase the analyst's efficiency.
The system proposed in this paper uses a combination of event (general
behavior information) and object (specific actors and entities) to
offer a robust description of a video sequence. Video sequences are
broken up into key frames. From the frames we extract low-level
features. We use these features to detect objects in the scene as well
as represent the scene as a whole (event detection). These events are
represented as a collection of normalized gradient histograms in the
x, y and t dimension over several different temporal scales. This
representation is compared with previously user-defined event by means
of a histogram comparison function in order to classify the new event.
This classification, along with the objects detected within the scene,
is used to compile a video data ontology language description of the
event within the video sequence. Comparison of these event
descriptions with the dynamically defined description of suspicious
activity will allow the system to annotate a new video sequence
appropriately.
Bridging the semantic gap between machine-understandable low-level
features of video data and the high-level semantic events taking place
in that video is the great undertaking of this paper. Defining an
event representation schema, and event comparison function that
enables similar events to be assigned the same class as well as
building a database of manually defined events for classification are
major steps towards a solution to this complex problem.