Computer Vision is all about objects. The optimum would be to see the complete outline and content of a single object to be identified against a contrasting, plain background.
However, that’s not the real world. We need to achieve a high level of confidence where the object is partially obscured, at an angle, badly lit, among other similar objects, at a distance, etc. – this is the magic provided by inference based on good models that are, in turn, based on lots of sample images of the desired object.
Taking all that into account, when it comes to capturing the video image, affordable video camera technology is limited in it’s functionality and we need to understand the capabilities and limitations.
The human eye usually works in pairs (not mine, but that’s not the focus of this blog…) and your brain works with your eyes to rapidly scan the field of view and direct the view to concentrate on specific objects.
The stereo view also gives a distance cue that is improved by movement of the head. This is really useful when distinguishing between small objects close to and similar, larger objects that are further away.
By contrast, video cameras provide a fixed view of objects as they pass through the field of view. You could have stereo cameras with variable zoom that pan and tilt to simulate a human head with eyes and neck muscles, but that is expensive.
So, let’s think about pixels. There is a trade-off between resolution and performance with ML hardware. A higher resolution means more objects can be resolved, but there is more visual information to process. Camera resolution is in pixels and modern vision sensors can deliver a lot of them. It is easy find 5 Megapixel cameras – meaning they can deliver a native resolution of 2560*1920.
This is a problem – to handle this many pixels in an image requires a lot of computation and with typical edge computing hardware will limit the frames per second (fps) that can be handled, typically to low single figures. You will also find that low-cost 5 Megapixel cameras often cannot deliver the maximum resolution at a reasonable frame rate – expect 5fps, not 30.
So, choose your resolution carefully. If the objects to be detected are stationary, then the framerate can be very low; alternatively, if the intent is to recognise a moving object – for example, a face in a crowd or a vehicle on a motorway, then resolution should be lower to allow a higher frame rate.
This brings us to the question of resolution. How many pixels define an object? One way of looking at this is to specify a camera that will fit the vision ML requirements – this comes down to:
- The width of the scene, with a recommendation for the object to be detected with a high degree of certainty it should occupy 10% of the field of view.
- The distance between camera and object
- Size of the object
- a human head, say 50 pixels
- a UK vehicle registration plate, 75 pixels across the plate – see below for more notes on this specific use case
- The focal length of the camera lens – effectively the magnification of the image
- The horizontal resolution of the camera.
For example, most cameras – even cheap ones, can manage 25fps at 1280*720 and this will give good results in most cases – here is an example, based on a typical vision ML requirement with a ‘standard’ modern CCTV camera. The numbers below are entered into an online ‘lens calculator’ – just make a browser search for something similar. You can typically change width, distance, pixel density, focal length and resolution of the camera to best suit your purpose.
So, let’s say we want to count the number of individuals wearing the correct Personal Protection Equipment (PPE) – this used to mean hard hat and high vis jacket – now it’s facemasks. The process of PPE recognition depends first on discriminating individual heads, then checking with a second model the wearing of PPE. So, we need to begin by detecting human heads – please note we do not actually identify people!
- Width of scene = 3.7m
- Distance between camera and object = 2m
- Assuming object is a human head 150mm wide at 50 pixels per head gives a desired resolution of 350 pixels per metre
- Focal length of the camera lens = 2.8
- Horizontal resolution of the camera = 1280
This would be a good start and if the camera needs to be located further away, we can select a different focal length lens.
Finally, there are other ways of improving the confidence of an object, in certain special cases.
For example, it is possible to stack successive images to improve identification of simple objects, such as a rectangular vehicle registration plate. The plate can been identified and transformed from multiple images (think moving vehicle) showing the plate at various angles and different sizes, to a regular rectangle of fixed size; this can lead to improvement of recognition of the characters on the plate. Identification is helped because we know we are looking for specific character combinations.