We receive a lot of feedback from our users since the launch of BeRoads in october. A big thanks to everybody ! There was a common pattern in all these mails / posts / tweets : users were complaining about unavailable webcams feeds.
Okay, so lots of webcams feeds are broken down. What should we do about it ? What about detecting broken feeds and "tag" them as unavailable ?
Our webcams downloader is a python job that runs in 3 different steps :
- scrap webcams links from providers
- download webcams feeds
- update MySQL database
We spotted 3 different kinds of broken feeds :
Checking last update with requests
The HTTP protocol provide a way to tell when was the last time the resource was updated : the 'last-modified' header. It's pretty straight forward to access it with requests :
import requests response = requests.get('http://foo.bar/img.jpg') if 'last-modified' in response.headers: print response.headers['last-modified']
The last-modified value is an UTC formatted date (i.e. 'Tue, 15 Nov 1994 12:45:26 +0000').
The simplest way to check if a resource has been modified in the last < insert time span here > is to use timestamps. To convert a last-modified value to a timestamp, you will have to use the calendar and datetime libs :
import datetime import calendar timestamp = calendar.timegm(datetime.datetime.strptime( response.headers['last-modified'], '%a, %d %b %Y %H:%M:%S %Z' ).utctimetuple())
We use this technique to check if the webcam feed has been updated since the last time we downloaded it.
Compare feed with 404 images
There is two kind of 404 images that our providers return when a feed is not available. This one and this one. We compare our images against these 404 with a similarity function and write the image on disk only if the similarity is lower than a predefined ratio.
import requests response = requests.get('http://foo.bar/img.jpg') with open('404.jpg') as f1: similarity = float(sum([a == b for a, b in zip(f1.read(), response.content)])) / len(c1) if similarity > ratio: return False # we don't write on disk else: return True # we write on disk
The ratio value depend on the 404 that we compare to.
Checking if an image is completely blacked out.
To verify that our webcam feed is not like this one, we use Python Imaging Library (PIL).
We simply load the rgb matrix, travel through each pixels and count black pixels. We check the number of black pixels against a predefined ratio.
from PIL import Image im = Image.open('webcam.jpg') R, G, B = im.convert('RGB').split() r = R.load() g = G.load() b = B.load() w, h = im.size pixels = 0 for i in range(w): for j in range(h): if r[i,j] < 10 and g[i, j] < 10 and b[i, j] < 10: pixels +=1 if black > ratio: return False # we tag it as unavailable else: return True # we tag it as available
And that's it ! The API now provide an 'enabled' attribute for webcams, we even create a use-case for this one : the Dead webcams everywhere project :)