How to add privacy friendly analytics to your Django website
What are web analytics?
Why do we need cookieless analytics in a django site? We now have a website running, and we want to measure its success. Today I’ll show you how I added web analytics to promozilla, my django side project, while respecting my users privacy.
Web analytics provide a way to understand how our website is being used, and who are the users. It collects metrics like referrers (where the users clicked to visit the site), search engine keywords, landing pages, countries of origin, preferred languages or device information like operating system, type of device or screen size.
The good thing is that it provides analytics on top of the metrics, so we can understand the general trends and our users preferences to make decisions. For example, if the bounce rate is much higher in mobile rather than desktop, maybe our website is not mobile friendly. Or maybe if our second biggest country is France, localizing to french might be a good idea.
The biggest contender in this space is without doubt Google Analytics, but it come with an hefty downside: our privacy. Since it stores cookies in our end-user’s computers, it is able to track them across sessions, but it also stores information that I think is not really needed for our case.
This is the third post in a series about my side project, promozilla. Promozilla is a Nintendo Switch promotion tracker built using Django. In the previous post, I showed how to add monitoring to our django website. Today I will focus on the bottom left corner of its architecture: privacy-friendly cookieless analytics.
Come cookieless analytics. Cookieless analytics are perfect for a django website, because they provide analytics to track the site popularity without infringing on your users privacy and are easy to set up. Why? It only takes a single line of code and you do not even have to add those cookie consent banners!
There are many solutions in this space, like Matomo or Fanthom Analytics but I decided to settle for Data Centurion because it provides a nice free plan for our side project.The free plan comes with unlimited websites but only 1000 page views per month. The paid plans start at 2.99 euros per month, well under Matomo’s 29€ and Fanthom Analytics 14€ plans. This allows your site to grow without having to pay a big subscription upfront. Plus it’s a new contender in this space, and who doesn’t like a underdog?
How to add cookieless analytics to our django website?
Click on: https://datacenturion.io/websites/new and fill in the required information. I prefer to set the ignored ip addresses later. Don’t forget to click on the “Notifications” checkmark if you want to receive to tasty statistics. Lastly, hang on to the script at the end of the page- it will be useful later.
3) Adding cookieless analytics to our django website – base template
4) Ignore Ip addresses
One last thing: we do not want to skew our analytics and artificially inflate our stats – unless your self-esteem requires that. If you are not like that, you can simply add your ip address to a blacklist and so your activity will not count towards your site usage. Super easy:
Go to your website settings page in Data centurion
Copy you address to the “blocked ips” text field and you are good to go!
Data centurion also has event tracking as feature, but unfortunately it is well hidden – there is no documentation about it. You can track specific actions of your site, like new accounts, sales, etc, and then you can do analytics on top of it. You can see more information about it in this medium post.
Web analytics are important tools to track our django website usage and measure its success. Cookieless analytics provide a way to achieve that without infringing our users privacy. Lastly, they are very easy to add to our site and some providers, like Data centurion, provide free plans, so there’s really no excuse to not add them.
This is the second post in my Promozilla series. Today I will discuss how I added monitoring to Promozilla, a Nintendo Switch promotion tracking website built with django. In the first post, I described the general architecture.
Monitoring our django site solves a very simple question. If we want our django website to be used by the world, it is nice for it to be running in the first place.
Monitoring enables us to do that, and preferably in a proactive way. Of course we can ssh in every day and read the logs to check for any issues, but that’s far from perfect.
First, we are not alerted if a problem arises, which means that the site can be down for hours or days before we realise it. Lastly, that’s not really clarifying since the logs may contain too much noise or missing some information.
So there has to be a better way! And there is!
Django’s ecosystem makes it very easy to monitor our website. With some dependencies installed and a little bit of configuration, we can be monitoring our django website in no time.
I designed promozilla monitoring to solve both of those issues. Bugsnag alerts me via email if any issue arises and grafana/prometheus allows me to have a global picture of the different components and how they are evolving over time (any degradation while I was away?).
Lastly, take into consideration that if you host your monitoring stack in the same place you host the rest of the infrastructure, if both go down you are out of luck.
Error monitoring: Bugsnag for django
Bugsnag in its simplest core is a dashboard for exceptions. One problem in my previous projects was that the only way for me to know if the sites were up was to either a) visit them, b) have someone complain, c) read the logs. But this is no way to sleep nicely at night. Fortunately, bugsnag is good peace of mind creator.
It provides a very simple django middleware that can be integrated with your django project. Every time an unhandled exception occurs, you receive an email with its stack trace and some very nice details. This can be tuned, of course, but the value it brings out of the box with its free plan is tremendous. One strong point, since it is hosted outside your server: if it is is unreachable you’ll still be able to somehow see how it went down.
Grafana is a powerful monitoring tool. It contains tracing, logging and dashboarding functionalities. To keep things simple, I decided to just use grafana’s dashboard with django.
The dashboards grafana provides are ideal to understand not only business metrics (most popular pages on the site/ who is referring us), but also application metrics (number of errors, database connections, average response time, etc).
Below I’m showing the dashboard I built for promozilla. The first screenshot contains business metrics: visitors over time, number of new accounts and referrers. The second screenshot has service-quality metrics: response time, error rates and requests served.
Here I am also showing metrics retrieved from traefik, my reverse proxy, which has Prometheus support out of the box.
The important thing is: Grafana is not a data storage solution. We need to have a service responsible for just querying and storing our system metrics. For that, I added Prometheus into the mix.
Prometheus with django out of the box
Prometheus is the perfect storage solution to integrate with grafana. At it’s core it is a time-series database designed for metrics storing and querying.
The way it works is tremendously simple: services like django expose an endpoint that Prometheus periodically visits to collect the data. This is called a pull model, where it visits the application to collect the metrics, as opposed to a push model, where the applications send the metrics to the database. This makes everything simpler (e.g. prometheus can be offline and the applications are not affected).
Also, the reverse proxy that I used, traefik, has Prometheus support out of the box, which enables some of the plots you see in the previous screenshots. This is the great thing about prometheus. Because it is so ubiquitous, you do not need to reinvent the wheel to add monitoring to your applications.
Lastly, Prometheus has a particular data model and query syntax when compared to more traditional query languages like SQL. It has some particular concepts like metrics (counter, gauge, histogram and summary) and dimensions that are worth getting familiar with before delving right into it.
Caution: in my experience, Prometheus is quite heavy on the RAM side – be careful if you are running your website in the same machine.
I think my django monitoring approach is very solid because it has alerting for when something goes wrong, with bugsnag, but also provides with a birds eye view of recent events and application changes with dashboards provided by grafana and prometheus. I hope this was useful and I’m welcome to any feedback!
I found that most resources online are focused on building django websites locally, or for simple use cases that are not tailored to the open world. So, I decided to share the architecture of a recent project of mine, promozilla. I believe it is a good example of flexibility in systems interactions without compromisity ease of use.
What is promozilla.xyz?
Promozilla is a promotion tracker build using a framework I love, Django. It tracks Nintendo Switch games, consoles and accessories promotions on the Portuguese market. In a simple way, first, it scrapes the Portuguese stores every night, then stores the game prices and lastly sends an email to the Switch owners that have the game in their wish list when it is in a promotion.
I believe that this personal project ended with an interesting architecture, suitable for a production-ready django system and I would love to share it with you.
How to architect a django website for the real world?
First of all, what is a system architecture? For me, the architecture of a solution (in this case a promotion tracking website build using Django) consists of a description of its components, how they interact, and most importantly, why they were chosen.
Why this architecture for a django site?
First of all, this is a side-project, so I must find the technology interesting and/or a good learning opportunity (that’s the reason I chose grafana and prometheus, for example).
Then, its components should be like legos, meaning:
Adding or removing components does not break unrelated stuff
It is easy to deploy (they play well with docker containers, for example)
They play nicely together (django and postgres rather than django and mongodb, for example). This means that I can spend my time adding new features and not configuring stuff
They are easy to test and run locally
They are fun to work with. This is a side project after all!
How to implement this in practice? Django loves docker!
How did I do it? Easy! Docker and docker-compose! Docker provides the container technology and docker-compose the orchestration. For those that do not know what that means I recommend watching this video from Fireship.
I ended up with two docker-compose repositories, so two versions of this architecture. Oone for local development and another for the production environment. The local development repo builds the local images from the development code, but the production environment one pulls the images from the gitlab image registry during the CICD cycle.
I really like docker-compose for several reasons. First, it is very easy to add or remove services, second, the service configurations are kept separate from their secrets and it is nice to track everything wih git. Lastly, it is super simple to backup production data, since I just need to copy the service’s volumes and save them somewhere safe.
This architecture has several components, from the ones the visitors interacts with directly, to the system monitoring, web analytics, the hosting and DNS, code repository, or CICD flows. Don’t worry, they will all be talked about.
In this post, I will only describe the components in the red rectangle since those are the ones the visitors interacts with directly. I will describe the other components in separate posts because this one was getting too large. Let’s get started!
The star component: the django website
This is the breadwinner of the system and the reason why you clicked this article. This is the main piece of promozilla: it renders the pages the visitor clicks and handles persisting and retrieving user information (like registration, login, their product wishlist, etc), and of course, game prices. It runs with gunicorn to handle more than one visitor using the site at the same time.
It uses sendgrid to send the registration emails (more on that below), PostgreSQL as the database, and shares disk with nginx for it to serve the images and css.
Django celery worker
This application has some long-running tasks that do not make sense to be running in the main application. Why? They will take resources from the main responsibility of the system: displaying promotions. They are harder to scale separately and there are even technical limitations to that (in the case of celery-beat, for example). Some example of those long-running tasks are:
Notifying the users if a game on their wish list has a new promotion
Storing the scrapped prices in the database
Triggering the nightly runs for scrapping and email notification
Retrieving the product thumbnails from the store page.
So those tasks as perfect to a system built for long-running asynchronous tasks that support concurrency and distribution, or in another word, celery!
This worker is built using the celery integration with django. I chose it because these tasks share many of the components of the main app. This makes it much easier to share the data models, the application shared settings (like email and database connections, for example) and in general leads to less code duplication, which is always a nice plus.
To send and receive work, it communicates with the rest of the workers using RabbitMQ (more on that below). Lastly, some tasks must run every day (like the scrapping and notifications), so it uses celery-beat, celery’s message scheduler, as a cron job to trigger the messages.
It uses sendgrid to send the registration emails (again, more on that below), PostgreSQL as the database, and shares storage with the django application so that it can store the product thumbnails.
Store scraper celery worker
This is the heart of the operation: it contains the scraping logic to get listings of Nintendo Switch products in online Portuguese stores. This code was initially in the django worker app, but I decided to split it because it started having its own needs, for example, scrapping libraries and proxy logic, that could have side effects on the main app (e.g. dependencies versions, bloating, etc)
The flow is very simple: the scraper receives a rabbitMQ message through celery with a Store and Product (e.g. Fnac/Games) to scrape. It visits the corresponding site, does its magic and finishes by sending a message again with a list of the products, their prices and promotions status. This message is consumed by the django celery worker that stores its content on the database.
Because of its stealth technology, the scraper is also responsible for downloading the product thumbnails and storing them so Nginx can later serve them.
An email provider: Sendgrid
Like most websites, email is the main method promozilla uses to communicate with its users. Email is used in the registration and profile management flows (e.g. password resets, etc), for example. One could naively use their personal Gmail or host its own email server, but that comes with management headaches and a high risk of the emails being classified as spam. So I decided to use SendGrid for django here, connected with SMTP. SendGrid has a free plan good for a side project of this size.
A scraping proxy: OpenVPN proxy
Since we are web-scraping, it is certain our requests might be blocked. To circumvent this I routed the requests through a proxy. I used a Privoxy-OpenVPN docker image where the scraper worker could route its requests through. This is fairly straightforward to setup, one just needs to have a proxy it can use (there are quite a lot online). Typically the free proxies are blocked very quickly, so I recommend going the extra mile and paying for a good service.
The reverse proxy: Traefik proxy
Traefik is a reverse proxy built for micro-services in mind, with lots of features designed for a container environment. I think it is the best reverse proxy for django for three reasons. First, it is very easy to configure: it is a single line of code to add and block services from being public or to force incoming connections to use HTTPS. Also, it is configurable from the project’s docker-compose file, meaning git tracking and one less file. But the biggest reason I chose it is because it has automatic SSL certificate generation capabilities.
However, there are other alternatives for reverse proxies that I would like to present:
nginx a reverse proxy that is also capable of serving static files, but lacks automatic SSL certificate generation out of the box
caddy, a reverse proxy that can serve static files AND automatically generate static files, but is not configurable from docker
The biggest drawback of using traefik with django is that you need to rely on another static file provider, but I will get on that later.
The database: PostgreSQL
The relational database is the unknown hero of any architecture because it stores its most precious resource: information. Django’s ORM was built with PostresSQL in mind, so it was a clear choice for the database. There are other choices, like Sqlite or MySQL, but I do not think they are worth the hassle. However, the deciding factor for me is that Django’s text search takes advantage of Postgre’s features, so I do not need to reinvent the wheel for something boring as text search.
Piece of advice however: some features are not enabled in postgres by default (like the TrigramExtension).
The static file server: Nginx
Django or Traefik are not suitable to serve static files, so I needed a solution. What are static files in the first place? Static files are files that are not server-generated, like images or css. Because of their nature, we can optimize their delivery with caching or serving from a server close to the website visitor. For this, I chose nginx, since I had used it previously and it is quite easy to set up. However, I would like to present three alternatives to serve static files in another post I wrote.
The message broker: RabbitMQ
The workers need to exchange work, so a broker is needed. Rabbitmq provides a common intermediary for them to communicate. It is perfect for this job because it is built with resiliency and fault tolerance in mind. For example, if one of workers goes down, the job does not disappear and is sent to another. Or it is very easy to scale the workers because they can start consuming the workload right away. Lastly, it is very easy to set up with docker and celery supports it out of the box, meaning I do need to reinvent the wheel..
Right now I’m still cleaning up the code, but I’ll share the django, scraper and docker-compose repositories. Stay tuned!
How to architect a django website for the real world?