Hacker News new | past | comments | ask | show | jobs | submit login
GoAccess – Visual Web Log Analyzer (goaccess.io)
152 points by tambourine_man on Aug 1, 2021 | hide | past | favorite | 19 comments



Despite the emphasis on goaccess’s visual mode, keep in mind that “ While the terminal output is the default output, it has the capability to generate a complete, self-contained real-time HTML report (great for analytics, monitoring and data visualization)”. It was a great replacement for my aging webalizer setup which in turn had been replaced by google analytics.

This is for a personal site and at some point I realized I cared less about having those very detailed google metrics (which I never checked anyway) than about making my site more responsive and less invasive. I nixed google analytics and haven’t looked back; still have basic metrics thanks to goaccess.


The HTML reports are my preferred way of looking at stats, but to make them more useful it's worth taking some additional steps to filter all the garbage traffic.

What works for me is:

- Use ipset to drop all traffic from certain countries (you pick which works best for you)

- Configure fail2ban to 'automagically' drop all IPs requesting .php and wp-admin URLs for a few days

- Integrate Piwik/Matomo's 'referrer spam' blocklist into your list of ignored referrers.

- Use per-site logging and only log .html hits with a static site to see page views.

This approach won't work for everyone and it takes extra sysadmin & Bash scripting skills to achieve, but it works really well with my Jekyll site.

I don't receive much traffic on my personal website but my stats page is public and updates hourly with a cronjob. https://www.tombrossman.com/stats/


Nice photos, thank you for piquing my interest in Jersey https://www.tom.je/

Also, pretty good advice in this post, bookmarked it.


What's quite nice with the HTML output view is that it stores stats even if the underlying log files are rotated/deleted - however if the goaccess process ends (like if your server needs to restart) you lose all the historic context.


I run goaccess once a day to analyze log files. There's an option to store the result in goaccess database. So nothing is never lost and I accumulate stats for as long as I want.

I detailed that in a (lengthy) blog post if you're interested: https://arnaudr.io/2020/08/10/goaccess-14-a-detailed-tutoria...


I love when something I had once been looking for, but seemed so specific I didn't know how to search for it, drops right into my lap. Thank you for this, it is exactly what I've been looking for.


I can confirm that goaccess databases works great for incrementally regenerating these HTML reports. We currently pipe all logs from all our ingress-nginx LBs every 15 minutes (and grep by virtual host) to goaccess and thus update its reports periodically. Each month, all reports get archived so we can start with a fresh report again.

Semi live and super useful without having to use 3rd party services like Google Analytics. The HTML report is self-contained (single file) and thus can easily be shared (or just statically hosted).


Sadly I imagine most of the traffic to a server would be bots or bad actors scanning for common vulns. I used AWStats (Another log file analyzer) for many years and had to slice roughly 70% percent of my traffic away because most of it was automated.

Most bots were courteous to state they were bots typically using a useragent with `-bot` found in the string. Some used generic browser useragents but were scanning for things like `wp-admin` etc

Most of the genuine human traffic were people on a mobile phone and that was the only heuristic I looked at to determine how many people visited my site. Very few desktop users were present.


The --ignore-crawlers flag in GoAccess filters out a decent amount of bot traffic. It's not perfect, but it's good enough for a rough estimate of 'real' traffic.


Here's a tricky one. Common Crawl runs its bots from AWS. AWS has like a jillion IP addresses. How do you tell which traffic are legit Common Crawl bots and which are imposters?


I mean, they’re all bots one way or the other. The only exception would be a personal VPN running off of AWS, but that’s a bad idea given how many sites block that range.


You could add $ssl_cipher to your log_format configuration (if nginx), and use that as a TLS fingerprint to find more bots.


Indeed - about 39% of hits to my site come from crawlers.

How do I know? Thanks to GoAccess!! ;)


If anyone is wondering, this is not a Go project. It's built with C.

Awesome work nonetheless.


Worth mentioning that it comes with an embedded WebSocket server, available as a standalone classical unix server - write programs that do one thing and do it well -

"Very simple, just redirect the output from your application (stdout) to a file (named pipe) and let gwsocket transfer the data to the browser — That's it."

[1] https://gwsocket.io/


Great work. I guess I'm spoiled already by tools like DataDog because I was clicking on the dashboards and expected to be able to go to the logs who actually generated them. For example, if I see a huge spike in the Requests dashboard at 3am, I would like to be able to go to the logs that generated that spike by just clicking on the spike. Does GoAccess provide access to the logs themselves?


I like GoAccess because the reports work well with lots or vhosts in the same logs, which many similar tools don’t. Allowing you to see which sites are busy, taking resource, have unusual patterns etc.


So it won't do my syslog.

I like that it'll run on a terminal. But it would be more useful to me if it had plugggable backends for arbitrary log formats.


This looks great,

Anyone here know of a way to tail a log file that’s exposed over http?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: