7 Log File Analysis SEO Checks Using ELK Stack – (Using Free Open Source Software)

By on

This post is part 2 of ‘Small Scale SEO Log File Analysis’. Part one focused on installing ELK Stack on an Intel NUC, this post looks at what to do once ELK Stack is installed.

I’ll outline several checks I like to perform when analysing log files and how I set up visualisations in ELK Stack to help with this (the images are from a live analysis and I have permission to use them, however I agreed to blank-out anything that may indicate the site that was analysed).

If you’re an expert at either ELK Stack or Log File analysis, this post (and part 1) may be a bit basic for you. This is NOT an ‘ultimate guide to log file analysis’, not by a long shot! (it’s a huge subject). These are the sort of visualisations that I have on an overview/snapshot dashboard to give an overview of info from log files.

Hopefully you’ll find the post and some of the ‘quick wins’ interesting.

Please feel free to comment with your own log file quick-wins, analysis tips, or comments on ELK Stack.

How I Use Kibana For SEO Log File Analysis

First, a quick overview of how Kibana works (the way I use it, anyway!).

Discover: The ‘Discover’ section allows you to perform searches to drill-down for specific info using lucene query syntax (it’s very easy to use, don’t worry!). I use this in two ways:

  1. Create a ‘Saved Search’ that I’ll set-up once, then make a visualization for, which will be saved on a dashboard
  2. To drill-down when I’m hunting for something specific – Usually when I’ve spotted something in a visualization on a dashboard that peaks my interest… This could be an SEO issue or a potential security problem – Especially when the site is using a CMS WordPress.

Visualize: This one is quite simple. In the ‘visualize’ section you create charts that present your data. You can either make a chart using a new search, or from a ‘Saved Search’, as per above.

Dashboard: The ‘dashboard’ shows various visualizations. I like to create dashboards that have a theme, for example ‘SEO Dash’ or ‘Security Dash’. There is sometimes a crossover here, with data appearing in two places… But I don’t mind that.

What Visualisations Can You Use In Kibana For Log File SEO Snapshots?

1 – Top Countries

Understanding where your visitors come from is quite basic, however I like to include a country map on my Kibana SEO dashboard, along with a couple of charts to make it easy to see figures of the most popular countries:

top-countries-map

The map gives a great way to get an at-a-glance impression of where visitors are from, but I also like to see tables of the top countries, too.

top-countries-by-figures

Why is checking country of origin of visitors useful? It depends. This site doesn’t target internationally and is only available in English, however if this wasn’t the case, one of the things I’d be looking to do, is break the report down further to segment based on whatever international URL structure was used (CC-subdomain or CC-subfolder, for example) and then keep an eye on where the country of origin of visitors to each CC section. With most things though, this is just a representation of data… The important part is the human brain deciding what this data means and this, of course, is often different depending on the situation at hand.

2 – Bot Activity Overview Report

Understanding how bots interact with a website is important, more so with a larger site or a site that is old enough to have undergone a few relaunches/redesigns (Dawn Anderson‘s slide on Generational cruft in SEO is a recommended read).

Before digging deep, I like to have a quick overview.

bot-activity

As you can see, most of the bots are not visible on this table, due to huge activity by ‘Uptime Monitoring Bot’, and also by a spike in ‘Yandex’ activity.

I find it really useful that in Kibana, it’s easy to:

  • Change the colours of the bots
  • Drag & Drop to zoom-in on a moment in time
  • Filter on/off specific bots

In fact, lets do some of this now! We’ll filter-out the uptime bot & Yandex, so we can get a view of the other bots.

filter-bots-done

To get the raw figures on bots, which is more useful perhaps than the above (but heh – I included the above for the VERY important, uber-technical reason… that I like pretty colours!):

bot-table

Looking at the above, we can see that YandexBot crawls the site a lot. However – and this is where the timeline graphs come in useful – this was mainly due to two big spikes. I’m going to dig into this later and run a search to see what URLs YandexBot was accessing to figure out what was going on.

This is also useful for spotting dodgy bots (the next visualisation I’ll be adding to my Dashboard soon, is a list of Googlebot hits with associated IP addresses, to spot fake bots).

3 – Server Response Codes

Another useful snapshop for the Dashboard, is server response codes. In the charts below we first have all response codes (including 200) and then all codes excluding 200. (Although I’d set Kibana to label ALL fields, it refused to label them all! When you’re viewing this in the browser though that’s fine, as there’s onhover interaction on the donut AND labels).kibana-server-response-codes

95.22% being 200 codes is far from the worst I’ve seen, but I’d still want to look at the other (nearly) 5%. The 2nd chart above helps with this.

I’d want to check out the 404s and also the 500 status (server error… argh. What URLs threw that response?). For the 403s (forbidden), I’d look at what URLs were triggering this response, but this isn’t always something of concern.

I also want to see how the errors are progressing over time, so I created a visual showing any response codes 4** or 5**. And yes… I know this includes 403 & 401, 410 – maybe even the odd 418 if I’m lucky! But that’s okay… this is just an overview and I’ll dig deeper if needed. For example, I’d certainly want to know what caused the spike seen below… I’d turn to the Discover section and query for this, possibly creating a new chart if I felt it useful.

server-errors-over-time

4 – Temp Redirects

Something else I like to check for on the overview dashboard is temp redirects (302 and 307 – sorry about the huge red box – client confidentiality & all!):

temp-redirects

Here I can see quite a few hits to URLs that sent a 302 response. These need investigating, as there are usually very few times that a temp. redirect should be used. In this case it turned out to be a login form that redirecting with a 302 to another login page. Much worse is when a page with external backlinks, social shares, and traffic is redirected with a 302 instead of a 301. There have been arguments about this over the last year or so, but in mind mind if a redirect/move is permanent, it should return a 301.

5 – Image Size

Checking the bytes downloaded of images is something else worth keeping an eye on. The chart below shows one image in particular deserves some attention.

image-size

A quick check on the image above revealed it to be close to 1mb in size – not ideal for a website image. This is just an overview of course, page load times should be investigated fully but a dashboard widget like this can just help to get a quick overview.

6 – Crawl Budget Issues (quick overview only for the dashboard)

Crawl budget optimisation is a big subject and could easily be a blog post (or several) in itself. However, I find that a common cause of ‘crawl bloat’ is the use of query parameters in URLs. So, as part of my snapshot overview dashboard, I like to include a chart showing Googlebot hits to URLs that contain URL parameters.

googlebot-crawling-query-strings-kibana-seo

Looking at the chart above, although you can’t see due to keeping the URLs confidential for the client, I was able to ascertain that a single URL parameter was being crawled by Googlebot a fair bit. Widening the date parameters and search for all Googlebot crawls of URLs with his parameter, I found that it was being crawled thousands of times. Checking into this parameter it was clear that it certainly didn’t need to be indexed or crawled.

To illustrate the effect of this one parameter being crawled so often, and how this could impact crawl budget, the charts below show Googlebot activity for URLs that have this parameter in the bottom-right (which shouldn’t be crawled or indexed) compared with Googlebot activity for the most important sections of the website (in this case, these are subfolders for service-sections, some of which have subpages under them. Googlebot hits to both the main subfolder AND any subpages within this are included).

The crawls of URLs with the query parameter is the one in the bottom-right.

important-sections-googlebot-activity-vs-query-strings

The client’s dev team recently added the query parameter (along with several others I found) to Google Search Console, to request Google not to crawl this parameter. I’ll be monitoring this chart over time based on successive log files, to see if/when this takes effect and how much of an impact it makes.

I also like to have the raw count of Googlebot hits to important sections of the website at hand, over the last ‘x’ days:

Googlebot Hits Important Sections

In addition to technical issues hitting crawl budget, such as query parameters, faceted navigation, and search strings, another common quick (ish) win with some larger sites is to run a content audit & merge several separate pieces on content that are on a similar topic into themes. For too long some people had an approach of ‘1 article per keyword’, resulting in many more URLs than are needed. Consider consolidating & merging such articles into solid, more in-depth themes.

That is really all I have set-up at an ‘overview’ level on the Kibana dashboard at the moment. I do intend to add a few more widgets, but I find this dashboard handy and use it in combination with drilling-down on queries for issues I spot, as well as combining with Screaming Frog Log Analyser (and of course Screaming Frog crawls – get their paid version, it’s cheap & worth it!), GSC data, GA data, and crawls from the likes of SEMrush. I also want to have a good go over Deep Crawl, as I know folks like Barry Adams (who I respect for his tech SEO knowledge) like Deep Crawl.

Although that’s it for the SEO visualisations, I do also use Kibana for a security dashboard, which brings me to number 7…

7 – Security Issues

Security And SEO

Okay, some may scream at this point that website security is NOT an ‘SEO issue’. Whilst I agree that web security is an area that requires very specialist expertise and should really be carried out by folks who leave & breath SQL injections, XSS hacks, and RATs… I feel pretty strongly that as SEOs we can (and should!), keep an eye on the most common security issues that clients face, especially when the client is using a popular CMS.

Incidentally, the above meme is from a real-life situation, where I saw a UK, English-only website appear in Google’s SERPs in Chinese. They’d been hacked and a script was present that was serving-up a Chinese version of the site complete with some very questionable links… but ONLY if the useragent was Googlebot.

So as with the SEO Dashboard, the ‘Security Dash’ I create from log files is only intended as an overview – A place to spot any immediate issues and then dig deeper.

WordPress Login Attempts

Wordpress Security Log File Analysis

The above chart looks at hits to the wp-login.php OR the xmlrpc.php pages. Both of these could show security issues. In fact, looking at the chart above and considering that this is for a UK based site with no staff, writers, or designers overseas, I’d say this needs looking at. Oh, and I asked the client… they haven’t been flying between Russia & Brazil, either.

I find the above is useful to illustrate ‘questionable’ login attempts to clients, but drilling down to IP addresses attempting the login can also be useful:

ips-wp-login-xmlrpc

With Kibana, you can easily click on a country (in the above map) to filter the entire dashboard (including the above table of IPs), as well as sorting by IP or clicking an IP to filter to show only results from that IP. With export options too, it makes it pretty easily to download IPs in bulk for banning. It may also be worth considering, in this case, using something like the premium version of Wordfence to block access to wp-login.php to specific countries.

A cheaper alternative is generating a list of country-specific IPs and either blocking all access to, or allowing access only to, wp-admin from set countries, using a service like this one, on countryipblocks.net  (always be careful NOT to lock yourself out!).

Attempts At SQL Injection

This is only a VERY basic check that will not necessarily catch all attempts and will very likely throw-up some false positives. However I find it useful. The chart below shows hits that contain, in the request, strings that MAY indicate attempts at either MySQL injection (you can see the strings I used in the title of the chart, although I also include folks who’s request contains ‘sql’), or people trying to access SQL files that folks have left behind – say after an install of a tool or similar.

Further Security Checks From Log Files

Looking at the above, I can see several attempts to access .sql files that do not exist on the site. The above URLs (and folders) do not exist and have never existed on this site, leading me to believe a lot of the above are attempts to crawl the web for known footprints.

One IP obviously appears more often than the others. I did a little digging of this IP and found many more entries a (small) selection of which are below:

IP Security Checks Log File

Summary – & I Am Impressed With The Intel NUC

Okay, that’s it folks! I hope you enjoyed the overview of dashboard visualisations I like to use in Kibana. All in all, I’m quite impressed with how well the Intel NUC handles ELK Stack. Sure, when filtering on large data sets it will slow down a tad, but on the whole it works pretty damn smoothly and is much more efficient than my beast of a desktop PC.

What about you? Anything I missed in this post that you like to extract from server log files to help with SEO? Comment below & let me know!

 

Leave a Reply

Your email address will not be published. Required fields are marked *