wiki:BearStat

BearStat

  • Browse source: source:/bearstat/
  • Checkout code: svn co svn://forge.bearstech.com/forge/bearstat

TODO:

Introduction

Here at Bearstech we have been using The Webalizer for quite a while. It's easy to install, works with any web site/application, requires no maintenance, and provides useful figures. It is also one of those rare "honest" softwares (with Analog) which precisely explains how it works and the correct interpretation of the different figures.

Webalizer is nice for volume estimation : looking the famous yearly panel gives almost instantly a good idea of a site's nature and trafic. Things like hits, page/hits ratios per day and month may be key information when setting up an hosting infrastructure. Our clients, on the other hand, will rather use Analytics or such to extract user behaviour and check their site audience. Different needs, different tools, so far so good.

Lately we have hit many limits of The Webalizer:

  • The characteristics of a web page are not captured by a .html extension anymore. URLs are now mapped in many ways, and what matters is the document MIME type.
  • The notion of a web 2.0 page is fuzzy: a "viewed page" is still a meaningful and interesting metric (for now), but AJAX is bringing a new type of audience. Would you count a "widget" as a page ? Conversely, would you count those many AJAX hits as one consolidated "viewed page" ?
  • The notion of visitor or visit is a less and less acceptable approximation. For instance, benchmarks and transaction optimisations need a pretty good idea of visits and visitors nature and quantity.

Since we are not going to fiddle with the numerous sites and various technologies to instrument them with an Analytics-like Javascript bug/probe, we wondered if we could still deploy almost automatically a solution based on Apache logs which would provide meaningful metrics for the "web 2.0 area".

Basic idea

The Apache log facility is versatile and extensible. We already used this feature to provide a custom log which enables us to measure and compare the activity between several virtual hosts on a single server (source:/misc/apachetraf). One configuration line per virtual host, 70 lines of Perl.

With the Apache log facility, you can record anything from an HTTP request cycle. Input and output headers, transaction time and size, and so on. Some modules like mod_logio can extend it: this might be very useful to record other per-query metrics, like latency, system ressource usage, etc.

Logs have another nice properties: they are easy to handle efficiently, and are processed asynchronously. For instance, you can easily set up a "statistics server" which will receive all your Apache's log in real time via syslog. Impact on the production server: one configuration line added, no performance issue. On the "statistics server", logs are processed at regular interval, by chunks: if the processing is efficient, the chunk length can be as low as 1 minute, which means your statistics server needs very little storage.

Right now we have a slow Perl prototype "crunching" 75,000 hits per second on a regular server. That should be enough for a start...

Sessions, visitors, visits

Important technical note: a cookie is tied to a site, a specific software and a device (eg: "The google.com cookie planted in Firefox 2 on my laptop").

We have decided to concentrate on the notion of session, which is in our experience always backed by a cookie. A session can be compared to a visit: it is a suite of HTTP requests from the same browser, in a given time window. It can be exactly defined by a cookie and a point in time. It does not always map to a simple reality:

  • Most of the time we can assume that the same person is using this very browser in this restricted time window. Computers being mostly "personal", this should be true. Cybercafés and other social practices might bias this assumption in surprising ways - depending on your audience context.
  • The session starts with the first request, it is easy to determine when someone arrives at a web place. But there is no mechanism to be sure when the user actually leaves. See below.
  • There is no indication that your site gets a full time attention during a session. Zapping habits and the inherent hypertext nature of the Internet tends to the contrary. Session duration is not important, as we'll see later.

A visitor is a vague notion. Although it is most of the time assimilated with a given cookie, it is very difficult to map to a real person:

  • The same person may use several cookies. There are many reasons for this: cookie expiration, mobility (same person using several devices).
  • Many persons may use a single cookie, like a family computer which will mostly switch users without the applications being aware

In practice, the first category is predominant. Which means that counting the distinct cookies in a period of time will most probably give you an over-estimation of the number of real persons in your audience.

A visitor may have several sessions, all with the same cookie but at different points in time. If your application cookie has a fitting expiration time, and the real person is using the same device regularly, you can detect "fidelity patterns" in your visitors. Those conditions are rarely met (and fidelity metrics are much better backed by an authentification scheme).

Measuring sessions

Session idle timeout

Technically, we ask Apache to track a specific cookie. Thus the log must be tailored to every application. As a default, PHP applications will use PHPSESSID for instance.

To count the different sessions we must define an idle timeout value: if no requests from a known session are seen within this delay, the session is considered expired. Illustration: let idle time be 30 minutes; if a user starts a session at 8:00 and browse till 8:15, the session will be considered expired at 8:45. If the same user comes back at 9:00, it is a new session.

Why this "idle" time ? Because it is needed for computations. And which value to pick ? Let's see the whys:

  • Scalability: if session did not end, we would need a potentially infinite memory to remember all of them. Consolidating one year statistics would need to keep around an ever growing session list.
  • Noise: why keeping around a session started by a one-time user for a full year or more ? Forgetting information is also about keeping the most interesting
  • Meaning: the key notion in the session is its punctual nature. A session represents someone reaching a web site with a goal, acting on it, then leaving. In a real world, you could parallel this with a shopping act.

Finding a fitting value for the idle time is a quest. It depends on you audience behaviour, and the smallest period of a distinct activity on your application. Let's narrow it:

  • It can't be too small: if set to 1 minute, then someone reading his electronic newspaper at a pace of 5 minutes per article will count for a session for every article read.
  • It can't be too big: if set larger than a day, you won't be able to capture periodic events more frequent than once a day

Now we guess that an interesting value will probably sit between 10 minutes and 10 hours. To this end, we ran our session counter on a production busy site, and graphed the session estimated with various idle times. Vertical axis unit is sessions per hour and horizontal axis is elapsed time in minutes.

When idle time is (unrealistically) set to 1 minute, we see an over-estimation because users broswing more than 1 minute are considered newcomers the next minute. Over an hour, a persistent user will be over-estimated 60 times ! The measure has a lot of jitter, a direct consequence of a small sample time window (the tested site has roughly 100 hits per minute).

When idle time grows, we have two interesting and pleasant effects:

  • The measure is more stable, because it is smoothed over a larger time window
  • Measures converge, meaning that by enlarging the time window we capture more and more real sessions and cut down in the over-estimation of expired sessions

If idle time grows too large, say above 10 hours, the result is not interesting: the measure grows steadily while there is some trafic, accumulating almost never-ending sessions, and then decreases slowly when everybody leaves. This looks like "one smooth hill a day" and neither give good real time trafic estimations, nor meaningful daily averages.

At last, let's mention the "memory effect" induced by the arbitratry session idle timeout: if all of current users were to leave immediately your site, the drop of audience would only show when their sessions are considered expired, which is one time window later. This effect will mask quick audience drops, and suggest us to keep the smaller acceptable idle time.

With all of these constraints and experiments in mind, we considered that a 30 minutes idle timeout was a fitting value. It also sounds realistic, to be compared to some screensavers or messaging software "away" features. Last but not least, Webalizer is also based on a 30 minutes session time out.

Session properties

The measure unit is expressed as sessions per hour. This is needed in order to be scale-independent with the idle time parameter. We must keep in mind that this measure still depends on:

  • The idle time parameter: since we don't know the session duration distribution, the effect of the idle time window is not predictible.
  • The session usage of one application. To some extent, applications will have different browsing and request patterns which may have an effect on the observed measures (eg. COMET based applications).
  • Technical details of the statistics algorithm implementation: such as bot filtering, page MIME types selection, etc.

However this measure does merely not depend on the real session duration. It is designed to be proportional to the flux of visitors, and almost unrelated to their attention span.

Strictly speaking, we should claim a measure in "sessions per hour, with bearstat 1.0 and idle time of 30 minutes" (unit, implementation, configuration). Since "sessions per hour" should be enough, we propose:

  • To standardize on an idle time of 30 minutes; using a different idle time would simply require an honest SEO to notify it clearly
  • To share the implementation of Bearstat as Free Software to expose its precise selection and filtering patterns, and let anybody use and improve them

It is technically possible to have a precise session duration information, but:

  • It requires specific support from the application, such as a periodic "presence" request
  • There seems to be only a weak correlation between the measured session duration and the real user's attention span (the "zapping" or "multi-tasking" effect)
  • It gives a real-time audience measure, which are deceptively low figures for a SEO used to big "visits per day" counters
  • Web server ressources do not work such as game server's: the real time audience accouting is most of the time not relevant (although COMET may change this)
Last modified 12 years ago Last modified on Nov 14, 2008, 4:08:20 PM

Attachments (2)

Download all attachments as: .zip