Most merchants deploy some sort of web-statistics for their e-shops to monitor and analyze traffic. Such tools may include javascript code where the page access information is collected and then it is sent to an external site for analysis. Others use internal tools where the information is stored in the database immediately upon page access. Such information includes the time a page was accessed, the IP the access was originated from, the page script and parameters, the browser/user agent and the referrer where they're coming from.
In a nutshell the whole concept of request/response between a client and a server over the web goes like this. The client (eg: a browser) requests a page from a server (eg: a website) and then the server responds sending the client the information about the page. The information the client-end sends over to the server, is useful to merchants as they try to measure their e-commerce site's efficiency in terms of sales and overall exposure. The merchants may assume this is legitimate traffic initiated by humans. However they do need to understand the possible origins of the traffic and ensure the information their web system analyzes really comes from a human entity. To understand the validity of the traffic information presented one needs to take into account at least the following: RESPONSE/REQUEST HEADERS Some of the traffic information elements can be manipulated easily from the client end. The user agent can be hidden or set to any value the client wants. Same goes for the referrer field. The user agent identifies the signature of the tool the visitor uses to access the particular page on the server. The referrer states what page the client is coming from prior accessing the current page. In the real world these two traffic elements are of no significance and may be altered in many different ways. It's beyond of the scope of this document to go into details but for instance you can experiment with plugins like the modify headers for Firefox. That will require that you have the browser installed along with the particular plugin. The Date field cannot be manipulated as it depends on the time of access. It is server based. When the page access time is recorded internally or externally the time-stamp can be set at that particular moment, therefore the client-end cannot influence this traffic element. The page script, access request, cannot be manipulated by the client-end as it is the page the client wants to access.
The IP field under normal circumstances identifies the client origin. The IP can be manipulated directly or indirectly. Direct alteration implies the client-end has the ability to structure the IP packets (has already forged the hardware layer and operates from an ISP who allows it in someway) and therefore he can now setup whatever IP he wants. In this case the client will not be aware of the server response as the response will be sent to an IP the client has no real access to. Indirect alteration implies many methods, a common one is the use of proxies where an intermediate entity is in-between the client/server and transmits/receives the information. In this case the server sees the proxy IP and not the IP of the real client. To experiment with the indirect manipulation method one could search the popular search engines for keywords like "anonymous proxy". Many offer some sort of free service over the web. Typically these free services send specific headers to the server signifying their nature so it is fairly simple to detect. Other methods of indirect manipulation may take advantage of the browser's active content switches, so when enabled a different server may instruct the browser to access a different site. In which case the real client is totally ignorant of this event. Then we also have the hijacked systems and servers that can operate as proxies in which case proxy detection is impossible from a server script.
HUMAN EMULATION Now to operate a regular browser we do not necessarily need a human. Certain plugins may launch the browser from the O/S at a certain time and with specific parameters and instruct it to place a request to a server. Take for example sites that provide web-site information. Among other things they show a picture of the web-site. Do you think a human visited the site and took a screenshot and then uploaded the picture for you to see? Another example is the browser emulation many online services offer where you simply request a screenshot or entire emulation of the website using a particular version and browser. Do you think there is a human behind servicing these requests? SYSTEM SECURITY If the client-end is compromised in some way the real visitor may hide his identity and send whatever request he needs to any servers, for as long as the compromised machine allows. If a sever is compromised then whatever IPs and systems are controlled by the server are also compromised in terms of web-access. Any of these systems now can be a client to place various requests to other sites. There are numerous and recent examples if one checks out the technology news, on popular information sites.
Going back to the programs or methods to collect and analyze the traffic information, take into account what was mentioned above. Most of these tools have no way of providing accurate results for the following reasons. WEB-SITE SPECIFICS Each web-site, each web-page has certain characteristics that cannot be analyzed unless you're pretty much the site author and have some way, feeding this information to the tool for analysis, the purpose of which is to filter the traffic of the site and to ensure artificial traffic is out of the equation. For instance the way the HTML elements are presented under certain conditions. The images that show up under certain conditions. Personalized information if someone is logged in and so forth and all these taking into account how a human who operates a browser should react to. There way too many factors, but you do comprehend the complexity that arises. To summarize a generic tool has no way to do proper web traffic analysis for the following reasons:
1. The traffic information can be manipulated as stated above. 2. A generic tool is not aware of the web-site specifics. Also worth noting here, is the misinformation that goes around, using active content one can filter web-traffic and identify humans. This is a myth. A script, bot, or whatever you want to call it, can be programmed to emulate any type of behavior. Especially when there is documentation available about it. In other words a programmer reads the specifics of the popular web analysis tool, he then programs his script or bot to include the web-analysis tool specifics. Therefore ajax or jscripts, active content in general, can be decoded and the proper parameters passed from the client to the server end can mimic humans and present a valid request that eventually is fed for analysis and corrupts the actual traffic results.. WORKAROUNDS There are ways to detect and rule out artificial or bogus traffic and a critical element to it, is the site customization. Keep in mind an automated script or bot will also use some generic approach to mimic how a browser responds when operated by a human. Customizing a web page may help merchants to identify the real traffic and derive the useful information from it. The raw server logs include all the access for a specific period of time. If you check the information in the logs you will see for instance someone who operates a browser will not only access the page but will access the side-scripts and other elements the HTML page consists of. So he will access the stylesheet for instance and will access the various images the HTML page presents. Again this is not enough in anyway as we mentioned earlier the browser can be emulated to a certain degree and operate automatically, meaning you will still see bogus traffic. Another way used for human detection purposes, is the order of the elements of the web-page access. If for instance a page has an image that is repeated several times the regular browser with a human operator does not need to access it repeatedly. It will also follow the response headers of the server. And so if the server flags the element as cached and the browser requests it again like he never had it, it means something is not right. One needs to know the headers a server is capable to respond with and utilize these headers with the page access and with the information from the origin of the request (like the IP). Once these items are brought together the real picture for the traffic starts showing up. Here is an example: 1. Client requests page script shipping.php. 2. Client requests 3 times a web design image that represents a left corner which appears 3 times on that same page. Enough information to conclude the access is likely artificial. Under normal circumstances someone will download a browser and use it. Unless he's experimenting with something he doesn't need to alter the settings of his browser and disable the browser's cache for instance. What will be the point for regular use? Slower page access? Not many humans change their settings, unless when perhaps they do some sort of site development. Another example is sending headers back from the server to the client-end. If the server sends to the client's browser a request to delete a cookie, would the client's browser send it back to the server? Or if the server sends a header with the same field and different value to the browser all under the same response of a page which one of the headers the browser is going to keep? Information exchange like that, will start revealing the true identity of the client. Combinations of images and headers also helps. Assuming a bot is re-programmed to avoid this kind of detection and tries to emulate the cache, what will happen if the server sends for an image a regular no-cache response the 2nd time or the 4th time but not the 3rd time? The aim of these methods is to force the bot to invoke a regular browser to do the access in order to avoid detection. Therefore within the page request/response many things are going to be different. First as the complexity of the bot increases so do the mistakes and programming errors. This is a basic software engineering concept. As the size of the code increases its detection becomes easier, so for cases where a system is compromised, a complete browser emulation may easily alert the victim that something is wrong with his system. At the same time the filtered logs will present useful information as automated scripts and bots are not aware of the site's specifics. If you need to know more of these and other methods of detection or require a specific service for your e-commerce store with respect to detect what's the real traffic, you can always contact us. Recent measurements and analysis in 2008 show that for many sites with heavy traffic, that 80%-90% of it, is artificial. These sites score high on search engines results for popular keywords. The bogus web traffic also shows as valid with some of the popular web analysis tools we tested. |