Chapter 4 Detailed Design & Algorithms
4.2. Tracking in HTTP Session with URL Rewriting
4.2.1. Finding Session ID
4.2.1.1. Preface
Before we introduce the section, why do we find the session id? In our tracking algorithm for URL rewriting discussed in next section, the main idea is to replace the old session id to the new one and then to re-travel the traveled URLs for comparing the differences. Hence, how to find the session id in the HTTP connection is important and necessary. Even, how to find the session id in the history URLs is also important and necessary.
So in the section, we introduce the features of the session id. According the feature discussed later, we illustrate the conclusion: the session id is one of the fixed-length i common substrings between URLs. Hence, we propose several algorithms to find this kind of substrings. According that, we design an algorithm that can find the session id in the HTTP connection.
4.2.1.2. Features of the Session ID
There are some features about the session id used in HTTP. The basic condition is the session id is composed from the normal characters. The opposite of the normal characters is the special characters. The members of the special characters are ‘;’, ‘/’, ‘&’, ‘%’, ‘?’, and ‘=’.
In brief, the session id is consisted from everything except the special characters. Because these characters have their special means in the URL. There is not important constraint in the
id except this. We can say that the next character followed with the session id must be a special character or end of string (EOS).
Although there are not many constraint in the id, most implementation the session id is usually a hashed value and with a fixed length. The function of session id is to identify clients and data stored in the server without using any client resources, like cookie. But some hacker might try to guess the using session id so that he could access the client’s personal information; replace the client, or something evil. So for avoiding the security problem, the implementation of session id usually uses encryption to be guessed easily or be tried out quickly but without too much complexity. The hash table is a common way in the real world, and some features are bringing out. One of the useful features is the session id in the same web site is a fixed length string.
The session id also has another feature by nature. The links that is used for the HTTP session with URL rewriting should be encoded for embedding the session id into the URL. It means the session id must be a substring of the URL.
Hence, we propose the session id has the two features:
1. It has the fixed length and its content won’t change in the same session.
2. The id is embedded in the URL.
3. The next character followed with the session id must be a special character or EOS.
So, we can say that the session id is one of the common substrings between all URLs.
We propose an algorithm to find the fix-length common substring in two known strings later.
4.2.1.3. Find the common fixed-length i string
Before we introduce the algorithm to find the common fixed-length i stirng, we use a graph to explain what is the common substrings and their length. We show that in Figure 4-13.
Figure 4-13 the common substrings
The figure is usually used to introduce the longest common substring (LCS) [23]. But our focus is on all common substrings. In Figure 4-13, we compare the two strings:
“aacwahwjxoexwaacwjx” and “cawaacjx”. We use an array to record the comparing result of each character one by one. The value 1 means the two compared characters is the same, otherwise the value is empty. In the array, the common string is the sequence that value of each character is 1 and the height is decreased. In the other word, the slope of the sequence in the array map is -1. The length of the common string is the sequence length. So, in the Figure, we can see one length 4 common substring, one length 3 common substring, and three 2 common substrings. The substring with length 4 is the longest common substring.
We summarize the useful features about the session id as follows: the session id is a fixed length substring embedded in dynamical URLs in the html document. According the feature, we propose an algorithm to find the common fix length l string between two strings, called. We can use the algorithm to find the session id between URLs.
For the way to find the common fixed-length l substring between two strings a and b; we compare three methods. The first is finding the all common strings between two strings, and then choose the substring with length l. The second way is make a substring with length l of one string, and then use Knuth-Morris-Pratt Algorithm (KMP) algorithm [20][25][26] to do string pattern match [23] process seeing if the substring is matched or not. It positive, the
substring is the candidate we want, otherwise not. The final way is using Boyer-Mooro (BM) algorithm [21][24] instead of KMP.
Figure 4-14 find all common substring with the common way
The first algorithm is very simple and intuitively. For every character a[m] in a sequentially do comparison with each character in b. If a[m] is a special character, we get the next one and restart the process. If it is not, for every character from a[m] to the end character of a marks a[i] following compares with every character from head to tail in b, marks in b[j].
If a[i] and b[j] is the same and b[j] is not the special character, do the comparison for next a[i]
and next b[j] and use a pivot p to record the start point until they are not the same or b[j] is the special character and use another pivot e to record end point. From the b[p] to b[e], the
substring is what we want the common substring between a, and b. If the length is l, put it into the candidate set C. If the comparison is in the condition that reaching the end of string a or string b and the pivot is marked, it means the common substring is the chars at the end of one of these string. If the length of the substring is l, put it into C or use another “a[m]” again. If the “a[m]” is the last character of string a, the comparison is going to the end and we have found all of the substring between the two string in the candidate set C. The pseudo code is shown in the Figure 4-14.
Figure 4-15 the algorithms using KMP for common substring
The second algorithm and the third one use the feature of pattern match. The main idea is the session id must be a substring in both strings, so we can find the session id exactly is matched in both strings. So we choose the famous pattern match algorithm to help us.
The second algorithm uses KMP algorithm. We show the pseudo code in Figure 4-15.
From the string b, we get the l length substring ss as the pattern. The pattern string must compose of all but the special chars, and the next character followed must be one of the special chars. If the next character is not in the special character set, the pattern is not an ended string for URL. In other word, the pattern is not the session id. And then, we use KMP algorithm to find if the pattern was matched or not. If matched, the pattern is added to the
candidate set C. If not, do the process for next pattern. The third one replace KMP algorithm instead of BM algorithm. We show the pseudo code in Figure 4-16.
Figure 4-16 the algorithms using BM for common substring
After testing the performance of these three algorithms, we get the best result of the algorithm 3: using Boyer-Mooro pattern match finding the common and fix length l string. It is shown in Figure 4-17. The evaluation will be discussed in Chapter 6.
Figure 4-17 we use BM for our L length common string finding algorithm
4.2.1.4. Find the Session ID
We discuss about the session id more. A HTML Web page usually has not only static links that is just point to a location with no parameters but also dynamic links that is consisted by web location, a lot of parameters, and the session id. So the average length of static links is usually shorter than the average length of dynamic links. If we find all of the common substrings as candidates between two dynamic links with the same length as the session id,
there must be one string which is the real session id.
So we choose the longest links in the page first and find all of the common substring with a given length e, called session id candidates, and then each candidate c has a corresponding integer d that record how many other links has the same part in them. If there is the same part as the candidate c in one link, the d increases by one. After all, there is only one candidate which has the most value of d and it is what we want, the session id. Pseudo code is shown in Figure 4-18. If the candidate set is empty in the beginning, it means there is more than one static links in the two longest links we choose. We can drop the longest one and find two the longest links in the rest links until we get the candidates.
Figure 4-18 the algorithm of get session id
According the algorithm, we can easily find the session id between URLs. But the mechanism has some limitations. If the page of the base URL has fewer dynamic links, the find session id might make a mistake on determining the session id. If the static links in the page are more and longer than the dynamic ones, it also might make a mistake, too.
We do some evaluations in Chapter 6. More detail is introduced in Chapter 6.