In this chapter, we present the software architecture of the stream-based mail proxy. The implementation of this architecture is described in chapter 4.
The system is designed to achieve the following goals:
Scalability: Stream-based processing is used to interleave file decompressing and virus scanning on file segments without storing the entire file. The buffer space requirement is greatly reduced. Hence, a large number of connections can be support.
Performance: A storage-based system like AMaViS often calls external commands to decompress files and scan viruses. Also, AMaViS needs to cooperate with MTAs, and so totally three daemons are on the system at the same time. The stream-based system calls the shared library to decompress and scan viruses. It is implemented in a single-process architecture. The overheads in context-switching and inter-process communication are eliminated. Also, stream-based processing eliminates the file access overheads which is especially large in AMaViS daemon described in chapter 2.
Extensibility: The system should be able to easily integrate new network protocols for extension because of separated modules. Besides the SMTP and POP3, other mail service like IMAP could be integrated in the future.
Transparency: The system monitors transparently every connection between the internal and external networks. No awareness of the system is needed.
3.1 System overview
User
Figure 1 shows the overview of our system. The thin line represents the direction of protocol, while the bold line represents the direction of mail transmission. First, a dispatcher intercepts the packets from user and redirects them to the corresponding protocol handler. For example, the dispatcher redirects connections with destination port 25 to the SMTP daemon. The SMTP/POP3 handler communicates to the user and the server simultaneously. After the protocol communication, the mail is ready to be sent. The direction of mail transmission is the difference between SMTP and POP3.
The data may be encoded or compressed. The attachments in a mail are encoded with MIME encoding, so the service about electronic mail like POP3 and SMTP need a MIME parser. The decoded attachment may be a compressed file, and the on-the-fly decompression engine decompresses it. After preprocessing, the system has a block or segment of partial data from the attached file. The system scans it with the virus scanner. If there is no virus, the original data read from the sender is forwarded to the receiver. If the mail contains the virus, the proxy can break the connection immediately and send a notification to user.
3.2 Processing workflow
This section presents the detailed workflow of processing one mail which is the same in SMTP and POP3. A MIME encoded mail is composed by several pairs of the MIME header and the MIME body after the mail header. The MIME header is different from the mail header. Figure 3 shows the composition of a mail. The mail body and several attachments are encoded into MIME body by several encoding methods defined in RFC 2045[25]. Common encoding methods of a MIME body are UUE, Base64, quoted-printable, etc. The MIME header contains the information of MIME body, such as the encoding method, the data type, and the filename of the attachment.
Fig.3 Composition of a MIME encoded mail Processing the mail header
irst part in every mail. The mail header parser reads the header f
y. A body parser can be put here to checks the body if it is a spa
The mail header is the f
rom raw buffer and checks if this mail is MIME encoded. If it is MIIME encoded, the MIME parser is ready for parsing the MIME header and the MIME body.
Process mail body
The mail body is after mail header immediatel
m, and if it contains malicious links or JAVA/VB scripts.
The body parser may modify the mail body to remove these malicious things. Since we only care about the virus in attachments, the mail body is simply forwarded to the destination. There is no body parser in our implementation.
Process mail attachments
MIME parser
Fig.4. Process mail attachments
Attachments are mostly encoded and may be compressed. Figure 4 shows the total workflow of processing attachments. First, the MIME parser gets the file name from the MIME header. According to the file name, the proxy processes the attachment in three ways: (a) The non-malicious files, identified by the file extension, can be ignored because they could not have viruses, like “*.jpg” and “*.txt”. (b) The file type needs to be scanned for viruses such as executable files types like “*.exe”
and other file types like “*.doc”. (c) If its type shows the file is compressed. The proxy needs to do decompressing before scanning. The decompressed data should also be recognized weather it may contain viruses. There is a “file recognizer” can analyze the decompressed data to decide the later process. If the decompressed data contains another compressed file, the system needs to decompress recursively. The sizes of intermediate buffers such as “decoded” and “decompressed” are not directly proportional to the size of the attachment. These buffers are created per mail. The size of “decompressed” buffer is decided by the compression ratio and the content being decompressed.
When the virus scanner finds viruses in the attachment, the proxy drop the remaining data of the attachment. The destination will receive a broken attachment.
The user on the destination is free from viruses.