Image semantics discovery from web pages for semantic-based image retrieval using self-organizing maps

(1)

Image semantics discovery from web pages for semantic-based

image retrieval using self-organizing maps

Hsin-Chang Yang

a,*

, Chung-Hong Lee

b

a_{Department of Computer Science and Information Engineering, Chang Jung University, Tainan 711, Taiwan} b_{Department of Electrical Engineering, National Kaohsiung University of Applied Science, Kaohsiung, Taiwan}

Abstract

Traditional content-based image retrieval (CBIR) systems often fail to meet a user’s need due to the ‘semantic gap’ between the extracted features of the systems and the user’s query. The cause of the semantic gap is the failure of extracting real semantics from an image and the query. To extract semantics of images, however, is a difficult task. Most existing techniques apply some predefined semantic categories and assign the images to appropriate categories through some learning processes. Nevertheless, these techniques always need human intervention and rely on content-based features. In this paper we propose a novel approach to bridge the semantic gap which is the major deficiency of CBIR systems. We conquer the deficiency by extracting semantics of an image from the environ-mental texts around it. Since an image generally co-exists with accompanying texts in various formats, we may rely on such environmen-tal texts to discover the semantics of the image. We apply a text mining process, which adopts the self-organizing map (SOM) learning algorithm as a kernel, on the environmental texts of an image to extract the semantic information from this image. Some implicit seman-tic information of the images can be discovered after the text mining process. We also define a semanseman-tic relevance measure to achieve the semantic-based image retrieval task. We performed experiments on a set of images which are collected from web pages and obtained promising results.

Keywords: Semantic-based image retrieval; Text mining; Self-organizing map

1. Introduction

Recently the task of image retrieval has received a great deal of attention from the web community since there are so many useful images on web pages. Image retrieval is a branch of information retrieval whose task is to retrieve some pieces of information (the documents) to fulﬁll a user’s information needs according to certain (semantic) relevance measurements. Currently most information retrieval systems retrieve documents based on their ‘con-tents’. That is, they measure the relevance between the

query and a document according to internal representa-tions or derived features. Such representarepresenta-tions or features will vary for different document styles and retrieval schemes. For text retrieval systems, the contents are often represented by a set of selected keywords that are intended to capture the semantics of the documents. Many studies have successfully represented the semantics of text docu-ments (Baeza-Yates & Ribeiro-Neto, 1999). For image retrieval systems, the representation of image content gen-erally contains a set of visual features extracted from the image that hopefully may effectively represent the image. Many schemes have been proposed to describe the image contents (De Marsicoi, Cinque, & Levialdi, 1997; Doer-mann, 1998; Gupta & Jain, 1997). However, the semantics of an image are difficult to be revealed by these features, so that irrelevant images are retrieved even when they have

* _{Corresponding author.}

E-mail addresses:[email protected](H.-C. Yang),leechung@ mail.ee.kuas.edu.tw(C.-H. Lee).

URL:http://www.im.cju.edu.tw/hcyang(H.-C. Yang).

www.elsevier.com/locate/eswa Expert Systems with Applications 34 (2008) 266–279

(2)

similar features. For example, an image with 70% green color and 30% blue color could be either a scenic view of a meadow or a book cover. Another example is a round object with a hole in it, which could either be a wheel or a compact disc. These examples show that a user may easily obtain images that are totally irrelevant to his query through CBIR approaches. This diﬀerence between the user’s information need and the image representation is called the ‘semantic gap’ in CBIR systems. Thus CBIR sys-tems only work well in relatively small domains of data sets, and to obtain more reasonable results, semantic-based image retrieval (SBIR) systems must be devised to bridge the semantic gap.

Unlike CBIR systems, which use ‘contents’ to retrieve images, SBIR systems try to discover the real semantic meaning of an image and use it to retrieve relevant images. However, understanding and discovering the semantics of a piece of information are high-level cognitive tasks, and thus hard to automate. Several attempts have been made to tackle this problem. Most of these methods use CBIR techniques such that primitive features are used to derive higher order image semantics. However, CBIR systems use no explicit knowledge about the image and limit their applications to fields such as fingerprint identification and trade mark retrieval, etc. To increase user satisfaction with query results, we must incorporate more semantics into the retrieval process. However, there are three major difficulties in doing so. The first is that we must have some kind of high-level description scheme for describing the semantics. The second is that a semantics extraction scheme is necessary to map visual features to high-level semantics. Finally, we must develop a query processing and matching method for the retrieval process. Many techniques have been devised to remedy these difficulties, as we discuss later. In this work we propose a novel approach to solve these difficulties using a simple frame-work. First we incorporate explicit knowledge into the framework by representing the images with their sur-rounding texts in the web pages. Such representation also solves the difficulty of semantic representation. The semantic extraction process is achieved in our framework by using a text mining process on these texts. We also design a semantic relevance measure for matching the user’s query and images in the image collection, which solves the third difficulty. Our idea comes from the recog-nition that it is too difficult to directly extract semantics from images. Thus we avoid direct access of the image contents, which is generally time-consuming and impre-cise. Instead, we try to obtain image semantics by their environmental texts, which are generally contextually rele-vant to the images.

The paper is organized as follows. In Section 2 we review some related works on semantic-based image retrie-val and text mining. We then describe the text mining pro-cess and its application on semantic image retrieval in Section3. In Section4we show some experimental results, and in the last section we give some conclusions.

2. Related studies

We review some work in diﬀerent aspects related to our work.

2.1. Text mining studies

First we discuss some related work about the text mining technique used in this work. Several research efforts at creating enhanced text mining techniques were developed to make the system and mining process more effective. Feldman uses text category labels (associated with the Reuters newswire service) to find ‘unexpected patterns’ among text articles (Dagan, Feldman, & Hirsh, 1996; Feld-man & Dagan, 1995; FeldFeld-man, Dagan, & Hirsh, 1998; Feldman, Klosgen, & Zilberstein, 1997). In their systems, all documents are labeled by keywords, and thus the knowledge discovery task is carried out by analyzing the co-occurrence frequencies of the various keywords labeling the documents. The main approach is to compare distribu-tions of category assignments within subsets of the docu-ment collection. Another related project can be found in so-called On-line New Event Detection (Allan, Carbonell, Doddington, Yamron, & Yang, 1998), whose input is a stream of news stories in chronological order, and whose output is a yes/no decision for each story, made at the time the story arrives, indicating whether or not the story is the first reference to a newly occurring event. In other words, the system must detect the first instance of what will become a series of reports on an important topic. Although this can be viewed as a standard classification task (where the class is a binary assignment to the new-event class) it is more in the spirit of data mining, in that the focus is on dis-covery of the beginning of a new theme or trend. Both the above approaches are built upon the uses of pre-defined text metadata associated with text documents, rather than the text content themselves. A project that aims at constructing methods for exploring full-text document collections, the WEBSOM (Honkela, Kaski, Lagus, & Kohonen, 1996; Kaski, Honkela, Lagus, & Kohonen, 1998; Kohonen, 1998), initiated from Honkela’s sugges-tion to use ‘‘self-organizing semantic maps’’ (Ritter & Kohonen, 1989) as a preprocessing stage for encoding doc-uments. Such maps are, in turn, used to automatically organize (i.e. cluster) documents according to the words that they contain. When the documents are organized on a map in such a way that nearby locations contain similar documents, exploration of the collection is facilitated by the intuitive neighborhood relations. Thus, users can easily navigate a word category map and zoom in on groups of documents related to a specific group of words.

2.2. SBIR schemes

To describe image semantics, Eakins (1996) classiﬁed queries into three levels, according to their abilities of abstraction. The lowest level, Level 1, comprises primitive

(3)

features such as color, texture, and shape. Essentially this level uses no semantic information in an image. Most works on CBIR relate to this level. Level 2 comprises derived attributes which involve some degree of logical inference about the identity of the objects depicted in the image. This level includes object semantics such as object classes and spatial relations among objects. Those systems which can resolve this level of queries are considered as retrieval by semantic content. The third level comprises retrieval by abstract attributes, involving a high degree of abstract – and possibly subjective – reasoning about the meaning and purpose of the objects or scenes depicted. Examples in this level include scene semantics, behavior semantics, and emotion semantics. Eakins’s classification is helpful for describing the capabilities and limitations of different retrieval techniques. We often refer to the retrieval tasks that fulfill the Level 2 and 3 queries as semantic image retrieval (Gudivada & Raghavan, 1995), and the difference between Level 1 and 2 as ‘‘semantic gap’’.

Most of the semantic extraction methods use a multi-level abstraction scheme, as suggested by Al-Khatib, Day, Ghafoor, and Berra (1999). This extraction process is divided into three layers, namely feature extraction layer, object recognition layer, as well as semantic modeling and knowledge representation layer. The feature extraction layer deals with low-level image processing in order to ﬁnd portions of the raw data which match the user’s requested pattern. In the object recognition layer features are ana-lyzed to recognize objects and faces in an image database. The major function of the last layer is to maintain the domain knowledge for representing spatial semantics asso-ciated with image databases. Since our method uses no image content in the retrieval process, we cannot classify this method as pertaining to Eakins’s levels or Al-Khatib’s layers.

2.3. Image semantics representation

There are several types of representation schemes for image semantics, and the first is textual representation schemes. Hermes, Klauck, Kreyb, and Zhang (1995) use a similarity technique to derive the natural language description of a outdoor scene image in the IRIS system. The textual description is generated by four sub-steps: extraction of features like colors, textures, and contours, segmentation, and interpretation of part-whole relations. Such text description is then processed by traditional text retrieval techniques. However, such textual representation is insufficient to represent the complex relations among concepts, and is only applicable to specific applications. The second type uses traditional knowledge representation schemes such as semantic networks, predicate logic, and frames. Recently several researchers have devised different models for semantics representation. Examples are the fuzzy boolean model (Zhuang, Mehrotra, & Huang, 1999), formal language theory (Colombo, DelBimbo, & Pala, 1999), fuzzy logic (Meghini, 1997), and semiotic

approaches (Cavazza, Green, & Palmer, 1998). These approaches are capable of representing and matching semantics in speciﬁc domains. However, there is still no approach that can be applied in the general domain. 2.4. Web image semantics extraction

The effectiveness of the proposed method relies on the correct annotations of an image. Several works adopt sim-ilar annotation schemes for image retrieval. Chen, Liu, Zhang, Li, and Zhang (2001)adopted an unified approach which combines low-level visual features, such as color, texture, and shape, and high-level semantic text features. The high-level text features they used are similar to those in this work, including image filename and URL, ALT text, surrounding text, page title, etc.Sclaroff, Cascia, Sethi, and Taycher (1999) proposed a system that combines textual and visual statistics in a single index vector for content-based search of a Web image database. In their method, textual statistics are captured in vector form using latent semantic indexing based on text in the containing HTML document. Visual statistics are captured in vector form using color and orientation histograms. The system assigns different weights to the words appearing in the title and headers and in the ALT fields of the img tags along with words emphasized with different fonts like bold and italics. However, these weight values were chosen heuristically, and they did not consider the proximity weighting. There is a number of systems that try to assign texts to images to achieve Web image retrieval, as we discussed here. Web-SEEk proposed byChang, Smith, Beigi, and Benitez (1997)

uses only Web URL addresses and HTML tags associated with the images to extract the keywords.Harmandas, San-derson, and Dunlop (1997) use the text after an image URL until the end of a paragraph or until a link to another image is encountered as the caption of the image. The Marie-3 system (Rowe & Frew, 1998) uses text ‘near’ an image to identify a caption. ‘Nearness’ is defined as within a fixed number of lines in the parse of the source HTML file. WebSeer (Frankel, Swain, & Athitsos, 1996) defines the caption of an image to be the text in the same center tag as the image, within the same cell of a table as an image or the same paragraph. However, none of these techniques can perform reasonably well on all types of HTML pages. 2.5. Image semantics matching

To match the semantics between the query and an image in the database,Chang, Chen, and Sundaram (1998) pro-posed a semantic visual template model, which is used to approximate user queries. In this, the user is asked to iden-tify a possible range of colors, textures, shapes or motion parameters to express his or her query, which is then reﬁned using relevance feedback techniques. When the user is sat-isﬁed, the query is given a semantic label (such as ‘‘sunset’’) and stored in a query database for later use. Over time, this query database becomes a kind of visual thesaurus, linking

(4)

each semantic concept to the range of primitive image fea-tures most likely to retrieve relevant items. However, this method requires an initial sketch, which is hard for naive users to generate. In addition, generating a semantic visual template is time-consuming and needs human intervention, making it unsuitable for large datasets. Dori and Hel-Or (1998)proposed the visual-object process diagram (VOPD) which incorporates both low-level image features and texture key sentences as descriptors of the image. Query-ing is performed by representQuery-ing the sought image with a VOPD and finding the images in the database whose VOPDs best match the query. However, generation of VOPDs for both query and images in the database is a manual task which is time-consuming and requires knowledge about the CASE tool. Ojala, Rautiainen, Matinmikko, and Aittola (2001) used HSV autocorrelo-grams for image retrieval. However, the semantics in their work was designated manually but not automatically.Zhao and Grosky (2002) adopted the latent semantic indexing (LSI) (Deerwester, Dumais, Furnas, & Landauer, 1990) technique to find the latent correlation between low-level visual features and high-level semantics.Srihari and Burh-ans (1994)proposed a visual semantics model which is clo-sely related to our work. However, although they discover image semantics from the text accompanying the images as we do, only picture captions are used, which makes their system, PICTION, apply only to limited domains. Wu, Sitharama Iyengar, and Zhu (2001) adopt the SOM learning for web image retrieval. They use various kinds of features such as textual feature, color histogram, image entropy, shape, and texture feature. However, they did not provide evaluations on the results.Laaksonen, Kosk-ela, Laakso, and Oja (2000) proposed PicSOM system which uses Tree structured self-organized maps for CBIR, and uses no semantic features. The major difference between the proposed method and the above two methods is the text mining process, which is the key to bridge the semantic gap. Chen et al. (2001) proposed a web mining approach for image retrieval. They analyzed the user feed-back log to improve image representations and discover the relationship between low-level and high-level features. Their method belongs to the log mining category in web mining techniques, and is different to our method which belongs to the content mining category.

3. Text mining for semantic image retrieval

In this section we give a detailed description of the pro-posed method, starting from the preprocessing and encod-ing of documents in Section 3.2. The encoded documents are then trained by the self-organizing map algorithm in Section 3.3. In this section a text mining process is also applied to the training results to discover the relationships among images as well as the subjects of the images, which are considered as a kind of semantics of the images. Finally, we use the text mining results to retrieve images in Section3.4.

3.1. System overview

We first give a brief sketch of the proposed method here. The functional components and the processing flow are depicted inFig. 1. In the preprocessing and feature gener-ation stages, we collect suitable web pages and transform them into feature vectors, which are indexed and stored in a database. Next we apply the proposed text mining pro-cess on this set of feature vectors, and obtain the image cluster map (ICM) and keyword cluster map (KCM). In the final stage we match the user query with each image and obtain the ranking of the images. The dotted lines enclose those functional components of each stage. 3.2. Document preprocessing

A document in our corpus is a typical web page, which contains both texts and images. A web page generally exists in HTML format which uses a set of predefined tags to perform actions such as typesetting the page, inserting hyperlinks and multimedia objects, etc. In this article we are interested in image retrieval, so we first extract images from the documents, and then extract environmental texts from the web pages corresponding to each extracted image. Note that a web page is discarded if it contains no qualified image or environmental texts.

3.2.1. Extracting images from web pages

In HTML we use the <img> tag to add images to the web pages. Thus we can extract images from the web pages by examining the <img> tags. We discard extremely small images because they often contain little information. For example, many web pages contain small icons such as but-tons and rulers that are used for alignment, segmentation, and marking, etc. The threshold of image size is determined experimentally by collecting all images and then determin-ing a threshold that can discriminate such small images. A typical threshold value is a few kilobytes. Other than image size, there is almost no limit on the image properties in our method since we do not directly extract features from the images. Notice that we treat duplicated images as separate ones to allow diﬀerent interpretations of the same images.

3.2.2. Extracting environmental texts

The remaining part of the web page is further processed to extract necessary texts. These texts are called the envi-ronmental texts (ETs) of these images. There are several types of environmental texts related to an image. For example, we can identify nearby texts, captions, alternate texts, and ﬁlenames from an HTML-format web page. In this article, two types of ETs are extracted, namely ETCaption and ETNormal. ETCaption includes texts that are

regulated by HTML tags, such as image URL and ALT text. On the other hand, ETNormalconsists of ordinary texts

which are located outside any HTML tags. ETCaption is

(5)

extraction of ETNormalis considerably diﬃcult since there

exists much ambiguity in texts that around an image. That is, it is rather diﬃcult to determine which part of texts is relevant to this image. To resolve such ambiguity, we adopt the scheme that was used byMukherjea and Cho (1999). In their work, they used criteria such sa visual distance, syntactic distance, regular patterns in a table, and groups of images to assign text to images. According to their report, visual distance, regular pattern and image group criteria have high accuracy in assigning text to images. Therefore, we extract ETNormalaccording to these criteria.

When both the ETNormal and ETCaption associated with

each image have been extracted, a word extractor is used to segment these ETs into a list of terms. We discard those terms in a standard stoplist to reduce the vocabulary size. We also discard terms that occur only once in the ETs. We use the resulting list of terms to represent the image. We call these terms the environmental keywords (EKs) of their associated image. Fig. 2 shows the ETs and EKs of an image extracted from an example web page.

3.2.3. Encoding images

We adopt a binary vector representation similar to the vector space model to encode the images. All the ETs asso-ciated with an image are collectively transformed to a bin-ary vector such that each component corresponds to an EK associated with this image. A component of the vector with value 1 or 0 indicates the presence or absence of an EK in its associated document, respectively. We do not use any term weighting method such as the tfÆidf scheme ( Baeza-Yates & Ribeiro-Neto, 1999).

One problem with this encoding method is that if the vocabulary is very large, then the dimensionality of the vec-tor is also high. In practice, the resulting dimensionality of the space is often huge, since the number of dimensions is determined by the number of distinct index terms in the corpus. In general, feature spaces on the order of 1000 to 100000 are very common for even reasonably small collec-tions of documents. As a result, techniques for controlling the dimensionality of the vector space are required. Such a problem could be solved, for example, by eliminating some of the most common and some of the rarest indexed terms

Web pages Yahoo! directory Web spider preprocessing environmental text extraction document encoding

feature vectors self-organizing map training

ICM KCM user interface similarity calculation labeling process ranked result preprocessing and feature generation matching process

text mining process

query

Fig. 1. The processing stages and functional components of the proposes method.

image environmental texts environmental keywords

ET-Normal:

“The Fleurieu Prize for Australian Landscape 2002.” “by Joe Furlonger”

ET-Caption: “../images/ 2002winner_furlonger.jpg” Fleurieu Prize Australian Landscape Joe Furlonger images winner jpg

Fig. 2. The ETs and EKs of the indicated image in the web page with URL:http://www.artprize.com.au/fppages/1new_winners.htm. In ETNormalwe only include the paragraph where the image occurs.

(6)

of web pages in the same level-1 category belong to the same class, and obtain glocal= 0.17, gglobal= 0.73, and

ghalf= 0.84. For comparison, the average values for the

three measurements reported in Laaksonen et al. (2000)

are 0.067, 0.269, and 0.651, respectively. The result shows that our scheme outperforms traditional CBIR systems that use visual features.

5. Conclusions

In this work we propose a novel approach for semantic-based image retrieval, using a text mining approach to discover semantically related images. Unlike other seman-tic-based image retrieval approaches, we avoid direct anal-ysis of images and rely on their environmental texts to discover their semantics. The reason for this approach is that it is hard to extract semantics directly from images, requiring human intervention to provide such semantic information. Our approach reduces this need, although our approach relies on the precise segmentation of the envi-ronmental texts. In this work we design several criteria to extract environmental texts which we believe may convey the most semantic information of the image. These criteria include the <img> tag in the HTML source, captions, nearby texts, etc. The environmental texts of every image are further clustered according to their semantic similarities through the self-organizing map algorithm, which is a part of the text mining process. The result of the text mining process contains two maps which reveal the semantic rela-tionships among images and keywords, respectively. We then use these maps to perform image retrieval tasks.

In this work there are two types of retrieval tasks, namely retrieval by keywords and retrieval by examples. The first type uses the relationships among keywords revealed by the text mining process to retrieval relevant images of the user-specified query words. In the query-by-examples mode, the user presents an example image and the system retrieves the relevant images according to the image relationships discovered by the text mining process. The approach was tested on a set of web pages which were obtained from a part of the Yahoo! hierarchy since that provides sufficient semantic information which is useful in later evaluation process. We perform two types of evaluation process. One is the average precision versus recall curve and the other is a semantic evaluation measurement. Both evaluation schemes produced promising results.

We provide some discussions about the restrictions and further development here. It is clear that our method can

only be applied for documents with both images and texts, and Web pages are ideal for this method. However, collec-tions of skeleton images such as medical images or photo albums would not be suitable for our method unless some kind of annotations were added to these images manually or automatically. Actually, automatic annotation of images is another application of our text mining process, and in the near future will be integrated into this work to overcome the skeleton image problem.

In the extraction of ETs it is diﬃcult to be precise, and so both the training results and the retrieval results may be deteriorated. A plausible remedy is to use a semi-automatic process that allows users to segment ETs on their own. Another approach would be to use a relevance feedback process to learn and modify the criteria in the extraction of ETs. Both approaches will be developed in our future work.

References

Al-Khatib, W., Day, F., Ghafoor, A., & Berra, P. B. (1999). Semantic modeling and knowledge representation in multimedia databases. IEEE Transactions on Knowledge and Data Engineering, 11(1), 64–80.

Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic detection and tracking pilot study: Final report. In Proc. DARPA broadcast news transcription and understanding workshop, Lansdowne, VA (pp. 194–218).

Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval (1st ed.). ACM Press.

Cavazza, M., Green, R. J., & Palmer, I. J. (1998). Multimedia semantic features and image content description. In Proc. 1998 multimedia modeling, Lausanne, Switzerland (pp. 39–46).

Chang, S., Smith, J., Beigi, M., & Benitez, A. (1997). Visual information retrieval from large distributed online repositories. Communications of ACM, 40, 63–71.

Chang, S. F., Chen, W., & Sundaram, H. (1998). Semantic visual templates: linking visual features to semantics. In Proc. IEEE international conference on image processing (ICIP’98), Chicago, IL (pp. 531–535).

Chen, Z., Liu, W., Zhang, F., Li, M., & Zhang, H. (2001). Web mining for web image retrieval. Journal of the American Society of Information Science, 52(1), 831–839.

Colombo, C., DelBimbo, A., & Pala, P. (1999). Semantics in visual information retrieval. IEEE Multimedia, 6(3), 38–53.

Cox, T. F., & Cox, M. A. A. (1994). Multidimensional Scaling. London, UK: Chapman & Hall.

Dagan, I., Feldman, R., & Hirsh, H. (1996). Keyword-based browsing and analysis of large document sets. In Proc. 5th annual symposium on document analysis and information retrieval (SDAIR), Las Vegas, NV (pp. 191–208).

Deerwester, S., Dumais, S., Furnas, G., & Landauer, K. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 40(6), 391–407.

Table 6

The 20 query words/phrases used in human evaluation process

Mountains Fishes Lakes Iris ﬂowers

Deserts Notebook computers Laser printers LCD screens

Keyboards RAM Bill Clinton Michelangelo Buonarroti

Claude Monet Ludwig Van Beethoven Tom Cruise Planet Jupiter

(7)

De Marsicoi, M., Cinque, L., & Levialdi, S. (1997). Indexing pictorial documents by their content: A survey of current techniques. Image and Vision Computing, 15, 119–141.

Doermann, D. (1998). The indexing and retrieval of document images: a survey. Computer Vision and Image Understanding, 70(3), 287–298. Dori, D., & Hel-Or, H. (1998). Semantic content based image retrieval

using object-process diagrams. In Proc. 7th international workshop on structural and syntactic pattern recognition (SSPR98), Sydney, Aus-tralia (pp. 15–30).

Eakins, J. P. (1996). Automatic image content retrieval: Are we going anywhere? In Proc. 3rd international conference on electronic library and visual information research, Milton Keynes, UK (pp. 123–135). Feldman, R., & Dagan, I. (1995). Kdt – knowledge discovery in texts. In

Proc. 1st annual conference on knowledge discovery and data mining (KDD), Montreal, Canada (pp. 112–117).

Feldman, R., Dagan, I., & Hirsh, H. (1998). Mining text using keyword distributions. Journal of Intelligent Information Systems, 10, 281–300. Feldman, R., Klosgen, W., & Zilberstein, A. (1997). Visualization techniques to explore data mining results for document collections. In Proc. 3rd annual conference on knowledge discovery and data mining (KDD), Newport Beach, CA (pp. 16–23).

Frankel, C., Swain, M., & Athitsos, V. (1996). Webseer: an image search engine for the world-wide web. Tech. Rep. 94-14, Computer Science Department, University of Chicago, Chicago, IL.

Gudivada, V. N., & Raghavan, V. V. (1995). Content-based image retrieval system. IEEE Computer, 28(9), 18–22.

Gupta, A., & Jain, R. (1997). Visual information retrieval. Communica-tions of the ACM, 40(5), 71–79.

Harmandas, V., Sanderson, M., & Dunlop, M. (1997). Image retrieval by hypertext links. In Proc. 20th international ACM SIGIR conference on research and development in information retrieval (pp. 296–303). Hermes, T., Klauck, C., Kreyb, J., & Zhang, J. (1995). Image retrieval for

information systems. In Proc. SPIE, Vol. 2420 storage and retrieval for image and video databases III (pp. 394–405).

Honkela, T., Kaski, S., Lagus, K., & Kohonen, T. (1996). Newsgroup exploration with websom method and browsing interface. Tech. Rep. A32, Laboratory of Computer and Information Science, Helsinki University of Technology, Espoo, Finland.

Jolliﬀe, I. T. (1986). Principal component analysis. Berlin: Springer-Verlag. Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). Websom – self-organizing maps of document collections. Neurocomputing, 21, 101–117.

Kohonen, T. (1997). Self-organizing maps. Berlin: Springer-Verlag. Kohonen, T. (1998). Self-organization of very large document collections:

state of the art. In Proc. 8th international conference on artiﬁcial neural networks, London, UK, Vol. 1 (pp. 65–74).

Laaksonen, J., Koskela, M., Laakso, S., & Oja, E. (2000). Picsom: content-based image retrieval with self-organizing maps. Pattern Recognition Letters, 21(13-14), 1199–1207.

Meghini, C., Sebastiani, F., & Straccia, U. (1997). The terminological image retrieval model. In Proc. 9th international conference on image analysis and processing (ICIAP’97), Florence, Italy, Vol. 2 (pp. 156– 163).

Mukherjea, S., & Cho, J. (1999). Automatically determining semantics for world wide web multimedia information retrieval. Journal of Visual Languages and Computing, 10, 585–606.

Ojala, T., Rautiainen, M., Matinmikko, E., & Aittola, M. (2001). Semantic image retrieval with hsv correlograms. In Proc. 12th Scandinavian conference on image analysis, Bergen, Norway (pp. 621– 627).

Ritter, H., & Kohonen, T. (1989). Self-organizing semantic maps. Biological Cybernetics, 61, 241–254.

Rowe, N., & Frew, B. (1998). Automatic caption localization for photographs on world-wide web pages. Information Processing and Management, 34, 95–107.

Sclaroﬀ, S., Cascia, M. L., Sethi, S., & Taycher, L. (1999). Unifying textual and visual cues for content-based image retrieval on the world wide web. Computer Vision and Image Understanding, 75(1–2), 86–98.

Srihari, R. K., & Burhans, D. T. (1994). Visual semantics: extracting visual information from text. In Proc. 12th national conference on artiﬁcial intelligence, Seattle, WA (pp. 793–798).

Wu, Q., Sitharama Iyengar, S., & Zhu, M. (2001). Web image retrieval using self-organizing feature map. Journal of the American Society for Information Science and Technology, 52(1), 868–875.

Zhao, R., & Grosky, W. I. (2002). Narrowing the semantic gap: Improved text-based web document retrieval using visual features. IEEE Trans-actions on Multimedia, 4(2), 189–200.

Zhuang, Y., Mehrotra, S., & Huang, T. S. (1999). A multimedia information retrieval model based on semantic and visual content. In Proc. 5th international conference for young computer scientists (ICYCS), Nanjing, China.