Carleton University
Technical Report TR-00-04
May 2000
Structural Characterization of Popular Web Documents
Abstract
Characterization of Web documents is essential to study performance issues such as minimizing demands on the back-end servers and communication overheads. In addition, characterization of the Web is also important to devise synthetic workload generators for use in the investigation of effective resource management algorithms. Most characterization of Web documents are based on Web files without considering their inherent structure. To display a complete Web page a collection of files that include the files corresponding to the embedded objects in a page must be transferred. A Web object is defined to be this collection, i.e., a Web page and its related embedded files. Our goal in conducting this study is to collect data on the structure and size of Web objects that is particularly useful in improving Web server performance through techniques such as clustering of files, parallel I/O, and data caching on client sites. We report the results of an empirical study conducted on several popular (in top 100 sites) Web sites. We have chosen the popular Web sites for this investigation because they are more likely to be efficiently designed. In addition, popular Web servers also account for a significant portion of the network traffic. We also study the trace of a busy proxy access log to characterize Web objects for regular Web environments.