The Web Science Trust

Dynamic characterization of a large Web graph

Bordino, Ilaria and Donato, Debora (2009) Dynamic characterization of a large Web graph. In: Proceedings of the WebSci'09: Society On-Line, 18-20 March 2009, Athens, Greece. (In Press)

[img] PDF (preprint) - Repository staff only - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
59Kb
[img]
Preview
PDF (Poster Description) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
450Kb
[img]
Preview
PDF (Poster Artwork) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
194Kb

Abstract

The Web is characterized by an extremely dynamic nature, as it is proved by the rapid and significant growth it has experimented in the last decade and by its continuous evolution through creation or deletion of pages and hyperlinks. Consequently, analyzing the temporal evolution of the Web has become a crucial task that can provide search engines with valuable information for refining crawling policies, improving ranking models or detecting spam. Understanding how the Web evolves over time is a delicate challenge that requires to integrate theoretical efforts of modelization and empirical results. Obtaining such findings is very expensive in terms of bandwidth, computation time and human intervention. Robust software is required to gather the data and provide easy access to the collected information. Apart for commercial engines, there have been only a few attempts to perform such task and to make the data available. Several previous works (e.g.,~\cite{ntoulas2004what,toyoda2006what}) analyzed a sequence of crawls, studying degree and frequency of change both in the content of Web pages and in the hyperlink structure to propose measures of their novelty. In this paper we study a temporal dataset~\cite{BSVLTAG} that has recently been made public. It is made of twelve 100M pages snapshots of the \texttt{.uk} domain. The Web graphs of the single snapshots have been merged into a global graph with labels that provide constant time access to temporal information.~\cite{bordino2008temporal} describes the work done to assess the data and to study some aspects of its evolution at the level of Web pages. We now analyze the structure of this huge time-aware graph at the level of interconnection between hosts. We study the \emph{hostgraph}, i.e., the graph in which every node corresponds to a site, whereas a directed edge represents the existence of hyperlinks between pages belonging to two different hosts. Understanding the graph structure of the Web at this macroscopic level can provide valuable insights for improving Web site accessibility and navigation, or discovering related hosts. The notion of hostgraph was proposed by~\cite{bharat2001who}. A few studies analyzed hostgraphs (~\cite{baeza2005spain,liu2005china,toyoda2008thai}). ~\cite{baeza2007domain} presents a comparison among the results of twelve characterization studies of several national domains.

Item Type:Conference or Workshop Item (Poster)
Uncontrolled Keywords:dynamic characterization, web evolution, temporal evolution
Subjects:Web Science Events > Web Science 2009
ID Code:158
Deposited By: W S T Administrator
Deposited On:24 Jan 2009 08:45
Last Modified:25 Oct 2011 16:37

Repository Staff Only: item control page

EPrints Logo
Web Science Repository is powered by EPrints 3 which is developed by the School of Electronics and Computer Science at the University of Southampton. More information and software credits.