Web page content analysis with "SmartBrowser"

суббота, 11 апреля 2009, Александр Краковецкий

Overview

Smart Browser is a software for loading only the most relevant for user page content excluding advertisement, design features, links etc. This software is a result of scientific researches and now it is under development. This version is a preview for demonstrating the main idea and determining the future research directions.

Architecture

The Smart Browser architecture consists with several main modules:

  1. Graphic user interface (GUI).
  2. Module for slicing web pages into information blocks.
  3. Module for analyzing the content of the blocks.
  4. Module for calculating block importance.
  5. Module for determining the main content among all blocks.

GUI

The software is designed as a web-browser so users are familiar with this interface and fill free to work with it.

Slicing Web Pages Module

This module retrieves the web page content and slices it by using the VIPS algorithm proposed by Microsoft R&D Center Asia.

Blocks Content Analyzing Module

This module calculate common properties of the block such as words count, sentences count, stop-words count etc. and their relationships.

Calculating Block Importance Module

This module calculates the block importance according to its properties values. The module is working on developed and learned math model.

Main Content Determining Module

Module analyzes the all blocks importance values and makes decision which of them are important and which are not. It is based on fuzzy clustering algorithm.

Working examples

Example 1: "Car crash kills Pennsylvania state senator - CNN.com"

To view the original web page plese refer to "Car crash kills Pennsylvania state senator - CNN.com" article page.

Original web page screenshot (click on image to view original size):

Site screenshot after cleaning by SmartBrowser (click on image to view original size):

Granularity Value: 7

Example 2: "Obama details his 'economic rescue plan' - CNN.com"

To view the original web page plese refer to  "Obama details his 'economic rescue plan' - CNN.com" article page.

Original site screenshot (click on image to view original size):

Site screenshot after cleaning by SmartBrowser (click on image to view original size):

Granularity Value: 5

System Requirements

  1. OS: Windows XP SP3/2003/2003 R2/Vista/2008 32-bit only
  2. .NET Framework 3.5

Downloads

You can download the latest executable files from the http://smartbrowser.codeplex.com/ site.

Feedbacks and propositions

All questions and comments please send to the msugvn[at]gmail.com.

Компании из статьи


Microsoft Украина


Сайт:
http://www.microsoft.com/ukr/ua/

Microsoft Украина Украинское подразделение компании Microsoft.

Ищите нас в интернетах!

Комментарии

Свежие вакансии