Big Data Text Automation on Small Machines


Dealing with Big Data may require powerful machines and vast storage. Focusing on a specific type of data, however, it is possible to use commodity computers in a small lab environment to handle Big Data generated from large complex graphs such as online social networks (OSNs) and other types of Internet applications, and produce useful information.

We are focusing on text data in this section. Text data appears everywhere on the Internet, including OSN postings, news articles, commentaries, reports, and reviews, to name just a few.

Data acquisition

Using APIs provided by the content providers and crawling the Internet are common methods for acquiring data from the Internet.


Microblog API’s

OSNs often provide open APIs to developers. MPT collected MBPs continuously for over a year from three popular OSN sites in mainland China. They were Sina, Tencent, and Renren. These OSN sites provided different interfaces for dfferent purposes. We were interested in topic discovering from daily MBPs, rather than in a specific user or a specific topic, and so we used the public-timeline interface to collect MBPs (see Figure 1 for an example of such an interface). These MBPs were semistructured JavaScript object notation style records.


Web crawling

Web-crawling technologies are important mechanisms for collecting data from the Inter-net. The general framework of crawling is given in the following:

  • Provide the crawler a seed URL.

  • The crawler grabs and stores the target page’s content.
  • Enter the URLs contained in the target page in a waiting queue.
  • Process one URL at a time in the queue.
  • Repeat Steps 2 through 4.

A crawler is responsible for doing the following tasks:

  • URL fetching. There are three approaches to grabbing URLs at the target site (initially the target site is the seed URL): Grab all the URLs in the target site. This approach may waste computing re-sources of the machines running the crawler on materials that are not useful for the applications at hand.

Grab a portion of the URLs and ignore certain UR c. Grab only what is needed for the current application.

  • Content extraction. Parse the web page and extract the content for the given application. There are two ways to parse a page:

Write specific rules for each website, then use a web parsing tool such as Jsoup to extract content.

Write common rules for all websites (Google’s content extractor is a typical example of this approach).

  • Fetching frequency. If a crawler visits a target website frequently over a short period of time, then it may trigger the website into blocking the crawler’s IP. Thus, it is important for a web crawler not to visit the target website too often over a short period of time to avoid being blocked.
  • Monitoring. Monitor whether the target website blocks a crawler’s requests and whether the website changes the structures of its web pages.

Crawler framework for new websites

The KWB crawler framework is a vertical crawling mechanism. It can be reused and customized according to the specific layout of a web page. We observe that news websites tend to have the same components: an index page and a number of content pages for news items. When grabbing the index page, we set the crawling depth to one to stop the crawler from grabbing URLs contained in the content page. Meantime, we also want to remove repeating URLs in the URL queue. The KWB crawler framework uses both specific rules and common rules, depending on the individual crawler for a given website.

The KWB crawler framework consists of the following modules (see Figure 2):

  • Visual input module. This module allows the user to specify the patten of the target web page’s layout. The user may specify the following two types of patterns:
  • The first type is a regular expression representing what content the user wants to extract. For example, the regular expression <TAG\b[^>]*>(.*?)</TAG> matches the opening and closing pair of a specific HTML tag TAG, and the text wrapped within would be the content the user wants to extract.
  • b. The second type is an XPath structure of the content that the user wants to extract. For example, suppose that the user wants to select the content enclosed in all the <h2> Then the user can specify an XPath query as h:h2.
  • Web page rule management. It manages the web page rules entered by users, including the following operations: delete, check, and update.
  • The core crawler cluster. This cluster consists of the following components:
  • Thread pool. This is the set of threads in a multitask system.
  • b. URL po This is the database with all the pending URL information when a URL was grabbed. We use a Bloom filter to detect duplicate URLs and remove them. The crawler will visit and remove a URL one at a time from the remaining URLs in this pool.
  • Pattern pool. This is the database of all the web page rules entered by users.

Screen Shot 2017-01-21 at 22.40.34

DAO mo The data access object (DAO) contains the interface for further operations, including data export and data interface.

Duplicate removal. This removes duplicate URLs in the URL pool and the pat-terns in the pattern pool.

  • The crawler task module. This module consists of the following submodules:

Priority processing. Some websites are updated more frequently than others. This module determines which sites need more frequent visits.

  • Temp Sometimes the user may just want to fetch a website once without paying a return visit. This component handles this type of crawling.

Regular grab. For most websites, the user sets up a schedule to grab them peri-odically. This component handles this type of crawling.

  • The supervision module. This module consists of the following submodules:

Resource control (proxy/account). This is a pool containing all the proxy infor-mation and account information. The proxy is used to avoid IP blocking prob-lems, and the account is used to log on to certain websites that require signing in, such as Twitter and Facebook. Monitorin This monitors if the crawler functions normally. For example, it monitors whether the target website has blocked the crawler.

Antiblocking. When the monitoring submodule detects that a crawler is blocked, it decides whether to restart the crawler, change the pattern, or change proxy to avoid being blocked.

Managing antiblocking, exception, and restore This submodule allows the user to manage and change patterns of a website rules. It also determines how often to test if a crawler is still functioning normally.

The program entrance. This component consists of a crawler controller/entrance submodule, which is responsible for starting the entire system.

The KWB crawler framework is implemented using Java. It uses httpclient to connect to a website and construct the document object model (DOM) tree of the page, and uses CSS and Jsoup to parse and extract content. DAO is implemented using MySQL and JDBC.

Generic article extractions

We explore in this section how to extract the main content from a web page without writing specific rules for the page. Human readers can easily distinguish on a web page the main content from irrelevant content by looking at the page layout displayed on the web browser. Any good UI design must guarantee this effect. Irrelevant content is also referred to as junk content or noise, yet such a simple task is very difficult for a machine to do well, as the machine cannot “visualize” the web page displays as easily as humans do. What a machine can do is process the hypertext markup language (HTML) code of the web page.

The main content may occur anywhere in the source code, which is structured with HTML DOM tags, styles, and scripts (e.g., see Figure 3). These tags, styles, and scripts are considered junk content for the purpose of extracting the main content. Other types of junk content are codes for displaying commercial messages, navigation lists, and irrelevant links, among other things, making it more difficult to detect the main content area.


Early methods

Early approaches can be categorized into the following three directions:

  • Devise empirical rules.
  • Construct a DOM tree based on the HTML structures and DOM features.
  • Devise machine learning models, all for the purpose of detecting the main content and eliminating noise.

Empirical rules

Empirical rules are formed based on the HTML tags contained in the source code of the web page such as <div>, <p>, and <span>; or tag labels such as “header,” “nav,” “menu,” and “footer.” In 2009, Joshi pointed out that the article areas are often wrapped in <div> or <p> tags. On the other hand, not everything wrapped in these tags is part of the main content, and so one cannot simply extract everything wrapped in these tags to form the main content.

Another common HTML tag is the comment tag <!–>. Comments are used by programmers to explain the source code, which will not be displayed on the browser; that is, the reader will not see the content wrapped in the comment tags. Thus, it is straightforward to remove all comment tags and everything wrapped in them.

The style tag <style> and the script tag <script> are used to instruct the browser how to display the web page, which can be removed, together with everything wrapped in them. The following is an example:

In 2010, Kohlschutter et al.  started to examine shallow text features. In 2013, Uzun  developed a model for creating rules to infer informative content from simple HTML pages based on the following two block tags: <div> and <td>.

While using tag-based empirical rules to extract the main content makes sense in some situations, this method has limitations. Thus, empirical rules are often combined with other methods to achieve a better extraction result.

DOM trees

Constructing DOM trees from the corresponding HTML DOM structure of a web page is one of the most popular methods to extract the main content.

In 2003, Yi captured the common presentation styles at that time and the actual contents of the web pages in a given website by sampling the pages of the site and built a site style tree to detect the noise and main content areas.

In 2009, Joshi and Liu devised a method of HTML DOM analysis to find subtrees with large bodies of text by summing up all the text size and combining with natural language processing techniques for automated extractions of the main content with associated images.

In 2010, Guo presented ECON, a system that uses a DOM tree to represent a news page and leverages the substantial features of the DOM tree to extract news content from news pages.

Dierent HTML tag structures will generate dierent DOM trees. When a web page has a complex structure, constructing a DOM tree is time consuming. Weninger and Hsu noted that it is not necessary to use the HTML structure or DOM features. They presented a method to extract the main content from diverse web pages using the web page’s text-to-tag ratio. This ratio helps distinguish the main content from the noise content.

Machine-learning methods

Machine-learning methods, classification methods in particular, have been used to train and extract the main content on a web page.

In 2007, Gibson explored nonrepeating content, boundary, and other factors, and used the conditional random field sequence labelling model to label every divided unit as “Content” or “NotContent,” where “Content” means the main content and “NotContent” means noise. Chakrabarti also developed a novel framework to train the classifier using automatic generated training data by assigning a score to every node of the DOM-tree for the given page.

In 2009, Pasternack and Roth applied the maximum subsequence segmentation (MSS) method to the domain of news websites using a small set of training data to identify the main content. The MSS method is a global optimization method over token-level local classifiers.

In 2010, Kohlschutter derived a stochastic model based on a classification model that combines word length and link density to extract the main content.

In 2013, Uzun presented a hybrid approach to discovering the informative content using the decision tree method and automatic rule creation.

The machine-learning approach, however, often incurs high complexity, requires many assumptions, and tends to make simple things unnecessarily more complicated.

We present a new algorithm called qRead devised by Wang and Wang, which achieves higher efficiency and accuracy in extracting the main content by finding computable measures on the text structure on a web page, without relying on HTML structures or DOM features. In particular, qRead removes the HTML tags and text that are clearly irrelevant to the main content, partitions the text into segments, uses the highest ratio of word counts over the number of lines in each segment, and compares the similarity between each segment with the title of the main content to determine the main content area.


According to the HTML standard, the HTML source code would typically contain two parts, which are wrapped, respectively, inside the <head> </head> tags and the <body> </body> tags. The information included inside the <head> </head> tags contains web page configuration or description details and so everything in it can be removed in the preprocessing step. Moreover, qRead filters out comments, styles, and scripts.

Most of the navigation lists are wrapped in the <div> and </div> tags with attributes class=”menu” or class=”nav”, allowing qRead to filter out most of the navigation lists using these features.

Removing HTML tags

HTML tags consist of the tag name, id, class name, or other tag properties. The main content text typically appears between two HTML tags, for example,

<p> This is an article. </p>

qRead removes all <div>, <p>, <span>, <table>, and <ul> tags, together with their properties such as tag id, class, styles, and width.

After this step, the text of web content that contains only readable text without HTML tags is generated (see Figure 4). We can see that Segment 2 is the article area and


Segment 1 looks like navigational text. We can see from Figure 4 by comparing Segments 1through3thateverylineinSegment2islongerthanthelinesinSegments1and3.

An article may contain multiple paragraphs. In the web page source code, the web page title or the article title are typically contained in heading tags such as the <h1> tag and the <h2>. Tables are contained in the <table> tag. The <p> and <span> tags usually contain plain text. A paragraph is typically wrapped in the <p> and </p> tags, or the <br> tags. qRead marks the </p> tags and the <br> tags to indicate paragraph break points before removing them.

Text segmentations

Navigational text is junk content. There are two types of navigational text. One type is a menu list with keywords in each line, for example,

Big Data Tex Automation on Small Machines

The other type of navigational text is a list of article titles with a phrase or sentence in each line. The following is an example:

Trending in U.S
St. Louis police allege hate crime in latest attack on Bosnian resident
Federal autopsy released in Ferguson shooting
Fire in downtown Los Angeles may have been intentionally set
Police shoot, kill knife-wielding man in New York synagogue
See all Trends

Wang and Wang observed that a text line in the noise area is typically much shorter than a text line in the main content where a text line ends with a carriage return in the source code, an end-of-sentence mark placed in the filtering phase, or some other symbols; define line length to be the number of words contained in the line. Figure 5a and b depicts the line length distributions of two different news web pages obtained from, respectively, the ABC News website and the BBC News website, where the main content (the news) in the ABC web page appears in lines 492-494 and in the BBC web page appears in lines 178-188.

Thus, the area of main content would typically include more words in each line than those in the noise area. Let L be a line. Denote by w(L) the number of words contained in L. Partition the remaining content into a sequence of text segments, separated by at

image040-1 copy

least  empty lines, where  is a threshold value, determined empirically. Paragraphs in the main content are typically separated by HTML tags such as <p> and <br> (where no empty lines are necessary), or by a single empty line or double empty lines. In practice, = 3 is sufficient. In other words, inside each segment S, the first line and the last line are nonempty text lines, and there are at most -1 consecutive empty lines in between. Let


denote the sequence of segments. Let L denote a text line. Then the size of S, denoted by w(S), is the number of words contained in S. That is,


Let l(S) denote the number of nonempty text lines contained in S. Let d(S) denote the word desnity of S, which is the ratio of the number of words contained in S over the number of nonempty text lines contained in S. Namely,


Wang and Wang further observed that the main content is typically the text contained in the segments with the highest density, and they hypothesized that any reasonably well-organized web page with reasonably long text in the main-content area has this property. They carried out extensive experiments and confirmed this hypothesis.

Figure 6a and b shows the segment density distribution for each of the web pages depicted in Figure 5a and b. In each  figure, the segment with the highest word density is exactly the area of the main content (i.e., the news article contained in the web page).


Similarity process    

Using segment density alone, however, does not work well on the following two types of web pages:

  1. The main content is a short article with one or two sentences.
  2. The main content consists of multiple text segments with the same density.

The text area of the highest density is not the article area in either of these two web page types. To improve extraction accuracy, qRead applies the measure of cosine similarity between a text segment and the article title to determine if the text segment should be part of the main content.

 Accuracy and efficiency

Extensive experiments were carried out on a laptop computer with a 2.5 GHz Intel Core i5 processor and 8 GB 1600 MHz DDR3 memory for 13,327 web pages written in English and 10,382 web pages written in Chinese collected from a large number of websites, where the article contained in each web page is substantially shorter than the page itself. This means that every web page in the data sets contains lots of junk content.

Screen Shot 2017-01-21 at 22.41.37

  1. An extraction is accurate if it contains exactly the complete article without any junk content.
  2. An extraction is extra if it contains the whole article but with some junk content.
  3. An extraction is missed if it contains an incomplete article or nothing related to the article.

The extraction results are shown in Table 1, which are superior over previously published results.




Data storage

Text data acquired from OSN APIs and the KWB crawlers are unstructured data. To handle unstructured text data in large quantity, we would need a high-performance, steady, and  flexible database system. Mongo DB is a common choice for this purpose.

Database for microblog post

Different MBPs from different sources use different data structures. With Mongo DB, it is possible to store data in different data structures in the same collection. This makes it convenient to manage the data. Moreover, Mongo DB is a database scheme with high performance on the operations of both read and write, which meets the need of intensive writing and querying of MBPs.

It is straightforward to store in one collection all the MBPs posted on the same date, and store all the collections in the same database. However, Mongo DB would write data in the same database into the same file and load all the files that are accessed frequently into the main memory, and so as more MBPs are stored in the same database, the file will grow larger quickly, causing Mongo DB to consume almost all the RAM and crashing the system. To solve this problem, Jia and Wang divided the collections of MBPs according to a fixed time interval into different databases. Since writing to the database is the main operation of MBT and MBT only collects real-time data, Mongo DB would load the file of the most recent week into RAM, which consumes much less RAM. This would solve the problem of crashing.

Note that we may also consider using an SQL-based database for reliability.

Database for an automatic news system

The KWB crawler eliminates duplicate URLs, but the same news article may be posted on different URLs, and so a duplicate article cannot be eliminated by removing duplicate URLs. For each news article KWB needs to retrieve a summary of dierent length (depending on applications) using a proprietary text summary engine. Eliminating duplicate articles and retrieving summaries are both time consuming. To reduce computations, KWB adopts a new database called central DB (see Figure 7) to perform these two tasks on raw data collected every 20 minutes.

There are two different types of duplicates in the raw data:

  1. News items that are exactly the same due to reposting.
  2. Different news items that report the same event.

We will keep the second type of news items, for they report the same event from (possibly) different prospectives, which are useful. To identify the first type of duplicates we may com-pute cosine similarities for all the raw data collected by the KWB crawlers. This approach, however, is time consuming. Instead, we take a greedy approach to reducing the number of news items that we need to retrieve summaries for by eliminating duplicates posted in

Screen Shot 2017-01-21 at 22.42.03

a small window of time. We will further remove duplicates later before computing news classifications.

The central DB retrieves article summaries and detects duplicates in a parallel fashion. In particular, it sorts all the unprocessed raw data in increasing order according to their IDs. These are incremental IDs given to the news items based on the time they are fetched by the KWB crawler framework. Starting from the first news article, repeat the following:


Send a request to the summary engine to retrieve summaries of required lengths.


Compute the cosine similarities of the article with the news items whose IDs fall in a small fixed time window after this article. If a duplicate is found, remove the one whose ID is in the time window (i.e., with a larger ID), for it is likely a reposting and the news article with a smaller ID may have already had the summaries generated from the summary engine running on a different server.


Move to the next news article in the sorted list.

Share on FacebookShare on LinkedInShare on Google+Tweet about this on TwitterShare on VK


High Five! You just read 2 awesome articles, in row. You may want to subscribe to our blog newsletter for new blog posts.