Tuesday, 15 February 2011

web services - Given billions of URLs, how to determine duplicate content -


I was asked this question in a question interview. I have described the question in the description given below. This was an open-ended question.

Looking at billions of URLs (deep links), how do I classify which URLs point to duplicate content? Question was also found in the cases of duplicate pages, which of these authentic Was there. This was the first part. My approach (with valid assumptions) was to classify them on the basis of domain and then matched url content in the same bucket.

In the second part, the interviewer told this question, stating that: Only two URLs, URL 1 is a wiki page about a celebrity (such as: Brad Pitt) and several of them including Brad Pitt in URL 2 There is information about famous celebrities. How can we identify which is authentic and which is a duplicate? My answer was based on comparing two pages based on their quotes.

The interviewer asked me to reply from scratch, and I believe that we do not have any prior information about the duplicate content on the URL since this one open-ended question, no lead Will prove helpful.

You can find this paper useful: "" By Monica Henzinger on Google, due to this problem one Research has attracted the right amount of paper:

A simple solution is to compare documents to all pairs. Since it is prohibitively expensive on large datasets, the first algorithm is proposed to detect near-duplicate documents with the Comparative Components, [11] and Huntage [11] Hintage. Algorithms work on the sequence of both adjacent characters. Brin et al started using word sequences to detect copyright infringement. Shivkumar and Garcia-Molina [13, 14] continued this research and centered it on scaling up to a multi-gigabyte database [15]. Broder et al [3] used to use word sequences efficiently to find duplicate web pages. Later, Charikar [4] developed an approach based on random estimates of words in a document. Recently, Hod and Zobel [10] compare methods for developing and identifying editions and literary documents. In other words, it is a complex problem with different solutions of different successes, and nothing is a 'correct' answer, involving the word or alphabet sequence in most answers.


No comments:

Post a Comment