Web Crawling in R
Standard scraping approach using the RCurl package
Section titled “Standard scraping approach using the RCurl package”We try to extract imdb top chart movies and ratings
R> library(RCurl)R> library(XML)R> url <- "http://www.imdb.com/chart/top"R> top <- getURL(url)R> parsed_top <- htmlParse(top, encoding = "UTF-8")R> top_table <- readHTMLTable(parsed_top)[[1]]R> head(top_table[1:10, 1:3])
Rank & Title IMDb Rating1 1. The Shawshank Redemption (1994) 9.22 2. The Godfather (1972) 9.23 3. The Godfather: Part II (1974) 9.04 4. The Dark Knight (2008) 8.95 5. Pulp Fiction (1994) 8.96 6. The Good, the Bad and the Ugly (1966) 8.97 7. Schindler’s List (1993) 8.98 8. 12 Angry Men (1957) 8.99 9. The Lord of the Rings: The Return of the King (2003) 8.910 10. Fight Club (1999) 8.8