What's Wrong: rvest's Error 'in open.connection(x, "rb") and readHTMLTable ()`s "XML contents does not seem to be XML"? [duplicate] - r

I am trying to extract all the table from this page using R, for html_node i had passed "table". In console the output is weird. Data is available in webpage but in R console it shows NA. Please suggest me where i had made mistake.
library(xml2)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
page <- read_html(url) #Creates an html document from URL
table <- html_table(page, fill = TRUE) #Parses tables into data frames
table
part of Output:
X4 X5 X6
1 Direct premiums written (1) Market share (2) 1
2 Market share (2) <NA> NA
3 10.6% <NA> NA
4 6.0 <NA> NA
5 5.4 <NA> NA
6 5.4 <NA> NA
7 5.2 <NA> NA
8 4.5 <NA> NA
9 3.3 <NA> NA
10 3.2 <NA> NA
11 3.0 <NA> NA
12 2.2 <NA> NA
X7 X8 X9 X10
1 State Farm Mutual Automobile Insurance $51,063,111 10.6% 2
2 <NA> <NA> <NA> NA
3 <NA> <NA> <NA> NA
4 <NA> <NA> <NA> NA
5 <NA> <NA> <NA> NA
6 <NA> <NA> <NA> NA
7 <NA> <NA> <NA> NA
8 <NA> <NA> <NA> NA
9 <NA> <NA> <NA> NA
10 <NA> <NA> <NA> NA
11 <NA> <NA> <NA> NA
12 <NA> <NA> <NA> NA

This will get all of the tables into a single data frame:
library(tidyverse)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))
tables <- data.frame()
for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}
You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:
table_2009 <- tables[21:30,] %>%
mutate(Year = 2009)
To add all the years at once:
years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))
Hope this helps.

There are a couple of issues with these tables.
First, I think you'll get better results if you specify the class of table. In this case, .tablesorter.
Second, you'll note that in some tables the second column header is Group, in other cases it is Group/company. This is what causes the NA. So you need to rename the columns to be consistent for all tables.
You can get a list of tables with renamed column headers like this:
tables <- page %>%
html_nodes("table.tablesorter") %>%
html_table() %>%
lapply(., function(x) setNames(x, c("rank", "group_company",
"direct_premiums_written", "market_share")))
Looking at the web page we see that the tables are for years 2017, 2008 to 2011 and 2013 to 2016. So we could add these years as names to the list then bind the tables together with a column for year:
library(dplyr)
tables <- setNames(tables, c(2017, 2008:2011, 2013:2016)) %>%
bind_rows(.id = "Year")

There are multiple items in the list that you have named table. (Not a good practice: there's a function by that name.)
str(tbl)
List of 18
$ :'data.frame': 12 obs. of 45 variables:
..$ X1 : chr [1:12] "Rank\nGroup/company\nDirect premiums written (1)\nMarket share (2)\n1\nState Farm Mutual Automobile Insurance\n"| __truncated__ "Rank" "1" "2" ...
..$ X2 : chr [1:12] "Rank" "Group/company" "State Farm Mutual Automobile Insurance" "Berkshire Hathaway Inc." ...
..$ X3 : chr [1:12] "Group/company" "Direct premiums written (1)" "$64,892,583" "38,408,251" ...
snippped rest of long output
Perhaps you only want the last one?
tbl[[18]]
Rank Group/company
1 1 State Farm Mutual Automobile Insurance
2 2 Berkshire Hathaway Inc.
3 3 Liberty Mutual
4 4 Allstate Corp.
5 5 Progressive Corp.
6 6 Travelers Companies Inc.
7 7 Chubb Ltd.
8 8 Nationwide Mutual Group
9 9 Farmers Insurance Group of Companies (3)
10 10 USAA Insurance Group
Direct premiums written (1) Market share (2)
1 $62,189,311 10.2%
2 33,300,439 5.4
3 32,217,215 5.3
4 30,875,771 5.0
5 23,951,690 3.9
6 23,918,048 3.9
7 20,786,847 3.4
8 19,756,093 3.2
9 19,677,601 3.2
10 18,273,675 3.0
Nope; going back to the page it's clear you want the first, but its structure appears to have been misinterpreted and the data has been arranged as "wide", with all the data residing in the first row. So some of the columns are being displayed and the rest of the data seems to be messed up; Just take columns 2:4:
tbl[[1]][ ,c('X2','X3','X4')]
X2 X3
1 Rank Group/company
2 Group/company Direct premiums written (1)
3 State Farm Mutual Automobile Insurance $64,892,583
4 Berkshire Hathaway Inc. 38,408,251
5 Liberty Mutual 33,831,726
6 Allstate Corp. 31,501,664
7 Progressive Corp. 27,862,882
8 Travelers Companies Inc. 24,875,076
9 Chubb Ltd. 21,266,737
10 USAA Insurance Group 20,151,368
11 Farmers Insurance Group of Companies (3) 19,855,517
12 Nationwide Mutual Group 19,218,907
X4
1 Direct premiums written (1)
2 Market share (2)
3 10.1%
4 6.0
5 5.3
6 4.9
7 4.3
8 3.9
9 3.3
10 3.1
11 3.1
12 3.0

Related

Change data type of all columns in list of data frames before using `bind_rows()`

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?
You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

In result output, Additional rows for NA is showing

I'm learning R and there is one issue I am facing while running the code. I wrote the code to get the data for NY (New York). But some additional rows as complete NAs is showing up. Please help.
Dummy Data:
ID Name Industry Inception Employees State City Revenue Expenses
1 Over-Hex Software 2006 25 TN Franklin 9,684,527 1,130,700
2 Unimattax IT 2009 36 NY New York 14,016,543 804,035
3 Greenfax Retail 2012 NA SC Greenville 9,746,272 1,044,375
4 Blacklane IT 2011 66 NY New York 15,359,369 4,631,808
Result output:
2 Unimattax IT 2009 36 NY New York 14,016,543 804,035
NA <NA> <NA> NA NA <NA> <NA> NA NA
4 Blacklane IT 2011 66 NY New York 15,359,369 4,631,808
NA <NA> <NA> NA NA <NA> <NA> NA NA
fin[fin$State == "NY",] # fin is the table name
It could be an issue with having NA elements in 'State' which will result in NA when we do the ==. To avoid that, create an & expression with is.na to make those NA elements to FALSE.
fin[fin$State == "NY" & !is.na(fin$State),]
Or another option is %in%, that generates FALSE for NA
fin[fin$State %in% "NY",]

extract table from webpage using R

I am trying to extract all the table from this page using R, for html_node i had passed "table". In console the output is weird. Data is available in webpage but in R console it shows NA. Please suggest me where i had made mistake.
library(xml2)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
page <- read_html(url) #Creates an html document from URL
table <- html_table(page, fill = TRUE) #Parses tables into data frames
table
part of Output:
X4 X5 X6
1 Direct premiums written (1) Market share (2) 1
2 Market share (2) <NA> NA
3 10.6% <NA> NA
4 6.0 <NA> NA
5 5.4 <NA> NA
6 5.4 <NA> NA
7 5.2 <NA> NA
8 4.5 <NA> NA
9 3.3 <NA> NA
10 3.2 <NA> NA
11 3.0 <NA> NA
12 2.2 <NA> NA
X7 X8 X9 X10
1 State Farm Mutual Automobile Insurance $51,063,111 10.6% 2
2 <NA> <NA> <NA> NA
3 <NA> <NA> <NA> NA
4 <NA> <NA> <NA> NA
5 <NA> <NA> <NA> NA
6 <NA> <NA> <NA> NA
7 <NA> <NA> <NA> NA
8 <NA> <NA> <NA> NA
9 <NA> <NA> <NA> NA
10 <NA> <NA> <NA> NA
11 <NA> <NA> <NA> NA
12 <NA> <NA> <NA> NA
This will get all of the tables into a single data frame:
library(tidyverse)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))
tables <- data.frame()
for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}
You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:
table_2009 <- tables[21:30,] %>%
mutate(Year = 2009)
To add all the years at once:
years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))
Hope this helps.
There are a couple of issues with these tables.
First, I think you'll get better results if you specify the class of table. In this case, .tablesorter.
Second, you'll note that in some tables the second column header is Group, in other cases it is Group/company. This is what causes the NA. So you need to rename the columns to be consistent for all tables.
You can get a list of tables with renamed column headers like this:
tables <- page %>%
html_nodes("table.tablesorter") %>%
html_table() %>%
lapply(., function(x) setNames(x, c("rank", "group_company",
"direct_premiums_written", "market_share")))
Looking at the web page we see that the tables are for years 2017, 2008 to 2011 and 2013 to 2016. So we could add these years as names to the list then bind the tables together with a column for year:
library(dplyr)
tables <- setNames(tables, c(2017, 2008:2011, 2013:2016)) %>%
bind_rows(.id = "Year")
There are multiple items in the list that you have named table. (Not a good practice: there's a function by that name.)
str(tbl)
List of 18
$ :'data.frame': 12 obs. of 45 variables:
..$ X1 : chr [1:12] "Rank\nGroup/company\nDirect premiums written (1)\nMarket share (2)\n1\nState Farm Mutual Automobile Insurance\n"| __truncated__ "Rank" "1" "2" ...
..$ X2 : chr [1:12] "Rank" "Group/company" "State Farm Mutual Automobile Insurance" "Berkshire Hathaway Inc." ...
..$ X3 : chr [1:12] "Group/company" "Direct premiums written (1)" "$64,892,583" "38,408,251" ...
snippped rest of long output
Perhaps you only want the last one?
tbl[[18]]
Rank Group/company
1 1 State Farm Mutual Automobile Insurance
2 2 Berkshire Hathaway Inc.
3 3 Liberty Mutual
4 4 Allstate Corp.
5 5 Progressive Corp.
6 6 Travelers Companies Inc.
7 7 Chubb Ltd.
8 8 Nationwide Mutual Group
9 9 Farmers Insurance Group of Companies (3)
10 10 USAA Insurance Group
Direct premiums written (1) Market share (2)
1 $62,189,311 10.2%
2 33,300,439 5.4
3 32,217,215 5.3
4 30,875,771 5.0
5 23,951,690 3.9
6 23,918,048 3.9
7 20,786,847 3.4
8 19,756,093 3.2
9 19,677,601 3.2
10 18,273,675 3.0
Nope; going back to the page it's clear you want the first, but its structure appears to have been misinterpreted and the data has been arranged as "wide", with all the data residing in the first row. So some of the columns are being displayed and the rest of the data seems to be messed up; Just take columns 2:4:
tbl[[1]][ ,c('X2','X3','X4')]
X2 X3
1 Rank Group/company
2 Group/company Direct premiums written (1)
3 State Farm Mutual Automobile Insurance $64,892,583
4 Berkshire Hathaway Inc. 38,408,251
5 Liberty Mutual 33,831,726
6 Allstate Corp. 31,501,664
7 Progressive Corp. 27,862,882
8 Travelers Companies Inc. 24,875,076
9 Chubb Ltd. 21,266,737
10 USAA Insurance Group 20,151,368
11 Farmers Insurance Group of Companies (3) 19,855,517
12 Nationwide Mutual Group 19,218,907
X4
1 Direct premiums written (1)
2 Market share (2)
3 10.1%
4 6.0
5 5.3
6 4.9
7 4.3
8 3.9
9 3.3
10 3.1
11 3.1
12 3.0

How can I overcome this error Error in tbl_vars(y) : argument "y" is missing, with no default?

I am trying to perform an inner join on 2 tables.
One is a hotel dataset which I have tokenized before using
df1 = read.csv("chennai.csv", header = TRUE, stringsAsFactors=FALSE)
library(dplyr)
library(tidytext)
hotel <- df1 %>% unnest_tokens(word,Review_Text)
data("stop_words")
hotel <- hotel %>%
anti_join(stop_words)
head(hotel)
Hotel_name Review_Title Sentiment
1 Accord Metropolitan Excellent comfortableness during stay 3
2 Accord Metropolitan Excellent comfortableness during stay 3
3 Accord Metropolitan Excellent comfortableness during stay 3
4 Accord Metropolitan Excellent comfortableness during stay 3
5 Accord Metropolitan Excellent comfortableness during stay 3
6 Accord Metropolitan Not too comfortable 1
Rating_Percentage X X.1 X.2 X.3 word
1 100 NA NA NA nice
2 100 NA NA NA stay
3 100 NA NA NA business
4 100 NA NA NA tourist
5 100 NA NA NA purpose
6 20 NA NA NA hotel
I have also used a simplified version of General Inquirer Dictionary spreadsheet
df <- read.csv("ib.csv", header=T, stringsAsFactors=FALSE)
dat <-subset(df, select=c(2,1))
head(dat)
word Scoree
1 A
2 ABANDON Negativ
3 ABANDONMENT Negativ
4 ABATE Negativ
5 ABATEMENT
6 ABDICATE Negativ
I have tried to do an inner_join where I encounter this error.
observation<- hotel %>%
+ inner_join(dat, by = "word") %>%
+ count(Scoree)

Web scraping with R, content

I am just starting with web scraping in R, I put this code:
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
mps %>%
html_nodes("tr") %>%
html_text()
To get the needed content that I put in a text file. My problem is that I want to eliminate these red points, but I can't. Could you please help me?
I think these points are replacing <b> and <br> in the html code.
Whoever constructed that page very frustratingly assembled the table within a table, but not defined as a <table> tag itself, so it's easiest to redefine it so it will parse more easily:
library(rvest)
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
df <- mps %>%
html_nodes("tr.Entete1, tr.Tableau1") %>% # get correct rows
paste(collapse = '\n') %>% # paste nodes back to a single string
paste('<table>', ., '</table>') %>% # add enclosing table node
read_html() %>% # reread as HTML
html_node('table') %>%
html_table(fill = TRUE) %>% # parse as table
{ setNames(.[-1,], make.names(.[1,], unique = TRUE)) } # grab names from first row
head(df)
#> X Région NA. Nature NA..1 Type NA..2
#> 2 Prix <NA> NA <NA> NA <NA> NA
#> 3 Modifiée NA <NA> NA <NA> NA
#> 4 Kelibia NA Terrain NA Terrain nu NA
#> 5 Cite El Ghazala NA Location NA App. 4 pièc NA
#> 6 Le Bardo NA Location NA App. 1 pièc NA
#> 7 Le Bardo NA Location vacance NA App. 3 pièc NA
#> Texte.annonce NA..3 Prix Prix.1 X.1 Modifiée
#> 2 <NA> NA <NA> <NA> <NA> <NA>
#> 3 <NA> NA <NA> <NA> <NA> <NA>
#> 4 Terrain a 5 km de kelibi NA 80 000 07/05/2017
#> 5 S plus 3 haut standing c NA 790 07/05/2017
#> 6 Appartements meubles NA 40 000 07/05/2017
#> 7 Un bel appartement au bardo m NA 420 07/05/2017
#> Modifiée.1 NA..4 NA..5
#> 2 <NA> NA NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 <NA> NA NA
#> 6 <NA> NA NA
#> 7 <NA> NA NA
Note there's a lot of NAs and other cruft here yet to be cleaned up, but at least it's usable at this point.
You can always use regular expressions to remove undesired chars, e.g.,
mps <- gsub("•", " ", mps)

Resources