How can I overcome this error Error in tbl_vars(y) : argument "y" is missing, with no default? - r

I am trying to perform an inner join on 2 tables.
One is a hotel dataset which I have tokenized before using
df1 = read.csv("chennai.csv", header = TRUE, stringsAsFactors=FALSE)
library(dplyr)
library(tidytext)
hotel <- df1 %>% unnest_tokens(word,Review_Text)
data("stop_words")
hotel <- hotel %>%
anti_join(stop_words)
head(hotel)
Hotel_name Review_Title Sentiment
1 Accord Metropolitan Excellent comfortableness during stay 3
2 Accord Metropolitan Excellent comfortableness during stay 3
3 Accord Metropolitan Excellent comfortableness during stay 3
4 Accord Metropolitan Excellent comfortableness during stay 3
5 Accord Metropolitan Excellent comfortableness during stay 3
6 Accord Metropolitan Not too comfortable 1
Rating_Percentage X X.1 X.2 X.3 word
1 100 NA NA NA nice
2 100 NA NA NA stay
3 100 NA NA NA business
4 100 NA NA NA tourist
5 100 NA NA NA purpose
6 20 NA NA NA hotel
I have also used a simplified version of General Inquirer Dictionary spreadsheet
df <- read.csv("ib.csv", header=T, stringsAsFactors=FALSE)
dat <-subset(df, select=c(2,1))
head(dat)
word Scoree
1 A
2 ABANDON Negativ
3 ABANDONMENT Negativ
4 ABATE Negativ
5 ABATEMENT
6 ABDICATE Negativ
I have tried to do an inner_join where I encounter this error.
observation<- hotel %>%
+ inner_join(dat, by = "word") %>%
+ count(Scoree)

Related

What's Wrong: rvest's Error 'in open.connection(x, "rb") and readHTMLTable ()`s "XML contents does not seem to be XML"? [duplicate]

I am trying to extract all the table from this page using R, for html_node i had passed "table". In console the output is weird. Data is available in webpage but in R console it shows NA. Please suggest me where i had made mistake.
library(xml2)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
page <- read_html(url) #Creates an html document from URL
table <- html_table(page, fill = TRUE) #Parses tables into data frames
table
part of Output:
X4 X5 X6
1 Direct premiums written (1) Market share (2) 1
2 Market share (2) <NA> NA
3 10.6% <NA> NA
4 6.0 <NA> NA
5 5.4 <NA> NA
6 5.4 <NA> NA
7 5.2 <NA> NA
8 4.5 <NA> NA
9 3.3 <NA> NA
10 3.2 <NA> NA
11 3.0 <NA> NA
12 2.2 <NA> NA
X7 X8 X9 X10
1 State Farm Mutual Automobile Insurance $51,063,111 10.6% 2
2 <NA> <NA> <NA> NA
3 <NA> <NA> <NA> NA
4 <NA> <NA> <NA> NA
5 <NA> <NA> <NA> NA
6 <NA> <NA> <NA> NA
7 <NA> <NA> <NA> NA
8 <NA> <NA> <NA> NA
9 <NA> <NA> <NA> NA
10 <NA> <NA> <NA> NA
11 <NA> <NA> <NA> NA
12 <NA> <NA> <NA> NA
This will get all of the tables into a single data frame:
library(tidyverse)
library(rvest)
url <- "https://www.iii.org/table-archive/21110"
df <- url %>%
read_html() %>%
html_nodes("table") %>%
html_table(fill = T) %>%
lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written",
"Market_share")))
tables <- data.frame()
for (i in seq(2,18,2)) {
temp <- df[[i]]
tables <- bind_rows(tables, temp)
}
You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:
table_2009 <- tables[21:30,] %>%
mutate(Year = 2009)
To add all the years at once:
years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>%
mutate(Year = rep(years, each = 10))
Hope this helps.
There are a couple of issues with these tables.
First, I think you'll get better results if you specify the class of table. In this case, .tablesorter.
Second, you'll note that in some tables the second column header is Group, in other cases it is Group/company. This is what causes the NA. So you need to rename the columns to be consistent for all tables.
You can get a list of tables with renamed column headers like this:
tables <- page %>%
html_nodes("table.tablesorter") %>%
html_table() %>%
lapply(., function(x) setNames(x, c("rank", "group_company",
"direct_premiums_written", "market_share")))
Looking at the web page we see that the tables are for years 2017, 2008 to 2011 and 2013 to 2016. So we could add these years as names to the list then bind the tables together with a column for year:
library(dplyr)
tables <- setNames(tables, c(2017, 2008:2011, 2013:2016)) %>%
bind_rows(.id = "Year")
There are multiple items in the list that you have named table. (Not a good practice: there's a function by that name.)
str(tbl)
List of 18
$ :'data.frame': 12 obs. of 45 variables:
..$ X1 : chr [1:12] "Rank\nGroup/company\nDirect premiums written (1)\nMarket share (2)\n1\nState Farm Mutual Automobile Insurance\n"| __truncated__ "Rank" "1" "2" ...
..$ X2 : chr [1:12] "Rank" "Group/company" "State Farm Mutual Automobile Insurance" "Berkshire Hathaway Inc." ...
..$ X3 : chr [1:12] "Group/company" "Direct premiums written (1)" "$64,892,583" "38,408,251" ...
snippped rest of long output
Perhaps you only want the last one?
tbl[[18]]
Rank Group/company
1 1 State Farm Mutual Automobile Insurance
2 2 Berkshire Hathaway Inc.
3 3 Liberty Mutual
4 4 Allstate Corp.
5 5 Progressive Corp.
6 6 Travelers Companies Inc.
7 7 Chubb Ltd.
8 8 Nationwide Mutual Group
9 9 Farmers Insurance Group of Companies (3)
10 10 USAA Insurance Group
Direct premiums written (1) Market share (2)
1 $62,189,311 10.2%
2 33,300,439 5.4
3 32,217,215 5.3
4 30,875,771 5.0
5 23,951,690 3.9
6 23,918,048 3.9
7 20,786,847 3.4
8 19,756,093 3.2
9 19,677,601 3.2
10 18,273,675 3.0
Nope; going back to the page it's clear you want the first, but its structure appears to have been misinterpreted and the data has been arranged as "wide", with all the data residing in the first row. So some of the columns are being displayed and the rest of the data seems to be messed up; Just take columns 2:4:
tbl[[1]][ ,c('X2','X3','X4')]
X2 X3
1 Rank Group/company
2 Group/company Direct premiums written (1)
3 State Farm Mutual Automobile Insurance $64,892,583
4 Berkshire Hathaway Inc. 38,408,251
5 Liberty Mutual 33,831,726
6 Allstate Corp. 31,501,664
7 Progressive Corp. 27,862,882
8 Travelers Companies Inc. 24,875,076
9 Chubb Ltd. 21,266,737
10 USAA Insurance Group 20,151,368
11 Farmers Insurance Group of Companies (3) 19,855,517
12 Nationwide Mutual Group 19,218,907
X4
1 Direct premiums written (1)
2 Market share (2)
3 10.1%
4 6.0
5 5.3
6 4.9
7 4.3
8 3.9
9 3.3
10 3.1
11 3.1
12 3.0

Change data type of all columns in list of data frames before using `bind_rows()`

I have a list of data frames, e.g. from the following code:
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE)
I would now like to combine the dataframes into one, e.g. with dplyr::bind_rows() but get the Error: Can't combine ..1$Deaths<integer> and..5$Deaths <character>. (the answer suggested here doesn't do the trick).
So I need to convert the data types before using row binding. I would like to use this inside a pipe (a tidyverse solution would be ideal) and not loop through the data frames due to the structure of the remaining project but instead use something vectorized like lapply(., function(x) {lapply(x %>% mutate_all, as.character)}) (which doesn't work) to convert all values to character.
Can someone help me with this?
You can change all the column classes to characters and bind them together with map_df.
library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/List_of_accidents_and_disasters_by_death_toll" %>%
rvest::read_html() %>%
html_nodes(css = 'table[class="wikitable sortable"]') %>%
html_table(fill = TRUE) %>%
map_df(~.x %>% mutate(across(.fns = as.character)))
# Deaths Date Attraction `Amusement park` Location Incident Injuries
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 28 14 Feb… Transvaal Park (entire … Transvaal Park Yasenevo, Mosc… NA NA
#2 15 27 Jun… Formosa Fun Coast music… Formosa Fun Coast Bali, New Taip… NA NA
#3 8 11 May… Haunted Castle; a fire … Six Flags Great … Jackson Townsh… NA NA
#4 7 9 June… Ghost Train; a fire at … Luna Park Sydney Sydney, Austra… NA NA
#5 7 14 Aug… Skylab; a crane collide… Hamburger Dom Hamburg, (Germ… NA NA
# 6 6 13 Aug… Virginia Reel; a fire a… Palisades Amusem… Cliffside Park… NA NA
# 7 6 29 Jun… Eco-Adventure Valley Sp… OCT East Yantian Distri… NA NA
# 8 5 30 May… Big Dipper; the roller … Battersea Park Battersea, Lon… NA NA
# 9 5 23 Jun… Kuzuluk Aquapark swimmi… Kuzuluk Aquapark Akyazi, Turkey… NA NA
#10 4 24 Jul… Big Dipper; a bolt came… Krug Park Omaha, Nebrask… NA NA
# … with 1,895 more rows

Removing "outer rows" to allow for interpolation (and prevent extrapolation)

I have (left)joined two data frames by country-year.
df<- left_join(df, df2, by="country-year")
leading to the following example output:
country country-year a b
1 France France2000 NA NA
2 France France2001 1000 1000
3 France France2002 NA NA
4 France France2003 1600 2200
5 France France2004 NA NA
6 UK UK2000 1000 1000
7 UK UK2001 NA NA
8 UK UK2002 1000 1000
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I initially wanted to remove all values for which both of the added columns (a,b) were NA.
df<-df[!is.na( df$a | df$b ),]
However, in second instance, I decided I wanted to interpolate the data I had (but not extrapolate). So instead I would like to remove all the columns for which I cannot interpolate; in the example:
1 France France2000 NA NA
5 France France2004 NA NA
9 UK UK2003 NA NA
10 UK UK2004 NA NA
I believe there are 2 options. First I somehow adapt this function:
library(tidyerse)
TRcomplete<-TRcomplete%>%
group_by(country) %>%
mutate_at(a:b,~na.fill(.x,"extend"))
to interpolate only, and then remove then apply df<-df[!is.na( df$a | df$b ),]
or I write a code to remove the "outer"columns first and then use extend like normal. Desired output:
country country-year a b
2 France France2001 1000 1000
3 France France2002 1300 1600
4 France France2003 1600 2200
6 UK UK2000 1000 1000
7 UK UK2001 0 0
8 UK UK2002 1000 1000
Any suggestions?
There are options in na.fill to specify what is done. If you look at ?na.fill, you see that fill can specify the left, interior and right, so if you specify the left and right are NA and the interior is "extend", then it will only fill the interior data. You can then filter the rows with NA.
library(tidyverse)
library(zoo)
df %>%
group_by(country) %>%
mutate_at(vars(a:b),~na.fill(.x,c(NA, "extend", NA))) %>%
filter(!is.na(a) | !is.na(b))
By the way, you have a typo in your library(tidyverse) statement; you are missing the v.

Web scraping with R, content

I am just starting with web scraping in R, I put this code:
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
mps %>%
html_nodes("tr") %>%
html_text()
To get the needed content that I put in a text file. My problem is that I want to eliminate these red points, but I can't. Could you please help me?
I think these points are replacing <b> and <br> in the html code.
Whoever constructed that page very frustratingly assembled the table within a table, but not defined as a <table> tag itself, so it's easiest to redefine it so it will parse more easily:
library(rvest)
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
df <- mps %>%
html_nodes("tr.Entete1, tr.Tableau1") %>% # get correct rows
paste(collapse = '\n') %>% # paste nodes back to a single string
paste('<table>', ., '</table>') %>% # add enclosing table node
read_html() %>% # reread as HTML
html_node('table') %>%
html_table(fill = TRUE) %>% # parse as table
{ setNames(.[-1,], make.names(.[1,], unique = TRUE)) } # grab names from first row
head(df)
#> X Région NA. Nature NA..1 Type NA..2
#> 2 Prix <NA> NA <NA> NA <NA> NA
#> 3 Modifiée NA <NA> NA <NA> NA
#> 4 Kelibia NA Terrain NA Terrain nu NA
#> 5 Cite El Ghazala NA Location NA App. 4 pièc NA
#> 6 Le Bardo NA Location NA App. 1 pièc NA
#> 7 Le Bardo NA Location vacance NA App. 3 pièc NA
#> Texte.annonce NA..3 Prix Prix.1 X.1 Modifiée
#> 2 <NA> NA <NA> <NA> <NA> <NA>
#> 3 <NA> NA <NA> <NA> <NA> <NA>
#> 4 Terrain a 5 km de kelibi NA 80 000 07/05/2017
#> 5 S plus 3 haut standing c NA 790 07/05/2017
#> 6 Appartements meubles NA 40 000 07/05/2017
#> 7 Un bel appartement au bardo m NA 420 07/05/2017
#> Modifiée.1 NA..4 NA..5
#> 2 <NA> NA NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 <NA> NA NA
#> 6 <NA> NA NA
#> 7 <NA> NA NA
Note there's a lot of NAs and other cruft here yet to be cleaned up, but at least it's usable at this point.
You can always use regular expressions to remove undesired chars, e.g.,
mps <- gsub("•", " ", mps)

Clustering / Matching Over Many Dimensions in R

I have a very large and complex data set with many observations of companies. Some of the observations of the companies are redundant and I need to make a key to map the redundant observations to a single one. However the only way to tell if they are actually representing the same company is through the similarity of a variety of variables. I think the appropriate approach is a kind of clustering based on a variety of conditions or perhaps even some kind of propensity score matching. Perhaps I just need flexible tools for making a complex kind of similarity matrix.
Unfortunately, I am not quite sure how to go about that in R. Most of the tools I've seen for clustering and categorizing seem to do so with either numerical distance or categorical data, but don't seem to allow multiple conditions or user specified conditions.
Below I've tried to create a smaller, public example of the kind of data I am working with and the result I am trying to produce. There are some conditions that must apply, for example, the location must be the same. There are some features that may associate one with another, for example var1 and var2. Then there are some features that may associate one with another, but they must not conflict, such as var3.
An additional layer of complexity is that the kind of association I am trying to use to map the redundant observation varies. For example, id1 and id2 are the same company redundantly entered into the data twice. In one place its name is "apples" and another "red apples". They share the same location, var1 value and var3 (after adjusting for formatting). Similarly ids 3, 5 and 6, are also really just one company, though much of the input for each is different. Some clusters would identify multiple observations, others would only have one. Ideally I would like to find a way to categorize or associate the observations based on several conditions, for example:
1. Test that the location is the same
2. Test whether var3 is different
3. Test whether the names is a substring of others
4. Test the edit distance of names
5. Test the similarity of var1 and var2 between observations
Anyways, hopefully there are better, more flexible tools for this than what I am finding or someone has experience with this kind of data work in R. Any and all suggestions and advice are much appreciated!
Data
id name location var1 var2 var3
1 apples US 1 abc 12345
2 red apples US 1 NA 12-345
3 green apples Mexico 2 def 235-92
4 bananas Brazil 2 abc NA
5 oranges Mexico 2 NA 23592
6 green apple Mexico NA def NA
7 tangerines Honduras NA abc 3498
8 mango Honduras 1 NA NA
9 strawberries Honduras NA abcd 3498
10 strawberry Honduras NA abc 3498
11 blueberry Brazil 1 abcd 2348
12 blueberry Brazil 3 abc NA
13 blueberry Mexico NA def 1859
14 bananas Brazil 1 def 2348
15 blackberries Honduras NA abc NA
16 grapes Mexico 6 qrs NA
17 grapefruits Brazil 1 NA 1379
18 grapefruit Brazil 2 bcd 1379
19 mango Brazil 3 efaq NA
20 fuji apples US 4 NA 189-35
Result
id name location var1 var2 var3 Result
1 apples US 1 abc 12345 1
2 red apples US 1 NA 12-345 1
3 green apples Mexico 2 def 235-92 3
4 bananas Brazil 2 abc NA 4
5 oranges Mexico 2 NA 23592 3
6 green apple Mexico NA def NA 3
7 tangerines Honduras NA abc 3498 7
8 mango Honduras 1 NA NA 8
9 strawberries Honduras NA abcd 3498 7
10 strawberry Honduras NA abc 3498 7
11 blueberry Brazil 1 abcd 2348 11
12 blueberry Brazil 3 abc NA 11
13 blueberry Mexico NA def 1859 13
14 bananas Brazil 1 def 2348 11
15 blackberries Honduras NA abc NA 15
16 grapes Mexico 6 qrs NA 16
17 grapefruits Brazil 1 NA 1379 17
18 grapefruit Brazil 2 bcd 1379 17
19 mango Brazil 3 efaq NA 19
20 fuji apples US 4 NA 189-35 20
Thanks in advance for your time and help!
library(stringdist)
getMatches <- function(df, tolerance=6){
out <- integer(nrow(df))
for(row in 1:nrow(df)){
dists <- numeric(nrow(df))
for(col in 1:ncol(df)){
tempDist <- stringdist(df[row, col], df[ , col], method="lv")
# WARNING: Matches NA perfectly.
tempDist[is.na(tempDist)] <- 0
dists <- dists + tempDist
}
dists[row] <- Inf
min_dist <- min(dists)
if(min_dist < tolerance){
out[row] <- which.min(dists)
}
else{
out[row] <- row
}
}
return(out)
}
test$Result <- getMatches(test[, -1])
Where test is your data. This probably definitely needs some refining and certainly needs some postprocessing. This creates a column with the index of the closest match. If it can't find a match within the given tolerance, it returns the index of itself.
EDIT: I will attempt some more later.

Resources