Revisitig "Can't rename columns that don't exist" - r

I tried to rename columns which is actually a very straight forward operation but still getting errors. I tried two methods and none of them working. Can any one explain, what needs to be done to rename columns without getting these strange errors. I tried several SO posts but none of them really worked.
library(pacman)
#> Warning: package 'pacman' was built under R version 4.2.1
p_load(dplyr, readr)
data = read_csv("https://raw.githubusercontent.com/srk7774/data/master/august_october_2020.csv",
col_names = TRUE)
#> Rows: 16 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): X.1
#> dbl (2): Total Agree - August 2020, Total Agree - October 2020
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
column_recodes <- c(X.1 = "country",
august = "Total Agree - August 2020",
october = "`Total Agree - October 2020",
`Another non-existent column name` = "bar")
data %>% rename_with(~recode(., !!!column_recodes))
#> # A tibble: 16 × 3
#> country `Total Agree - August 2020` `Total Agree - October 2020`
#> <chr> <dbl> <dbl>
#> 1 Total 77 73
#> 2 India 87 87
#> 3 China 97 85
#> 4 South Korea 84 83
#> 5 Brazil 88 81
#> 6 Australia 88 79
#> 7 United Kingdom 85 79
#> 8 Mexico 75 78
#> 9 Canada 76 76
#> 10 Germany 67 69
#> 11 Japan 75 69
#> 12 South Africa 64 68
#> 13 Italy 67 65
#> 14 Spain 72 64
#> 15 United States 67 64
#> 16 France 59 54
data %>%
rename(country = X.1,
august = Total.Agree...August.2020,
october = Total.Agree...October.2020)
#> Error in `chr_as_locations()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `Total.Agree...August.2020` doesn't exist.
Created on 2022-10-24 by the reprex package (v2.0.1)

Add backtick when using names with space:
data %>%
rename(country = X.1,
august = `Total Agree - August 2020`,
october =`Total Agree - October 2020`)

Related

dplyr arrange is not working while order is fine

I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2

R combine rows and columns within a dataframe

I've looked around for a while trying to figure this out, but I just can't seem to describe my problem concisely enough to google my way out of it. I am trying to work with Michigan COVID stats where the data has Detroit listed separately from Wayne County. I need to add Detroit's numbers to Wayne County's numbers, then remove the Detroit rows from the data frame.
I have included a screen grab too. For the purposes of this problem, can someone explain how I can get Detroit City added to Dickinson, and then make the Detroit City rows disappear? Thanks.
library(tidyverse)
library(openxlsx)
cases_deaths <- read.xlsx("https://www.michigan.gov/coronavirus/-/media/Project/Websites/coronavirus/Cases-and-Deaths/4-20-2022/Cases-and-Deaths-by-County-2022-04-20.xlsx?rev=f9f34cd7a4614efea0b7c9c00a00edfd&hash=AA277EC28A17C654C0EE768CAB41F6B5.xlsx")[,-5]
# Remove rows that don't describe counties
cases_deaths <- cases_deaths[-c(51,52,101,102,147,148,167,168),]
Code chunk output picture
You could do:
cases_deaths %>%
filter(COUNTY %in% c("Wayne", "Detroit City")) %>%
mutate(COUNTY = "Wayne") %>%
group_by(COUNTY, CASE_STATUS) %>%
summarize_all(sum) %>%
bind_rows(cases_deaths %>%
filter(!COUNTY %in% c("Wayne", "Detroit City")))
#> # A tibble: 166 x 4
#> # Groups: COUNTY [83]
#> COUNTY CASE_STATUS Cases Deaths
#> <chr> <chr> <dbl> <dbl>
#> 1 Wayne Confirmed 377396 7346
#> 2 Wayne Probable 25970 576
#> 3 Alcona Confirmed 1336 64
#> 4 Alcona Probable 395 7
#> 5 Alger Confirmed 1058 8
#> 6 Alger Probable 658 5
#> 7 Allegan Confirmed 24109 294
#> 8 Allegan Probable 3024 52
#> 9 Alpena Confirmed 4427 126
#> 10 Alpena Probable 1272 12
#> # ... with 156 more rows
Created on 2022-04-23 by the reprex package (v2.0.1)

R -> Sum part of Columns + agreggating observations [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed last year.
I am very new to coding and just started doing some R graphics and now I am kinda lost with my data analyse and need some light! I am training some analyses and I got a very long dataset with 19 Countries x 12 months x 22 Products and for every month a Profit. Kinda like this:
Country Month Product Profit
Brazil Jan A 50
Brazil fev A 80
Brazil mar A 15
Austria Jan A 35
Austria fev A 80
Austria mar A 47
France Jan A 21
France fev A 66
France mar A 15
[...]
France Dez C 40 etc...
I am was thinking to do one graph showing the profits through the year and another for every country, so I could see the top and bottom 2 countries. I wanted to have something like:
All Countries Jan 106 or Brazil 2021 145
All Countries Fev 146 Austria 2021 162
All Countries Mar 77 France 2021 112
but the sum function can't help with characters type and as I have a long List, idk how to sum only part of the column.
sorry if it got confusing.
The package dplyr has quite a natural syntax for this:
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(
Country = rep(c(rep("Brazil", 3L), rep("Austria", 3L), rep("France", 3L)), 3L),
Profit = rep(c(50, 80, 15, 35, 80, 47, 21, 66, 15), 3L),
Month = rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"), 3L),
Year = sort(rep(c(2019, 2020, 2021), 9L))
)
df %>%
group_by(Country, Month) %>%
summarize(sum = sum(Profit))
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: Country [3]
#> Country Month sum
#> <chr> <chr> <dbl>
#> 1 Austria Apr 105
#> 2 Austria Jun 141
#> 3 Austria May 240
#> 4 Brazil Feb 240
#> 5 Brazil Jan 150
#> 6 Brazil Mar 45
#> 7 France Aug 198
#> 8 France Jul 63
#> 9 France Sep 45
Using base R, you can try something along these lines.
# sum of profit per month
out1 <- tapply(df$Profit, df$Month, sum)
# sum of profit per year per country
out2 <- data.frame(
profit = sapply(split(df, f = ~ df$Country + df$Year), function(x) sum(x$Profit))
)
out2$Country <- gsub('\\.[0-9]*', '', rownames(out2))
out2$Year <- gsub('[a-zA-z]*\\.', '', rownames(out2))
rownames(out2) <- NULL
Output
> out1
Apr Aug Feb Jan Jul Jun Mar May Sep
105 198 240 150 63 141 45 240 45
> head(out2)
profit Country Year
1 162 Austria 2019
2 145 Brazil 2019
3 102 France 2019
4 162 Austria 2020
5 145 Brazil 2020
6 102 France 2020
Data
# sample data
df <- data.frame(
Country = rep(c(rep('Brazil',3L),rep('Austria',3L),rep('France',3L)), 3L),
Profit = rep(c(50,80,15,35,80,47,21,66,15), 3L),
Month = rep(c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep'),3L),
Year = sort(rep(c(2019,2020,2021), 9L))
)

How do I scrape a full table using rvest that uses a button to show more records?

I have a table that I'm trying to scrape. The table has 100 rows. The first 10 rows load initially, and I'm guessing the button to "Show Full List" must have a Javascript response.
I am looking for help on how to scrape the full table.
Here is the code I used to get the first 10 rows using rvest.
library(tidyverse)
library(rvest)
#>
#> Attaching package: 'rvest'
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
url <- "https://www.mlb.com/prospects"
html <- read_html(url)
html %>%
html_node("table") %>%
html_table()
#> # A tibble: 10 x 10
#> Rank Player Position Team Level eta Age `Height / Weigh~ Bats Throws
#> <int> <chr> <chr> <chr> <chr> <int> <int> <chr> <chr> <chr>
#> 1 1 Wander~ SS Tampa~ - 2021 20 "5' 10\" / 189 ~ S R
#> 2 2 Adley ~ C Balti~ - 2021 23 "6' 2\" / 220 l~ S R
#> 3 3 Spence~ 3B/1B Detro~ - 2022 21 "6' 1\" / 220 l~ R R
#> 4 4 Jarred~ OF Seatt~ AA 2021 21 "6' 1\" / 190 l~ L L
#> 5 5 Julio ~ OF Seatt~ - 2022 20 "6' 3\" / 180 l~ R R
#> 6 6 MacKen~ LHP San D~ ROK 2021 22 "6' 2\" / 197 l~ L L
#> 7 7 Bobby ~ SS Kansa~ - 2022 20 "6' 1\" / 200 l~ R R
#> 8 8 CJ Abr~ SS San D~ - 2022 20 "6' 2\" / 185 l~ L R
#> 9 9 Ke'Bry~ 3B Pitts~ MLB 2021 24 "5' 10\" / 205 ~ R R
#> 10 10 Nate P~ RHP Toron~ MLB 2021 24 "6' 6\" / 250 l~ R R
Below a possible solution.
You could use RSelenium to interact with the page.
library(RSelenium)
library(rvest)
library(dplyr)
driver <- rsDriver(browser=c("firefox"), port = 4441L)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.mlb.com/prospects")
remote_driver$findElement(using = 'css selector', value = '.load-more__button')$clickElement()
url<-unlist(remote_driver$getPageSource())
html <- read_html(url)
html %>%
html_node(css =".rankings__table")
%>% html_table()
Rank Player Position Team Level eta Age Height / Weight Bats Throws
1 1 Wander Franco SS Tampa Bay Rays - 2021 20 5' 10" / 189 lbs S R
2 2 Adley Rutschman C Baltimore Orioles - 2021 23 6' 2" / 220 lbs S R
............
98 98 Tyler Freeman SS Cleveland Indians - 2022 21 6' 0" / 165 lbs R R
99 99 Cade Cavalli RHP Washington Nationals ROK 2022 22 6' 4" / 230 lbs R R
100 100 Taylor Trammell OF Seattle Mariners MLB 2021 23 6' 2" / 213 lbs L

R - for loop url

Evening all, I'm having a few issues at the moment scraping data from multiple web pages.
library(RCurl)
library(XML)
tables <- readHTMLTable(getURL("https://www.basketball-reference.com/leagues/NBA_2018_games.html"))
for (i in c("october", "november", "december", "january")) {
readHTMLTable(getURL(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html")))
regular <- tables[["schedule"]]
write.csv(regular, file = paste0("./", i, i, ".csv"))
}
I'm having an issue where it doesn't appear to be looping through the months and is just saving 4 files from october.
Any help appreciated.
this is not the most elegant way but it works good.
Hope help you.
Code to web scraping
rm(list = ls())
if(!require("rvest")){install.packages("rvest");library("rvest")}
for (i in c("october", "november", "december", "january")) {
nba_url <- read_html(paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-",i,".html"))
#Left part of the table
left<-nba_url %>%
html_nodes(".left") %>% #item de precios
html_text()
left<-left[-length(left)]
left<-left[-(1:4)]
#Assign specific values
Date<-left[seq(1,length(left),4)]
Visitor<-left[seq(2,length(left),4)]
Home<-left[seq(3,length(left),4)]
#Right part of the table
right<-nba_url %>%
html_nodes(".right") %>% #item de precios
html_text()
right<-right[-length(right)]
right<-right[-(1:2)]
#Assign specific values
Start<-right[seq(1,length(right),3)]
PTS1<-right[seq(2,length(right),3)]
PTS2<-right[seq(3,length(right),3)]
nba_data<-data.frame(Date,Start,Visitor,PTS1,Home,PTS2)
write.csv(nba_data, file = paste0("./", i, i, ".csv"))
}
This is a solution using the tidyvere to scrape this website. But first we check the robots.txt file of the website to get a sense of the limit rate for request. See for reference the post Analyzing “Crawl-Delay” Settings in Common Crawl robots.txt Data with R for further info.
library(spiderbar)
library(robotstxt)
rt <- robxp(get_robotstxt("https://www.basketball-reference.com"))
crawl_delays(rt)
#> agent crawl_delay
#> 1 * 3
#> 2 ahrefsbot -1
#> 3 twitterbot -1
#> 4 slysearch -1
#> 5 ground-control -1
#> 6 groundcontrol -1
#> 7 matrix -1
#> 8 hal9000 -1
#> 9 carmine -1
#> 10 the-matrix -1
#> 11 skynet -1
We are interested by the * value. We see we have to wait a minimum of 3 sec between requests. We will took 5 secondes.
We use the tidyverse ecosystem to build the urls and iterate through them to get a table with all the data.
library(tidyverse)
library(rvest)
#> Le chargement a nécessité le package : xml2
#>
#> Attachement du package : 'rvest'
#> The following object is masked from 'package:purrr':
#>
#> pluck
#> The following object is masked from 'package:readr':
#>
#> guess_encoding
month_sub <- c("october", "november", "december", "january")
urls <- map_chr(month_sub, ~ paste0("https://www.basketball-reference.com/leagues/NBA_2018_games-", .,".html"))
urls
#> [1] "https://www.basketball-reference.com/leagues/NBA_2018_games-october.html"
#> [2] "https://www.basketball-reference.com/leagues/NBA_2018_games-november.html"
#> [3] "https://www.basketball-reference.com/leagues/NBA_2018_games-december.html"
#> [4] "https://www.basketball-reference.com/leagues/NBA_2018_games-january.html"
pb <- progress_estimated(length(urls))
map(urls, ~{
url <- .
pb$tick()$print()
Sys.sleep(5) # we take 5sec
tables <- read_html(url) %>%
# we select the table part by its table id tag
html_nodes("#schedule") %>%
# we extract the table
html_table() %>%
# we get a 1 element list so we take flatten to get a tibble
flatten_df()
}) -> tables
# we get a list of tables, one per month
str(tables, 1)
#> List of 4
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 104 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 213 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 227 obs. of 8 variables:
#> $ :Classes 'tbl_df', 'tbl' and 'data.frame': 216 obs. of 8 variables:
# we can get all the data in one table by binding rows.
# As we saw on the website that there are 2 empty columns with no names,
# we need to take care of it with repair_name before row binding
res <- tables %>%
map_df(tibble::repair_names)
res
#> # A tibble: 760 x 8
#> Date `Start (ET)` `Visitor/Neutral` PTS
#> <chr> <chr> <chr> <int>
#> 1 Tue, Oct 17, 2017 8:01 pm Boston Celtics 102
#> 2 Tue, Oct 17, 2017 10:30 pm Houston Rockets 121
#> 3 Wed, Oct 18, 2017 7:30 pm Milwaukee Bucks 100
#> 4 Wed, Oct 18, 2017 8:30 pm Atlanta Hawks 111
#> 5 Wed, Oct 18, 2017 7:00 pm Charlotte Hornets 102
#> 6 Wed, Oct 18, 2017 7:00 pm Brooklyn Nets 140
#> 7 Wed, Oct 18, 2017 8:00 pm New Orleans Pelicans 103
#> 8 Wed, Oct 18, 2017 7:00 pm Miami Heat 116
#> 9 Wed, Oct 18, 2017 10:00 pm Portland Trail Blazers 76
#> 10 Wed, Oct 18, 2017 10:00 pm Houston Rockets 100
#> # ... with 750 more rows, and 4 more variables: `Home/Neutral` <chr>,
#> # V1 <chr>, V2 <chr>, Notes <lgl>

Resources