read_csv() adds "\r" to *.csv input - r

I'm trying to read in a small (17kb), simple csv file from EdX.org (for an online course), and I've never had this trouble with readr::read_csv() before. Base-R read.csv() reads the file without generating the problem.
A small (17kb) csv file from EdX.org
library(tidyverse)
df <- read_csv("https://courses.edx.org/assets/courseware/v1/ccdc87b80d92a9c24de2f04daec5bb58/asset-v1:MITx+15.071x+1T2020+type#asset+block/WHO.csv")
head(df)
Gives this output
#> # A tibble: 6 x 13
#> Country Region Population Under15 Over60 FertilityRate LifeExpectancy
#> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
#> 1 Afghan… Easte… 29825 47.4 3.82 "\r5.4\r" 60
#> 2 Albania Europe 3162 21.3 14.9 "\r1.75\r" 74
#> 3 Algeria Africa 38482 27.4 7.17 "\r2.83\r" 73
#> 4 Andorra Europe 78 15.2 22.9 <NA> 82
#> 5 Angola Africa 20821 47.6 3.84 "\r6.1\r" 51
#> 6 Antigu… Ameri… 89 26.0 12.4 "\r2.12\r" 75
#> # … with 6 more variables: ChildMortality <dbl>, CellularSubscribers <dbl>,
#> # LiteracyRate <chr>, GNI <chr>, PrimarySchoolEnrollmentMale <chr>,
#> # PrimarySchoolEnrollmentFemale <chr>
You'll notice that the column FertilityRate has "\r" added to the values. I've downloaded the csv file and cannot find them there.
Base-R read.csv() reads in the file with no problems, so I'm wondering what the problem is with my usage of the tidyverse read_csv().
head(df$FertilityRate)
#> [1] "\r5.4\r" "\r1.75\r" "\r2.83\r" NA "\r6.1\r" "\r2.12\r"
How can I fix my usage of read_csv() so that: the "\r" strings are not there?
If possible, I'd prefer not to have to individually specify the type of every single column.

In a nutshell, the characters are inside the file (probably by accident) and read_csv is right to not remove them automatically: since they occur within quotes, this by convention means that a CSV parser should treat the field as-is, and not strip out whitespace characters. read.csv is wrong to do so, and this is arguably a bug.
You can strip them out yourself once you’ve loaded the data:
df = mutate_if(df, is.character, ~ stringr::str_remove_all(.x, '\r'))
This seems to be good enough for this file, but in general I’d be wary that the file might be damaged in other ways, since the presence of these characters is clearly not intentional, and the file follows no common file ending convention (it’s neither a conventional Windows nor Unix file).

Related

Make a loop to scrape a website to create multiple dataframes

I'm working on a project where I can see two ways to potentially solve my problem. I'm scraping a webpage by using a loop to save the each page locally as a HTML file. The problem I'm having is when I try to select on the files in my local folder they are basically blank pages. I'm not sure why. I've used this same code on other sites for this project with success.
This is the code I'm using.
#scrape playoff teams for multiple seasons and saved html to local folder
for(i in 2002:2021){
playoff_url <- read_html(paste0("https://www.espn.com/nfl/stats/player/_/season/",i,"/seasontype/3"))
playoff_stats <- playoff_url %>%
write_html(paste0("playoff",i,".HTML"))
}
My second option is to scrape individual seasons into a data frame, but I would like to do it in a loop, and to not have to run this code 20 different times. I also don't want to continually scrape data from the site every time I run the code. It doesn't matter if all the data is in 1 large data frame for all 20 seasons or 20 separate ones. I can export the code to a local file then import it when I need it.
#read in code for playoff QBs from ESPN and added year column
playoff_url <- read_html("https://www.espn.com/nfl/stats/player/_/season/2015/seasontype/3")
play_QB2015 <-playoff_url %>%
html_nodes("table") %>%
html_table()
#combine list from QB playoff data to convert to dataframe
play_QB2015 <- c(play_QB2015[[1]], play_QB2015[[2]])
# Convert list to dataframe using data.frame()
play_QB2015 <- data.frame(play_QB2015)
play_QB2015$Year = 2015
Not sure what happens to your files, but first downloading and storing with httr2 and then parsing saved files with rvest works fine for me (sorry for overused tidyverse ..) :
library(fs)
library(dplyr)
library(httr2)
library(rvest)
library(purrr)
library(stringr)
dest_dir <- path_temp("playoffs")
dir_create(dest_dir)
years <- 2002:2012
# collect all years to a list
playoff_lst <- map(
set_names(years),
~ {
dest_file <- path(dest_dir, str_glue("{.x}.html"))
# only download if local copy is not present
if (!file_exists(dest_file)){
request(str_glue("https://www.espn.com/nfl/stats/player/_/season/{.x}/seasontype/3")) %>%
req_perform(dest_file)
}
read_html(dest_file) %>%
html_elements("table") %>%
html_table() %>%
bind_cols()
}
)
Results:
names(playoff_lst)
#> [1] "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011"
#> [11] "2012"
head(playoff_lst$`2002`)
#> # A tibble: 6 × 17
#> RK Name POS GP CMP ATT `CMP%` YDS AVG `YDS/G` LNG TD
#> <int> <chr> <chr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int> <int>
#> 1 1 Rich Gan… QB 3 73 115 63.5 841 7.3 280. 50 7
#> 2 2 Brad Joh… QB 3 53 98 54.1 670 6.8 223. 71 5
#> 3 3 Tommy Ma… QB 2 51 89 57.3 633 7.1 316. 40 5
#> 4 4 Steve Mc… QB 2 48 80 60 532 6.7 266 39 3
#> 5 5 Jeff Gar… QB 2 49 85 57.6 524 6.2 262 76 3
#> 6 6 Donovan … QB 2 46 79 58.2 490 6.2 245 42 1
#> # … with 5 more variables: INT <int>, SACK <int>, SYL <int>, QBR <lgl>,
#> # RTG <dbl>
dir_tree(dest_dir)
#> ... /RtmpcjLFJe/playoffs
#> ├── 2002.html
#> ├── 2003.html
#> ├── 2004.html
#> ├── 2005.html
#> ├── 2006.html
#> ├── 2007.html
#> ├── 2008.html
#> ├── 2009.html
#> ├── 2010.html
#> ├── 2011.html
#> └── 2012.html
Created on 2023-02-16 with reprex v2.0.2

Strsplit function in R

I'm currently working through some coursework where data has been provided on supermarket chip sales. Part of the task is to remove any entries where products are not chips and have been provided code to enter to help with this:
productWords <- data.table(unlist(strsplit(unique(transaction_data[, "PROD_NAME"]), "")))
the data file provided = transaction_data and PROD_NAME variable is the column we're interested in.
This however, returns the error:
Error in strsplit(unique(transaction_data[, "PROD_NAME"]), "") : non-character argument
Can someone please explain what it is that I'm doing wrong, or am I missing something? I'm not sure how this code would be able to understand the product and differentiate between another, am I meant to be adding something in with the code based on product names I've seen while looking through the data?
Here are some lines of the data as an example:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
<date> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 2018-10-17 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6
2 2019-05-14 1 1307 348 66 CCs Nacho Cheese 175g 3 6.3
3 2019-05-20 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.9

openweather api to get multiple cities

get_weather_forecaset_by_cities <- function(city_names){
df <- data.frame("weather_data_frame" )
for (city_name in city_names){
forecast_url <- 'https://api.openweathermap.org/data/2.5/forecast'
forecast_query <- list(q = city_name, appid = "a297d3b46d0b5a7aab6dde3512962b99", units="metric")
for(result in results) {
city <- c(city, city_name)
}
}
return(df)
}
Need help to understand the above given code, i am specifically stuck in the line 6 of the code, in the markdown it says "# Loop the json result" (Note: json_result is a list of lists).
But my actual task is this "Complete and call get_weather_forecaset_by_cities function with a list of cities, and write the data frame into a csv file called cities_weather_forecast.csv"
Which parts should i have to fill in and how?
cities <- c("Seoul", "Washington, D.C.", "Paris", "Suzhou")
cities_weather_df <- get_weather_forecaset_by_cities(cities)
After running the next line of codes it shows this error
"Error in get_weather_forecaset_by_cities(cities): object 'json_result' not found
Traceback:
get_weather_forecaset_by_cities(cities)"
Since this is a homework question, it's not appropriate to provide a complete answer. However, a function that receives a list of city names and obtains their 5 day weather forecasts from openweathermap.org looks like this:
get_weather_forecast_by_cities <- function(city_names){
library(dplyr)
library(tidyr)
library(jsonlite)
df <- NULL # initialize a data frame
for(c in city_names){
aForecast <- paste0("http://api.openweathermap.org/data/2.5/forecast?",
"q=",c,
"&appid=<your API KEY here>",
"&units=metric")
message(aForecast)
aCity <- fromJSON(aForecast)
# extract the date / time content and convert to POSIXct
# extract the periodic weather content, note that
# tidyr::unnest() is helpful here
# add the city name
# bind the results into the output data frame, df
}
df # return data frame
}
We replace the commented areas with code and run the function as follows:
source("./examples/get_weather_forecast_by_cities.R")
cityList <- c("Seoul","Paris","Chicago","Beijing","Suzhou")
theData <-get_weather_forecast_by_cities(cityList)
head(theData)
...and the first few rows of output:
> head(theData)
# A tibble: 6 × 24
dt temp feels_like temp_min temp_max pressure sea_level
<dttm> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2022-08-03 03:00:00 27.4 31.5 27.4 29.9 1011 1011
2 2022-08-03 06:00:00 29.4 34.2 29.4 31.0 1010 1010
3 2022-08-03 09:00:00 29.2 33.2 29.2 29.2 1009 1009
4 2022-08-03 12:00:00 26.9 29.9 26.9 26.9 1010 1010
5 2022-08-03 15:00:00 25.7 26.7 25.7 25.7 1010 1010
6 2022-08-03 18:00:00 25.0 26.1 25.0 25.0 1010 1010
# … with 17 more variables: grnd_level <int>, humidity <int>, temp_kf <dbl>,
# id <int>, main <chr>, description <chr>, icon <chr>, all <int>, speed <dbl>,
# deg <int>, gust <dbl>, visibility <int>, pop <dbl>, `3h` <dbl>, pod <chr>,
# dt_txt <chr>, city <chr>
>
At this point we have a data frame that can be easily written to csv with write.csv() which is left as an exercise for the reader.

read file from google drive

I have spreadsheet uploaded as csv file in google drive unlocked so users can read from it.
This is the link to the csv file:
https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/edit?usp=sharing
I am trying to read it from R but I am getting a long list of error messages. I am using:
id = "170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk"
read.csv(sprint("https://docs.google.com/spreadsheets/d/uc?id=%s&export=download",id))
Could someone suggest how to read files from google drive directly into R?
I would try to publish the sheet as a CSV file (doc), and then read it from there.
It seems like your file is already published as a CSV. So, this should work. (Note that the URL ends with /pub?output=csv)
read.csv("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
To read the CSV file faster you can use vroom which is even faster than fread(). See here.
Now using vroom,
library(vroom)
vroom("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
#> Rows: 387048 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): StationCode, SampleID, WeatherCode, OrganismCode, race, race2
#> dbl (7): WaterTemperature, Turbidity, Velocity, ForkLength, Weight, Count, ...
#> date (1): SampleDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 387,048 × 14
#> StationCode SampleDate SampleID WeatherCode WaterTemperature Turbidity
#> <chr> <date> <chr> <chr> <dbl> <dbl>
#> 1 Gate 11 2000-04-25 116_00 CLD 13.1 2
#> 2 Gate 5 1995-04-26 117_95 CLR NA 2
#> 3 Gate 2 1995-04-21 111_95 W 10.4 12
#> 4 Gate 6 2008-12-13 348_08 CLR 49.9 1.82
#> 5 Gate 5 1999-12-10 344_99 CLR 7.30 1.5
#> 6 Gate 6 2012-05-25 146_12 CLR 55.5 1.60
#> 7 Gate 10 2011-06-28 179_11 RAN 57.3 3.99
#> 8 Gate 11 1996-04-25 116_96 CLR 13.8 21
#> 9 Gate 9 2007-07-02 183_07 CLR 56.6 2.09
#> 10 Gate 6 2009-06-04 155_09 CLR 58.6 3.08
#> # … with 387,038 more rows, and 8 more variables: Velocity <dbl>,
#> # OrganismCode <chr>, ForkLength <dbl>, Weight <dbl>, Count <dbl>,
#> # race <chr>, year <dbl>, race2 <chr>
Created on 2022-07-08 by the reprex package (v2.0.1)

Need help navigating lists when converting JSON into dataframe/CSV

I am trying to scrape a javascript rendered table and after trying both selenium and phantomJS I've decided that JSON would be the easiest way to do it. However I am quite new to R and not very good at handling lists, and because of that I cannot get my data into the table format I desire. I've looked at a number of solutions but for some reason they don't really work on the JSON I have.
The JSON data is rendered through this URL. And this is the actual website where the table is located.
What I've done so far is to try to parse the JSON into R and coerce it into a dataframe, based on what I've seen from most answers on stackoverflow.
library(httr)
library(jsonlite)
rf <- GET(url) #the entire URL is very long so I'm not putting it here
rfc <- content(rf)
Doing this returns me a large list of four elements, rfc. I then apply the following function.
library(httr)
library(jsonlite)
json_file <- lapply(rfc, function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
This returns me an error:
Error in x[sapply(x, is.null)] <- NA : invalid subscript type 'list'
Given that I only need the second element of the list, which is where the information is at, I attempt to subset it:
json_file <- lapply(rfc[2], function(x) {
x[sapply(x, is.null)] <- NA
unlist(x)
})
This returns me a large list, 12mb in size. When I try to coerce it to a dataframe using as.data.frame, R returns me 506472 observations of 1 variable. The different columns have all been squashed into one and the headers are gone.
Can anyone tell me how I should go about doing this? There's a free online JSON to CSV converter here that does exactly what I need beautifully. This is what it produces:
Unfortunately this is not a solution. Because I intend to run this in Shiny I want to do everything within R. Any help is appreciated, thanks.
You need to take the element rfc$data$DailyProductionAndFlowList, which itself is effectively a list of single-row data frames, and bind them together. You'll need to overwrite the NULL values first:
df <- do.call(rbind, lapply(rfc$data$DailyProductionAndFlowList, function(x) {
x[sapply(x, is.null)] <- "NULL"
as.data.frame(x, stringsAsFactors = FALSE)
}))
To show you that the result is sensible, I've put it in a tibble here for nicer printing:
as_tibble(df)
#> # A tibble: 3,997 x 11
#> GasDate FacilityId FacilityName LocationId LocationName Demand Supply
#> <chr> <int> <chr> <int> <chr> <dbl> <dbl>
#> 1 2020-0~ 520047 Eastern Gas~ 520008 Sydney 94.4 0
#> 2 2020-0~ 520047 Eastern Gas~ 520009 Canberra 16.5 0
#> 3 2020-0~ 520047 Eastern Gas~ 530015 Longford Hub 0 234.
#> 4 2020-0~ 520047 Eastern Gas~ 590011 Regional - ~ 22.4 0
#> 5 2020-0~ 520047 Eastern Gas~ 590012 Regional - ~ 2.68 19.4
#> 6 2020-0~ 520047 Eastern Gas~ 520008 Sydney 113. 0
#> 7 2020-0~ 520047 Eastern Gas~ 520009 Canberra 19.7 0
#> 8 2020-0~ 520047 Eastern Gas~ 530015 Longford Hub 0 225.
#> 9 2020-0~ 520047 Eastern Gas~ 590011 Regional - ~ 27.5 0
#> 10 2020-0~ 520047 Eastern Gas~ 590012 Regional - ~ 5.05 20.1
#> # ... with 3,987 more rows, and 4 more variables: TransferIn <dbl>,
#> # TransferOut <dbl>, HeldInStorage <chr>, LastUpdated <chr>
Another approach:
library( data.table )
library( rjson )
#location of data
json.url = "https://aemo.com.au/aemo/api/v1/GasBBReporting/DailyProductionAndFlow?FacilityIds=540093,580010,540101,544261,540047,530030,540083,540096,580020,540071,540077,520075,540059,520054,520090,540094,540098,540080,540090,540086,540050,540097,540055,520047,540089,540070,540092,530071,530042,540088,540075,544253,540061,530038,530039,530040,580100,580040,540064,530043,550050,550045,550046,550054,520053,530061,520060,580050,540084,530041,530044,580060,580070,540065,550052,530060,540058,540085,540102,540073,540057,540095,544260,540110,540040,540082,540072,540062,540103,550061,550060,540060,540066,540067,540076,540068,580210,570050,540051,532005,530110,540045,540046,540091,580030,540069,540087,580180,540074&FromGasDate=07/08/2020&ToGasDate=07/09/2020"
#retrieve lastest data
mydata <- data.table::rbindlist( rjson::fromJSON( file = json.url )$data$DailyProductionAndFlowList,
use.names = TRUE, fill = TRUE )

Resources