Retrieving all releases for a specific R package - r

What are available r tools to obtain list of all releases for a specific R CRAN package.
There is expected to retrieve at least Dates each package version was released.
Other metadata for each package are in value too.
self-promotion of my new CRAN package https://CRAN.R-project.org/package=pacs

TL;DR
pacs::pac_timemachine
pkgsearch::cran_package_history
pkgdown:::pkg_timeline (non-exported and only Date of publish)
pacs::pac_timemachine in pacs package.
pacs::pac_timemachine is using CRAN website or crandb.
head(pacs::pac_timemachine("tidyr"), 3)
#> Package Version Released Archived LifeDuration
#> 2 tidyr 0.1 2014-07-21 2015-09-08 414 days
#> 3 tidyr 0.2.0 2015-09-08 2015-09-08 0 days
#> 4 tidyr 0.3.0 2015-09-08 2015-09-10 2 days
#> URL Size
#> 2 Archive/tidyr/tidyr_0.1.tar.gz 134K
#> 3 Archive/tidyr/tidyr_0.2.0.tar.gz 139K
#> 4 Archive/tidyr/tidyr_0.3.0.tar.gz 147K
tail(pacs::pac_timemachine("tidyr"), 3)
#> Package Version Released Archived LifeDuration
#> 25 tidyr 1.1.1 2020-07-31 2020-08-27 27 days
#> 26 tidyr 1.1.2 2020-08-27 2021-03-03 188 days
#> 1 tidyr 1.1.3 2021-03-03 <NA> 192 days
#> URL Size
#> 25 Archive/tidyr/tidyr_1.1.1.tar.gz 859K
#> 26 Archive/tidyr/tidyr_1.1.2.tar.gz 861K
#> 1 tidyr_1.1.3.tar.gz <NA>
We could get the result for certain Date or Date interval or version too.
pacs::pac_timemachine("tidyr", at = as.Date("2018-01-01"))
pacs::pac_timemachine("tidyr", version = "1.0.0")
pacs::pac_timemachine("tidyr", from = as.Date("2020-06-01"), to = as.Date("2020-08-01"))
Created on 2021-09-11 by the reprex package (v2.0.1)
the pkgsearch package.
This one is builded under private DB which is systematically appended with new DESCRIPTION files for each CRAN package.
head(pkgsearch::cran_package_history("tidyr"), 3)
#> # A tibble: 3 × 25
#> Package Title Version `Authors#R` Description License LazyData URL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 tidyr Easily … 0.1 "'Hadley Wick… tidyr is an e… MIT + … true https…
#> 2 tidyr Easily … 0.2.0 "as.person(c(… An evolution … MIT + … true https…
#> 3 tidyr Easily … 0.3.0 "c(<U+000a>pe… An evolution … MIT + … true https…
#> # … with 17 more variables: VignetteBuilder <chr>, Packaged <chr>,
#> # Author <chr>, Maintainer <chr>, NeedsCompilation <chr>, Repository <chr>,
#> # Date/Publication <chr>, crandb_file_date <chr>, date <chr>,
#> # dependencies <list>, BugReports <chr>, RoxygenNote <chr>, Remotes <chr>,
#> # MD5sum <chr>, Encoding <chr>, SystemRequirements <chr>,
#> # Config/testthat/edition <chr>
tail(pkgsearch::cran_package_history("tidyr"), 3)
#> # A tibble: 3 × 25
#> Package Title Version `Authors#R` Description License LazyData URL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 tidyr Tidy Messy Data 1.1.1 "\nc(perso… "Tools to … MIT + … true http…
#> 2 tidyr Tidy Messy Data 1.1.2 "\nc(perso… "Tools to … MIT + … true http…
#> 3 tidyr Tidy Messy Data 1.1.3 "\nc(perso… "Tools to … MIT + … true http…
#> # … with 17 more variables: VignetteBuilder <chr>, Packaged <chr>,
#> # Author <chr>, Maintainer <chr>, NeedsCompilation <chr>, Repository <chr>,
#> # Date/Publication <chr>, crandb_file_date <chr>, date <chr>,
#> # dependencies <list>, BugReports <chr>, RoxygenNote <chr>, Remotes <chr>,
#> # MD5sum <chr>, Encoding <chr>, SystemRequirements <chr>,
#> # Config/testthat/edition <chr>
Created on 2021-09-11 by the reprex package (v2.0.1)
pkgdown package
pkgdown:::pkg_timeline function in pkgdown package. It is a non-exported function so sb have to take that into account. It returns only Date when each package version was published.

Related

Scrap webpage that requires button click

I am trying to scrap data from the link below. I need to click and download a csv file available in the csv button from the webpage.
library(netstat)
library(RSelenium)
url <- https://gtr.ukri.org/search/project?term=%22climate+change%22+OR+%22climate+crisis%22&fetchSize=25&selectedSortableField=&selectedSortOrder=&fields=pro.gr%2Cpro.t%2Cpro.a%2Cpro.orcidId%2Cper.fn%2Cper.on%2Cper.sn%2Cper.fnsn%2Cper.orcidId%2Cper.org.n%2Cper.pro.t%2Cper.pro.abs%2Cpub.t%2Cpub.a%2Cpub.orcidId%2Corg.n%2Corg.orcidId%2Cacp.t%2Cacp.d%2Cacp.i%2Cacp.oid%2Ckf.d%2Ckf.oid%2Cis.t%2Cis.d%2Cis.oid%2Ccol.i%2Ccol.d%2Ccol.c%2Ccol.dept%2Ccol.org%2Ccol.pc%2Ccol.pic%2Ccol.oid%2Cip.t%2Cip.d%2Cip.i%2Cip.oid%2Cpol.i%2Cpol.gt%2Cpol.in%2Cpol.oid%2Cprod.t%2Cprod.d%2Cprod.i%2Cprod.oid%2Crtp.t%2Crtp.d%2Crtp.i%2Crtp.oid%2Crdm.t%2Crdm.d%2Crdm.i%2Crdm.oid%2Cstp.t%2Cstp.d%2Cstp.i%2Cstp.oid%2Cso.t%2Cso.d%2Cso.cn%2Cso.i%2Cso.oid%2Cff.t%2Cff.d%2Cff.c%2Cff.org%2Cff.dept%2Cff.oid%2Cdis.t%2Cdis.d%2Cdis.i%2Cdis.oid%2Ccpro.rtpc%2Ccpro.rcpgm%2Ccpro.hlt&type=#/csvConfirm
I am struggling to implement that using Selenium. Here is the code I have so far.
rD <- rsDriver(port= free_port(), browser = "chrome", chromever = "106.0.5249.21", check = TRUE, verbose = TRUE)
remote_driver <- rD[["client"]]
remDr <- rD$client
remDr$navigate(url)
webElem <- remDr$findElement(using = "css", "content gtr-body d-flex flex-column ng-scope")
webElem$clickElement()
You can often just record the network log and see what request is sent when hitting the download button. In Chrome, right click Inspect, then look for the network tab. In this case there is only one request sent:
Right click and "copy as cURL" to see the whole request or just click copy URL, since the cookies and headers are not necessary here. I wrote a quick function around the task of querying the site:
dl_ukri <- function(query,
destfile = paste0(query, ".csv"),
size = 25L,
quiet_download = FALSE) {
url <- paste0(
"https://gtr.ukri.org/search/project/csv?term=",
urltools::url_encode(query),
"&selectedFacets=&fields=acp.d,is.t,prod.t,pol.oid,acp.oid,rtp.t,pol.in,prod.i,per.pro.abs,acp.i,col.org,acp.t,is.d,is.oid,cpro.rtpc,prod.d,stp.oid,rtp.i,rdm.oid,rtp.d,col.dept,ff.d,ff.c,col.pc,pub.t,kf.d,dis.t,col.oid,pro.t,per.sn,org.orcidId,per.on,ff.dept,rdm.t,org.n,dis.d,prod.oid,so.cn,dis.i,pro.a,pub.orcidId,pol.gt,rdm.i,rdm.d,so.oid,per.fnsn,per.org.n,per.pro.t,pro.orcidId,pub.a,col.d,per.orcidId,col.c,ip.i,pro.gr,pol.i,so.t,per.fn,col.i,ip.t,ff.oid,stp.i,so.i,cpro.rcpgm,cpro.hlt,col.pic,so.d,ff.t,ip.d,dis.oid,ip.oid,stp.d,rtp.oid,ff.org,kf.oid,stp.t&type=&selectedSortableField=score&selectedSortOrder=DESC"
)
curl::curl_download(url, destfile, quiet = quiet_download)
}
Testing this with your original search:
dl_ukri('"climate change" OR "climate crisis"', destfile = "test.csv")
readr::read_csv("test.csv")
#> Rows: 5894 Columns: 25
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (23): FundingOrgName, ProjectReference, LeadROName, Department, ProjectC...
#> dbl (2): AwardPounds, ExpenditurePounds
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5,894 × 25
#> FundingOrgN…¹ Proje…² LeadR…³ Depar…⁴ Proje…⁵ PISur…⁶ PIFir…⁷ PIOth…⁸ PI OR…⁹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ESRC ES/W00… Univer… School… Fellow… Thew Harriet Christ… http:/…
#> 2 AHRC AH/W00… Univer… Arts L… Resear… Scott Peter Manley <NA>
#> 3 AHRC 2609218 Queen … Drama Studen… <NA> <NA> <NA> <NA>
#> 4 UKRI MR/V02… Univer… Politi… Fellow… Spaiser Viktor… <NA> http:/…
#> 5 MRC MC_PC_… Univer… <NA> Intram… Alessi Dario Renato <NA>
#> 6 AHRC 1948811 Royal … School… Studen… <NA> <NA> <NA> <NA>
#> 7 EPSRC 2688399 Brunel… Chemic… Studen… <NA> <NA> <NA> <NA>
#> 8 ESRC ES/T01… Univer… Social… Resear… Walker Cather… Louise http:/…
#> 9 AHRC AH/X00… Queen … Drama Resear… Herita… Paul <NA> http:/…
#> 10 ESRC 2272756 Univer… Sch of… Studen… <NA> <NA> <NA> <NA>
#> # … with 5,884 more rows, 16 more variables: StudentSurname <chr>,
#> # StudentFirstName <chr>, StudentOtherNames <chr>, `Student ORCID iD` <chr>,
#> # Title <chr>, StartDate <chr>, EndDate <chr>, AwardPounds <dbl>,
#> # ExpenditurePounds <dbl>, Region <chr>, Status <chr>, GTRProjectUrl <chr>,
#> # ProjectId <chr>, FundingOrgId <chr>, LeadROId <chr>, PIId <chr>, and
#> # abbreviated variable names ¹​FundingOrgName, ²​ProjectReference, ³​LeadROName,
#> # ⁴​Department, ⁵​ProjectCategory, ⁶​PISurname, ⁷​PIFirstName, ⁸​PIOtherNames, …
Created on 2022-10-17 with reprex v2.0.2
Voilà. I also played around with the fetchSize=25 which is in the original URL. But it does not seem to do anything, so I just omitted it.

read file from google drive

I have spreadsheet uploaded as csv file in google drive unlocked so users can read from it.
This is the link to the csv file:
https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/edit?usp=sharing
I am trying to read it from R but I am getting a long list of error messages. I am using:
id = "170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk"
read.csv(sprint("https://docs.google.com/spreadsheets/d/uc?id=%s&export=download",id))
Could someone suggest how to read files from google drive directly into R?
I would try to publish the sheet as a CSV file (doc), and then read it from there.
It seems like your file is already published as a CSV. So, this should work. (Note that the URL ends with /pub?output=csv)
read.csv("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
To read the CSV file faster you can use vroom which is even faster than fread(). See here.
Now using vroom,
library(vroom)
vroom("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
#> Rows: 387048 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): StationCode, SampleID, WeatherCode, OrganismCode, race, race2
#> dbl (7): WaterTemperature, Turbidity, Velocity, ForkLength, Weight, Count, ...
#> date (1): SampleDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 387,048 × 14
#> StationCode SampleDate SampleID WeatherCode WaterTemperature Turbidity
#> <chr> <date> <chr> <chr> <dbl> <dbl>
#> 1 Gate 11 2000-04-25 116_00 CLD 13.1 2
#> 2 Gate 5 1995-04-26 117_95 CLR NA 2
#> 3 Gate 2 1995-04-21 111_95 W 10.4 12
#> 4 Gate 6 2008-12-13 348_08 CLR 49.9 1.82
#> 5 Gate 5 1999-12-10 344_99 CLR 7.30 1.5
#> 6 Gate 6 2012-05-25 146_12 CLR 55.5 1.60
#> 7 Gate 10 2011-06-28 179_11 RAN 57.3 3.99
#> 8 Gate 11 1996-04-25 116_96 CLR 13.8 21
#> 9 Gate 9 2007-07-02 183_07 CLR 56.6 2.09
#> 10 Gate 6 2009-06-04 155_09 CLR 58.6 3.08
#> # … with 387,038 more rows, and 8 more variables: Velocity <dbl>,
#> # OrganismCode <chr>, ForkLength <dbl>, Weight <dbl>, Count <dbl>,
#> # race <chr>, year <dbl>, race2 <chr>
Created on 2022-07-08 by the reprex package (v2.0.1)

How to merge two columns with the same names from two different data frames and compare and print the ones that are similar

I currently have this code:
install.packages(c("httr", "jsonlite", "tidyverse"))
library(httr)
library(jsonlite)
library(tidyverse)
res1<-GET("https://rss.applemarketingtools.com/api/v2/us/music/most-played/100/songs.json")
res1
rawToChar(res1$content)
data1 = fromJSON(rawToChar(res1$content))
us100<-data1$feed$results
res2 <- GET("https://rss.applemarketingtools.com/api/v2/gb/music/most-played/100/songs.json")
data2<-fromJSON(rawToChar(res2$content))
uk100<-data2$feed$results
I want to compare the two data frames and make a new one printing the results of the artist names and name of the songs that both data frames have in common, how do I do this?
I think you're just looking for an inner_join
us100 %>% inner_join(uk100, by = "id") %>% as_tibble()
#> # A tibble: 16 x 21
#> artistName.x id name.x releaseDate.x kind.x artistId.x artistUrl.x
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Jack Harlow 1618~ First~ 2022-04-08 songs 1047679432 https://mu~
#> 2 Harry Styles 1615~ As It~ 2022-03-31 songs 471260289 https://mu~
#> 3 Lil Baby 1618~ In A ~ 2022-04-08 songs 1276656483 https://mu~
#> 4 Lauren Spencer-Smith 1618~ Flowe~ 2022-04-14 songs 1462708784 https://mu~
#> 5 Glass Animals 1508~ Heat ~ 2020-06-29 songs 528928008 https://mu~
#> 6 Carolina Gaitán - ~ 1594~ We Do~ 2021-11-19 songs 1227636438 https://mu~
#> 7 The Kid LAROI & Jus~ 1574~ STAY 2021-07-09 songs 1435848034 https://mu~
#> 8 Frank Ocean 1440~ Lost 2012-07-10 songs 442122051 https://mu~
#> 9 Elton John & Dua Li~ 1578~ Cold ~ 2021-08-13 songs 54657 https://mu~
#> 10 Tate McRae 1606~ she's~ 2022-02-04 songs 1446365464 https://mu~
#> 11 Adele 1590~ Easy ~ 2021-10-14 songs 262836961 https://mu~
#> 12 Lil Tjay 1613~ In My~ 2022-04-01 songs 1436446949 https://mu~
#> 13 Lizzo 1619~ About~ 2022-04-14 songs 472949623 https://mu~
#> 14 Tiësto & Ava Max 1590~ The M~ 2021-11-04 songs 4091218 https://mu~
#> 15 Ed Sheeran 1581~ Shive~ 2021-09-09 songs 183313439 https://mu~
#> 16 The Weeknd 1488~ Blind~ 2019-11-29 songs 479756766 https://mu~
#> # ... with 14 more variables: contentAdvisoryRating.x <chr>,
#> # artworkUrl100.x <chr>, genres.x <list>, url.x <chr>, artistName.y <chr>,
#> # name.y <chr>, releaseDate.y <chr>, kind.y <chr>, artistId.y <chr>,
#> # artistUrl.y <chr>, contentAdvisoryRating.y <chr>, artworkUrl100.y <chr>,
#> # genres.y <list>, url.y <chr>
``z

Step_dummy. Dealing with duplicated column names generated by recipe() steps, Tidymodels

Dear community,
I have been struggeling for extensive amount of time now trying to understand what is going on here, when I perform my recipe() steps for my linear (glm) model using the Tidymodels framework. The recipe() step_dummy(all_nominal(), -all_outcomes()) was suggested by the usemodels() function https://usemodels.tidymodels.org/index.html .
When I commend out the step_dummy() the recipe() and prep() works fine, however its important to me that these categorical variables are dummyfied (..is that a word!?).
This is the first time I making use of and including a reprex in a question on stackoverflow, so please let me know if you need more information to assist on this matter.
I have looked everywhere, e.g. including a one_hot = TRUE or keep_original_cols argument in the step_dummy() but it does not seem to be effective.
It should be quite easy as it is a matter of renaming the generated columns as unique, but do not succeed. Here is the era.af_train set.
> era.af_train
# A tibble: 7,104 x 44
logRR ID AEZ16simple PrName.Code SubPrName.Code Product Country
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 -0.851 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
2 -1.17 1663 Warm.Semiar~ BP/Mu Mu-N/TW Pearl Mill~ Niger
3 -0.314 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
4 -0.776 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
5 -0.0850 1675 Warm.Semiar~ AP TPM+N Pearl Mill~ Niger
6 -0.159 1689 Warm.Subhum~ Al/AP/BP Al+N/LF/TP/TPM~ Maize Togo
7 -0.579 1701 Warm.Semiar~ BP TW Fodder (Le~ Tunisia
8 -0.662 1729 Warm.Subhum~ Al Al-N/Al+N Cassava or~ Nigeria
9 -1.80 1802 Cool.Subhum~ Al/AP Al+N/TP Wheat Ethiop~
10 -1.74 1802 Cool.Subhum~ Al/AP Al+N/TP/TPI+N Wheat Ethiop~
# ... with 7,094 more rows, and 37 more variables: Latitude <dbl>,
# Longitude <dbl>, Site.Type <fct>, Tree <fct>, Bio01_MT_Anu.Mean <dbl>,
# Bio02_MDR.Mean <dbl>, Bio03_Iso.Mean <dbl>, Bio04_TS.Mean <dbl>,
# Bio05_TWM.Mean <dbl>, Bio06_MinTCM.Mean <dbl>, Bio07_TAR.Mean <dbl>,
# Bio08_MT_WetQ.Mean <dbl>, Bio09_MT_DryQ.Mean <dbl>,
# Bio10_MT_WarQ.Mean <dbl>, Bio11_MT_ColQ.Mean <dbl>,
# Bio12_Pecip_Anu.Mean <dbl>, Bio13_Precip_WetM.Mean <dbl>,
# Bio14_Precip_DryM.Mean <dbl>, Bio15_Precip_S.Mean <dbl>,
# Bio16_Precip_WetQ.Mean <dbl>, Bio17_Precip_DryQ.Mean <dbl>,
# Mean_log.n_tot_ncs <dbl>, Mean_log.ca_mehlich3 <dbl>,
# Mean_log.k_mehlich3 <dbl>, Mean_log.mg_mehlich3 <dbl>,
# Mean_log.p_mehlich3 <dbl>, Mean_log.s_mehlich3 <dbl>,
# Mean_log.fe_mehlich3 <dbl>, Mean_db_od <dbl>, Mean_bdr <dbl>,
# Mean_sand_tot_psa <dbl>, Mean_clay_tot_psa <dbl>, Mean_ph_h2o <dbl>,
# Mean_log.ecec.f <dbl>, Mean_log.c_tot <dbl>, Mean_log.oc <dbl>,
# Slope.mean <dbl>
I am including the columns ID, AEZ16simple, PrName.Code, SubPrName.Code, Product, Country, Latitude and Longitude as "ID variables", as I wish to compare the glm model later with a random forest model and a XGBoost model.
All help is welcome!
Have a good weekend and
thank you in advance.
library(reprex)
#> Warning: package 'reprex' was built under R version 4.0.5
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.0.5
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(readr)
#> Warning: package 'readr' was built under R version 4.0.5
setwd("C:/Users/lindh011/OneDrive - Wageningen University & Research/Internship ICRAF (ERA)/ERA_Agroforestry_WURwork")
era.af_train <- read_csv("era.af_train.csv")
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> .default = col_double(),
#> AEZ16simple = col_character(),
#> PrName.Code = col_character(),
#> SubPrName.Code = col_character(),
#> Product = col_character(),
#> Country = col_character(),
#> Site.Type = col_character(),
#> Tree = col_character()
#> )
#> i Use `spec()` for the full column specifications.
era.af_train_Tib <- as_tibble(era.af_train)
glmnet_recipe <-
recipe(formula = logRR ~ ., data = era.af_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), naming = dummy_names) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
update_role(ID,
AEZ16simple,
PrName.Code,
SubPrName.Code,
Product,
Country,
Latitude,
Longitude,
new_role = "sample ID") %>%
step_impute_mode(all_nominal(), -all_outcomes()) %>%
step_impute_knn (all_numeric_predictors()) %>%
step_impute_knn(logRR) %>%
step_corr(all_numeric_predictors()) %>%
step_nzv(all_numeric_predictors()) %>%
prep()
#> Error: Column names `SubPrName.Code_AF.N.Al.N.TP`, `SubPrName.Code_AF.N.Al.N.TP.TPM`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N`, and 33 more must not be duplicated.
#> Use .name_repair to specify repair.
Created on 2021-07-02 by the reprex package (v2.0.0)

readr::read_csv() - parsing failure with nested quotations

I have a csv where some columns have a quoted column with another quotation inside it:
"blah blah "nested quote"" and it generates parsing failures. I'm not sure if this is a bug or there is an argument to deal with this?
Reprex (file is here or content pasted below):
readr::read_csv("~/temp/shittyquotes.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> INSTNM = col_character(),
#> ADDR = col_character(),
#> CITY = col_character(),
#> STABBR = col_character(),
#> ZIP = col_character(),
#> CHFNM = col_character(),
#> CHFTITLE = col_character(),
#> EIN = col_character(),
#> OPEID = col_character(),
#> WEBADDR = col_character(),
#> ADMINURL = col_character(),
#> FAIDURL = col_character(),
#> APPLURL = col_character(),
#> ACT = col_character(),
#> IALIAS = col_character(),
#> INSTCAT = col_character(),
#> CCBASIC = col_character(),
#> CCIPUG = col_character(),
#> CCSIZSET = col_character(),
#> CARNEGIE = col_character()
#> # ... with 2 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 2 IALIAS delimiter or quote C '~/temp/shittyquotes.csv'
#> 2 IALIAS delimiter or quote D '~/temp/shittyquotes.csv'
#> 2 NA 59 columns 100 columns '~/temp/shittyquotes.csv'
#> # A tibble: 2 x 59
#> UNITID INSTNM ADDR CITY STABBR ZIP FIPS OBEREG CHFNM CHFTITLE
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 441238 City … 1500… Duar… CA 9101… 6 8 Dr. … Director
#> 2 441247 Commu… 3800… Mode… CA 9535… 6 8 Vict… Preside…
#> # ... with 49 more variables: GENTELE <dbl>, EIN <chr>, OPEID <chr>,
#> # OPEFLAG <dbl>, WEBADDR <chr>, ADMINURL <chr>, FAIDURL <chr>,
#> # APPLURL <chr>, SECTOR <dbl>, ICLEVEL <dbl>, CONTROL <dbl>,
#> # HLOFFER <dbl>, UGOFFER <dbl>, GROFFER <dbl>, FPOFFER <dbl>,
#> # HDEGOFFR <dbl>, DEGGRANT <dbl>, HBCU <dbl>, HOSPITAL <dbl>,
#> # MEDICAL <dbl>, TRIBAL <dbl>, LOCALE <dbl>, OPENPUBL <dbl>, ACT <chr>,
#> # NEWID <dbl>, DEATHYR <dbl>, CLOSEDAT <dbl>, CYACTIVE <dbl>,
#> # POSTSEC <dbl>, PSEFLAG <dbl>, PSET4FLG <dbl>, RPTMTH <dbl>,
#> # IALIAS <chr>, INSTCAT <chr>, CCBASIC <chr>, CCIPUG <chr>,
#> # CCIPGRAD <dbl>, CCUGPROF <dbl>, CCENRPRF <dbl>, CCSIZSET <chr>,
#> # CARNEGIE <chr>, TENURSYS <dbl>, LANDGRNT <dbl>, INSTSIZE <chr>,
#> # CBSA <dbl>, CBSATYPE <chr>, CSA <dbl>, NECTA <dbl>, DFRCGID <dbl>
Created on 2018-12-04 by the reprex package (v0.2.1)
Also here's the csv content:
UNITID,INSTNM,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,CHFNM,CHFTITLE,GENTELE,EIN,OPEID,OPEFLAG,WEBADDR,ADMINURL,FAIDURL,APPLURL,SECTOR,ICLEVEL,CONTROL,HLOFFER,UGOFFER,GROFFER,FPOFFER,HDEGOFFR,DEGGRANT,HBCU,HOSPITAL,MEDICAL,TRIBAL,LOCALE,OPENPUBL,ACT,NEWID,DEATHYR,CLOSEDAT,CYACTIVE,POSTSEC,PSEFLAG,PSET4FLG,RPTMTH,IALIAS,INSTCAT,CCBASIC,CCIPUG,CCIPGRAD,CCUGPROF,CCENRPRF,CCSIZSET,CARNEGIE,TENURSYS,LANDGRNT,INSTSIZE,CBSA,CBSATYPE,CSA,NECTA,DFRCGID
441238,"City of Hope Graduate School of Biological Science","1500 E Duarte Rd","Duarte","CA","91010-3000", 6, 8,"Dr. Arthur Riggs","Director","6263018293","953432210","03592400",1,"gradschool.coh.org"," "," "," ",2,1,2,9,2,1,2,10,1,2,-2,2,2,21,1,"A ",-2,-2,"-2",1,1,1,1,1," ",1,25,-2,-2,-2,7,-2,-3,1,2,1,31100,1,348,-2,198
441247,"Community Business College","3800 McHenry Ave Suite M","Modesto","CA","95356-1569", 6, 8,"Victor L. Vandenberghe","President","2095293648","484-8230","03615300",7,"www.communitybusinesscollege.edu","www.communitybusinesscollege.edu","www.cbc123.com","www.123.com",9,3,3,1,1,2,2,0,2,2,-2,2,2,12,1,"A ",-2,-2,"-2",1,1,1,1,2,"formerly "Community Business School"",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,33700,1,-2,-2,71
441256,"Design's School of Cosmetology","715 24th St Ste E","Paso Robles","CA","93446", 6, 8,"Sharon Skinner","Administrator","8052378575","80002030","03646300",1,"designsschool.com"," "," "," ",9,3,3,2,1,2,2,0,2,2,-2,2,2,13,1,"A ",-2,-2,"-2",1,1,1,1,2," ",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,42020,1,-2,-2,46
Jim Hester provided this answer:
You need to use the escape_double = FALSE argument to read_delim(). This isn't part of read_csv() because excel style csvs escape inner quotations by doubling them.
data.table's fread() parses the file just fine... it throws a warning about the quotes, but you can ignore it..
library( data.table )
data.table::fread("./temp.csv" )
Warning message:
In data.table::fread("./temp.csv") :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Resources