I am trying to scrap data from the link below. I need to click and download a csv file available in the csv button from the webpage.
library(netstat)
library(RSelenium)
url <- https://gtr.ukri.org/search/project?term=%22climate+change%22+OR+%22climate+crisis%22&fetchSize=25&selectedSortableField=&selectedSortOrder=&fields=pro.gr%2Cpro.t%2Cpro.a%2Cpro.orcidId%2Cper.fn%2Cper.on%2Cper.sn%2Cper.fnsn%2Cper.orcidId%2Cper.org.n%2Cper.pro.t%2Cper.pro.abs%2Cpub.t%2Cpub.a%2Cpub.orcidId%2Corg.n%2Corg.orcidId%2Cacp.t%2Cacp.d%2Cacp.i%2Cacp.oid%2Ckf.d%2Ckf.oid%2Cis.t%2Cis.d%2Cis.oid%2Ccol.i%2Ccol.d%2Ccol.c%2Ccol.dept%2Ccol.org%2Ccol.pc%2Ccol.pic%2Ccol.oid%2Cip.t%2Cip.d%2Cip.i%2Cip.oid%2Cpol.i%2Cpol.gt%2Cpol.in%2Cpol.oid%2Cprod.t%2Cprod.d%2Cprod.i%2Cprod.oid%2Crtp.t%2Crtp.d%2Crtp.i%2Crtp.oid%2Crdm.t%2Crdm.d%2Crdm.i%2Crdm.oid%2Cstp.t%2Cstp.d%2Cstp.i%2Cstp.oid%2Cso.t%2Cso.d%2Cso.cn%2Cso.i%2Cso.oid%2Cff.t%2Cff.d%2Cff.c%2Cff.org%2Cff.dept%2Cff.oid%2Cdis.t%2Cdis.d%2Cdis.i%2Cdis.oid%2Ccpro.rtpc%2Ccpro.rcpgm%2Ccpro.hlt&type=#/csvConfirm
I am struggling to implement that using Selenium. Here is the code I have so far.
rD <- rsDriver(port= free_port(), browser = "chrome", chromever = "106.0.5249.21", check = TRUE, verbose = TRUE)
remote_driver <- rD[["client"]]
remDr <- rD$client
remDr$navigate(url)
webElem <- remDr$findElement(using = "css", "content gtr-body d-flex flex-column ng-scope")
webElem$clickElement()
You can often just record the network log and see what request is sent when hitting the download button. In Chrome, right click Inspect, then look for the network tab. In this case there is only one request sent:
Right click and "copy as cURL" to see the whole request or just click copy URL, since the cookies and headers are not necessary here. I wrote a quick function around the task of querying the site:
dl_ukri <- function(query,
destfile = paste0(query, ".csv"),
size = 25L,
quiet_download = FALSE) {
url <- paste0(
"https://gtr.ukri.org/search/project/csv?term=",
urltools::url_encode(query),
"&selectedFacets=&fields=acp.d,is.t,prod.t,pol.oid,acp.oid,rtp.t,pol.in,prod.i,per.pro.abs,acp.i,col.org,acp.t,is.d,is.oid,cpro.rtpc,prod.d,stp.oid,rtp.i,rdm.oid,rtp.d,col.dept,ff.d,ff.c,col.pc,pub.t,kf.d,dis.t,col.oid,pro.t,per.sn,org.orcidId,per.on,ff.dept,rdm.t,org.n,dis.d,prod.oid,so.cn,dis.i,pro.a,pub.orcidId,pol.gt,rdm.i,rdm.d,so.oid,per.fnsn,per.org.n,per.pro.t,pro.orcidId,pub.a,col.d,per.orcidId,col.c,ip.i,pro.gr,pol.i,so.t,per.fn,col.i,ip.t,ff.oid,stp.i,so.i,cpro.rcpgm,cpro.hlt,col.pic,so.d,ff.t,ip.d,dis.oid,ip.oid,stp.d,rtp.oid,ff.org,kf.oid,stp.t&type=&selectedSortableField=score&selectedSortOrder=DESC"
)
curl::curl_download(url, destfile, quiet = quiet_download)
}
Testing this with your original search:
dl_ukri('"climate change" OR "climate crisis"', destfile = "test.csv")
readr::read_csv("test.csv")
#> Rows: 5894 Columns: 25
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (23): FundingOrgName, ProjectReference, LeadROName, Department, ProjectC...
#> dbl (2): AwardPounds, ExpenditurePounds
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5,894 × 25
#> FundingOrgN…¹ Proje…² LeadR…³ Depar…⁴ Proje…⁵ PISur…⁶ PIFir…⁷ PIOth…⁸ PI OR…⁹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ESRC ES/W00… Univer… School… Fellow… Thew Harriet Christ… http:/…
#> 2 AHRC AH/W00… Univer… Arts L… Resear… Scott Peter Manley <NA>
#> 3 AHRC 2609218 Queen … Drama Studen… <NA> <NA> <NA> <NA>
#> 4 UKRI MR/V02… Univer… Politi… Fellow… Spaiser Viktor… <NA> http:/…
#> 5 MRC MC_PC_… Univer… <NA> Intram… Alessi Dario Renato <NA>
#> 6 AHRC 1948811 Royal … School… Studen… <NA> <NA> <NA> <NA>
#> 7 EPSRC 2688399 Brunel… Chemic… Studen… <NA> <NA> <NA> <NA>
#> 8 ESRC ES/T01… Univer… Social… Resear… Walker Cather… Louise http:/…
#> 9 AHRC AH/X00… Queen … Drama Resear… Herita… Paul <NA> http:/…
#> 10 ESRC 2272756 Univer… Sch of… Studen… <NA> <NA> <NA> <NA>
#> # … with 5,884 more rows, 16 more variables: StudentSurname <chr>,
#> # StudentFirstName <chr>, StudentOtherNames <chr>, `Student ORCID iD` <chr>,
#> # Title <chr>, StartDate <chr>, EndDate <chr>, AwardPounds <dbl>,
#> # ExpenditurePounds <dbl>, Region <chr>, Status <chr>, GTRProjectUrl <chr>,
#> # ProjectId <chr>, FundingOrgId <chr>, LeadROId <chr>, PIId <chr>, and
#> # abbreviated variable names ¹FundingOrgName, ²ProjectReference, ³LeadROName,
#> # ⁴Department, ⁵ProjectCategory, ⁶PISurname, ⁷PIFirstName, ⁸PIOtherNames, …
Created on 2022-10-17 with reprex v2.0.2
Voilà. I also played around with the fetchSize=25 which is in the original URL. But it does not seem to do anything, so I just omitted it.
I have spreadsheet uploaded as csv file in google drive unlocked so users can read from it.
This is the link to the csv file:
https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/edit?usp=sharing
I am trying to read it from R but I am getting a long list of error messages. I am using:
id = "170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk"
read.csv(sprint("https://docs.google.com/spreadsheets/d/uc?id=%s&export=download",id))
Could someone suggest how to read files from google drive directly into R?
I would try to publish the sheet as a CSV file (doc), and then read it from there.
It seems like your file is already published as a CSV. So, this should work. (Note that the URL ends with /pub?output=csv)
read.csv("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
To read the CSV file faster you can use vroom which is even faster than fread(). See here.
Now using vroom,
library(vroom)
vroom("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
#> Rows: 387048 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): StationCode, SampleID, WeatherCode, OrganismCode, race, race2
#> dbl (7): WaterTemperature, Turbidity, Velocity, ForkLength, Weight, Count, ...
#> date (1): SampleDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 387,048 × 14
#> StationCode SampleDate SampleID WeatherCode WaterTemperature Turbidity
#> <chr> <date> <chr> <chr> <dbl> <dbl>
#> 1 Gate 11 2000-04-25 116_00 CLD 13.1 2
#> 2 Gate 5 1995-04-26 117_95 CLR NA 2
#> 3 Gate 2 1995-04-21 111_95 W 10.4 12
#> 4 Gate 6 2008-12-13 348_08 CLR 49.9 1.82
#> 5 Gate 5 1999-12-10 344_99 CLR 7.30 1.5
#> 6 Gate 6 2012-05-25 146_12 CLR 55.5 1.60
#> 7 Gate 10 2011-06-28 179_11 RAN 57.3 3.99
#> 8 Gate 11 1996-04-25 116_96 CLR 13.8 21
#> 9 Gate 9 2007-07-02 183_07 CLR 56.6 2.09
#> 10 Gate 6 2009-06-04 155_09 CLR 58.6 3.08
#> # … with 387,038 more rows, and 8 more variables: Velocity <dbl>,
#> # OrganismCode <chr>, ForkLength <dbl>, Weight <dbl>, Count <dbl>,
#> # race <chr>, year <dbl>, race2 <chr>
Created on 2022-07-08 by the reprex package (v2.0.1)
Dear community,
I have been struggeling for extensive amount of time now trying to understand what is going on here, when I perform my recipe() steps for my linear (glm) model using the Tidymodels framework. The recipe() step_dummy(all_nominal(), -all_outcomes()) was suggested by the usemodels() function https://usemodels.tidymodels.org/index.html .
When I commend out the step_dummy() the recipe() and prep() works fine, however its important to me that these categorical variables are dummyfied (..is that a word!?).
This is the first time I making use of and including a reprex in a question on stackoverflow, so please let me know if you need more information to assist on this matter.
I have looked everywhere, e.g. including a one_hot = TRUE or keep_original_cols argument in the step_dummy() but it does not seem to be effective.
It should be quite easy as it is a matter of renaming the generated columns as unique, but do not succeed. Here is the era.af_train set.
> era.af_train
# A tibble: 7,104 x 44
logRR ID AEZ16simple PrName.Code SubPrName.Code Product Country
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 -0.851 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
2 -1.17 1663 Warm.Semiar~ BP/Mu Mu-N/TW Pearl Mill~ Niger
3 -0.314 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
4 -0.776 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
5 -0.0850 1675 Warm.Semiar~ AP TPM+N Pearl Mill~ Niger
6 -0.159 1689 Warm.Subhum~ Al/AP/BP Al+N/LF/TP/TPM~ Maize Togo
7 -0.579 1701 Warm.Semiar~ BP TW Fodder (Le~ Tunisia
8 -0.662 1729 Warm.Subhum~ Al Al-N/Al+N Cassava or~ Nigeria
9 -1.80 1802 Cool.Subhum~ Al/AP Al+N/TP Wheat Ethiop~
10 -1.74 1802 Cool.Subhum~ Al/AP Al+N/TP/TPI+N Wheat Ethiop~
# ... with 7,094 more rows, and 37 more variables: Latitude <dbl>,
# Longitude <dbl>, Site.Type <fct>, Tree <fct>, Bio01_MT_Anu.Mean <dbl>,
# Bio02_MDR.Mean <dbl>, Bio03_Iso.Mean <dbl>, Bio04_TS.Mean <dbl>,
# Bio05_TWM.Mean <dbl>, Bio06_MinTCM.Mean <dbl>, Bio07_TAR.Mean <dbl>,
# Bio08_MT_WetQ.Mean <dbl>, Bio09_MT_DryQ.Mean <dbl>,
# Bio10_MT_WarQ.Mean <dbl>, Bio11_MT_ColQ.Mean <dbl>,
# Bio12_Pecip_Anu.Mean <dbl>, Bio13_Precip_WetM.Mean <dbl>,
# Bio14_Precip_DryM.Mean <dbl>, Bio15_Precip_S.Mean <dbl>,
# Bio16_Precip_WetQ.Mean <dbl>, Bio17_Precip_DryQ.Mean <dbl>,
# Mean_log.n_tot_ncs <dbl>, Mean_log.ca_mehlich3 <dbl>,
# Mean_log.k_mehlich3 <dbl>, Mean_log.mg_mehlich3 <dbl>,
# Mean_log.p_mehlich3 <dbl>, Mean_log.s_mehlich3 <dbl>,
# Mean_log.fe_mehlich3 <dbl>, Mean_db_od <dbl>, Mean_bdr <dbl>,
# Mean_sand_tot_psa <dbl>, Mean_clay_tot_psa <dbl>, Mean_ph_h2o <dbl>,
# Mean_log.ecec.f <dbl>, Mean_log.c_tot <dbl>, Mean_log.oc <dbl>,
# Slope.mean <dbl>
I am including the columns ID, AEZ16simple, PrName.Code, SubPrName.Code, Product, Country, Latitude and Longitude as "ID variables", as I wish to compare the glm model later with a random forest model and a XGBoost model.
All help is welcome!
Have a good weekend and
thank you in advance.
library(reprex)
#> Warning: package 'reprex' was built under R version 4.0.5
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.0.5
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(readr)
#> Warning: package 'readr' was built under R version 4.0.5
setwd("C:/Users/lindh011/OneDrive - Wageningen University & Research/Internship ICRAF (ERA)/ERA_Agroforestry_WURwork")
era.af_train <- read_csv("era.af_train.csv")
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> .default = col_double(),
#> AEZ16simple = col_character(),
#> PrName.Code = col_character(),
#> SubPrName.Code = col_character(),
#> Product = col_character(),
#> Country = col_character(),
#> Site.Type = col_character(),
#> Tree = col_character()
#> )
#> i Use `spec()` for the full column specifications.
era.af_train_Tib <- as_tibble(era.af_train)
glmnet_recipe <-
recipe(formula = logRR ~ ., data = era.af_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), naming = dummy_names) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
update_role(ID,
AEZ16simple,
PrName.Code,
SubPrName.Code,
Product,
Country,
Latitude,
Longitude,
new_role = "sample ID") %>%
step_impute_mode(all_nominal(), -all_outcomes()) %>%
step_impute_knn (all_numeric_predictors()) %>%
step_impute_knn(logRR) %>%
step_corr(all_numeric_predictors()) %>%
step_nzv(all_numeric_predictors()) %>%
prep()
#> Error: Column names `SubPrName.Code_AF.N.Al.N.TP`, `SubPrName.Code_AF.N.Al.N.TP.TPM`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N`, and 33 more must not be duplicated.
#> Use .name_repair to specify repair.
Created on 2021-07-02 by the reprex package (v2.0.0)
I have a csv where some columns have a quoted column with another quotation inside it:
"blah blah "nested quote"" and it generates parsing failures. I'm not sure if this is a bug or there is an argument to deal with this?
Reprex (file is here or content pasted below):
readr::read_csv("~/temp/shittyquotes.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> INSTNM = col_character(),
#> ADDR = col_character(),
#> CITY = col_character(),
#> STABBR = col_character(),
#> ZIP = col_character(),
#> CHFNM = col_character(),
#> CHFTITLE = col_character(),
#> EIN = col_character(),
#> OPEID = col_character(),
#> WEBADDR = col_character(),
#> ADMINURL = col_character(),
#> FAIDURL = col_character(),
#> APPLURL = col_character(),
#> ACT = col_character(),
#> IALIAS = col_character(),
#> INSTCAT = col_character(),
#> CCBASIC = col_character(),
#> CCIPUG = col_character(),
#> CCSIZSET = col_character(),
#> CARNEGIE = col_character()
#> # ... with 2 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 2 IALIAS delimiter or quote C '~/temp/shittyquotes.csv'
#> 2 IALIAS delimiter or quote D '~/temp/shittyquotes.csv'
#> 2 NA 59 columns 100 columns '~/temp/shittyquotes.csv'
#> # A tibble: 2 x 59
#> UNITID INSTNM ADDR CITY STABBR ZIP FIPS OBEREG CHFNM CHFTITLE
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 441238 City … 1500… Duar… CA 9101… 6 8 Dr. … Director
#> 2 441247 Commu… 3800… Mode… CA 9535… 6 8 Vict… Preside…
#> # ... with 49 more variables: GENTELE <dbl>, EIN <chr>, OPEID <chr>,
#> # OPEFLAG <dbl>, WEBADDR <chr>, ADMINURL <chr>, FAIDURL <chr>,
#> # APPLURL <chr>, SECTOR <dbl>, ICLEVEL <dbl>, CONTROL <dbl>,
#> # HLOFFER <dbl>, UGOFFER <dbl>, GROFFER <dbl>, FPOFFER <dbl>,
#> # HDEGOFFR <dbl>, DEGGRANT <dbl>, HBCU <dbl>, HOSPITAL <dbl>,
#> # MEDICAL <dbl>, TRIBAL <dbl>, LOCALE <dbl>, OPENPUBL <dbl>, ACT <chr>,
#> # NEWID <dbl>, DEATHYR <dbl>, CLOSEDAT <dbl>, CYACTIVE <dbl>,
#> # POSTSEC <dbl>, PSEFLAG <dbl>, PSET4FLG <dbl>, RPTMTH <dbl>,
#> # IALIAS <chr>, INSTCAT <chr>, CCBASIC <chr>, CCIPUG <chr>,
#> # CCIPGRAD <dbl>, CCUGPROF <dbl>, CCENRPRF <dbl>, CCSIZSET <chr>,
#> # CARNEGIE <chr>, TENURSYS <dbl>, LANDGRNT <dbl>, INSTSIZE <chr>,
#> # CBSA <dbl>, CBSATYPE <chr>, CSA <dbl>, NECTA <dbl>, DFRCGID <dbl>
Created on 2018-12-04 by the reprex package (v0.2.1)
Also here's the csv content:
UNITID,INSTNM,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,CHFNM,CHFTITLE,GENTELE,EIN,OPEID,OPEFLAG,WEBADDR,ADMINURL,FAIDURL,APPLURL,SECTOR,ICLEVEL,CONTROL,HLOFFER,UGOFFER,GROFFER,FPOFFER,HDEGOFFR,DEGGRANT,HBCU,HOSPITAL,MEDICAL,TRIBAL,LOCALE,OPENPUBL,ACT,NEWID,DEATHYR,CLOSEDAT,CYACTIVE,POSTSEC,PSEFLAG,PSET4FLG,RPTMTH,IALIAS,INSTCAT,CCBASIC,CCIPUG,CCIPGRAD,CCUGPROF,CCENRPRF,CCSIZSET,CARNEGIE,TENURSYS,LANDGRNT,INSTSIZE,CBSA,CBSATYPE,CSA,NECTA,DFRCGID
441238,"City of Hope Graduate School of Biological Science","1500 E Duarte Rd","Duarte","CA","91010-3000", 6, 8,"Dr. Arthur Riggs","Director","6263018293","953432210","03592400",1,"gradschool.coh.org"," "," "," ",2,1,2,9,2,1,2,10,1,2,-2,2,2,21,1,"A ",-2,-2,"-2",1,1,1,1,1," ",1,25,-2,-2,-2,7,-2,-3,1,2,1,31100,1,348,-2,198
441247,"Community Business College","3800 McHenry Ave Suite M","Modesto","CA","95356-1569", 6, 8,"Victor L. Vandenberghe","President","2095293648","484-8230","03615300",7,"www.communitybusinesscollege.edu","www.communitybusinesscollege.edu","www.cbc123.com","www.123.com",9,3,3,1,1,2,2,0,2,2,-2,2,2,12,1,"A ",-2,-2,"-2",1,1,1,1,2,"formerly "Community Business School"",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,33700,1,-2,-2,71
441256,"Design's School of Cosmetology","715 24th St Ste E","Paso Robles","CA","93446", 6, 8,"Sharon Skinner","Administrator","8052378575","80002030","03646300",1,"designsschool.com"," "," "," ",9,3,3,2,1,2,2,0,2,2,-2,2,2,13,1,"A ",-2,-2,"-2",1,1,1,1,2," ",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,42020,1,-2,-2,46
Jim Hester provided this answer:
You need to use the escape_double = FALSE argument to read_delim(). This isn't part of read_csv() because excel style csvs escape inner quotations by doubling them.
data.table's fread() parses the file just fine... it throws a warning about the quotes, but you can ignore it..
library( data.table )
data.table::fread("./temp.csv" )
Warning message:
In data.table::fread("./temp.csv") :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.