readr::read_csv() - parsing failure with nested quotations - r

I have a csv where some columns have a quoted column with another quotation inside it:
"blah blah "nested quote"" and it generates parsing failures. I'm not sure if this is a bug or there is an argument to deal with this?
Reprex (file is here or content pasted below):
readr::read_csv("~/temp/shittyquotes.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> INSTNM = col_character(),
#> ADDR = col_character(),
#> CITY = col_character(),
#> STABBR = col_character(),
#> ZIP = col_character(),
#> CHFNM = col_character(),
#> CHFTITLE = col_character(),
#> EIN = col_character(),
#> OPEID = col_character(),
#> WEBADDR = col_character(),
#> ADMINURL = col_character(),
#> FAIDURL = col_character(),
#> APPLURL = col_character(),
#> ACT = col_character(),
#> IALIAS = col_character(),
#> INSTCAT = col_character(),
#> CCBASIC = col_character(),
#> CCIPUG = col_character(),
#> CCSIZSET = col_character(),
#> CARNEGIE = col_character()
#> # ... with 2 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 2 IALIAS delimiter or quote C '~/temp/shittyquotes.csv'
#> 2 IALIAS delimiter or quote D '~/temp/shittyquotes.csv'
#> 2 NA 59 columns 100 columns '~/temp/shittyquotes.csv'
#> # A tibble: 2 x 59
#> UNITID INSTNM ADDR CITY STABBR ZIP FIPS OBEREG CHFNM CHFTITLE
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 441238 City … 1500… Duar… CA 9101… 6 8 Dr. … Director
#> 2 441247 Commu… 3800… Mode… CA 9535… 6 8 Vict… Preside…
#> # ... with 49 more variables: GENTELE <dbl>, EIN <chr>, OPEID <chr>,
#> # OPEFLAG <dbl>, WEBADDR <chr>, ADMINURL <chr>, FAIDURL <chr>,
#> # APPLURL <chr>, SECTOR <dbl>, ICLEVEL <dbl>, CONTROL <dbl>,
#> # HLOFFER <dbl>, UGOFFER <dbl>, GROFFER <dbl>, FPOFFER <dbl>,
#> # HDEGOFFR <dbl>, DEGGRANT <dbl>, HBCU <dbl>, HOSPITAL <dbl>,
#> # MEDICAL <dbl>, TRIBAL <dbl>, LOCALE <dbl>, OPENPUBL <dbl>, ACT <chr>,
#> # NEWID <dbl>, DEATHYR <dbl>, CLOSEDAT <dbl>, CYACTIVE <dbl>,
#> # POSTSEC <dbl>, PSEFLAG <dbl>, PSET4FLG <dbl>, RPTMTH <dbl>,
#> # IALIAS <chr>, INSTCAT <chr>, CCBASIC <chr>, CCIPUG <chr>,
#> # CCIPGRAD <dbl>, CCUGPROF <dbl>, CCENRPRF <dbl>, CCSIZSET <chr>,
#> # CARNEGIE <chr>, TENURSYS <dbl>, LANDGRNT <dbl>, INSTSIZE <chr>,
#> # CBSA <dbl>, CBSATYPE <chr>, CSA <dbl>, NECTA <dbl>, DFRCGID <dbl>
Created on 2018-12-04 by the reprex package (v0.2.1)
Also here's the csv content:
UNITID,INSTNM,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,CHFNM,CHFTITLE,GENTELE,EIN,OPEID,OPEFLAG,WEBADDR,ADMINURL,FAIDURL,APPLURL,SECTOR,ICLEVEL,CONTROL,HLOFFER,UGOFFER,GROFFER,FPOFFER,HDEGOFFR,DEGGRANT,HBCU,HOSPITAL,MEDICAL,TRIBAL,LOCALE,OPENPUBL,ACT,NEWID,DEATHYR,CLOSEDAT,CYACTIVE,POSTSEC,PSEFLAG,PSET4FLG,RPTMTH,IALIAS,INSTCAT,CCBASIC,CCIPUG,CCIPGRAD,CCUGPROF,CCENRPRF,CCSIZSET,CARNEGIE,TENURSYS,LANDGRNT,INSTSIZE,CBSA,CBSATYPE,CSA,NECTA,DFRCGID
441238,"City of Hope Graduate School of Biological Science","1500 E Duarte Rd","Duarte","CA","91010-3000", 6, 8,"Dr. Arthur Riggs","Director","6263018293","953432210","03592400",1,"gradschool.coh.org"," "," "," ",2,1,2,9,2,1,2,10,1,2,-2,2,2,21,1,"A ",-2,-2,"-2",1,1,1,1,1," ",1,25,-2,-2,-2,7,-2,-3,1,2,1,31100,1,348,-2,198
441247,"Community Business College","3800 McHenry Ave Suite M","Modesto","CA","95356-1569", 6, 8,"Victor L. Vandenberghe","President","2095293648","484-8230","03615300",7,"www.communitybusinesscollege.edu","www.communitybusinesscollege.edu","www.cbc123.com","www.123.com",9,3,3,1,1,2,2,0,2,2,-2,2,2,12,1,"A ",-2,-2,"-2",1,1,1,1,2,"formerly "Community Business School"",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,33700,1,-2,-2,71
441256,"Design's School of Cosmetology","715 24th St Ste E","Paso Robles","CA","93446", 6, 8,"Sharon Skinner","Administrator","8052378575","80002030","03646300",1,"designsschool.com"," "," "," ",9,3,3,2,1,2,2,0,2,2,-2,2,2,13,1,"A ",-2,-2,"-2",1,1,1,1,2," ",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,42020,1,-2,-2,46

Jim Hester provided this answer:
You need to use the escape_double = FALSE argument to read_delim(). This isn't part of read_csv() because excel style csvs escape inner quotations by doubling them.

data.table's fread() parses the file just fine... it throws a warning about the quotes, but you can ignore it..
library( data.table )
data.table::fread("./temp.csv" )
Warning message:
In data.table::fread("./temp.csv") :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

Related

Step_dummy. Dealing with duplicated column names generated by recipe() steps, Tidymodels

Dear community,
I have been struggeling for extensive amount of time now trying to understand what is going on here, when I perform my recipe() steps for my linear (glm) model using the Tidymodels framework. The recipe() step_dummy(all_nominal(), -all_outcomes()) was suggested by the usemodels() function https://usemodels.tidymodels.org/index.html .
When I commend out the step_dummy() the recipe() and prep() works fine, however its important to me that these categorical variables are dummyfied (..is that a word!?).
This is the first time I making use of and including a reprex in a question on stackoverflow, so please let me know if you need more information to assist on this matter.
I have looked everywhere, e.g. including a one_hot = TRUE or keep_original_cols argument in the step_dummy() but it does not seem to be effective.
It should be quite easy as it is a matter of renaming the generated columns as unique, but do not succeed. Here is the era.af_train set.
> era.af_train
# A tibble: 7,104 x 44
logRR ID AEZ16simple PrName.Code SubPrName.Code Product Country
<dbl> <dbl> <fct> <fct> <fct> <fct> <fct>
1 -0.851 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
2 -1.17 1663 Warm.Semiar~ BP/Mu Mu-N/TW Pearl Mill~ Niger
3 -0.314 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
4 -0.776 1663 Warm.Semiar~ BP TW Pearl Mill~ Niger
5 -0.0850 1675 Warm.Semiar~ AP TPM+N Pearl Mill~ Niger
6 -0.159 1689 Warm.Subhum~ Al/AP/BP Al+N/LF/TP/TPM~ Maize Togo
7 -0.579 1701 Warm.Semiar~ BP TW Fodder (Le~ Tunisia
8 -0.662 1729 Warm.Subhum~ Al Al-N/Al+N Cassava or~ Nigeria
9 -1.80 1802 Cool.Subhum~ Al/AP Al+N/TP Wheat Ethiop~
10 -1.74 1802 Cool.Subhum~ Al/AP Al+N/TP/TPI+N Wheat Ethiop~
# ... with 7,094 more rows, and 37 more variables: Latitude <dbl>,
# Longitude <dbl>, Site.Type <fct>, Tree <fct>, Bio01_MT_Anu.Mean <dbl>,
# Bio02_MDR.Mean <dbl>, Bio03_Iso.Mean <dbl>, Bio04_TS.Mean <dbl>,
# Bio05_TWM.Mean <dbl>, Bio06_MinTCM.Mean <dbl>, Bio07_TAR.Mean <dbl>,
# Bio08_MT_WetQ.Mean <dbl>, Bio09_MT_DryQ.Mean <dbl>,
# Bio10_MT_WarQ.Mean <dbl>, Bio11_MT_ColQ.Mean <dbl>,
# Bio12_Pecip_Anu.Mean <dbl>, Bio13_Precip_WetM.Mean <dbl>,
# Bio14_Precip_DryM.Mean <dbl>, Bio15_Precip_S.Mean <dbl>,
# Bio16_Precip_WetQ.Mean <dbl>, Bio17_Precip_DryQ.Mean <dbl>,
# Mean_log.n_tot_ncs <dbl>, Mean_log.ca_mehlich3 <dbl>,
# Mean_log.k_mehlich3 <dbl>, Mean_log.mg_mehlich3 <dbl>,
# Mean_log.p_mehlich3 <dbl>, Mean_log.s_mehlich3 <dbl>,
# Mean_log.fe_mehlich3 <dbl>, Mean_db_od <dbl>, Mean_bdr <dbl>,
# Mean_sand_tot_psa <dbl>, Mean_clay_tot_psa <dbl>, Mean_ph_h2o <dbl>,
# Mean_log.ecec.f <dbl>, Mean_log.c_tot <dbl>, Mean_log.oc <dbl>,
# Slope.mean <dbl>
I am including the columns ID, AEZ16simple, PrName.Code, SubPrName.Code, Product, Country, Latitude and Longitude as "ID variables", as I wish to compare the glm model later with a random forest model and a XGBoost model.
All help is welcome!
Have a good weekend and
thank you in advance.
library(reprex)
#> Warning: package 'reprex' was built under R version 4.0.5
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.5
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(recipes)
#> Warning: package 'recipes' was built under R version 4.0.5
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
library(readr)
#> Warning: package 'readr' was built under R version 4.0.5
setwd("C:/Users/lindh011/OneDrive - Wageningen University & Research/Internship ICRAF (ERA)/ERA_Agroforestry_WURwork")
era.af_train <- read_csv("era.af_train.csv")
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> .default = col_double(),
#> AEZ16simple = col_character(),
#> PrName.Code = col_character(),
#> SubPrName.Code = col_character(),
#> Product = col_character(),
#> Country = col_character(),
#> Site.Type = col_character(),
#> Tree = col_character()
#> )
#> i Use `spec()` for the full column specifications.
era.af_train_Tib <- as_tibble(era.af_train)
glmnet_recipe <-
recipe(formula = logRR ~ ., data = era.af_train) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), naming = dummy_names) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
update_role(ID,
AEZ16simple,
PrName.Code,
SubPrName.Code,
Product,
Country,
Latitude,
Longitude,
new_role = "sample ID") %>%
step_impute_mode(all_nominal(), -all_outcomes()) %>%
step_impute_knn (all_numeric_predictors()) %>%
step_impute_knn(logRR) %>%
step_corr(all_numeric_predictors()) %>%
step_nzv(all_numeric_predictors()) %>%
prep()
#> Error: Column names `SubPrName.Code_AF.N.Al.N.TP`, `SubPrName.Code_AF.N.Al.N.TP.TPM`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N.In.N`, `SubPrName.Code_Al.N`, and 33 more must not be duplicated.
#> Use .name_repair to specify repair.
Created on 2021-07-02 by the reprex package (v2.0.0)

Rvest returning table containing NAs

I am trying to scrape data from a table using the Rvest package, but the table is coming back filled with NAs and missing all but the first row.
How can I solve this?
Source <- "https://www.viewbase.com/bitfinex_long_short_position"
Longs <- read_html(Source)%>%html_node(xpath='/html/body/div[1]/div[3]/div[2]/div[2]/div[2]/div/div[2]/div/div[2]/table')%>%
html_table(fill=TRUE)%>% as.data.frame()
Longs
Long Position 24H Change Short Position 24H Change % Long vs. Short NA NA NA NA
1 NA NA NA NA NA NA NA NA NA NA
You might have to use RSelenium to get the exact same output that you see on the webpage. However, you might also get most of the information from json file available on the webpage.
jsonlite::fromJSON('https://api.viewbase.com/margin/bfx_long_short_now') %>%
dplyr::bind_rows()
# BTCUSD BTCUST ETHUSD ETHUST ETHBTC USTUSD XRPUSD XRPBTC BABUSD
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 58323 5.82e4 4.08e3 4078. 7.01e-2 1.00e0 1.57e0 2.69e-5 252
#2 58100 5.81e4 3.88e3 3880. 6.68e-2 1.00e0 1.52e0 2.62e-5 252
#3 31557. 2.30e3 1.93e5 21972. 3.62e+5 3.72e5 5.07e7 2.54e+7 8843.
#4 31547. 2.30e3 1.87e5 21944. 4.05e+5 1.41e6 5.11e7 2.53e+7 8843.
#5 559. 1.97e2 2.43e5 442. 1.02e+4 9.09e6 9.71e6 1.71e+7 15050.
#6 601. 3.81e1 2.46e5 598. 9.35e+3 1.34e5 9.97e6 1.73e+7 15050.
# … with 27 more variables: BABBTC <dbl>, LTCUSD <dbl>, LTCBTC <dbl>,
# EOSUSD <dbl>, EOSBTC <dbl>, ETCUSD <dbl>, ETCBTC <dbl>,
# BSVUSD <dbl>, BSVBTC <dbl>, XTZUSD <dbl>, XTZBTC <dbl>,
# ZECUSD <dbl>, ZECBTC <dbl>, LEOUSD <dbl>, LEOUST <dbl>,
# DSHUSD <dbl>, DSHBTC <dbl>, IOTUSD <dbl>, IOTBTC <dbl>,
# NEOUSD <dbl>, NEOBTC <dbl>, OMGUSD <dbl>, OMGBTC <dbl>,
# XLMUSD <dbl>, XLMBTC <dbl>, XMRUSD <dbl>, XMRBTC <dbl>

tidymodels bake:Error: Please pass a data set to `new_data`

I'm using recipe()function in tidymodels packages for imputation missing values and fixing imbalanced data.
here is my data;
mer_df <- mer2 %>%
filter(!is.na(laststagestatus2)) %>%
select(Id, Age_Range__c, Gender__c, numberoflead, leadduration, firsttouch, lasttouch, laststagestatus2)%>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# A tibble: 197,836 x 8
Id Age_Range__c Gender__c numberoflead leadduration firsttouch lasttouch
<fct> <fct> <fct> <int> <dbl> <fct> <fct>
1 0010~ NA NA 2 5.99 Dealer IB~ Walk in
2 0010~ NA NA 1 0 Online Se~ Online S~
3 0010~ NA NA 1 0 Walk in Walk in
4 0010~ NA NA 1 0 Online Se~ Online S~
5 0010~ NA NA 2 0.0128 Dealer IB~ Dealer I~
6 0010~ NA NA 1 0 OB Call OB Call
7 0010~ NA NA 1 0 Dealer IB~ Dealer I~
8 0010~ NA NA 4 73.9 Dealer IB~ Walk in
9 0010~ NA Male 24 0.000208 OB Call OB Call
10 0010~ NA NA 18 0.000150 OB Call OB Call
# ... with 197,826 more rows, and 1 more variable: laststagestatus2 <fct>
here is my codes;
mer_rec <- recipe(laststagestatus2 ~ ., data = mer_train)%>%
step_medianimpute(numberoflead,leadduration)%>%
step_knnimpute(Gender__c,Age_Range__c,fisrsttouch,lasttouch) %>%
step_other(Id,firsttouch) %>%
step_other(Id,lasttouch) %>%
step_dummy(all_nominal(), -laststagestatus2) %>%
step_smote(laststagestatus2)
mer_rec
mer_rec %>% prep()
it just works fine until here ;
Data Recipe
Inputs:
role #variables
outcome 1
predictor 7
Training data contained 148377 data points and 147597 incomplete rows.
Operations:
Median Imputation for 2 items [trained]
K-nearest neighbor imputation for Id, ... [trained]
Collapsing factor levels for Id, firsttouch [trained]
Collapsing factor levels for Id, lasttouch [trained]
Dummy variables from Id, ... [trained]
SMOTE based on laststagestatus2 [trained]
but when ı run bake() function that gives error says;
mer_rec %>% prep() %>% bake(new_data=NULL) %>% count(laststagestatus2)
Error: Please pass a data set to `new_data`.
Could anyone help me about what I m missing here?
There is a fix in the development version of recipes to get this up and working. You can install via:
devtools::install_github("tidymodels/recipes")
Then you can bake() with new_data = NULL to get out the transformed training data.
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 2,199 x 71
#> Gr_Liv_Area Year_Built Sale_Price Neighborhood_Co… Neighborhood_Ol…
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3.22 1960 5.33 0 0
#> 2 2.95 1961 5.02 0 0
#> 3 3.12 1958 5.24 0 0
#> 4 3.21 1997 5.28 0 0
#> 5 3.21 1998 5.29 0 0
#> 6 3.13 2001 5.33 0 0
#> 7 3.11 1992 5.28 0 0
#> 8 3.21 1995 5.37 0 0
#> 9 3.22 1993 5.25 0 0
#> 10 3.17 1998 5.26 0 0
#> # … with 2,189 more rows, and 66 more variables: Neighborhood_Edwards <dbl>,
#> # Neighborhood_Somerset <dbl>, Neighborhood_Northridge_Heights <dbl>,
#> # Neighborhood_Gilbert <dbl>, Neighborhood_Sawyer <dbl>,
#> # Neighborhood_Northwest_Ames <dbl>, Neighborhood_Sawyer_West <dbl>,
#> # Neighborhood_Mitchell <dbl>, Neighborhood_Brookside <dbl>,
#> # Neighborhood_Crawford <dbl>, Neighborhood_Iowa_DOT_and_Rail_Road <dbl>,
#> # Neighborhood_Timberland <dbl>, Neighborhood_Northridge <dbl>,
#> # Neighborhood_Stone_Brook <dbl>,
#> # Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
#> # Neighborhood_Clear_Creek <dbl>, Neighborhood_Meadow_Village <dbl>,
#> # Neighborhood_other <dbl>, Bldg_Type_TwoFmCon <dbl>, Bldg_Type_Duplex <dbl>,
#> # Bldg_Type_Twnhs <dbl>, Bldg_Type_TwnhsE <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwoFmCon <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_Duplex <dbl>, Gr_Liv_Area_x_Bldg_Type_Twnhs <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwnhsE <dbl>, Latitude_ns_01 <dbl>,
#> # Latitude_ns_02 <dbl>, Latitude_ns_03 <dbl>, Latitude_ns_04 <dbl>,
#> # Latitude_ns_05 <dbl>, Latitude_ns_06 <dbl>, Latitude_ns_07 <dbl>,
#> # Latitude_ns_08 <dbl>, Latitude_ns_09 <dbl>, Latitude_ns_10 <dbl>,
#> # Latitude_ns_11 <dbl>, Latitude_ns_12 <dbl>, Latitude_ns_13 <dbl>,
#> # Latitude_ns_14 <dbl>, Latitude_ns_15 <dbl>, Latitude_ns_16 <dbl>,
#> # Latitude_ns_17 <dbl>, Latitude_ns_18 <dbl>, Latitude_ns_19 <dbl>,
#> # Latitude_ns_20 <dbl>, Longitude_ns_01 <dbl>, Longitude_ns_02 <dbl>,
#> # Longitude_ns_03 <dbl>, Longitude_ns_04 <dbl>, Longitude_ns_05 <dbl>,
#> # Longitude_ns_06 <dbl>, Longitude_ns_07 <dbl>, Longitude_ns_08 <dbl>,
#> # Longitude_ns_09 <dbl>, Longitude_ns_10 <dbl>, Longitude_ns_11 <dbl>,
#> # Longitude_ns_12 <dbl>, Longitude_ns_13 <dbl>, Longitude_ns_14 <dbl>,
#> # Longitude_ns_15 <dbl>, Longitude_ns_16 <dbl>, Longitude_ns_17 <dbl>,
#> # Longitude_ns_18 <dbl>, Longitude_ns_19 <dbl>, Longitude_ns_20 <dbl>
Created on 2020-10-12 by the reprex package (v0.3.0.9001)
If you are unable to install packages from GitHub, you could use juice() to do the same thing.

How to create several data sets, one for each value iterated in a for loop?

So I am mining data of occurences of fish species in Brazil, belonging to the "actinopterygii" group using the "rgbif" package, but since the number of occurences for that group is so high I can't retrieve them all at once.
With these two lines of code we can see there are 323200 occurrences:
#install.packages("rgbif")
library(rgbif)
actinopterygii<-name_backbone(name="Actinopterygii")
occ_count(taxonKey = actinopterygii$classKey,country="BR")
The thing is that the function that retrieves the occurences has a maximum limit of 2000 occurrences per retrieval:
actinopterygii_oc<-occ_search(taxonKey = actinopterygii$classKey,country="BR",limit=2000,start=0)
#the start argument refers to the index of the record we are starting at so we can page through all the results
I'm basically trying to avoid having to repeat this line 60 times and change the start value by 2000 every time, so I tried with a for loop but it's not working. I created an interval for the number of occurences, to perform the retrieval 2000 by 2000 at a time:
interval<-seq(from = 0, to = 323200, by = 2000)
for (value in interval){
actinopterygii_oc<-occ_search(taxonKey = actinopterygii$classKey,country="BR",limit=2000,start=value)
}
The problem is this code is only modifying one single set of data everytime. So, is there any way I can create several sets of data, one for each value in the interval, while looping through the values in the interval?
I'm so sorry for how confusing this might be but I'm not able to express it any better, thank you in advance for any answers
You can aggregate the sets of data in a list like so:
interval<-seq(from = 0, to = 323200, by = 2000)
actinopterygii_oc <- list()
for (i in 1:length(interval)){
value <- interval[i]
actinopterygii_oc[[i]] <- occ_search(taxonKey = actinopterygii$classKey,country="BR",limit=2000,start=value)
}
and then combine them using for example dplyr::bind_rows(actinopterygii_oc).
Rather than a for loop try purrr::map to get back a list of tibbles of 2,000 rows at a time. I probably don't have to tell you this will take a long time
interval <- seq(from = 1, to = 323200, by = 2000)
list_of_tibbles <-
purrr::map(interval,
~ occ_search(taxonKey = actinopterygii$classKey,
country="BR",
limit=2000,
start= .x)
)
I wasn't about to grab all your data but you get back output like
[[1]]
Records found [323200]
Records returned [2000]
No. unique hierarchies [661]
No. media records [2000]
No. facets [0]
Args [limit=2000, offset=1, taxonKey=204, country=BR, fields=all]
# A tibble: 2,000 x 145
key scientificName decimalLatitude decimalLongitude issues datasetKey publishingOrgKey
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 2550… Chaetodipteru… -7.91 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
2 2550… Myrichthys oc… -7.90 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
3 2550… Mugil curema … -7.91 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
4 2550… Centropomus u… -7.91 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
5 2550… Trachinotus c… -7.91 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
6 2550… Phractocephal… -3.18 -59.9 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
7 2550… Diapterus aur… -7.91 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
8 2550… Chaetodipteru… -7.91 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
9 2550… Centropomus u… -7.91 -34.8 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
10 2550… Calophysus ma… -3.18 -59.9 cdrou… 50c9509d-… 28eb1a3f-1c15-4…
# … with 1,990 more rows, and 138 more variables: installationKey <chr>,
# publishingCountry <chr>, protocol <chr>, lastCrawled <chr>, lastParsed <chr>,
# crawlId <int>, extensions <chr>, basisOfRecord <chr>, occurrenceStatus <chr>,
# taxonKey <int>, kingdomKey <int>, phylumKey <int>, classKey <int>, orderKey <int>,
# familyKey <int>, genusKey <int>, speciesKey <int>, acceptedTaxonKey <int>,
# acceptedScientificName <chr>, kingdom <chr>, phylum <chr>, order <chr>, family <chr>,
# genus <chr>, species <chr>, genericName <chr>, specificEpithet <chr>, taxonRank <chr>,
# taxonomicStatus <chr>, dateIdentified <chr>, coordinateUncertaintyInMeters <dbl>,
# stateProvince <chr>, year <int>, month <int>, day <int>, eventDate <chr>,
# modified <chr>, lastInterpreted <chr>, references <chr>, license <chr>,
# identifiers <chr>, facts <chr>, relations <chr>, gadm.level0.gid <chr>,
# gadm.level0.name <chr>, gadm.level1.gid <chr>, gadm.level1.name <chr>,
# gadm.level2.gid <chr>, gadm.level2.name <chr>, gadm.level3.gid <chr>,
# gadm.level3.name <chr>, geodeticDatum <chr>, class <chr>, countryCode <chr>,
# recordedByIDs <chr>, identifiedByIDs <chr>, country <chr>, rightsHolder <chr>,
# identifier <chr>, http...unknown.org.nick <chr>, verbatimEventDate <chr>,
# datasetName <chr>, collectionCode <chr>, gbifID <chr>, verbatimLocality <chr>,
# occurrenceID <chr>, taxonID <chr>, catalogNumber <chr>, recordedBy <chr>,
# http...unknown.org.occurrenceDetails <chr>, institutionCode <chr>, rights <chr>,
# eventTime <chr>, identifiedBy <chr>, identificationID <chr>, name <chr>,
# occurrenceRemarks <chr>, gadm <chr>, informationWithheld <chr>,
# recordedByIDs.type <chr>, recordedByIDs.value <chr>, individualCount <int>,
# establishmentMeans <chr>, continent <chr>, organismQuantityType <chr>, habitat <chr>,
# http...rs.tdwg.org.dwc.terms.organismQuantity <chr>,
# georeferenceVerificationStatus <chr>, verbatimSRS <chr>, verbatimCoordinateSystem <chr>,
# county <chr>, locality <chr>, taxonRemarks <chr>, preparations <chr>, disposition <chr>,
# vernacularName <chr>, organismName <chr>, fieldNotes <chr>, originalNameUsage <chr>,
# http...rs.tdwg.org.dwc.terms.organismQuantityType <chr>, …
You'll notice that what you get back has not only the data but other meta data. To glue it all the data back together into one big dataframe another map
glued_data <-
purrr::map(list_of_tibbles, "data") %>%
bind_rows()
dim(glued_data)
[1] 10000 162

Apply a function to every case of a tibble

My second participation here, in Stackoverflow.
I have a function called bw_test with several args like this:
bw_test <- function(localip, remoteip, localspeed, remotespeed , duracion =30,direction ="both"){
comando <- str_c("ssh usuario#", localip ," /tool bandwidth-test direction=", direction," remote-tx-speed=",remotespeed,"M local-tx-speed=",localspeed,"M protocol=udp user=usuario password=mipasso duration=",duracion," ",remoteip)
resultado <- system(comando,intern = T,ignore.stderr = T)
# resultado pull from a ssh server a vector like this:
# head(resultado)
#[1] " status: connecting\r" " tx-current: #0bps\r" " tx-10-second-average: 0bps\r"
#[4] " tx-total-average: 0bps\r" " rx-current: #0bps\r" " rx-10-second-average: 0bps\r"
resultado %<>%
replace("\r","") %>%
tail(17) %>%
trimws("both") %>%
as_tibble %>%
mutate(local=localip, remote=remoteip) %>%
separate(value,sep=":", into=c("parametro","valor")) %>%
head(15)
resultado$valor %<>%
trimws() %>%
str_replace("Mbps","") %>% str_replace("%","") %>% str_replace("s","")
resultado %<>%
spread(parametro,valor)
resultado %<>%
mutate(`tx-percentaje`=as.numeric(resultado$`tx-total-average`)/localspeed) %>%
mutate(`rx-percentaje`=as.numeric(resultado$`rx-total-average`)/remotespeed)
return(resultado)
}
this function returns a tibble like this one:
A tibble: 1 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 1 both 4 13 0 no 12
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
So, when I call the function inside rbind, got the result of every run on a tibble:
rbind(bw_test("192.168.105.10" ,"192.168.105.18", 75,125),
bw_test("192.168.133.11","192.168.133.9", 5 ,50),
bw_test("192.168.254.251","192.168.254.250", 25,150))
My results are for the example:
# A tibble: 3 x 19
local remote `connection-cou… direction duration `local-cpu-load` `lost-packets` `random-data` `remote-cpu-loa…
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 192.… 192.1… 20 both 28 63 232 no 48
2 192.… 192.1… 20 both 29 4 0 no 20
3 192.… 192.1… 20 both 29 15 0 no 22
# … with 10 more variables: `rx-10-second-average` <chr>, `rx-current` <chr>, `rx-size` <chr>,
# `rx-total-average` <chr>, `tx-10-second-average` <chr>, `tx-current` <chr>, `tx-size` <chr>,
# `tx-total-average` <chr>, `tx-percentaje` <dbl>, `rx-percentaje` <dbl>
My problem is to apply the function to the cases of a tibble like like this.
aps <- tribble(
~name, ~ip, ~remoteip , ~bw_test, ~localspeed,~remotespeed,
"backbone_border_core","192.168.253.1", "192.168.253.3", 1,200,200,
"backbone_2_site2","192.168.254.251", "192.168.254.250", 1, 25,150
}
I was trying to use map, but i got:
map(c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed), bw_test)
el argumento "remotespeed" está ausente, sin valor por omisión
I believe cause c(aps$ip,aps$remoteip,aps$localspeed,aps$remotespeed) feeds first all cases of aps$ip, later all of aps$remoteip and so on.
I'm using the right strategie? it's map a suitable way
What i'm doing wrong?
¿how can I apply function to every row to get the requested tibble?
I'll appreciate your kindly help.
Greets.
Try using pmap_df.
output <- purrr::pmap_df(list(aps$ip, aps$remoteip, aps$localspeed,
aps$remotespeed), bw_test)

Resources