What are available r tools to obtain list of all releases for a specific R CRAN package.
There is expected to retrieve at least Dates each package version was released.
Other metadata for each package are in value too.
self-promotion of my new CRAN package https://CRAN.R-project.org/package=pacs
TL;DR
pacs::pac_timemachine
pkgsearch::cran_package_history
pkgdown:::pkg_timeline (non-exported and only Date of publish)
pacs::pac_timemachine in pacs package.
pacs::pac_timemachine is using CRAN website or crandb.
head(pacs::pac_timemachine("tidyr"), 3)
#> Package Version Released Archived LifeDuration
#> 2 tidyr 0.1 2014-07-21 2015-09-08 414 days
#> 3 tidyr 0.2.0 2015-09-08 2015-09-08 0 days
#> 4 tidyr 0.3.0 2015-09-08 2015-09-10 2 days
#> URL Size
#> 2 Archive/tidyr/tidyr_0.1.tar.gz 134K
#> 3 Archive/tidyr/tidyr_0.2.0.tar.gz 139K
#> 4 Archive/tidyr/tidyr_0.3.0.tar.gz 147K
tail(pacs::pac_timemachine("tidyr"), 3)
#> Package Version Released Archived LifeDuration
#> 25 tidyr 1.1.1 2020-07-31 2020-08-27 27 days
#> 26 tidyr 1.1.2 2020-08-27 2021-03-03 188 days
#> 1 tidyr 1.1.3 2021-03-03 <NA> 192 days
#> URL Size
#> 25 Archive/tidyr/tidyr_1.1.1.tar.gz 859K
#> 26 Archive/tidyr/tidyr_1.1.2.tar.gz 861K
#> 1 tidyr_1.1.3.tar.gz <NA>
We could get the result for certain Date or Date interval or version too.
pacs::pac_timemachine("tidyr", at = as.Date("2018-01-01"))
pacs::pac_timemachine("tidyr", version = "1.0.0")
pacs::pac_timemachine("tidyr", from = as.Date("2020-06-01"), to = as.Date("2020-08-01"))
Created on 2021-09-11 by the reprex package (v2.0.1)
the pkgsearch package.
This one is builded under private DB which is systematically appended with new DESCRIPTION files for each CRAN package.
head(pkgsearch::cran_package_history("tidyr"), 3)
#> # A tibble: 3 × 25
#> Package Title Version `Authors#R` Description License LazyData URL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 tidyr Easily … 0.1 "'Hadley Wick… tidyr is an e… MIT + … true https…
#> 2 tidyr Easily … 0.2.0 "as.person(c(… An evolution … MIT + … true https…
#> 3 tidyr Easily … 0.3.0 "c(<U+000a>pe… An evolution … MIT + … true https…
#> # … with 17 more variables: VignetteBuilder <chr>, Packaged <chr>,
#> # Author <chr>, Maintainer <chr>, NeedsCompilation <chr>, Repository <chr>,
#> # Date/Publication <chr>, crandb_file_date <chr>, date <chr>,
#> # dependencies <list>, BugReports <chr>, RoxygenNote <chr>, Remotes <chr>,
#> # MD5sum <chr>, Encoding <chr>, SystemRequirements <chr>,
#> # Config/testthat/edition <chr>
tail(pkgsearch::cran_package_history("tidyr"), 3)
#> # A tibble: 3 × 25
#> Package Title Version `Authors#R` Description License LazyData URL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 tidyr Tidy Messy Data 1.1.1 "\nc(perso… "Tools to … MIT + … true http…
#> 2 tidyr Tidy Messy Data 1.1.2 "\nc(perso… "Tools to … MIT + … true http…
#> 3 tidyr Tidy Messy Data 1.1.3 "\nc(perso… "Tools to … MIT + … true http…
#> # … with 17 more variables: VignetteBuilder <chr>, Packaged <chr>,
#> # Author <chr>, Maintainer <chr>, NeedsCompilation <chr>, Repository <chr>,
#> # Date/Publication <chr>, crandb_file_date <chr>, date <chr>,
#> # dependencies <list>, BugReports <chr>, RoxygenNote <chr>, Remotes <chr>,
#> # MD5sum <chr>, Encoding <chr>, SystemRequirements <chr>,
#> # Config/testthat/edition <chr>
Created on 2021-09-11 by the reprex package (v2.0.1)
pkgdown package
pkgdown:::pkg_timeline function in pkgdown package. It is a non-exported function so sb have to take that into account. It returns only Date when each package version was published.
I'm using recipe()function in tidymodels packages for imputation missing values and fixing imbalanced data.
here is my data;
mer_df <- mer2 %>%
filter(!is.na(laststagestatus2)) %>%
select(Id, Age_Range__c, Gender__c, numberoflead, leadduration, firsttouch, lasttouch, laststagestatus2)%>%
mutate_if(is.character, factor) %>%
mutate_if(is.logical, as.integer)
# A tibble: 197,836 x 8
Id Age_Range__c Gender__c numberoflead leadduration firsttouch lasttouch
<fct> <fct> <fct> <int> <dbl> <fct> <fct>
1 0010~ NA NA 2 5.99 Dealer IB~ Walk in
2 0010~ NA NA 1 0 Online Se~ Online S~
3 0010~ NA NA 1 0 Walk in Walk in
4 0010~ NA NA 1 0 Online Se~ Online S~
5 0010~ NA NA 2 0.0128 Dealer IB~ Dealer I~
6 0010~ NA NA 1 0 OB Call OB Call
7 0010~ NA NA 1 0 Dealer IB~ Dealer I~
8 0010~ NA NA 4 73.9 Dealer IB~ Walk in
9 0010~ NA Male 24 0.000208 OB Call OB Call
10 0010~ NA NA 18 0.000150 OB Call OB Call
# ... with 197,826 more rows, and 1 more variable: laststagestatus2 <fct>
here is my codes;
mer_rec <- recipe(laststagestatus2 ~ ., data = mer_train)%>%
step_medianimpute(numberoflead,leadduration)%>%
step_knnimpute(Gender__c,Age_Range__c,fisrsttouch,lasttouch) %>%
step_other(Id,firsttouch) %>%
step_other(Id,lasttouch) %>%
step_dummy(all_nominal(), -laststagestatus2) %>%
step_smote(laststagestatus2)
mer_rec
mer_rec %>% prep()
it just works fine until here ;
Data Recipe
Inputs:
role #variables
outcome 1
predictor 7
Training data contained 148377 data points and 147597 incomplete rows.
Operations:
Median Imputation for 2 items [trained]
K-nearest neighbor imputation for Id, ... [trained]
Collapsing factor levels for Id, firsttouch [trained]
Collapsing factor levels for Id, lasttouch [trained]
Dummy variables from Id, ... [trained]
SMOTE based on laststagestatus2 [trained]
but when ı run bake() function that gives error says;
mer_rec %>% prep() %>% bake(new_data=NULL) %>% count(laststagestatus2)
Error: Please pass a data set to `new_data`.
Could anyone help me about what I m missing here?
There is a fix in the development version of recipes to get this up and working. You can install via:
devtools::install_github("tidymodels/recipes")
Then you can bake() with new_data = NULL to get out the transformed training data.
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(123)
ames_split <- initial_split(ames, prob = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
ames_rec %>% prep() %>% bake(new_data = NULL)
#> # A tibble: 2,199 x 71
#> Gr_Liv_Area Year_Built Sale_Price Neighborhood_Co… Neighborhood_Ol…
#> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 3.22 1960 5.33 0 0
#> 2 2.95 1961 5.02 0 0
#> 3 3.12 1958 5.24 0 0
#> 4 3.21 1997 5.28 0 0
#> 5 3.21 1998 5.29 0 0
#> 6 3.13 2001 5.33 0 0
#> 7 3.11 1992 5.28 0 0
#> 8 3.21 1995 5.37 0 0
#> 9 3.22 1993 5.25 0 0
#> 10 3.17 1998 5.26 0 0
#> # … with 2,189 more rows, and 66 more variables: Neighborhood_Edwards <dbl>,
#> # Neighborhood_Somerset <dbl>, Neighborhood_Northridge_Heights <dbl>,
#> # Neighborhood_Gilbert <dbl>, Neighborhood_Sawyer <dbl>,
#> # Neighborhood_Northwest_Ames <dbl>, Neighborhood_Sawyer_West <dbl>,
#> # Neighborhood_Mitchell <dbl>, Neighborhood_Brookside <dbl>,
#> # Neighborhood_Crawford <dbl>, Neighborhood_Iowa_DOT_and_Rail_Road <dbl>,
#> # Neighborhood_Timberland <dbl>, Neighborhood_Northridge <dbl>,
#> # Neighborhood_Stone_Brook <dbl>,
#> # Neighborhood_South_and_West_of_Iowa_State_University <dbl>,
#> # Neighborhood_Clear_Creek <dbl>, Neighborhood_Meadow_Village <dbl>,
#> # Neighborhood_other <dbl>, Bldg_Type_TwoFmCon <dbl>, Bldg_Type_Duplex <dbl>,
#> # Bldg_Type_Twnhs <dbl>, Bldg_Type_TwnhsE <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwoFmCon <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_Duplex <dbl>, Gr_Liv_Area_x_Bldg_Type_Twnhs <dbl>,
#> # Gr_Liv_Area_x_Bldg_Type_TwnhsE <dbl>, Latitude_ns_01 <dbl>,
#> # Latitude_ns_02 <dbl>, Latitude_ns_03 <dbl>, Latitude_ns_04 <dbl>,
#> # Latitude_ns_05 <dbl>, Latitude_ns_06 <dbl>, Latitude_ns_07 <dbl>,
#> # Latitude_ns_08 <dbl>, Latitude_ns_09 <dbl>, Latitude_ns_10 <dbl>,
#> # Latitude_ns_11 <dbl>, Latitude_ns_12 <dbl>, Latitude_ns_13 <dbl>,
#> # Latitude_ns_14 <dbl>, Latitude_ns_15 <dbl>, Latitude_ns_16 <dbl>,
#> # Latitude_ns_17 <dbl>, Latitude_ns_18 <dbl>, Latitude_ns_19 <dbl>,
#> # Latitude_ns_20 <dbl>, Longitude_ns_01 <dbl>, Longitude_ns_02 <dbl>,
#> # Longitude_ns_03 <dbl>, Longitude_ns_04 <dbl>, Longitude_ns_05 <dbl>,
#> # Longitude_ns_06 <dbl>, Longitude_ns_07 <dbl>, Longitude_ns_08 <dbl>,
#> # Longitude_ns_09 <dbl>, Longitude_ns_10 <dbl>, Longitude_ns_11 <dbl>,
#> # Longitude_ns_12 <dbl>, Longitude_ns_13 <dbl>, Longitude_ns_14 <dbl>,
#> # Longitude_ns_15 <dbl>, Longitude_ns_16 <dbl>, Longitude_ns_17 <dbl>,
#> # Longitude_ns_18 <dbl>, Longitude_ns_19 <dbl>, Longitude_ns_20 <dbl>
Created on 2020-10-12 by the reprex package (v0.3.0.9001)
If you are unable to install packages from GitHub, you could use juice() to do the same thing.
I have a csv where some columns have a quoted column with another quotation inside it:
"blah blah "nested quote"" and it generates parsing failures. I'm not sure if this is a bug or there is an argument to deal with this?
Reprex (file is here or content pasted below):
readr::read_csv("~/temp/shittyquotes.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> INSTNM = col_character(),
#> ADDR = col_character(),
#> CITY = col_character(),
#> STABBR = col_character(),
#> ZIP = col_character(),
#> CHFNM = col_character(),
#> CHFTITLE = col_character(),
#> EIN = col_character(),
#> OPEID = col_character(),
#> WEBADDR = col_character(),
#> ADMINURL = col_character(),
#> FAIDURL = col_character(),
#> APPLURL = col_character(),
#> ACT = col_character(),
#> IALIAS = col_character(),
#> INSTCAT = col_character(),
#> CCBASIC = col_character(),
#> CCIPUG = col_character(),
#> CCSIZSET = col_character(),
#> CARNEGIE = col_character()
#> # ... with 2 more columns
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 2 IALIAS delimiter or quote C '~/temp/shittyquotes.csv'
#> 2 IALIAS delimiter or quote D '~/temp/shittyquotes.csv'
#> 2 NA 59 columns 100 columns '~/temp/shittyquotes.csv'
#> # A tibble: 2 x 59
#> UNITID INSTNM ADDR CITY STABBR ZIP FIPS OBEREG CHFNM CHFTITLE
#> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
#> 1 441238 City … 1500… Duar… CA 9101… 6 8 Dr. … Director
#> 2 441247 Commu… 3800… Mode… CA 9535… 6 8 Vict… Preside…
#> # ... with 49 more variables: GENTELE <dbl>, EIN <chr>, OPEID <chr>,
#> # OPEFLAG <dbl>, WEBADDR <chr>, ADMINURL <chr>, FAIDURL <chr>,
#> # APPLURL <chr>, SECTOR <dbl>, ICLEVEL <dbl>, CONTROL <dbl>,
#> # HLOFFER <dbl>, UGOFFER <dbl>, GROFFER <dbl>, FPOFFER <dbl>,
#> # HDEGOFFR <dbl>, DEGGRANT <dbl>, HBCU <dbl>, HOSPITAL <dbl>,
#> # MEDICAL <dbl>, TRIBAL <dbl>, LOCALE <dbl>, OPENPUBL <dbl>, ACT <chr>,
#> # NEWID <dbl>, DEATHYR <dbl>, CLOSEDAT <dbl>, CYACTIVE <dbl>,
#> # POSTSEC <dbl>, PSEFLAG <dbl>, PSET4FLG <dbl>, RPTMTH <dbl>,
#> # IALIAS <chr>, INSTCAT <chr>, CCBASIC <chr>, CCIPUG <chr>,
#> # CCIPGRAD <dbl>, CCUGPROF <dbl>, CCENRPRF <dbl>, CCSIZSET <chr>,
#> # CARNEGIE <chr>, TENURSYS <dbl>, LANDGRNT <dbl>, INSTSIZE <chr>,
#> # CBSA <dbl>, CBSATYPE <chr>, CSA <dbl>, NECTA <dbl>, DFRCGID <dbl>
Created on 2018-12-04 by the reprex package (v0.2.1)
Also here's the csv content:
UNITID,INSTNM,ADDR,CITY,STABBR,ZIP,FIPS,OBEREG,CHFNM,CHFTITLE,GENTELE,EIN,OPEID,OPEFLAG,WEBADDR,ADMINURL,FAIDURL,APPLURL,SECTOR,ICLEVEL,CONTROL,HLOFFER,UGOFFER,GROFFER,FPOFFER,HDEGOFFR,DEGGRANT,HBCU,HOSPITAL,MEDICAL,TRIBAL,LOCALE,OPENPUBL,ACT,NEWID,DEATHYR,CLOSEDAT,CYACTIVE,POSTSEC,PSEFLAG,PSET4FLG,RPTMTH,IALIAS,INSTCAT,CCBASIC,CCIPUG,CCIPGRAD,CCUGPROF,CCENRPRF,CCSIZSET,CARNEGIE,TENURSYS,LANDGRNT,INSTSIZE,CBSA,CBSATYPE,CSA,NECTA,DFRCGID
441238,"City of Hope Graduate School of Biological Science","1500 E Duarte Rd","Duarte","CA","91010-3000", 6, 8,"Dr. Arthur Riggs","Director","6263018293","953432210","03592400",1,"gradschool.coh.org"," "," "," ",2,1,2,9,2,1,2,10,1,2,-2,2,2,21,1,"A ",-2,-2,"-2",1,1,1,1,1," ",1,25,-2,-2,-2,7,-2,-3,1,2,1,31100,1,348,-2,198
441247,"Community Business College","3800 McHenry Ave Suite M","Modesto","CA","95356-1569", 6, 8,"Victor L. Vandenberghe","President","2095293648","484-8230","03615300",7,"www.communitybusinesscollege.edu","www.communitybusinesscollege.edu","www.cbc123.com","www.123.com",9,3,3,1,1,2,2,0,2,2,-2,2,2,12,1,"A ",-2,-2,"-2",1,1,1,1,2,"formerly "Community Business School"",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,33700,1,-2,-2,71
441256,"Design's School of Cosmetology","715 24th St Ste E","Paso Robles","CA","93446", 6, 8,"Sharon Skinner","Administrator","8052378575","80002030","03646300",1,"designsschool.com"," "," "," ",9,3,3,2,1,2,2,0,2,2,-2,2,2,13,1,"A ",-2,-2,"-2",1,1,1,1,2," ",6,-3,-3,-3,-3,-3,-3,-3,2,2,1,42020,1,-2,-2,46
Jim Hester provided this answer:
You need to use the escape_double = FALSE argument to read_delim(). This isn't part of read_csv() because excel style csvs escape inner quotations by doubling them.
data.table's fread() parses the file just fine... it throws a warning about the quotes, but you can ignore it..
library( data.table )
data.table::fread("./temp.csv" )
Warning message:
In data.table::fread("./temp.csv") :
Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
I am trying to add commas for thousands in my data e.g. 10,000 along with dollars e.g. $10,000.
I'm using several dplyr commands along with tidyr gather and spread functions. Here's what I tried:
Cut n paste this code block to generate the random data "dataset" I'm working with:
library(dplyr)
library(tidyr)
library(lubridate)
## Generate some data
channels <- c("Facebook", "Youtube", "SEM", "Organic", "Direct", "Email")
last_month <- Sys.Date() %m+% months(-1) %>% floor_date("month")
mts <- seq(from = last_month %m+% months(-23), to = last_month, by = "1 month") %>% as.Date()
dimvars <- expand.grid(Month = mts, Channel = channels, stringsAsFactors = FALSE)
# metrics
rws <- nrow(dimvars)
set.seed(42)
# generates variablility in the random data
randwalk <- function(initial_val, ...){
initial_val + cumsum(rnorm(...))
}
Sessions <- ceiling(randwalk(3000, n = rws, mean = 8, sd = 1500)) %>% abs()
Revenue <- ceiling(randwalk(10000, n = rws, mean = 0, sd = 3500)) %>% abs()
# make primary df
dataset <- cbind(dimvars, Revenue)
Which looks like:
> tbl_df(dataset)
# A tibble: 144 × 3
Month Channel Revenue
<date> <chr> <dbl>
1 2015-06-01 Facebook 8552
2 2015-07-01 Facebook 12449
3 2015-08-01 Facebook 10765
4 2015-09-01 Facebook 9249
5 2015-10-01 Facebook 11688
6 2015-11-01 Facebook 7991
7 2015-12-01 Facebook 7849
8 2016-01-01 Facebook 2418
9 2016-02-01 Facebook 6503
10 2016-03-01 Facebook 5545
# ... with 134 more rows
Now I want to spread the months into columns to show revenue trend by channel, month over month. I can do that like so:
revenueTable <- dataset %>% select(Month, Channel, Revenue) %>%
group_by(Month, Channel) %>%
summarise(Revenue = sum(Revenue)) %>%
#mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
gather(Key, Value, -Channel, -Month) %>%
spread(Month, Value) %>%
select(-Key)
And it looks almost exactly as I want:
> revenueTable
# A tibble: 6 × 25
Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01` `2015-10-01` `2015-11-01` `2015-12-01` `2016-01-01` `2016-02-01` `2016-03-01` `2016-04-01`
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Direct 11910 8417 4012 359 4473 2702 6261 6167 8630 5230 1394
2 Email 7244 3517 671 1339 10788 10575 8567 8406 7856 6345 7733
3 Facebook 8552 12449 10765 9249 11688 7991 7849 2418 6503 5545 3908
4 Organic 4191 978 219 4274 2924 4155 5981 9719 8220 8829 7024
5 SEM 2344 6873 10230 6429 5016 2964 3390 3841 3163 1994 2105
6 Youtube 186 2949 2144 5073 1035 4878 7905 7377 2305 4556 6247
# ... with 13 more variables: `2016-05-01` <dbl>, `2016-06-01` <dbl>, `2016-07-01` <dbl>, `2016-08-01` <dbl>, `2016-09-01` <dbl>, `2016-10-01` <dbl>,
# `2016-11-01` <dbl>, `2016-12-01` <dbl>, `2017-01-01` <dbl>, `2017-02-01` <dbl>, `2017-03-01` <dbl>, `2017-04-01` <dbl>, `2017-05-01` <dbl>
Now the part I'm struggling with. I would like to format the data as currency. I tried adding this inbetween summarise() and gather() within the chain:
mutate(Revenue = paste0("$", format(Revenue, big.interval = ","))) %>%
This half works. The dollar sign is prepended but the comma separators do not show. I tried removing the paste0("$" part to see if I could get the comma formatting to work with no success.
How can I format my tbl as a currency with dollars and commas, rounded to nearest whole dollars (no $1.99, just $2)?
I think you can just do this at the end with dplyr::mutate_at().
revenueTable %>% mutate_at(vars(-Channel), funs(. %>% round(0) %>% scales::dollar()))
#> # A tibble: 6 x 25
#> Channel `2015-06-01` `2015-07-01` `2015-08-01` `2015-09-01`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 Direct $11,910 $8,417 $4,012 $359
#> 2 Email $7,244 $3,517 $671 $1,339
#> 3 Facebook $8,552 $12,449 $10,765 $9,249
#> 4 Organic $4,191 $978 $219 $4,274
#> 5 SEM $2,344 $6,873 $10,230 $6,429
#> 6 Youtube $186 $2,949 $2,144 $5,073
#> # ... with 20 more variables: `2015-10-01` <chr>, `2015-11-01` <chr>,
#> # `2015-12-01` <chr>, `2016-01-01` <chr>, `2016-02-01` <chr>,
#> # `2016-03-01` <chr>, `2016-04-01` <chr>, `2016-05-01` <chr>,
#> # `2016-06-01` <chr>, `2016-07-01` <chr>, `2016-08-01` <chr>,
#> # `2016-09-01` <chr>, `2016-10-01` <chr>, `2016-11-01` <chr>,
#> # `2016-12-01` <chr>, `2017-01-01` <chr>, `2017-02-01` <chr>,
#> # `2017-03-01` <chr>, `2017-04-01` <chr>, `2017-05-01` <chr>
We can use data.table
library(data.table)
nm1 <- setdiff(names(revenueTable), 'Channel')
setDT(revenueTable)[, (nm1) := lapply(.SD, function(x)
scales::dollar(round(x))), .SDcols = nm1]
revenueTable[, 1:3, with = FALSE]
# Channel `2015-06-01` `2015-07-01`
#1: Direct $11,910 $8,417
#2: Email $7,244 $3,517
#3: Facebook $8,552 $12,449
#4: Organic $4,191 $978
#5: SEM $2,344 $6,873
#6: Youtube $186 $2,949