Changing Default dots argument in R - r

This question is related to this question and this question.
I need to assign a default to the ... argument in a function. I have successfully been able to to use the default package to accomplish this for specific arguments. For instance, lets say I want to allow toJSON from the jsonlite package to show more than four digits. The default is 4, but I want to show 10.
library(jsonlite)
library(default)
df <- data.frame(x = 2:5,
y = 2:5 / pi)
df
#> x y
#> 1 2 0.6366198
#> 2 3 0.9549297
#> 3 4 1.2732395
#> 4 5 1.5915494
# show as JSON - defaults to four digits
toJSON(df)
#> [{"x":2,"y":0.6366},{"x":3,"y":0.9549},{"x":4,"y":1.2732},{"x":5,"y":1.5915}]
# use default pacakge to change to 10
default(toJSON) <- list(digits = 10)
toJSON(df)
#> [{"x":2,"y":0.63661977237},{"x":3,"y":0.95492965855},{"x":4,"y":1.2732395447},{"x":5,"y":1.5915494309}]
There is another function called stream_out which uses toJSON but only uses the digits argument in ....
> stream_out(df)
{"x":2,"y":0.63662}
{"x":3,"y":0.95493}
{"x":4,"y":1.27324}
{"x":5,"y":1.59155}
Complete! Processed total of 4 rows.
>
> stream_out(df, digits = 10)
{"x":2,"y":0.63661977237}
{"x":3,"y":0.95492965855}
{"x":4,"y":1.2732395447}
{"x":5,"y":1.5915494309}
Complete! Processed total of 4 rows.
So even though I have changed the digits in toJSON, it isn't passed to the ... in stream_out. I cannot change this in the same manner as with toJSON.
> default(stream_out) <- list(digits = 10)
Error: 'digits' is not an argument of this function
This is not strictly a jsonlite question, but that is my use case here. I need to somehow change the ... argument of the stream_out function so that any time it is used, 10 digits are returned, rather than 4. However, any examples that show how to change defaults of ... arguments could probably be used to get to what I need.
Thanks!

Related

Getting different results from the same function inside and outside a mutate function call

Can someone explain to me why I get a different result when I run the convertToDisplayTime function inside mutate than when I run it on its own? The correct result is the one I obtain when I run it on its own. Also, why do I get these warnings? It feels like I might be passing the whole timeInSeconds column as an argument when I call convertToDisplayTime in the mutate function, but I'm not sure that I really understand the mechanics in play here.
library('tidyverse')
#> Warning: package 'tibble' was built under R version 4.1.2
convertToDisplayTime <- function(timeInSeconds){
## Takes a time in seconds and converts it
## to a xx:xx:xx string format
if(timeInSeconds>86400){ #Not handling time over a day
stop(simpleError("Enter a time below 86400 seconds (1 day)"))
} else if(timeInSeconds>3600){
numberOfMinutes = 0
numberOfHours = timeInSeconds%/%3600
remainingSeconds = timeInSeconds%%3600
if(remainingSeconds>60){
numberOfMinutes = remainingSeconds%/%60
remainingSeconds = remainingSeconds%%60
}
if(numberOfMinutes<10){displayMinutes = paste0("0",numberOfMinutes)}
else{displayMinutes = numberOfMinutes}
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfHours,":",displayMinutes,":", displaySeconds))
} else if(timeInSeconds>60){
numberOfMinutes = timeInSeconds%/%60
remainingSeconds = timeInSeconds%%60
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfMinutes,":", displaySeconds))
} else{
return(paste0("0:",timeInSeconds))
}
}
(df <- tibble(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10)) %>% mutate(displayTime = convertToDisplayTime(timeInSeconds)))
#> Warning in if (timeInSeconds > 86400) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 3600) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 60) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (remainingSeconds < 10) {: the condition has length > 1 and only
#> the first element will be used
#> # A tibble: 4 x 2
#> timeInSeconds displayTime
#> <dbl> <chr>
#> 1 2710. 45:10
#> 2 2705. 45:5
#> 3 2692. 44:52
#> 4 2708. 45:8
convertToDisplayTime(2710.46)
#> [1] "45:10"
convertToDisplayTime(2705.04)
#> [1] "45:05"
convertToDisplayTime(2691.66)
#> [1] "44:52"
convertToDisplayTime(2708.10)
#> [1] "45:08"
Created on 2022-01-06 by the reprex package (v2.0.1)
Like mentioned in the comments, the problem here is that your function is not vectorized: it works with a single value for an input and outputs a single value. However, this does not work when the input is a vector of values, hence the condition has length 1 warning you get:
1: Problem with `mutate()` column `displayTime`.\
ℹ `displayTime = convertToDisplayTime(timeInSeconds)`.
ℹ the condition has length > 1 and only the first element will be used
Here, when you use dplyr::mutate, you're technically trying to feed a vector to your function, which is not formatted to process it.
Several options you may consider:
1. The "fast and ugly" way:
df <- data.frame(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10))
## This one does not work
df %>% mutate(displayTime = convertToDisplayTime(timeInSeconds))
## This one works
df %>%
rowwise() %>%
mutate(displayTime = convertToDisplayTime(timeInSeconds)) %>%
ungroup()
dplyr::rowwise() allows dplyr::mutate() to work on each row independently, rather than by columns. I assume this is the behavior you initially expected. dplyr::ungroup() sorta reverts rowwise, eg. go back to the default column-wise behavior.
I may be a little harsh on this one, but this is the kind of trick that I used back when I did not quite understand my way around dataframes and their manipulation...
2. Vectorize directly from your dplyr verbs:
df %>%
mutate(displayTime = base::mapply(convertToDisplayTime, timeInSeconds))
## or
df %>%
mutate(displayTime = purrr::map_chr(timeInSeconds, convertToDisplayTime))
Both options are similar.
3. Vectorize your function:
convertToDisplayTime_vec <- base::Vectorize(convertToDisplayTime)
# class(convertToDisplayTime_vec)
df %>% mutate(displayTime = convertToDisplayTime_vec(timeInSeconds))
## or
convertToDisplayTime_vec2 <- function(timeInSeconds_vec) {
mapply(FUN = convertToDisplayTime, timeInSeconds_vec)
}
# class(convertToDisplayTime_vec2)
df %>%
mutate(displayTime = convertToDisplayTime_vec2(timeInSeconds))
# Still works on single variables!
# convertToDisplayTime_vec2(6475)
This is my favourite option, as once it is implemented you can use it either on single variables, vectors or dataframes, without worring about it.
A little documentation to dig a little into the subject.
PS: As an aside, a little tip worth remembering: you may want to be careful when manipulating data.frame and tibble objects. Despite their similarity, they have slight differences, and some functions deal differently with one or the other, or actually convert one to the other without your noticing...

How can I input a single additional parameter to disk.frame's inmapfn at readin?

According to the article https://diskframe.com/articles/ingesting-data.html a good use case for inmapfn as part of csv_to_disk_frame(...) is for date conversion. In my data I know the name of the date column at runtime and would like to feed in the date to a convert at read in time function. One issue I am having is that it doesn't seem any additional parameters can be passed into the inmapfn argument beyond the chunk itself. I can't use a hardcoded variable at runtime as the name of the column isn't known until runtime.
To clarify the issue is that the inmapfn seems to run in its own environment to prevent any data races/other parallelisation issues but I know the variable won't be changed so I am hoping there is someway to override this as I can make sure that this is safe.
I know the function I am calling works when called on an arbitrary dataframe.
I have provided a reproducible example below.
library(tidyverse)
library(disk.frame)
setup_disk.frame()
a <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write_csv(a, "a.csv")
dates_col <- "dates"
tmp.df <- csv_to_disk.frame(
"a.csv",
outdir = file.path(tempdir(), "tmp.df"),
in_chunk_size = 1L,
inmapfn = function(chunk) {
chunk[, sdate := as.Date(do.call(`$`, list(chunk,dates_col)), "%d%b%Y")]
}
)
#> -----------------------------------------------------
#> Stage 1 of 2: splitting the file a.csv into smallers files:
#> Destination: C:\Users\joelk\AppData\Local\Temp\RtmpcFBBkr\file4a1876e87bf5
#> -----------------------------------------------------
#> Stage 1 of 2 took: 0.020s elapsed (0.000s cpu)
#> -----------------------------------------------------
#> Stage 2 of 2: Converting the smaller files into disk.frame
#> -----------------------------------------------------
#> csv_to_disk.frame: Reading multiple input files.
#> Please use `colClasses = ` to set column types to minimize the chance of a failed read
#> =================================================
#>
#> -----------------------------------------------------
#> -- Converting CSVs to disk.frame -- Stage 1 of 2:
#>
#> Converting 5 CSVs to 6 disk.frames each consisting of 6 chunks
#>
#> Error in do.call(`$`, list(chunk, dates_col)): object 'dates_col' not found
You can experiment with different backend and chunk_reader arguments. For example, if you set the backend to readr, the inmapfn user defined function will have access to previously defined variables. Furthermore, readr will do column type guessing
and will automatically impute Date type columns if it recognizes the string format as a date (in your example data it wouldn't recognize that as a date type, however).
If you don't want to use the readr backend for performance reasons, then I would ask if your example correctly represents your actual scenario? I'm not seeing the need to pass in the date column as a variable in the example you provided.
There is a working solution in the Just-in-time transformation section of the link you provided, and I'm not seeing any added complexities between that example and yours.
If you really need to use the default backend and chunk_reader plan AND you really need to send the inmapfn function a previously defined variable, you can wrap the the csv_to_disk.frame call in a wrapper function:
library(disk.frame)
setup_disk.frame()
df <- tribble(~dates, ~val,
"09feb2021", 2,
"21feb2012", 2,
"09mar2013", 3,
"20apr2021", 4,
)
write.csv(df, file.path(tempdir(), "df.csv"), row.names = FALSE)
wrap_csv_to_disk <- function(col) {
my_date_col <- col
csv_to_disk.frame(
file.path(tempdir(), "df.csv"),
in_chunk_size = 1L,
inmapfn = function(chunk, dates = my_date_col) {
chunk[, dates] <- lubridate::dmy(chunk[[dates]])
chunk
})
}
date_col <- "dates"
df_disk_frame <- wrap_csv_to_disk(date_col)
#> str(collect(df_disk_frame)$dates)
# Date[1:4], format: "2021-02-09" "2012-02-21" "2013-03-09" "2021-04-20"
I see. For a work around would it be possible to do something like this?
date_var = knonw_at_runtime()
saveRDS(date_var, "some/path/date_var.rds")
a = csv_to_disk.frame(files, inmapfn = function(chunk) {
date_var = readRDS("some/path/date_var.rds")
# do the rest
})
I think letting inmapfn have other options is doable see https://github.com/xiaodaigh/disk.frame/issues/377 for tracking

How to better analyze contents of a `call`?

In the package rootSolve, the function multiroot requires an input argument which is a collection of functions. I found a way to determine dynamically how many functions are contained in that input argument, but would like some help from the R community as to a cleaner approach. The example input function here (underdefined, but that doesn't matter so far as parsing it goes)
Kfunc<-function(x) {
z<-c( z1<- 4*var1 -3*var2 +5*var3, z2<-8*var1 +5*var2 -2*var3 )
}
Where the z1,z2,z3 are the outputs and "varJ" are the parameters to be determined
I came up with this stack of functions to find out how many separate functions are "inside" the definition of z :
bar <- parse(text = (parse(text = body(Kfunc)[2] )[[1]][3]))
length(bar[[1]])
#[1] 3
bar[[1]][1]
#c()
bar[[1]][2]
#(z1 <- 4 * var1 - 3 * var2 + 5 * var3)()
bar[[1]][3]
#(z2 <- 8 * var1 + 5 * var2 - 2 * var3)()
Showing that the number of equations is equal to length(bar[[1]]) - 1
Is there a faster/shorter/cleaner way, preferably without requiring a non-base library?

R, getting an invalid argument to unary operator when using order function

I'm essentially doing the exact same thing 3 times, and when adding a new variable I get this error
Error in -emps$EV : invalid argument to unary operator
The code chunk causing this is
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
Works like a charm for the first list, but the identical code thereafter causes the error.
This specific line is causing the error
sort3<-emps[order(-emps$EV),]
How can I fix/workaround this?
Full Code
url <- getURL("https://raw.githubusercontent.com/M-ttM/Basketball/master/class.csv")
shots <- read.csv(text = url)
shots$make<-shots$points>0
shots2<-shots[which(!(shots$player=="Luc Richard Mbah a Moute")),]
fit1<-glm(make~factor(type)+factor(period), data=shots2,family="binomial")
summary(fit1)
shots2$makeodds<-fitted(fit1)
shots2$EV<-shots2$makeodds*ifelse(shots2$type=="3pt",3,2)
shots3<-shots2[which(shots2$y>7),]
locmakes<-data.frame(table(shots3[, c("x", "y")]))
s1k <- shots2[with(shots2, player %in% names(which(table(player)>=1000))), ]
pps<-aggregate(points~player,s1k,mean)
sort<-pps[order(-PPS$points),]
head(sort,10)
evps<-aggregate(EV~player,s1k,mean)
sort2<-evps[order(-evps$EV),]
head(sort2,10)
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
The error message seems to occur when trying to order columns including chr type data. A possible workaround is to use the reverse function rev() instead of the minus sign, like so:
column_a = c("a","a","b","b","c","c")
column_b = seq(6)
df = data.frame(column_a, column_b)
df$column_a = as.character(df$column_a)
df[with(df, order(-column_a, column_b)),]
> Error in -column_a : invalid argument to unary operator
df[with(df, order(rev(column_a), column_b)),]
column_a column_b
5 c 5
6 c 6
3 b 3
4 b 4
1 a 1
2 a 2
Let me know if it works in your case.
On this line, emps$EV doesn't exist.
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EV),]
head(sort3,10)
You probably meant
s1k$EM<-s1k$points-s1k$EV
emps<-aggregate(EM~player,s1k,mean)
sort3<-emps[order(-emps$EM),]
head(sort3,10)

Subset by function's variable using $variable

I am having trouble to subset from a list using a variable of my function.
rankhospital <- function(state,outcome,num = "best") {
#code here
e3<-dataframe(...,state.name,...)
if (num=="worst"){ return(worst(state,outcome))
}else if((num%in%b=="TRUE" & outcome=="heart attack")=="TRUE"){
sep<-split(e3,e3$state.name)
hosp.estado<-sep$state
hospital<-hosp.estado[num,1]
return(as.character(hospital))
I split my data frame by state (which is a variable of my function)
But hosp.estado<-sep$state doesn't work. I have also tried as.data.frame.
The function (rankhospital("NY"....) returns me a character(0).
When I feed the sep$state with sep$"NY" directly in code it works perfectly so I guess the problem is I can't use a function's variable to do this. Am I right? What could I use instead?
Thank you!!
If state is a variable in your function, you can refer to a column with the name given by state using: sep[state] or sep[[state]]. The first produces a data frame with one column named based on the value of state. The second produces an unnamed vector.
df=data.frame(NY=rnorm(10),CA=rnorm(10), IL=rnorm(10))
state="NY"
df[state]
# NY
# 1 -0.79533912
# 2 -0.05487747
# 3 0.25014132
# 4 0.61824329
# 5 -0.17262350
# 6 -2.22390027
# 7 -1.26361438
# 8 0.35872890
# 9 -0.01104548
# 10 -0.94064916
df[[state]]
# [1] -0.79533912 -0.05487747 0.25014132 0.61824329 -0.17262350 -2.22390027 -1.26361438 0.35872890 -0.01104548 -0.94064916
class(df[state])
# [1] "data.frame"
class(df[[state]])
# [1] "numeric"
It seems like you are trying to get the top hospital in a state. You don't want to split here (see the result of sep to see what I mean). Instead, use:
as.character(e3[e3$state.name==state, 1][num])
This hopefully does what you want.
You need sep[[state]] instead of sep$state to get the data frame out of your sep list, which matches the state parameter of your function. Like this:
e3 <- read.csv("https://raw.github.com/Hindol/data-analysis-coursera/master/HW3/hospital-data.csv")
state <- "WY"
num <- 1:5
sep<-split(e3,e3$State)
hosp.estado<-sep[[state]]
hospital<-hosp.estado[num,1]
as.character(hospital)
# [1] "530002" "530006" "530008" "530010" "530011"

Resources