I created a function in R that creates deciles (or any n-tile) based on a volume metric as opposed to observation counts.
User_Decile <- function(x,n,Output = " "){
require(dplyr)
df <- data_frame(index = seq_along(x),value = x)
x_sum <- sum(df$value)
x_ranges <- x_sum/n
df <- df %>% arrange(value)
df$cumsum <- cumsum(df$value)
df$bins <- cut(df$cumsum, breaks = floor(seq(0, x_sum, x_ranges)),
right = T,
include.lowest = T,
labels = as.integer(seq(1,n,1)))
if(Output == "Summary"){
df <- df %>% group_by(bins)
return(df %>% summarise(Lower_Bound = min(value),
Upper_Bound = max(value) - 1,
Value_sum = sum(value)))}
else {
df <- df %>% arrange(index)
return(as.numeric(df$bins))}
}
(x is a vector of numbers, n is the number of bins/-tiles to group the data into, Output= specifies if you want a summary of the bounds/data or the actual data itself.)
It previous worked well within a program I created to segment some data, but I just tried to use the function again for the first time in a couple months and I'm getting:
Error in .bincode(x, breaks, right, included.lowest) :
invalid 'right' argument
According to the error, the issue is with the 'right' argument in the cut() function. As far as I know, the right= argument is boolean and only takes T or F values. I've tried both, but neither seems to work.
Does anyone have a workaround for this issue, or can recommend another function in place of cut()?
?TRUE states that:
TRUE and FALSE are reserved words denoting logical constants in the R
language, whereas T and F are global variables whose initial values
set to these.
It appears that T is being interpreted as something else here. You should always use TRUE to be on the safe side.
Related
Tl;dr - I'm trying to use the merge.data.table() function with row indexes and the suggestions given in the R documentation are not working.
My data is roughly:
library(data.table)
library(quantreg)
library(purrr)
foo <- expand.grid(c(seq(60001, 60050, by = 1),
c("18-30", "31-60", "61+"),
c("pre", "during", "after"))
foo <- as.data.table(foo)
setnames(foo, names(foo), c("zip", "agegroup", "period"))
foo <- cbind(foo,
quartile = floor(runif(n = nrow(foo), 1, 4)),
times = runif(n = nrow(foo), 18, 25))
I ran several quantile regressions on the data, subsetting by age group (at someone else's request).
v_tau <- c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)
mq_age1 <- map(v_tau, ~rq(data = foo[agegroup == "18-30",],
times ~ quartile + period + quartile*period,
tau = .x))
I'm trying to merge a vector of predicted fitted values from the rq() object with the original data table (I could also transform it into a dataframe, it doesn't have to be a data table). This vector is shorter than the number of rows in the data table, so I've been trying to apply the answer given here for a plm() object, modifying to account for the fact that my fitted values do not have multiple index attributes.
So, what I have been trying to do is join them by row index. I realize I can make another column with an explicit index, but I would like to avoid that because the fitted values are from a subset of the data and I am joining them to a subset of the data; adding an explicit index is possible, but not uniform or parsimonious, and will end up generating a lot of NAs that I don't want to deal with.
fitted <- mq_age1[[10]]$fitted.values
d_fitted <- cbind(attr(fitted, "index"),
fitted = fitted)
foo2 <- merge(foo[agegroup == "18-30",], d_fitted, by = 0, all.x = TRUE)
Looking at the merge() documentation, it says: "Columns to merge on can be specified by name, number or by a logical vector: the name "row.names" or the number 0 specifies the row names. If specified by name it must correspond uniquely to a named column in the input."
However, when I try this, it gives me the following error message:
Error in
merge.data.table(foo[agegroup == "18-30", ], d_fitted, by = 0, :
A non-empty vector of column names for `by` is required.
Similarly, when I try using "row.names":
foo2 <- merge(foo[agegroup == "18-30",], d_fitted, by = "row.names", all.x = TRUE)
Error in merge.data.table(foo[agegroup == "18-30", ], d_fitted, by = "row.names", :
Elements listed in `by` must be valid column names in x and y
What is going on? Why can't I do this?
Found the answer: #r2evans kindly pointed out that base::merge has this functionality, while data.table::merge does not.
foo <- as.data.frame(foo)
before
foo2 <- merge(foo[foo$agegroup == "18-49",], d_fitted, by = 0, all.x = TRUE)
did the trick. Thanks!
Description and goal: In R Studio, I would like to define a function that drops columns of a given data.frame if it contains a too high share of missing values, defined by a cutoff value in percent. This function should return information about the subsetted data.frame (number of remaining columns and remaining share of missing cases) together with the subsetted data.frame itself for further analyses. Additionally, there should be an option to visualize remaining missing cases using the function vis_miss() of the identically named package.
Packages used:
library(tidyverse)
library(vismiss)
Data:
my.data <- tibble(col_1 = c(1:5),
col_2 = c(1,2,NA,NA,NA))
My function:
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
df
}
Test:
cut_cols(my.data, 0.5, vis_miss = F) # without visualization
cut_cols(my.data, 0.5, vis_miss = T) # with visualization
Problem:
As you might have already seen in the example above, only the first line, where vis_miss = F actually returns the data.frame but not the second line, where vis_miss = T. I assume that this is because of the extra if () {} clause, which returns a plot and then ends the process without printing df. Is there a way to prevent this from happening so that the first line also returns the new data.frame?
You were correct in your suspicion that the if(){} clause was stopping the df from printing. I think return() stops any function from running further. If that's the case then it's best practice to put it at the end of any function.
Further, use print(df) to make sure your function outputs your data frame. Here are a few changes to your code
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
print(df)
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
}
cut_cols(my.data, 0.5, vis_miss = T)
Here's another option if it interests you. You can assign both the df and the plot to a list then call the list.
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
# empty list
list_ <- c()
# assign df to first index of list
list_[[1]] <- df
if (vis_miss==TRUE){
plot <- vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F)
# assign plot to second index in list
list_[[2]] <- plot
}
return(list_)
}
output <- cut_cols(my.data, 0.5, vis_miss = T)
Calling output will print both the df and plot. output[[1]] will print just the df. output[[2]] will print just the plot.
Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)
I am trying to collate results from a simulation study using dplyr and purrr. My results are saved as a list of data frames with the results from several different classification algorithms, and I'm trying to use purrr and dplyr to summarize these results.
I'm trying to calculate
- number of objects assigned to each cluster
- number of objects in the cluster that actually belong to the cluster
- number of true positives, false positives, false negatives, and true negatives using 3 different algorithms (KEEP1 - KEEP3)
- for 2 of the algorithms, I have access to a probability of being in the cluster, so I can compare this to alternate choices of alpha - and so I can calculate true positives etc. using a different choice of alpha.
I found this: https://github.com/tidyverse/dplyr/issues/3101, which I used successfully on a single element of the list to get exactly what I wanted:
f <- function(.x, .y) {
sum(.x & .y)
}
actions <- list(
.vars = lst(
c('correct'),
c('KEEP1', 'KEEP2', 'KEEP3'),
c('pval1', 'pval2')
),
.funs = lst(
funs(Nk = length, N_correct = sum),
funs(
TP1 = f(., .y = correct),
FN1 = f(!(.), .y = correct),
TN1 = f(!(.), .y = !(correct)),
FP1 = f(., .y = !(correct))
),
funs(
TP2 = f((. < alpha0) , .y = correct),
FN2 = f(!(. < alpha0), .y = correct),
TN2 = f(!(. < alpha0), .y = !(correct)),
FP2 = f((. < alpha0), .y = !(correct))
)
)
)
reproducible_data <- replicate(2,
data_frame(
k = factor(rep(1:10, each = 20)), # group/category
correct = sample(x = c(TRUE, FALSE), 10 * 20, replace = TRUE, prob = c(.8, .2)),
pval1 = rbeta(10 * 20, 1, 10),
pval2 = rbeta(10 * 20, 1, 10),
KEEP1 = pval1 < 0.05,
KEEP2 = pval2 < 0.05,
KEEP3 = runif(10 * 20) > .2,
alpha0 = 0.05,
alpha = 0.05 / 20 # divided by no. of objects in each group (k)
),
simplify = FALSE)
# works
df1 <- reproducible_data[[1]]
pmap(actions, ~df1 %>% group_by(k) %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
Now, I want to use map to do this to the entire list. However, I can no longer access the variable "correct" (it hasn't gotten far enough to not see alpha or alpha0, but presumably the same issue will occur). I'm still learning dplyr/purrr, but my experimenting hasn't proved useful.
# does not work
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y)) %>%
reduce(inner_join,by = 'k')
)
# this doesn't either
out_summary <- map(
reproducible_data,
pmap(actions, ~ as_tibble(.) %>% group_by("k") %>% summarize_at(.x, .y, alpha = alpha, alpha0 = alpha0, correct = correct)) %>%
reduce(inner_join,by = 'k')
)
Within map, I don't see the variable 'k' in $group_by(k)$ unless it is quoted $group_by('k')$, but I do not need to quote it when I just used pmap. I've tried various ways to pass the correct variables to these functions, but I'm still learning dplyr and purrr, and haven't succeeded yet.
One more note - the actual data is stored as a regular data frame, so I need $as_tibble()$ in the pmap function. I was running into some different errors when I removed it in this example, so I opted to add it back so I would get the same issues. Thanks!
Try this
map(
reproducible_data,
function(df1) {
pmap(actions, ~ df1 %>%
as_tibble() %>%
group_by(k) %>%
summarize_at(.x, .y)) %>%
reduce(inner_join, by = "k")
}
)
I think your arguments might get mixed up when using map and pmap at the same time. I used the function syntax for map to define df1 to try to fix that. The rest of it looks ok (although I switched to pmap_df to return a dataframe (the structure of the list was ugly without it and pmap_df was the easiest way to make it pretty. Lmk if it's not the expected output. 👍
Also the problem with group_by("k") vs. group_by(k)
Also: writing group_by("k") actually creates a variable "k" and fills it with characters "k", then uses that to group. That will get your code to run, but it won't do what you like. Sometimes that kind of problem is really because of an error that occurs a line or two before (or, with dplyr, a pipe or two before). In this case, map wasn't passing df1 where you needed it.
I am wondering how to properly UQ string created variable names on the RHS in dplyr methods like mutate. See the error messages I got in comments in the wilcox.test part of this MWE:
require(dplyr)
dfMain <- data.frame(
base = c(rep('A', 5), rep('B', 5)),
id = letters[1:10],
q0 = rnorm(10)
)
backgs <- list(
A = rnorm(13),
B = rnorm(11)
)
fun <- function(dfMain, i = 0){
pcol <- sprintf('p%i', i)
qcol <- sprintf('q%i', i)
(
dfMain %>%
group_by(id) %>%
mutate(
!!pcol := ifelse(
!is.nan(!!qcol) &
length(backgs[[base]]),
wilcox.test(
# !!(qcol) - backgs[[base]]
# object 'base' not found
# (!!qcol) - backgs[[base]]
# non-numeric argument to binary operator
(!!qcol) - backgs[[base]]
)$p.value,
NaN
)
)
)
}
dfMain <- dfMain %>% fun()
I guess at !!(qcol) ... it is interpreted as I would like to unquote the whole expression not only the variable name that's why it does not find base? I also found out that (!!qcol) returns the string itself so no surprise the - operator is unable to handle it.
Your code should work as you expect by changing the line where you define qcol to:
qcol <- as.symbol(sprintf('q%i', i))
That is, since qcol was a string, you needed to turn it into a symbol before unquoting for it to be evaluated correctly in your mutate. Also I presume the column you wanted to refer to was the q0 column you defined in your data, not a non-existent column named qval0.