I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.
# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>%
as.data.frame(df)
I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.
df2 <- df %>%
mutate(cm = cummean(val))
However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.
I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.
Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!
Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.
library(tidyverse)
set.seed(2000)
num_iterations <- 100
num_sample <- 100
1:num_iterations %>%
map_dfr(
function(i) {
tibble(
iteration = i,
id = 1:num_sample,
val = rnorm(num_sample),
cm = cummean(val),
cm2 = cumsum(val) / seq_along(val)
)
}
)
You can mutate to create 100 samples then call cummean:
library(dplyr)
library(purrr)
df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))
We may use rerun from purrr
library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
dat %>%
sample_n(size = n()) %>%
mutate(cm = cummean({{valcol}}))
}
n <- 100
out <- rerun(n, f1(df, val))
The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows
out1 <- bind_rows(out, .id = 'ID')
> head(out1)
ID val cm
1 1 0.3376980 0.33769804
2 1 -1.5699384 -0.61612019
3 1 1.3387892 0.03551628
4 1 0.2409634 0.08687807
5 1 0.7373232 0.21696708
6 1 -0.8012491 0.04726439
Related
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
In dplyr, group_by has a parameter add, and if it's true, it adds to the group_by. For example:
data <- data.frame(a=c('a','b','c'), b=c(1,2,3), c=c(4,5,6))
data <- data %>% group_by(a, add=TRUE)
data <- data %>% group_by(b, add=TRUE)
data %>% summarize(sum_c = sum(c))
Output:
a b sum_c
1 a 1 4
2 b 2 5
3 c 3 6
Is there an analogous way to add summary variables to a summarize statement? I have some complicated conditionals (with dbplyr) where if x=TRUE I want to add
variable x_v to the summary.
I see several related stackoverflow questions, but I didn't see this.
EDIT: Here is some precise example code, but simplified from the real code (which has more than two conditionals).
summarize_num <- TRUE
summarize_num_distinct <- FALSE
data <- data.frame(val=c(1,2,2))
if (summarize_num && summarize_num_distinct) {
summ <- data %>% summarize(n=n(), n_unique=n_distinct())
} else if (summarize_num) {
summ <- data %>% summarize(n=n())
} else if (summarize_num_distinct) {
summ <- data %>% summarize(n_unique=n_distinct())
}
Depending on conditions (summarize_num, and summarize_num_distinct here), the eventual summary (summ here) has different columns.
As the number of conditions goes up, the number of clauses goes up combinatorially. However, the conditions are independent, so I'd like to add the summary variables independently as well.
I'm using dbplyr, so I have to do it in a way that it can get translated into SQL.
Would this work for your situation? Here, we add a column for each requested summation using mutate. It's computationally wasteful since it does the same sum once for every row in each group, and then discards everything but the first row of each group. But that might be fine if your data's not too huge.
data <- data.frame(val=c(1,2,2), grp = c(1, 1, 2)) # To show it works within groups
summ <- data %>% group_by(grp)
if(summarize_num) {summ = mutate(summ, n = n())}
if(summarize_num_distinct) {summ = mutate(summ, n_unique=n_distinct(val))}
summ = slice(summ, 1) %>% ungroup() %>% select(-val)
## A tibble: 2 x 3
# grp n n_unique
# <dbl> <int> <int>
#1 1 2 2
#2 2 1 1
The summarise_at() function takes a list of functions as parameter. So, we can get
data <- data.frame(val=c(1,2,2))
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts)
n_unique n
1 2 3
All functions in the list must take one argument. Therefore, n() was replaced by length().
The list of functions can be modified dynamically as requested by the OP, e.g.,
summarize_num_distinct <- FALSE
summarize_num <- TRUE
fcts <- list(n_unique = n_distinct, n = length)
data %>%
summarise_at(.vars = "val", fcts[c(summarize_num_distinct, summarize_num)])
n
1 3
So, the idea is to define a list of possible aggregation functions and then to select dynamically the aggregation to compute. Even the order of columns in the aggregate can be determined:
fcts <- list(n_unique = n_distinct, n = length, sum = sum, avg = mean, min = min, max = max)
data %>%
summarise_at(.vars = "val", fcts[c(6, 2, 4, 3)])
max n avg sum
1 2 3 1.666667 5
Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))
I want to turn a table into a data frame. Three columns should be there: 1. the zip code 2 outcome "0" and 3 outcome "1". But as.data.frame.matrix turns the zip-code into row names and makes them unusable.
I tried to add a fourth column with imaginary ID's (1:100) so R makes them to row names but R tells me, that "all arguments must be the same length" - which they are!
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
abs <- table(df$zip, df$outcome)
abs <- as.data.frame.matrix(abs)
Some has a nice and slick idea? Thanks in advance!
Edit:
When:
abs <- as.matrix(as.data.frame(abs))
I get something close to what I want but the outcomes are together in one column. How to untie them, to make them look like the table again?
You can get to your desired result easier with dplyr and tidyr:
library(dplyr)
library(tidyr)
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
df <- df %>% group_by(zip, outcome) %>%
summarise(freq = n()) %>%
ungroup() %>%
spread(outcome, freq)
You are supplying only a 100 values to a data.frame that has 101 rows.
> nrow(abs)
[1] 101
so this would work
abs$new_col <- 1:101
I think you want this:
abs2 <- as.data.frame(abs) %>% select(2,3,1)
I am new to R . I have a large dataset with 1-minute resolution for one year . It makes total of 55940 observation all 1 minute apart with dates and times . I want to change it to six minute resolution data. It necessarily means adding first 6 rows then next 6 and so on so forth . Any good solutions ?
You could try something like this:
library(dplyr)
# original df
df <- data.frame(min = 1:60, val = rnorm(60))
# create a grouping variable and add to df
grp <- floor(df$min / 6)
df <- data.frame(grp, df)
# create new df at 6 min level
new.df <- df %>%
group_by(grp) %>%
summarise(new.val = sum(val))
Another option with a similar approach
library(dplyr)
# original dataframe
n <- 55940
df <- data.frame(id = 1:n , val = rnorm(n))
# new dataframe
df_new <- df %>%
group_by(cut(df$id, n/6)) %>%
summarise(new.val = sum(val))