R apply correlation function to a list - r

I have a data frame like this:
set.seed(1)
category <- c(rep('A',100), rep('B',100), rep('C',100))
var1 = rnorm(1:300)
var2 = rnorm(1:300)
df<-data.frame(category=category, var1 = var1, var2=var2)
I need to calculate the correlations between var1 and var2 by category. I think I can first split the df by category and apply the cor function to the list. But I am really confused about hot to use the lapply function.
Could someone kindly help me out?

This should produce the desired result:
lapply(split(df, category), function(dfs) cor(dfs$var1, dfs$var2))
EDIT:
You can also use by (as suggested by #thelatemail):
by(df, df$category, function(x) cor(x$var1,x$var2))

You can use sapply to get the same but as a vector, not a list
sapply(split(df, category), function(dfs) cor(dfs$var1, dfs$var2))

And just for comparison, here's how you'd do it with the dplyr package.
library(dplyr)
df %>% group_by(category) %>% summarize(cor=cor(var1,var2))
# category cor
# 1 A -0.05043706
# 2 B 0.13519013
# 3 C -0.04186283

Related

Creating a loop in R for a function

I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6

Calculate cumulative mean for dataset randomized 100 times

I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.
# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>%
as.data.frame(df)
I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.
df2 <- df %>%
mutate(cm = cummean(val))
However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.
I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.
Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!
Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.
library(tidyverse)
set.seed(2000)
num_iterations <- 100
num_sample <- 100
1:num_iterations %>%
map_dfr(
function(i) {
tibble(
iteration = i,
id = 1:num_sample,
val = rnorm(num_sample),
cm = cummean(val),
cm2 = cumsum(val) / seq_along(val)
)
}
)
You can mutate to create 100 samples then call cummean:
library(dplyr)
library(purrr)
df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))
We may use rerun from purrr
library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
dat %>%
sample_n(size = n()) %>%
mutate(cm = cummean({{valcol}}))
}
n <- 100
out <- rerun(n, f1(df, val))
The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows
out1 <- bind_rows(out, .id = 'ID')
> head(out1)
ID val cm
1 1 0.3376980 0.33769804
2 1 -1.5699384 -0.61612019
3 1 1.3387892 0.03551628
4 1 0.2409634 0.08687807
5 1 0.7373232 0.21696708
6 1 -0.8012491 0.04726439

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

How to sum up a list of variables in a customized dplyr function?

Starting point:
I have a dataset (tibble) which contains a lot of Variables of the same class (dbl). They belong to different settings. A variable (column in the tibble) is missing. This is the rowSum of all variables belonging to one setting.
Aim:
My aim is to produce sub data sets with the same data structure for each setting including the "rowSum"-Variable (i call it "s1").
Problem:
In each setting there are a different number of variables (and of course they are named differently).
Because it should be the same structure with different variables it is a typical situation for a function.
Question:
How can I solve the problem using dplyr?
I wrote a function to
(1) subset the original dataset for the interessting setting (is working) and
(2) try to rowSums the variables of the setting (does not work; Why?).
Because it is a function for a special designed dataset, the function includes two predefined variables:
day - which is any day of an investigation period
N - which is the Number of cases investigated on this special day
Thank you for any help.
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day,N,!!! subvars) %>%
dplyr::mutate(s1 = rowSums(!!! subvars,na.rm = TRUE))
return(dfplot)
}
We can change it to string with as_name and subset the dataset with [[ for the rowSums
library(rlang)
library(purrr)
library(dplyr)
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
v1 <- map_chr(subvars, as_name)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = rowSums( .[v1],na.rm = TRUE))
return(dfplot)
}
out <- mkr.sumsetting(col1, col2, dataset = df1)
head(out, 3)
# day N col1 col2 s1
#1 1 20 -0.5458808 0.4703824 -0.07549832
#2 2 20 0.5365853 0.3756872 0.91227249
#3 3 20 0.4196231 0.2725374 0.69216051
Or another option would be select the quosure and then do the rowSums
mkr.sumsetting <- function(...,dataset){
subvars <- rlang::enquos(...)
#print(subvars)
# Summarize the variables belonging to the interessting setting
dfplot <- dataset %>%
dplyr::select(day, N, !!! subvars) %>%
dplyr::mutate(s1 = dplyr::select(., !!! subvars) %>%
rowSums(na.rm = TRUE))
return(dfplot)
}
mkr.sumsetting(col1, col2, dataset = df1)
data
set.seed(24)
df1 <- data.frame(day = 1:20, N = 20, col1 = rnorm(20),
col2 = runif(20))

Parallel wilcox.test using group_by and summarise

There must be an R-ly way to call wilcox.test over multiple observations in parallel using group_by. I've spent a good deal of time reading up on this but still can't figure out a call to wilcox.test that does the job. Example data and code below, using magrittr pipes and summarize().
library(dplyr)
library(magrittr)
# create a data frame where x is the dependent variable, id1 is a category variable (here with five levels), and id2 is a binary category variable used for the two-sample wilcoxon test
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
# make sure piping and grouping are called correctly, with "sum" function as a well-behaving example function
df %>% group_by(id1) %>% summarise(s=sum(x))
df %>% group_by(id1,id2) %>% summarise(s=sum(x))
# make sure wilcox.test is called correctly
wilcox.test(x~id2, data=df, paired=FALSE)$p.value
# yet, cannot call wilcox.test within pipe with summarise (regardless of group_by). Expected output is five p-values (one for each level of id1)
df %>% group_by(id1) %>% summarise(w=wilcox.test(x~id2, data=., paired=FALSE)$p.value)
df %>% summarise(wilcox.test(x~id2, data=., paired=FALSE))
# even specifying formula argument by name doesn't help
df %>% group_by(id1) %>% summarise(w=wilcox.test(formula=x~id2, data=., paired=FALSE)$p.value)
The buggy calls yield this error:
Error in wilcox.test.formula(c(1.09057358373486,
2.28465932554436, 0.885617572657959, : 'formula' missing or incorrect
Thanks for your help; I hope it will be helpful to others with similar questions as well.
Your task will be easily accomplished using the do function (call ?do after loading the dplyr library). Using your data, the chain will look like this:
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
df <- tbl_df(df)
res <- df %>% group_by(id1) %>%
do(w = wilcox.test(x~id2, data=., paired=FALSE)) %>%
summarise(id1, Wilcox = w$p.value)
output
res
Source: local data frame [5 x 2]
id1 Wilcox
(int) (dbl)
1 1 0.6904762
2 2 0.4206349
3 3 1.0000000
4 4 0.6904762
5 5 1.0000000
Note I added the do function between the group_by and summarize.
I hope it helps.
You can do this with base R (although the result is a cumbersome list):
by(df, df$id1, function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
or with dplyr:
ddply(df, .(id1), function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
id1 V1
1 1 0.3095238
2 2 1.0000000
3 3 0.8412698
4 4 0.6904762
5 5 0.3095238

Resources