So I've started to dip my toes into the wonderful world of dplyr programming. I'm trying to write a function that accepts a data.frame, a target column, and any number of grouping columns (using bare names for all columns). The function will then bin the data based on the target column and count the number of entries in each bin. I want to keep a separate bin size for every combination of the grouping variables present in my original data.frame(), so I'm using the complete() and nesting() functions to do this. Here's an example of what I'm trying to do and the error I'm running into:
library(dplyr)
library(tidyr)
#Prepare test data
set.seed(42)
test_data =
data.frame(Gene_ID = rep(paste0("Gene.", 1:10), times=4),
Comparison = rep(c("WT_vs_Mut1", "WT_vs_Mut2"), each=10, times=2),
Test_method = rep(c("T-test", "MannWhitney"), each=20),
P_value = runif(40))
#Perform operation manually
test_data %>%
#Start by binning the data according to q-value
mutate(Probability.bin = cut(P_value,
breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
labels = c(seq(0.0, 1.0, by=0.1)),
right = FALSE)) %>%
#Now summarize the results by bin.
count(Comparison, Test_method, Probability.bin) %>%
#Fill in any missing bins with 0 counts
complete(nesting(Comparison, Test_method), Probability.bin,
fill=list(n = 0))
#Create function that accepts bare column names
bin_by_p_value <- function(df,
pvalue_col, #Bare name of p-value column
...) { #Bare names of grouping columns
#"Quote" column names so they are ready for use below
pvalue_col_name <- enquo(pvalue_col)
group_by_cols <- quos(...)
#Perform the operation
df %>%
#Start by binning the data according to q-value
mutate(Probability.bin = cut(UQ(pvalue_col_name),
breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
labels = c(seq(0.0, 1.0, by=0.1)),
right = FALSE)) %>%
#Now summarize the results by bin.
count(UQS(group_by_cols), Probability.bin) %>%
#Fill in any missing bins with 0 counts
complete(nesting(UQS(group_by_cols)), Probability.bin,
# complete(nesting(UQS(group_by_cols)), Probability.bin,
fill=list(n = 0))
}
#Use function to perform operation
test_data %>%
bin_by_p_value(P_value, Comparison, Test_method)
When I perform the operation manually, everything works fine. When I use the function, it fails with this error:
Error in overscope_eval_next(overscope, expr) :
object 'Comparison' not found
I've narrowed down the problem to the following piece of code in the function:
complete(nesting(UQS(group_by_cols)), Probability.bin...
If I remove the call to nesting(), the code executes without the error. However, I want to maintain the functionality where I only use combinations of the grouping variables that are present in the original data, and then get all possible combinations with the bins, so I can fill in all of the missing bins. Based on the error name and where this is failing, my guess is this is a scoping/environment issue, where I really should use a different environment for the grouping variables in nesting(), since it's contained inside the call to complete(). However, I'm new enough to dplyr programming, that I'm not sure how to do that.
I tried to work around this by uniting the grouping columns into a single column, and then using that united column as input into complete(). This lets me perform the complete() operation the way I want to, while avoiding the nesting() function. However, I ran into trouble when I wanted to separate back into the original grouping columns, since I don't know how to convert a list of quosures into a character vector (required for the "into" parameter of separate()). Here are code snippets to illustrate what I'm talking about:
#Fill in any missing bins with 0 counts
unite(Merged_grouping_cols, UQS(group_by_cols), sep="*") %>%
complete(Merged_grouping_cols, Probability.bin,
fill=list(n = 0)) %>%
separate(Merged_grouping_cols, into=c("What goes here?"), sep="\\*")
Here's the pertinent version info: R version 3.4.2 (2017-09-28), tidyr_0.7.2, dplyr_0.7.4
I'd appreciate any workarounds, but I want to know what I'm doing that's rubbing complete() and nesting() the wrong way.
Use curly-curly {{}} for pvalue_col.
Pass the dots (...) directly to count.
Use ensyms with !!! in nesting.
bin_by_p_value <- function(df,
pvalue_col, #Bare name of p-value column
...) { #Bare names of grouping columns
#Perform the operation
df %>%
#Start by binning the data according to q-value
mutate(Probability.bin = cut({{pvalue_col}},
breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
labels = c(seq(0.0, 1.0, by=0.1)),
right = FALSE)) %>%
#Now summarize the results by bin.
count(..., Probability.bin) %>%
#Fill in any missing bins with 0 counts
complete(nesting(!!!ensyms(...)), Probability.bin, fill=list(n = 0))
}
test_data %>% bin_by_p_value(P_value, Comparison, Test_method)
# A tibble: 44 x 4
# Comparison Test_method Probability.bin n
# <chr> <chr> <fct> <dbl>
# 1 WT_vs_Mut1 MannWhitney 0 1
# 2 WT_vs_Mut1 MannWhitney 0.1 1
# 3 WT_vs_Mut1 MannWhitney 0.2 0
# 4 WT_vs_Mut1 MannWhitney 0.3 1
# 5 WT_vs_Mut1 MannWhitney 0.4 1
# 6 WT_vs_Mut1 MannWhitney 0.5 1
# 7 WT_vs_Mut1 MannWhitney 0.6 0
# 8 WT_vs_Mut1 MannWhitney 0.7 0
# 9 WT_vs_Mut1 MannWhitney 0.8 1
#10 WT_vs_Mut1 MannWhitney 0.9 4
# … with 34 more rows
Testing the output if the output of manual call is stored in res.
identical(res, test_data %>% bin_by_p_value(P_value, Comparison, Test_method))
#[1] TRUE
Related
I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))
I would like to create for loop to repeat the same function for 150 variables. I am new to R and I am a bit stuck.
To give you an example of some commands I need to repeat:
N <- table(df$ var1 ==0)["TRUE"]
n <- table(df$ var1 ==1)["TRUE"]
PREV95 <- (svyciprop(~ var1 ==1, level=0.95, design= design, deff= "replace")*100)
I need to run the same functions for 150 columns. I know that I need to put all my cols in one vector = x but then I don't know how to write the loop to repeat the same command for all my variables.
Can anyone help me to write a loop?
A word in advance: loops in R can in most cases be replaced with a faster, R-ish way (various flavours of apply, maping, walking ...)
applying a function to the columns of dataframe df:
a)
with base R, example dataset cars
my_function <- function(xs) max(xs)
lapply(cars, my_function)
b)
tidyverse-style:
cars %>%
summarise_all(my_function)
An anecdotal example: I came across an R-script which took about half an hour to complete and made abundant use of for-loops. Replacing the loops with vectorized functions and members of the apply family cut the execution time down to about 3 minutes. So while for-loops and related constructs might be more familiar when coming from another language, they might soon get in your way with R.
This chapter of Hadley Wickham's R for data science gives an introduction into iterating "the R-way".
Here is an approach that doesn't use loops. I've created a data set called df with three factor variables to represent your dataset as you described it. I created a function eval() that does all the work. First, it filters out just the factors. Then it converts your factors to numeric variables so that the numbers can be summed as 0 and 1 otherwise if we sum the factors it would be based on 1 and 2. Within the function I create another function neg() to give you the number of negative values by subtracting the sum of the 1s from the total length of the vector. Then create the dataframes "n" (sum of the positives), "N" (sum of the negatives), and PREV95. I used pivot_longer to get the data in a long format so that each stat you are looking for will be in its own column when merged together. Note I had to leave PREV95 out because I do not have a 'design' object to use as a parameter to run the function. I hashed it out but you can remove the hash to add back in. I then used left_join to combine these dataframes and return "results". Again, I've hashed out the version that you'd use to include PREV95. The function eval() takes your original dataframe as input. I think the logic for PREV95 should work, but I cannot check it without a 'design' parameter. It returns a dataframe, not a list, which you'll likely find easier to work with.
library(dplyr)
library(tidyr)
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
eval <- function(df){
df1 <- df %>%
select_if(is.factor) %>%
mutate_all(function(x) as.numeric(as.character(x)))
neg <- function(x){
length(x) - sum(x)
}
n<- df1 %>%
summarize(across(where(is.numeric), sum)) %>%
pivot_longer(everything(), names_to = "Var", values_to = "n")
N <- df1 %>%
summarize(across(where(is.numeric), function(x) neg(x))) %>%
pivot_longer(everything(), names_to = "Var", values_to = "N")
#PREV95 <- df1 %>%
# summarize(across(where(is.numeric), function(x) survey::svyciprop(~x == 1, design = design, level = 0.95, deff = "replace")*100)) %>%
# pivot_longer(everything(), names_to = "Var", values_to = "PREV95")
results <- n %>%
left_join(N, by = "Var")
#results <- n %>%
# left_join(N, by = "Var") %>%
# left_join(PREV95, by = "Var")
return(results)
}
eval(df)
Var n N
<chr> <dbl> <dbl>
1 Var1 2 8
2 Var2 5 5
3 Var3 4 6
If you really wanted to use a for loop, here is how to make it work. Again, I've left out the survey function due to a lack of info on the parameters to make it work.
seed(100)
df <- data.frame(Var1 = factor(sample(c(0,1), 10, TRUE)),
Var2 = factor(sample(c(0,1), 10, TRUE)),
Var3 = factor(sample(c(0,1), 10, TRUE)))
VarList <- names(df %>% select_if(is.factor))
results <- list()
for (var in VarList){
results[[var]][["n"]] <- sum(df[[var]] == 1)
results[[var]][["N"]] <- sum(df[[var]] == 0)
}
unlist(results)
Var1.n Var1.N Var2.n Var2.N Var3.n Var3.N
2 8 5 5 4 6
I am trying to learn purrr to simulate data using rnorm with different means, sd, and n in each iteration.
This code generates my dataframe:
parameter = crossing(n = c(60,80,100),
agegroup = c("a", "b","c"),
effectsize = c(0.2, 0.5, 0.8),
sd =2
) %>%
# create a simulation id number
group_by(agegroup) %>%
mutate(sim= row_number())%>%
ungroup() %>%
mutate(# change effect size so that one group has effect, others d=0
effectsize= if_else(agegroup == "a", effectsize, 0),
# calculate the mean for the distribution from effect size
mean =effectsize*sd)
Now I want to iterate over the different simulations and for each row, generate data according to mean, sd and r using rnorm
# create a nested dataframe to iterate over each simulation and agegroup
nested_df = parameter %>%
group_by(sim, agegroup, effectsize)%>%
nest() %>% arrange(sim)
This is what my dataframe then looks like:
Now I want create normally-distributed data with the mean, sd, and n given in the "data" column
nested_df = nested_df %>%
mutate(data_points = pmap(data,rnorm))
However the code above gives an error that I haven't been able to find a solution to:
Error in mutate_impl(.data, dots) :
Evaluation error: unused arguments
I read the Iteration chapter in R for Data Science and googled a bunch, but I can't figure out how to combine pmap and nest. The reason I would like to use those functions is that it would make it easier to keep the parameters, simulated data, and output all in one dataframe.
You don't necessarily need to nest the parameters. For example:
parameter %>%
# Use `pmap` because we explicitly specify three arguments
mutate(data_points = pmap(list(n, mean, sd), rnorm))
# A tibble: 27 x 7
# n agegroup effectsize sd sim mean data_points
# <dbl> <chr> <dbl> <dbl> <int> <dbl> <list>
# 1 60 a 0.2 2 1 0.4 <dbl [60]>
# 2 60 a 0.5 2 2 1 <dbl [60]>
# 3 60 a 0.8 2 3 1.6 <dbl [60]>
With the nested data frame, you can use map rather than pmap:
nested_df %>%
# Use `map` because there is really one argument, `data`,
# but then refer to three different columns of `data`.
mutate(data_points = map(data, ~ rnorm(.$n, .$mean, .$sd)))
first, it is okay to use pmap like this:
x <- tibble(n = 100, mean = 5, sd = 0.1)
pmap(x, rnorm)
which is very similar to use do.call:
do.call(rnorm, x)
However, if you want to use pmap inside mutate you bring the inputs for the function .f into the right shape.
Writing
nested_df %>%
mutate(y = pmap(x, f))
means that f expects input x.
In your case, rnorm expects three inputs, but only gets one.
So if you insist on nesting the inputs you can do this:
nested_df %>%
mutate(data_points = pmap(list(data), function(z) pmap(z, rnorm))[[1]])
or
nested_df %>%
mutate(data_points = pmap(list(data), function(z) do.call(rnorm, z))).
However I would recommend to do it a little bit differently:
parameter %>%
mutate(data_points = pmap(list(n, mean, sd), rnorm))
Hope this helps a little.
I am a rookie STATA user trying to make the jump to R. I am working through various exercises, but keep getting something wrong with the group_by and subset command.
I have a simple dataset that I wish to make groupbased calculations on. I am trying to use the groups_by command from the dplyr package to do this.
My dataset is called itchy and consists of 4 variabels:
treat- levels A and B (type of treatment)
type- levels Dark and Fair (skin-colour)
y - levels 0 and 1 (failure or succes of treatment)
freq - numerical variable indicating how many are in this particular group
Using this code you can recreate it:
type <- c(2,2,2,2,1,1,1,1)
treat <-c(1,1,2,2,1,1,2,2)
y <- c(1,0,1,0,1,0,1,0)
freq <- c(9,17,5,20,10,15,3,20)
itchy <- cbind.data.frame(type,treat,y,freq)
itchy$type <- as.factor(type)
itchy$type <- factor(itchy$type,levels = c(1,2), labels = c("Dark", "Fair"))
itchy$treat <- as.factor(treat)
itchy$treat <- factor(itchy$treat,levels = c(1,2), labels = c("A", "B"))
itchy$y <- as.factor(y)
itchy$y <- factor(itchy$y,levels = c(0,1), labels = c("failure", "succes"))
Now I would like to calculate the ods for a success for treatment A and B when applied to skintype Dark or Fair. (ods = nr of successful events/nr of failures)
I have two questions:
1) Can you help me do the ods calculations by groups?
2) I have tried with various combinations of group_by and subset, without any luck. The below code shows some of my unsuccessful attempts. Can you then tell I have a basic misunderstanding of how the group_by and subset commands work
itchy %>% group_by(treat, type) %>% summarize(ods = (subset(freq, y==1)/subset(freq, y==0)))
itchy %>% group_by(treat, type) %>% ods <- c((subset(freq, y==1)/subset(freq, y==0)))
itchy %>% group_by(treat, type) %>% itchy$ods <- (subset(freq, y==1)/subset(freq, y==0))
If I understand you correctly, I think the following will work. I made use of the the spread function from the tidyr package, which like dplyr is part of the tidyverse
library(tidyr)
itchy %>%
spread(y, freq) %>%
mutate(odds = succes / failure)
type treat failure succes odds
1 Dark A 15 10 0.6666667
2 Dark B 20 3 0.1500000
3 Fair A 17 9 0.5294118
4 Fair B 20 5 0.2500000
junk = itchy %>% group_by(y,treat, type) %>% summarize(Overall = sum(freq))
myfunc = function(arg1,arg2){
filter(junk,treat == arg1,type == arg2)[1,4]/filter(junk,treat == arg1,type == arg2)[2,4]
}
myfunc("A","Dark") # You can try all the various combinations here
Does this give you the desired result?
I want to find the rank correlation of various columns in a data.frame using dplyr.
I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.
For the following df:
df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))
I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????
cor <- df %>% group_by(Universe) %>%
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)
I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.
Please help!
An alternative approach is to just call the cor function once since this will calculate all required correlations. Repeated calls to cor might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:
#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%
data.frame(row=rep(colnames(.)[-1], n_groups(.)))
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <- which(upper.tri(diag(num_col)))
cor_cols <- df %>% group_by(Universe) %>%
do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])
So here follows the winning (time-wise) solution to my problem:
d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>%
summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>%
unite(Pair, R1, R2, sep="_")
Albeit 0.005 milliseconds in this example, adding data adds time.
Try this:
library(data.table) # needed for fast melt
setDT(df) # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))
mdf %>%
mutate(obs_set = substr(variable, 4, 4) ) %>% # ".x" or ".y" subgroup
full_join(.,., by=c('Universe', 'obs_set', 'id')) %>% # see notes
group_by(Universe, variable.x, variable.y) %>%
filter(variable.x != variable.y) %>%
dplyr::summarise(rank_corr = cor(value.x, value.y,
method='spearman', use='pairwise.complete.obs'))
Produces:
Universe variable.x variable.y rank_corr
(fctr) (fctr) (fctr) (dbl)
1 A AA.x BB.x -0.9
2 A AA.x CC.x -0.9
3 A BB.x AA.x -0.9
4 A BB.x CC.x 0.8
5 A CC.x AA.x -0.9
6 A CC.x BB.x 0.8
7 A AA.y BB.y -0.3
8 A AA.y CC.y 0.2
9 A BB.y AA.y -0.3
10 A BB.y CC.y -0.3
.. ... ... ... ...
Explanation:
Melt: converts table to long form, one row per observation. To do the melt in a dplyr chain, you would have to use tidyr::gather, I believe, so pick your dependency. Using data.table there is faster and not hard to understand. The step also creates an id for each observation, 1 to nrow(df). The rest is in dplyr like you wanted.
Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe and observation id (edit: and now '.x' or '.y' subgroup).
Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.
Summarize using Spearman correlation. Note you should use dplyr::summarise since if you have plyr also loaded you might accidentally call plyr::summarise.