Correlation using funs in dplyr - r

I want to find the rank correlation of various columns in a data.frame using dplyr.
I am sure there is a simple solution to this problem, but I think the problem lies in me not being able to use two inputs in summarize_each_ in dplyr when using the cor function.
For the following df:
df <- data.frame(Universe=c(rep("A",5),rep("B",5)),AA.x=rnorm(10),BB.x=rnorm(10),CC.x=rnorm(10),AA.y=rnorm(10),BB.y=rnorm(10),CC.y=rnorm(10))
I want to get the rank correlations between all the .x and the .y combinations. My problem in the function below where you see ????
cor <- df %>% group_by(Universe) %>%
summarize_each_(funs(cor(.,method = 'spearman',use = "pairwise.complete.obs")),????)
I want cor to just include the correlation pairs: AA.x.AA.y , AA.x,BB.y, ... for each Universe.
Please help!

An alternative approach is to just call the cor function once since this will calculate all required correlations. Repeated calls to cor might be a performance issue for a large data set. Code to do this and extract the correlation pairs with labels could look like:
#
# calculate correlations and display in matrix format
#
cor_matrix <- df %>% group_by(Universe) %>%
do(as.data.frame(cor(.[,-1], method="spearman", use="pairwise.complete.obs")))
#
# to add row names
#
cor_matrix1 <- cor_matrix %>%
data.frame(row=rep(colnames(.)[-1], n_groups(.)))
#
# calculate correlations and display in column format
#
num_col=ncol(df[,-1])
out_indx <- which(upper.tri(diag(num_col)))
cor_cols <- df %>% group_by(Universe) %>%
do(melt(cor(.[,-1], method="spearman", use="pairwise.complete.obs"), value.name="cor")[out_indx,])

So here follows the winning (time-wise) solution to my problem:
d <- df %>% gather(R1,R1v,contains(".x")) %>% gather(R2,R2v,contains(".y"),-Universe) %>% group_by(Universe,R1,R2) %>%
summarize(ICAC = cor(x=R1v, y=R2v,method = 'spearman',use = "pairwise.complete.obs")) %>%
unite(Pair, R1, R2, sep="_")
Albeit 0.005 milliseconds in this example, adding data adds time.

Try this:
library(data.table) # needed for fast melt
setDT(df) # sets by reference, fast
mdf <- melt(df[, id := 1:.N], id.vars = c('Universe','id'))
mdf %>%
mutate(obs_set = substr(variable, 4, 4) ) %>% # ".x" or ".y" subgroup
full_join(.,., by=c('Universe', 'obs_set', 'id')) %>% # see notes
group_by(Universe, variable.x, variable.y) %>%
filter(variable.x != variable.y) %>%
dplyr::summarise(rank_corr = cor(value.x, value.y,
method='spearman', use='pairwise.complete.obs'))
Produces:
Universe variable.x variable.y rank_corr
(fctr) (fctr) (fctr) (dbl)
1 A AA.x BB.x -0.9
2 A AA.x CC.x -0.9
3 A BB.x AA.x -0.9
4 A BB.x CC.x 0.8
5 A CC.x AA.x -0.9
6 A CC.x BB.x 0.8
7 A AA.y BB.y -0.3
8 A AA.y CC.y 0.2
9 A BB.y AA.y -0.3
10 A BB.y CC.y -0.3
.. ... ... ... ...
Explanation:
Melt: converts table to long form, one row per observation. To do the melt in a dplyr chain, you would have to use tidyr::gather, I believe, so pick your dependency. Using data.table there is faster and not hard to understand. The step also creates an id for each observation, 1 to nrow(df). The rest is in dplyr like you wanted.
Full join: joins the melted table to itself to create paired observations from all variable pairings based on common Universe and observation id (edit: and now '.x' or '.y' subgroup).
Filter: we don't need to correlate observations paired to themselves, we know those correlations = 1. If you wanted to include them for a correlation matrix or something, comment out this step.
Summarize using Spearman correlation. Note you should use dplyr::summarise since if you have plyr also loaded you might accidentally call plyr::summarise.

Related

Deriving cosine values for vector contrasts distributed over rows in a dataframe (rows to individual vectors)

I am attempting to use the lsa::cosine function to derive cosine values between vectors distributed across successive rows of a dataframe. My raw dataframe is structured with 15 numeric columns with each row denoting a unique vector
each row is a 15-item vector
My challenge is to create a new variable (e.g., cosineraw) that reflects cosine(vec1, vec2). Vec1 is the vector for Row1 and Vec2 is the vector for the next row (lead). I need this function to loop over rows for very large dataframes and am attempting to avoid a for loop. Essentially I need to compute a cosine value for each row contrasted to the next row stopping at the second to last row of the dataframe (since there is no cosine value for the last observation).
I've tried selecting observations rowwise:
dat <- mydat %>% rowwise %>% mutate(cosraw = cosine(as.vector(t(select_all))), as.vector(t(lead(select_all))))
but am getting an 'argument is not a matrix' error
In isolation, this code snippet works:
maybe <- lsa::cosine(as.vector(t(dat[2,])), as.vector(t(dat[1,])))
The problem is that the row index must be relative. This only works successfully for row1 vs. row2 not as the basis for a function rolling across all rows.
Is there a way to do this avoiding a 'for' loop?
Here's a base R solution:
# Load {lsa}
library(lsa)
# Generate data with 250k rows and 300 columns
gen_list <- lapply(1:250000, function(i){
rnorm(300)
})
# Convert to matrix
mat <- t(simplify2array(gen_list))
# Obtain desired values
vals <- unlist(
lapply(
2:nrow(mat), function(i){
cosine(mat[i-1,], mat[i,])
}
)
)
You can ignore the gen_list code as this was to generate example data.
You will want to convert your data frame to a matrix to make it compatible with the {lsa} package.
Runs quickly -- 3.39 seconds on my computer
My answer is similar to Kat's, but I firstly packaged the 15 row values into a list and then created a new column with leading list of lists.
Here is a reproducible data
library(dplyr)
library(tidyr)
library(lsa)
set.seed(1)
df <- data.frame(replicate(15,runif(10)))
The actual workflow:
df %>%
rowwise %>%
summarise(row_v = list(c_across())) %>%
mutate(nextrow_v = lead(row_v)) %>%
replace_na(list(nextrow_v=list(rep(NA, 15)))) %>% # replace NA with a list of NAs
rowwise %>%
summarise(cosr = cosine(unlist(row_v), unlist(nextrow_v)))
# A tibble: 10 x 1
# Rowwise:
cosr[,1]
<dbl>
1 0.820
2 0.791
3 0.780
4 0.785
5 0.838
6 0.808
7 0.718
8 0.743
9 0.773
10 NA
I'm assuming that you aren't looking for vectorization, as well (i.e., lapply or map).
This works, but it's a bit cumbersome. I didn't have any actual data from you so I made my own.
library(lsa)
library(tidyverse)
set.seed(1)
df1 <- matrix(sample(rnorm(15 * 11, 1, .1), 15 * 10), byrow = T, ncol = 15)
Then I created a copy of the data to use as the lead, because for the mutate to work, you need to lead columnwise, but aggregate rowwise. (That doesn't sound quite right, but hopefully, you can make heads or tails of it.)
df2 <- df1
df3 <- df2[-1, ] # all but the first row
df3 <- rbind(df3, rep(NA, 15)) # fill the missing row with NA
df2 <- cbind(df2, df3) %>% as.data.frame()
So now I've got a data frame that is 30 columns wide. the first 15 are my vector; the second 15 is the lead.
df2 %>%
rowwise %>%
mutate(cosr = cosine(c_across(V1:V15), c_across(V16:V30))) %>%
select(cosr) %>% unlist()
# cosr1 cosr2 cosr3 cosr4 cosr5 cosr6 cosr7 cosr8
# 0.9869402 0.9881976 0.9932426 0.9921418 0.9946119 0.9917792 0.9908216 0.9918681
# cosr9 cosr10
# 0.9972666 NA
If in doubt, you can always use a loop or vectorization to validate the numbers.
for(i in 1:(nrow(df1) - 1)) {
v1 <- df1[i, ] %>% unlist()
v2 <- df1[i + 1, ] %>% unlist()
message(cosine(v1, v2))
}
invisible(
lapply(1:(nrow(df1) - 1),
function(i) {message(cosine(unlist(df1[i, ]),
unlist(df1[i + 1, ])))}))

Calculate cumulative mean for dataset randomized 100 times

I have a dataset, and I would like to randomize the order of this dataset 100 times and calculate the cumulative mean each time.
# example data
ID <- seq.int(1,100)
val <- rnorm(100)
df <- cbind(ID, val) %>%
as.data.frame(df)
I already know how to calculate the cumulative mean using the function "cummean()" in dplyr.
df2 <- df %>%
mutate(cm = cummean(val))
However, I don't know how to randomize the dataset 100 times and apply the cummean() function to each iteration of the dataframe. Any advice on how to do this would be greatly appreciated.
I realize this could probably be solved via either a loop, or in tidyverse, and I'm open to either solution.
Additionally, if possible, I'd like to include a column that indicates which iteration the data was produced from (i.e., randomization #1, #2, ..., #100), as well as include the "ID" value, which indicates how many data values were included in the cumulative mean. Thanks in advance!
Here is an approach using the purrr package. Also, not sure what cummean is calculating (maybe someone can share that in the comments) so I included an alternative, the column cm2 as a comparison.
library(tidyverse)
set.seed(2000)
num_iterations <- 100
num_sample <- 100
1:num_iterations %>%
map_dfr(
function(i) {
tibble(
iteration = i,
id = 1:num_sample,
val = rnorm(num_sample),
cm = cummean(val),
cm2 = cumsum(val) / seq_along(val)
)
}
)
You can mutate to create 100 samples then call cummean:
library(dplyr)
library(purrr)
df %>% mutate(map_dfc(1:100, ~cummean(sample(val))))
We may use rerun from purrr
library(dplyr)
library(purrr)
f1 <- function(dat, valcol) {
dat %>%
sample_n(size = n()) %>%
mutate(cm = cummean({{valcol}}))
}
n <- 100
out <- rerun(n, f1(df, val))
The output of rerun is a list, which we can name it with sequence and if we need to create a new column by binding, use bind_rows
out1 <- bind_rows(out, .id = 'ID')
> head(out1)
ID val cm
1 1 0.3376980 0.33769804
2 1 -1.5699384 -0.61612019
3 1 1.3387892 0.03551628
4 1 0.2409634 0.08687807
5 1 0.7373232 0.21696708
6 1 -0.8012491 0.04726439

dplyr programming: unquote-splicing causes overscope error with complete() and nesting()

So I've started to dip my toes into the wonderful world of dplyr programming. I'm trying to write a function that accepts a data.frame, a target column, and any number of grouping columns (using bare names for all columns). The function will then bin the data based on the target column and count the number of entries in each bin. I want to keep a separate bin size for every combination of the grouping variables present in my original data.frame(), so I'm using the complete() and nesting() functions to do this. Here's an example of what I'm trying to do and the error I'm running into:
library(dplyr)
library(tidyr)
#Prepare test data
set.seed(42)
test_data =
data.frame(Gene_ID = rep(paste0("Gene.", 1:10), times=4),
Comparison = rep(c("WT_vs_Mut1", "WT_vs_Mut2"), each=10, times=2),
Test_method = rep(c("T-test", "MannWhitney"), each=20),
P_value = runif(40))
#Perform operation manually
test_data %>%
#Start by binning the data according to q-value
mutate(Probability.bin = cut(P_value,
breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
labels = c(seq(0.0, 1.0, by=0.1)),
right = FALSE)) %>%
#Now summarize the results by bin.
count(Comparison, Test_method, Probability.bin) %>%
#Fill in any missing bins with 0 counts
complete(nesting(Comparison, Test_method), Probability.bin,
fill=list(n = 0))
#Create function that accepts bare column names
bin_by_p_value <- function(df,
pvalue_col, #Bare name of p-value column
...) { #Bare names of grouping columns
#"Quote" column names so they are ready for use below
pvalue_col_name <- enquo(pvalue_col)
group_by_cols <- quos(...)
#Perform the operation
df %>%
#Start by binning the data according to q-value
mutate(Probability.bin = cut(UQ(pvalue_col_name),
breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
labels = c(seq(0.0, 1.0, by=0.1)),
right = FALSE)) %>%
#Now summarize the results by bin.
count(UQS(group_by_cols), Probability.bin) %>%
#Fill in any missing bins with 0 counts
complete(nesting(UQS(group_by_cols)), Probability.bin,
# complete(nesting(UQS(group_by_cols)), Probability.bin,
fill=list(n = 0))
}
#Use function to perform operation
test_data %>%
bin_by_p_value(P_value, Comparison, Test_method)
When I perform the operation manually, everything works fine. When I use the function, it fails with this error:
Error in overscope_eval_next(overscope, expr) :
object 'Comparison' not found
I've narrowed down the problem to the following piece of code in the function:
complete(nesting(UQS(group_by_cols)), Probability.bin...
If I remove the call to nesting(), the code executes without the error. However, I want to maintain the functionality where I only use combinations of the grouping variables that are present in the original data, and then get all possible combinations with the bins, so I can fill in all of the missing bins. Based on the error name and where this is failing, my guess is this is a scoping/environment issue, where I really should use a different environment for the grouping variables in nesting(), since it's contained inside the call to complete(). However, I'm new enough to dplyr programming, that I'm not sure how to do that.
I tried to work around this by uniting the grouping columns into a single column, and then using that united column as input into complete(). This lets me perform the complete() operation the way I want to, while avoiding the nesting() function. However, I ran into trouble when I wanted to separate back into the original grouping columns, since I don't know how to convert a list of quosures into a character vector (required for the "into" parameter of separate()). Here are code snippets to illustrate what I'm talking about:
#Fill in any missing bins with 0 counts
unite(Merged_grouping_cols, UQS(group_by_cols), sep="*") %>%
complete(Merged_grouping_cols, Probability.bin,
fill=list(n = 0)) %>%
separate(Merged_grouping_cols, into=c("What goes here?"), sep="\\*")
Here's the pertinent version info: R version 3.4.2 (2017-09-28), tidyr_0.7.2, dplyr_0.7.4
I'd appreciate any workarounds, but I want to know what I'm doing that's rubbing complete() and nesting() the wrong way.
Use curly-curly {{}} for pvalue_col.
Pass the dots (...) directly to count.
Use ensyms with !!! in nesting.
bin_by_p_value <- function(df,
pvalue_col, #Bare name of p-value column
...) { #Bare names of grouping columns
#Perform the operation
df %>%
#Start by binning the data according to q-value
mutate(Probability.bin = cut({{pvalue_col}},
breaks = c(-Inf, seq(0.1, 1, by=0.1), Inf),
labels = c(seq(0.0, 1.0, by=0.1)),
right = FALSE)) %>%
#Now summarize the results by bin.
count(..., Probability.bin) %>%
#Fill in any missing bins with 0 counts
complete(nesting(!!!ensyms(...)), Probability.bin, fill=list(n = 0))
}
test_data %>% bin_by_p_value(P_value, Comparison, Test_method)
# A tibble: 44 x 4
# Comparison Test_method Probability.bin n
# <chr> <chr> <fct> <dbl>
# 1 WT_vs_Mut1 MannWhitney 0 1
# 2 WT_vs_Mut1 MannWhitney 0.1 1
# 3 WT_vs_Mut1 MannWhitney 0.2 0
# 4 WT_vs_Mut1 MannWhitney 0.3 1
# 5 WT_vs_Mut1 MannWhitney 0.4 1
# 6 WT_vs_Mut1 MannWhitney 0.5 1
# 7 WT_vs_Mut1 MannWhitney 0.6 0
# 8 WT_vs_Mut1 MannWhitney 0.7 0
# 9 WT_vs_Mut1 MannWhitney 0.8 1
#10 WT_vs_Mut1 MannWhitney 0.9 4
# … with 34 more rows
Testing the output if the output of manual call is stored in res.
identical(res, test_data %>% bin_by_p_value(P_value, Comparison, Test_method))
#[1] TRUE

apply function to grouped rows in dataframe [duplicate]

This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 5 years ago.
I have created a function that computes a number of biological statistics, such as species range edges. Here is a simplified version of the function:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[,lat], w=rangedf[,weighting], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[,lon], w=rangedf[,weighting], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
I would like to apply this to a large dataframe where every row is an observation of a species. As such, I want the function to group rows by a specified set of columns, and then computer these statistics for each group. Here is a test dataframe:
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data.table(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
I want to apply this function to every unique combination of species and year to calculate summary statistics, i.e., the following:
testresult <- testdf %>%
group_by(SPECIES, YEAR) %>%
range_stats(lat="LATITUDE",lon="LONGITUDE",weighting="BIOMASS",na.rm=T)
However, the code above does not work (I get a (list) object cannot be coerced to type 'double' error) and I am not sure how else to approach the problem.
Since you add the tag of dplyr and purrr, I assume you are interested in a tidyverse solution. So below I will demonstrate a solution based on the tidyverse.
First, your range_stats is problematic. This is why you got the error message. The weighted.mean is expecting a vector for both the x and w argument. However, if rangedf is a tibble, the way you subset the tibble, such as rangedf[,lat] will still return a one-column tibble. A better way is to use pull from the dplyr package.
library(tidyverse)
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x = rangedf %>% pull(lat),
w = rangedf %>% pull(weighting), na.rm=T)
cent_lon <- weighted.mean(x = rangedf %>% pull(lon),
w = rangedf %>% pull(weighting), na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
Next, the way you created the data frame is OK, but data.table is from the data.table package and you will create a data.table, not a tibble. I thought you want to use an approach from tidyverse, so I changed data.table to data_frame as follows.
LATITUDE <- c(27.91977, 21.29066, 26.06340, 28.38918, 25.97517, 27.96313)
LONGITUDE <- c(-175.8617, -157.8645, -173.9593, -178.3571, -173.9679, -175.7837)
BIOMASS <- c(4.3540488, 0.2406332, 0.2406332, 2.1419699, 0.3451426, 1.0946017)
SPECIES <- c('Abudefduf abdominalis','Abudefduf abdominalis','Abudefduf abdominalis','Chaetodon lunulatus','Chaetodon lunulatus','Chaetodon lunulatus')
YEAR <- c('2005', '2005', '2014', '2009', '2009', '2015')
testdf <- data_frame(LATITUDE, LONGITUDE, BIOMASS, SPECIES, YEAR)
Now, you said you want to apply the range_stats function to each combination of SPECIES and YEAR. One approach is to split the data frame to a list of data frames, and use lapply family function. But here I want to show you how to use the map family function to achieve this task as map is from the purrr package, which is part of the tidyverse.
We can first create a group indices based on SPECIES and YEAR.
testdf2 <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR))
testdf2
# A tibble: 6 x 6
LATITUDE LONGITUDE BIOMASS SPECIES YEAR Group
<dbl> <dbl> <dbl> <chr> <chr> <int>
1 27.91977 -175.8617 4.3540488 Abudefduf abdominalis 2005 1
2 21.29066 -157.8645 0.2406332 Abudefduf abdominalis 2005 1
3 26.06340 -173.9593 0.2406332 Abudefduf abdominalis 2014 2
4 28.38918 -178.3571 2.1419699 Chaetodon lunulatus 2009 3
5 25.97517 -173.9679 0.3451426 Chaetodon lunulatus 2009 3
6 27.96313 -175.7837 1.0946017 Chaetodon lunulatus 2015 4
As you can see, Group is a new column showing the index number. Now we can split the data frame based on Group, and then use map_dfr to apply the range_stats function.
testresult <- testdf2 %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
testresult
Group cent_lat cent_lon
1 1 27.57259 -174.9191
2 2 26.06340 -173.9593
3 3 28.05418 -177.7480
4 4 27.96313 -175.7837
Notice that map_dfr can automatic bind the output list of data frames to a single data frame. .id = "Group" means we want to create a column called Group based on the name of the list element.
I separated the process into two steps, but of course they can be all in one pipeline as follows.
testresult <- testdf %>%
mutate(Group = group_indices(., SPECIES, YEAR)) %>%
split(.$Group) %>%
map_dfr(range_stats, lat = "LATITUDE",lon = "LONGITUDE",
weighting = "BIOMASS", na.rm = TRUE, .id = "Group")
If you want, testresult can be merged with testdf using left_join, but I will stop here as testresult is probably already the desired output you want. I hope this helps.
Fundamentally, the main issue involves weighted.mean() where you are passing a dataframe object and not a vector that can be coerced to double. To fix within method, simply change:
x=rangedf[,lat]
To double brackets:
x=rangedf[[lat]]
Adjusted method:
range_stats <- function(rangedf, lat, lon, weighting, na.rm=T){
cent_lat <- weighted.mean(x=rangedf[[lat]], w=rangedf[[weighting]], na.rm=T)
cent_lon <- weighted.mean(x=rangedf[[lon]], w=rangedf[[weighting]], na.rm=T)
out <- data.frame(cent_lat, cent_lon)
return(out)
}
As for overall group by slice computation, do forgive me in bypassing, dplyr and data.table which you use and consider base R's underutilized but useful method, by().
The challenge with your current setup is the output of range_stats method return is a data.frame of two columns and dplyr's group_by() expects one aggregation vector operation. However, by passes dataframe objects (sliced by factors) into a defined function to return a list of data.frames which you can then rbind for one final dataframe:
df_List <- by(testdf, testdf[, c("SPECIES", "YEAR")], FUN=function(df)
data.frame(species=df$SPECIES[1],
year=df$YEAR[1],
range_stats(df,"LATITUDE","LONGITUDE","BIOMASS"))
)
finaldf <- do.call(rbind, df_List)
finaldf
# species year cent_lat cent_lon
# 1 Abudefduf abdominalis 2005 27.57259 -174.9191
# 2 Chaetodon lunulatus 2009 28.05418 -177.7480
# 3 Abudefduf abdominalis 2014 26.06340 -173.9593
# 4 Chaetodon lunulatus 2015 27.96313 -175.7837

Parallel wilcox.test using group_by and summarise

There must be an R-ly way to call wilcox.test over multiple observations in parallel using group_by. I've spent a good deal of time reading up on this but still can't figure out a call to wilcox.test that does the job. Example data and code below, using magrittr pipes and summarize().
library(dplyr)
library(magrittr)
# create a data frame where x is the dependent variable, id1 is a category variable (here with five levels), and id2 is a binary category variable used for the two-sample wilcoxon test
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
# make sure piping and grouping are called correctly, with "sum" function as a well-behaving example function
df %>% group_by(id1) %>% summarise(s=sum(x))
df %>% group_by(id1,id2) %>% summarise(s=sum(x))
# make sure wilcox.test is called correctly
wilcox.test(x~id2, data=df, paired=FALSE)$p.value
# yet, cannot call wilcox.test within pipe with summarise (regardless of group_by). Expected output is five p-values (one for each level of id1)
df %>% group_by(id1) %>% summarise(w=wilcox.test(x~id2, data=., paired=FALSE)$p.value)
df %>% summarise(wilcox.test(x~id2, data=., paired=FALSE))
# even specifying formula argument by name doesn't help
df %>% group_by(id1) %>% summarise(w=wilcox.test(formula=x~id2, data=., paired=FALSE)$p.value)
The buggy calls yield this error:
Error in wilcox.test.formula(c(1.09057358373486,
2.28465932554436, 0.885617572657959, : 'formula' missing or incorrect
Thanks for your help; I hope it will be helpful to others with similar questions as well.
Your task will be easily accomplished using the do function (call ?do after loading the dplyr library). Using your data, the chain will look like this:
df <- data.frame(x=abs(rnorm(50)),id1=rep(1:5,10), id2=rep(1:2,25))
df <- tbl_df(df)
res <- df %>% group_by(id1) %>%
do(w = wilcox.test(x~id2, data=., paired=FALSE)) %>%
summarise(id1, Wilcox = w$p.value)
output
res
Source: local data frame [5 x 2]
id1 Wilcox
(int) (dbl)
1 1 0.6904762
2 2 0.4206349
3 3 1.0000000
4 4 0.6904762
5 5 1.0000000
Note I added the do function between the group_by and summarize.
I hope it helps.
You can do this with base R (although the result is a cumbersome list):
by(df, df$id1, function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
or with dplyr:
ddply(df, .(id1), function(x) { wilcox.test(x~id2, data=x, paired=FALSE)$p.value })
id1 V1
1 1 0.3095238
2 2 1.0000000
3 3 0.8412698
4 4 0.6904762
5 5 0.3095238

Resources