I want time series correlations in a grouped data frame. Here's a sample dataset:
x <- cbind(expand.grid(type = letters[1:4], time = seq(1:4), kind = letters[5:8]), value = rnorm(64)) %>% arrange(type, time, kind)
which produces 64 rows of the variables type, time, kind and value.
I want a time series correlation of the values for each kind grouped by type. Think of each type and time combination as an ordered vector of 4 values. I group by type and time, then arrange by kind, then remove kind.
y <- x %>% group_by(type) %>% arrange(type, time, kind) %>% select(-kind)
I can then group y by type and time and nest such that all the values are together in the data variable, regroup by type only and create a new variable which is the lead data.
z <- y %>% group_by(type, time) %>% nest(value) %>% group_by(type) %>% mutate(ahead = lead(data))
Now I want to run mutate(R = cor(data, ahead)), but I can't seem get the syntax correct.
I've also tried mutate(R = cor(data$value, ahead$value)) and mutate(R = cor(data[1]$value, ahead[1]$value)), to no avail.
The error I get from cor is: supply both 'x' and 'y' or a matrix-like 'x'.
How do I reference the data and ahead variables as vectors to run with cor?
Ultimately, I'm looking for a 16 row data frame with columns type, time, and R where R is a single correlation value.
Thank you for your attention.
We can use map2_dbl from purrr to pass data and ahead at the same time to cor function.
library(dplyr)
z %>%
mutate(R = purrr::map2_dbl(data, ahead, cor)) %>%
select(-data, -ahead)
# type time R
# <fct> <int> <dbl>
# 1 a 1 0.358
# 2 a 2 -0.0498
# 3 a 3 -0.654
# 4 a 4 1
# 5 b 1 -0.730
# 6 b 2 0.200
# 7 b 3 -0.928
# 8 b 4 1
# 9 c 1 0.358
#10 c 2 0.485
#11 c 3 -0.417
#12 c 4 1
#13 d 1 0.140
#14 d 2 -0.448
#15 d 3 -0.511
#16 d 4 1
In base R, we can use mapply
z$R <- mapply(cor, z$data, z$ahead)
Related
I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667
I analyse a data set from an experiment and would like to calculate effect sizes for each variable. My dataframe consist of multiple variables (= columns) for 8 treatments t (= rows), with t1 - t4 being the control for t5 - t8, respectively (t1 control for t5, t2 for t6, ... ). The original data set is way larger, so I would like to solve the following two tasks::
I would like to calculate the log(treatment/control) for each t5 - t8 for one variable, e.g. effect size for t5 = log(t5/t1), effect size for t6 = log(t6/t2), ... . The name of the resulting column should be variablename_effect and the new column would only have 4 rows instead of 8.
The most tricky part is, that I need to implement the combination of specific rows into my code, so that the correct control is used for each treatment.
I would like to calculate the effect sizes for all my variables within one code, so create multiple new columns with the correct names (variablename_effect).
I would prefer to solve the problem in dplyr or base R to keep it simple.
So far, the only related question I found was /r-dplyr-mutate-refer-new-column-itself (shows combination of multiple if else()). I would be very thankful for either a solution, links to similar questions or which packages I should use in cast it's not possible within dplyr / base R!
Sample data:
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
Edit: this is the df_effect I would expect to receive as an output, thanks #Martin_Gal for the hint!
df_effect <- data.frame("treatment" = c(5:8), "Var1_effect" = c(log(13/9), log(14/10), log(15/11), log(16/12)), "Var2_effect" = c(log(21/17), log(22/18), log(23/19), log(24/20)))
My ideas so far:
For calculating the effect size:
mutate() and for function:
# 1st option:
for (i in 5:8) {
dt_effect <- df %>%
mutate(Var1_effect = log(df[i, "Var1"]/df[i - 4, "Var1"]))
}
#2nd option:
for (i in 5:8){
dt_effect <- df %>%
mutate(Var1_effect = log(df[treatment == i , "Var1"]/df[treatment == i - 4 , "Var1"]))
}
problem: both return the result for i = 8 for every row!
mutate() and ifelse():
df_effect <- df %>%
mutate(Var1_effect = ifelse(treatment >= 5, log(df[, "Var1"]/df[ , "Var1"]), NA))
seems to work, but so far I couldn't implement which row to pick for the control, so it returns NA for t1 - t4 (correct) and 0 for t5 - t8 (mathematically correct as I calculate log(t5/t5), ... but not what I want).
maybe I should use summarise() instead of mutate() because I create fewer rows than in my original dataframe?
Make this work for every variable at the same time
My only idea would be to index the columns within a second for function and use the paste() to create the new column names, but I don't know exactly how to do this ...
I don't know if this will solve your problem, but I want to make a suggestion similar to Limey:
library(dplyr)
library(tidyr)
df %>%
mutate(control = 1 - (treatment-1) %/% (nrow(.)/2),
group = ifelse(treatment %% (nrow(.)/2) == 0, nrow(.)/2, treatment %% (nrow(.)/2))) %>%
select(-treatment) %>%
pivot_wider(names_from = c(control), values_from=c(Var1, Var2)) %>%
group_by(group) %>%
mutate(Var1_effect = log(Var1_0/Var1_1))
This yields
# A tibble: 4 x 6
# Groups: group [4]
group Var1_1 Var1_0 Var2_1 Var2_0 Var1_effect
<dbl> <int> <int> <int> <int> <dbl>
1 1 9 13 17 21 0.368
2 2 10 14 18 22 0.336
3 3 11 15 19 23 0.310
4 4 12 16 20 24 0.288
What happend here?
I expected the first half of your data.frame to be the control variables for the second half. So I created an indicator variable and a grouping variable based on the treatment id's/numbers.
Now the treatment id isn't used anymore, so I dropped it.
Next I used pivot_wider to create a dataset with Var1_1 (i.e. Var1 for your control variable) and Var1_0 (i.e. Var1 for your "ordinary" variable).
Finally I calculated Var1_effect per group.
In response to OP's comment to #MartinGal 's solution (which is perfectly fione in its own right):
First convert the input data to a more convenient form:
# Original input dataset
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
# Revised input dataset
revisedDF <- df %>%
select(-treatment) %>%
add_column(
Treatment=rep(c("Control", "Test"), each=4),
Experiment=rep(1:4, times=2)
) %>%
pivot_longer(
names_to="Variable",
values_to="Value",
cols=c(Var1, Var2)
) %>%
arrange(Experiment, Variable, Treatment)
revisedDF %>% head(6)
Giving
# A tibble: 6 x 4
Treatment Experiment Variable Value
<chr> <int> <chr> <int>
1 Control 1 Var1 9
2 Test 1 Var1 13
3 Control 1 Var2 17
4 Test 1 Var2 21
5 Control 2 Var1 10
6 Test 2 Var1 14
I like this format because it makes the analysis code completely independent of the number of variables, the number of experiements and the number of Treatments.
The analysis is straightforward, too:
result <- revisedDF %>% pivot_wider(
names_from=Treatment,
values_from=Value
) %>%
mutate(Effect=log(Test/Control))
result
Giving
Experiment Variable Control Test Effect
<int> <chr> <int> <int> <dbl>
1 1 Var1 9 13 0.368
2 1 Var2 17 21 0.211
3 2 Var1 10 14 0.336
4 2 Var2 18 22 0.201
5 3 Var1 11 15 0.310
6 3 Var2 19 23 0.191
7 4 Var1 12 16 0.288
8 4 Var2 20 24 0.182
pivot_wider and pivot_longer are relatively new dplyr verbs. If you're unable to use the most recent version of the package, spread and gather do the same job with slightly different argument names.
I am interested in biodiversity index calculations using vegan
package. The simpsons index works but no results from Shannon
argument. I was hoping somebody know the solution
What I have tried is that I have converted data. frame into vegan
package test data format using code below
Plot <- c(1,1,2,2,3,3,3)
species <- c( "Aa","Aa", "Aa","Bb","Bb","Rr","Xx")
count <- c(3,2,1,4,2,5,7)
veganData <- data.frame(Plot,species,count)
matrify(veganData )
diversity(veganData,"simpson")
diversity(veganData,"shannon", base = exp(1))
1. I get the following results, so I think it produces all
simpsons indices
> diversity(veganData,"simpson")
simpson.D simpson.I simpson.R
1 1.00 0.00 1.0
2 0.60 0.40 1.7
3 0.35 0.65 2.8
2. But when I run for Shannon index get the following
message
> diversity(veganData,"shannon")
data frame with 0 columns and 3 rows
I am not sure why its not working ? do we need to make any changes
in data formatting while switching the methods?
Your data need to be in the wide format. Also the counts must be either in total or averages (not repeated counts for the same plot).
library(dply); library(tidyr)
df <- veganData %>%
group_by(Plot, species) %>%
summarise(count = sum(count)) %>%
ungroup %>%
spread(species, count, fill=0)
df
# # A tibble: 3 x 5
# Plot Aa Bb Rr Xx
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 5 0 0 0
# 2 2 1 4 0 0
# 3 3 0 2 5 7
diversity(df[,-1], "shannon")
# [1] 0.0000000 0.5004024 0.9922820
To check if the calculation is correct, note the Shannon calculation is carried out as -1 x summation of Pi*lnPi
# For plot 3:
-1*(
(2/(2+5+7))*log((2/(2+5+7))) + #Pi*lnPi of Bb
(5/(2+5+7))*log((5/(2+5+7))) + #Pi*lnPi of Rr
(7/(2+5+7))*log((7/(2+5+7))) #Pi*lnPi of Xx
)
# [1] 0.992282
This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))
I am pretty new to R, so this question may be a bit naive.
I have got a tibble with several columns, and I want to create a factor (Bin) by binning the values in one of the columns in N bins. Which is done in a pipe. However, I would like to be able to define the column to be binned at the top of the script (e.g. bin2use = RT), because I want this to be flexible.
I've tried several ways of referring to a column name using this variable, but I cannot get it to work. Amongst others I have tried get(), eval(), [[]]
simplified example code
Subject <- c(rep(1,100), rep(2,100))
RT <- runif(200, 300, 800 )
data_st <- tibble(Subject, RT)
bin2use = 'RT'
nbin = 5
binned_data <- data_st %>%
group_by(Subject) %>%
mutate(
Bin = cut_number(get(bin2use), nbin, label = F)
)
Error in mutate_impl(.data, dots) :
non-numeric argument to binary operator
We can use a non-standard evaluation with `lazyeval
library(dplyr)
library(ggplot2)
f1 <- function(colName, bin){
call <- lazyeval::interp(~cut_number(a, b, label = FALSE),
a = as.name(colName), b = bin)
data_st %>%
group_by(Subject) %>%
mutate_(.dots = setNames(list(call), "Bin"))
}
f1(bin2use, nbin)
#Source: local data frame [200 x 3]
#Groups: Subject [2]
# Subject RT Bin
# <dbl> <dbl> <int>
#1 1 752.2066 5
#2 1 353.0410 1
#3 1 676.5617 4
#4 1 493.0052 2
#5 1 532.2157 3
#6 1 467.5940 2
#7 1 791.6643 5
#8 1 333.1583 1
#9 1 342.5786 1
#10 1 637.8601 4
# ... with 190 more rows