I have a helper function (say foo()) that will be run on various data frames that may or may not contain specified variables. Suppose I have
library(dplyr)
d1 <- data_frame(taxon=1,model=2,z=3)
d2 <- data_frame(taxon=2,pss=4,z=3)
The variables I want to select are
vars <- intersect(names(data),c("taxon","model","z"))
that is, I'd like foo(d1) to return the taxon, model, and z columns, while foo(d2) returns just taxon and z.
If foo contains select(data,c(taxon,model,z)) then foo(d2) fails (because d2 doesn't contain model). If I use select(data,-pss) then foo(d1) fails similarly.
I know how to do this if I retreat from the tidyverse (just return data[vars]), but I'm wondering if there's a handy way to do this either (1) with a select() helper of some sort (tidyselect::select_helpers) or (2) with tidyeval (which I still haven't found time to get my head around!)
Another option is select_if:
d2 %>% select_if(names(.) %in% c('taxon', 'model', 'z'))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
select_if is superseded. Use any_of instead:
d2 %>% select(any_of(c('taxon', 'model', 'z')))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
type ?dplyr::select in R and you will find this:
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must
be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names
that don't exist.
You can use one_of(), which gives a warning when the column is absent but otherwise selects the correct columns:
d1 %>%
select(one_of(c("taxon", "model", "z")))
d2 %>%
select(one_of(c("taxon", "model", "z")))
Using the builtin anscombe data frame for the example noting that z is not a column in anscombe :
anscombe %>% select(intersect(names(.), c("x1", "y1", "z")))
giving:
x1 y1
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68
Related
My function is defined as the following, where i subset a dataframe to a specific name and return the first 5 elements.
Bestideas <- function(x) {
topideas <- subset(Masterall, Masterall$NAME == x) %>%
slice(1:5)
return(topideas)
I would then like to apply the function, to an entire df (with one column of Names), so that the function is applied to each name on the list and binds it into a new df, containing the first five ideas from all unique names. Through research - I have arrived at the following:
bestideas_collection = lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
However, it doesn't work. It returns a dataframe with only five ideas in total, and from 5 different names. As there is 30 Unique names in my list, I expected 30*5 = 150 ideas in the "bestideas_collection" variable. I get this error message:
"longer object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object length"
Further, if I do it manually for each name, it works just as intended - which makes me think that the function works fine, and that the issue is with the lapply function.
holder <- Bestideas("NAME 1")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 2")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 3")
bestideas_collection <- bind_rows(bestideas_collection,holder)
...
Can anyone help me if I am using the function wrong, or do you have alternative methods of doing it? I have already tried with a for-loop - but it gives me the same error as with the lapply function.
I don't have your data, so I tried to reproduce your problem on a fabricated set. I was unable to do so. With a very simple case, your function works as expected.
library(dplyr)
set.seed(123)
Masterall <- data.frame(NAME = rep(LETTERS, 10), value = rnorm(260)) %>%
group_by(NAME) %>% arrange(desc(value))
UNIQUE_NAMES_DF <- LETTERS
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows
Is your UNIQUE_NAMES_DF a data.frame? If so, that is the trouble. The lapply function expects a vector as its first input. It can handle a data.frame, but clearly unexpected results occur. Here is an example:
UNIQUE_NAMES_DF <- data.frame(NAME = LETTERS, other = sample(letters))
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 12 x 2
# Groups: NAME [11]
NAME value
<chr> <dbl>
1 C -0.785
2 D 0.385
3 E -0.371
4 F 1.13
5 I 1.10
6 N -0.641
7 P -1.02
8 Q -0.0341
9 U -1.07
10 X -0.0834
11 Z 1.26
12 Z -0.739
I do not know the structure of your UNIQUE_NAMES_DF, but if you just feed the column with the names into your lapply, it should work:
lapply(UNIQUE_NAMES_DF$NAME, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows
I analyse a data set from an experiment and would like to calculate effect sizes for each variable. My dataframe consist of multiple variables (= columns) for 8 treatments t (= rows), with t1 - t4 being the control for t5 - t8, respectively (t1 control for t5, t2 for t6, ... ). The original data set is way larger, so I would like to solve the following two tasks::
I would like to calculate the log(treatment/control) for each t5 - t8 for one variable, e.g. effect size for t5 = log(t5/t1), effect size for t6 = log(t6/t2), ... . The name of the resulting column should be variablename_effect and the new column would only have 4 rows instead of 8.
The most tricky part is, that I need to implement the combination of specific rows into my code, so that the correct control is used for each treatment.
I would like to calculate the effect sizes for all my variables within one code, so create multiple new columns with the correct names (variablename_effect).
I would prefer to solve the problem in dplyr or base R to keep it simple.
So far, the only related question I found was /r-dplyr-mutate-refer-new-column-itself (shows combination of multiple if else()). I would be very thankful for either a solution, links to similar questions or which packages I should use in cast it's not possible within dplyr / base R!
Sample data:
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
Edit: this is the df_effect I would expect to receive as an output, thanks #Martin_Gal for the hint!
df_effect <- data.frame("treatment" = c(5:8), "Var1_effect" = c(log(13/9), log(14/10), log(15/11), log(16/12)), "Var2_effect" = c(log(21/17), log(22/18), log(23/19), log(24/20)))
My ideas so far:
For calculating the effect size:
mutate() and for function:
# 1st option:
for (i in 5:8) {
dt_effect <- df %>%
mutate(Var1_effect = log(df[i, "Var1"]/df[i - 4, "Var1"]))
}
#2nd option:
for (i in 5:8){
dt_effect <- df %>%
mutate(Var1_effect = log(df[treatment == i , "Var1"]/df[treatment == i - 4 , "Var1"]))
}
problem: both return the result for i = 8 for every row!
mutate() and ifelse():
df_effect <- df %>%
mutate(Var1_effect = ifelse(treatment >= 5, log(df[, "Var1"]/df[ , "Var1"]), NA))
seems to work, but so far I couldn't implement which row to pick for the control, so it returns NA for t1 - t4 (correct) and 0 for t5 - t8 (mathematically correct as I calculate log(t5/t5), ... but not what I want).
maybe I should use summarise() instead of mutate() because I create fewer rows than in my original dataframe?
Make this work for every variable at the same time
My only idea would be to index the columns within a second for function and use the paste() to create the new column names, but I don't know exactly how to do this ...
I don't know if this will solve your problem, but I want to make a suggestion similar to Limey:
library(dplyr)
library(tidyr)
df %>%
mutate(control = 1 - (treatment-1) %/% (nrow(.)/2),
group = ifelse(treatment %% (nrow(.)/2) == 0, nrow(.)/2, treatment %% (nrow(.)/2))) %>%
select(-treatment) %>%
pivot_wider(names_from = c(control), values_from=c(Var1, Var2)) %>%
group_by(group) %>%
mutate(Var1_effect = log(Var1_0/Var1_1))
This yields
# A tibble: 4 x 6
# Groups: group [4]
group Var1_1 Var1_0 Var2_1 Var2_0 Var1_effect
<dbl> <int> <int> <int> <int> <dbl>
1 1 9 13 17 21 0.368
2 2 10 14 18 22 0.336
3 3 11 15 19 23 0.310
4 4 12 16 20 24 0.288
What happend here?
I expected the first half of your data.frame to be the control variables for the second half. So I created an indicator variable and a grouping variable based on the treatment id's/numbers.
Now the treatment id isn't used anymore, so I dropped it.
Next I used pivot_wider to create a dataset with Var1_1 (i.e. Var1 for your control variable) and Var1_0 (i.e. Var1 for your "ordinary" variable).
Finally I calculated Var1_effect per group.
In response to OP's comment to #MartinGal 's solution (which is perfectly fione in its own right):
First convert the input data to a more convenient form:
# Original input dataset
df <- data.frame("treatment" = c(1:8), "Var1" = c(9:16), "Var2" = c(17:24))
# Revised input dataset
revisedDF <- df %>%
select(-treatment) %>%
add_column(
Treatment=rep(c("Control", "Test"), each=4),
Experiment=rep(1:4, times=2)
) %>%
pivot_longer(
names_to="Variable",
values_to="Value",
cols=c(Var1, Var2)
) %>%
arrange(Experiment, Variable, Treatment)
revisedDF %>% head(6)
Giving
# A tibble: 6 x 4
Treatment Experiment Variable Value
<chr> <int> <chr> <int>
1 Control 1 Var1 9
2 Test 1 Var1 13
3 Control 1 Var2 17
4 Test 1 Var2 21
5 Control 2 Var1 10
6 Test 2 Var1 14
I like this format because it makes the analysis code completely independent of the number of variables, the number of experiements and the number of Treatments.
The analysis is straightforward, too:
result <- revisedDF %>% pivot_wider(
names_from=Treatment,
values_from=Value
) %>%
mutate(Effect=log(Test/Control))
result
Giving
Experiment Variable Control Test Effect
<int> <chr> <int> <int> <dbl>
1 1 Var1 9 13 0.368
2 1 Var2 17 21 0.211
3 2 Var1 10 14 0.336
4 2 Var2 18 22 0.201
5 3 Var1 11 15 0.310
6 3 Var2 19 23 0.191
7 4 Var1 12 16 0.288
8 4 Var2 20 24 0.182
pivot_wider and pivot_longer are relatively new dplyr verbs. If you're unable to use the most recent version of the package, spread and gather do the same job with slightly different argument names.
I have a case where I need to apply a dynamically selected function onto a column of a tibble. In some cases, I don't want the values to change at all -- then I select the identity function I().
After applying I() the datatype of the column changes from <dbl> to <I<dbl>>. Why is that? Why is it not just double again?
library(tidyverse)
df <- tibble(x = (1:3*pi))
print(df)
# A tibble: 3 x 1
# x
# <dbl>
# 1 3.14
# 2 6.28
# 3 9.42
df %<>% mutate(x = I(x))
print(df)
# A tibble: 3 x 1
# x
# <I<dbl>>` <-- Why <I...> and not <dbl>?
# 1 3.14
# 2 6.28
# 3 9.42
How can I just get ?
I() is not the identity function, technically (that would be identity). I() is to inhibit interpretation/conversion, saying that the component should be used "as is". Further I(...) returns an object of class "AsIs", which is and should be recognized as something unique from its non-I(...) counterpart. As for the effect of this class ... I don't know of any (though I don't use them regularly, so I might be missing something).
And you can still operate on this, it's just classes differently.
dput(1:3)
# 1:3
dput(I(1:3))
# structure(1:3, class = "AsIs")
tibble(x = (1:3*pi)) %>%
mutate(x = I(x)) %>%
mutate(y = x + 1)
# # A tibble: 3 x 2
# x y
# <I<dbl>> <I<dbl>>
# 1 3.14 4.14
# 2 6.28 7.28
# 3 9.42 10.4
though that new column is also "AsIs".
I have a helper function (say foo()) that will be run on various data frames that may or may not contain specified variables. Suppose I have
library(dplyr)
d1 <- data_frame(taxon=1,model=2,z=3)
d2 <- data_frame(taxon=2,pss=4,z=3)
The variables I want to select are
vars <- intersect(names(data),c("taxon","model","z"))
that is, I'd like foo(d1) to return the taxon, model, and z columns, while foo(d2) returns just taxon and z.
If foo contains select(data,c(taxon,model,z)) then foo(d2) fails (because d2 doesn't contain model). If I use select(data,-pss) then foo(d1) fails similarly.
I know how to do this if I retreat from the tidyverse (just return data[vars]), but I'm wondering if there's a handy way to do this either (1) with a select() helper of some sort (tidyselect::select_helpers) or (2) with tidyeval (which I still haven't found time to get my head around!)
Another option is select_if:
d2 %>% select_if(names(.) %in% c('taxon', 'model', 'z'))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
select_if is superseded. Use any_of instead:
d2 %>% select(any_of(c('taxon', 'model', 'z')))
# # A tibble: 1 x 2
# taxon z
# <dbl> <dbl>
# 1 2 3
type ?dplyr::select in R and you will find this:
These helpers select variables from a character vector:
all_of(): Matches variable names in a character vector. All names must
be present, otherwise an out-of-bounds error is thrown.
any_of(): Same as all_of(), except that no error is thrown for names
that don't exist.
You can use one_of(), which gives a warning when the column is absent but otherwise selects the correct columns:
d1 %>%
select(one_of(c("taxon", "model", "z")))
d2 %>%
select(one_of(c("taxon", "model", "z")))
Using the builtin anscombe data frame for the example noting that z is not a column in anscombe :
anscombe %>% select(intersect(names(.), c("x1", "y1", "z")))
giving:
x1 y1
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68
This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.
Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.
A base R option would be
df1$value.mean <- with(df1, ave(value, ID))