I have a correlation dataset that looks like this:
V1 V2 R2
1 2 0.4
1 3 0.5
3 5 0.3
And i want to convert it to a two-column data in such a way that I would have multiple x (in column V) in one y (in column R2) for scatter plotting. It would look like this:
V R2
1 0.4
2 0.4
1 0.5
2 0.5
3 0.5
3 0.3
4 0.3
5 0.3
How can I do this in R?
In the tidyverse, you can make a list column of the required vectors with purrr::map2 to iterate seq over each pair of start and end points, and then expand with tidyr::unnest:
df <- data.frame(V1 = c(1L, 1L, 3L),
V2 = c(2L, 3L, 5L),
R2 = c(0.4, 0.5, 0.3))
library(tidyverse)
df %>% transmute(V = map2(V1, V2, seq), R2) %>% unnest()
#> R2 V
#> 1 0.4 1
#> 2 0.4 2
#> 3 0.5 1
#> 4 0.5 2
#> 5 0.5 3
#> 6 0.3 3
#> 7 0.3 4
#> 8 0.3 5
In base R, there isn't a simple equivalent of unnest, so it's easier to use Map (the multivariate lapply, roughly equivalent to purrr::map2 above) to build a list of data frames, complete with the R2 value (recycled by data.frame), which than then be do.call(rbind, ...)ed into a single data frame:
do.call(rbind,
Map(function(v1, v2, r2){data.frame(V = v1:v2, R2 = r2)},
df$V1, df$V2, df$R2))
#> V R2
#> 1 1 0.4
#> 2 2 0.4
#> 3 1 0.5
#> 4 2 0.5
#> 5 3 0.5
#> 6 3 0.3
#> 7 4 0.3
#> 8 5 0.3
Check out the intermediate products of each to get a feel for how they work.
Here is one option using data.table
library(data.table)
setDT(df1)[, .(V = V1:V2, R2), by = .(grp = 1:nrow(df1))][, grp := NULL][]
# V R2
#1: 1 0.4
#2: 2 0.4
#3: 1 0.5
#4: 2 0.5
#5: 3 0.5
#6: 3 0.3
#7: 4 0.3
#8: 5 0.3
Related
I would like to create a new df, based upon whether the second or third condition's for each subject are greater than the first condition.
Example df:
df1 <- data.frame(subject = rep(1:5, 3),
condition = rep(c("first", "second", "third"), each = 5),
values = c(.4, .4, .4, .4, .4, .6, .6, .6, .6, .4, .6, .6, .6, .4, .4))
> df1
subject condition values
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
5 5 first 0.4
6 1 second 0.6
7 2 second 0.6
8 3 second 0.6
9 4 second 0.6
10 5 second 0.4
11 1 third 0.6
12 2 third 0.6
13 3 third 0.6
14 4 third 0.4
15 5 third 0.4
The resulting df would be this:
> df2
subject condition values
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
6 1 second 0.6
7 2 second 0.6
8 3 second 0.6
9 4 second 0.6
11 1 third 0.6
12 2 third 0.6
13 3 third 0.6
14 4 third 0.4
Here, subject #5 does not meet the criteria. This is because only subject #5's values are not greater than the first condition in either the second or third condition.
Thanks.
We may group by 'subject' and filter if any of the second or third 'values' are greater than 'first'
library(dplyr)
df1 %>%
group_by(subject) %>%
filter(any(values[2:3] > first(values))) %>%
ungroup
-output
# A tibble: 12 × 3
subject condition values
<int> <chr> <dbl>
1 1 first 0.4
2 2 first 0.4
3 3 first 0.4
4 4 first 0.4
5 1 second 0.6
6 2 second 0.6
7 3 second 0.6
8 4 second 0.6
9 1 third 0.6
10 2 third 0.6
11 3 third 0.6
12 4 third 0.4
Using ave.
df1[with(df1, ave(values, subject, FUN=\(x) any(x[2:3] > x[1])) == 1), ]
# subject condition values
# 1 1 first 0.4
# 2 2 first 0.4
# 3 3 first 0.4
# 4 4 first 0.4
# 6 1 second 0.6
# 7 2 second 0.6
# 8 3 second 0.6
# 9 4 second 0.6
# 11 1 third 0.6
# 12 2 third 0.6
# 13 3 third 0.6
# 14 4 third 0.4
This question already has answers here:
Select groups where all values are positive
(2 answers)
Closed 7 months ago.
I have a data frame which is grouped by 'subject'. I want filter those 'subject' where all 'values' are above a certain value, values > 0.5
Example df:
df1 <- data.frame(subject = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5),
values = c(.4, .6, .6, .6, .6, .6, .6, .6, .6, .4))
df1
subject values
1 1 0.4
2 2 0.6
3 3 0.6
4 4 0.6
5 5 0.6
6 1 0.6
7 2 0.6
8 3 0.6
9 4 0.6
10 5 0.4
Desired output:
> df1
subject values
1 2 0.6
2 3 0.6
3 4 0.6
4 2 0.6
5 3 0.6
6 4 0.6
You can use all() inside a grouped filter using dplyr:
library(dplyr)
df1 %>%
group_by(subject) %>%
filter(all(values > .5)) %>%
ungroup()
Output:
# A tibble: 6 x 2
subject values
<dbl> <dbl>
1 2 0.6
2 3 0.6
3 4 0.6
4 2 0.6
5 3 0.6
6 4 0.6
Using min in ave.
df1[with(df1, ave(values, subject, FUN=min)) > .5, ]
# subject values
# 2 2 0.6
# 3 3 0.6
# 4 4 0.6
# 7 2 0.6
# 8 3 0.6
# 9 4 0.6
a data.table approach
library(data.table_)
setDT(df1)[, .SD[all(values == 0.6) == TRUE], by = .(subject)][]
# subject values
# 1: 2 0.6
# 2: 2 0.6
# 3: 3 0.6
# 4: 3 0.6
# 5: 4 0.6
# 6: 4 0.6
I have the following data frame:
DF <- data.frame(A=c(0.1,0.1,0.1,0.1,0.2,0.2,0.2,0.3,0.4,0.4 ), B=c(1,2,1,5,10,2,3,1,6,2), B=c(1000,50,400,6,300,2000,20,30,40,50))
and I want to filter DF for each group of equal values in A select the Maximum in B.
For example for 0.1 in A the maximum in B is 5.
Ending with the new data frame:
A B C
0.1 5 6
0.2 10 300
0.3 1 30
0.4 6 40
I am not sure if this a problem to solve with base R or with a library. Because I am thinking to use dplyr and group A. I am correct?
There are a couple of base R options:
Using subset + ave
> subset(DF,as.logical(ave(B,A,FUN = function(x) x == max(x))))
A B B.1
4 0.1 5 6
5 0.2 10 300
8 0.3 1 30
9 0.4 6 40
Using merge + aggregate
> merge(aggregate(B~A,DF,max),DF)
A B B.1
1 0.1 5 6
2 0.2 10 300
3 0.3 1 30
4 0.4 6 40
An option with data.table where group by 'A', get the index where 'B' is max with which.max, wrap with .I to return the row index. If we don't specify or rename, by default, it returns as 'V1' column, which we extract as vector to subset the rows of dataset
library(data.table)
setDT(DF)[DF[, .I[which.max(B)], A]$V1]
-output
# A B B.1
#1: 0.1 5 6
#2: 0.2 10 300
#3: 0.3 1 30
#4: 0.4 6 40
You're right, using dplyr and grouping by A, you can use slice_max() (also from dplyr) to select the max value in B for each group
library(dplyr)
DF %>%
group_by(A) %>%
slice_max(B)
Output:
# A tibble: 4 x 3
# Groups: A [4]
A B C
<dbl> <dbl> <dbl>
1 0.1 5 6
2 0.2 10 300
3 0.3 1 30
4 0.4 6 40
I have a simple for-loop which works as I would like on vectors, I would like to use my for-loop on a column of a dataframe grouped by another column in the dataframe e.g.:
# here is my for-loop working as expected on a simple vector:
vect <- c(0.5, 0.7, 0.1)
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
[1] 1.9411537 0.9715143 5.5456579
And here is psuedo-code trying to do it on a column of a dataframe:
#Example data
my.df <- data.frame(let = rep(LETTERS[1:3], each = 3),
num1 = 1:3, vect = c(0.5, 0.7, 0.1), num3 = NA)
my.df
let num1 vect num3
1 A 1 0.5 NA
2 A 2 0.7 NA
3 A 3 0.1 NA
4 B 1 0.5 NA
5 B 2 0.7 NA
6 B 3 0.1 NA
7 C 1 0.5 NA
8 C 2 0.7 NA
9 C 3 0.1 NA
# My attempt:
require(tidyverse)
my.df <- my.df %>%
group_by(let) %>%
mutate(for (i in 1:length(vect)) {
num3[i] <- sum(exp(-4 * (vect[i] - vect[-i])))
})
What result should look like (but my psuedo code above doesn't work):
let num1 vect num3
1 A 1 0.5 1.9411537
2 A 2 0.7 0.9715143
3 A 3 0.1 5.5456579
4 B 1 0.5 1.9411537
5 B 2 0.7 0.9715143
6 B 3 0.1 5.5456579
7 C 1 0.5 1.9411537
8 C 2 0.7 0.9715143
9 C 3 0.1 5.5456579
I feel like I am not using tidyverse logic by trying to having a for-loop inside mutate, any suggestions much appreciated.
The simple solution is to create a custom function and pass that to mutate. A working solution:
custom_func <- function(vec) {
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
}
library(tidyverse)
my.df %>%
group_by(let) %>%
mutate(num3 = custom_func(vect))
#> # A tibble: 9 x 4
#> # Groups: let [3]
#> let num1 vect num3
#> <fct> <int> <dbl> <dbl>
#> 1 A 1 0.5 1.94
#> 2 A 2 0.7 0.972
#> 3 A 3 0.1 5.55
#> 4 B 1 0.5 1.94
#> 5 B 2 0.7 0.972
#> 6 B 3 0.1 5.55
#> 7 C 1 0.5 1.94
#> 8 C 2 0.7 0.972
#> 9 C 3 0.1 5.55
I'm wondering whether a more elegant version of the custom function is possible - perhaps someone smarter than me can tell you whether purrr::map, for example, could provide an alternative.
We can use map_dbl from purrr and apply the formula for calculation.
library(dplyr)
library(purrr)
my.df %>%
group_by(let) %>%
mutate(num3 = map_dbl(seq_along(vect), ~ sum(exp(-2 * (vect[.] - vect[-.])))))
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
#1 A 1 0.5 1.94
#2 A 2 0.7 0.972
#3 A 3 0.1 5.55
#4 B 1 0.5 1.94
#5 B 2 0.7 0.972
#6 B 3 0.1 5.55
#7 C 1 0.5 1.94
#8 C 2 0.7 0.972
#9 C 3 0.1 5.55
You can turn your for-loop into a sapply-call and then use it in mutate.
sapply takes a function and aplys it to each list-element. In this case I'm looping over the number of elements in each groups (n()).
my.df %>%
group_by(let) %>%
mutate(num3 = sapply(1:n(), function(i) sum(exp(-2 * (vect[i] - vect[-i])))))
# A tibble: 9 x 4
# Groups: let [3]
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
# 1 A 1 0.5 1.94
# 2 A 2 0.7 0.972
# 3 A 3 0.1 5.55
# 4 B 1 0.5 1.94
# 5 B 2 0.7 0.972
# 6 B 3 0.1 5.55
# 7 C 1 0.5 1.94
# 8 C 2 0.7 0.972
# 9 C 3 0.1 5.55
This is essential equivalent to the very wrong looking for-loop inside a mutate call. In this case, however I'd prefer the custom-function provided by A. Stam.
my.df %>%
group_by(let) %>%
mutate(num3 = {
res <- numeric(length = n())
for (i in 1:n()) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
})
You can also replace sapply with purrr's map_dbl.
Or using data.table
library(data.table)
setDT(my.df)[, num3 := unlist(lapply(seq_len(.N),
function(i) sum(exp(-2 * (vect[i] - vect[-i]))))), let]
my.df
# let num1 vect num3
#1: A 1 0.5 1.9411537
#2: A 2 0.7 0.9715143
#3: A 3 0.1 5.5456579
#4: B 1 0.5 1.9411537
#5: B 2 0.7 0.9715143
#6: B 3 0.1 5.5456579
#7: C 1 0.5 1.9411537
#8: C 2 0.7 0.9715143
#9: C 3 0.1 5.5456579
I have an R dataframe that looks like this
1 A 1
2 A 0.9
5 A 0.7
6 A 0.6
8 A 0.5
3 B 0.6
4 B 0.5
5 B 0.4
6 B 0.3
I'd need to fill all the gaps till the maximum per category (second column).
i.e. the result I wish to obtain is the following
1 A 1
2 A 0.9
3 A 0.9
4 A 0.9
5 A 0.7
6 A 0.6
7 A 0.6
8 A 0.5
1 B 0.6
2 B 0.6
3 B 0.6
4 B 0.5
5 B 0.4
6 B 0.3
basically, padding backwards when there are missing data before the first obs and forward when missing data is in between.
what I did is grouping by cat
groupby = ddply(df, ~fit$group,summarise, max=max(time))
A 8
B 6
but now I'm stuck on the next steps.
We can try with data.table/zoo. Convert the 'data.frame' to 'data.table' (setDT(df1)), expand the 'v1' column based on the sequence of max value grouped by 'v2', join on with 'v1' and 'v2' and then grouped by 'v2', we pad the NA elements with adjacent elements using na.locf (from zoo)
library(data.table)
library(zoo)
setDT(df1)[df1[, .(v1=seq_len(max(v1))), v2], on = c('v1', 'v2')
][, v3 := na.locf(na.locf(v3, na.rm = FALSE), fromLast=TRUE), by = v2][]
# v1 v2 v3
# 1: 1 A 1.0
# 2: 2 A 0.9
# 3: 3 A 0.9
# 4: 4 A 0.9
# 5: 5 A 0.7
# 6: 6 A 0.6
# 7: 7 A 0.6
# 8: 8 A 0.5
# 9: 1 B 0.6
#10: 2 B 0.6
#11: 3 B 0.6
#12: 4 B 0.5
#13: 5 B 0.4
#14: 6 B 0.3
Or using dplyr/zoo
library(dplyr)
library(zoo)
library(tidyr)
df1 %>%
group_by(v2) %>%
expand(v1 = seq_len(max(v1))) %>%
left_join(., df1) %>%
mutate(v3 = na.locf(na.locf(v3, na.rm = FALSE), fromLast=TRUE)) %>%
select(v1, v2, v3)
# v1 v2 v3
# <int> <chr> <dbl>
#1 1 A 1.0
#2 2 A 0.9
#3 3 A 0.9
#4 4 A 0.9
#5 5 A 0.7
#6 6 A 0.6
#7 7 A 0.6
#8 8 A 0.5
#9 1 B 0.6
#10 2 B 0.6
#11 3 B 0.6
#12 4 B 0.5
#13 5 B 0.4
#14 6 B 0.3
data
df1 <- structure(list(v1 = c(1L, 2L, 5L, 6L, 8L, 3L, 4L, 5L, 6L), v2 = c("A",
"A", "A", "A", "A", "B", "B", "B", "B"), v3 = c(1, 0.9, 0.7,
0.6, 0.5, 0.6, 0.5, 0.4, 0.3)), .Names = c("v1", "v2", "v3"),
class = "data.frame", row.names = c(NA, -9L))
library(dplyr)
library(tidyr)
library(zoo)
complete(dat, V2, V1) %>% mutate(V3 = na.locf(V3))
results in:
# A tibble: 14 × 3
V2 V1 V3
<fctr> <int> <dbl>
1 A 1 1.0
2 A 2 0.9
3 A 3 0.9
4 A 4 0.9
5 A 5 0.7
6 A 6 0.6
7 A 8 0.5
8 B 1 0.5
9 B 2 0.5
10 B 3 0.6
11 B 4 0.5
12 B 5 0.4
13 B 6 0.3
14 B 8 0.3