Data and context first: The data in question is
set.seed(123)
df1 <- data.frame(A = rep(1, 4), B = c(2, 6, 4, 4), D = c(0.1, 0.2, 0.3, 0.4))
df2 <- data.frame(A = rep(1, 4), C = c(2, 4, 6, 4), D = c(0.5, 0.6, 0.7, 0.8))
so we have
> df1
A B D
1 1 2 0.1
2 1 6 0.2
3 1 4 0.3
4 1 4 0.4
and
> df2
A C D
1 1 2 0.5
2 1 4 0.6
3 1 6 0.7
4 1 4 0.8
Now, when doing
merge(df1, df2, by.x = c("A", "B"), by.y = c("A", "C"))
one gets
A B D.x D.y
1 1 2 0.1 0.5
2 1 4 0.3 0.6
3 1 4 0.3 0.8
4 1 4 0.4 0.6
5 1 4 0.4 0.8
6 1 6 0.2 0.7
because of ambiguous combinations of (A,B) and (A,C) values.
The actual question: How could one solve this by randomly distributing the D.x and D.y to the (A,B), e.g. to get equally likely
A B D.x D.y
1 1 2 0.1 0.5
2 1 4 0.3 0.6
5 1 4 0.4 0.8
6 1 6 0.2 0.7
and
A B D.x D.y
1 1 2 0.1 0.5
3 1 4 0.3 0.8
4 1 4 0.4 0.6
6 1 6 0.2 0.7
as a result of the merge?
With the use of the data.table package, you could do it as follows:
library(data.table)
DT <- dt1[dt2, on = c(A="A", B="C")][, .(i.D = sample(i.D,1)), by = .(A, B, D)]
which gives two possible results (run the code from above several times to see the different results):
> DT
A B D i.D
1: 1 2 0.1 0.5
2: 1 4 0.3 0.6
3: 1 4 0.4 0.8
4: 1 6 0.2 0.7
or:
> DT
A B D i.D
1: 1 2 0.1 0.5
2: 1 4 0.3 0.8
3: 1 4 0.4 0.6
4: 1 6 0.2 0.7
Although this simple solution works, it will be less efficient (especially with regard to memory use). A more memory efficient solution which leads to the same result is:
dt1[, indx := 1:.N, keyby = .(A, B)]
dt2[, indx := if(.N > 1L) sample(.N) else 1L, keyby = .(A, C)]
dt1[dt2, on = c(A = "A", B = "C", indx = "indx")]
By creating an index in both datasets and sampling that index for the second dataset, you can join on that. This prevents a cartesian join in which all possible combinations are included in the join at first.
Used data:
dt1 <- data.table(A = rep(1, 4), B = c(2, 6, 4, 4), D = c(0.1, 0.2, 0.3, 0.4))
dt2 <- data.table(A = rep(1, 4), C = c(2, 4, 6, 4), D = c(0.5, 0.6, 0.7, 0.8))
In base R you could do:
df12 <- merge(df1, df2, by.x = c("A", "B"), by.y = c("A", "C"))
aggregate( . ~ A + B + D.x, df12, sample, 1)
which gives me the following three results in three consequtive runs of the aggregate function:
# run 1
A B D.x D.y
1 1 2 0.1 0.5
2 1 6 0.2 0.7
3 1 4 0.3 0.6
4 1 4 0.4 0.8
# run 2
A B D.x D.y
1 1 2 0.1 0.5
2 1 6 0.2 0.7
3 1 4 0.3 0.8
4 1 4 0.4 0.8
# run 3
A B D.x D.y
1 1 2 0.1 0.5
2 1 6 0.2 0.7
3 1 4 0.3 0.8
4 1 4 0.4 0.6
Related
I have data on plant species cover at site and plot level which looks like this:
SITE PLOT SPECIES AREA
1 1 A 0.3
1 1 B 25.5
1 1 C 1.0
1 2 A 0.3
1 2 C 0.3
1 2 D 0.3
2 1 B 17.9
2 1 C 131.2
2 2 A 37.3
2 2 C 0.3
2 3 A 5.3
2 3 D 0.3
I have successfully used the following code to obtain percentage values for species at various sites,
dfnew <- merge(df1, prop.table(xtabs(AREA ~ SPECIES + SITE, df1), 2)*100)
I am trying now to find the relative proportion of each species within each plot(as
a proportion of all species in the plot) with a desired output like the one below:
SITE PLOT SPECIES AREA Plot-freq
1 1 A 0.3 1.06
1 1 B 25.5 95.39
1 1 C 1.0 3.56
1 2 A 0.3 33.33
1 2 C 0.3 33.33
1 2 D 0.3 33.33
2 1 B 17.9 12.02
2 1 C 131.2 87.98
2 2 A 37.3 99.25
2 2 C 0.3 0.75
2 3 A 5.3 94.94
2 3 D 0.3 5.06
I tried adding the PLOT variable to the original code but ended up with tiny values
a <- merge(df1, prop.table(xtabs(AREA ~ SPECIES + PLOT + SITE, woods2), 2)*100)
I have been looking at similar questions, but most of those don't have similar data and none of the solutions seem to work for me. Any help much appreciated.
data
> dput(df1)
structure(list(SITE = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
PLOT = c(1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 3, 3), SPECIES = c("A",
"B", "C", "A", "C", "D", "B", "C", "A", "C", "A", "D"), AREA = c(0.3,
25.5, 1, 0.3, 0.3, 0.3, 17.9, 131.2, 37.3, 0.3, 5.3, 0.3)), class = "data.frame", row.names = c(NA,
-12L))
I'm not sure I completely understand your calculation, but I believe you can do this:
library(dplyr)
df1 %>% group_by(SITE, PLOT) %>% mutate(Plot_freq = AREA/sum(AREA))
Output:
SITE PLOT SPECIES AREA Plot_freq
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 A 0.3 0.0112
2 1 1 B 25.5 0.951
3 1 1 C 1 0.0373
4 1 2 A 0.3 0.333
5 1 2 C 0.3 0.333
6 1 2 D 0.3 0.333
7 2 1 B 17.9 0.120
8 2 1 C 131. 0.880
9 2 2 A 37.3 0.992
10 2 2 C 0.3 0.00798
11 2 3 A 5.3 0.946
12 2 3 D 0.3 0.0536
Very interesting to merge with the prop.table! I also wasn't lucky though, to modify your approach.
However, to avoid dplyr you may want to use ave to calculate plot sums, then just pipe |> it further to calculate the relative areas like so:
transform(df1, Psum=ave(AREA, SITE, PLOT, FUN=sum)) |> transform(Plot_freq=AREA/Psum*100)
# SITE PLOT SPECIES AREA Psum Plot_freq
# 1 1 1 A 0.3 26.8 1.1194030
# 2 1 1 B 25.5 26.8 95.1492537
# 3 1 1 C 1.0 26.8 3.7313433
# 4 1 2 A 0.3 0.9 33.3333333
# 5 1 2 C 0.3 0.9 33.3333333
# 6 1 2 D 0.3 0.9 33.3333333
# 7 2 1 B 17.9 149.1 12.0053655
# 8 2 1 C 131.2 149.1 87.9946345
# 9 2 2 A 37.3 37.6 99.2021277
# 10 2 2 C 0.3 37.6 0.7978723
# 11 2 3 A 5.3 5.6 94.6428571
# 12 2 3 D 0.3 5.6 5.3571429
Note: R >= 4.1 used.
I am having trouble to reshape my data set to a panel data set. My df looks as follows
id s1 s2 s3 s4 ct1 ct2 ret1 ret2 ret3 ret4
1 a b c d 0.5 0.5 0.6 0.7 0.8 0.5
2 c b a d 0.6 0.6 0.7 0.6 0.5 0.4
3 a c d b 0.7 0.7 0.7 0.8 0.2 0.1
I would like to reshape so it looks as follows
id s ct1 ct2 ret
1 a 0.5 0.5 0.6
1 b 0.5 0.5 0.7
1 c 0.5 0.5 0.8
1 d 0.5 0.5 0.5
2 a 0.6 0.6 0.5
2 b 0.6 0.6 0.6
2 c 0.6 0.6 0.7
2 d 0.6 0.6 0.4
3 a 0.7 0.7 0.7
3 b 0.7 0.7 0.1
3 c 0.7 0.7 0.8
3 d 0.7 0.7 0.2
I regularly reshape from wide to long but somehow my head cannot get around this problem.
1) base R
An option using reshape
out <- reshape(
dat,
idvar = c("id", "ct1", "ct2"),
varying = c(outer(c("s", "ret"), 1:4, paste0)),
sep = "",
direction = "long"
)
Remove rownames and column time
rownames(out) <- out$time <- NULL
Result
out[order(out$id), ]
# id ct1 ct2 s ret
#1 1 0.5 0.5 a 0.6
#4 1 0.5 0.5 b 0.7
#7 1 0.5 0.5 c 0.8
#10 1 0.5 0.5 d 0.5
#2 2 0.6 0.6 c 0.7
#5 2 0.6 0.6 b 0.6
#8 2 0.6 0.6 a 0.5
#11 2 0.6 0.6 d 0.4
#3 3 0.7 0.7 a 0.7
#6 3 0.7 0.7 c 0.8
#9 3 0.7 0.7 d 0.2
#12 3 0.7 0.7 b 0.1
2) data.table
Using melt from data.table
library(data.table)
out <- melt(
setDT(dat),
id.vars = c("id", "ct1", "ct2"),
measure.vars = patterns(c("^s\\d", "^ret\\d")),
value.name = c("s", "ret")
)[, variable := NULL]
out
data
dat <- structure(list(id = 1:3, s1 = structure(c(1L, 2L, 1L), .Label = c("a",
"c"), class = "factor"), s2 = structure(c(1L, 1L, 2L), .Label = c("b",
"c"), class = "factor"), s3 = structure(c(2L, 1L, 3L), .Label = c("a",
"c", "d"), class = "factor"), s4 = structure(c(2L, 2L, 1L), .Label = c("b",
"d"), class = "factor"), ct1 = c(0.5, 0.6, 0.7), ct2 = c(0.5,
0.6, 0.7), ret1 = c(0.6, 0.7, 0.7), ret2 = c(0.7, 0.6, 0.8),
ret3 = c(0.8, 0.5, 0.2), ret4 = c(0.5, 0.4, 0.1)), .Names = c("id",
"s1", "s2", "s3", "s4", "ct1", "ct2", "ret1", "ret2", "ret3",
"ret4"), class = "data.frame", row.names = c(NA, -3L))
You could do it using spread and gather from the tidyr package. You will need to create a temporary id variable in order to be able to pivot the data:
library(dplyr)
library(tidyr)
df %>%
gather(key, value , -id, -ct1, -ct2) %>%
mutate(key = str_extract(key, "[:alpha:]+")) %>%
group_by(key) %>%
mutate(tmp_id = row_number()) %>%
ungroup() %>%
spread(key, value) %>%
select(id, s, ct1, ct2, ret)
Here is one way that the development version of tidyr (install with devtools::install_github("tidyverse/tidyr")) can make this a lot easier with pivot_longer. We make a spec indicating that the s columns should go into an s variable and similarly for the ret columns. You can remove the final obs column that indicates the number after s or ret if desired.
library(tidyverse)
tbl <- read_table2(
"id s1 s2 s3 s4 ct1 ct2 ret1 ret2 ret3 ret4
1 a b c d 0.5 0.5 0.6 0.7 0.8 0.5
2 c b a d 0.6 0.6 0.7 0.6 0.5 0.4
3 a c d b 0.7 0.7 0.7 0.8 0.2 0.1"
)
spec <- tibble(
`.name` = tbl %>% select(matches("^s|ret")) %>% colnames(),
`.value` = str_remove(`.name`, "\\d$"),
obs = str_extract(`.name`, "\\d")
)
tbl %>%
pivot_longer(spec = spec)
#> # A tibble: 12 x 6
#> id ct1 ct2 obs s ret
#> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 1 0.5 0.5 1 a 0.6
#> 2 1 0.5 0.5 2 b 0.7
#> 3 1 0.5 0.5 3 c 0.8
#> 4 1 0.5 0.5 4 d 0.5
#> 5 2 0.6 0.6 1 c 0.7
#> 6 2 0.6 0.6 2 b 0.6
#> 7 2 0.6 0.6 3 a 0.5
#> 8 2 0.6 0.6 4 d 0.4
#> 9 3 0.7 0.7 1 a 0.7
#> 10 3 0.7 0.7 2 c 0.8
#> 11 3 0.7 0.7 3 d 0.2
#> 12 3 0.7 0.7 4 b 0.1
Created on 2019-07-23 by the reprex package (v0.3.0)
I have a simple for-loop which works as I would like on vectors, I would like to use my for-loop on a column of a dataframe grouped by another column in the dataframe e.g.:
# here is my for-loop working as expected on a simple vector:
vect <- c(0.5, 0.7, 0.1)
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
[1] 1.9411537 0.9715143 5.5456579
And here is psuedo-code trying to do it on a column of a dataframe:
#Example data
my.df <- data.frame(let = rep(LETTERS[1:3], each = 3),
num1 = 1:3, vect = c(0.5, 0.7, 0.1), num3 = NA)
my.df
let num1 vect num3
1 A 1 0.5 NA
2 A 2 0.7 NA
3 A 3 0.1 NA
4 B 1 0.5 NA
5 B 2 0.7 NA
6 B 3 0.1 NA
7 C 1 0.5 NA
8 C 2 0.7 NA
9 C 3 0.1 NA
# My attempt:
require(tidyverse)
my.df <- my.df %>%
group_by(let) %>%
mutate(for (i in 1:length(vect)) {
num3[i] <- sum(exp(-4 * (vect[i] - vect[-i])))
})
What result should look like (but my psuedo code above doesn't work):
let num1 vect num3
1 A 1 0.5 1.9411537
2 A 2 0.7 0.9715143
3 A 3 0.1 5.5456579
4 B 1 0.5 1.9411537
5 B 2 0.7 0.9715143
6 B 3 0.1 5.5456579
7 C 1 0.5 1.9411537
8 C 2 0.7 0.9715143
9 C 3 0.1 5.5456579
I feel like I am not using tidyverse logic by trying to having a for-loop inside mutate, any suggestions much appreciated.
The simple solution is to create a custom function and pass that to mutate. A working solution:
custom_func <- function(vec) {
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
}
library(tidyverse)
my.df %>%
group_by(let) %>%
mutate(num3 = custom_func(vect))
#> # A tibble: 9 x 4
#> # Groups: let [3]
#> let num1 vect num3
#> <fct> <int> <dbl> <dbl>
#> 1 A 1 0.5 1.94
#> 2 A 2 0.7 0.972
#> 3 A 3 0.1 5.55
#> 4 B 1 0.5 1.94
#> 5 B 2 0.7 0.972
#> 6 B 3 0.1 5.55
#> 7 C 1 0.5 1.94
#> 8 C 2 0.7 0.972
#> 9 C 3 0.1 5.55
I'm wondering whether a more elegant version of the custom function is possible - perhaps someone smarter than me can tell you whether purrr::map, for example, could provide an alternative.
We can use map_dbl from purrr and apply the formula for calculation.
library(dplyr)
library(purrr)
my.df %>%
group_by(let) %>%
mutate(num3 = map_dbl(seq_along(vect), ~ sum(exp(-2 * (vect[.] - vect[-.])))))
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
#1 A 1 0.5 1.94
#2 A 2 0.7 0.972
#3 A 3 0.1 5.55
#4 B 1 0.5 1.94
#5 B 2 0.7 0.972
#6 B 3 0.1 5.55
#7 C 1 0.5 1.94
#8 C 2 0.7 0.972
#9 C 3 0.1 5.55
You can turn your for-loop into a sapply-call and then use it in mutate.
sapply takes a function and aplys it to each list-element. In this case I'm looping over the number of elements in each groups (n()).
my.df %>%
group_by(let) %>%
mutate(num3 = sapply(1:n(), function(i) sum(exp(-2 * (vect[i] - vect[-i])))))
# A tibble: 9 x 4
# Groups: let [3]
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
# 1 A 1 0.5 1.94
# 2 A 2 0.7 0.972
# 3 A 3 0.1 5.55
# 4 B 1 0.5 1.94
# 5 B 2 0.7 0.972
# 6 B 3 0.1 5.55
# 7 C 1 0.5 1.94
# 8 C 2 0.7 0.972
# 9 C 3 0.1 5.55
This is essential equivalent to the very wrong looking for-loop inside a mutate call. In this case, however I'd prefer the custom-function provided by A. Stam.
my.df %>%
group_by(let) %>%
mutate(num3 = {
res <- numeric(length = n())
for (i in 1:n()) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
})
You can also replace sapply with purrr's map_dbl.
Or using data.table
library(data.table)
setDT(my.df)[, num3 := unlist(lapply(seq_len(.N),
function(i) sum(exp(-2 * (vect[i] - vect[-i]))))), let]
my.df
# let num1 vect num3
#1: A 1 0.5 1.9411537
#2: A 2 0.7 0.9715143
#3: A 3 0.1 5.5456579
#4: B 1 0.5 1.9411537
#5: B 2 0.7 0.9715143
#6: B 3 0.1 5.5456579
#7: C 1 0.5 1.9411537
#8: C 2 0.7 0.9715143
#9: C 3 0.1 5.5456579
I have a correlation dataset that looks like this:
V1 V2 R2
1 2 0.4
1 3 0.5
3 5 0.3
And i want to convert it to a two-column data in such a way that I would have multiple x (in column V) in one y (in column R2) for scatter plotting. It would look like this:
V R2
1 0.4
2 0.4
1 0.5
2 0.5
3 0.5
3 0.3
4 0.3
5 0.3
How can I do this in R?
In the tidyverse, you can make a list column of the required vectors with purrr::map2 to iterate seq over each pair of start and end points, and then expand with tidyr::unnest:
df <- data.frame(V1 = c(1L, 1L, 3L),
V2 = c(2L, 3L, 5L),
R2 = c(0.4, 0.5, 0.3))
library(tidyverse)
df %>% transmute(V = map2(V1, V2, seq), R2) %>% unnest()
#> R2 V
#> 1 0.4 1
#> 2 0.4 2
#> 3 0.5 1
#> 4 0.5 2
#> 5 0.5 3
#> 6 0.3 3
#> 7 0.3 4
#> 8 0.3 5
In base R, there isn't a simple equivalent of unnest, so it's easier to use Map (the multivariate lapply, roughly equivalent to purrr::map2 above) to build a list of data frames, complete with the R2 value (recycled by data.frame), which than then be do.call(rbind, ...)ed into a single data frame:
do.call(rbind,
Map(function(v1, v2, r2){data.frame(V = v1:v2, R2 = r2)},
df$V1, df$V2, df$R2))
#> V R2
#> 1 1 0.4
#> 2 2 0.4
#> 3 1 0.5
#> 4 2 0.5
#> 5 3 0.5
#> 6 3 0.3
#> 7 4 0.3
#> 8 5 0.3
Check out the intermediate products of each to get a feel for how they work.
Here is one option using data.table
library(data.table)
setDT(df1)[, .(V = V1:V2, R2), by = .(grp = 1:nrow(df1))][, grp := NULL][]
# V R2
#1: 1 0.4
#2: 2 0.4
#3: 1 0.5
#4: 2 0.5
#5: 3 0.5
#6: 3 0.3
#7: 4 0.3
#8: 5 0.3
I have an R dataframe that looks like this
1 A 1
2 A 0.9
5 A 0.7
6 A 0.6
8 A 0.5
3 B 0.6
4 B 0.5
5 B 0.4
6 B 0.3
I'd need to fill all the gaps till the maximum per category (second column).
i.e. the result I wish to obtain is the following
1 A 1
2 A 0.9
3 A 0.9
4 A 0.9
5 A 0.7
6 A 0.6
7 A 0.6
8 A 0.5
1 B 0.6
2 B 0.6
3 B 0.6
4 B 0.5
5 B 0.4
6 B 0.3
basically, padding backwards when there are missing data before the first obs and forward when missing data is in between.
what I did is grouping by cat
groupby = ddply(df, ~fit$group,summarise, max=max(time))
A 8
B 6
but now I'm stuck on the next steps.
We can try with data.table/zoo. Convert the 'data.frame' to 'data.table' (setDT(df1)), expand the 'v1' column based on the sequence of max value grouped by 'v2', join on with 'v1' and 'v2' and then grouped by 'v2', we pad the NA elements with adjacent elements using na.locf (from zoo)
library(data.table)
library(zoo)
setDT(df1)[df1[, .(v1=seq_len(max(v1))), v2], on = c('v1', 'v2')
][, v3 := na.locf(na.locf(v3, na.rm = FALSE), fromLast=TRUE), by = v2][]
# v1 v2 v3
# 1: 1 A 1.0
# 2: 2 A 0.9
# 3: 3 A 0.9
# 4: 4 A 0.9
# 5: 5 A 0.7
# 6: 6 A 0.6
# 7: 7 A 0.6
# 8: 8 A 0.5
# 9: 1 B 0.6
#10: 2 B 0.6
#11: 3 B 0.6
#12: 4 B 0.5
#13: 5 B 0.4
#14: 6 B 0.3
Or using dplyr/zoo
library(dplyr)
library(zoo)
library(tidyr)
df1 %>%
group_by(v2) %>%
expand(v1 = seq_len(max(v1))) %>%
left_join(., df1) %>%
mutate(v3 = na.locf(na.locf(v3, na.rm = FALSE), fromLast=TRUE)) %>%
select(v1, v2, v3)
# v1 v2 v3
# <int> <chr> <dbl>
#1 1 A 1.0
#2 2 A 0.9
#3 3 A 0.9
#4 4 A 0.9
#5 5 A 0.7
#6 6 A 0.6
#7 7 A 0.6
#8 8 A 0.5
#9 1 B 0.6
#10 2 B 0.6
#11 3 B 0.6
#12 4 B 0.5
#13 5 B 0.4
#14 6 B 0.3
data
df1 <- structure(list(v1 = c(1L, 2L, 5L, 6L, 8L, 3L, 4L, 5L, 6L), v2 = c("A",
"A", "A", "A", "A", "B", "B", "B", "B"), v3 = c(1, 0.9, 0.7,
0.6, 0.5, 0.6, 0.5, 0.4, 0.3)), .Names = c("v1", "v2", "v3"),
class = "data.frame", row.names = c(NA, -9L))
library(dplyr)
library(tidyr)
library(zoo)
complete(dat, V2, V1) %>% mutate(V3 = na.locf(V3))
results in:
# A tibble: 14 × 3
V2 V1 V3
<fctr> <int> <dbl>
1 A 1 1.0
2 A 2 0.9
3 A 3 0.9
4 A 4 0.9
5 A 5 0.7
6 A 6 0.6
7 A 8 0.5
8 B 1 0.5
9 B 2 0.5
10 B 3 0.6
11 B 4 0.5
12 B 5 0.4
13 B 6 0.3
14 B 8 0.3