I have a soil properties data.table with values for different locations and depths. Some values are NA so I'd like to get the mean values considering the upper and lower layers. In the case of the top layer, I'd take the value from the next one down.
I was able to create a column indicating which are the upper and lower layer for each row and I though about doing a self merge. But I'm completely lost at how to proceed.
Any clues as how to do this? Bellow is an example data.table and what I'd like to achieve. The example considers two locations with 3 layers. But I have multiple locations and some have more layers than others.
library(data.table)
# I was able to identify which are the botton and top layers
# using a function to identify the neighboors
dt <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29),
bot_l = c(20, 30, NA, 20, 30, NA),
top_l = c(NA, 10, 20, NA, 10, 20))
# How can I calculate the average between top and lowe layers?
dt_desired <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29),
bot_l = c(20, 30, NA, 20, 30, NA),
top_l = c(NA, 10, 20, NA, 10, 20)
mean_top_bot = c(18, 11.5, 18, 27, 27, 27))
To explain a bit more:
mean_top_bot[1] = val[depth = 0] + val[depth = 20]. Since I don't have value at depth 0, that would become (NA + 18)/2 = 18 (rm.na = TRUE)
mean_top_bot[2] = val[depth=10] + val[depth=30] = (12+11)/2
I calculated the mean_top_bot values by hand. That's why I had some errors there :facepal:
Solution using self merge
I was able to merge the table with itself, by changin the by.x and by.y`parameters. But I have a feeling that I'm doing this in the worst way possible.
dt1 <- merge(dt, dt[, .SD, .SDcols = !c('bot_l', 'top_l')],
by.x = c('id', 'bot_l'),
by.y = c('id', 'depth'),
all = TRUE)[order(id, depth)]
id bot_l depth val.x top_l val.y
1: 1 20 10 12 NA 18
2: 1 30 20 18 10 11
3: 1 NA 30 11 20 NA
4: 1 10 NA NA NA 12
5: 2 20 10 25 NA 27
6: 2 30 20 27 10 29
7: 2 NA 30 29 20 NA
8: 2 10 NA NA NA 25
Are there any easier ways to do this?
It should be easier to use directly data.table::shift without computing "top" and "bot" layers.
dt <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29))
dt[, v := rowMeans(data.table::setDT(data.table::shift(val,
c(1, - 1))),
na.rm = TRUE),
by = id]
Same but with maggrittr :
library(magrittr)
dt[, v := data.table::shift(val, c(1, -1)) %>% data.table::setDT() %>% rowMeans(na.rm = TRUE),
by = id]
The code above do the mean between previous and next val for a given depth. I suppose the gap between the value and top/bot layers is one and the data are already ordered by id and depth, as in your example.
It took me a while to figure out, but this can be solved as well by a rolling mean:
dt[, mean_top_bot :=
zoo::rollapply(val, width = list(c(-1L, 1L)), FUN = mean, partial = TRUE), id][]
id depth val bot_l top_l mean_top_bot
1: 1 10 12 20 NA 18
2: 1 20 18 30 10 11.5
3: 1 30 11 NA 20 18
4: 2 10 25 20 NA 27
5: 2 20 27 30 10 27
6: 2 30 29 NA 20 27
Two characteristics of zoo::rollapply() come in handy:
The width argument alternatively takes a list of integer offsets. So, list(c(-1L, 1L)) refers to the values of the preceeding and subsequent rows while omitting the current row.
With partial = TRUE, only the subset of indexes that are in range are passed to FUN. E.g., for the first row, offset -1 refers to index 0 which is out of range. Therefore, only the value of index 2 (offset 1) is passed to mean(). Likewise for the last row, where only the second to last value is passed to mean().
Related
I have a following dataframe:
data_frame <- structure(list(position = c(10, 15, 10, 10, 25, 19, 25, 15, 20,
31, 22, 20, 10, 19), length = c(21, 15, 19, 10, 27, 19, 25, 31,
34, 31, 26, 27, 10, 19)), class = "data.frame", row.names = c(NA,
-14L))
I want to create the following dataframe from it:
data_frame2 <- structure(list(positionpositions = c(10, 15, 10, 10, 25, 19, 25, 15, 20, 31, 22, 20, 10, 19), lengthlengths = c(21, 15, 19, 10, 27, 19, 25, 31, 34, 31, 26, 27, 10, 19), desired_col1desired_col = c(2, 1, 2, 2, 1, 3, 1, 1, 0, 21, 0, 0, 2, 3), desired_col2 = c(2, 3, 2, 2, 8, 6, 8, 3, 6, 13, 7, 6, 2, 6)), class = "data.frame", row.names = c(NA, -14L))
where:
"desired_col1" - informs of how many "lengths" are equal to the given value in column "position".
desired_col2 - informs of how many "lengths" are equal to or lower than the given value in column "position".
I was trying multiple times, but could not apply the function column-wise.
EDIT: I was able to create the desired column 1 by converting the columns into vectors, using sum within sapply and adding the resulting vector as a new column. Like this:
x <- data_frame$position
y <- data_frame$length
counts <- sapply(x, function(val) sum(val == y))
data_frame$desired <- counts
However, I am still looking for something more straight-forward, a dplyr-based maybe.
Also I am still struggling with desired column 2.
I would get the count of each length element and then join that back to the original data:
library(dplyr)
data_frame %>%
count(length) %>%
rename(position = length) %>%
left_join(data_frame, ., by = "position") %>%
mutate(n = coalesce(n, 0))
# position length n
# 1 10 21 2
# 2 15 15 1
# 3 10 19 2
# 4 10 10 2
# 5 25 27 1
# 6 19 19 3
# 7 25 25 1
# 8 15 31 1
# 9 20 34 0
# 10 31 31 2
# 11 22 26 0
# 12 20 27 0
# 13 10 10 2
# 14 19 19 3
Your sapply solution is quite good for base R, and more straightforward. Do note that "converting the columns into vectors" is skippable, the sapply can be done as a one-liner:
data_frame$desired = sapply(data_frame$position, \(x) sum(x == data_frame$length))
The dplyr solution with the join will be more efficient if your data is large.
Do be careful with your column names. Your two sample data frames and your code are quite inconsistent on length vs lengths and positions vs position.
v <- df$length
df %>%
rowwise() %>%
mutate(
desired_col1 = sum(v == position),
desired_col2 = sum(v <= position)
)
Results
# A tibble: 14 × 4
# Rowwise:
position length desired_col1 desired_col2
<dbl> <dbl> <int> <int>
1 10 21 2 2
2 15 15 1 3
3 10 19 2 2
4 10 10 2 2
5 25 27 1 8
6 19 19 3 6
7 25 25 1 8
8 15 31 1 3
9 20 34 0 6
10 31 31 2 13
11 22 26 0 7
12 20 27 0 6
13 10 10 2 2
14 19 19 3 6
This is the perfect time for some combined mutate and 'lapply' usage!
data_frame %>%
mutate(count_col1 = sapply(position, function(x) sum(x == length)),
count_col2 = sapply(position, function(x) sum(x >= length)))
Currently I wrote this code:
lc <- round(tabyl(x$Likelihood.to.Click, show_na = FALSE),2)
lc$percent <- lc$percent * 100
and produced this chart:
But In need help manipulating it to create the below df (basically its summing the percentages
of row 1 and 2, leaving 3 as is, and then summing 4 and 5:
First make your example reproducible using dput(lc) to pass the data:
lc <- structure(list(L2E = 1:5, n = c(7, 23, 84, 73, 33), pct = c(3,
10, 38, 33, 15)), class = "data.frame", row.names = c(NA, -5L))
Now define the groups
groups <- list(1:2, 3, 4:5)
sumpct <- sapply(groups, function(x) sum(lc[x, 3]))
lc2 <- data.frame(group=seq(length(groups)), sumpct)
lc2
# group sumpct
# 1 1 13
# 2 2 38
# 3 3 48
Note that row 3 is 38, not 32 and in rows 4 and 5, 33 + 15 = 48, not 54.
Does anyone know if it is possible to use a variable in one dataframe (in my case the "deploy" dataframe) to create a variable in another dataframe?
For example, I have two dataframes:
df1:
deploy <- data.frame(ID = c("20180101_HH1_1_1", "20180101_HH1_1_2", "20180101_HH1_1_3"),
Site_Depth = c(42, 93, 40), Num_Depth_Bins_Required = c(5, 100, 4),
Percent_Column_in_each_bin = c(20, 10, 25))
df2:
sp.c <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42))
I want to create new columns in df2 that assigns each species and count to a bin based on the Percent_Column_in_each_bin for each ID. For example, in 20180101_HH1_1_3 there would be 4 bins that each make up 25% of the column and all species that are within 0-25% of the column (in df2) would be in bin 1 and species within 25-50% of the column would be in depth bin 2, and so on. What I'm imagining this looking like is:
i.want.this <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42),
'20180101_HH1_1_1_Bin' = c(1, 1, 2, 4, 4, 5, 1, 4, 1, 3),
'20180101_HH1_1_2_Bin' = c(2, 2, 4, 7, 8, 10, 1, 7, 1, 5),
'20180101_HH1_1_3_Bin' = c(1, 1, 2, 3, 3, 4, 1, 3, 1, 2))
I am pretty new to R and I'm not sure how to make this happen. I need to do this for over 100 IDs (all with different depths, number of depth bins, and percent of the column in each bin) so I was hoping that I don't need to do them all by hand. I have tried mutate in dplyr but I can't get it to pull from two different dataframes. I have also tried ifelse statements, but I would need to run the ifelse statement for each ID individually.
I don't know if what I am trying to do is possible but I appreciate the feedback. Thank you in advance!
Edit: my end goal is to find the max count (max ct) for each species within each bin for each ID. What I've been doing to find this (using the bins generated with suggestions from #Ben) is using dplyr to slice and find the max ID like this:
20180101_HH1_1_1 <- sp.c %>%
group_by(20180101_HH1_1_1, species) %>%
arrange(desc(ct)) %>%
slice(1) %>%
group_by(20180101_HH1_1_1) %>%
mutate(Count_Total_Per_Bin = sum(ct)) %>%
group_by(species, add=TRUE) %>%
mutate(species_percent_of_total_in_bin =
paste0((100*ct/Count_Total_Per_Bin) %>%
mutate(ID= "20180101_HH1_1_1 ") %>%
ungroup()
but I have to do this for over 100 IDs. My desired output would be something like:
end.goal <- data.frame(ID = c(rep("20180101_HH1_1_1", 8)),
species = c("RR", "GS", "SH", "GT", "RR", "BR", "RS", "BA"),
bin = c(1, 1, 1, 2, 3, 4, 4, 5),
Max_count_of_each_species_in_each_bin = c(11, 66, 500, 1, 6, 12, 30, 6),
percent_dist_from_surf = c(11, 15, 5, 33, 42, 68, 71, 100),
percent_each_species_max_in_each_bin = c((11/577)*100, (66/577)*100, (500/577)*100, 100, 100, (12/42)*100, (30/42)*100, 100))
I was thinking that by answering the original question I could get to this but I see now that there's still a lot you have to do to get this for each ID.
Here is another approach, which does not require a loop.
Using sapply you can cut to determine bins for each percent_dist_from_surf value in your deploy dataframe.
res <- sapply(deploy$Percent_Column_in_each_bin, function(x) {
cut(sp.c$percent_dist_from_surf, seq(0, 100, by = x), include.lowest = TRUE, labels = 1:(100/x))
})
colnames(res) <- deploy$ID
cbind(sp.c, res)
Or using purrr:
library(purrr)
cbind(sp.c, imap(setNames(deploy$Percent_Column_in_each_bin, deploy$ID),
~ cut(sp.c$percent_dist_from_surf, seq(0, 100, by = .x), include.lowest = TRUE, labels = 1:(100/.x))
))
Output
species ct percent_dist_from_surf 20180101_HH1_1_1 20180101_HH1_1_2 20180101_HH1_1_3
1 RR 25 11 1 2 1
2 GS 66 15 1 2 1
3 GT 1 33 2 4 2
4 BR 12 68 4 7 3
5 RS 30 71 4 8 3
6 BA 6 100 5 10 4
7 GS 1 2 1 1 1
8 RS 22 65 4 7 3
9 SH 500 5 1 1 1
10 RR 6 42 3 5 2
Edit:
To determine the maximum ct value for each species, site, and bin, put the result of above into a dataframe called res and do the following.
First would put into long form with pivot_longer. Then you can group_by species, site, and bin, and determine the maximum ct for this combination.
library(tidyverse)
res %>%
pivot_longer(cols = starts_with("2018"), names_to = "site", values_to = "bin") %>%
group_by(species, site, bin) %>%
summarise(max_ct = max(ct)) %>%
arrange(site, bin)
Output
# A tibble: 26 x 4
# Groups: species, site [21]
species site bin max_ct
<fct> <chr> <fct> <dbl>
1 GS 20180101_HH1_1_1 1 66
2 RR 20180101_HH1_1_1 1 25
3 SH 20180101_HH1_1_1 1 500
4 GT 20180101_HH1_1_1 2 1
5 RR 20180101_HH1_1_1 3 6
6 BR 20180101_HH1_1_1 4 12
7 RS 20180101_HH1_1_1 4 30
8 BA 20180101_HH1_1_1 5 6
9 GS 20180101_HH1_1_2 1 1
10 SH 20180101_HH1_1_2 1 500
11 GS 20180101_HH1_1_2 2 66
12 RR 20180101_HH1_1_2 2 25
13 GT 20180101_HH1_1_2 4 1
14 RR 20180101_HH1_1_2 5 6
15 BR 20180101_HH1_1_2 7 12
16 RS 20180101_HH1_1_2 7 22
17 RS 20180101_HH1_1_2 8 30
18 BA 20180101_HH1_1_2 10 6
19 GS 20180101_HH1_1_3 1 66
20 RR 20180101_HH1_1_3 1 25
21 SH 20180101_HH1_1_3 1 500
22 GT 20180101_HH1_1_3 2 1
23 RR 20180101_HH1_1_3 2 6
24 BR 20180101_HH1_1_3 3 12
25 RS 20180101_HH1_1_3 3 30
26 BA 20180101_HH1_1_3 4 6
It is helpful to distinguish between the contents of your two dataframes.
df2 appears to contain measurements from some sites
df1 appears to contain parameters by which you want to process/summarise the measurements in df2
Given these different purposes of the two dataframes, your best approach is probably to loop over all the rows of df1 each time adding a column to df2. Something like the following:
max_dist = max(df2$percent_dist_from_surf)
for(ii in 1:nrow(df1)){
# extract parameters
this_ID = df1[[ii,"ID"]]
this_depth = df1[[ii,"Site_Depth"]]
this_bins = df1[[ii,"Num_Depth_Bins_Required"]]
this_percent = df1[[ii,"Percent_Column_in_each_bin"]]
# add column to df2
df2 = df2 %>%
mutate(!!sym(this_ID) := insert_your_calculation_here)
}
The !!sym(this_ID) := part of the code is to allow dynamic naming of your output columns.
And as best I can determine the formula you want for insert_your_calculation_here is ceil(percent_dist_from_surf / max_dist * this_bins)
I am working with a data set of patients' health state over time.
I would like to compute the data frame of transitions
from the current health state to the next health state.
Here is an example where the health state is measured
only by AFP level and weight.
The health state measurements might look like the following:
x <- data.frame(id = c(1, 1, 1, 2, 2, 2),
day = c(1, 2, 3, 1, 2, 3),
event = c('status', 'status', 'death', 'status', 'status', 'status'),
afp = c(10, 50, NA, 20, 30, 40),
weight = c(100, 105, NA, 200, 200, 200))
The desired output looks like the following:
y <- data.frame(id = c(1, 1, 2, 2),
current_afp = c(10, 50, 20, 30),
current_weight = c(100, 105, 200, 200),
next_event = c('status', 'death', 'status', 'status'),
next_afp = c(50, NA, 30, 40),
next_weight = c(105, NA, 200, 200))
One inefficient way to obtain the output is:
take the cross product of the measurements data frame with itself
keep only rows with matching ids, and day.x + 1 = day.y
rename the columns
Is there a more efficient way to obtain the output?
Note: The real measurements data frame can have more than 10 columns,
so it is not very efficient from a lines of code perspective
to explicitly write
current_afp = x$afp[1:(n-1)],
next_afp = x$afp[2:n]
...
and so on.
You could try:
library(dplyr)
x %>%
mutate_each(funs(lead(.)), -id, -day) %>%
full_join(x, ., by = c("id", "day")) %>%
select(-event.x) %>%
setNames(c(names(.)[1:2],
paste0("current_", sub("\\..*","", names(.)[3:4])),
paste0("next_", sub("\\..*","", names(.)[5:7])))) %>%
group_by(id) %>%
filter(day != last(day))
Which gives:
# id day current_afp current_weight next_event next_afp next_weight
#1 1 1 10 100 status 50 105
#2 1 2 50 105 death NA NA
#3 2 1 20 200 status 30 200
#4 2 2 30 200 status 40 200
Using base R with a split-apply-combine approach
res <- lapply(split(x[-2], x$id), function(y) {
xx <- cbind(y[1:(nrow(y)-1), ], y[2:nrow(y), -1])
colnames(xx) <- c("id", paste("current", colnames(y)[-1], sep="_"),
paste("next", colnames(y)[-1], sep="_"))
xx[, which(colnames(xx) != "current_event")]
})
do.call(rbind, res)
id current_afp current_weight next_event next_afp next_weight
1 1 10 100 status 50 105
2 1 50 105 death NA NA
3 2 20 200 status 30 200
4 2 30 200 status 40 200
Or, an example where not all days are in sequence
x <- data.frame(id = c(1, 1, 1, 2, 2, 2),
day = c(1, 2, 3, 1, 2, 4),
event = c('status', 'status', 'death', 'status', 'status', 'status'),
afp = c(10, 50, NA, 20, 30, 40),
weight = c(100, 105, NA, 200, 200, 200))
x
id day event afp weight
1 1 1 status 10 100
2 1 2 status 50 105
3 1 3 death NA NA
4 2 1 status 20 200
5 2 2 status 30 200
6 2 4 status 40 200
Some of the transitions are NA, which could be removed if desired.
res <- lapply(split(x, x$id), function(y) {
y <- merge(data.frame(id=unique(y$id), day = 1:max(y$day)), y,
by = c("id", "day"), all.x=TRUE)[, -2]
xx <- cbind(y[1:(nrow(y)-1), ], y[2:nrow(y), -1])
colnames(xx) <- c("id", paste("current", colnames(y)[-1], sep="_"),
paste("next", colnames(y)[-1], sep="_"))
xx[, which(colnames(xx) != "current_event")]
})
do.call(rbind, res)
id current_afp current_weight next_event next_afp next_weight
1.1 1 10 100 status 50 105
1.2 1 50 105 death NA NA
2.1 2 20 200 status 30 200
2.2 2 30 200 <NA> NA NA
2.3 2 NA NA status 40 200
I have a bigger dataset, but I made this smaller one for the purpose of this example. My dataset looks like this
df <- data.frame(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
APPT_ID = c(11, 11, 11, 12, 12, 12, 13, 13, 13, 14, 14, 14),
Variable = c(letters[1:3], letters[1:3], letters[1:3], letters[1:3]),
Value = c(41:52))
The first two columns (ID and APPT_ID) are identifiers for each observation, so I would like to maintain those as columns, while transposing the second two columns (variable and value) such that each of the variables is its own column showing its value. I would like there to be only one observation row for every unique combination of ID and APPT_ID.
I'd like my output dataset to look like this:
df2 <- data.frame(ID = c(1, 1, 2, 3), APPT_ID = c(11, 12, 13, 14),
a = c(41, 44, 47, 50), b = c(42, 45, 48, 51),
c = c(43, 46, 49, 52) )
Whats the best why to do this?
I think this will do the job
library(tidyr)
df %>%
spread(Variable, Value)
In base R, you can use reshape():
reshape(df,dir='w',idvar=c('ID','APPT_ID'),timevar='Variable');
## ID APPT_ID Value.a Value.b Value.c
## 1 1 11 41 42 43
## 4 1 12 44 45 46
## 7 2 13 47 48 49
## 10 3 14 50 51 52
You can use the varying argument to control the names of the resulting varying (non-identifier) columns.
With reshape2
dcast(df, ID+APPT_ID~Variable, value.var="Value")
# ID APPT_ID a b c
# 1 1 11 41 42 43
# 2 1 12 44 45 46
# 3 2 13 47 48 49
# 4 3 14 50 51 52