Transposing an R data frame maintaining the first two columns as identifiers - r

I have a bigger dataset, but I made this smaller one for the purpose of this example. My dataset looks like this
df <- data.frame(ID = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3),
APPT_ID = c(11, 11, 11, 12, 12, 12, 13, 13, 13, 14, 14, 14),
Variable = c(letters[1:3], letters[1:3], letters[1:3], letters[1:3]),
Value = c(41:52))
The first two columns (ID and APPT_ID) are identifiers for each observation, so I would like to maintain those as columns, while transposing the second two columns (variable and value) such that each of the variables is its own column showing its value. I would like there to be only one observation row for every unique combination of ID and APPT_ID.
I'd like my output dataset to look like this:
df2 <- data.frame(ID = c(1, 1, 2, 3), APPT_ID = c(11, 12, 13, 14),
a = c(41, 44, 47, 50), b = c(42, 45, 48, 51),
c = c(43, 46, 49, 52) )
Whats the best why to do this?

I think this will do the job
library(tidyr)
df %>%
spread(Variable, Value)

In base R, you can use reshape():
reshape(df,dir='w',idvar=c('ID','APPT_ID'),timevar='Variable');
## ID APPT_ID Value.a Value.b Value.c
## 1 1 11 41 42 43
## 4 1 12 44 45 46
## 7 2 13 47 48 49
## 10 3 14 50 51 52
You can use the varying argument to control the names of the resulting varying (non-identifier) columns.

With reshape2
dcast(df, ID+APPT_ID~Variable, value.var="Value")
# ID APPT_ID a b c
# 1 1 11 41 42 43
# 2 1 12 44 45 46
# 3 2 13 47 48 49
# 4 3 14 50 51 52

Related

Count occurrences of values in second column equal to the given value in first column R

I have a following dataframe:
data_frame <- structure(list(position = c(10, 15, 10, 10, 25, 19, 25, 15, 20,
31, 22, 20, 10, 19), length = c(21, 15, 19, 10, 27, 19, 25, 31,
34, 31, 26, 27, 10, 19)), class = "data.frame", row.names = c(NA,
-14L))
I want to create the following dataframe from it:
data_frame2 <- structure(list(positionpositions = c(10, 15, 10, 10, 25, 19, 25, 15, 20, 31, 22, 20, 10, 19), lengthlengths = c(21, 15, 19, 10, 27, 19, 25, 31, 34, 31, 26, 27, 10, 19), desired_col1desired_col = c(2, 1, 2, 2, 1, 3, 1, 1, 0, 21, 0, 0, 2, 3), desired_col2 = c(2, 3, 2, 2, 8, 6, 8, 3, 6, 13, 7, 6, 2, 6)), class = "data.frame", row.names = c(NA, -14L))
where:
"desired_col1" - informs of how many "lengths" are equal to the given value in column "position".
desired_col2 - informs of how many "lengths" are equal to or lower than the given value in column "position".
I was trying multiple times, but could not apply the function column-wise.
EDIT: I was able to create the desired column 1 by converting the columns into vectors, using sum within sapply and adding the resulting vector as a new column. Like this:
x <- data_frame$position
y <- data_frame$length
counts <- sapply(x, function(val) sum(val == y))
data_frame$desired <- counts
However, I am still looking for something more straight-forward, a dplyr-based maybe.
Also I am still struggling with desired column 2.
I would get the count of each length element and then join that back to the original data:
library(dplyr)
data_frame %>%
count(length) %>%
rename(position = length) %>%
left_join(data_frame, ., by = "position") %>%
mutate(n = coalesce(n, 0))
# position length n
# 1 10 21 2
# 2 15 15 1
# 3 10 19 2
# 4 10 10 2
# 5 25 27 1
# 6 19 19 3
# 7 25 25 1
# 8 15 31 1
# 9 20 34 0
# 10 31 31 2
# 11 22 26 0
# 12 20 27 0
# 13 10 10 2
# 14 19 19 3
Your sapply solution is quite good for base R, and more straightforward. Do note that "converting the columns into vectors" is skippable, the sapply can be done as a one-liner:
data_frame$desired = sapply(data_frame$position, \(x) sum(x == data_frame$length))
The dplyr solution with the join will be more efficient if your data is large.
Do be careful with your column names. Your two sample data frames and your code are quite inconsistent on length vs lengths and positions vs position.
v <- df$length
df %>%
rowwise() %>%
mutate(
desired_col1 = sum(v == position),
desired_col2 = sum(v <= position)
)
Results
# A tibble: 14 × 4
# Rowwise:
position length desired_col1 desired_col2
<dbl> <dbl> <int> <int>
1 10 21 2 2
2 15 15 1 3
3 10 19 2 2
4 10 10 2 2
5 25 27 1 8
6 19 19 3 6
7 25 25 1 8
8 15 31 1 3
9 20 34 0 6
10 31 31 2 13
11 22 26 0 7
12 20 27 0 6
13 10 10 2 2
14 19 19 3 6
This is the perfect time for some combined mutate and 'lapply' usage!
data_frame %>%
mutate(count_col1 = sapply(position, function(x) sum(x == length)),
count_col2 = sapply(position, function(x) sum(x >= length)))

Join list of tibbles with different number of rows into one dataframe

I have a list of tibbles (48 to be exact) and they have the same number of columns (2) but a different number of rows. The first column is a value, the second column is a count. Each element has a different name. An example is:
tib_list = list(tibble(value = c(11, 21, 22, 23, 33, 41, 42, 81, 82),
count = c(1:9)),
tibble(value = c(11, 21, 22, 23, 41, 42, 82),
count = c(1:7)),
tibble(value = c(11, 21, 22, 23),
count = c(1:4)))
names(tib_list) = c("Carrots", "Apples", "Pears")
I want to combine all of the tibbles from the list into a single dataframe. The first column of the created dataframe will have all of the possible values from the tibbles and the second column will have the counts. If there is no count for the corresponding value, then it should say NA. I also want each column to be named based on the element name in the list. I want my final output to basically be like this:
test = data.frame(value = c(11, 21, 22, 23, 33, 41, 42, 81, 82),
Carrots = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
Apples = c(1, 2, 3, 4, NA, 4, 5, 6, NA),
Pears = c(1, 2, 3, 4, NA, NA, NA, NA, NA))
print(test)
value Carrots Apples Pears
1 11 1 1 1
2 21 2 2 2
3 22 3 3 3
4 23 4 4 4
5 33 5 NA NA
6 41 6 4 NA
7 42 7 5 NA
8 81 8 6 NA
9 82 9 NA NA
I have tried cbind (error due to different number of rows) and joins. I am just not sure how to get the output I need. Thanks in advance.
We could also do a reduced full_join and change the names:
library(tidyverse)
reduce(tib_list, full_join, by='value') %>%
rename_with(~c('value', names(tib_list)))
# A tibble: 9 x 4
value Carrots Apples Pears
<dbl> <int> <int> <int>
1 11 1 1 1
2 21 2 2 2
3 22 3 3 3
4 23 4 4 4
5 33 5 NA NA
6 41 6 5 NA
7 42 7 6 NA
8 81 8 NA NA
9 82 9 7 NA

print from specific rows with highest value from multiple columns using R Studio

I have attached excellent image, I want to extract only those column in which Its row should have maximum value comparing other row
First, provide a reproducible version of your data (not a picture):
dput(dta)
structure(list(A = c(45, 20, 9, 6, 6), B = c(23, 34, 7, 10, 5
), C = c(12, 15, 8, 0, 4), D = c(4, 4, 6, 0, 3), E = c(5, 6,
3, 1, 2)), class = "data.frame", row.names = c("BOX_A", "BOX_B",
"BOX_C", "BOX_D", "BOX_E"))
Now find which column is the maximum:
idx <- apply(dta, 1, which.max)
Now display the rows where the maximum is in the first column. This is not what you asked for but it is what your picture shows:
dta[idx==1, ]
# A B C D E
# BOX_A 45 23 12 4 5
# BOX_C 9 7 8 6 3
# BOX_E 6 5 4 3 2

Get mean value from other rows into current row

I have a soil properties data.table with values for different locations and depths. Some values are NA so I'd like to get the mean values considering the upper and lower layers. In the case of the top layer, I'd take the value from the next one down.
I was able to create a column indicating which are the upper and lower layer for each row and I though about doing a self merge. But I'm completely lost at how to proceed.
Any clues as how to do this? Bellow is an example data.table and what I'd like to achieve. The example considers two locations with 3 layers. But I have multiple locations and some have more layers than others.
library(data.table)
# I was able to identify which are the botton and top layers
# using a function to identify the neighboors
dt <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29),
bot_l = c(20, 30, NA, 20, 30, NA),
top_l = c(NA, 10, 20, NA, 10, 20))
# How can I calculate the average between top and lowe layers?
dt_desired <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29),
bot_l = c(20, 30, NA, 20, 30, NA),
top_l = c(NA, 10, 20, NA, 10, 20)
mean_top_bot = c(18, 11.5, 18, 27, 27, 27))
To explain a bit more:
mean_top_bot[1] = val[depth = 0] + val[depth = 20]. Since I don't have value at depth 0, that would become (NA + 18)/2 = 18 (rm.na = TRUE)
mean_top_bot[2] = val[depth=10] + val[depth=30] = (12+11)/2
I calculated the mean_top_bot values by hand. That's why I had some errors there :facepal:
Solution using self merge
I was able to merge the table with itself, by changin the by.x and by.y`parameters. But I have a feeling that I'm doing this in the worst way possible.
dt1 <- merge(dt, dt[, .SD, .SDcols = !c('bot_l', 'top_l')],
by.x = c('id', 'bot_l'),
by.y = c('id', 'depth'),
all = TRUE)[order(id, depth)]
id bot_l depth val.x top_l val.y
1: 1 20 10 12 NA 18
2: 1 30 20 18 10 11
3: 1 NA 30 11 20 NA
4: 1 10 NA NA NA 12
5: 2 20 10 25 NA 27
6: 2 30 20 27 10 29
7: 2 NA 30 29 20 NA
8: 2 10 NA NA NA 25
Are there any easier ways to do this?
It should be easier to use directly data.table::shift without computing "top" and "bot" layers.
dt <- data.table(id = rep(c(1,2), 1, each = 3),
depth = c(10, 20, 30, 10, 20, 30),
val = c(12, 18, 11, 25, 27, 29))
dt[, v := rowMeans(data.table::setDT(data.table::shift(val,
c(1, - 1))),
na.rm = TRUE),
by = id]
Same but with maggrittr :
library(magrittr)
dt[, v := data.table::shift(val, c(1, -1)) %>% data.table::setDT() %>% rowMeans(na.rm = TRUE),
by = id]
The code above do the mean between previous and next val for a given depth. I suppose the gap between the value and top/bot layers is one and the data are already ordered by id and depth, as in your example.
It took me a while to figure out, but this can be solved as well by a rolling mean:
dt[, mean_top_bot :=
zoo::rollapply(val, width = list(c(-1L, 1L)), FUN = mean, partial = TRUE), id][]
id depth val bot_l top_l mean_top_bot
1: 1 10 12 20 NA 18
2: 1 20 18 30 10 11.5
3: 1 30 11 NA 20 18
4: 2 10 25 20 NA 27
5: 2 20 27 30 10 27
6: 2 30 29 NA 20 27
Two characteristics of zoo::rollapply() come in handy:
The width argument alternatively takes a list of integer offsets. So, list(c(-1L, 1L)) refers to the values of the preceeding and subsequent rows while omitting the current row.
With partial = TRUE, only the subset of indexes that are in range are passed to FUN. E.g., for the first row, offset -1 refers to index 0 which is out of range. Therefore, only the value of index 2 (offset 1) is passed to mean(). Likewise for the last row, where only the second to last value is passed to mean().

Is there a way to create new columns in R based on manipulations from multiple data frames?

Does anyone know if it is possible to use a variable in one dataframe (in my case the "deploy" dataframe) to create a variable in another dataframe?
For example, I have two dataframes:
df1:
deploy <- data.frame(ID = c("20180101_HH1_1_1", "20180101_HH1_1_2", "20180101_HH1_1_3"),
Site_Depth = c(42, 93, 40), Num_Depth_Bins_Required = c(5, 100, 4),
Percent_Column_in_each_bin = c(20, 10, 25))
df2:
sp.c <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42))
I want to create new columns in df2 that assigns each species and count to a bin based on the Percent_Column_in_each_bin for each ID. For example, in 20180101_HH1_1_3 there would be 4 bins that each make up 25% of the column and all species that are within 0-25% of the column (in df2) would be in bin 1 and species within 25-50% of the column would be in depth bin 2, and so on. What I'm imagining this looking like is:
i.want.this <- data.frame(species = c("RR", "GS", "GT", "BR", "RS", "BA", "GS", "RS", "SH", "RR"),
ct = c(25, 66, 1, 12, 30, 6, 1, 22, 500, 6),
percent_dist_from_surf = c(11, 15, 33, 68, 71, 100, 2, 65, 5, 42),
'20180101_HH1_1_1_Bin' = c(1, 1, 2, 4, 4, 5, 1, 4, 1, 3),
'20180101_HH1_1_2_Bin' = c(2, 2, 4, 7, 8, 10, 1, 7, 1, 5),
'20180101_HH1_1_3_Bin' = c(1, 1, 2, 3, 3, 4, 1, 3, 1, 2))
I am pretty new to R and I'm not sure how to make this happen. I need to do this for over 100 IDs (all with different depths, number of depth bins, and percent of the column in each bin) so I was hoping that I don't need to do them all by hand. I have tried mutate in dplyr but I can't get it to pull from two different dataframes. I have also tried ifelse statements, but I would need to run the ifelse statement for each ID individually.
I don't know if what I am trying to do is possible but I appreciate the feedback. Thank you in advance!
Edit: my end goal is to find the max count (max ct) for each species within each bin for each ID. What I've been doing to find this (using the bins generated with suggestions from #Ben) is using dplyr to slice and find the max ID like this:
20180101_HH1_1_1 <- sp.c %>%
group_by(20180101_HH1_1_1, species) %>%
arrange(desc(ct)) %>%
slice(1) %>%
group_by(20180101_HH1_1_1) %>%
mutate(Count_Total_Per_Bin = sum(ct)) %>%
group_by(species, add=TRUE) %>%
mutate(species_percent_of_total_in_bin =
paste0((100*ct/Count_Total_Per_Bin) %>%
mutate(ID= "20180101_HH1_1_1 ") %>%
ungroup()
but I have to do this for over 100 IDs. My desired output would be something like:
end.goal <- data.frame(ID = c(rep("20180101_HH1_1_1", 8)),
species = c("RR", "GS", "SH", "GT", "RR", "BR", "RS", "BA"),
bin = c(1, 1, 1, 2, 3, 4, 4, 5),
Max_count_of_each_species_in_each_bin = c(11, 66, 500, 1, 6, 12, 30, 6),
percent_dist_from_surf = c(11, 15, 5, 33, 42, 68, 71, 100),
percent_each_species_max_in_each_bin = c((11/577)*100, (66/577)*100, (500/577)*100, 100, 100, (12/42)*100, (30/42)*100, 100))
I was thinking that by answering the original question I could get to this but I see now that there's still a lot you have to do to get this for each ID.
Here is another approach, which does not require a loop.
Using sapply you can cut to determine bins for each percent_dist_from_surf value in your deploy dataframe.
res <- sapply(deploy$Percent_Column_in_each_bin, function(x) {
cut(sp.c$percent_dist_from_surf, seq(0, 100, by = x), include.lowest = TRUE, labels = 1:(100/x))
})
colnames(res) <- deploy$ID
cbind(sp.c, res)
Or using purrr:
library(purrr)
cbind(sp.c, imap(setNames(deploy$Percent_Column_in_each_bin, deploy$ID),
~ cut(sp.c$percent_dist_from_surf, seq(0, 100, by = .x), include.lowest = TRUE, labels = 1:(100/.x))
))
Output
species ct percent_dist_from_surf 20180101_HH1_1_1 20180101_HH1_1_2 20180101_HH1_1_3
1 RR 25 11 1 2 1
2 GS 66 15 1 2 1
3 GT 1 33 2 4 2
4 BR 12 68 4 7 3
5 RS 30 71 4 8 3
6 BA 6 100 5 10 4
7 GS 1 2 1 1 1
8 RS 22 65 4 7 3
9 SH 500 5 1 1 1
10 RR 6 42 3 5 2
Edit:
To determine the maximum ct value for each species, site, and bin, put the result of above into a dataframe called res and do the following.
First would put into long form with pivot_longer. Then you can group_by species, site, and bin, and determine the maximum ct for this combination.
library(tidyverse)
res %>%
pivot_longer(cols = starts_with("2018"), names_to = "site", values_to = "bin") %>%
group_by(species, site, bin) %>%
summarise(max_ct = max(ct)) %>%
arrange(site, bin)
Output
# A tibble: 26 x 4
# Groups: species, site [21]
species site bin max_ct
<fct> <chr> <fct> <dbl>
1 GS 20180101_HH1_1_1 1 66
2 RR 20180101_HH1_1_1 1 25
3 SH 20180101_HH1_1_1 1 500
4 GT 20180101_HH1_1_1 2 1
5 RR 20180101_HH1_1_1 3 6
6 BR 20180101_HH1_1_1 4 12
7 RS 20180101_HH1_1_1 4 30
8 BA 20180101_HH1_1_1 5 6
9 GS 20180101_HH1_1_2 1 1
10 SH 20180101_HH1_1_2 1 500
11 GS 20180101_HH1_1_2 2 66
12 RR 20180101_HH1_1_2 2 25
13 GT 20180101_HH1_1_2 4 1
14 RR 20180101_HH1_1_2 5 6
15 BR 20180101_HH1_1_2 7 12
16 RS 20180101_HH1_1_2 7 22
17 RS 20180101_HH1_1_2 8 30
18 BA 20180101_HH1_1_2 10 6
19 GS 20180101_HH1_1_3 1 66
20 RR 20180101_HH1_1_3 1 25
21 SH 20180101_HH1_1_3 1 500
22 GT 20180101_HH1_1_3 2 1
23 RR 20180101_HH1_1_3 2 6
24 BR 20180101_HH1_1_3 3 12
25 RS 20180101_HH1_1_3 3 30
26 BA 20180101_HH1_1_3 4 6
It is helpful to distinguish between the contents of your two dataframes.
df2 appears to contain measurements from some sites
df1 appears to contain parameters by which you want to process/summarise the measurements in df2
Given these different purposes of the two dataframes, your best approach is probably to loop over all the rows of df1 each time adding a column to df2. Something like the following:
max_dist = max(df2$percent_dist_from_surf)
for(ii in 1:nrow(df1)){
# extract parameters
this_ID = df1[[ii,"ID"]]
this_depth = df1[[ii,"Site_Depth"]]
this_bins = df1[[ii,"Num_Depth_Bins_Required"]]
this_percent = df1[[ii,"Percent_Column_in_each_bin"]]
# add column to df2
df2 = df2 %>%
mutate(!!sym(this_ID) := insert_your_calculation_here)
}
The !!sym(this_ID) := part of the code is to allow dynamic naming of your output columns.
And as best I can determine the formula you want for insert_your_calculation_here is ceil(percent_dist_from_surf / max_dist * this_bins)

Resources