Sum values larger than the current value, by group - r

I have measured basal area of trees in different plots. Here's a small example with two plots with 4 trees each:
Plot Tree BasalArea
1 1 4
1 2 5
1 3 7
1 4 3
2 1 4
2 2 6
2 3 9
2 4 5
Within each plot, I want calculate the sum of basal area of the trees that have basal area larger than the focal tree.
For example, Tree 1 in Plot 1 has an area of 4. Within that plot there are 2 trees with an area larger than tree 1: Tree 2 and Tree 3 with area 5 and 7, respectively. So, "BA_Larger" for tree 1 is 5 + 7 = 12.
Tree 2 in the same plot has basal area = 5. Within plot 1 there is only one tree with a larger area than tree 2: tree 3 with area 7. Thus, "BA_Larger" for tree 2 is 7.
Finally, the data frame should be like this:
Plot Tree BasalArea BA_Larger
1 1 4 12
1 2 5 7
1 3 7 0
1 4 3 16
2 1 4 20
2 2 6 9
2 3 9 0
2 4 5 15
The data set is very large. I have tried to calculate the "BA_Larger", without success. Any help is highly appreciated.

The base R solution with ave():
within(df, BA_Larger <- ave(BasalArea, Plot, FUN = function(x) sapply(x, function(y) sum(x[x > y]))))
With a tidyverse style, you can also use map_int() or map_dbl() from purrr.
library(dplyr)
library(purrr)
df %>%
group_by(Plot) %>%
mutate(BA_Larger = map_int(BasalArea, ~ sum(BasalArea[BasalArea > .]))) %>%
ungroup()
Output
# # A tibble: 8 x 4
# Plot Tree BasalArea BA_Larger
# <int> <int> <int> <int>
# 1 1 1 4 12
# 2 1 2 5 7
# 3 1 3 7 0
# 4 1 4 3 16
# 5 2 1 4 20
# 6 2 2 6 9
# 7 2 3 9 0
# 8 2 4 5 15
Data
df <- structure(list(Plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Tree = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), BasalArea = c(4L, 5L, 7L, 3L, 4L,
6L, 9L, 5L)), class = "data.frame", row.names = c(NA, -8L))

Another solution
library(tidyverse)
df %>%
group_by(Plot) %>%
arrange(BasalArea, .by_group = T) %>%
mutate(res = sum(BasalArea) - cumsum(BasalArea)) %>%
arrange(Tree, .by_group = T) %>%
ungroup()
# A tibble: 8 x 4
Plot Tree BasalArea res
<int> <int> <int> <int>
1 1 1 4 12
2 1 2 5 7
3 1 3 7 0
4 1 4 3 16
5 2 1 4 20
6 2 2 6 9
7 2 3 9 0
8 2 4 5 15

Using a non-equi join with data.table. Calculate sum for each match.
library(data.table)
setDT(d)
d[ , ba2 := d[d, on = .(Plot, BasalArea > BasalArea), sum(x.BasalArea), by = .EACHI]$V1]
# Plot Tree BasalArea ba2
# 1: 1 1 4 12
# 2: 1 2 5 7
# 3: 1 3 7 NA
# 4: 1 4 3 16
# 5: 2 1 4 20
# 6: 2 2 6 9
# 7: 2 3 9 NA
# 8: 2 4 5 15

Actually you don't need a package to do this. Using by you may split the data on the Plot column, then compare the specific tree i to the other values in the split-subset and exclude i in the sum. Finally unsplit the result according to the df1$Plot column.
res <- unsplit(by(df1, df1$Plot, function(x)
transform(x, BA_Larger=sapply(1:nrow(x), function(i)
sum(x[x[, 3] > x[i, 3], 3])))), df1$Plot)
res
# Plot Tree BasalArea BA_Larger
# 1 1 1 4 12
# 2 1 2 5 7
# 3 1 3 7 0
# 4 1 4 3 16
# 5 2 1 4 20
# 6 2 2 6 9
# 7 2 3 9 0
# 8 2 4 5 15
Data:
df1 <- structure(list(Plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Tree = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), BasalArea = c(4L, 5L, 7L, 3L, 4L,
6L, 9L, 5L)), class = "data.frame", row.names = c(NA, -8L))

Related

R Recode variable for all observations that do not occur more than once

I have a simple dataframe that looks like the following:
Observation X1 X2 Group
1 2 4 1
2 6 3 2
3 8 4 2
4 1 3 3
5 2 8 4
6 7 5 5
7 2 4 5
How can I recode the group variable such that all non-recurrent observations are recoded as "unaffiliated"?
The desired output would be the following:
Observation X1 X2 Group
1 2 4 Unaffiliated
2 6 3 2
3 8 4 2
4 1 3 Unaffiliated
5 2 8 Unaffiliated
6 7 5 5
7 2 4 5
We may use duplicated to create a logical vector for non-duplicates and assign the 'Group' to Unaffiliated for those non-duplicates
df1$Group[with(df1, !(duplicated(Group)|duplicated(Group,
fromLast = TRUE)))] <- "Unaffiliated"
-output
> df1
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
data
df1 <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
unfaffil takes a vector of Group numbers and returns "Unaffiliated" if it has one element and otherwise returns the input. We can then apply it by Group using ave. This does not overwrite the input. No packages are used but if you use dplyr then transform can be replaced with mutate.
unaffil <- function(x) if (length(x) == 1) "Unaffiliated" else x
transform(dat, Group = ave(Group, Group, FUN = unaffil))
giving
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
Note
dat <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
One way could be first grouping then checking for maximum of row number and finishing with an ifelse:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Group = ifelse(max(row_number()) == 1, "Unaffiliated", as.character(Group))) %>%
ungroup()
Observation X1 X2 Group
<int> <int> <int> <chr>
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5

Pasting values from a vector to a new column in a for loop with nested data

I have a dataframe that currently looks like this:
subjectID
Trial
1
3
1
3
1
3
1
4
1
4
1
5
1
5
1
5
2
1
2
1
2
3
2
3
2
3
2
5
2
5
2
6
3
1
Etc., where trial number is nested under subject ID. I need to make a new column in which column "NewTrial" is simply what order the trials now appear in. For example:
subjectID
Trial
NewTrial
1
3
1
1
3
1
1
3
1
1
4
2
1
4
2
1
5
3
1
5
3
1
5
3
2
1
1
2
1
1
2
3
2
2
3
2
2
3
2
2
5
3
2
5
3
2
6
4
3
1
1
So far, I have a for-loop written that looks like this:
for (myperson in unique(data$subjectID)){
#This line creates a vector of the number of unique trials per subject: for subject 1, c(1, 2, 3)
triallength=1:length(unique(data$Trial[data$subID==myperson]))
I'm having trouble now finding a way to paste the numbers from the created triallength vector as a column in the dataframe. Does anyone know of a way to accomplish this? I am lacking some experience with for-loops and hoping to gain more. If anyone has a tidyverse/dplyr solution, however, I am open to that as well as an alternative to a for-loop. Thanks in advance, and let me know if any clarification is needed!
Converting to factor with unique values as levels, then as.numeric in an ave should be nice.
transform(dat, NewTrial=ave(Trial, subjectID, FUN=\(x) as.numeric(factor(x, levels=unique(x)))))
# subjectID Trial NewTrial
# 1 1 3 1
# 2 1 3 1
# 3 1 3 1
# 4 1 4 2
# 5 1 4 2
# 6 1 5 3
# 7 1 5 3
# 8 1 5 3
# 9 2 1 1
# 10 2 1 1
# 11 2 3 2
# 12 2 3 2
# 13 2 3 2
# 14 2 5 3
# 15 2 5 3
# 16 2 6 4
# 17 3 1 1
Data:
dat <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), Trial = c(3L, 3L, 3L, 4L,
4L, 5L, 5L, 5L, 1L, 1L, 3L, 3L, 3L, 5L, 5L, 6L, 1L)), class = "data.frame", row.names = c(NA,
-17L))
We could use match on the unique values after grouping by 'subjectID'
library(dplyr)
df1 <- df1 %>%
group_by(subjectID) %>%
mutate(NewTrial = match(Trial, unique(Trial))) %>%
ungroup
We could use rleid:
library(dplyr)
library(data.table)
df %>%
group_by(subjectID) %>%
mutate(NewTrial = rleid(subjectID, Trial))
subjectID Trial NewTrial
<int> <int> <int>
1 1 3 1
2 1 3 1
3 1 3 1
4 1 4 2
5 1 4 2
6 1 5 3
7 1 5 3
8 1 5 3
9 2 1 1
10 2 1 1
11 2 3 2
12 2 3 2
13 2 3 2
14 2 5 3
15 2 5 3
16 2 6 4
17 3 1 1

Creating a new variable based on the orders of existing variables using R

Hoping to create the new variable X based on three existing variables: "SubID" "Day" and "Time". I used to have three sorting functions in excel to do this manually: first sort by the "SubID," and then sort by the "Day," and lastly sort by "Time." X should be from 1 to the largest number of rows for each SubID, based on the order of Day and Time.
SubID: assigned subject number
Day: each subject's day number (1,2,3...21)
Time: 1, 2, 3
X: the number of rows marked as the same SubID
SubID Day Time X
1 1 1 1
1 1 2 2
1 1 3 3
1 2 1 4
1 2 2 5
2 1 1 1
2 1 2 2
2 1 3 3
2 2 3 6
2 2 2 5
2 2 1 4
I have been doing this manually in excel and I am sure there must be a smarter way to do it in R, but I am new to R and don't know how. Thank you in advance!
May be with data.table package. You will have to install it in case you haven't already. I have commented the command.
# install.packages("data.table")
library(data.table)
we can generate your data in the following way.
df <- data.frame(SubId=sample(1:2,10,replace=TRUE),
Day=sample(1:2,10,replace=TRUE),
Time=sample(1:2,10,replace=TRUE))
Then convert the data.frame into data.table.
setDT(df)
##> df
## SubId Day Time
## 1: 1 2 1
## 2: 1 1 1
## 3: 1 1 2
## 4: 2 2 1
## 5: 2 1 1
## 6: 1 2 2
## 7: 1 2 1
## 8: 1 2 2
## 9: 2 1 1
## 10: 2 1 2
Finally we can order my SubId, Day ,Time. As the table is ordered as we wanted, we just have to number the rows from 1 to the number of observations in each SubId.
df[order(SubId,Day,Time),X:=1:.N,SubId]
##> df
## SubId Day Time X
## 1: 1 2 1 3
## 2: 1 1 1 1
## 3: 1 1 2 2
## 4: 2 2 1 4
## 5: 2 1 1 1
## 6: 1 2 2 5
## 7: 1 2 1 4
## 8: 1 2 2 6
## 9: 2 1 1 2
## 10: 2 1 2 3
May be this helps
library(dplyr)
df1 %>%
group_by(SubID) %>%
mutate(X1 = row_number(as.numeric(paste0(Day, Time))))
# A tibble: 11 x 5
# Groups: SubID [2]
# SubID Day Time X X1
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 1
# 2 1 1 2 2 2
# 3 1 1 3 3 3
# 4 1 2 1 4 4
# 5 1 2 2 5 5
# 6 2 1 1 1 1
# 7 2 1 2 2 2
# 8 2 1 3 3 3
# 9 2 2 3 6 6
#10 2 2 2 5 5
#11 2 2 1 4 4
Or using order
df1 %>%
group_by(SubID) %>%
mutate(X1 = order(Day, Time))
Or with data.table
library(data.table)
setDT(df1)[, X1 := order(Day, Time), by = SubID]
data
df1 <- structure(list(SubID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), Day = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L),
Time = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 3L, 2L, 1L), X = c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 6L, 5L, 4L)), class = "data.frame",
row.names = c(NA,
-11L))

Merge columns of dataframe with all combinations of variables

"w" "n"
"1" 2 1
"2" 3 1
"3" 4 1
"4" 2 1
"5" 5 1
"6" 6 1
"7" 3 2
"8" 7 2
I tried the following command,but didnt show any change as I expect.
w2 <- w1 %>%
expand(w,n)
My output should look like this
w n
2 1
2 2
3 1
3 2
4 1
4 2
5 1
5 2
6 1
6 2
7 1
7 2
data
w1 <- structure(list(w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L,
2L, 1L, 1L, 1L, 2L)), .Names = c("w", "n"), row.names = c(NA,
-7L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"), groups = structure(list(
w = c(2L, 3L, 3L, 4L, 5L, 6L, 7L), n = c(1L, 1L, 2L, 1L,
1L, 1L, 2L), .rows = list(1L, 2L, 3L, 4L, 5L, 6L, 7L)), .Names = c("w",
"n", ".rows"), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"), .drop = TRUE))
The issue was in your data frame being grouped, consider:
w1 %>%
ungroup() %>%
expand(w, n)
Output:
# A tibble: 12 x 2
w n
<int> <int>
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2
We can use complete from tidyr.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
distinct(w, .keep_all = TRUE) %>%
complete(w, n)
dat2
# # A tibble: 12 x 2
# w n
# <int> <int>
# 1 2 1
# 2 2 2
# 3 3 1
# 4 3 2
# 5 4 1
# 6 4 2
# 7 5 1
# 8 5 2
# 9 6 1
# 10 6 2
# 11 7 1
# 12 7 2
DATA
dat <- read.table(text = "w n
2 1
3 1
4 1
2 1
5 1
6 1
3 2
7 2",
header = TRUE)
Using the original data frame df you can create a new data frame that copies w for each unique value of n:
data.frame(w = rep(unique(df$w),
each = uniqueN(df$n)),
n = rep(unique(df$n),
times = uniqueN(df$w)))
Output:
w n
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
7 5 1
8 5 2
9 6 1
10 6 2
11 7 1
12 7 2

in create a new variable with the max or min of another variable -- by group [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
R Community: I am trying to to create a new variable based on the value of existing variable, not on a row-wise basis but rather on a group-wise basis. I'm trying to create max.var and min.var below based on old.var without collapsing or aggregating the rows, that is, preserving all the id rows:
id old.var min.var max.var
1 1 1 3
1 2 1 3
1 3 1 3
2 5 5 11
2 7 5 11
2 9 5 11
2 11 5 11
3 3 3 4
3 4 3 4
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), old.var =
c(1L,
2L, 3L, 5L, 7L, 9L, 11L, 3L, 4L), min.var = c(1L, 1L, 1L, 5L,
5L, 5L, 5L, 3L, 3L), max.var = c(3L, 3L, 3L, 11L, 11L, 11L, 11L,
4L, 4L)), .Names = c("id", "old.var", "min.var", "max.var"), class = "data.frame", row.names = c(NA,
-9L))
I've tried using the aggregate and by functions, but they of course summarize the data. I haven't had much luck trying an Excel-like MATCH/INDEX approach either. Thanks in advance for your assistance!
You can use dplyr,
df %>%
group_by(id) %>%
mutate(min.var = min(old.var), max.var = max(old.var))
#Source: local data frame [9 x 4]
#Groups: id [3]
# id old.var min.var max.var
# (int) (int) (int) (int)
#1 1 1 1 3
#2 1 2 1 3
#3 1 3 1 3
#4 2 5 5 11
#5 2 7 5 11
#6 2 9 5 11
#7 2 11 5 11
#8 3 3 3 4
#9 3 4 3 4
Using ave as docendo discimus pointed out in the question's comments:
df$min.var <- ave(df$old.var, df$id, FUN = min)
df$max.var <- ave(df$old.var, df$id, FUN = max)
Output:
id old.var min.var max.var
1 1 1 1 3
2 1 2 1 3
3 1 3 1 3
4 2 5 5 11
5 2 7 5 11
6 2 9 5 11
7 2 11 5 11
8 3 3 3 4
9 3 4 3 4
We can use data.table
library(data.table)
setDT(df1)[, c('min.var', 'max.var') := list(min(old.var), max(old.var)) , by = id]
df1
# id old.var min.var max.var
#1: 1 1 1 3
#2: 1 2 1 3
#3: 1 3 1 3
#4: 2 5 5 11
#5: 2 7 5 11
#6: 2 9 5 11
#7: 2 11 5 11
#8: 3 3 3 4
#9: 3 4 3 4

Resources