Related
I have a simple dataframe that looks like the following:
Observation X1 X2 Group
1 2 4 1
2 6 3 2
3 8 4 2
4 1 3 3
5 2 8 4
6 7 5 5
7 2 4 5
How can I recode the group variable such that all non-recurrent observations are recoded as "unaffiliated"?
The desired output would be the following:
Observation X1 X2 Group
1 2 4 Unaffiliated
2 6 3 2
3 8 4 2
4 1 3 Unaffiliated
5 2 8 Unaffiliated
6 7 5 5
7 2 4 5
We may use duplicated to create a logical vector for non-duplicates and assign the 'Group' to Unaffiliated for those non-duplicates
df1$Group[with(df1, !(duplicated(Group)|duplicated(Group,
fromLast = TRUE)))] <- "Unaffiliated"
-output
> df1
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
data
df1 <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
unfaffil takes a vector of Group numbers and returns "Unaffiliated" if it has one element and otherwise returns the input. We can then apply it by Group using ave. This does not overwrite the input. No packages are used but if you use dplyr then transform can be replaced with mutate.
unaffil <- function(x) if (length(x) == 1) "Unaffiliated" else x
transform(dat, Group = ave(Group, Group, FUN = unaffil))
giving
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
Note
dat <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
One way could be first grouping then checking for maximum of row number and finishing with an ifelse:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Group = ifelse(max(row_number()) == 1, "Unaffiliated", as.character(Group))) %>%
ungroup()
Observation X1 X2 Group
<int> <int> <int> <chr>
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
I have a dataframe that currently looks like this:
subjectID
Trial
1
3
1
3
1
3
1
4
1
4
1
5
1
5
1
5
2
1
2
1
2
3
2
3
2
3
2
5
2
5
2
6
3
1
Etc., where trial number is nested under subject ID. I need to make a new column in which column "NewTrial" is simply what order the trials now appear in. For example:
subjectID
Trial
NewTrial
1
3
1
1
3
1
1
3
1
1
4
2
1
4
2
1
5
3
1
5
3
1
5
3
2
1
1
2
1
1
2
3
2
2
3
2
2
3
2
2
5
3
2
5
3
2
6
4
3
1
1
So far, I have a for-loop written that looks like this:
for (myperson in unique(data$subjectID)){
#This line creates a vector of the number of unique trials per subject: for subject 1, c(1, 2, 3)
triallength=1:length(unique(data$Trial[data$subID==myperson]))
I'm having trouble now finding a way to paste the numbers from the created triallength vector as a column in the dataframe. Does anyone know of a way to accomplish this? I am lacking some experience with for-loops and hoping to gain more. If anyone has a tidyverse/dplyr solution, however, I am open to that as well as an alternative to a for-loop. Thanks in advance, and let me know if any clarification is needed!
Converting to factor with unique values as levels, then as.numeric in an ave should be nice.
transform(dat, NewTrial=ave(Trial, subjectID, FUN=\(x) as.numeric(factor(x, levels=unique(x)))))
# subjectID Trial NewTrial
# 1 1 3 1
# 2 1 3 1
# 3 1 3 1
# 4 1 4 2
# 5 1 4 2
# 6 1 5 3
# 7 1 5 3
# 8 1 5 3
# 9 2 1 1
# 10 2 1 1
# 11 2 3 2
# 12 2 3 2
# 13 2 3 2
# 14 2 5 3
# 15 2 5 3
# 16 2 6 4
# 17 3 1 1
Data:
dat <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), Trial = c(3L, 3L, 3L, 4L,
4L, 5L, 5L, 5L, 1L, 1L, 3L, 3L, 3L, 5L, 5L, 6L, 1L)), class = "data.frame", row.names = c(NA,
-17L))
We could use match on the unique values after grouping by 'subjectID'
library(dplyr)
df1 <- df1 %>%
group_by(subjectID) %>%
mutate(NewTrial = match(Trial, unique(Trial))) %>%
ungroup
We could use rleid:
library(dplyr)
library(data.table)
df %>%
group_by(subjectID) %>%
mutate(NewTrial = rleid(subjectID, Trial))
subjectID Trial NewTrial
<int> <int> <int>
1 1 3 1
2 1 3 1
3 1 3 1
4 1 4 2
5 1 4 2
6 1 5 3
7 1 5 3
8 1 5 3
9 2 1 1
10 2 1 1
11 2 3 2
12 2 3 2
13 2 3 2
14 2 5 3
15 2 5 3
16 2 6 4
17 3 1 1
The code below is generating the mode value considering the columns Method1, Method2, Method3 and Method4. However, notice that for alternative 10 and 12 it has the same mode value, that is, it has a value of 2. However, I would like my Mode column to have different values, as if it were a rank. Therefore, the alternative that had Mode=1 is the best, but I have no way of knowing the second best alternative, because it has two numbers 2 in the Mode column. Do you have suggestions on what approach I can take?
database<-structure(list(Alternatives = c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
Method1 = c(1L, 10L, 7L, 8L, 9L, 6L, 5L, 3L, 4L, 2L), Method2 = c(1L,
8L, 6L, 7L, 10L, 9L, 4L, 2L, 3L, 5L), Method3 = c(1L,
10L, 7L, 8L, 9L, 6L, 4L, 2L, 3L, 5L), Method4 = c(1L,
9L, 6L, 7L, 10L, 8L, 5L, 3L, 4L, 2L)), class = "data.frame", row.names = c(NA,
10L))
ModeFunc <- function(Vec) {
tmp <- sort(table(Vec),decreasing = TRUE)
Nms <- names(tmp)
if(max(tmp) > 1) {
as.numeric(Nms[1])
} else NA}
output <- database |> rowwise() |>
mutate(Mode = ModeFunc(c_across(Method1:Method4))) %>%
data.frame()
> output
Alternatives Method1 Method2 Method3 Method4 Mode
1 3 1 1 1 1 1
2 4 10 8 10 9 10
3 5 7 6 7 6 6
4 6 8 7 8 7 7
5 7 9 10 9 10 9
6 8 6 9 6 8 6
7 9 5 4 4 5 4
8 10 3 2 2 3 2
9 11 4 3 3 4 3
10 12 2 5 5 2 2
CHECK
output$Rank <- (nrow(output) + 1) - rank(-output$Mode, ties.method = "last")
output|>
arrange(Mode)
Alternatives Method1 Method2 Method3 Method4 Mode Rank
1 3 1 1 1 1 1 1
2 10 3 2 2 3 2 2
3 12 2 5 5 2 2 3
4 11 4 3 3 4 3 4
5 9 5 4 4 5 4 5
6 5 7 6 7 6 6 6
7 8 6 9 6 8 6 7
8 6 8 7 8 7 7 8
9 7 9 10 9 10 9 9
10 4 10 8 10 9 10 10
OK. Based on OP's comment above, Here's a solution that picks the row with the lowest value of Alternatives in case of ties. You can generalise to any other tie break with an appropriate modification of the second mutate.
output |>
arrange(Mode) |> # Sort by mode
group_by(Mode) |> # Assign intial ranks
mutate(Rank=cur_group_id()) |>
arrange(Rank, Alternatives) |> # Sort and assign tie break
mutate(TieBreak=row_number()) |>
ungroup()
# A tibble: 10 × 8
Alternatives Method1 Method2 Method3 Method4 Mode Rank TieBreak
<dbl> <int> <int> <int> <int> <dbl> <int> <int>
1 3 1 1 1 1 1 1 1
2 10 3 2 2 3 2 2 1
3 12 2 5 5 2 2 2 2
4 11 4 3 3 4 3 3 1
5 9 5 4 4 5 4 4 1
6 5 7 6 7 6 6 5 1
7 8 6 9 6 8 6 5 2
8 6 8 7 8 7 7 6 1
9 7 9 10 9 10 9 7 1
10 4 10 8 10 9 10 8 1
Note that cur_group_id() required dplyr v1.0.0 or later and that row_number() takes account of groups when a data frame is grouped.
I have measured basal area of trees in different plots. Here's a small example with two plots with 4 trees each:
Plot Tree BasalArea
1 1 4
1 2 5
1 3 7
1 4 3
2 1 4
2 2 6
2 3 9
2 4 5
Within each plot, I want calculate the sum of basal area of the trees that have basal area larger than the focal tree.
For example, Tree 1 in Plot 1 has an area of 4. Within that plot there are 2 trees with an area larger than tree 1: Tree 2 and Tree 3 with area 5 and 7, respectively. So, "BA_Larger" for tree 1 is 5 + 7 = 12.
Tree 2 in the same plot has basal area = 5. Within plot 1 there is only one tree with a larger area than tree 2: tree 3 with area 7. Thus, "BA_Larger" for tree 2 is 7.
Finally, the data frame should be like this:
Plot Tree BasalArea BA_Larger
1 1 4 12
1 2 5 7
1 3 7 0
1 4 3 16
2 1 4 20
2 2 6 9
2 3 9 0
2 4 5 15
The data set is very large. I have tried to calculate the "BA_Larger", without success. Any help is highly appreciated.
The base R solution with ave():
within(df, BA_Larger <- ave(BasalArea, Plot, FUN = function(x) sapply(x, function(y) sum(x[x > y]))))
With a tidyverse style, you can also use map_int() or map_dbl() from purrr.
library(dplyr)
library(purrr)
df %>%
group_by(Plot) %>%
mutate(BA_Larger = map_int(BasalArea, ~ sum(BasalArea[BasalArea > .]))) %>%
ungroup()
Output
# # A tibble: 8 x 4
# Plot Tree BasalArea BA_Larger
# <int> <int> <int> <int>
# 1 1 1 4 12
# 2 1 2 5 7
# 3 1 3 7 0
# 4 1 4 3 16
# 5 2 1 4 20
# 6 2 2 6 9
# 7 2 3 9 0
# 8 2 4 5 15
Data
df <- structure(list(Plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Tree = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), BasalArea = c(4L, 5L, 7L, 3L, 4L,
6L, 9L, 5L)), class = "data.frame", row.names = c(NA, -8L))
Another solution
library(tidyverse)
df %>%
group_by(Plot) %>%
arrange(BasalArea, .by_group = T) %>%
mutate(res = sum(BasalArea) - cumsum(BasalArea)) %>%
arrange(Tree, .by_group = T) %>%
ungroup()
# A tibble: 8 x 4
Plot Tree BasalArea res
<int> <int> <int> <int>
1 1 1 4 12
2 1 2 5 7
3 1 3 7 0
4 1 4 3 16
5 2 1 4 20
6 2 2 6 9
7 2 3 9 0
8 2 4 5 15
Using a non-equi join with data.table. Calculate sum for each match.
library(data.table)
setDT(d)
d[ , ba2 := d[d, on = .(Plot, BasalArea > BasalArea), sum(x.BasalArea), by = .EACHI]$V1]
# Plot Tree BasalArea ba2
# 1: 1 1 4 12
# 2: 1 2 5 7
# 3: 1 3 7 NA
# 4: 1 4 3 16
# 5: 2 1 4 20
# 6: 2 2 6 9
# 7: 2 3 9 NA
# 8: 2 4 5 15
Actually you don't need a package to do this. Using by you may split the data on the Plot column, then compare the specific tree i to the other values in the split-subset and exclude i in the sum. Finally unsplit the result according to the df1$Plot column.
res <- unsplit(by(df1, df1$Plot, function(x)
transform(x, BA_Larger=sapply(1:nrow(x), function(i)
sum(x[x[, 3] > x[i, 3], 3])))), df1$Plot)
res
# Plot Tree BasalArea BA_Larger
# 1 1 1 4 12
# 2 1 2 5 7
# 3 1 3 7 0
# 4 1 4 3 16
# 5 2 1 4 20
# 6 2 2 6 9
# 7 2 3 9 0
# 8 2 4 5 15
Data:
df1 <- structure(list(Plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Tree = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), BasalArea = c(4L, 5L, 7L, 3L, 4L,
6L, 9L, 5L)), class = "data.frame", row.names = c(NA, -8L))
This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
R Community: I am trying to to create a new variable based on the value of existing variable, not on a row-wise basis but rather on a group-wise basis. I'm trying to create max.var and min.var below based on old.var without collapsing or aggregating the rows, that is, preserving all the id rows:
id old.var min.var max.var
1 1 1 3
1 2 1 3
1 3 1 3
2 5 5 11
2 7 5 11
2 9 5 11
2 11 5 11
3 3 3 4
3 4 3 4
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), old.var =
c(1L,
2L, 3L, 5L, 7L, 9L, 11L, 3L, 4L), min.var = c(1L, 1L, 1L, 5L,
5L, 5L, 5L, 3L, 3L), max.var = c(3L, 3L, 3L, 11L, 11L, 11L, 11L,
4L, 4L)), .Names = c("id", "old.var", "min.var", "max.var"), class = "data.frame", row.names = c(NA,
-9L))
I've tried using the aggregate and by functions, but they of course summarize the data. I haven't had much luck trying an Excel-like MATCH/INDEX approach either. Thanks in advance for your assistance!
You can use dplyr,
df %>%
group_by(id) %>%
mutate(min.var = min(old.var), max.var = max(old.var))
#Source: local data frame [9 x 4]
#Groups: id [3]
# id old.var min.var max.var
# (int) (int) (int) (int)
#1 1 1 1 3
#2 1 2 1 3
#3 1 3 1 3
#4 2 5 5 11
#5 2 7 5 11
#6 2 9 5 11
#7 2 11 5 11
#8 3 3 3 4
#9 3 4 3 4
Using ave as docendo discimus pointed out in the question's comments:
df$min.var <- ave(df$old.var, df$id, FUN = min)
df$max.var <- ave(df$old.var, df$id, FUN = max)
Output:
id old.var min.var max.var
1 1 1 1 3
2 1 2 1 3
3 1 3 1 3
4 2 5 5 11
5 2 7 5 11
6 2 9 5 11
7 2 11 5 11
8 3 3 3 4
9 3 4 3 4
We can use data.table
library(data.table)
setDT(df1)[, c('min.var', 'max.var') := list(min(old.var), max(old.var)) , by = id]
df1
# id old.var min.var max.var
#1: 1 1 1 3
#2: 1 2 1 3
#3: 1 3 1 3
#4: 2 5 5 11
#5: 2 7 5 11
#6: 2 9 5 11
#7: 2 11 5 11
#8: 3 3 3 4
#9: 3 4 3 4