I have a dataset that records the changes in a group from a certain ID, in a given month.
In the example, in july, the ID 5 changed from group 2 to group 1, then from group 1 to 2, and so on.
I need to get only the first and the last changes made in this ID/month.
ID groupTO groupFROM MONTH
5 2 1 6
5 1 2 7
5 2 1 7
5 3 2 7
5 1 3 7
5 2 1 8
5 1 2 8
5 2 1 8
6 1 2 6
6 3 1 6
6 2 1 7
6 3 2 8
6 1 3 8
In this case, i need the results to be:
ID groupTO groupFROM MONTH
5 2 1 6
5 1 2 7
5 1 3 7
5 2 1 8
5 2 1 8
6 1 2 6
6 3 1 6
6 2 1 7
6 3 2 8
6 1 3 8
If i remove the duplicates (ID/MONTH), i can get the first occurence, but how do i get the last one?
Here's an easy way you can do with dplyr;
library(dplyr)
# Create data
dt <-
data.frame(Id = c(rep(5, 8), rep(6, 5)),
groupTO = c(2, 1, 2, 3, 1, 2, 1, 2, 1, 3, 2, 3, 1),
groupFROM = c(1, 2, 1, 2, 3, 1, 2, 1, 2, 1, 1, 2, 3),
MONTH = c(6, 7, 7, 7, 7, 8, 8, 8, 6, 6, 7, 8, 8))
dt %>%
# Group by ID and month
group_by(Id, MONTH) %>%
# Get first and last row
slice(c(1, n())) %>%
# To remove cases where first is same as last
distinct()
# # A tibble: 9 x 4
# # Groups: Id, MONTH [6]
# Id groupTO groupFROM MONTH
# <dbl> <dbl> <dbl> <dbl>
# 5 2 1 6
# 5 1 2 7
# 5 1 3 7
# 5 2 1 8
# 6 1 2 6
# 6 3 1 6
# 6 2 1 7
# 6 3 2 8
# 6 1 3 8
Using data.table
library(data.table)
unique(setDT(df1)[, .SD[c(1, .N)], .(ID, MONTH)])
# ID MONTH groupTO groupFROM
#1: 5 6 2 1
#2: 5 7 1 2
#3: 5 7 1 3
#4: 5 8 2 1
#5: 6 6 1 2
#6: 6 6 3 1
#7: 6 7 2 1
#8: 6 8 3 2
#9: 6 8 1 3
data
df1 <- structure(list(ID = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L,
6L, 6L, 6L), groupTO = c(2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L,
3L, 2L, 3L, 1L), groupFROM = c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 3L), MONTH = c(6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L,
6L, 6L, 7L, 8L, 8L)), class = "data.frame", row.names = c(NA,
-13L))
Here is a base R solution using split
dfout <- do.call(rbind,c(make.row.names = F,
lapply(split(df,df[c("Id","MONTH")],lex.order = T),
function(v) if (nrow(v)==1) v[1,] else v[c(1,nrow(v)),])))
such that
> dfout
Id groupTO groupFROM MONTH
1 5 2 1 6
2 5 1 2 7
3 5 1 3 7
4 5 2 1 8
5 5 2 1 8
6 6 1 2 6
7 6 3 1 6
8 6 2 1 7
9 6 3 2 8
10 6 1 3 8```
A base R way using ave where we select 1st and last row for each ID and MONTH and select the unique rows in the dataframe.
unique(subset(df, ave(groupTO == 1, ID, MONTH, FUN = function(x)
seq_along(x) %in% c(1, length(x)))))
# ID groupTO groupFROM MONTH
#1 5 2 1 6
#2 5 1 2 7
#5 5 1 3 7
#6 5 2 1 8
#9 6 1 2 6
#10 6 3 1 6
#11 6 2 1 7
#12 6 3 2 8
#13 6 1 3 8
Related
I have a simple dataframe that looks like the following:
Observation X1 X2 Group
1 2 4 1
2 6 3 2
3 8 4 2
4 1 3 3
5 2 8 4
6 7 5 5
7 2 4 5
How can I recode the group variable such that all non-recurrent observations are recoded as "unaffiliated"?
The desired output would be the following:
Observation X1 X2 Group
1 2 4 Unaffiliated
2 6 3 2
3 8 4 2
4 1 3 Unaffiliated
5 2 8 Unaffiliated
6 7 5 5
7 2 4 5
We may use duplicated to create a logical vector for non-duplicates and assign the 'Group' to Unaffiliated for those non-duplicates
df1$Group[with(df1, !(duplicated(Group)|duplicated(Group,
fromLast = TRUE)))] <- "Unaffiliated"
-output
> df1
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
data
df1 <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
unfaffil takes a vector of Group numbers and returns "Unaffiliated" if it has one element and otherwise returns the input. We can then apply it by Group using ave. This does not overwrite the input. No packages are used but if you use dplyr then transform can be replaced with mutate.
unaffil <- function(x) if (length(x) == 1) "Unaffiliated" else x
transform(dat, Group = ave(Group, Group, FUN = unaffil))
giving
Observation X1 X2 Group
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
Note
dat <- structure(list(Observation = 1:7, X1 = c(2L, 6L, 8L, 1L, 2L,
7L, 2L), X2 = c(4L, 3L, 4L, 3L, 8L, 5L, 4L), Group = c(1L, 2L,
2L, 3L, 4L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-7L))
One way could be first grouping then checking for maximum of row number and finishing with an ifelse:
library(dplyr)
df %>%
group_by(Group) %>%
mutate(Group = ifelse(max(row_number()) == 1, "Unaffiliated", as.character(Group))) %>%
ungroup()
Observation X1 X2 Group
<int> <int> <int> <chr>
1 1 2 4 Unaffiliated
2 2 6 3 2
3 3 8 4 2
4 4 1 3 Unaffiliated
5 5 2 8 Unaffiliated
6 6 7 5 5
7 7 2 4 5
The code below is generating the mode value considering the columns Method1, Method2, Method3 and Method4. However, notice that for alternative 10 and 12 it has the same mode value, that is, it has a value of 2. However, I would like my Mode column to have different values, as if it were a rank. Therefore, the alternative that had Mode=1 is the best, but I have no way of knowing the second best alternative, because it has two numbers 2 in the Mode column. Do you have suggestions on what approach I can take?
database<-structure(list(Alternatives = c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
Method1 = c(1L, 10L, 7L, 8L, 9L, 6L, 5L, 3L, 4L, 2L), Method2 = c(1L,
8L, 6L, 7L, 10L, 9L, 4L, 2L, 3L, 5L), Method3 = c(1L,
10L, 7L, 8L, 9L, 6L, 4L, 2L, 3L, 5L), Method4 = c(1L,
9L, 6L, 7L, 10L, 8L, 5L, 3L, 4L, 2L)), class = "data.frame", row.names = c(NA,
10L))
ModeFunc <- function(Vec) {
tmp <- sort(table(Vec),decreasing = TRUE)
Nms <- names(tmp)
if(max(tmp) > 1) {
as.numeric(Nms[1])
} else NA}
output <- database |> rowwise() |>
mutate(Mode = ModeFunc(c_across(Method1:Method4))) %>%
data.frame()
> output
Alternatives Method1 Method2 Method3 Method4 Mode
1 3 1 1 1 1 1
2 4 10 8 10 9 10
3 5 7 6 7 6 6
4 6 8 7 8 7 7
5 7 9 10 9 10 9
6 8 6 9 6 8 6
7 9 5 4 4 5 4
8 10 3 2 2 3 2
9 11 4 3 3 4 3
10 12 2 5 5 2 2
CHECK
output$Rank <- (nrow(output) + 1) - rank(-output$Mode, ties.method = "last")
output|>
arrange(Mode)
Alternatives Method1 Method2 Method3 Method4 Mode Rank
1 3 1 1 1 1 1 1
2 10 3 2 2 3 2 2
3 12 2 5 5 2 2 3
4 11 4 3 3 4 3 4
5 9 5 4 4 5 4 5
6 5 7 6 7 6 6 6
7 8 6 9 6 8 6 7
8 6 8 7 8 7 7 8
9 7 9 10 9 10 9 9
10 4 10 8 10 9 10 10
OK. Based on OP's comment above, Here's a solution that picks the row with the lowest value of Alternatives in case of ties. You can generalise to any other tie break with an appropriate modification of the second mutate.
output |>
arrange(Mode) |> # Sort by mode
group_by(Mode) |> # Assign intial ranks
mutate(Rank=cur_group_id()) |>
arrange(Rank, Alternatives) |> # Sort and assign tie break
mutate(TieBreak=row_number()) |>
ungroup()
# A tibble: 10 × 8
Alternatives Method1 Method2 Method3 Method4 Mode Rank TieBreak
<dbl> <int> <int> <int> <int> <dbl> <int> <int>
1 3 1 1 1 1 1 1 1
2 10 3 2 2 3 2 2 1
3 12 2 5 5 2 2 2 2
4 11 4 3 3 4 3 3 1
5 9 5 4 4 5 4 4 1
6 5 7 6 7 6 6 5 1
7 8 6 9 6 8 6 5 2
8 6 8 7 8 7 7 6 1
9 7 9 10 9 10 9 7 1
10 4 10 8 10 9 10 8 1
Note that cur_group_id() required dplyr v1.0.0 or later and that row_number() takes account of groups when a data frame is grouped.
The code below generates the mode value from the values obtained by Methods 1, 2, 3 and 4. But notice that in some cases I have correct mode values, for example, alternatives 3 and 4, but incorrect ones, such as in alternative 5, as it has two values of 7 and two values of 6, but the mode value is showing 6. Furthermore, in alternatives 11 and 12, it has no a mode value, because it has different values for both methods. So for these incorrect cases, that is, when I have 2 equal values for the same alternative and when I have no mode value, I would like to consider the value obtained by Method 1 to be the mode value. I inserted below the correct output.
Executable code below:
database<-structure(list(Alternatives = c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
Method1 = c(1L, 10L, 7L, 8L, 9L, 6L, 5L, 3L, 4L, 2L), Method2 = c(1L,
8L, 6L, 7L, 10L, 9L, 4L, 2L, 5L, 3L), Method3 = c(1L,
10L, 7L, 8L, 9L, 6L, 4L, 2L, 3L, 5L), Method4 = c(1L,
9L, 6L, 7L, 10L, 8L, 5L, 3L, 2L, 4L)), class = "data.frame", row.names = c(NA,
10L))
ModeFunc <- function(Vec) {
tmp <- sort(table(Vec),decreasing = TRUE)
Nms <- names(tmp)
if(max(tmp) > 1) {
as.numeric(Nms[1])
} else NA}
output <- database |> rowwise() |>
mutate(Mode = ModeFunc(c_across(Method1:Method4))) %>%
data.frame()
> output
Alternatives Method1 Method2 Method3 Method4 Mode
1 3 1 1 1 1 1
2 4 10 8 10 9 10
3 5 7 6 7 6 6
4 6 8 7 8 7 7
5 7 9 10 9 10 9
6 8 6 9 6 8 6
7 9 5 4 4 5 4
8 10 3 2 2 3 2
9 11 4 5 3 2 NA
10 12 2 3 5 4 NA
The correct output would then be:
Alternatives Method1 Method2 Method3 Method4 Mode
3 1 1 1 1 1
4 10 8 10 9 10
5 7 6 7 6 7
6 8 7 8 7 8
7 9 10 9 10 9
8 6 9 6 8 6
9 5 4 4 5 5
10 3 2 2 3 3
11 4 5 3 2 4
12 2 3 5 4 2
You could use some conventional mode() function,
mode <- function(x) {
ux <- unique(x)
tb <- tabulate(match(x, ux))
ux[tb == max(tb)]
}
and update values using ifelse in mapply.
mds <- apply(database[-1], 1, mode) |> setNames(database$Alternatives)
mapply(\(x, y) ifelse(length(x) > 1, y, x), mds, database$Method1)
# 3 4 5 6 7 8 9 10 11 12
# 1 10 7 8 9 6 5 3 4 2
So, altogether it could look like this:
database |>
cbind(Mode=mapply(\(x, y) ifelse(length(x) > 1, y, x),
apply(database[-1], 1, mode),
database$Method1))
# Alternatives Method1 Method2 Method3 Method4 Mode
# 1 3 1 1 1 1 1
# 2 4 10 8 10 9 10
# 3 5 7 6 7 6 7
# 4 6 8 7 8 7 8
# 5 7 9 10 9 10 9
# 6 8 6 9 6 8 6
# 7 9 5 4 4 5 5
# 8 10 3 2 2 3 3
# 9 11 4 5 3 2 4
# 10 12 2 3 5 4 2
I am trying to drop observations in R from my dataset. I need each Person_ID to have wave 0 AND (wave 1 OR wave 3 OR wave 6 OR wave 12 OR wave 18). Can someone help me?
Initial dataset
Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
2 0
3 0
3 1
4 6
4 12
Wanted result
Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
3 0
3 1
Thanks!
You can do a grouped filter. We keep a person if both 0 and any of 1, 3, 6, 12, 18 are in their corresponding wave values.
library(tidyverse)
tbl <- read_table2(
"Person_ID wave
1 0
1 1
1 3
1 6
1 12
1 18
2 0
3 0
3 1
4 6
4 12"
)
tbl %>%
group_by(Person_ID) %>%
filter(0 %in% wave, any(c(1, 3, 6, 12, 18) %in% wave))
#> # A tibble: 8 x 2
#> # Groups: Person_ID [2]
#> Person_ID wave
#> <dbl> <dbl>
#> 1 1 0
#> 2 1 1
#> 3 1 3
#> 4 1 6
#> 5 1 12
#> 6 1 18
#> 7 3 0
#> 8 3 1
Created on 2019-03-25 by the reprex package (v0.2.1)
We can also do this in base R
df1[with(df1, Person_ID %in% intersect(Person_ID[wave %in% c(1, 3, 6, 12, 18)],
Person_ID[!wave])),]
# Person_ID wave
#1 1 0
#2 1 1
#3 1 3
#4 1 6
#5 1 12
#6 1 18
#8 3 0
#9 3 1
data
df1 <- structure(list(Person_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L,
3L, 4L, 4L), wave = c(0L, 1L, 3L, 6L, 12L, 18L, 0L, 0L, 1L, 6L,
12L)), class = "data.frame", row.names = c(NA, -11L))
This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
R Community: I am trying to to create a new variable based on the value of existing variable, not on a row-wise basis but rather on a group-wise basis. I'm trying to create max.var and min.var below based on old.var without collapsing or aggregating the rows, that is, preserving all the id rows:
id old.var min.var max.var
1 1 1 3
1 2 1 3
1 3 1 3
2 5 5 11
2 7 5 11
2 9 5 11
2 11 5 11
3 3 3 4
3 4 3 4
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L), old.var =
c(1L,
2L, 3L, 5L, 7L, 9L, 11L, 3L, 4L), min.var = c(1L, 1L, 1L, 5L,
5L, 5L, 5L, 3L, 3L), max.var = c(3L, 3L, 3L, 11L, 11L, 11L, 11L,
4L, 4L)), .Names = c("id", "old.var", "min.var", "max.var"), class = "data.frame", row.names = c(NA,
-9L))
I've tried using the aggregate and by functions, but they of course summarize the data. I haven't had much luck trying an Excel-like MATCH/INDEX approach either. Thanks in advance for your assistance!
You can use dplyr,
df %>%
group_by(id) %>%
mutate(min.var = min(old.var), max.var = max(old.var))
#Source: local data frame [9 x 4]
#Groups: id [3]
# id old.var min.var max.var
# (int) (int) (int) (int)
#1 1 1 1 3
#2 1 2 1 3
#3 1 3 1 3
#4 2 5 5 11
#5 2 7 5 11
#6 2 9 5 11
#7 2 11 5 11
#8 3 3 3 4
#9 3 4 3 4
Using ave as docendo discimus pointed out in the question's comments:
df$min.var <- ave(df$old.var, df$id, FUN = min)
df$max.var <- ave(df$old.var, df$id, FUN = max)
Output:
id old.var min.var max.var
1 1 1 1 3
2 1 2 1 3
3 1 3 1 3
4 2 5 5 11
5 2 7 5 11
6 2 9 5 11
7 2 11 5 11
8 3 3 3 4
9 3 4 3 4
We can use data.table
library(data.table)
setDT(df1)[, c('min.var', 'max.var') := list(min(old.var), max(old.var)) , by = id]
df1
# id old.var min.var max.var
#1: 1 1 1 3
#2: 1 2 1 3
#3: 1 3 1 3
#4: 2 5 5 11
#5: 2 7 5 11
#6: 2 9 5 11
#7: 2 11 5 11
#8: 3 3 3 4
#9: 3 4 3 4