I have an R data frame that has an ID column with multiple records for an ID. When the flag is set to 1 for an ID, I want to create a column new timeline that starts from 1 and increases sequentially in increments of 6 (1,6,12...). How can I achieve this in R using dplyr ?
Below is a sample data frame
ID
Timepoint
Flag
A
0
0
A
6
0
A
12
0
A
18
1
A
24
0
A
30
0
A
36
0
Expected Dataframe
ID
Timepoint
Flag
New_Timepoint
A
0
0
A
6
0
A
12
0
A
18
1
1
A
24
0
6
A
30
0
12
A
36
0
18
An option is to group by 'ID', create the lag of the 'Timepoint' with n specified as the position of 'Flag' where the value is 1 (-1)
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(New_Timepoint = dplyr::lag(replace(Timepoint, !Timepoint, 1),
n = which(Flag == 1)-1)) %>%
ungroup
-output
# A tibble: 7 x 4
# ID Timepoint Flag New_Timepoint
# <chr> <int> <int> <dbl>
#1 A 0 0 NA
#2 A 6 0 NA
#3 A 12 0 NA
#4 A 18 1 1
#5 A 24 0 6
#6 A 30 0 12
#7 A 36 0 18
Or use a double cumsum to create the index
df1 %>%
group_by(ID) %>%
mutate(New_Timepoint = Timepoint[na_if(cumsum(cumsum(Flag)), 0)]) %>%
ungroup
data
df1 <- structure(list(ID = c("A", "A", "A", "A", "A", "A", "A"),
Timepoint = c(0L,
6L, 12L, 18L, 24L, 30L, 36L),
Flag = c(0L, 0L, 0L, 1L, 0L, 0L,
0L)), class = "data.frame", row.names = c(NA, -7L))
Another dplyr option
df %>%
group_by(ID) %>%
mutate(New_Timepoint = pmax(1, Timepoint - c(NA, Timepoint[Flag == 1])[cumsum(Flag) + 1])) %>%
ungroup()
gives
ID Timepoint Flag New_Timepoint
<chr> <int> <int> <dbl>
1 A 0 0 NA
2 A 6 0 NA
3 A 12 0 NA
4 A 18 1 1
5 A 24 0 6
6 A 30 0 12
7 A 36 0 18
Related
I have a df which looks like this
ID X003-APP X005-APP X008-APP X003-COP X004-COP X008-PIN X009-PIN
363 NA NA 1 0 NA 4 5
364 0 2 NA 1 5 1 5
678 0 NA NA 5 NA NA NA
713 1 1 1 1 1 1 1
219 1 2 3 3 NA 4 5
234 NA NA NA 2 3 NA NA
321 2 3 1 NA NA 1 2
I am interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My required output is:
ID APP COP PIN
363 1 1 1
364 1 1 1
678 1 1 0
713 1 1 1
219 1 1 1
234 0 1 0
321 1 0 1
For reference, I am sharing the dput():
structure(list(ID = c(363L, 364L, 678L, 713L, 219L, 234L, 321L),
X003.APP = c(NA, 0L, 0L, 1L, 1L, NA, 2L),
X005.APP = c(NA, 2L, NA, 1L, 2L, NA, 3L),
X008.APP = c(1L, NA, NA, 1L, 3L, NA, 1L),
X003.COP = c(0L, 1L, 5L, 1L, 3L, 2L, NA),
X004.COP = c(NA, 5L, NA, 1L, NA, 3L, NA),
X008.PIN = c(4L, 1L, NA, 1L, 4L, NA, 1L),
X009.PIN = c(5L, 5L, NA, 1L, 5L, NA, 2L)),
class = "data.frame", row.names = c(NA, -7L))
Edit:
Later on, I would like to analyse 2 and 3 sequences across IDs. For example, I am ultimately, interested in minimum counts for non-null values across the column substrings APP, COP and PIN. My ultimate required output for a sequence of length 2 would be:
Spec_1 Spec_2 Counts
APP COP 5
APP PIN 5
COP PIN 4
Or correspondingly, my required output for a sequence of length 3 would be:
Spec_1 Spec_2 Spec_3 Counts
APP COP PIN 4
Is there an easy way to achieve this? It would be great to have a solution that could cater for longer sequences - even beyond 3. Thank you very much for your time.
You may try
library(reshape2)
library(tidyverse)
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = as.numeric(any(!is.na(value)))) %>%
filter(value == 1) %>%
pivot_wider(names_from = "Spec", values_from = "value") %>%
replace(is.na(.), 0)
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 219 1 1 1
2 234 0 1 0
3 321 1 0 1
4 363 1 1 1
5 364 1 1 1
6 678 1 1 0
7 713 1 1 1
Is your edited one and
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 1) %>%
summarise(Spec = combn(Spec, 2, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2) %>%
summarize(Counts = n())
Spec_1 Spec_2 Counts
<chr> <chr> <int>
1 APP COP 5
2 APP PIN 5
3 COP PIN 4
is your previous one.
3 seq?
df %>%
reshape2::melt(id = "ID") %>%
separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
group_by(ID, Spec) %>%
summarize(value = any(!is.na(value))) %>%
filter(value) %>%
group_by(ID) %>%
filter(n() > 2) %>%
summarise(Spec = combn(Spec, 3, simplify = F)) %>%
unnest_wider(Spec, names_sep = "_") %>%
group_by(Spec_1, Spec_2, Spec_3) %>%
summarize(Counts = n())
Spec_1 Spec_2 Spec_3 Counts
<chr> <chr> <chr> <int>
1 APP COP PIN 4
Try this using dplyr
library(dplyr)
df |> rowwise() |> transmute( ID,
APP = case_when(all(is.na(c_across(contains("APP")))) ~ 0 , TRUE ~ 1) ,
COP = case_when(all(is.na(c_across(contains("COP")))) ~ 0 , TRUE ~ 1) ,
PIN = case_when(all(is.na(c_across(contains("PIN")))) ~ 0 , TRUE ~ 1)) -> df1
output
# A tibble: 7 × 4
# Rowwise:
ID APP COP PIN
<int> <dbl> <dbl> <dbl>
1 363 1 1 1
2 364 1 1 1
3 678 1 1 0
4 713 1 1 1
5 219 1 1 1
6 234 0 1 0
7 321 1 0 1
for your second required you can use
df1 |> transmute(AC = case_when(sum(c_across(c(APP,COP))) == 2 ~ 1 , TRUE ~ 0) ,
AP = case_when(sum(c_across(c(APP,PIN))) == 2 ~ 1 , TRUE ~ 0) ,
CP = case_when(sum(c_across(c(PIN,COP))) == 2 ~ 1 , TRUE ~ 0) ,
ACP = case_when(sum(c_across(c(APP,COP,PIN))) == 3 ~ 1 , TRUE ~ 0)) |> ungroup() |>
summarise(APP_COP = sum(AC) , APP_PIN = sum(AP) , COP_PIN = sum(CP) , APP_COP_PIN = sum(ACP))
output
# A tibble: 1 × 4
APP_COP APP_PIN COP_PIN APP_COP_PIN
<dbl> <dbl> <dbl> <dbl>
1 5 5 4 4
dfin <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 1 0 20 20
1 2 1 20 20
Per study and ID, for those who have duplicate CYCLE == 0 values, remove the row that had the higher TIME.
dfout <-
STUDY ID CYCLE TIME VALUE
1 1 0 10 50
1 2 1 20 20
Using RStudio.
An option is to do a group by 'STUDY', 'ID' and filter out the duplicated 0 values in 'CYCLE'
library(dplyr)
dfin %>%
arrange(STUDY, ID, TIME) %>%
group_by(STUDY, ID) %>%
filter(!(duplicated(CYCLE) & CYCLE == 0))
# A tibble: 2 x 5
# Groups: STUDY, ID [2]
# STUDY ID CYCLE TIME VALUE
# <int> <int> <int> <int> <int>
#1 1 1 0 10 50
#2 1 2 1 20 20
Also, if there are many duplicates for 0 and want to remove only the row where 'TIME' is also max
dfin %>%
group_by(STUDY, ID) %>%
filter(!(TIME == max(TIME) & CYCLE == 0))
Or using base R
dfin1 <- do.call(order, dfin[c("STUDY", "ID", "TIME")])
dfin1[!(duplicated(dfin1[1:3]) & duplicated(dfin1$CYCLE)),]
# STUDY ID CYCLE TIME VALUE
#1 1 1 0 10 50
#3 1 2 1 20 20
data
dfin <- structure(list(STUDY = c(1L, 1L, 1L), ID = c(1L, 1L, 2L), CYCLE = c(0L,
0L, 1L), TIME = c(10L, 20L, 20L), VALUE = c(50L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-3L))
For the following data - I would like to count the number of students per class each year.
Class Students Gender Height Year_1999 Year_2000 Year_2001 Year_2002
1 Mark M 180 80 54 22 12
2 John M 234 0 59 32 62
1 Tom M 124 0 53 26 12
2 Jane F 180 80 54 22 0
3 Kim F 140 0 2 3 32
The output should be
Class Year_1999 Year_2000 Year_2001 Year_2002
1 1 2 2 2
2 1 2 2 1
3 0 1 1 1
I tried the following but didn't have much luck
Number_obs = df %>%
group_by(class) %>%
summarise(count=n())
We can use summarise_at in dplyr. After grouping by 'Class', loop through the columns that have 'year' matches in the column names in summarise_at, get the sum of values that are not equal to 0
library(dplyr)
df1 %>%
group_by(Class) %>%
summarise_at(vars(matches("Year")), list(~ sum(as.logical(.))))
# A tibble: 3 x 5
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or we can gather into 'long' format, do the group_by operation on a single column and spread it to 'wide' format
library(tidyr)
df1 %>%
gather(key, val, matches("Year")) %>%
group_by(Class, key) %>%
summarise(val = sum(val != 0)) %>%
spread(key, val)
Or using data.table
library(data.table)
setDT(df1)[, lapply(.SD, function(x) sum(as.logical(x))), .(Class), .SDcols = 5:8]
Or using base R with aggregate
aggregate(.~ Class, df1[-(2:4)], function(x) sum(x != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Or using rowsum
rowsum(+(!!df1[5:8]), df1$Class)
# Year_1999 Year_2000 Year_2001 Year_2002
#1 1 2 2 2
#2 1 2 2 1
#3 0 1 1 1
Or using colSums
t(sapply(split(as.data.frame(df1[5:8] != 0), df1$Class), colSums))
data
df1 <- structure(list(Class = c(1L, 2L, 1L, 2L, 3L), Students = c("Mark",
"John", "Tom", "Jane", "Kim"), Gender = c("M", "M", "M", "F",
"F"), Height = c(180L, 234L, 124L, 180L, 140L), Year_1999 = c(80L,
0L, 0L, 80L, 0L), Year_2000 = c(54L, 59L, 53L, 54L, 2L), Year_2001 = c(22L,
32L, 26L, 22L, 3L),
Year_2002 = c(12L, 62L, 12L, 0L, 32L)), class = "data.frame",
row.names = c(NA,
-5L))
Similar to #akrun's colSums solution, using by.
do.call(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# 1 1 2 2 2
# 2 1 2 2 1
# 3 0 1 1 1
or
Reduce(rbind, by(df[5:8] > 0, df[1], colSums))
# Year_1999 Year_2000 Year_2001 Year_2002
# init 1 2 2 2
# 1 2 2 1
# 0 1 1 1
do.call is faster.
Using dplyr, we can use summarise_at
library(dplyr)
df %>%
group_by(Class) %>%
summarise_at(vars(starts_with("Year")), ~sum(. != 0))
# Class Year_1999 Year_2000 Year_2001 Year_2002
# <int> <int> <int> <int> <int>
#1 1 1 2 2 2
#2 2 1 2 2 1
#3 3 0 1 1 1
Solution
I went with the solutions provided by #MauritsEvers and #akrun below.
Question
For a data frame, I want to keep only 1 column of each set of duplicate columns. In addition, the column that is kept takes on a name that is a concatenation of all column names in the set of duplicate columns. There are multiple sets of duplicate columns in the data frame. The data frame contains tens of thousands of columns, so using a for loop might take too much time.
I have tried a combination of using the duplicate(), summary(), aggregate(), lapply(), apply(), and using for loops.
Input data frame (df_in):
0 1 2 3 4 5 6 7
0 1 0 0 1 0 1 1
0 1 0 1 1 0 0 0
1 0 1 0 0 1 1 0
Output data frame (df_out):
0-2-5 1-4 3 6 7
0 1 0 1 1
0 1 1 0 0
1 0 0 1 0
Here is an option with tidyverse. We gather the data into 'long' format, conver the 'value' into a string, grouped by 'value', paste the 'key' column together, separate the rows of 'value' and then spread the 'value' column to get the expected output
library(tidyverse)
gather(df_in) %>%
group_by(key) %>%
summarise(value = toString(value)) %>%
group_by(value) %>%
summarise(key = paste(key, collapse="-")) %>%
separate_rows(value) %>%
group_by(key) %>%
mutate(n = row_number()) %>%
spread(key, value) %>%
select(-n)
# A tibble: 3 x 5
# `0-2-5` `1-4` `3` `6` `7`
# <chr> <chr> <chr> <chr> <chr>
#1 0 1 0 1 1
#2 0 1 1 0 0
#3 1 0 0 1 0
Or another option with tidyverse would be
t(df_in) %>%
as.data.frame %>%
mutate(grp = group_indices(., V1, V2, V3)) %>%
mutate(rn = row_number() - 1) %>%
group_split(grp, keep = FALSE) %>%
map_dfc(~ .x %>%
mutate(rn = str_c(rn, collapse="-")) %>%
slice(1) %>%
gather(key, val, -rn) %>%
rename(!! .$rn[1] := val) %>%
select(ncol(.)))
# A tibble: 3 x 5
# `0-2-5` `3` `7` `6` `1-4`
# <int> <int> <int> <int> <int>
#1 0 0 1 1 1
#2 0 1 0 0 1
#3 1 0 0 1 0
Or we can also do this with data.table methods
library(data.table)
dcast(melt(as.data.table(t(df_in))[, grp := .GRP, .(V1, V2, V3)][,
c(.SD[1], cn = paste(.I-1, collapse="-")) , .(grp)],
id.var = c('cn', 'grp')), variable ~ cn, value.var = 'value')[,
variable := NULL][]
# 0-2-5 1-4 3 6 7
#1: 0 1 0 1 1
#2: 0 1 1 0 0
#3: 1 0 0 1 0
data
df_in <- structure(list(`0` = c(0L, 0L, 1L), `1` = c(1L, 1L, 0L), `2` = c(0L,
0L, 1L), `3` = c(0L, 1L, 0L), `4` = c(1L, 1L, 0L), `5` = c(0L,
0L, 1L), `6` = c(1L, 0L, 1L), `7` = c(1L, 0L, 0L)),
class = "data.frame", row.names = c(NA, -3L))
You can do the following in base R
Get indices of identical columns
idx <- split(seq_along(names(df)), apply(df, 2, paste, collapse = "_"))
Sort indices from low to high
idx <- idx[order(sapply(idx, function(x) x[1]))]
Names of idx as concatentation of column names
names(idx) <- sapply(idx, function(x) paste(names(df)[x], collapse = "_"))
Create final matrix
sapply(idx, function(x) df[, x[1]])
# col0_col2_col5 col1_col4 col3_col6 col7
#[1,] 0 1 1 1
#[2,] 0 1 0 0
#[3,] 1 0 1 0
Note that the resulting object is a matrix, so if you need a data.frame simply cast as.data.frame.
Sample data
I've changed your sample data slightly to not have numbers as column names.
df <- read.table(text =
"col0 col1 col2 col3 col4 col5 col6 col7
0 1 0 1 1 0 1 1
0 1 0 0 1 0 0 0
1 0 1 1 0 1 1 0", header = T)
my input data is
df
anger sad joy happy trust disgust
1 1 0 1 2 3 0
2 2 0 0 2 0 3
3 2 2 1 1 1 1
4 0 1 1 1 0 1
I want output like this
mydata
anger sad joy happy trust disgust col
1 1 0 1 2 3 0 trust
2 2 0 0 2 0 3 disgust
I want to extract max value colname from each row but output only those rows having only one max value colname and discard all other row with more than one colname.
i tried this
d1 <- df[!apply(df[-1], 1, function(x) anyDuplicated(x[x == max(x)])),]
but i am getting this
anger sad joy happy trust disgust
1 1 0 1 2 3 0
2 2 0 0 2 0 3
3 2 2 1 1 1 1
I don't want third row in the output.
Thanks for help in advance.
We can use max.col to get the index of columns for each row after subsetting the rows
d1 <- mydata[!apply(mydata[-1], 1, anyDuplicated),]
d1$out <- names(d1)[-1][max.col(d1[-1], 'first')]
d1
# zone_id v1 v2 v3 v4 out
#1 1 12 15 18 20 v4
#3 3 31 28 14 2 v1
#4 4 12 16 9 5 v2
#5 5 5 18 10 12 v2
Update
If the OP wanted to remove only the duplicate values of max values, then replace the first line with
d1 <- mydata[!apply(mydata[-1], 1, function(x) anyDuplicated(x[x == max(x)])),]
Update2
Based on the newdataset by the OP, we don't need to remove the first column as it is not an id column
d2 <- mydata1[!apply(mydata1, 1, function(x) anyDuplicated(x[x == max(x)])),]
d2$out <- names(d2)[max.col(d2, 'first')]
d2
# anger sad joy happy trust disgust out
#1 1 0 1 2 3 0 trust
#2 2 0 0 2 0 3 disgust
data
mydata1 <- structure(list(anger = c(1L, 2L, 2L, 0L), sad = c(0L, 0L, 2L,
1L), joy = c(1L, 0L, 1L, 1L), happy = c(2L, 2L, 1L, 1L), trust = c(3L,
0L, 1L, 0L), disgust = c(0L, 3L, 1L, 1L)), .Names = c("anger", "sad",
"joy", "happy", "trust", "disgust"), row.names = c(NA, 4L),
class = "data.frame")
you can try:
mydata %>%
select(-zone_id) %>%
mutate(mx = do.call(pmax, (.))) %>%
select(mx) %>%
cbind(mydata) %>%
mutate( flg = rowSums(. == mx)) %>%
filter(flg ==2) %>%
select(-flg) %>%
gather(key = out, value= v, -mx, -zone_id) %>%
filter(mx == v) %>%
select(zone_id, mx, out) %>%
left_join(mydata)
which gives:
zone_id mx out v1 v2 v3 v4
1 3 31 v1 31 28 2 2
2 4 16 v2 1 16 9 1
3 5 18 v2 5 18 10 12
4 1 20 v4 12 15 18 20