Row wise comparison of a dataframe in R

Row wise comparison of a dataframe in R - r

I have a data frame with multiple data points corresponding to each ID. When the status value is different between 2 timepoints for an ID, I want to flag the first status change. How do I achieve that in R ? Below is a sample dataset.
ID
Time
Status
ID1
0
X
ID1
6
X
ID1
12
Y
ID1
18
Z
Result dataset
ID
Time
Status
Flag
ID1
0
X
ID1
6
X
ID1
12
Y
1
ID1
18
Z

Here is a base R solution with ave. It creates a vector y that is equal to 1 every time the previous value is different from the current one. Then the Flag is computed with diff.
y <- with(df1, ave(Status, ID, FUN = function(x) c(0, x[-1] != x[-length(x)])))
df1$Flag <- c(0, diff(as.integer(y)) != 0)
df1
# ID Time Status Flag
#1 ID1 0 X 0
#2 ID1 6 X 0
#3 ID1 12 Y 1
#4 ID1 18 Z 0
Data
df1 <- read.table(text = "
ID Time Status
ID1 0 X
ID1 6 X
ID1 12 Y
ID1 18 Z
", header = TRUE)

You can use mutate() with ifelse() and lag(), then replace the non-first Flag==1 with 0s with replace():
df1%>%group_by(ID)%>%
mutate(Flag=ifelse(is.na(lag(Status)), 0,
as.integer(Time!=lag(Time) & Status!=lag(Status))))%>%
group_by(ID, Flag)%>%
mutate(Flag=replace(Flag, Flag==lag(Flag) & Flag==1, 0))
# A tibble: 4 x 4
# Groups: ID, Flag [2]
ID Time Status Flag
<fct> <int> <fct> <dbl>
1 ID1 0 X 0
2 ID1 6 X 0
3 ID1 12 Y 1
4 ID1 18 Z 0

Related

Difference between rows in long format for R based on other column variables

I have an R dataframe such as:
df <- data.frame(ID = rep(c(1, 1, 2, 2), 2), Condition = rep(c("A", "B"),4),
Variable = c(rep("X", 4), rep("Y", 4)),
Value = c(3, 5, 6, 6, 3, 8, 3, 6))
ID Condition Variable Value
1 1 A X 3
2 1 B X 5
3 2 A X 6
4 2 B X 6
5 1 A Y 3
6 1 B Y 8
7 2 A Y 3
8 2 B Y 6
I want to obtain the difference between each value of Condition (A - B) for each Variable and ID while keeping the long format. That would mean the value must appear every two rows, like this:
ID Condition Variable Value diff_value
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3
So far, I managed to do something relatively similar using the dplyr package, but it does not work if I want to maintain the long format:
df_long_example %>%
group_by(Variable, ID) %>%
mutate(diff_value = lag(Value, default = Value[1]) -Value)
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 0
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 0
6 1 B Y 8 -5
7 2 A Y 3 0
8 2 B Y 6 -3

You don't have to use lag, but use diff:
df %>%
group_by(Variable,ID) %>%
mutate(diff = -diff(Value))
Output:
# A tibble: 8 x 5
# Groups: Variable, ID [4]
ID Condition Variable Value diff
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3

You dont need to create lag variable just use Value[Condition == "A"] - Value[Condition == "B"] as below
df %>%
group_by(ID, Variable) %>%
mutate(Value, diff_value = Value[Condition == "A"] - Value[Condition == "B"])
# A tibble: 8 x 5
# Groups: ID, Variable [4]
ID Condition Variable Value diff_value
<dbl> <chr> <chr> <dbl> <dbl>
1 1 A X 3 -2
2 1 B X 5 -2
3 2 A X 6 0
4 2 B X 6 0
5 1 A Y 3 -5
6 1 B Y 8 -5
7 2 A Y 3 -3
8 2 B Y 6 -3

This should work:
# Step one: create a new column of df, where we store the "Value" we need
# to add/subtract, as you required (same "ID", same "Variable", different
# "Condtion").
temp.fun = function(x, dta)
{
# Given a row x of dta, this function selects the value corresponding to the row
# with same "ID", same "Variable" and different "Condition".
# Notice that if "Condition" is not binary, we need to generalize this function.
# Notice also that this function is super specific to your case, and that it has
# been thought to be used within apply().
# INPUTS:
# - x, a row of a data frame.
# - dta, the data frame (df, in your case).
# OUTPUT:
# - temp.corresponding, "Value" you want for each row.
# Saving information.
temp.id = as.numeric(x["ID"])
temp.condition = as.character(x["Condition"])
temp.variable = as.character(x["Variable"])
# Index for selecting row.
temp.row = dta$ID == temp.id & dta$Condition != temp.condition & dta$Variable == temp.variable
# Selecting "Value".
temp.corresponding = dta$Value[temp.row]
return(temp.corresponding)
}
df$corr_value = apply(df, MARGIN = 1, FUN = temp.fun, dta = df)
# Step two: add/subtract to create the column "diff_value".
# Key: if "Condition" equals "A", we subtract, otherwise we add.
df$diff_value = NA
df$diff_value[df$Condition == "A"] = df$Value[df$Condition == "A"] - df$corr_value[df$Condition == "A"]
df$diff_value[df$Condition == "B"] = df$corr_value[df$Condition == "B"] - df$Value[df$Condition == "B"]
Notice that this solution just fits the specifics of your problem, and may be neither elegant nor efficient.
I wrote comments in the code to explain how this solution works. Anyway, the idea is to first write the function temp.fun(), which operates on single rows: for each row we pass, it finds df$Value of the row satisfying the criteria you asked (same ID, same Variable, different Condition). Then, we use apply() to pass all rows in temp.fun(), thus creating a new column in df storing the Value mentioned above.
We are now ready to compute df$diff_value. First, we initialize space, creating a column on NA. Then, we perform the operations. Be careful: because of the specifics of the problem, if Condition equals A, we want to subtract values, whether when Condition equals B we are going to add values. That is, in the former case we compute df$Value - df$corr_value, and in the latter we compute df$corr_value- df$Value.
Final warning: if Condition is not binary, this solution must be generalized in order to work.

Filter rows based on a ID column in R

I have a data frame with an ID column, Timepoint and status. Each ID has multiple timepoints and status associated with each timepoint. I want to filter all the ID's which has the same status for all timepoints associated with the ID. How can I achieve that with R dpylr ?
Below is a sample dataset
ID
Time
Status
A
1
X
A
2
X
A
3
Y
A
4
Z
B
1
X
B
2
X
B
3
X
C
1
Z
C
2
Z
D
1
X
E
1
X
E
2
Y
Expected Dataframe
ID
Time
Status
B
1
X
B
2
X
B
3
X
C
1
Z
C
2
Z
D
1
X

Does this work:
library(dplyr)
df %>% group_by(ID) %>% filter(length(unique(Status)) == 1)
# A tibble: 6 x 3
# Groups: ID [3]
ID Time Status
<chr> <dbl> <chr>
1 B 1 X
2 B 2 X
3 B 3 X
4 C 1 Z
5 C 2 Z
6 D 1 X

We can use
library(data.table)
setDT(df)[, .SD[uniqueN(Status)==1], ID]

Sum values from DF and make a new one

I have this dataframe in R:
ID <- c(rep("ID1" , 4) , rep("ID2" , 4))
mut <- rep(c("AC", "TG", "AG", "TC"), 2)
count <- c(2,4,6,8,1,3,5,7)
data.frame(ID, mut, count)
ID mut count
1 ID1 AC 2
2 ID1 TG 4
3 ID1 AG 6
4 ID1 TC 8
5 ID2 AC 1
6 ID2 TG 3
7 ID2 AG 5
8 ID2 TC 7
I want to create a new one where I sum the values of count based on "mut" column.
Basically, for each ID, I would sum the count from mut=AC and TG and from AG and TC, to obtain this:
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12
I have absolutely no clue on how to do this!!
Thanks!!
M

You better make sure you have an even number of elements in each ID.
df=data.frame(ID, mut, count)
df$sek=rep(1:(nrow(df)/2),each=2)
do.call(rbind,
by(df,list(df$sek),function(x){
data.frame(
"ID"=x$ID[1],
"new_mut"=paste0(x$mut,collapse="-"),
"count"=sum(x$count)
)
})
)
ID new_mut count
1 ID1 AC-TG 6
2 ID1 AG-TC 14
3 ID2 AC-TG 4
4 ID2 AG-TC 12

Using dplyr :
library(dplyr)
df %>%
group_by(ID, val = ceiling(match(mut, unique(mut))/2)) %>%
summarise(mut = paste0(mut,collapse="-"),
count = sum(count)) %>%
select(-val)
# ID mut count
# <chr> <chr> <dbl>
#1 ID1 AC-TG 6
#2 ID1 AG-TC 14
#3 ID2 AC-TG 4
#4 ID2 AG-TC 12

Removing mirrored combinations of variables in a data frame

I'm looking to get each unique combination of two variables:
library(purrr)
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`)
# A tibble: 6 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 1 2
4 3 2
5 1 3
6 2 3
How do I remove out the mirrored combinations? That is, I want only one of rows 1 and 3 in the data frame above, only one of rows 2 and 5, and only one of rows 4 and 6. My desired output would be something like:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2
I don't care if a particular id value is in id1 or id2, so the below is just as acceptable as the output:
# A tibble: 3 x 2
id1 id2
<int> <int>
1 1 2
2 1 3
3 2 3

A tidyverse version of Dan's answer:
cross_df(list(id1 = seq_len(3), id2 = seq_len(3)), .filter = `==`) %>%
mutate(min = pmap_int(., min), max = pmap_int(., max)) %>% # Find the min and max in each row
unite(check, c(min, max), remove = FALSE) %>% # Combine them in a "check" variable
distinct(check, .keep_all = TRUE) %>% # Remove duplicates of the "check" variable
select(id1, id2)
# A tibble: 3 x 2
id1 id2
<int> <int>
1 2 1
2 3 1
3 3 2

A Base R approach:
# create a string with the sorted elements of the row
df$temp <- apply(df, 1, function(x) paste(sort(x), collapse=""))
# then you can simply keep rows with a unique sorted-string value
df[!duplicated(df$temp), 1:2]

R: Generating indicators that values differ within groups

I have a data frame where each row is an observation and I have two columns:
the group membership of the observation
the outcome for the observation.
I'm trying to create a new variable outcome_change that takes a value of 1 if outcome is NOT identical for all observations in a given group and 0 otherwise.
Shown in the below code (dat) is an example of the data I have. Meanwhile, dat_out1 shows what I'm looking for the code to produce in the presence of no NA values. The dat_out2 is identical except it shows that the same results arise when there are missing values in a group's values.
Surely there is somewhat to do this with dplyr::group_by()? I don't know how to make these comparisons within groups.
# Input (2 groups: 1 with identical values of outcome
# in the group (group a) and 1 with differing values of
# outcome in the group (group b)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
# Output 1: add a variable for all observations belonging to
# a group where the outcome changed within each group
dat_out1 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2),
outcome_change = c(0,0,0,1,1,1))
# Output 2: same as Output 1, but able to ignore NA values
dat_out2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA),
outcome_change = c(0,0,0,1,1,1))

Here is an aproach:
library(tidyverse)
dat %>%
group_by(group) %>%
mutate(outcome_change = ifelse(length(unique(outcome[!is.na(outcome)])) > 1, 1, 0))
#output
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a 1 0
4 b 3 1
5 b 2 1
6 b 2 1
with dat2
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a NA 0
4 b 3 1
5 b 2 1
6 b NA 1

library(dplyr)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
dat2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA))
dat_out1 <- dat %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome) == max(outcome), 0, 1))
dat_out2 <- dat2 %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome, na.rm = TRUE) == max(outcome, na.rm = TRUE), 0, 1))

Here is an option using data.table
library(data.table)
setDT(dat1)[, outcome_change := as.integer(uniqueN(outcome[!is.na(outcome)])>1), group]
dat1
# group outcome outcome_change
#1: a 1 0
#2: a 1 0
#3: a 1 0
#4: b 3 1
#5: b 2 1
#6: b 2 1
If we apply the same with 'dat2'
dat2
# group outcome outcome_change2
#1: a 1 0
#2: a 1 0
#3: a NA 0
#4: b 3 1
#5: b 2 1
#6: b NA 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Row wise comparison of a dataframe in R - r

Related

Difference between rows in long format for R based on other column variables

Filter rows based on a ID column in R

Sum values from DF and make a new one

Removing mirrored combinations of variables in a data frame

R: Generating indicators that values differ within groups

Categories

Resources