Remove rows which are different with the first changing in R - r

I have data sets
ID <- c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4)
x <- c(1,2,3,1,2,3,4,1,2,1,2,3,4,5,6)
y <- c(2,2,3,6,6,4,5, 1,1,5,5,5,2,2,2)
df <- data.frame(ID, x, y)
df
ID x y
1 1 1 2
2 1 2 2
3 1 3 3
4 2 1 6
5 2 2 6
6 2 3 4
7 2 4 5
8 3 1 1
9 3 2 1
10 4 1 5
11 4 2 5
12 4 3 5
13 4 4 2
14 4 5 2
15 4 6 2
If you see ID 1 have 3 rows, by y of the third row change y = 3, so I want to set y = 2 (The same number of previous row), the ID 2 have y change at y = 4, I want to set y = 6 and delete next row. When the number of y change for each ID, we set only the first row change as the same at previous row, the rest remove it.
The table will be
ID x y
1 1 2
1 2 2
1 3 2
2 1 6
2 2 6
2 3 6
3 1 1
3 2 1
4 1 5
4 2 5
4 3 5
4 4 5
I couldn't figure out, do you have any idea, please help me, thanks.

Or we can do
library(data.table)
df1 <- setDT(df)[, .SD[shift(rleid(y), fill = 1) == 1], .(ID)]
df1[, y := y[1], .(ID)]
df1
ID x y
1: 1 1 2
2: 1 2 2
3: 1 3 2
4: 2 1 6
5: 2 2 6
6: 2 3 6
7: 3 1 1
8: 3 2 1
9: 4 1 5
10: 4 2 5
11: 4 3 5
12: 4 4 5

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', if there is only a unique element in 'y' get the sequence of rows (1:.N) or else get the difference of 'y' (diff), check whether it is not equal to 0, use which to return the numeric index of the first TRUE ([1]),get the sequence and wrap it with .I to return row index.
library(data.table)
i1 <- setDT(df)[, if(uniqueN(y) >1) .I[seq(which(c(FALSE,diff(y)!=0))[1])]
else .I[1:.N], ID]$V1
Based on 'i1', we subset the rows of 'df', grouped by 'ID', we assign (:=), the 1st element in 'y' to change the 'y' column.
df[i1][, y:= y[1], ID][]
# ID x y
#1: 1 1 2
#2: 1 2 2
#3: 1 3 2
#4: 2 1 6
#5: 2 2 6
#6: 2 3 6
#7: 3 1 1
#8: 3 2 1
#9: 4 1 5
#10: 4 2 5
#11: 4 3 5
#12: 4 4 5
Or we can use a bit more simple coding with dplyr. (Disclaimer: The idea is somewhat similar to #Psidom's code). After grouping by 'ID', we get the lag of 'y', get a logical index by comparing with the first observation, filter the rows based on that and change the 'y' values to the first value.
library(dplyr)
df %>%
group_by(ID) %>%
filter(first(y)==lag(y, default = first(y))) %>%
mutate(y, y=first(y))
# ID x y
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 1 2 2
#3 1 3 2
#4 2 1 6
#5 2 2 6
#6 2 3 6
#7 3 1 1
#8 3 2 1
#9 4 1 5
#10 4 2 5
#11 4 3 5
#12 4 4 5
Or another option is ave from base R
df1 <- df[with(df, as.logical(ave(y, ID, FUN = function(x)
lag(x, default= x[1])== x[1]))),]
df1$y <- with(df1, ave(y, ID, FUN= function(x) x[1]))

You could use a for loop, matching to the first instance of a given ID:
for( i in 1:nrow(df) ){
df$new[i] <- df$y[ match( df$ID[i], df$ID ) ]
}
This works because you're effectively asking for all subsequent values of y to be replaced with the first value, for a given ID. match returns the first value matching a given criteria, which works well for what you're after.
Or you could eliminate the for loop by first extracting ID as a variable:
ID <- df$ID
df$new <- df$y[ match( ID, df$ID ) ]
EDIT TO ADD: Sorry, here's a step to add to delete rows as requested
df <- subset( df, y == new |
( shift( y, 1, type = "lag" ) != y &
shift( ID, 1, type = "lag" ) == ID )
)

Related

Set values of a column to NA after a given point

I have a dataset like this:
ID NUMBER X
1 5 2
1 3 4
1 6 3
1 2 5
2 7 3
2 3 5
2 9 3
2 4 2
and I'd like to set values of variable X to NA after the variable NUMBER increses (even though after it decreases again) for each ID, and obtaining:
ID NUMBER X
1 5 2
1 3 4
1 6 NA
1 2 NA
2 7 3
2 3 5
2 9 NA
2 4 NA
How can I do it?
Thanks for your help!
Surely not the most elegant solution, but it is quite intuitive:
library(data.table)
setDT(d)
d[, n := ifelse(NUMBER > shift(NUMBER, 1, "lag"),1,0), by=ID]
d[is.na(n), n := 0]
d[, n := cumsum(n), by=ID]
d[n>0, X := NA ]
d
ID NUMBER X n
1: 1 5 2 0
2: 1 3 4 0
3: 1 6 NA 1
4: 1 2 NA 1
5: 2 7 3 0
6: 2 3 5 0
7: 2 9 NA 1
8: 2 4 NA 1
You can do this with dplyr package. If your dataframe is called df then you can use this code:
df %>% group_by(ID) %>%
mutate ( X = c(X[1:(min(which(diff(Number) > 0)))],rep("NA",length(X)-(min(which(diff(Number) > 0)))))) %>%
as.data.frame()
I first grouped them with ID and then I found the first increasing number with diff and which.

is there a way in R to subtract two rows within a group by specifying another grouping var?

Say I have something like this:
ID = c("a","a","a","a","a", "b","b","b","b","b")
Group = c("1","2","3","4","5", "1","2","3","4","5")
Value = c(3, 4,2,4,3, 6, 1, 8, 9, 10)
df<-data.frame(ID,Group,Value)
I want to subtract group=5 from group=3 within the ID, with an output column which has this difference for each ID like so:
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
Also, if that calculation cannot be done (i.e. group 5 is missing), NA values for the 'want' column would be ideal.
As there is only one unique 'Group' per 'ID', we can do subsetting
library(dplyr)
df %>%
group_by(ID) %>%
mutate(want = Value[Group == 5] - Value[Group == 3])
# A tibble: 10 x 4
# Groups: ID [2]
# ID Group Value want
# <fct> <fct> <dbl> <dbl>
# 1 a 1 3 1
# 2 a 2 4 1
# 3 a 3 2 1
# 4 a 4 4 1
# 5 a 5 3 1
# 6 b 1 6 2
# 7 b 2 1 2
# 8 b 3 8 2
# 9 b 4 9 2
#10 b 5 10 2
The above can be made more error-proof if we convert to numeric index and get the first element. When there are no TRUE, by using [1], it returns NA
df %>%
slice(-10) %>%
group_by(ID) %>%
mutate(want = Value[which(Group == 5)[1]] - Value[which(Group == 3)[1]])
Or use match which returns an index of NA if there are no matches, and anything with NA index returns NA which will subsequently return NA in subtraction (NA -3)
df %>%
slice(-10) %>% # removing the last row where Group is 10
group_by(ID) %>%
mutate(want = Value[match(5, Group)] - Value[match(3, Group)])
Here is a base R solution
dfout <- Reduce(rbind,
lapply(split(df,df$ID),
function(x) within(x, Want <-diff(subset(Value, Group %in% c("3","5"))))))
such that
> dfout
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
A data.table method:
library(data.table)
setDT(df)[, want := (Value[Group == 5] - Value[Group == 3]), by = .(ID)]
df
# ID Group Value want
# 1: a 1 3 1
# 2: a 2 4 1
# 3: a 3 2 1
# 4: a 4 4 1
# 5: a 5 3 1
# 6: b 1 6 2
# 7: b 2 1 2
# 8: b 3 8 2
# 9: b 4 9 2
# 10: b 5 10 2
Here is a solution using base R.
unsplit(
lapply(
split(df, df$ID),
function(d) {
x5 = d$Value[d$Group == "5"]
x5 = ifelse(length(x5) == 1, x5, NA)
x3 = d$Value[d$Group == "3"]
x3 = ifelse(length(x3) == 1, x3, NA)
d$Want = x5 - x3
d
}),
df$ID)

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Remove ID:s with only one observation in time in r

Hi I have panel data and would like to remove any individuals that only have observations at one time point and keep the ones that have 2 points in time.
so the dataframe:
df <- data.frame(id = c(1,2,2,3,3,4,4,5,6), time = c(1,1,2,1,2,1,2,2,2))
id time
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 4 1
7 4 2
8 5 2
9 6 2
becomes this:
id time
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
i.e removing individual 1, 5 and 6 so that the panel is balansed.
Thx
We can do this using a couple of options. With data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', we get the number of rows (.N) and if that is greater than 1, get the Subset of Data.table (.SD)
library(data.table)
setDT(df)[, if(.N>1) .SD, by = id]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2
Can use the same methodology with dplyr.
library(dplyr)
df %>%
group_by(id) %>%
filter(n()>1)
# id time
# (dbl) (dbl)
#1 2 1
#2 2 2
#3 3 1
#4 3 2
#5 4 1
#6 4 2
Or with base R, get the table of data.frame, check whether it is greater than 1, subset the names based on the logical index ('i1') and use it to subset the 'data.frame' using %in%.
i1 <- table(df$id)>1
subset(df, id %in% names(i1)[i1] )
Another option,
ind <- rle(df$id)$values[rle(df$id)$lengths > 1]
df[df$id %in% ind,]
# id time
#2 2 1
#3 2 2
#4 3 1
#5 3 2
#6 4 1
#7 4 2
library(data.table)
setDT(df, key = "id")[(duplicated(id) | duplicated(id, fromLast = TRUE))]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2
You can use dplyr package to do this
library(dplyr)
df %>% group_by(id,time) %>% summarize(count = n()) %>%
filter(!count == 1)

Inserting a count field for each row by a grouping variable

I have a data set with observations that are both grouped and ordered (by rank). I'd like to add a third variable that is a count of the number of observations for each grouping variable. I'm aware of ways to group and count variables but I can't find a way to re-insert these counts back into the original data set, which has more rows. I'd like to get the variable C in the example table below.
A B C
1 1 3
1 2 3
1 3 3
2 1 4
2 2 4
2 3 4
2 4 4
Here's one way using ave:
DF <- within(DF, {C <- ave(A, A, FUN=length)})
# A B C
# 1 1 1 3
# 2 1 2 3
# 3 1 3 3
# 4 2 1 4
# 5 2 2 4
# 6 2 3 4
# 7 2 4 4
Here is one approach using data.table that makes use of .N, which is described in the help file to "data.table" as .N is an integer, length 1, containing the number of rows in the group.
> library(data.table)
> DT <- data.table(A = rep(c(1, 2), times = c(3, 4)), B = c(1:3, 1:4))
> DT
A B
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 2 4
> DT[, C := .N, by = "A"]
> DT
A B C
1: 1 1 3
2: 1 2 3
3: 1 3 3
4: 2 1 4
5: 2 2 4
6: 2 3 4
7: 2 4 4

Resources