Remove ID:s with only one observation in time in r - r

Hi I have panel data and would like to remove any individuals that only have observations at one time point and keep the ones that have 2 points in time.
so the dataframe:
df <- data.frame(id = c(1,2,2,3,3,4,4,5,6), time = c(1,1,2,1,2,1,2,2,2))
id time
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 4 1
7 4 2
8 5 2
9 6 2
becomes this:
id time
1 2 1
2 2 2
3 3 1
4 3 2
5 4 1
6 4 2
i.e removing individual 1, 5 and 6 so that the panel is balansed.
Thx

We can do this using a couple of options. With data.table, convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'id', we get the number of rows (.N) and if that is greater than 1, get the Subset of Data.table (.SD)
library(data.table)
setDT(df)[, if(.N>1) .SD, by = id]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2
Can use the same methodology with dplyr.
library(dplyr)
df %>%
group_by(id) %>%
filter(n()>1)
# id time
# (dbl) (dbl)
#1 2 1
#2 2 2
#3 3 1
#4 3 2
#5 4 1
#6 4 2
Or with base R, get the table of data.frame, check whether it is greater than 1, subset the names based on the logical index ('i1') and use it to subset the 'data.frame' using %in%.
i1 <- table(df$id)>1
subset(df, id %in% names(i1)[i1] )

Another option,
ind <- rle(df$id)$values[rle(df$id)$lengths > 1]
df[df$id %in% ind,]
# id time
#2 2 1
#3 2 2
#4 3 1
#5 3 2
#6 4 1
#7 4 2

library(data.table)
setDT(df, key = "id")[(duplicated(id) | duplicated(id, fromLast = TRUE))]
# id time
#1: 2 1
#2: 2 2
#3: 3 1
#4: 3 2
#5: 4 1
#6: 4 2

You can use dplyr package to do this
library(dplyr)
df %>% group_by(id,time) %>% summarize(count = n()) %>%
filter(!count == 1)

Related

filter duplicated rows that has nonmatching variable values .in R

I am trying to filter some rows that have duplicated and I need the non-matching duplicates to filter.
Here is the sample dataset.
df <- data.frame(
id = c(1,2,2,3,4,5,5,6),
cat = c(3,3,4,5,2,2,1,5),
actual.cat = c(3,4,4,5,2,1,1,7))
> df
id cat actual.cat
1 1 3 3
2 2 3 4
3 2 4 4
4 3 5 5
5 4 2 2
6 5 2 1
7 5 1 1
8 6 5 7
So, each id has cat and actual.cat. When there is a duplicated id, I need to filter the nonmatching row.
Here what I need.
> df
id cat actual.cat
1 3 3
2 3 4
3 5 5
4 2 2
5 2 1
6 5 7
Any ideas on this?
Thanks!
We can do a group by operation and filter
library(dplyr)
df %>%
group_by(id) %>%
filter(n() > 1 & cat != actual.cat|n() == 1)
-output
# A tibble: 6 x 3
# Groups: id [6]
# id cat actual.cat
# <dbl> <dbl> <dbl>
#1 1 3 3
#2 2 3 4
#3 3 5 5
#4 4 2 2
#5 5 2 1
#6 6 5 7
Or using base R
subset(df, id %in% names(which(table(id) > 1)) &
cat != actual.cat| id %in% names(which(table(id) == 1)))
In base R, you can use subset with ave to select rows in each id where number of rows in each group is 1 or cat is not equal to actual.cat.
subset(df, ave(cat != actual.cat, id, FUN = function(x) length(x) == 1 | x))
# id cat actual.cat
#1 1 3 3
#2 2 3 4
#4 3 5 5
#5 4 2 2
#6 5 2 1
#8 6 5 7
You can also write this logic in data.table :
library(data.table)
setDT(df)[, .SD[.N == 1 | cat != actual.cat], id]

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Remove rows which are different with the first changing in R

I have data sets
ID <- c(1,1,1,2,2,2,2,3,3,4,4,4,4,4,4)
x <- c(1,2,3,1,2,3,4,1,2,1,2,3,4,5,6)
y <- c(2,2,3,6,6,4,5, 1,1,5,5,5,2,2,2)
df <- data.frame(ID, x, y)
df
ID x y
1 1 1 2
2 1 2 2
3 1 3 3
4 2 1 6
5 2 2 6
6 2 3 4
7 2 4 5
8 3 1 1
9 3 2 1
10 4 1 5
11 4 2 5
12 4 3 5
13 4 4 2
14 4 5 2
15 4 6 2
If you see ID 1 have 3 rows, by y of the third row change y = 3, so I want to set y = 2 (The same number of previous row), the ID 2 have y change at y = 4, I want to set y = 6 and delete next row. When the number of y change for each ID, we set only the first row change as the same at previous row, the rest remove it.
The table will be
ID x y
1 1 2
1 2 2
1 3 2
2 1 6
2 2 6
2 3 6
3 1 1
3 2 1
4 1 5
4 2 5
4 3 5
4 4 5
I couldn't figure out, do you have any idea, please help me, thanks.
Or we can do
library(data.table)
df1 <- setDT(df)[, .SD[shift(rleid(y), fill = 1) == 1], .(ID)]
df1[, y := y[1], .(ID)]
df1
ID x y
1: 1 1 2
2: 1 2 2
3: 1 3 2
4: 2 1 6
5: 2 2 6
6: 2 3 6
7: 3 1 1
8: 3 2 1
9: 4 1 5
10: 4 2 5
11: 4 3 5
12: 4 4 5
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'ID', if there is only a unique element in 'y' get the sequence of rows (1:.N) or else get the difference of 'y' (diff), check whether it is not equal to 0, use which to return the numeric index of the first TRUE ([1]),get the sequence and wrap it with .I to return row index.
library(data.table)
i1 <- setDT(df)[, if(uniqueN(y) >1) .I[seq(which(c(FALSE,diff(y)!=0))[1])]
else .I[1:.N], ID]$V1
Based on 'i1', we subset the rows of 'df', grouped by 'ID', we assign (:=), the 1st element in 'y' to change the 'y' column.
df[i1][, y:= y[1], ID][]
# ID x y
#1: 1 1 2
#2: 1 2 2
#3: 1 3 2
#4: 2 1 6
#5: 2 2 6
#6: 2 3 6
#7: 3 1 1
#8: 3 2 1
#9: 4 1 5
#10: 4 2 5
#11: 4 3 5
#12: 4 4 5
Or we can use a bit more simple coding with dplyr. (Disclaimer: The idea is somewhat similar to #Psidom's code). After grouping by 'ID', we get the lag of 'y', get a logical index by comparing with the first observation, filter the rows based on that and change the 'y' values to the first value.
library(dplyr)
df %>%
group_by(ID) %>%
filter(first(y)==lag(y, default = first(y))) %>%
mutate(y, y=first(y))
# ID x y
# <dbl> <dbl> <dbl>
#1 1 1 2
#2 1 2 2
#3 1 3 2
#4 2 1 6
#5 2 2 6
#6 2 3 6
#7 3 1 1
#8 3 2 1
#9 4 1 5
#10 4 2 5
#11 4 3 5
#12 4 4 5
Or another option is ave from base R
df1 <- df[with(df, as.logical(ave(y, ID, FUN = function(x)
lag(x, default= x[1])== x[1]))),]
df1$y <- with(df1, ave(y, ID, FUN= function(x) x[1]))
You could use a for loop, matching to the first instance of a given ID:
for( i in 1:nrow(df) ){
df$new[i] <- df$y[ match( df$ID[i], df$ID ) ]
}
This works because you're effectively asking for all subsequent values of y to be replaced with the first value, for a given ID. match returns the first value matching a given criteria, which works well for what you're after.
Or you could eliminate the for loop by first extracting ID as a variable:
ID <- df$ID
df$new <- df$y[ match( ID, df$ID ) ]
EDIT TO ADD: Sorry, here's a step to add to delete rows as requested
df <- subset( df, y == new |
( shift( y, 1, type = "lag" ) != y &
shift( ID, 1, type = "lag" ) == ID )
)

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

How to add index of a List item after melt() in R [duplicate]

This question already has answers here:
Create counter with multiple variables [duplicate]
(6 answers)
Closed 7 years ago.
I am working with a list as follows:
> l <- list(c(2,4,9), c(4,2,6,1))
> m <- melt(l)
> m
value L1
2 1
4 1
9 1
4 2
2 2
6 2
1 2
i want to add index i for my resulting data frame m looks like this:
> m
i value L1
1 2 1
2 4 1
3 9 1
1 4 2
2 2 2
3 6 2
4 1 2
i indicating 3 values belongs to first list element and 4 values belongs to the second list element.
How can i archive it please, can anyone help?
Just for completeness, some other options
data.table (which is basically what getanID is doing)
library(data.table)
setDT(m)[, i := seq_len(.N), L1]
dplyr
library(dplyr)
m %>%
group_by(L1) %>%
mutate(i = row_number())
Base R (from comments by #user20650)
transform(m, i = ave(L1, L1, FUN = seq_along))
You could use splitstackshape
library(splitstackshape)
getanID(m, 'L1')[]
# value L1 .id
#1: 2 1 1
#2: 4 1 2
#3: 9 1 3
#4: 4 2 1
#5: 2 2 2
#6: 6 2 3
#7: 1 2 4
Or using base R
transform(stack(setNames(l, seq_along(l))), .id= rapply(l, seq_along))
Less elegant than ave but does the work:
transform(m, i=unlist(sapply(rle(m$L1)$length, seq_len)))
# value L1 i
#1 2 1 1
#2 4 1 2
#3 9 1 3
#4 4 2 1
#5 2 2 2
#6 6 2 3
#7 1 2 4
Or
m$i <- sequence(rle(m$L1)$lengths)

Resources