Assigning an unique identification variable across repeated values - r

I will create a simple example of some dummy data:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
object <- c(1,1,2,1,1,1,1,2,3,1,1,1,2)
df1 <- data.frame(case, object)
Now for each unique case and object value, I want to create a corresponding unique numerical value (an identifier)
df1$UNIQ_ID <- ........
The end result should take the following values c(1,1,2,3,3,4,4,5,6,7,7,8,9) as when
unique(df1$object[df1$case=='a'])
unique(df1$object[df1$case=='b'])
I have though of using dpylr and group_by(case)

We can use the .GRP from data.table after grouping by 'case' and 'object' on a data.table object (setDT(df1)).
library(data.table)
setDT(df1)[,UNIQ_ID:= .GRP ,.(case, object)]
df1
# case object UNIQ_ID
# 1: a 1 1
# 2: a 1 1
# 3: a 2 2
# 4: b 1 3
# 5: b 1 3
# 6: c 1 4
# 7: c 1 4
# 8: c 2 5
# 9: c 3 6
#10: d 1 7
#11: d 1 7
#12: e 1 8
#13: e 2 9
A base R option would be
grp <- interaction(df1)
as.numeric(factor(grp, levels= unique(grp)))
#[1] 1 1 2 3 3 4 4 5 6 7 7 8 9

Related

Index the first and the last rows with NA in a dataframe

I have a large dataset, which contains many NAs. I want to find the rows where the first NA and the last NA appear. For example, for column A, I want the output to be the second row (the last NA before a number) and the fifth row (the first NA after a number). My code, which was shown below, does not work very well.
nonnaindex <- which(!is.na(df))
firstnonna <- apply(nonnaindex, 2, min)
Data:
ID A B C
1 NA NA 3
2 NA 2 2
3 3 3 1
4 4 5 NA
5 NA 6 NA
I believe this function might be what you are looking for:
first_and_last_non_na <- function(DT, col) {
library(data.table)
data.table(DT)[, grp := rleid(is.na(get(col)))][
, rbind(last(.SD[is.na(get(col)) & grp == min(grp)]),
first(.SD[is.na(get(col)) & grp == max(grp)]))][
!is.na(ID)][, grp := NULL][]
}
which returns
first_and_last_na_row(DT, "A")
ID A B C
1: 2 NA 2 2
2: 5 NA 6 NA
first_and_last_na_row(DT, "B")
ID A B C
1: 1 NA NA 3
first_and_last_na_row(DT, "C")
ID A B C
1: 4 4 5 NA
first_and_last_na_row(DT, "D")
Empty data.table (0 rows) of 4 cols: ID,A,B,C
in case of
DT
ID A B C
1: 1 NA NA 3
2: 2 NA 2 2
3: 3 3 3 1
4: 4 4 5 NA
5: 5 NA 6 NA
or
first_and_last_na_row(DT2, "D")
ID A B C D
1: 1 NA NA 3 NA
in case of Akrun's (simplified) example
DT2
ID A B C D
1: 1 NA NA 3 NA
2: 2 NA 2 2 2
3: 3 3 3 1 NA
4: 4 4 5 NA NA
5: 5 NA 6 NA 4
Edit: Faster version using melt()
The OP has commented that his production data set consists of 4000 columns and 192 rows and that he needs the indices to clean another data set. He tried a for loop across all columns which is very slow.
Therefore, I suggest to reshape the data set from wide to long format and to use data.table's efficient grouping mechanism:
# reshape from wide to long format
long <- setDT(DT2)[, melt(.SD, id = "ID")][
# add grouping variable to distinguish streaks continuous of NA/non-NA values
# for each variable
, grp := rleid(variable, is.na(value))][
# set sort order just for convenience, not essential
, setorder(.SD, variable, ID)]
long
ID variable value grp
1: 1 A NA 1
2: 2 A NA 1
3: 3 A 3 2
4: 4 A 4 2
5: 5 A NA 3
6: 1 B NA 4
7: 2 B 2 5
8: 3 B 3 5
9: 4 B 5 5
10: 5 B 6 5
11: 1 C 3 6
12: 2 C 2 6
13: 3 C 1 6
14: 4 C NA 7
15: 5 C NA 7
16: 1 D NA 8
17: 2 D 2 9
18: 3 D NA 10
19: 4 D NA 10
20: 5 D 4 11
Now, we get the indices of the starting or ending, resp., NA sequence for each variable (if any) by
# starting NA sequence
long[, .(ID = which(is.na(value) & grp == min(grp))), by = variable]
variable ID
1: A 1
2: A 2
3: B 1
4: D 1
# ending NA sequence
long[, .(ID = which(is.na(value) & grp == max(grp))), by = variable]
variable ID
1: A 5
2: C 4
3: C 5
Note that this returns all indices of the starting or ending NA sequences which might be more convenient for subsequent cleaning of another data set. If only the last and first indices are required this can be achieved by
long[long[, is.na(value) & grp == min(grp), by =variable]$V1, .(ID = max(ID)), by = variable]
variable ID
1: A 2
2: B 1
3: D 1
long[long[, is.na(value) & grp == max(grp), by =variable]$V1, .(ID = min(ID)), by = variable]
variable ID
1: A 5
2: C 4
I have tested this approach using a dummy data set of 192 rows times 4000 columns. The whole operation needed less than one second.

Removing rows in a R data.table with NAs in specific columns

I have a data.table with a large number of features. I would like to remove the rows where the values are NAs only for certain features.
Currently I am using the following to handle this:
data.joined.sample <- data.joined.sample %>%
filter(!is.na(lat)) %>%
filter(!is.na(long)) %>%
filter(!is.na(temp)) %>%
filter(!is.na(year)) %>%
filter(!is.na(month)) %>%
filter(!is.na(day)) %>%
filter(!is.na(hour)) %>%
.......
Is there a more concise way to achieve this?
str(data.joined.sample)
Classes ‘data.table’ and 'data.frame': 336776 obs. of 50 variables:
We can select those columns, get a logical vector of NA's based on it using complete.cases and use that to remove the NA elements
data.joined.sample[complete.cases(data.joined.sample[colsofinterest]),]
where
colsofinterest <- c("lat", "long", "temp", "year", "month", "day", "hour")
Update
Based on the OP's comments, if it is a data.table, then subset the colsofinterest and use complete.cases
data.joined.sample[complete.cases(data.joined.sample[, colsofinterest, with = FALSE])]
data.table-objects, if that is in fact what your working with, have a somewhat different syntax for the "[" function. Look through this console session:
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> DT[x=="a"&y==1]
x y v
1: a 1 4
> is.na(DT[x=="a"&y==1]$v) <- TRUE # make one item NA
> DT[x=="a"&y==1]
x y v
1: a 1 NA
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[complete.cases(DT)] # note no comma
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
> DT # But that didn't remove the NA, it only gave a value
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT <- DT[complete.cases(DT)] # do this assignment to make permanent
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
Probably not the true "data.table way".

Why is class(.SD) on a data.table showing "data.frame"?

colnames() seems to be enumerating all columns per group as expected, but class() shows exactly two rows per group! And one of them is data.frame
> dt <- data.table("a"=1:3, "b"=1:3, "c"=1:3, "d"=1:3, "e"=1:3)
> dt[, class(.SD), by=a]
x y z V1
1: 1 1 1 data.table
2: 1 1 1 data.frame
3: 2 2 2 data.table
4: 2 2 2 data.frame
5: 3 3 3 data.table
6: 3 3 3 data.frame
> dt[, colnames(.SD), by=x]
x y z V1
1: 1 1 1 a
2: 1 1 1 b
3: 1 1 1 c
4: 1 1 1 d
5: 1 1 1 e
6: 2 2 2 a
7: 2 2 2 b
8: 2 2 2 c
9: 2 2 2 d
10: 2 2 2 e
11: 3 3 3 a
12: 3 3 3 b
13: 3 3 3 c
14: 3 3 3 d
15: 3 3 3 e
.SD stands for column Subset of Data.table, thus it is also a data.table object. And because data.table is a data.frame class(.SD) returns a length 2 character vector for each group, making it a little bit confusing if you expect single row for each group.
To avoid such confusion you can just wrap results into another list, enforcing single row for each group.
library(data.table)
dt <- data.table(x=1:3, y=1:3)
dt[, .(class = list(class(.SD))), by = x]
# x class
#1: 1 data.table,data.frame
#2: 2 data.table,data.frame
#3: 3 data.table,data.frame
Every data.table is a data.frame, and shows both applicable classes when asked:
> class(dt)
[1] "data.table" "data.frame"
This applies to .SD, too, because .SD is a data table by definition (.SD is a data.table containing the Subset of x's Data for each group)

Calculate differences on a variable between factor levels

I have a data.frame with exactly one value measured for each subject at multiple timepoints. It simplifies to this:
> set.seed(42)
> x = data.frame(subject=rep(c('a', 'b', 'c'), 3), time=rep(c(1,2,3), each=3), value=rnorm(3*3, 0, 1))
> x
subject time value
1 a 1 1.37095845
2 b 1 -0.56469817
3 c 1 0.36312841
4 a 2 0.63286260
5 b 2 0.40426832
6 c 2 -0.10612452
7 a 3 1.51152200
8 b 3 -0.09465904
9 c 3 2.01842371
I want to calculate the change in value for each timepoint and for each subject. For this simple example, my My current solution is this:
> x$diff[x$time==1] = x$value[x$time==2] - x$value[x$time==1]
> x$diff[x$time==2] = x$value[x$time==3] - x$value[x$time==2]
> x
subject time value diff
1 a 1 1.37095845 -0.7380958
2 b 1 -0.56469817 0.9689665
3 c 1 0.36312841 -0.4692529
4 a 2 0.63286260 0.8786594
5 b 2 0.40426832 -0.4989274
6 c 2 -0.10612452 2.1245482
7 a 3 1.51152200 NA
8 b 3 -0.09465904 NA
9 c 3 2.01842371 NA
... and then remove the last rows. However, in my actual data set, there's way more levels of time and I need to do this for several columns instead of just value. The code gets very ugly. Is there a neat way to do this? A solution which does not assume that rows are ordered within subjects according to time would be nice.
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(x)), grouped by 'subject', we take the difference of the next value (shift(value, type='lead')) with the current value and assign (:=) the output to create the 'Diff' column.
library(data.table)#v1.9.6+
setDT(x)[order(time),Diff := shift(value, type= 'lead') - value ,
by = subject]
# subject time value Diff
#1: a 1 1.37095845 -0.7380958
#2: b 1 -0.56469817 0.9689665
#3: c 1 0.36312841 -0.4692529
#4: a 2 0.63286260 0.8786594
#5: b 2 0.40426832 -0.4989274
#6: c 2 -0.10612452 2.1245482
#7: a 3 1.51152200 NA
#8: b 3 -0.09465904 NA
#9: c 3 2.01842371 NA
You can use dplyr for this:
library(dplyr)
x %>%
arrange(time, subject) %>%
group_by(subject) %>%
mutate(diff = c(diff(value), NA))
# Source: local data frame [9 x 4]
# Groups: subject [3]
#
# subject time value diff
# (fctr) (dbl) (dbl) (dbl)
# 1 a 1 1.30970525 -1.66596287
# 2 b 1 0.12556761 -0.06070412
# 3 c 1 -1.09423634 1.38590546
# 4 a 2 -0.35625763 0.91417329
# 5 b 2 0.06486349 0.06652424
# 6 c 2 0.29166912 -0.98495562
# 7 a 3 0.55791566 NA
# 8 b 3 0.13138773 NA
# 9 c 3 -0.69328649 NA
If you want to get rid of the NAs, add %>% na.omit.
You could try ave. ave applies a function to a subset of a values, for more details see ?ave, e.g.:
x$diff <- ave(x$value, x$subject, FUN=function(x)c(diff(x), NA))
x
# subject time value diff
# 1 a 1 1.37095845 -0.7380958
# 2 b 1 -0.56469817 0.9689665
# 3 c 1 0.36312841 -0.4692529
# 4 a 2 0.63286260 0.8786594
# 5 b 2 0.40426832 -0.4989274
# 6 c 2 -0.10612452 2.1245482
# 7 a 3 1.51152200 NA
# 8 b 3 -0.09465904 NA
# 9 c 3 2.01842371 NA
BTW the diff function requires that the time is ordered.
EDIT: Update with set.seed(42).

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources