Row-wise difference in two list using Data.Table in R - r

I want to use data.table to incrementally find out new elements i.e. for every row, I'd see whether values in list have been seen before. If they are, we will ignore them. If not, we will select them.
I was able to wrap elements by group in a list, but I am unsure how I can find incremental differences.
Here's my attempt:
df = data.table::data.table(id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
df_wrapped=df[,.(Values=(list(unique(Value)))), by=id]
expected_output = data.table::data.table(id = c("A","B","C","D","E"),
Value = list(c(1,4,5,2,3),c(2,3),c(3),c(7),c(2,3,9)),
Diff=list(c(1,4,5,2,3),c(NA),c(NA),c(7),c(9)),
Count = c(5,0,0,1,1))
Thoughts about expected output:
For the first row, all elements are unique. So, we will include them in Diff column.
In the second row, 2,3 have occurred in row 1. So, we will ignore them. Ditto for row 3.
Similarly, 7 and 9 are seen for the first time in row 4 and 5, so we will include them.
Here's visual representation:
expected_output
id Value Diff Count
A 1,4,5,2,3 1,4,5,2,3 5
B 2,3 NA 0
C 3 NA 0
D 7 7 1
E 2,3,9 9 1
I'd appreciate any thoughts. I am only looking for data.table based solutions because of performance issues in my original dataset.

I am not sure why you specifically need to put them in a list, but otherwise I wrote a small piece that could help you.
df = data.table::data.table(id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
df = df[order(id, Value)]
df = df[duplicated(Value) == FALSE, diff := Value][]
df = df[, count := uniqueN(diff, na.rm = TRUE), by = id]
The outcome would be:
> df
id Value diff count
1: A 1 1 5
2: A 2 2 5
3: A 3 3 5
4: A 4 4 5
5: A 5 5 5
6: B 2 NA 0
7: B 3 NA 0
8: C 3 NA 0
9: D 7 7 1
10: E 2 NA 1
11: E 3 NA 1
12: E 9 9 1
Hope this helps, or at least get you started.

Here is another possible approach:
library(data.table)
df = data.table(
id = c('A','B','C','A','B','A','A','A','D','E','E','E'),
Value = c(1,2,3,4,3,5,2,3,7,2,3,9))
valset <- c()
df[, {
d <- setdiff(Value, valset)
valset <- unique(c(valset, Value))
.(Values=.(Value), Diff=.(d), Count=length(d))
},
by=.(id)]
output:
id Values Diff Count
1: A 1,4,5,2,3 1,4,5,2,3 5
2: B 2,3 0
3: C 3 0
4: D 7 7 1
5: E 2,3,9 9 1

Related

How to sort a dataframe in decreasing order lapply and sort in r

Not sure if this is a duplicate but I couldn't find anything that either solves my original problem or the issue I'm running into with the partial I did find.
The goal is to sort a dataframe independently by column.
Reproducible example
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a
name date1 date2 date3
1 a 2 0 0
2 a 3 2 2
3 a 1 3 0
4 b 3 1 3
5 b 1 2 2
6 b 2 0 1
b <- ddply(a, "name", function(x) { as.data.frame(lapply(x, sort))
b
name date1 date2 date3
1 a 1 0 0
2 a 2 2 0
3 a 3 3 2
4 b 1 0 1
5 b 2 1 2
6 b 3 2 3
Now this works as expected, but is the opposite of what I'm looking to do.
Desired output
b
name date1 date2 date3
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
I've tried to add in the decreasing=T parameter but haven't had any luck with the variations I've tried and usually end up with an error about missing arguments or undefined columns being selected. How does one correctly implement a decreasing sort with this syntax and/or otherwise achieve the end result without relying on explicitly naming the columns (they names are dates so change often)
Bonus
How could this code be adapted to account for NA's with na.last
Thank you!
I think you nuked the data.frame rows with your code, not a very good practice standard dplyr use the arrange() function like this
library(tidyverse)
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a %>%
arrange(name,-date1)
If you want to live a dangerous life here is the code for it
a %>%
group_by(name) %>%
mutate_all(sort,decreasing = TRUE)
name date1 date2 date3
<fct> <dbl> <dbl> <dbl>
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
A solution with the data.table package is the following
library(data.table)
a <- data.table(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# alternatively:
# a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# setDT(a)
b <- a[, lapply(.SD, sort, decreasing = TRUE), by = name]
.SD returns the subset of data, in this case created with the by = name. It splits the original data.table by the values in the given column.
This also fulfills your bonus requirement, the na.last can be supplied.
aa <- data.table(name = c("a","a","a","b","b","b"),date1 = c(NA,3,1,3,1,NA),date2 = c(0,2,NA,1,2,0),date3 = c(0,2,0,3,2,NA))
bb <- aa[, lapply(.SD, sort, decreasing = TRUE, na.last = TRUE), by = name]

Group observations into specified number of groups according to id with data.table solution

I have the following data.table:
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
dt
id obs
1: 1 0.1470735
2: 1 1.6954685
3: 1 2.3947260
4: 1 2.1782338
5: 1 0.5168873
6: 2 -0.8879545
7: 2 1.9320034
8: 2 2.6269272
9: 2 1.5212627
10: 2 -0.1581711
Which has a total of 5 distinct ids (numbers 1 through 5) and 5 observations (obs) for each id. I want to group the ids together randomly in groups of X ids according to id and create a new column with the grouping. For this example, let's say I want to end up with a data.table like this:
id obs group
1: 1 0.1470735 A
2: 1 1.6954685 A
3: 1 2.3947260 A
4: 1 2.1782338 A
5: 1 0.5168873 A
6: 2 -0.8879545 A
7: 2 1.9320034 A
8: 2 2.6269272 A
9: 2 1.5212627 A
10: 2 -0.1581711 A
Where ids 1 and 2 are assigned to group A, ids 3 and 4 are assigned to group B, and id 5 is assigned to group C.
My actual dataset is much larger and will not necessarily group evenly, but I do not need the groups to contain the same number of ids. I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
Could someone please help me with an elegant data.table way to accomplish this?
This is the same as #Shree's answer, just using length.out in rep and no dplyr.
I do need to control the general size of the group (for example I want to be able to say 5 ids per group and if the last group has only 3 ids that's fine).
You can make an id table; assign groups there; and if necessary merge back:
# bigger, reproducible example
library(data.table)
max_per_group = 5
n_ids = 1e5+1
DT = data.table(id = rep(1:nid, each = max_per_group), obs = 1)
# make an id table
idDT = unique(DT[, "id"])
# randomly assign groups
idDT[, g := sample(rep(.I, each = 5, length.out = .N))]
# merge back if needed
DT[idDT, on=.(id), g := i.g]
You refer to "my actual dataset" -- but R allows you to juggle multiple tables. Trying to do everything in one is almost always counterproductive.
EDIT: Didn't notice that you needed this with data.table. I'll leave this out here as an alternative.
I am creating a dataframe with id and randomly assigned group. This will be joined with your data to get groups for each record by id -
library(dplyr)
library(data.table)
dt <- data.table(id = rep(1:5, 5), obs = rnorm(1, n = 25))[order(id)]
max_per_group <- 5
n_ids <- length(unique(dt$id))
data.frame(id = unique(dt$id), grp = sample(rep(LETTERS, max_per_group), n_ids)) %>%
left_join(dt, ., by = "id")
id obs grp
1 1 1.28879713 S
2 1 1.04471197 S
3 1 0.36470847 S
4 1 0.46741567 S
5 1 1.07749891 S
6 2 1.73640785 K
7 2 1.61144042 K
8 2 2.85196859 K
9 2 1.84848117 K
10 2 2.11395863 K
11 3 0.88623462 S
12 3 2.11706351 S
13 3 1.29225433 S
14 3 0.30458037 S
15 3 -1.72070005 S
16 4 2.24593162 U
17 4 2.10346287 U
18 4 2.28724412 U
19 4 0.02978044 U
20 4 0.56234660 U
21 5 2.92050008 F
22 5 1.08048974 F
23 5 0.58885261 F
24 5 1.53299092 F
25 5 1.47271123 F

Repeat sequence by group

I have the following dataframe:
a <- data.frame(
group1=factor(rep(c("a","b"),each=6,times=1)),
time=rep(1:6,each=1,times=2),
newcolumn = c(1,1,2,2,3,3,1,1,2,2,3,3)
)
I'm looking to replicate the output of newcolumn with a rep by group function (the time variable is there for ordering purposes). In other words, for each group, ordered by time, how can I assign a sequence 1,1,2,2,n,n? I also need a general solution (in the case that groups are of differing number of rows, or I want to repeat values 3,10,n times).
For instance, I can generate that sequence with this:
newcolumn=rep(1:3,each=2,times=2)
But that wouldn't work in a group by statement where group1 has differing rows.
We specify the length.out in the rep after grouping by 'group1'
library(dplyr)
a %>%
group_by(group1) %>%
mutate(new = rep(seq_len(n()/2), each = 2, length.out = n()))
NOTE: each and times are not used in the same call. Either we use each or times
EDIT: Based on comments from #r2evans
A data.table alternative:
library(data.table)
DT <- as.data.table(a[1:2])
DT[order(time),newcolumn := rep(seq_len(.N/2), each=2, length.out=.N),by=c("group1")]
DT
# group1 time newcolumn
# 1: a 1 1
# 2: a 2 1
# 3: a 3 2
# 4: a 4 2
# 5: a 5 3
# 6: a 6 3
# 7: b 1 1
# 8: b 2 1
# 9: b 3 2
# 10: b 4 2
# 11: b 5 3
# 12: b 6 3

R previous index per group

I am trying to set the previous observation per group to NA, if a certain condition applies.
Assume I have the following datatable:
DT = data.table(group=rep(c("b","a","c"),each=3), v=c(1,1,1,2,2,1,1,2,2), y=c(1,3,6,6,3,1,1,3,6), a=1:9, b=9:1)
and I am using the simple condition:
DT[y == 6]
How can I set the previous rows of DT[y == 6] within DT to NA, namely the rows with the numbers 2 and 8 of DT? That is, how to set the respectively previous rows per group to NA.
Please note: From DT we can see that there are 3 rows when y is equal to 6, but for group a (row nr 4) I do not want to set the previous row to NA, as the previous row belongs to a different group.
So what I want in different terms is the previous index of certain elements in datatable. Is that possible? Would be also interesting if one can go further back than 1 period. Thanks for any hints.
You can find the row indices where current y is not 6 and next row is 6, then set the whole row to NA:
DT[shift(y, type="lead")==6 & y!=6,
(names(DT)) := lapply(.SD, function(x) NA)]
DT
output:
group v y a b
1: b 1 1 1 9
2: <NA> NA NA NA NA
3: b 1 6 3 7
4: a 2 6 4 6
5: a 2 3 5 5
6: a 1 1 6 4
7: c 1 1 7 3
8: <NA> NA NA NA NA
9: c 2 6 9 1
As usual, Frank commenting with a more succinct version:
DT[shift(y, type="lead")==6 & y!=6, names(DT) := NA]

Replacing the values from another data from based on the information in the first column in R

I'm trying to merge informations in two different data frames, but problem begins with uneven dimensions and trying to use not the column index but the information in the column. merge function in R or join's (dplyr) don't work with my data.
I have to dataframes (One is subset of the others with updated info in the last column):
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"))
Name val Case
1 A 1 NA
2 B 2 1
3 C 3 NA
4 D 1 NA
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 NA
9 I 3 NA
Some rows in the Case column in df1 have to be changed with the info in the df2 below:
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1")
Name val Case
1 A 1 1
2 D 2 1
3 H 3 1
So there's nothing important in the val column, however I added it into the examples since I want to indicate that I have more columns than two and also my real data is way bigger than the examples.
Basically, I want to change specific rows by checking the information in the first columns (in this case, they're unique letters) and in the end I still want to have df1 as a final data frame.
for a better explanation, I want to see something like this:
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Note changed information for A,D and H.
Thanks.
%in% from base-r is there to rescue.
df1=data.frame(Name = print(LETTERS[1:9]), val = seq(1:3), Case = c("NA","1","NA","NA","1","NA","1","NA","NA"), stringsAsFactors = F)
df2 = data.frame(Name = c("A","D","H"), val = seq(1:3), Case = "1", stringsAsFactors = F)
df1$Case <- ifelse(df1$Name %in% df2$Name, df2$Case[df2$Name %in% df1$Name], df1$Case)
df1
Output:
> df1
Name val Case
1 A 1 1
2 B 2 1
3 C 3 NA
4 D 1 1
5 E 2 1
6 F 3 NA
7 G 1 1
8 H 2 1
9 I 3 NA
Here is what I would do using dplyr:
df1 %>%
left_join(df2, by = c("Name")) %>%
mutate(val = if_else(is.na(val.y), val.x, val.y),
Case = if_else(is.na(Case.y), Case.x, Case.y)) %>%
select(Name, val, Case)

Resources