I have a large data table (millions of rows), where I need to trim the rows down to one per ID. The rule is that if another art than "X" is in the unique ID,
the X'es should be deleted. But if no other art is in the Unique ID, the X should stay.
Test dataset:
dt <- data.table(
ID=c(1,1,1,2,2,3,4,4),
art=c("X", "Y", "X", "X", "X", "X", "Z", "X"),
redskb=c("a", "Y", "a", "b", "b", "c", "k", "n")
)
ID art redskb
1: 1 X a
2: 1 Y Y
3: 1 X a
4: 2 X b
5: 2 X b
6: 3 X c
7: 4 X k
8: 4 Z n
Required output:
ID art redskb
1: 1 Y Y
2: 2 X b
3: 3 X c
4: 4 Z n
I tried with
unique(dt, by = c("ID"))
but could not get it to work efficiently with if's.
I'd try something like this:
unique(dt)[, `:=`(flag, if (.N == 1) TRUE else art != "X"), ID][(flag)]
## ID art redskb flag
## 1: 1 Y Y TRUE
## 2: 2 X b TRUE
## 3: 3 X c TRUE
## 4: 4 Z k TRUE
data.table:
dt[order(ID,art=="X"),.SD[1],ID]
or #Frank's version:
unique(dt[order(ID,art == "X")], by="ID")
# ID art redskb
# 1: 1 Y Y
# 2: 2 X b
# 3: 3 X c
# 4: 4 Z k
dplyr:
dt %>% group_by(ID) %>% slice(which.max(art != "X"))
# # A tibble: 4 x 3
# # Groups: ID [4]
# ID art redskb
# <dbl> <fctr> <chr>
# 1 1 Y Y
# 2 2 X b
# 3 3 X c
# 4 4 Z k
We can do
dt[dt[, .I[if(uniqueN(art) >1 & any(art == "X")) art!="X" else seq_len(.N)==1], ID]$V1]
# ID art redskb
#1: 1 Y Y
#2: 2 X b
#3: 3 X c
#4: 4 Z k
Related
I have two incomplete data.tables with the same column names.
dt1 <- data.table(id = c(1, 2, 3), v1 = c("w", "x", NA), v2 = c("a", NA, "c"))
dt2 <- data.table(id = c(2, 3, 4), v1 = c(NA, "y", "z"), v2 = c("b", "c", NA))
They look like this:
dt1
id v1 v2
1: 1 w a
2: 2 x <NA>
3: 3 <NA> c
> dt2
id v1 v2
1: 2 <NA> b
2: 3 y c
3: 4 z <NA>
Is there a way to merge the two by filling in the missing info?
This is the result I'm after:
id v1 v2
1: 1 w a
2: 2 x b
3: 3 y c
4: 4 z <NA>
I've tried various data.table joins, merges but I either get the columns repeated:
> merge(dt1,
+ dt2,
+ by = "id",
+ all = TRUE)
id v1.x v2.x v1.y v2.y
1: 1 w a <NA> <NA>
2: 2 x <NA> <NA> b
3: 3 <NA> c y c
4: 4 <NA> <NA> z <NA>
or the rows repeated:
> merge(dt1,
+ dt2,
+ by = names(dt1),
+ all = TRUE)
id v1 v2
1: 1 w a
2: 2 <NA> b
3: 2 x <NA>
4: 3 <NA> c
5: 3 y c
6: 4 z <NA>
Both data.tables have the same column names.
You can group by ID and get the unique values after omitting NAs, i.e.
library(data.table)
merge(dt1, dt2, all = TRUE)[,
lapply(.SD, function(i)na.omit(unique(i))),
by = id][]
# id v1 v2
#1: 1 w a
#2: 2 x b
#3: 3 y c
#4: 4 z <NA>
You could also start out with rbind():
rbind(dt1, dt2)[, lapply(.SD, \(x) unique(x[!is.na(x)])), by = id]
# id v1 v2
# <num> <char> <char>
# 1: 1 w a
# 2: 2 x b
# 3: 3 y c
# 4: 4 z <NA>
First full_join and after that group_by per id and merge the rows:
library(dplyr)
library(tidyr)
dt1 %>%
full_join(dt2, by = c("id", "v1", "v2")) %>%
group_by(id) %>%
fill(starts_with('v'), .direction = 'updown') %>%
slice(1) %>%
ungroup
Output:
# A tibble: 4 × 3
id v1 v2
<dbl> <chr> <chr>
1 1 w a
2 2 x b
3 3 y c
4 4 z NA
I am struggling again to understand how the mult argument is working when performing an update-on-join.
What I am trying to do is to implement a left-join as defined in lj.
For performance reasons I'd like to update the left table
The "un-trivial" part is that when the left table and the right table have a column in common, (not considering the join columns), I'd like to use the first value in the right table to override the value of the left table.
I thought mult would help me dealing with this multiple match issue but I cannot get it right
library(data.table)
X <- data.table(x = c("a", "a", "b", "c", "d"), y = c(0, 1, 1, 2, 2), t = 0:4)
X
# x y t
# <char> <num> <int>
#1: a 0 0
#2: a 1 1
#3: b 1 2
#4: c 2 3
#5: d 2 4
Y <- data.table(xx = c("f", "b", "c", "c", "e", "a"), y = c(2, NA, 3, 4, 5, 6), u = 2:7)
Y
# xx y u
# <char> <num> <int>
#1: f 2 2
#2: b NA 3
#3: c 3 4
#4: c 4 5
#5: e 5 6
#6: a 6 7
# Expected result
# x y t
# <char> <num> <int>
#1: a 6 0 <= single match on xx == "a" so Y[xx == "a", y] is used
#2: a 6 1 <= single match on xx == "a" so Y[xx == "a", y] is used
#3: b NA 2 <= single match on xx == "b" so Y[xx == "b", y] is used
#4: c 3 3 <= mult match on xx == "c" so Y[xx == "c", y[1L]] is used
#5: d 2 4 <= no xx == "d" in Y so nothing changes
copy(X)[Y, y := i.y, by = .EACHI, on = c(x = "xx"), mult = "first"][]
# x y t
# <char> <num> <int>
#1: a 6 0
#2: a 1 1 <= a should always have the same value ie 6
#3: b NA 2
#4: c 4 3 <= y == 4 is not the first value of y in the Y table
#5: d 2 4
# Using mult = "all" is the closest I get from the right result
copy(X)[Y, y := i.y, by = .EACHI, on = c(x = "xx"), mult = "all"][]
# x y t
# <char> <num> <int>
#1: a 6 0
#2: a 6 1
#3: b NA 2
#4: c 4 3 <= y == 4 is not the first value of y in the Y table
#5: d 2 4
Can someone explain to me what's wrong in the above ?
I guess I could use Y[X, ...] to get to what I want, the issue is that X is very large and the performance I get is much worse using Y[X, ...]
I'd like to use the first value in the right table to override the value of the left table
Select the first values and update with them alone:
X[unique(Y, by="xx", fromLast=FALSE), on=.(x=xx), y := i.y]
x y t
1: a 6 0
2: a 6 1
3: b NA 2
4: c 3 3
5: d 2 4
fromLast= can select the first or last row when dropping dupes.
How multiple matches are handled:
In x[i, mult=], if a row of i has multiple matches, mult determines which matching row(s) of x are selected. This explains the results shown in the OP.
In x[i, v := i.v], if multiple rows of i match to the same row in x, all of the relevant i-rows write to the x-row sequentially, so the last i-row gets the final write. Turn on verbose output to see how many edits are made in an update -- it will exceed the number of x rows in this case (because the rows are edited repeatedly):
options(datatable.verbose=TRUE)
data.table(a=1,b=2)[.(a=1, b=3:4), on=.(a), b := i.b][]
# Assigning to 2 row subset of 1 rows
a b
1: 1 4
mult is always equal to "last" in case of update on join with :=
I recall it was described somewhere in documentation.
I've a dataset that looks like the following structure, but I need to generate columns based on the v2 but filled with the value of v3. How can I get this done? The desired result is show below.
df <- data_frame(v1 = c(3, 3, 2, 2, 3, 1, 1, 1, 0), v2 = c("a", "b", "a", "c", "c", "b", "c", "a", "a"), v3 = c("y", "y", "n","y", "n","y", "y",NA, "n"))
> df
# A tibble: 9 x 3
v1 v2 v3
<dbl> <chr> <chr>
1 3 a y
2 3 b y
3 2 a n
4 2 c y
5 3 c n
6 1 b y
7 1 c y
8 1 a NA
9 0 a n
The desired outcome: Grouped by v1, identify the value of v2 == "a" in v3 and generate a column v_a with that value. Apply the same rationality to other classes of v2.
# A tibble: 9 x 4
v1 v2 v3 v_a ...
<dbl> <chr> <chr> <chr>
1 3 a y y
2 3 b y y
3 2 a n n
4 2 c y n
5 3 c n y
6 1 b y NA
7 1 c y NA
8 1 a NA NA
9 0 a n n
We can get data in wide format and do the join :
library(dplyr)
df %>%
tidyr::pivot_wider(names_from = v2, values_from = v3, names_prefix = 'v_') %>%
left_join(df, by = 'v1')
# A tibble: 9 x 6
# v1 v_a v_b v_c v2 v3
# <dbl> <chr> <chr> <chr> <chr> <chr>
#1 3 y y n a y
#2 3 y y n b y
#3 3 y y n c n
#4 2 n NA y a n
#5 2 n NA y c y
#6 1 NA y y b y
#7 1 NA y y c y
#8 1 NA y y a NA
#9 0 n NA NA a n
To get the names inverted, we can use :
cols<- unique(df$v2)
df %>%
tidyr::pivot_wider(names_from = v2, values_from = v3) %>%
left_join(df, by = 'v1') %>%
rename_at(vars(cols), ~paste0(., '_v'))
An option using data.table:
uv <- setDT(df)[, unique(v2)]
df[, paste0(uv, "_v") := lapply(uv, function(x)
if(any(v2==x)) v3[v2==x] else NA_character_), v1]
output:
v1 v2 v3 a_v b_v c_v
1: 3 a y y y n
2: 3 b y y y n
3: 2 a n n <NA> y
4: 2 c y n <NA> y
5: 3 c n y y n
6: 1 b y <NA> y y
7: 1 c y <NA> y y
8: 1 a <NA> <NA> y y
9: 0 a n n <NA> <NA>
This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
How to create a consecutive group number
(13 answers)
Closed 5 years ago.
Tried finding a similar post, but couldn't.
I have a column in data table which looks like this ->
x,x,x,x,y,y,y,c,c,c
I want to index in a separate column such that ->
1,1,1,1,2,2,2,3,3,3
How to do it?
I'd go with this, which has the advantage of working with data frames and data tables, (and maybe tibbles, idk). The index numbers are obtained from the first appearance of a col code and the output index numbers are not dependent on col codes being adjacent rows (so if col goes x,x,x,x,y,y,y,x,x,x all the x get index 2).
> dt <- data.table(col = c("x", "x", "x", "x", "y", "y", "y", "c", "c", "c"))
> dt$index = as.numeric(factor(dt$col,levels=unique(dt$col)))
> dt
col index
1: x 1
2: x 1
3: x 1
4: x 1
5: y 2
6: y 2
7: y 2
8: c 3
9: c 3
10: c 3
A solution with data.table:
library(data.table)
dt <- data.table(col = c("x", "x", "x", "x", "y", "y", "y", "c", "c", "c"))
dt[ , idx := .GRP, by = col]
# col idx
# 1: x 1
# 2: x 1
# 3: x 1
# 4: x 1
# 5: y 2
# 6: y 2
# 7: y 2
# 8: c 3
# 9: c 3
# 10: c 3
A solution in base R:
dat <- data.frame(col = c("x", "x", "x", "x", "y", "y", "y", "c", "c", "c"))
dat <- transform(dat, idx = match(col, unique(col)))
# col idx
# 1 x 1
# 2 x 1
# 3 x 1
# 4 x 1
# 5 y 2
# 6 y 2
# 7 y 2
# 8 c 3
# 9 c 3
# 10 c 3
dt$index <- cumsum(!duplicated(dt$a))
dt
a index
# 1 x 1
# 2 x 1
# 3 x 1
# 4 x 1
# 5 y 2
# 6 y 2
# 7 y 2
# 8 c 3
# 9 c 3
# 10 c 3
I have a data.table that looks something like this:
> dt <- data.table(
group1 = c("a", "a", "a", "b", "b", "b", "b"),
group2 = c("x", "x", "y", "y", "z", "z", "z"),
data1 = c(NA, rep(T, 3), rep(F, 2), "sometimes"),
data2 = c("sometimes", rep(F,3), rep(T,2), NA))
> dt
group1 group2 data1 data2
1: a x NA sometimes
2: a x TRUE FALSE
3: a y TRUE FALSE
4: b y TRUE FALSE
5: b z FALSE TRUE
6: b z FALSE TRUE
7: b z sometimes NA
My goal is to find the number of non-NA records in each data column, grouped by group1 and group2.
group1 group2 data1 data2
1: a x 1 2
3: a y 1 1
4: b y 1 1
5: b z 3 2
I have this code left over from dealing with another part of the dataset, which had no NAs and was logical:
dt[
,
lapply(.SD, sum),
by = list(group1, group2),
.SDcols = c("data3", "data4")
]
But it won't work with NA values, or non-logical values.
dt[, lapply(.SD, function(x) sum(!is.na(x))), by = .(group1, group2)]
# group1 group2 data1 data2
#1: a x 1 2
#2: a y 1 1
#3: b y 1 1
#4: b z 3 2
Another alternative is to melt/dcast in order to avoid by column operation. This will remove the NAs and use the length function by default
dcast(melt(dt, id = c("group1", "group2"), na.rm = TRUE), group1 + group2 ~ variable)
# Aggregate function missing, defaulting to 'length'
# group1 group2 data1 data2
# 1: a x 1 2
# 2: a y 1 1
# 3: b y 1 1
# 4: b z 3 2
Using dplyr (with some help from David Arenburg & eddi):
library(dplyr)
dt %>% group_by(group1, group2) %>% summarise_each(funs(sum(!is.na(.))))
Source: local data table [4 x 4]
Groups: group1
group1 group2 data1 data2
1 a x 1 2
2 a y 1 1
3 b y 1 1
4 b z 3 2