DT1 holds the mapping for all IDs, DT2 holds relationships between IDs.
I'd like to join the mappings from DT1 directly to both IDs in DT2.
Example datasets below:
# Join a mapping to multiple variables.
library(data.table)
# Dataset with mappings.
set.seed(1)
dt1 <- data.table(id=1:10,
group=sample(letters[1:4], 10, replace=TRUE))
# > dt1
# id group
# 1: 1 b
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 a
# 6: 6 d
# 7: 7 d
# 8: 8 c
# 9: 9 c
# 10: 10 a
# Dataset with relationship between IDs.
dt2 <- data.table(id1=1:5,
id2=6:10)
# > dt2
# id1 id2
# 1: 1 6
# 2: 2 7
# 3: 3 8
# 4: 4 9
# 5: 5 10
I could of course use two joins, first on ID1, then on ID2. Another way of achieving what I want is first melting DT2, so all the ID values are a single variable before joining...
# Now melt, join group variable of DT1 to DT2, then cast again to obtain
# original structure.
dt2[, i := .I] # need an observation ID
dt2Long <- melt(dt2, id="i")
setkey(dt2Long, value)
dcast(dt2Long[dt1], i ~ variable, value.var=c("value", "group"))
# i value_id1 value_id2 group_id1 group_id2
# 1: 1 1 6 b d
# 2: 2 2 7 b d
# 3: 3 3 8 c c
# 4: 4 4 9 d c
# 5: 5 5 10 a a
This gives the desired result, but I would like to know if something like the following is possible (i.e. merging a single variable with two variables)?
setkey(dt1, id)
dt1[dt2, on=c("id1", "id2")]
Related
I noticed, when using updating by reference, I was losing some rows from my right table if there is more tan one row by join key.
No matter how I browse the forum, I can't find how to do it. Something escapes me ?
Even with mult= it doesn't seem to work.
Due to a performance and volumetry issue I would like to keep updating by reference.
In my reprex, I expected two rows for a=2
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = c(2,2,5), b = 23:25)
A[B, on = 'a', newvar := i.b, mult = 'all']
Thanks !!
One option is to create a list column in 'B' and do the join and assign (:=) as := cannot expand rows on the original data.
A[B[, .(b = .(b)), a], on = .(a), newvar := i.b]
-output
> A
a b newvar
1: 1 12
2: 2 13 23,24
3: 3 14
4: 4 15
Once we have the list, it is easier to unnest
library(tidyr)
A[, unnest(.SD, newvar, keep_empty = TRUE)]
# A tibble: 5 x 3
a b newvar
<int> <int> <int>
1 1 12 NA
2 2 13 23
3 2 13 24
4 3 14 NA
5 4 15 NA
Or use a full join with merge.data.table
merge(A, B, by = 'a', all.x = TRUE)
a b.x b.y
1: 1 12 NA
2: 2 13 23
3: 2 13 24
4: 3 14 NA
5: 4 15 NA
I need to merge some data from one data.table into another. I know how to add a new column from one data.table to another in a join. Below values from column b in df2 is added to df1 in a join based on the Id column:
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, b := b][]
#> Id a b
#> 1: a 1 7
#> 2: b 2 8
#> 3: c 3 9
#> 4: d 4 NA
#> 5: e 5 NA
However, that idiom does not work when the column already exists in df1, here column a:
df1[df2, a := a][]
#> Id a
#> 1: a 1
#> 2: b 2
#> 3: c 3
#> 4: d 4
#> 5: e 5
I understand that a is not updated by this assignment because the field a already exists in df1. The reference to a in the right hand side of the assignment resolves to that value, not the on in df2.
So how to update values in df1$a with those in df2$a in a join on matching id to get the following:
#> Id a
#> 1: a 7
#> 2: b 8
#> 3: c 9
#> 4: d 4
#> 5: e 5
From ?data.table:
When i is a data.table, the columns of i can be referred to in j by using the prefix i., e.g., X[Y, .(val, i.val)]. Here val refers to X's column and i.val Y's.
Thus, in the RHS of :=, use the i. prefix to refer to the a column in df2, i.a:
library(data.table)
df1 <- data.table(Id = letters[1:5], a = 1:5)
df2 <- data.table(Id = letters[1:3], a = 7:9, b = 7:9)
setkey(df1, Id)
setkey(df2, Id)
df1[df2, a := i.a]
# or instead of setting keys, use `on` argument:
df1[df2, on = .(Id), a := i.a]
df1
# Id a
# <char> <int>
# 1: a 7
# 2: b 8
# 3: c 9
# 4: d 4
# 5: e 5
I am looking for a faster, data table alternative to the relist function, specifically the skeleton argument that allows you to group values based on a specific structure. I know the by option does this for you but that's only by columns in the data table. I want to ultimately take a column of values from a data table and group those values based on an already created 'frame' for the values. For example, the code below shows how listFrame is created and would be passed to the skeleton argument to group values based on each sequence of numbers.
library(data.table)
sampleData <- data.table(rep(seq(0,10), each = 10))
sampleData <- sampleData[-(sample(1:length(sampleData[[1]]), 25))]
nList <- as.data.table(sampleData[, table(V1)])
listFrame <- data.table(sapply(1:nrow(nList), function(x) 1:nList$N[x]))
sampleData <- relist(sampleData[[1]], listFrame[[1]])
Furthermore, after relisting sampleData I plan to apply a function over each sublist, using lapply. If it can all be done using data.table that would be great.
lapply(1:length(sampleData), function(x) median(sampleData[[x]]))
If I understand well, your listFrame as skeleton is a way of counting same number in V1. As your input data is sorted on V1, I think is the same as a run-length type group id you could obtain with function rleid.
So you could group by rleid(V1) then apply the median
library(data.table)
sampleData <- data.table(rep(seq(0,10), each = 10))
sampleData <- sampleData[-(sample(1:length(sampleData[[1]]), 25))]
sampleData[, .(med = median(V1)), by = .(V1, rleid(V1))][, rleid := NULL][]
#> V1 med
#> 1: 0 0
#> 2: 1 1
#> 3: 2 2
#> 4: 3 3
#> 5: 4 4
#> 6: 5 5
#> 7: 6 6
#> 8: 7 7
#> 9: 8 8
#> 10: 9 9
#> 11: 10 10
Results in column med is the same as your example, but stored in a table not a list.
After precision in comments
rleid is a way for creating the ID column. If I take your specific case of function dplR::tbrm, you just need an ID column to apply the function.
library(data.table)
library(dplR)
sampleData <- data.table(rep(seq(0,10), each = 10))
sampleData <- sampleData[-(sample(1:length(sampleData[[1]]), 25))]
create an ID column :
sampleData[, ID := LETTERS[rleid(V1)]]
Apply your function by ID
sampleData[, dplR::tbrm(V1), by = ID]
#> ID V1
#> 1: A 0
#> 2: B 1
#> 3: C 2
#> 4: D 3
#> 5: E 4
#> 6: F 5
#> 7: G 6
#> 8: H 7
#> 9: I 8
#> 10: J 9
#> 11: K 10
There are many answers for how to split a dataframe, for example How to split a data frame?
However, I'd like to split a dataframe so that the smaller dataframes contain the last row of the previous dataframe and the first row of the following dataframe.
Here's an example
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
data.frame(n = n, group)
n group
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
8 8 c
9 9 c
I'd like the output to look like:
d1 <- data.frame(n = 1:4, group = c(rep("a",3),"b"))
d2 <- data.frame(n = 3:7, group = c("a",rep("b",3),"c"))
d3 <- data.frame(n = 6:9, group = c("b",rep("c",3)))
d <- list(d1, d2, d3)
d
[[1]]
n group
1 1 a
2 2 a
3 3 a
4 4 b
[[2]]
n group
1 3 a
2 4 b
3 5 b
4 6 b
5 7 c
[[3]]
n group
1 6 b
2 7 c
3 8 c
4 9 c
What is an efficient way to accomplish this task?
Suppose DF is the original data.frame, the one with columns n and group. Let n be the number of rows in DF. Now define a function extract which given a sequence of indexes ix enlarges it to include the one prior to the first and after the last and then returns those rows of DF. Now that we have defined extract, split the vector 1, ..., n by group and apply extract to each component of the split.
n <- nrow(DF)
extract <- function(ix) DF[seq(max(1, min(ix) - 1), min(n, max(ix) + 1)), ]
lapply(split(seq_len(n), DF$group), extract)
$a
n group
1 1 a
2 2 a
3 3 a
4 4 b
$b
n group
3 3 a
4 4 b
5 5 b
6 6 b
7 7 c
$c
n group
6 6 b
7 7 c
8 8 c
9 9 c
Or why not try good'ol by, which "[a]ppl[ies] a Function to a Data Frame Split by Factors [INDICES]".
by(data = df, INDICES = df$group, function(x){
id <- c(min(x$n) - 1, x$n, max(x$n) + 1)
na.omit(df[id, ])
})
# df$group: a
# n group
# 1 1 a
# 2 2 a
# 3 3 a
# 4 4 b
# --------------------------------------------------------------------------------
# df$group: b
# n group
# 3 3 a
# 4 4 b
# 5 5 b
# 6 6 b
# 7 7 c
# --------------------------------------------------------------------------------
# df$group: c
# n group
# 6 6 b
# 7 7 c
# 8 8 c
# 9 9 c
Although the print method of by creates a 'fancy' output, the (default) result is a list, with elements named by the levels of the grouping variable (just try str and names on the resulting object).
I was going to comment under #cdetermans answer but its too late now.
You can generalize his approach using data.table::shift (or dyplr::lag) in order to find the group indices and then run a simple lapply on the ranges, something like
library(data.table) # v1.9.6+
indx <- setDT(df)[, which(group != shift(group, fill = TRUE))]
lapply(Map(`:`, c(1L, indx - 1L), c(indx, nrow(df))), function(x) df[x,])
# [[1]]
# n group
# 1: 1 a
# 2: 2 a
# 3: 3 a
# 4: 4 b
#
# [[2]]
# n group
# 1: 3 a
# 2: 4 b
# 3: 5 b
# 4: 6 b
# 5: 7 c
#
# [[3]]
# n group
# 1: 6 b
# 2: 7 c
# 3: 8 c
# 4: 9 c
Could be done with data.frame as well, but is there ever a reason not to use data.table? Also this has the option to be executed with parallelism.
library(data.table)
n <- 1:9
group <- rep(c("a","b","c"), each = 3)
df <- data.table(n = n, group)
df[, `:=` (group = factor(df$group))]
df[, `:=` (group_i = seq_len(.N), group_N = .N), by = "group"]
library(doParallel)
groups <- unique(df$group)
foreach(i = seq(groups)) %do% {
df[group == groups[i] | (as.integer(group) == i + 1 & group_i == 1) | (as.integer(group) == i - 1 & group_i == group_N), c("n", "group"), with = FALSE]
}
[[1]]
n group
1: 1 a
2: 2 a
3: 3 a
4: 4 b
[[2]]
n group
1: 3 a
2: 4 b
3: 5 b
4: 6 b
5: 7 c
[[3]]
n group
1: 6 b
2: 7 c
3: 8 c
4: 9 c
Here is another dplyr way:
library(dplyr)
data =
data_frame(n = n, group) %>%
group_by(group)
firsts =
data %>%
slice(1) %>%
ungroup %>%
mutate(new_group = lag(group)) %>%
slice(-1)
lasts =
data %>%
slice(n()) %>%
ungroup %>%
mutate(new_group = lead(group)) %>%
slice(-n())
bind_rows(firsts, data, lasts) %>%
mutate(final_group =
ifelse(is.na(new_group),
group,
new_group) ) %>%
arrange(final_group, n) %>%
group_by(final_group)
I need to re-format a table in R.
I have a table like this.
ID category
1 a
1 b
2 c
3 d
4 a
4 c
5 a
And I want to reform it as
ID category1 category2
1 a b
2 c null
3 d null
4 a c
5 a null
Is this doable in R?
This is a very straightforward "long to wide" type of reshaping problem, but you need a secondary "id" (or "time") variable.
You can try using getanID from my "splitstackshape" package and use dcast to reshape from long to wide. getanID will create a new column called ".id" that would be used as your "time" variable:
library(splitstackshape)
dcast.data.table(getanID(mydf, "ID"), ID ~ .id, value.var = "category")
# ID 1 2
# 1: 1 a b
# 2: 2 c NA
# 3: 3 d NA
# 4: 4 a c
# 5: 5 a NA
Same as Ananda's, but using dplyr and tidyr:
library(tidyr)
library(dplyr)
mydf %>% group_by(ID) %>%
mutate(cat_row = paste0("category", 1:n())) %>%
spread(key = cat_row, value = category)
# Source: local data frame [5 x 3]
#
# ID category1 category2
# 1 1 a b
# 2 2 c NA
# 3 3 d NA
# 4 4 a c
# 5 5 a NA