I ran into an unexpected (to me) issue with subsetting columns in a data.table. I was trying to use unique on three columns of a data.table. The original data.table contained several (>5) columns of which 2 were set as keys. One of three columns on which I used unique was a key column. The surprising part was that the results differed when I selected columns using their character names versus when I selected them using .() notation. Here is a little example that replicates the issue that I encountered (NOTE: selecting 2 columns rather than 3 for simplicity).
library(data.table)
dt1 <- data.table(c1 = rep(c("a", "a", "b"), each = 3),
c2 = rep(1:3, each = 3),
c3 = rep(1:3, 3))
setkey(dt1, c1, c3)
unique(dt1[, c("c1", "c2"), with = FALSE])
c1 c2
1: a 1
2: b 3
unique(dt1[, .(c1, c2)])
c1 c2
1: a 1
2: a 2
3: b 3
It seems like that column selection using c("c1", "c2") notation retains the key by c1 but column selection using .(c1, c2) does not. Is there a way to control whether or not the key is retained while subsetting columns?
A little more context to my problem. I was trying to execute this code within a function which took the data and the column names to be selected as arguments. It was easier to pass character column names.
Related
I am trying to merge two dataframes in R, joining them by the one column that they share.
Here are screenshots of the two dataframes, and I am merging on the column "INC_KEY".
This is the code I have written to merge the two dataframes:
dp <- inner_join(d,p,by="INC_KEY")
d has 177156 observations, and p has 1641137 observations, but the final merged dataframe has 8416113 observations, which does not make sense to me. I have also tried changing the inner_join function above to the merge function, but I still get the same result. I am wondering how to fix this code so that the merged dataframe has a realistic number of observations - thanks so much for any help!
You most probably have duplicates in either d or p or both of them. Try keeping only one row for each unique INC_KEY value before joining.
library(dplyr)
dp <- inner_join(d %>% distinct(INC_KEY, .keep_all = TRUE),
p %>% distinct(INC_KEY, .keep_all = TRUE),by="INC_KEY")
This can happen if your INC_KEY is not a unique identifier. Here is a simplified example:
library(dplyr)
df1 <- data.frame(key = c("A", "B", "C", "A"),
val1 = 1:4)
df2 <- data.frame(key = c("A", "B", "C", "C", "B"),
val2 = 1:5)
inner_join(df1, df2, by = "key")
Joining, by = "key"
key val1 val2
1 A 1 1
2 B 2 2
3 B 2 5
4 C 3 3
5 C 3 4
6 A 4 1
Because there are two values of "A" in the key column in df1, both rows match the one row of df2 with "A". The one row in df1 with a key of "C" matches both rows with the key of "C" in df2. This is the expected behavior of an inner join with duplicated key values. The join returns all rows in the second data.frame that match each row in the first data.frame. If there are multiple matches, they are all returned.
If you want one row per INC_KEY, then you need to do something to your original data before the join, especially the rows are not complete duplicates.
The key column INC_KEY has duplicates in at least one of your tables. inner_join will then output a table with additional rows depending on the number of found duplicates minus the rows with INC_KEY missing in either dor p.
If you expect your new table to have the same number of rows as table d, then you need to aggregate the information in table p first; grouped by INC_KEY. Then you can perform inner_join.
Is there a nice way to make a sub-group within a grouping column in data.table operations?
The result I would like is the output from this:
dt <- data.table(
group = c("a","a","a","b","b","b","c","c"),
value = c(1,2,3,4,5,6,7,8)
)
dt[group!="a", group:="Other"][, sum(value), by=.(group)][]
which gives
group V1
a 6
Other 30
However, this alters the original data.table. I don't know if there is a different way to do this that wouldn't involve merging two data.table. I can imagine a more complicated use case where I want group %in% c("a","b") as one sub-group and group %in% c("c","d") another, etc.
I think this is like a SQL right excluding join (using the terminology here)
You can go through by group and within each group perform an anti-join
#group no longer found in .SD, hence make a copy of the column
dt[, g:=group]
#go through each group, anti-join with other groups, aggregate value
dt[, .(
sumGrpVal=sum(value),
sumNonGrpVal=dt[!.SD, sum(value), on=c("group"="g")]
), by=.(group)]
or an even faster way:
dt[, .(
sumGrpVal=sum(value),
sumNonGrpVal=dt[group!=.BY$group, sum(value)]
), by=.(group)]
output:
group sumGrpVal sumNonGrpVal
1: a 6 30
2: b 15 21
3: c 15 21
I have a 9801 by 3 reference table.
The first 2 columns of this table is defined as follows.
x1 = x2 = seq(0.01,0.99,0.01)
x12 = data.matrix(expand.grid(x1,x2))
The 3rd columns contains the outcome values.
Now I have another n by 3 matrix where the 1st and 2nd columns are selected rows of the above matrix 'x12' and the 3rd column is to be filled. I would like fill in the 3rd column of the 2nd table by looking up the same combination of the 1st and 2nd column in the 1st table and find the value in the 3rd column.
How can I do this?
You can do this with the merge function:
# Original data frame
x1 = x2 = seq(0.01,0.99,0.01)
x12 = expand.grid(x1,x2)
# Add a fake "outcome"
x12$outcome = rnorm(nrow(x12))
# New data frame with 100 random rows and the first two columns of x12
x12new = x12[sample(1:nrow(x12), 100), c(1,2)]
# Merge the outcome values from x12 into x12new
x12new = merge(x12new, x12, by=c("Var1","Var2"), all.x=TRUE)
by tells merge which columns must match when comparing the two data frames. all.x=TRUE tells merge to keep all rows from the first data frame, x12new in this case, even if they don't have a match in the second data frame (not an issue here, but you'll often want to make sure you don't lose any rows when merging).
One other thing to note is that, unlike vlookup in Excel, merge will increase the number of rows in the new, merged data frame if there are multiple rows that match the criteria. For example, see what happens when you merge values from df2 into df1:
df1 = data.frame(x = c(1,2,3,4), z=c(10,20,30,40))
df2 = data.frame(x = c(1,1,1,2,3), y=c("a","b","c","a","c"))
merge(df1, df2, by="x", all.x=TRUE)
x z y
1 1 10 a
2 1 10 b
3 1 10 c
4 2 20 a
5 3 30 c
6 4 40 <NA>
You can also use left_join from the dplyr package (other types of joins are available as well):
library(dplyr)
left_join(df1, df2, by="x")
I have several data frames that I need to merge into the one data frame to rule them all. The master data frame will end up with thousands of columns. All of the data frames have an ID column to join on. One problem is that hundreds of columns are duplicated across data frames. Another problem is that a handful of those columns contain inconsistent values. I would like to find a way to
Combine all data frames, keeping only 1 "master column" of data if there are duplicate column names and the values do not conflict between data frames
Keep both both columns of data if they share the same name, but they have conflicting values.
Are there any packages that can help automate this? Or am I going to be stuck writing a lot of code/manually checking data?
I wrote the package safejoin which solves this very succintly :
#devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
See the following data frames, A is identical in both, B is different in df1 and df2,
C and D are in only one data frame
df1 <- data.frame(id = 1:2, A = 3:4, B= 5:6, C = 7:8)
df2 <- data.frame(id = 1:2, A = 3:4, B= 9:10, D = 11:12)
library(tidyverse)
safe_full_join(df1, df2, by = "id", conflict = ~ if(identical(.x, .y)) .x else
map2( .x, .y,~tibble(df1=.x,df2=.y))) %>%
unnest(.sep="_")
# id A C D B_df1 B_df2
# 1 1 3 7 11 5 9
# 2 2 4 8 12 6 10L
I recently started to use the data.table package to identify values in a table's column that conform to some conditions. Although and I manage to get most of the things done, now I'm stuck with this problem:
I have a data table, table1, in which the first column (labels) is a group ID, and the second column, o.cell, is an integer. The key is on "labels"
I have another data table, table2, containing a single column: "cell".
Now, I'm trying to find, for each group in table1, the values from the column "o.cell" that are in the "cell" column of table2. table1 has some 400K rows divided into 800+ groups of unequal sizes. table2 has about 1.3M rows of unique cell numbers. Cell numbers in column "o.cell" table1 can be found in more than one group.
This seems like a simple task but I can't find the right way to do it. Depending on the way I structure my call, it either gives me a different result than what I expect or it never completes and I have to end R task because it's frozen (my machine has 24 GB RAM).
Here's an example of one of the "variant" of the calls I have tried:
overlap <- table1[, list(over.cell =
o.cell[!is.na(o.cell) & o.cell %in% table2$cell]),
by = labels]
I pretty sure this is the wrong way to use data tables for this task and on top of that I can't get the result I want.
I will greatly appreciate any help. Thanks.
Sounds like this is your set up:
dt1 = data.table(labels = c('a','b'), o.cell = 1:10)
dt2 = data.table(cell = 4:7)
And you simply want to do a simple merge:
setkey(dt1, o.cell)
dt1[dt2]
# o.cell labels
#1: 4 b
#2: 5 a
#3: 6 b
#4: 7 a