I'm writing a function that needs to work with different datasets. The columns that have to be passed inside the function look somewhat like the following data frames:
df1 <- data.frame(x1 = c("d","e","f","g"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(456,678,876,987))
df2 <- data.frame(x1 = c("a","b","c","d"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(123,324,345,564))
From these I need to find out if any of the df1$x1 are present in df2$x2. If present, print the entire row where df1$x1 value that is present in df2$x2.
I need to use the data frames inside the function but I can't specify the column names explicitly. So I need to find a way to access the columns without exactly using the column name.
The desired output:
x1 x2 x3 x4
d Aug 2017 456 common
enter image description here
My problem is, I can't use any kind of function where I need to specify the column names explicitly. For example, inner join cannot be performed since I have to specify
by = 'col_name'
You can use match with column indices:
df1[na.omit(match(df2[, 1], df1[, 1])), ]
# x1 x2 x3
#1 d Aug 2017 456
Here are three simple examples of functions that you might use to return the rows you want, where the function itself does not need to hardcode the name of the column, and does not require the indices to match:
Pass the frames directly, and the column names as strings:
f <- function(d1, d2, col1, col2) d1[which(d1[,col1] %in% d2[,col2]),]
Usage and Output
f(df1,df2, "x1", "x1")
x1 x2 x3
1 d Aug 2017 456
Pass only the values of the columns as vectors:
f <- function(x,y) which(x %in% y)
Usage and Output
df1 %>% filter(row_number() %in% f(x1, df2$x1))
x1 x2 x3
1 d Aug 2017 456
Pass the frames and the unquoted columns, using the {{}} operator
f <- function(d1, d2, col1, col2) {
filter(d1, {{col1}} %in% pull(d2, {{col2}}))
}
Usage and Output:
f(df1,df2,x1,x1)
x1 x2 x3
1 d Aug 2017 456
Related
I have a vector with 2 elements:
v1 <- c('X1','X2')
I want to create possible combinations of these elements.
The resultant data frame would look like:
structure(list(ID = c(1, 2, 3, 4), c1 = c("X1", "X2", "X1", "X2"
), c2 = c("X1", "X1", "X2", "X2")), class = "data.frame", row.names = c(NA,
-4L))
Here, rows with ID=2 and ID=3 have same elements (however arranged in different order). I would like to consider these 2 rows as duplicate. Hence, the final output will have 3 rows only.
i.e. 3 combinations
X1, X1
X1, X2
X2, X2
In my actual dataset, I have 16 such elements in the vector V1.
I have tried using expand.grid approach for obtaining possible combinations but this actually exceeds the machine limit. (number of combinations with 16 elements will be too large). This is potentially due to duplications described above.
Can someone help here to get all possible combinations without any duplications ?
I am actually looking for a solution that uses data table functionality. I believe this can be really faster
Thanks in advance.
Here is a base R solution using your sample == data:
First, create your combinations. Using unique = TRUE cuts back on the number of combinations.
library(data.table)
data <- setDT(CJ(df$c1, df$c2, unique = TRUE))
Then, filter out duplicates:
data[!duplicated(t(apply(data, 1, sort))),]
This gives us:
V1 V2
1 X1 X1
2 X2 X1
10 X2 X2
I would look into the ?expand.grid function for this type of task.
expand.grid(v1, v1)
Var1 Var2
1 X1 X1
2 X2 X1
3 X1 X2
4 X2 X2
dat4 is the final output.
v1 <- c('X1','X2')
library(data.table)
dat <- expand.grid(v1, v1, stringsAsFactors = FALSE)
setDT(dat)
# A function to combine and sort string from two columns
f <- function(x, y){
z <- paste(sort(c(x, y)), collapse = "-")
return(z)
}
# Apply the f function to each row
dat2 <- dat[, comb := f(Var1, Var2), by = 1:nrow(dat)]
# Remove the duplicated elements in the comb column
dat3 <- unique(dat2, by = "comb")
# Select the columns
dat4 <- dat3[, c("Var1", "Var2")]
print(dat4)
# Var1 Var2
# 1: X1 X1
# 2: X2 X1
# 3: X2 X2
You may want to check RcppAlgos::comboGeneral, which does exactly what you want and is known to be fast and memory efficient. Just do something like this:
vars <- paste0("X", 1:2)
RcppAlgos::comboGeneral(vars, length(vars), repetition = TRUE)
Output
[,1] [,2]
[1,] "X1" "X1"
[2,] "X1" "X2"
[3,] "X2" "X2"
On my laptop with 16Gb RAM, I can run this function up to 14 variables, and it takes less than 5s to finish. Speed is less of a concern. However, note that you need at least 17.9Gb RAM to get all 16-variable combinations.
We can use crossing from tidyr
library(tidyr)
crossing(v1, v1)
This question already has answers here:
R group by aggregate
(3 answers)
Closed 2 years ago.
I am working with a dataset of more than 3 million observations. This data set includes more than 770,000 unique IDs that are of interest to me. The data includes descriptive information about these IDs. The challenge is that these unique IDs contain non-unique duplicates, which means I need to find a way to consolidate the data.
After much thinking, I decided to take the mode of each column for each ID in the data set. The output gives me most common value for each column for each id. By taking the most common value, I am able to consolidate the non-unique duplicates into one row per each id.
The problem: To do so, I have iterate over 770,000 unique ids in a for loop. I want to use code that will be as efficient as possible because the for loop I have been using takes days to complete.
Given the code I have provided, is there a way to optimize the code, use parallel processing, or a different way to complete the task more efficiently?
Reproducible code:
ID <- c(1,2,2,3,3,3)
x1 <- c("A", "B", "B","C", "C", "C")
x2 <- c("alpha", "bravo", "bravo", "charlie", "charlie2", "charlie2")
x3 <- c("apple", "banana", "banana", "plum1", "plum1", "plum")
df <- data.frame(ID, x1, x2, x3)
#Mode Function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
library(reshape2)
#Takes the mode for every column
mode_row <- function(dat){
x <- setNames(as.data.frame(apply(dat, 2, getmode)), c("value"))
x$variable <- rownames(x); rownames(x) <- NULL
mode_row <- reshape2::dcast(x, . ~ variable, value.var = "value")
mode_row$. <- NULL
return(mode_row)
}
#Take the mode of each row to account for duplicate donors
df2 <- NULL
for(i in unique(df$ID)){
df2 <- rbind(df2, mode_row(subset(df, ID == i)))
#message(i)
}
df2
Expected Output:
ID x1 x2 x3
1 1 A alpha apple
2 2 B bravo banana
3 3 C charlie2 plum1
There are grouped functions available in base R, dplyr and data.table :
Base R :
aggregate(.~ID, df, getmode)
# ID x1 x2 x3
#1 1 A alpha apple
#2 2 B bravo banana
#3 3 C charlie2 plum1
dplyr :
library(dplyr)
df %>% group_by(ID) %>% summarise(across(x1:x3, getmode))
#Use summarise_at in older version of dplyr
#df %>% group_by(ID) %>% summarise_at(vars(x1:x3), getmode)
data.table :
library(data.table)
setDT(df)[, lapply(.SD, getmode), ID, .SDcols = x1:x3]
Given is a data.table representing the relations between 6 objects:
# create sampla data.table
x1 <- c(1,1,1,2,2,2,3,3,3,4,5,6)
x2 <- c(1,2,3,1,2,3,1,2,3,4,6,5)
dt <- data.table(x1, x2)
1st row represents the objects.
2nd row represents connection with other objects.
# check combinations
dt[dt$x1 != dt$x2]
Object 4 has no connections with other objects.
Objects 1, 2 and 3 are connected, as well as objects 5 and 6.
Now, a new column should be created where all connected objects get the same number (ID)
The resulting data.table should look like:
x3 <- c(1,1,1,1,1,1,1,1,1,2,3,3)
dt.res <- data.table(dt, x3)
How can this be achieved?
x1 <- c(1,1,1,2,2,2,3,3,3,4,5,6)
x2 <- c(1,2,3,1,2,3,1,2,3,4,6,5)
dt <- data.frame(x1, x2)
dt$x3=dt$x1
dt
for(i in 1:nrow(dt)){
if(dt$x3[i]!=dt$x2[i]){
dt$x3[dt$x3==dt$x2[i]]=dt$x3[i]
}
}
setDT(dt)[, id := .GRP, by=x3]
dt
Create duplicate of x1, x3
Iterate through x3, check if different from x2
If different, replaces all elements in x3 which are equal to the element that you just checked in x2 with the current value of x3
Assign ID's with setDT function
I have 2 dataframes with the same headers similar to that.
Jul X1 X2 X3 X4 X5
The sizes of each data are:
D1:
nrowA=2191, ncolA= 51.
nrowB=366, ncolB= 51.
Actually, I have exacly the same columns in each dataframe. The first dataframe is daily data of temperature for 04 years while the second data is a "reference". I want to do (A-B) where the first column (Jul) of each dataframe does match. Could you please advise me with a method to do that in AVOIDING loops. Cheers
If you know SQL there is a library that allows you to compute SQL queries:
D1 <- data.frame(a = 1:5, b=letters[1:5])
D2 <- data.frame(a = 1:3, b=letters[1:3])
require(sqldf)
a1NotIna2 <- sqldf('SELECT * FROM D1 WHERE (a NOT IN (SELECT a FROM D2))')
I have two data frames:
df1
x1 x2
1 a
2 b
3 c
4 d
and
df2
x1 x2
2 zz
3 qq
I want to replace some of the values in df1$x2 with values in df2$x2 based on the conditional match between df1$x1 and df2$x2 to produce:
df1
x1 x2
1 a
2 zz
3 qq
4 d
use match(), assuming values in df1 are unique.
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
df1$x2[match(df2$x1,df1$x1)] <- df2$x2
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
If the values aren't unique, use :
for(id in 1:nrow(df2)){
df1$x2[df1$x1 %in% df2$x1[id]] <- df2$x2[id]
}
We can use {powerjoin}, and handle the conflicting columns with coalesce_yx
library(powerjoin)
df1 <- data.frame(x1 = 1:4, x2 = letters[1:4], stringsAsFactors = FALSE)
df2 <- data.frame(x1 = 2:3, x2 = c("zz", "qq"), stringsAsFactors = FALSE)
power_left_join(df1, df2, by = "x1", conflict = coalesce_yx)
#> x1 x2
#> 1 1 a
#> 2 2 zz
#> 3 3 qq
#> 4 4 d
The first part of Joris' answer is good, but in the case of non-unique values in df1, the row-wise for-loop will not scale well on large data.frames.
You could use a data.table "update join" to modify in place, which will be quite fast:
library(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := i.x2]
Or, assuming you don't care about maintaining row order, you could use SQL-inspired dplyr:
library(dplyr)
union_all(
inner_join( df1["x1"], df2 ), # x1 from df1 with matches in df2, x2 from df2
anti_join( df1, df2["x1"] ) # rows of df1 with no match in df2
) # %>% arrange(x1) # optional, won't maintain an arbitrary row order
Either of these will scale much better than the row-wise for-loop.
I see that Joris and Aaron have both chosen to build examples without factors. I can certainly understand that choice. For the reader with columns that are already factors there would also be to option of coercion to "character". There is a strategy that avoids that constraint and which also allows for the possibility that there may be indices in df2 that are not in df1 which I believe would invalidate Joris Meys' but not Aaron's solutions posted so far:
df1 <- data.frame(x1=1:4,x2=letters[1:4])
df2 <- data.frame(x1=c(2,3,5), x2=c("zz", "qq", "xx") )
It requires that the levels be expanded to include the intersection of both factor variables and then also the need to drop non-matching columns (= NA values) in match(df1$x1, df2$x1)
df1$x2 <- factor(df1$x2 , levels=c(levels(df1$x2), levels(df2$x2)) )
df1$x2[na.omit(match(df2$x1,df1$x1))] <- df2$x2[which(df2$x1 %in% df1$x1)]
df1
#-----------
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
(Note that recent versions of R do not have stringsAsFactors set to TRUE in the data.frame function defaults, unlike it was for most of the history of R.)
You can do it by matching the other way too but it's more complicated. Joris's solution is better but I'm putting this here also as a reminder to think about which way you want to match.
df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"), stringsAsFactors=FALSE)
swap <- df2$x2[match(df1$x1, df2$x1)]
ok <- !is.na(swap)
df1$x2[ok] <- swap[ok]
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
It can be done with dplyr.
library(dplyr)
full_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
new here, but using the following dplyr approach seems to work as well
similar but slightly different to one of the answers above
df3 <- anti_join(df1, df2, by = "x1")
df3 <- rbind(df3, df2)
df3
As of dplyr 1.0.0 there is a function specifically for this:
library(dplyr)
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
rows_update(df1, df2, by = "x1")
See https://stackoverflow.com/a/65254214/2738526