merge data.frame but keep only unique columns? - r

Let's say I want to merge two data.frames but some of the columns are redundant (the same). How would I merge those data.frames but drop the redundant columns?
X1 = data.frame(id = c("a","b","c"), same = c(1,2,3), different1 = c(4,5,6))
X2 = data.frame(id = c("b","c","a"), same = c(2,3,1), different2 = c(7,8,9))
merge(X1,X2, by="id", all = TRUE, sort = FALSE)
id same.x different1 same.y different2
1 a 1 4 1 9
2 b 2 5 2 7
3 c 3 6 3 8
But how would I get just the different1 and different2 columns?
id same different1 different2
1 a 1 4 9
2 b 2 5 7
3 c 3 6 8

You could include the column same in your by argument. The default is by=intersect(names(x), names(y)). Try merge(X1, X2) (it is the same as merge(X1, X2, by=c("id", "same"))):
merge(X1, X2)
# id same different1 different2
#1 a 1 4 9
#2 b 2 5 7
#3 c 3 6 8

Just subset via indexing in the merge statement. There are many ways to subset i.e. name, position. There is even a subset function but the [] notation works well for almost all cases
merge(X1[,c("id","same","different1")], X2[,c("id","different2")], by="id", all = TRUE, sort = FALSE)
As shown in other examples you could put it into the by statement but this will become an issue after you exit the realm of one-to-one merges and enter one-to-many or many-to-many merges.

Related

Create functions to select every second value

How do I create functions to select every second value in a column in a data frame in R, but from the second value in the column?
I tried something like this:
df.new = df[seq(1, nrow(df), 2), ]
You can use c(FALSE, TRUE) to subset the data.frame and get every second row starting with the second.
x[c(FALSE, TRUE),]
# a b
#2 2 9
#4 4 7
#6 6 5
#8 8 3
#10 10 1
And for a specific column:
x$a[c(FALSE, TRUE)]
#[1] 2 4 6 8 10
Data
x <- data.frame(a = 1:10, b=10:1)

R multiple regular expressions, dataframe column names

I have a dataframe data with a lot of columns in the form of
...v1...min ...v1...max ...v2...min ...v2...max
1 a a a a
2 b b b b
3 c c c c
where in place ... there could be any expression.
I would like to create a function createData that takes three arguments:
X: a dataframe,
cols: a vector containing first part of the column, so i.e. c("v1", "v2")
fun: a vector containing second part of the column, so i.e. c("min"), or c("max", "min")
and returns filtered dataframe, so - for example:
createData(X, c("v1"), None) would return this kind of dataframe:
...v1...min ...v1...max
1 a a
2 b b
3 c c
while createData(X, c("v1", "v2"), c("min")) would give me
...v1...min ...v2...min
1 a a
2 b b
3 c c
At this point I decided I need to use i.e. select(contains()) from dplyr package.
createData <- function(data, fun, cols)
{
X %>% select(contains())
return(X)
}
What I struggle with is:
how to filter columns that consist two (or maybe more?) strings, i.e. both var1 and min? I tried going with data[grepl(".*(v1*min|min*v1).*", colnames(data), ignore.case=TRUE)] but it doesn't seem to work and also my expressions aren't fixed - they depend on the vector I pass,
how to filter multiple columns with different names, i.e. c("v1", "v2"), passed in a vector? and how to combine it with the first question?
I don't really need to stick with dplyr package, it was just for the sake of the example. Thanks!
EDIT:
An reproducible example:
data = data.frame(AXv1c2min = c(1,2,3),
subv1trwmax = c(4,5,6),
ss25v2xxmin = c(7,8,9),
cwfv2urttmmax = c(10,11,12))
If you pass a vector to contains, it will function like an OR tag, while multiple select statements will have additive effects. So for your esample data:
We can filter for (v1 OR v2) AND min like this:
library(tidyverse)
data %>%
select(contains(c('v1','v2'))) %>%
select(contains('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
So as a function where either argument is optional:
createData <- function(data, fun=NULL, cols=NULL) {
if (!is.null(fun)) data <- select(data, contains(fun))
if (!is.null(cols)) data <- select(data, contains(cols))
return(data)
}
A series of examples:
createData(data, cols=c('v1', 'v2'), fun='min')
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, fun=c('min'))
AXv1c2min ss25v2xxmin
1 1 7
2 2 8
3 3 9
createData(data, cols=c('v1'), fun=c('min', 'max'))
AXv1c2min subv1trwmax
1 1 4
2 2 5
3 3 6
createData(data, cols=c('v1'), fun=c('max'))
subv1trwmax
1 4
2 5
3 6

How to create a parameter using $ to select from data frame

I have three different data frames that are similar in their columns such:
df1 df2 df3
Class 1 2 3 Class 1 2 3 Class 1 2 3
A 5 3 2 A 7 3 10 A 5 4 1
B 9 1 4 B 2 6 2 A 2 6 2
C 7 9 8 C 4 7 1 A 12 3 8
I would like to iterate through the three files and select the data from the columns with similar name. In other words, I want to iterate three times and everytime select data of column 1, then column 2, and then column 3 and merge them in one data frame.
To do that, I did the following:
df1 <- read.csv(R1)
df2 <- read.csv(R2)
df3 <- read.csv(R3)
df <- data.frame(Class=character(), B1_1=integer(), B1_2=integer(), B1_3=integer(), stringsAsFactors=FALSE)
for(i in 1:3){
nam <- paste("X", i, sep = "") #here I want to call the column name such as X1, X2, and X3
df[seq_along(df1[nam]), ]$B1_1 <- df1[nam]
df[seq_along(df2[nam]), ]$B1_2 <- df2[nam]
df[seq_along(df3[nam]), ]$B1_3 <- df3[nam]
df$Class <- df1$Class
}
In this line df[seq_along(df1[nam]), ]$B1_1 <- df1[nam], I followed the solution from this but this produces the following error:
Error in `$<-.data.frame`(`*tmp*`, "B1_1", value = list(X1 = c(5L, 7L, :
replacement has 10 rows, data has 1
Do you have any idea how to solve it?

Merging different data frames depending on column value

I have a data frame df1
df1<- data.frame(ID = c("A","B","A","A","B"),CLASS = c(1,1,2,1,4))
ID CLASS
1 A 1
2 B 1
3 A 2
4 A 1
5 B 4
and another two data frames A and B
> A<- data.frame(CLASS = c(1,2,3), DESCRIPTION = c("Unknown", "Tall", "Short"))
CLASS DESCRIPTION
1 1 Unknown
2 2 Tall
3 3 Short
> B <- data.frame(CLASS = c(1,2,3,4), DESCRIPTION = c("Big", "Small", "Medium", "Very Big"))
CLASS DESCRIPTION
1 1 Big
2 2 Small
3 3 Medium
4 4 Very Big
I want to merge these three data frames depending on the ID and class of df1 to have something like this:
ID CLASS DESCRIPTION
1 A 1 Unknown
2 B 1 Big
3 A 2 Tall
4 A 1 Unknown
5 B 4 Very Big
I know I can merge it as df1 <- merge(df1, A, by = "CLASS") but I can't find a way to add the conditional (maybe an "if" is too much) to also merge B according to the ID.
I need to have an efficient way to do this as I am applying it to over 2M rows.
Add the ID variable to A and B, rbind A and B together, and use ID and CLASS to merge:
A$ID = 'A'
B$ID = 'B'
AB <- rbind(A, B)
merge(df1, AB, by = c('ID', 'CLASS'))
ID CLASS DESCRIPTION
1 A 1 Unknown
2 A 1 Unknown
3 A 2 Tall
4 B 1 Big
5 B 4 Very Big
I would suggest using stringsAsFactors = FALSE when creating the data:
df1 <- data.frame(ID = c("A","B","A","A","B"),CLASS = c(1,1,2,1,4),
stringsAsFactors = FALSE)
A <- data.frame(CLASS = c(1,2,3),
DESCRIPTION = c("Unknown", "Tall", "Short"),
stringsAsFactors = FALSE)
B <- data.frame(CLASS = c(1,2,3,4),
DESCRIPTION = c("Big", "Small", "Medium", "Very Big"),
stringsAsFactors = FALSE)
To merge multiple dataframes in one go, Reduce is often helpful:
out <- Reduce(function(x,y) merge(x,y, by = "CLASS", all.x=T), list(df1, A, B))
out
CLASS ID DESCRIPTION.x DESCRIPTION.y
1 1 A Unknown Big
2 1 B Unknown Big
3 1 A Unknown Big
4 2 A Tall Small
5 4 B <NA> Very Big
As you can see, columns that were present in all dataframes were added a suffix (default merge behavior). This allows you to apply whatever logic you want to get the final column you wish for. For instance,
out$Description <- ifelse(out$ID == "A", as.character(out$DESCRIPTION.x), as.character(out$DESCRIPTION.y))
> out
CLASS ID DESCRIPTION.x DESCRIPTION.y Description
1 1 A Unknown Big Unknown
2 1 B Unknown Big Big
3 1 A Unknown Big Unknown
4 2 A Tall Small Tall
5 4 B <NA> Very Big Very Big
Note that ifelse is vectorized and quite efficient.
A dplyr solution:
library(dplyr)
bind_rows(lst(A,B),.id="ID") %>% inner_join(df1)
# ID CLASS DESCRIPTION
# 1 A 1 Unknown
# 2 A 1 Unknown
# 3 A 2 Tall
# 4 B 1 Big
# 5 B 4 Very Big

Speed up data.frame rearrangement

I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)

Resources