I have 3 dataframes, all have some overlapping column names, but also some that are not present in at least one column. I am trying to 1) select only columns that are present in all 3 dfs and 2) make sure all of the columns are in the same order (not looking for alphabetical per say).
df1
A B C D
4 5 2 9
df2
A D C F
13 23 94 1
df3
E C A D
3 83 12 7
**Ideal Output**
df1
A C D
4 2 9
df2
A C D
13 94 23
df3
A C D
12 83 7
I am honestly not to sure where to start. Intuitively I think something like this:
df1 <- apply(df1, 2, function(x) ifelse(colnames(x) %in% colnames(df2) & colnames(df1) %in% colnames(df3), x, subset(df1, select = -c(x))
Then repeat for the other 2 dfs. Once all three dfs have the same columns, then I would just order it using one of the dfs as a template.
col_order <- colnames(df1)
df2 <- df2[, col_order]
Where am I going wrong?
We can get the datasets in a list, and get the intersecting names and use that to subset
lst1 <- mget(ls(pattern = '^df\\d+$'))
nm1 <- Reduce(intersect, lapply(lst1, names))
lapply(lst1, subset, select = nm1)
Related
I am trying to use a vector of logical expressions to subset a data frame. I have a data frame I want to subset based on several columns where I want to exclude "B" each time. First I want do define a vector for logical expressions based on data frame column names.
set.seed(42)
n <- 24
dataframe <- data.frame(column1=as.character(factor(paste("obs",1:n))),
rand1=rep(LETTERS[1:4], n/4),
rand2=rep(LETTERS[1:6], n/6),
rand3=rep(LETTERS[1:3], n/3),
x=rnorm(n))
columns <- colnames(dataframe)[2:4]
criteria <- quote(rep(paste0(columns[1:3], " != ", quote("B")), length(columns)))
What I want to achieve is a vector criteria containing
rand1 != "B" rand2 != "B" rand3 != "B" so I can use it to subset data frame based on columns like
dfs1 <- subset(dataframe, criteria[1])
dfs2 <- subset(dataframe, criteria[2])
dfs3 <- subset(dataframe, criteria[3])
I might be misunderstanding your question, but it seems like you want a collection of data.frames where each one excludes rows where a given column = 'B'.
Assuming this is what you want:
cols <- c('rand1', 'rand2', 'rand3')
result <- lapply(dataframe[, cols], function(x) dataframe[x!='B',])
will create a list of data.frames, each of which has the result of excluding rows where the indicated column == 'B'.
Based on Using tidy eval for multiple, arbitrary filter conditions
filter_fun <- function(df, cols, conds){
fp <- map2(cols, conds, function(x, y) quo((!!(as.name(x))) != !!y))
filter(df, !!!fp)
}
filter_col <- columns[1:3] %>% as.list()
cond_list <- rep(list("B"), length(columns[1:3]))
filter_fun(dataframe, cols = filter_col,
conds = cond_list)
column1 rand1 rand2 rand3 x
1 obs 1 A A A 1.3709584
2 obs 3 C C C 0.3631284
3 obs 4 D D A 0.6328626
4 obs 7 C A A 1.5115220
5 obs 9 A C C 2.0184237
6 obs 12 D F C 2.2866454
7 obs 13 A A A -1.3888607
8 obs 15 C C C -0.1333213
9 obs 16 D D A 0.6359504
10 obs 19 C A A -2.4404669
11 obs 21 A C C -0.3066386
12 obs 24 D F C 1.2146747
How to simply "paste" two data frames next to each other, filling unequal rows with NAs (e.g. because I want to make a "kable" or sth similar)?
df1 <- data.frame(a = c(1,2,3),
b = c(3,4,5))
df2 <- data.frame(a = c(4,5),
b = c(5,6))
# The desired "merge"
a b a b
1 3 4 5
2 4 5 6
3 5 NA NA
Thanks to Ronak Shah, I found an easy answer in the answers to this post: How to cbind or rbind different lengths vectors without repeating the elements of the shorter vectors?
Without having to hack anything together, one can use cbind.na from the qpcR: package:
df1 <- data.frame(a = c(1,2,3),
b = c(3,4,5))
df2 <- data.frame(a = c(4,5),
b = c(5,6))
comb <- qpcR:::cbind.na(df1, df2)
As this answer is 4 years old, I wonder if there are more "modern" solutions in the popular packages like tidyverse et. al.
In base R you could do:
nr <- max(nrow(df1), nrow(df2))
cbind(df1[1:nr, ], df2[1:nr, ])
# a b a b
# 1 1 3 4 5
# 2 2 4 5 6
# 3 3 5 NA NA
I have several data frames I would like to combine, but I need to get rid of rows that don't have matching values in a column in the other data frames. For example, I want to merge a, b, and c data frames, based on the values in column x.
a <- data.frame(1:5, 5:9)
colnames(a) <- c("x", "y")
b <- data.frame(1:4, 7:10)
colnames(b) <- c("x", "y")
c <- data.frame(1:3, 6:8)
colnames(c) <- c("x", "y")
and have the result be
1 5
2 6
3 7
1 7
2 8
3 9
1 6
2 7
3 8
where the first three rows are from data frame a, the second three rows are from data frame b, and the third three rows are from data frame c, and the rows that didn't have matching values in column x were not included.
We create an index based on intersecting elements of 'x'
v1 <- Reduce(intersect, list(a$x, b$x, c$x))
rbind(a[a$x %in% v1,], b[b$x %in% v1,], c[c$x %in% v1, ])
# x y
#1 1 5
#2 2 6
#3 3 7
#4 1 7
#5 2 8
#6 3 9
#7 1 6
#8 2 7
#9 3 8
If there are many dataset objects, it is better to keep it in a list. Here, the example showed the object identifiers as completely different, but if the identifiers have a pattern e.g. df1, df2, ..df100 etc, it becomes easier to get it to a list
lst1 <- mget(ls(pattern = "^df\\d+$"))
If the object identifiers are all different xyz, abc, fq12 etc, but these are the only data.frame objects loaded in the global environment
lst1 <- mget(names(eapply(.GlobalEnv, 'is.data.frame')))
Then, get the interesecitng elements of the column 'x'
v1 <- Reduce(intersect, lapply(lst1, `[[`, "x"))
Use the intersecting vector to subset the rows of the list elements
do.call(rbind, lapply(lst1, function(x) dat[dat$x %in% v1,]))
Here, we assume the column names are the same across all the datasets
Another option is to do a merge and then unlist
out <- Reduce(function(...) merge(..., by = 'x'), list(a, b, c))
data.frame(x = out$x, y = unlist(out[-1], use.name = FALSE))
I have two datasets that look like this:
What I want is to change the values from the second column in the first dataset to the values from the second column from the second dataset. All the names in the first dataset are in the second one, and obviously my dataset is much bigger than that.
I was trying to use R to do that but I am very new at it. I was looking at the intersect command but I am not sure if it's going to work. I don't put any codes because I'm real lost here.
I also need that the order of the first columns (which are names) in the first dataset stays the same, but with the new values from the second column of the second dataset.
Agree with #agstudy, a simple use of merge would do the trick. Try something like this:
df1 <- data.frame(name=c("ab23242", "ab35366", "ab47490", "ab59614"),
X=c(72722, 88283, 99999, 114278.333))
df2 <- data.frame(name=c("ab35366", "ab47490", "ab59614", "ab23242" ),
X=c(12345, 23456, 34567, 456789))
df.merge <- merge(df1, df2, by="name", all.x=T)
df.merge <- df.merge[, -2]
Output:
name X.y
1 ab23242 456789
2 ab35366 12345
3 ab47490 23456
4 ab59614 34567
I think merge will keep order of first frame but you can also keep the order strictly by simply adding a column with order df1$order <- 1:nrow(df1) and later on sorting based on that column.
df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ df1$name1 %in% df2$name2 , "valuecol1"]
df2
name2 valuecol2
1 a 10
2 b 9
3 c 8
4 d 7
5 e 6
6 f 2
7 g 4
8 h 6
9 i 8
10 j 10
This is what I thought might work, but doing replacements using indexing with match sometimes bites me in ways I need to adjust:
df2 [match(df1$name1, df2$name2) , "valuecol2"] <-
df1[ match(df1$name1, df2$name2) , "valuecol1"]
Here's how I tested it (edited).
> df2 <- data.frame( name2 = letters[1:10], valuecol2=10:1)
> df1<- data.frame( name1 = letters[1:5], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f 5
7 g 4
8 h 3
9 i 2
10 j 1
Yep.... bitten again.
> df1<- data.frame( name1 = letters[6:10], valuecol1=seq(2,10,by=2))
> df2 [ match(df1$name1, df2$name2) , "valuecol2"] <- df1[ match(df1$name1, df2$name2) , "valuecol1"]
> df2
name2 valuecol2
1 a 2
2 b 4
3 c 6
4 d 8
5 e 10
6 f NA
7 g NA
8 h NA
9 i NA
10 j NA
How about this:
library(data.table)
# generate some random data
dt.1 <- data.table(id = 1:1000, value=rnorm(1000), key="id")
dt.2 <- data.table(id = 2*(500:1), value=as.numeric(1:500), key="id")
# objective is to replace value in df.1 with value from df.2 where id's match.
# data table joins - very efficient
# dt.1 now has 3 columns: id, value, and value.1 from dt.2$value
dt.1 <-dt.2[dt.1,nomatch=NA]
dt.1[is.na(value),]$value=dt.1[is.na(value),]$value.1
dt.1$value.1=NULL # get rid of extra column
NB: This sorts dt.1 by id which should be OK since it's sorted that way already.
Also: In future, please include data that can be imported into R. Images are not useful!
In attempting to answer a question earlier, I ran into a problem that seemed like it should be simple, but I couldn't figure out.
If I have a list of dataframes:
df1 <- data.frame(a=1:3, x=rnorm(3))
df2 <- data.frame(a=1:3, x=rnorm(3))
df3 <- data.frame(a=1:3, x=rnorm(3))
df.list <- list(df1, df2, df3)
That I want to rbind together, I can do the following:
df.all <- ldply(df.list, rbind)
However, I want another column that identifies which data.frame each row came from. I expected to be able to use the deparse(substitute(x)) method (here and elsewhere) to get the name of the relevant data.frame and add a column. This is how I approached it:
fun <- function(x) {
name <- deparse(substitute(x))
x$id <- name
return(x)
}
df.all <- ldply(df.list, fun)
Which returns
a x id
1 1 1.1138062 X[[1L]]
2 2 -0.5742069 X[[1L]]
3 3 0.7546323 X[[1L]]
4 1 1.8358605 X[[2L]]
5 2 0.9107199 X[[2L]]
6 3 0.8313439 X[[2L]]
7 1 0.5827148 X[[3L]]
8 2 -0.9896495 X[[3L]]
9 3 -0.9451503 X[[3L]]
So obviously each element of the list does not contain the name I think it does. Can anyone suggest a way to get what I expected (shown below)?
a x id
1 1 1.1138062 df1
2 2 -0.5742069 df1
3 3 0.7546323 df1
4 1 1.8358605 df2
5 2 0.9107199 df2
6 3 0.8313439 df2
7 1 0.5827148 df3
8 2 -0.9896495 df3
9 3 -0.9451503 df3
Define your list with names and it should give you an .id column with the data.frame name
df.list <- list(df1=df1, df2=df2, df3=df3)
df.all <- ldply(df.list, rbind)
Output:
.id a x
1 df1 1 1.84658809
2 df1 2 -0.01177462
3 df1 3 0.58579469
4 df2 1 -0.64748756
5 df2 2 0.24384614
6 df2 3 0.59012676
7 df3 1 -0.63037679
8 df3 2 -1.17416295
9 df3 3 1.09349618
Then you can know the data.frame name from the column df.all$.id
Edit:
As per #Gary Weissman's comment if you want to generate the names automatically you can do
names(df.list) <- paste0('df',seq_along(df.list)
Using base only, one could try something like:
dd <- lapply(seq_along(df.list), function(x) cbind(df_name = paste0('df',x),df.list[[x]]))
do.call(rbind,dd)
In your definition, df.list does not have names, however, even then the deparse substitute idiom does not appear to work easilty (as lapply calls .Internal(lapply(X, FUN)) -- you would have to look at the source to see if the object name was available and how to get it
Something like
names(df.list) <- paste('df', 1:3, sep = '')
foo <- function(n, .list){
.list[[n]]$id <- n
.list[[n]]
}
a x id
1 1 0.8204213 a
2 2 -0.8881671 a
3 3 1.2880816 a
4 1 -2.2766111 b
5 2 0.3912521 b
6 3 -1.3963381 b
7 1 -1.8057246 c
8 2 0.5862760 c
9 3 0.5605867 c
if you want to use your function, instead of deparse(substitute(x)) use match.call(), and you want the second argument, making sure to convert it to character
name <- as.character(match.call()[[2]])