Appending data frames in R based on column names - r

I am relatively new to R, so bear with me. I have a list of data frames that I need to combine into one data frame. so:
dfList <- list(
df1 = data.frame(x=letters[1:2],y=1:2),
df2 = data.frame(x=letters[3:4],z=3:4)
)
comes out as:
$df1
x y
1 a 1
2 b 2
$df2
x z
1 c 3
2 d 4
and I want them to combine common columns and add anything not already there. the result would be:
final result
x y z
1 a 1
2 b 2
3 c 3
4 d 4
Is this even possible?

Yep, it's pretty easy, actually:
library(dplyr)
df_merged <- bind_rows(dfList)
df_merged
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4
And if you don't want NA in the empty cells, you can replace them like this:
df_merged[is.na(df_merged)] <- 0 # or whatever you want to replace NA with

Just using do.call with rbind.fill
do.call(rbind.fill,dfList)
x y z
1 a 1 NA
2 b 2 NA
3 c NA 3
4 d NA 4

You could do that with base function merge():
merge(dfList$df1, dfList$df2, by = "x", all = TRUE)
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
Or with dplyr package with function full_join:
dplyr::full_join(dfList$df1, dfList$df2, by = "x")
# x y z
# 1 a 1 NA
# 2 b 2 NA
# 3 c NA 3
# 4 d NA 4
They both join everything that is in both data.frames.
Hope that works for you.

Related

Removing columns based on a vector of names in R

I have a data.frame called DATA. Using BASE R, I was wondering how I could remove any variables in DATA that is named any of the following: ar = c("out", "Name", "mdif" , "stder" , "mpre")?
Currently, I use DATA[ , !names(DATA) %in% ar] but while this removes the unwanted variables, it again creates some new nuisance variables suffixed .1.
After extraction, is it possible to remove just suffixes?
Note1: We have NO ACCESS to r, the only input is DATA.
Note2: This is toy data, a functional solution is appreciated.
r <- list(
data.frame(Name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
Z = rep(3, 6),
out = rep(1, 6)),
data.frame(Name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
Y = c(1,1,1,2,NA,NA),
Z = rep(2, 6),
out = rep(1, 6)))
DATA <- do.call(cbind, r) ## DATA
ar = c("out", "Name", "mdif" , "stder" , "mpre") # The names for exclusion
DATA[ , !names(DATA) %in% ar] ## Current solution
#>
# X Y Z X.1 Y.1 Z.1 ## X.1 Y.1 Z.1 are automatically created but no needed
# 1 2 1 3 1 1 2
# 2 2 1 3 NA 1 2
# 3 1 1 3 3 1 2
# 4 1 2 3 1 2 2
# 5 NA 1 3 NA NA 2
# 6 NA NA 3 NA NA 2
Ideally column names should be unique but if you want to keep duplicated column names, we can remove suffixes using sub after extraction
DATA1 <- DATA[ , !names(DATA) %in% ar]
names(DATA1) <- sub("\\.\\d+", "", names(DATA1))
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
In base R, if we create an object with the index, we can reuse it later instead of doing additional manipulations on the column name
i1 <- !names(DATA) %in% ar
DATA1 <- setNames(DATA[i1], names(DATA)[i1])
DATA1
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
For reusuability, we can create a function
f1 <- function(dat, vec) {
i1 <- !names(dat) %in% vec
setNames(dat[i1], names(dat)[i1])
}
f1(DATA, ar)
If the datasets are stored in a list, use lapply to loop over the list and apply the f1
lst1 <- list(DATA, DATA)
lapply(lst1, f1, vec = ar)
If the 'ar' elements are also different for different list elements
arLst <- list(ar1, ar2)
Map(f1, lst1, vec = arLst)
Here,
ar1 <- c("out", "Name")
ar2 <- c("mdif" , "stder" , "mpre")
Here is also another option using tidyverse
library(dplyr)
library(stringr)
DATA %>%
set_names(make.unique(names(.))) %>%
select(-matches(str_c(ar, collapse="|"))) %>%
set_names(str_remove(names(.), "\\.\\d+$"))
# X Y Z X Y Z
#1 2 1 3 1 1 2
#2 2 1 3 NA 1 2
#3 1 1 3 3 1 2
#4 1 2 3 1 2 2
#5 NA 1 3 NA NA 2
#6 NA NA 3 NA NA 2
NOTE: It is not recommended to have duplicate column names

merge columns that have the same name r

I am working in R with a dataset that is created from mongodb with the use of mongolite.
I am getting a list that looks like so:
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4
I would like to merge the datasetto look like this:
_id A B
1 a 1
2 k 4
1 b 2
2 l 3
1 e 5
2 c 3
1 NA NA
2 d 4
The NAs in the last columns are there because the columns are named from the first entry and if a later entry has more columns than that they don't get names assigned to them, (if I get help for this as well it would be awesome but it's not the reason I am here).
Also the number of columns might differ for different subsets of the dataset.
I have tried melt() but since it is a list and not a dataframe it doesn't work as expected, I have tried stack() but it dodn't work because the columns have the same name and some of them don't even have a name.
I know this is a very weird situation and appreciate any help.
Thank you.
using library(magrittr)
data:
df <- fread("
_id A B A B A B NA NA
1 a 1 b 2 e 5 NA NA
2 k 4 l 3 c 3 d 4 ",header=T)
setDF(df)
Code:
df2 <- df[,-1]
odds<- df2 %>% ncol %>% {(1:.)%%2} %>% as.logical
even<- df2 %>% ncol %>% {!(1:.)%%2}
cbind(df[,1,drop=F],
A=unlist(df2[,odds]),
B=unlist(df2[,even]),
row.names=NULL)
result:
# _id A B
# 1 1 a 1
# 2 2 k 4
# 3 1 b 2
# 4 2 l 3
# 5 1 e 5
# 6 2 c 3
# 7 1 <NA> NA
# 8 2 d 4
We can use data.table. Assuming A and B are always following each other. I created an example with 2 sets of NA's in the header. With grep we can find the ones fread has named V8 etc. Using R's recycling of vectors, you can rename multiple headers in one go. If in your case these are named differently change the pattern in the grep command. Then we melt the data in via melt
library(data.table)
df <- fread("
_id A B A B A B NA NA NA NA
1 a 1 b 2 e 5 NA NA NA NA
2 k 4 l 3 c 3 d 4 e 5",
header = TRUE)
df
_id A B A B A B A B A B
1: 1 a 1 b 2 e 5 <NA> NA <NA> NA
2: 2 k 4 l 3 c 3 d 4 e 5
# assuming A B are always following each other. Can be done in 1 statement.
cols <- names(df)
cols[grep(pattern = "^V", x = cols)] <- c("A", "B")
names(df) <- cols
# melt data (if df is a data.frame replace df with setDT(df)
df_melted <- melt(df, id.vars = 1,
measure.vars = patterns(c('A', 'B')),
value.name=c('A', 'B'))
df_melted
_id variable A B
1: 1 1 a 1
2: 2 1 k 4
3: 1 2 b 2
4: 2 2 l 3
5: 1 3 e 5
6: 2 3 c 3
7: 1 4 <NA> NA
8: 2 4 d 4
9: 1 5 <NA> NA
10: 2 5 e 5
Thank you for your help, they were great inspirations.
Even though #Andre Elrico gave a solution that worked in the reproducible example better #phiver gave a solution that worked better on my overall problem.
By using both those I came up with the following.
library(data.table)
#The data were in a list of lists called list for this example
temp <- as.data.table(matrix(t(sapply(list, '[', seq(max(sapply(list, lenth))))),
nrow = m))
# m here is the number of lists in list
cols <- names(temp)
cols[grep(pattern = "^V", x = cols)] <- c("B", "A")
#They need to be the opposite way because the first column is going to be substituted with id, and this way they fall on the correct column after that
cols[1] <- "id"
names(temp) <- cols
l <- melt.data.table(temp, id.vars = 1,
measure.vars = patterns(c("A", "B")),
value.name = c("A", "B"))
That way I can use this also if I have more than 2 columns that I need to manipulate like that.

R: Combine columns ignoring NAs

I have a dataframe with a few columns, where for each row only one column can have a non-NA value. I want to combine the columns into one, keeping only the non-NA value, similar to this post:
Combine column to remove NA's
However, in my case, some rows may contain only NAs, so in the combined column, we should keep an NA, like this (adapted from the post I mentioned):
data <- data.frame('a' = c('A','B','C','D','E','F'),
'x' = c(1,2,NA,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA,NA),
'z' = c(NA,NA,NA,4,5,NA))
So I would have
a x y z
1 A 1 NA NA
2 B 2 NA NA
3 C NA 3 NA
4 D NA NA 4
5 E NA NA 5
6 F NA NA NA
And I would to get
'a' 'mycol'
A 1
B 2
C 3
D 4
E 5
F NA
The solution from the post mentioned above does not work in my case because of row F, it was:
cbind(data[1], mycol = na.omit(unlist(data[-1])))
Thanks!
Using base R...
data$mycol <- apply(data[,2:4], 1, function(x) x[!is.na(x)][1])
data
a x y z mycol
1 A 1 NA NA 1
2 B 2 NA NA 2
3 C NA 3 NA 3
4 D NA NA 4 4
5 E NA NA 5 5
6 F NA NA NA NA
One option is coalesce from dplyr
library(tidyverse)
data %>%
transmute(a, mycol = coalesce(!!! rlang::syms(names(.)[-1])))
# a mycol
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5
#6 F NA
Or we can use max.col from base R
cbind(data[1], mycol= data[-1][cbind(1:nrow(data),
max.col(!is.na(data[-1])) * NA^!rowSums(!is.na(data[-1]))+1)])
# a mycol
#1 A 1
#2 B 2
#3 C 3
#4 D 4
#5 E 5
#6 F NA
Or only with rowSums
v1 <- rowSums(data[-1], na.rm = TRUE)
cbind(data[1], mycol = v1 * NA^!v1)
Or another option is pmax
cbind(data[1], mycol = do.call(pmax, c(data[-1], na.rm = TRUE)))
or pmin
cbind(data[1], mycol = do.call(pmin, c(data[-1], na.rm = TRUE)))

R: fill a new column in a data frame with a value by matching variables in reverse

I apologize for the title of this question. I can't figure out how a good way to briefly describe what I want to do.
I have something like this, with >8000 rows:
x y value_xy
A B 7
A C 2
B A 3
B C 6
C A 2
C B 1
I want to create a new column, value_yx, that looks like this:
x y value_xy value_yx
A B 7 3
A C 2 2
B A 3 7
B C 1 1
C A 2 2
C B 1 1
For each value of x and y, I want to have a new column that finds the value of y to x (as y appears later in the x column). Sometimes these values are equal, other times they aren't.
I have explored using for loops, ave(), and several other functions, but I haven't been able to make it work.
Try merge. The by.x and by.y arguments specify columns to be matched, and here the order of matching columns is reversed in by.y:
merge(x = df, y = df, by.x = c("x", "y"), by.y = c("y", "x"))
# x y value_xy.x value_xy.y
# 1 A B 7 3
# 2 A C 2 2
# 3 B A 3 7
# 4 B C 6 1
# 5 C A 2 2
# 6 C B 1 6
Looks like I was beat to it but an alternative solution with mapply
df$value_yx = mapply(function(x_flip, y_flip) df[df$x == y_flip & df$y == x_flip,]$value_xy, df$x, df$y)
# x y value_xy value_yx
#1 A B 7 3
#2 A C 2 2
#3 B A 3 7
#4 B C 6 1
#5 C A 2 2
#6 C B 1 6
xtabs will return a value-matrix that can be indexed by a two-column, character-valued matrix formed from the first two columns and are probably factors (hence the need for the as.character()-conversion:
> dfrm$value_yx <- xtabs(value_xy~x+y, dfrm)[
sapply(dfrm[2:1],as.character) ]
> dfrm
x y value_xy value_yx
1 A B 7 3
2 A C 2 2
3 B A 3 7
4 B C 6 1
5 C A 2 2
6 C B 1 6
--- See what is being indexed
> xtabs(value_xy~x+y, dfrm)
y
x A B C
A 0 7 2
B 3 0 6
C 2 1 0

Condensing Data Frame in R

I just have a simple question, I really appreciate everyones input, you have been a great help to my project. I have an additional question about data frames in R.
I have data frame that looks similar to something like this:
C <- c("","","","","","","","A","B","D","A","B","D","A","B","D")
D <- c(NA,NA,NA,2,NA,NA,1,1,4,2,2,5,2,1,4,2)
G <- list(C=C,D=D)
T <- as.data.frame(G)
T
C D
1 NA
2 NA
3 NA
4 2
5 NA
6 NA
7 1
8 A 1
9 B 4
10 D 2
11 A 2
12 B 5
13 D 2
14 A 1
15 B 4
16 D 2
I would like to be able to condense all the repeat characters into one, and look similar to this:
J B C E
1 2 1
2 A 1 2 1
3 B 4 5 4
4 D 2 2 2
So of course, the data is all the same, it is just that it is condensed and new columns are formed to hold the data. I am sure there is an easy way to do it, but from the books I have looked through, I haven't seen anything for this!
EDIT I edited the example because it wasn't working with the answers so far. I wonder if the NA's, blanks, and unevenness from the blanks are contributing??
hereĀ“s a reshape solution:
require(reshape)
cast(T, C ~ ., function(x) x)
Changed T to df to avoid a bad habit. Returns a list, which my not be what you want but you can convert from there.
C <- c("A","B","D","A","B","D","A","B","D")
D <- c(1,4,2,2,5,2,1,4,2)
my.df <- data.frame(id=C,val=D)
ret <- function(x) x
by.df <- by(my.df$val,INDICES=my.df$id,ret)
This seems to get the results you are looking for. I'm assuming it's OK to remove the NA values since that matches the desired output you show.
T <- na.omit(T)
T$ind <- ave(1:nrow(T), T$C, FUN = seq_along)
reshape(T, direction = "wide", idvar = "C", timevar = "ind")
# C D.1 D.2 D.3
# 4 2 1 NA
# 8 A 1 2 1
# 9 B 4 5 4
# 10 D 2 2 2
library(reshape2)
dcast(T, C ~ ind, value.var = "D", fill = "")
# C 1 2 3
# 1 2 1
# 2 A 1 2 1
# 3 B 4 5 4
# 4 D 2 2 2

Resources