Merging multiple data frames without getting duplicates - r

I am trying to merge 6+ datasets into one by ID. Right now, the duplication of IDs makes merge treat each as a new observation.
Example code:
combined <-Reduce(function(x,y) merge(x,y, all=TRUE), list(NRa,NRb,NRc,NRd,NRe,NRf,NRg,NRh))
Which gives me this:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA NA 1 NA
3 62734839 NA NA NA 1 NA NA
4 62737229 NA 1 NA NA NA NA
5 62737229 NA NA NA 1 1 NA
I would like each ID to have a single row:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA 1 1 NA
3 62737229 NA 1 NA 1 1 NA
Any help is appreciated. Thank you.

Using R's sqldf package will work leaving you with one id per row.
Data1 <- data.frame(
X = sample(1:10),
Housing = sample(c("yes", "no"), 10, replace = TRUE)
)
Data2 <- data.frame(
X = sample(1:10),
Credit = sample(c("yes", "no"), 10, replace = TRUE)
)
Data3 <- data.frame(
X = sample(1:10),
OwnsCar = sample(c("yes", "no"), 10, replace = TRUE)
)
Data4 <- data.frame(
X = sample(1:10),
CollegeGrad = sample(c("yes", "no"), 10, replace = TRUE)
)
library(sqldf)
sqldf("Select Data1.X,Data1.Housing,Data2.Credit,Data3.OwnsCar,Data4.CollegeGrad from Data1
inner join Data2 on Data1.X = Data2.X
inner join Data3 on Data1.X = Data3.X
inner join Data4 on Data1.X = Data4.X
")

Why don't you try by='ID' in your merge() function. If that's not enough, try aggregate().

Your description of the problem is not entirely clear, and you don't provide data.
Assuming that all of your dataframes have the same dimensions, column names, column orders, ID entries, that the ID row orders match, that ID is the first column, that all other entries are either NA or 1 and that any cell in one dataframe featuring a 1 has NA values in that cell for all other data frames or that sums of numeric values are acceptable, and that you want the result as a data frame ...
An Old-School solution using the abind package:
consolidate <- function(lst) {
stopifnot(require(abind))
## form 3D array, replace NA
x <- abind(lst, along=3)
x[is.na(x)] <- 0
z <- x[,,1] ## data store
## sum array along 3rd dimension
for (j in seq(2,ncol(x)))
for (i in seq(nrow(x)))
z[i,j] <- sum(x[i,j,])
z[z==0] <- NA ## restore NA
as.data.frame(z)
}
For dataframes (with the above caveats) a,b,c:
consolidate(list(a,b,c))

Related

Fast way to insert values in column of data frame in R

I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame.
This is my code, which works perfectly fine, and creates a result which matches my expectation.
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
unique_to_a <- df$a[!(df$a %in% df$b)]
unique_to_b <- df$b[!(df$b %in% df$a)]
n <- max(c(unique_to_a, unique_to_b))
out <- data.frame(A=rep(NA,n), B=rep(NA,n))
for (element in unique_to_a){
out[element, "A"] = element
}
for (element in unique_to_b){
out[element, "B"] = element
}
out
The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...
Any ideas on how to speed up the operation is much appreciated.
Cheers!
Didn't compare the speed but at least this is more concise:
elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
# X1 X2
# 1 NA NA
# 2 NA NA
# 3 NA 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 NA NA
# 8 NA NA
# 9 NA NA
# 10 NA NA
# 11 11 NA

randomly add NA values across columns in data

I have something like this:
dates <- seq(from = as.Date("2010-01-01"), as.Date("2017-12-01"), "1 day")
values = cumsum(rnorm(length(dates)))
df <- cbind(dates, values)
Which looks like:
dates values
1 14610 -0.3750827
2 14611 0.2068051
3 14612 0.1986609
4 14613 0.1793758
5 14614 1.1068358
6 14615 0.9621490
I would like to add randomly to the data NA values such that:
dates values
1 14610 -0.3750827
2 NA NA
3 14612 0.1986609
4 14613 0.1793758
5 NA NA
6 14615 0.9621490
Where some rows have NA values in. I have found code to randomly add NA values but only to one column.
ind <- sample(df, 100)
df[ind] <- NA
Does not work for me.
To do it your way, you'd need an array of same dimensions as df with random TRUEs and FALSEs so that you can replace the TRUEs with NA. Here's a way -
ind <- matrix(sample(c(TRUE,FALSE), prod(dim(df)), replace = T),
nrow = nrow(df), ncol = ncol(df))
df[ind] <- NA

Insert an empty column between every column of a dataframe in R

Say you have a dataframe of four columns:
dat <- data.frame(A = rnorm(5), B = rnorm(5), C = rnorm(5), D = rnorm(5))
And you want to insert an empty column between each of the columns in the dataframe, so that the output is:
A A1 B B1 C C1 D D1
1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
How would you do this?
The dataframe I would like to do this operation to has hundreds of columns and so obviously I don't want to type out each column and add them naively like this:
dat$A1 <- NA
dat$B1 <- NA
dat$C1 <- NA
dat$D1 <- NA
dat <- dat[, c("A", "A1", "B", "B1", "C", "C1", "D", "D1")]
Thanks for you help in advance!
You can try
res <- data.frame(dat, dat*NA)[order(rep(names(dat),2))]
res
# A A.1 B B.1 C C.1 D D.1
#1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
#2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
#3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
#4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
#5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
NOTE: I am leaving the . in the column names as it is a trivial task to remove it.
Or another option is
dat[paste0(names(dat),1)] <- NA
dat[order(names(dat))]
you can try this
df <- cbind(dat, dat)
df <- df[, sort(names(df))]
df[, seq(2, 8,by=2)] <- NA
names(df) <- sub("\\.", "", names(df))
# create new data frame with twice the number of columns
bigdat <- data.frame(matrix(ncol = dim(dat)[2]*2, nrow = dim(dat)[1]))
# set sequence of target column indices
inds <- seq(1,dim(bigdat)[2],by=2)
# insert values
bigdat[,inds] <- dat
# set column names
colnames(bigdat)[inds] <- colnames(dat)

How to create a new column with if condition

This seems simple but I could not perform. Its different than sound similar question ask here.
I want to create new columns say df$col1, df$col2, df$col3 on dataframe df using if condition in the column already exists ie df$con and df$val.
I would like to write the value of column "val" in df$col1 if df$con > 3
I would like to write the value of col df$val in df$col2 if df$con<2
and I would like to write the 30% of df$val in df$col3 if df$con is between 1 and 3.
How should I do it ? Below is my dataframe df with two columns "con" for condition and "val" for value use.
dput(df)
structure(list(con = c(-33.09524956, -36.120924, -28.7020053,
-26.06385399, -18.45731163, -14.51817928, -20.1005132, -23.62346403,
-24.90464018, -23.51471516), val = c(0.016808197, 1.821442227,
4.078385886, 3.763593573, 2.617612605, 2.691796601, 1.060565469,
0.416400183, 0.348732675, 1.185505136)), .Names = c("con", "val"
), row.names = c(NA, 10L), class = "data.frame")
This might do it. First we write a function to change FALSE values to NA
foo <- function(x) {
is.na(x) <- x == FALSE
return(x)
}
Then apply it over the list of logical vectors and take the matching val column values
df[paste0("col", 1:3)] <- with(df, {
x <- list(con > 3, con < 2, con < 3 & con > 1)
lapply(x, function(y) val[foo(y)])
})
resulting in
df
con val col1 col2 col3
1 -33.09525 0.0168082 NA 0.0168082 NA
2 -36.12092 1.8214422 NA 1.8214422 NA
3 -28.70201 4.0783859 NA 4.0783859 NA
4 -26.06385 3.7635936 NA 3.7635936 NA
5 -18.45731 2.6176126 NA 2.6176126 NA
6 -14.51818 2.6917966 NA 2.6917966 NA
7 -20.10051 1.0605655 NA 1.0605655 NA
8 -23.62346 0.4164002 NA 0.4164002 NA
9 -24.90464 0.3487327 NA 0.3487327 NA
10 -23.51472 1.1855051 NA 1.1855051 NA
Could go the tidyverse way. The pipes %>% just send the output of each operation to the next function. mutate allows you to make a new column in your data frame, but you have to remember to store it at the top. It's stored as output. The ifelse allows you to conditionally assign values to your new column, for example the column col1. The second argument in ifelse is the output for a true condition, and the third argument is when ifelse is false. Hope this helps some too!
Go tidyverse!
library(tidyverse)
output <- df %>%
mutate(col1=ifelse(con>3, val, NA)) %>%
mutate(col2=ifelse(con<2, val, NA)) %>%
mutate(col3=ifelse(con<=3 & con>=1, 0.3*val, NA))
Here's a df that actually meets some of the conditions:
structure(list(con = c(-33.09524956, 2.5, -28.7020053, 2, -18.45731163,
2, -20.1005132, 6, -24.90464018, -23.51471516), val = c(0.016808197,
1.821442227, 4.078385886, 3.763593573, 2.617612605, 2.691796601,
1.060565469, 0.416400183, 0.348732675, 1.185505136)), .Names = c("con",
"val"), row.names = c(NA, 10L), class = "data.frame")
Here's the output after running the code:
con val col1 col2 col3
1 -33.09525 0.0168082 NA 0.0168082 NA
2 2.50000 1.8214422 NA NA 0.5464327
3 -28.70201 4.0783859 NA 4.0783859 NA
4 2.00000 3.7635936 NA NA 1.1290781
5 -18.45731 2.6176126 NA 2.6176126 NA
6 2.00000 2.6917966 NA NA 0.8075390
7 -20.10051 1.0605655 NA 1.0605655 NA
8 6.00000 0.4164002 0.4164002 NA NA
9 -24.90464 0.3487327 NA 0.3487327 NA
10 -23.51472 1.1855051 NA 1.1855051 NA

Can you use rbind.fill without having it fill in NA's?

I am trying to combine two dataframes with different number of columns and column headers. However, after I combine them using rbind.fill(), the resulting file has filled the empty cells with NA.
This is very inconvenient since one of the columns has data that is also represented as "NA" (for North America), so when I import it into a csv, the spreadsheet can't tell them apart.
Is there a way for me to:
Use the rbind.fill function without having it populate the empty cells with NA
or
Change the column to replace the NA values*
*I've scoured the blogs, and have tried the two most popular solutions:
df$col[is.na(df$col)] <- 0, #it does not work
df$col = ifelse(is.na(df$col), "X", df$col), #it changes all the characters to numbers, and ruins the column
Let me know if you have any advice! I (unfortunately) cannot share the df, but will be willing to answer any questions!
NA is not the same as "NA" to R, but might be interpreted as such by your favourite spreadsheet program. NA is a special value in R just like NaN (not a number). If I understand correctly, one of your solutions is to replace the "NA" values in the column representing North America with something else, in which case you should just be able to do...
df$col[ df$col == "NA" ] <- "NorthAmerica"
This is assuming that your "NA" values are actually character strings. is.na() won't return any values if they are character strings which is why df$col[ is.na(df$col) ] <- 0 won't work.
An example of the difference between NA and "NA":
x <- c( 1, 2, 3 , "NA" , 4 , 5 , NA )
> x[ !is.na(x) ]
[1] "1" "2" "3" "NA" "4" "5"
> x[ x == "NA" & !is.na(x) ]
[1] "NA"
Method to resolve this
I think you want to leave "NA" and any NAs as they are in the first df, but make all NA in the second df formed from rbind.fill() change to something like "NotAvailable". You can accomplish this like so...
df1 <- data.frame( col = rep( "NA" , 6 ) , x = 1:6 , z = rep( 1 , 6 ) )
df2 <- data.frame( col = rep( "SA" , 2 ) , x = 1:2 , y = 5:6 )
df <- rbind.fill( df1 , df2 )
temp <- df [ (colnames(df) %in% colnames(df2)) ]
temp[ is.na( temp ) ] <- "NotAvailable"
res <- cbind( temp , df[ !( colnames(df) %in% colnames(df2) ) ] )
#df has real NA values in column z and column y. We just want to get rid of y's
df
# col x z y
# 1 NA 1 1 NA
# 2 NA 2 1 NA
# 3 NA 3 1 NA
# 4 NA 4 1 NA
# 5 NA 5 1 NA
# 6 NA 6 1 NA
# 7 SA 1 NA 5
# 8 SA 2 NA 6
#res has "NA" strings in col representing "North America" and NA values in z, whilst those in y have been removed
#More generally, any NA in df1 will be left 'as-is', whilst NA from df2 formed using rbind.fill will be converted to character string "NotAvilable"
res
# col x y z
# 1 NA 1 NotAvailable 1
# 2 NA 2 NotAvailable 1
# 3 NA 3 NotAvailable 1
# 4 NA 4 NotAvailable 1
# 5 NA 5 NotAvailable 1
# 6 NA 6 NotAvailable 1
# 7 SA 1 5 NA
# 8 SA 2 6 NA
If you have a dataframe that contains NA's and you want to replace them all you can do something like:
df[is.na(df)] <- -999
This will take care of all NA's in one shot
If you only want to act on a single column you can do something like
df$col[which(is.na(df$col))] <- -999

Resources