I am trying to find an efficient (fast in running and simple in coding) way to do the rbind.fill function but in base R. From my searching, there seems to be plenty of library functions such as smartbind, bind_rows, and rbind on data.table, though, as stated before, I need a solution in base R. I found using:
df3 <- rbind(df1, df2[, names(df1)])
From an answer to this question, but it removes the columns while I want extra columns filled with NA to be added instead.
EDIT It would also be nice if this method works on an empty data.frame and a populated one too, essentially just setting returning the populated one. (this is for the sake of simplicity, but if it's not possible it's not hard to just replace the variable with the new data.frame if it's empty.
EDIT2 I would also like it to bind by column names for the columns which are labeled the same. Additionally, the first data frame can be both bigger and smaller than the second one, and both may have columns the other does not have.
EDIT3 As suggested by a comment, here is an example input and output I would like (I just made up the numbers they don't really matter).
#inputs
a <- data.frame(aaa=c(1, 1, 2), bbb=c(2, 3, 3), ccc=c(1, 3, 4))
b <- data.frame(aaa=c(8, 5, 4), bbb=c(1, 1, 4), ddd=c(9, 9, 9), eee=(1, 2, 4))
#desired output
aaa bbb ccc ddd eee
1 2 1 NA NA
1 3 3 NA NA
2 3 4 NA NA
8 1 NA 9 1
5 1 NA 9 2
4 4 NA 9 4
While I've been using R for a few weeks, it's still relatively new to me so I haven't gotten the mechanics down enough yet to actually make a solution, though I've been thinking about using intersect somehow with names(a) and names(b) and trying to bind only those columns first, and then adding the other ones in somehow, but I'm not really sure where to go from here / how to actually implement it in an 'R' way...
I don't know how efficient it may be, but one simple way to code this would be to add the missing columns to each data frame and then rbind together.
rbindx <- function(..., dfs=list(...)) {
ns <- unique(unlist(sapply(dfs, names)))
do.call(rbind, lapply(dfs, function(x) {
for(n in ns[! ns %in% names(x)]) {x[[n]] <- NA}; x }))
}
a <- data.frame(aaa=c(1, 1, 2), bbb=c(2, 3, 3), ccc=c(1, 3, 4))
b <- data.frame(aaa=c(8, 5, 4), bbb=c(1, 1, 4), ddd=c(9, 9, 9), eee=c(1, 2, 4))
rbindx(a, b)
# aaa bbb ccc ddd eee
# 1 1 2 1 NA NA
# 2 1 3 3 NA NA
# 3 2 3 4 NA NA
# 4 8 1 NA 9 1
# 5 5 1 NA 9 2
# 6 4 4 NA 9 4
Just use rbind.fill. If you can't install the plyr package, pull out the parts you need.
rbind.fill seems to have very few internal dependencies: plyr::compact is a one-liner, plyr:::output_template depends on plyr:::allocate_column, but at a glance that looks like it is all base code. So copy those 4 functions (attribute the source and make sure the license is compatible with your use - the current version on CRAN uses the MIT license which is quite permissive, you just need to keep it MIT-licensed), and then you have the real implementation of rbind.fill.
Why take this approach? Because, as Aaron points out - you know it works. It's been in use for and debugged for years.
Related
I'm pretty new to R so the answer might be obvious, but so far I have only found answers to similar problems that don't match, or which I can't translate to mine.
The Requirement:
I have two vectors of the same length which contain numeric values as well as NA-values which might look like:
[1] 12 8 11 9 NA NA NA
[1] NA 7 NA 10 NA 11 9
What I need now is two vectors that only contain those values that are not NA in both original vectors, so in this case the result should look like this:
[1] 8 9
[1] 7 10
I was thinking about simply going through the vectors in a loop, but the dataset is quite large so I would appreciate a faster solution to that... I hope someone can help me on that...
You are looking for complete.cases But you should put your vectors in a data.frame.
dat <- data.frame(x=c(12 ,8, 11, 9, NA, NA, NA),
y=c(NA ,7, NA, 10, NA, 11, 9))
dat[complete.cases(dat),]
x y
2 8 7
4 9 10
Try this:
#dummy vector
a <- c(12,8,11,9,NA,NA,NA)
b <- c(NA,7,NA,10,NA,11,9)
#result
a[!is.na(a) & !is.na(b)]
b[!is.na(a) & !is.na(b)]
Something plus NA in R is generally NA. So, using that piece of information, you can simply do:
cbind(a, b)[!is.na(a + b), ]
# a b
# [1,] 8 7
# [2,] 9 10
More generally, you could write a function like the following to easily accept any number of vectors:
myFun <- function(...) {
myList <- list(...)
Names <- sapply(substitute(list(...)), deparse)[-1]
out <- do.call(cbind, myList)[!is.na(Reduce("+", myList)), ]
colnames(out) <- Names
out
}
With that function, the usage would be:
myFun(a, b)
# a b
# [1,] 8 7
# [2,] 9 10
In my timings, this is by far the fastest option here, but that's only important if you are able to detect differences down to the microseconds or if your vector lengths are in the millions, so I won't bother posting benchmarks.
I have a data frame with some information. Some data is NA. Something like:
id fact sex
1 1 3 M
2 2 6 F
3 3 NA <NA>
4 4 8 F
5 5 2 F
6 6 2 M
7 7 NA <NA>
8 8 1 F
9 9 10 M
10 10 10 M
I have to change fact by some rule(e.x. multiply by 3 elements, that have (data == "M")).
I tried survey$fact[survey$sex== "M"] <- survey$fact[survey$sex== "M"] * 3, but I have some error because of NA.
I know I can check if element is NA with is.na(x), and add this condition in [...], but I hope that exists more beautiful solution
I really like ifelse, it always seems to have the desired behaviour with respect to NA values for me.
survey$fact <- ifelse(survey$sex == "M", survey$fact * 3, survey$fact)
?ifelse shows that the first argument is the test, the second the value assigned if the test is true and the final argument the value if false. If you assign the original data.frame column as the false return value, it will assign rows for which the test fails without modifying them.
This is an extension of what you asked, to show that you can also test for NA values.
survey$fact <- ifelse(is.na(survey$sex), survey$fact * 2, survey$fact)
I also like that it's very readable.
which can filter those NAs:
survey$fact[which(survey$sex == "M")] <- survey$fact[which(survey$sex== "M")] * 3
There are many ways you can make that a little cleaner, e.g.:
males <- which(survey$sex == "M")
survey$fact[males] <- 3 * survey$fact[males]
or
survey <- within(survey, fact[males] <- 3 * fact[males])
I have a vector to be append, and here is the code,which is pretty slow due to the nrow is big.
All I want to is to speed up. I have tried c() and append() and both seems not fast enough.
And I checkd Efficiently adding or removing elements to a vector or list in R?
Here is the code:
compare<-vector()
for (i in 1:nrow(domin)){
for (j in 1:nrow(domin)){
a=0
if ((domin[i,]$GPA>domin[j,]$GPA) & (domin[i,]$SAT>domin[j,]$SAT)){
a=1
}
compare<-c(compare,a)
}
print(i)
}
I found it is hard to figure out the index for the compare if I use
#compare<-rep(0,times=nrow(opt_predict)*nrow(opt_predict))
The information you want would be better placed in a matrix:
v1 <- 1:3
v2 <- c(1,2,2)
mat1 <- outer(v1,v1,`>`)
mat2 <- outer(v2,v2,`>`)
both <- mat1 & mat2
To see which positions the inequality holds for, use which:
which(both,arr.ind=TRUE)
# row col
# [1,] 2 1
# [2,] 3 1
Comments:
This answer should be a lot faster than your loop. However, you are really just sorting two vectors, so there is probably a faster way to do this than taking the exhaustive set of inequalities...
In your case, there is only a partial ordering (since, for a given i and j, it is possible that neither one is strictly greater than the other in both dimensions). If you were satisfied with sorting first on v1 and then on v2, you could use the data.table package to easily get a full ordering:
set.seed(1)
v1 <- sample.int(10,replace=TRUE)
v2 <- sample.int(10,replace=TRUE)
require(data.table)
DT <- data.table(v1,v2)
setkey(DT)
DT[,rank:=.GRP,by='v1,v2']
which gives
v1 v2 rank
1: 1 8 1
2: 3 3 2
3: 3 8 3
4: 4 2 4
5: 6 7 5
6: 7 4 6
7: 7 10 7
8: 9 5 8
9: 10 4 9
10: 10 8 10
It depends on what you were planning to do next.
I have a data set with around 400 observations (rows). For each row I need to find the root of a function such as: f(x)= variable_1 - variable_2 + x.
For finding the root I want to use the function uniroot.all(f,interval) from the rootSolve package.
My question is, how do I do this for every row. Should I use a loop or would "apply" be more suitable to do this?
With "apply" I tried the follwing code, however I always get an error message.
> library(rootSolve)
> df<-as.data.frame(matrix(1:6,ncol=2))
> df
V1 V2
1 1 4
2 2 5
3 3 6
> apply(df,1,uniroot.all(fun<- function(x) df$V1-df$V2 + x, interval=c(0,100)))
Thanks a lot!
Here is the correct syntax when using apply:
apply(df, 1,
function(z) uniroot.all(function(x)z[1]-z[2]+x,
interval = c(0,100)))
# [1] 3 3 3
Personally, I like using the plyr package for this kind of things, so I can access variables by their column names (here V1 and V2):
library(plyr)
adply(df, 1, summarize,
solution = uniroot.all(function(x)V1-V2+x,
interval = c(0,100)))
# V1 V2 solution
# 1 1 4 3
# 2 2 5 3
# 3 3 6 3
Suppose I have a date.frame like:
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
a b c
1 1 4 5
2 2 3 4
3 3 5 3
4 4 2 2
5 5 1 1
and I need to replace all the 5 as NA in column b & c then return to df:
df
a b c
1 1 4 NA
2 2 3 4
3 3 NA 3
4 4 2 2
5 5 1 1
But I want to do a generic apply() function instead of using replace() each by each because there are actually many variables need to be replaced in the real data. Suppose I've defined a variable list:
var <- c("b", "c")
and come up with something like:
df <- within(df, sapply(var, function(x) x <- replace(x, x==5, NA)))
but nothing happens. I was thinking if there is a way to work this out with something similar to the above by passing a variable list of column names from a data.frame into a generic apply / plyr function (or maybe some other completely different ways). Thanks~
You could just do
df[,var][df[,var] == 5] <- NA
df <- data.frame(a=1:5, b=sample(1:5, 5, replace=TRUE), c=5:1)
df
var <- c("b","c")
df[,var] <- sapply(df[,var],function(x) ifelse(x==5,NA,x))
df
I find the ifelse notation easier to understand here, but most Rers would probably use indexing instead.