R smartbind produces extra column - r

Adding rows to dataframe is easy with rbind but i find that smartbind is often better at handling exceptions such as dataframes with different columnnames and so on.
However Iteratively adding rows in smartbind produces an additional row in some instances:
library(gtools)
alldf <- data.frame()
for (i in 1:3) {
df <- data.frame(x=i)
alldf<- smartbind(df,alldf)
}
smartbind :
> alldf
x
1 3
2:1 2
2:2 1
2:3 1
rbind :
> alldf
x
1 3
2 2
3 1
I don't have a clue why smartbind does this, i've tried fiddling with removal of rownames rownames(alldf) <- NULL, but it doesn't seem to change this. I can use rbind instead for now, or i could inititalize the alldf on the first loop, but it seems like a hassle. Plus, I sometimes prefer to use smartbind so i would like to correct this.
Thanks for Reading

Related

data.table rbindlist column of lists

I want to transform a nested list to a data.table but I get this error:
library(data.table)
res = list(list(a=1, b=2, c=list(1,2,3,4,5)), list(a=2, b=3, c=list(1,2,3,4,5)))
rbindlist(res)
Error in rbindlist(res) :
Column 3 of item 1 is length 5, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table
My result should look like this:
data.table(a=c(1,2), b=c(2,3), c=list(c(1,2,3,4,5), c(1,2,3,4,5)))
a b c
1: 1 2 1,2,3,4,5
2: 2 3 1,2,3,4,5
I there a way to transform this list? It should work without knowing the column names beforehand.
What you want to do is easy. But your input must be formatted as the following:
library(data.table)
res = list(list(a=1, b=2, c=list(c(1,2,3,4,5))), list(a=2, b=3, c=list(c(1,2,3,4,5))))
rbindlist(res)
You can transform your input with the following code
res = lapply(res, function(x)
{
x[[3]] <- list(unlist(x[[3]]))
return(x)
})
Generally, R wants a column of a data.frame or data.table to be like a vector, meaning all single values. And rbindlist expects a list of data.frames, or things that can be treated as/converted to data.frames. So the reason your code fails is because it first tries to transform your input into 2 data.frames, for which the third column seems longer then the first and second.
But it is possible, you need to force R to construct 2 data.frames with each just one row. So we need the value of c to be just length-one, which we can do by making it a list with one element: a vector of length 5. And then we need to tell R it needs to really treat it "AsIs", and not convert it to something of length 5. For that, we have the function I which does nothing more then mark its input as "AsIs"
res <- list(data.frame(a=1, b=2, c=I(list(c(1,2,3,4,5)))),
data.frame(a=2, b=3, c=I(list(c(1,2,3,4,5)))))
res2 <- rbindlist(res)
The data.frame calls are not even necessary, it also works with list. But generally, I think not relying on hoe other functions must first convert your input works best.
I think you need to first process every sublist and convert it to achieve your example output, like so
library(data.table)
res = list(list(a=1, b=2, c=list(1,2,3,4,5)), list(a=2, b=3, c=list(1,2,3,4,5)))
DTlist <- lapply(res, function(row_){
lapply(row_, function(col_){
if(class(col_) == 'list'){
list(unlist(col_))
}else{
col_
}
})
})
rbindlist(DTlist)
The result would be
a b c
1: 1 2 1,2,3,4,5
2: 2 3 1,2,3,4,5
Sorry post is edited, bc I didnt recognize what the OP is trying to do initially. This also works, if the OP doesnt know which the sublist column is.

in R: remove rows containing no integer (such as characters i.e.) from a data frame

My data frame df looks like follow:
Variable A Variable B Variable C
9 2 1
2 0 don't know
maybe 1 1
? 0 3
I need to remove all rows, where non-numerical values are used. It should look like this afterwards:
Variable A Variable B Variable C
9 2 1
I thought about something like
df[! grepl(*!= numerical*, df),]
or
df[! df %in% *!= numerical*, ]
but I don't find anything I could use as input for "take all that doesn't match numerical values". Could you please help me?
Thanks a lot!
One option would be to loop through the columns, convert to numeric so that all non-numeric elements convert to NA, check for NA with is.na , negate (!) it, compare the corresponding elements of list with Reduce and &, use that to subset the rows.
df[Reduce(`&`, lapply(df, function(x) !is.na(as.numeric(x)))),]
This might not be the best way to do it, but works.
s is the df that contains your data-
contains <- lapply(seq_len(nrow(s)), function(i){
yes <- grep("[^0-9.]" , s[i,]) #regex for presence of non-digits
ifelse(identical(yes, integer(0)),F,T)
}) %>% unlist
s <- s[which(!contains),]
Thanks!

Lookup of entries with multiplicities

Suppose I have a vector data <- c(1,2,2,1) and a reference table, say : ref <- cbind(c(1,1,2,2,2,2,4,4), c(1,2,3,4,5,6,7,8))
I would like my code to return the following vector : result <- c(1,2,3,4,5,6,3,4,5,6,1,2). It's like using the R function match(). But match() only returns the first occurrence of the reference vector. Similar for %in%.
I have tried functions like merge(), join() but I would like something with only the combination of rep() and seq() R functions.
You can try
ref[ref[,1] %in% data,2]
To return the second column value whenever the first column value is in the given set. You can wrap this in a lapply:
unlist(lapply(data, function(x) ref[ref[,1] ==x, 2]))
You can get the indices you are looking for like this:
indices <- sapply(data,function(xx)which(ref[,1]==xx))
Of course, that is a list, since the number of hits will be different for each entry of data. So you just unlist() this:
ref[unlist(indices),2]
[1] 1 2 3 4 5 6 3 4 5 6 1 2

How to avoid listing out function arguments but still subset too?

I have a function myFun(a,b,c,d,e,f,g,h) which contains a vectorised expression of its parameters within it.
I'd like to add a new column: data$result <- with(data, myFun(A,B,C,D,E,F,G,H)) where A,B,C,D,E,F,G,H are column names of data. I'm using data.table but data.frame answers are appreciated too.
So far the parameter list (column names) can be tedious to type out, and I'd like to improve readability. Is there a better way?
> myFun <- function(a,b,c) a+b+c
> dt <- data.table(a=1:5,b=1:5,c=1:5)
> with(dt,myFun(a,b,c))
[1] 3 6 9 12 15
The ultimate thing I would like to do is:
dt[isFlag, newCol:=myFun(A,B,C,D,E,F,G,H)]
However:
> dt[a==1,do.call(myFun,dt)]
[1] 3 6 9 12 15
Notice that the j expression seems to ignore the subset. The result should be just 3.
Ignoring the subset aspect for now: df$result <- do.call("myFun", df). But that copies the whole df whereas data.table allows you to add the column by reference: df[,result:=myFun(A,B,C,D,E,F,G,H)].
To include the comment from #Eddi (and I'm not sure how to combine these operations in data.frame so easily) :
dt[isFlag, newCol := do.call(myFun, .SD)]
Note that .SD can be used even when you aren't grouping, just subsetting.
Or if your function is literally just adding its arguments together :
dt[isFlag, newCol := do.call(sum, .SD)]
This automatically places NA into newCol where isFlag is FALSE.
You can use
df$result <- do.call(myFun, df)

Colwise eats column names within ddply

I'm trying to chunk through a data frame, find instances where the sub-data frames are unbalanced, and add 0 values for certain levels of a factor that are missing. To do this, within ddply, I did a quick comparison to a set vector of what levels of a factor should be there, and then create some new rows, replicating the first row of the subdata set but modifying their values, and then rbinding them to the old data set.
I use colwise to do the replication.
This works great outside of ddply. Inside of ddply...identifying rows get eaten, and rbind borks on my. It's curious behavior. See the following code with some debugging print statements thrown in to see the difference in results:
#a test data frame
g <- data.frame(a=letters[1:5], b=1:5)
#repeat rows using colwise
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
#if I want to do this with just one row, I get all of the columns
rep.row(g[1,],5)
is fine. It prints
a b
1 a 1
2 a 1
3 a 1
4 a 1
5 a 1
#but, as soon as I use ddply to create some new data
#and try and smoosh it to the old data, I get errors
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
rbind(x, newrows)
})
This gives
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
You can see the problem with this debugged version
#So, what is going on here?
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
print(x)
print("\n\n")
print(newrows)
rbind(x, newrows)
})
You can see that x and newrows have different columns - they differ in a.
a b
1 a 1
[1] "\n\n"
b
1 0
2 0
3 0
4 0
5 0
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
What is going on here? Why when I use colwise on a subdata frame do the identifying rows get eaten?
It's a funny interaction between ddply and colwise, it seems. More specifically, the problem occurs when colwise calls strip_splits and finds a vars attribute that was given by ddply.
As a workaround, try putting this first line in your function,
attr(x, "vars") <- NULL
# your code follows
newrows <- rep.row(x[1,],5)

Resources