Colwise eats column names within ddply - r

I'm trying to chunk through a data frame, find instances where the sub-data frames are unbalanced, and add 0 values for certain levels of a factor that are missing. To do this, within ddply, I did a quick comparison to a set vector of what levels of a factor should be there, and then create some new rows, replicating the first row of the subdata set but modifying their values, and then rbinding them to the old data set.
I use colwise to do the replication.
This works great outside of ddply. Inside of ddply...identifying rows get eaten, and rbind borks on my. It's curious behavior. See the following code with some debugging print statements thrown in to see the difference in results:
#a test data frame
g <- data.frame(a=letters[1:5], b=1:5)
#repeat rows using colwise
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
#if I want to do this with just one row, I get all of the columns
rep.row(g[1,],5)
is fine. It prints
a b
1 a 1
2 a 1
3 a 1
4 a 1
5 a 1
#but, as soon as I use ddply to create some new data
#and try and smoosh it to the old data, I get errors
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
rbind(x, newrows)
})
This gives
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
You can see the problem with this debugged version
#So, what is going on here?
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
print(x)
print("\n\n")
print(newrows)
rbind(x, newrows)
})
You can see that x and newrows have different columns - they differ in a.
a b
1 a 1
[1] "\n\n"
b
1 0
2 0
3 0
4 0
5 0
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
What is going on here? Why when I use colwise on a subdata frame do the identifying rows get eaten?

It's a funny interaction between ddply and colwise, it seems. More specifically, the problem occurs when colwise calls strip_splits and finds a vars attribute that was given by ddply.
As a workaround, try putting this first line in your function,
attr(x, "vars") <- NULL
# your code follows
newrows <- rep.row(x[1,],5)

Related

data.table rbindlist column of lists

I want to transform a nested list to a data.table but I get this error:
library(data.table)
res = list(list(a=1, b=2, c=list(1,2,3,4,5)), list(a=2, b=3, c=list(1,2,3,4,5)))
rbindlist(res)
Error in rbindlist(res) :
Column 3 of item 1 is length 5, inconsistent with first column of that item which is length 1. rbind/rbindlist doesn't recycle as it already expects each item to be a uniform list, data.frame or data.table
My result should look like this:
data.table(a=c(1,2), b=c(2,3), c=list(c(1,2,3,4,5), c(1,2,3,4,5)))
a b c
1: 1 2 1,2,3,4,5
2: 2 3 1,2,3,4,5
I there a way to transform this list? It should work without knowing the column names beforehand.
What you want to do is easy. But your input must be formatted as the following:
library(data.table)
res = list(list(a=1, b=2, c=list(c(1,2,3,4,5))), list(a=2, b=3, c=list(c(1,2,3,4,5))))
rbindlist(res)
You can transform your input with the following code
res = lapply(res, function(x)
{
x[[3]] <- list(unlist(x[[3]]))
return(x)
})
Generally, R wants a column of a data.frame or data.table to be like a vector, meaning all single values. And rbindlist expects a list of data.frames, or things that can be treated as/converted to data.frames. So the reason your code fails is because it first tries to transform your input into 2 data.frames, for which the third column seems longer then the first and second.
But it is possible, you need to force R to construct 2 data.frames with each just one row. So we need the value of c to be just length-one, which we can do by making it a list with one element: a vector of length 5. And then we need to tell R it needs to really treat it "AsIs", and not convert it to something of length 5. For that, we have the function I which does nothing more then mark its input as "AsIs"
res <- list(data.frame(a=1, b=2, c=I(list(c(1,2,3,4,5)))),
data.frame(a=2, b=3, c=I(list(c(1,2,3,4,5)))))
res2 <- rbindlist(res)
The data.frame calls are not even necessary, it also works with list. But generally, I think not relying on hoe other functions must first convert your input works best.
I think you need to first process every sublist and convert it to achieve your example output, like so
library(data.table)
res = list(list(a=1, b=2, c=list(1,2,3,4,5)), list(a=2, b=3, c=list(1,2,3,4,5)))
DTlist <- lapply(res, function(row_){
lapply(row_, function(col_){
if(class(col_) == 'list'){
list(unlist(col_))
}else{
col_
}
})
})
rbindlist(DTlist)
The result would be
a b c
1: 1 2 1,2,3,4,5
2: 2 3 1,2,3,4,5
Sorry post is edited, bc I didnt recognize what the OP is trying to do initially. This also works, if the OP doesnt know which the sublist column is.

R smartbind produces extra column

Adding rows to dataframe is easy with rbind but i find that smartbind is often better at handling exceptions such as dataframes with different columnnames and so on.
However Iteratively adding rows in smartbind produces an additional row in some instances:
library(gtools)
alldf <- data.frame()
for (i in 1:3) {
df <- data.frame(x=i)
alldf<- smartbind(df,alldf)
}
smartbind :
> alldf
x
1 3
2:1 2
2:2 1
2:3 1
rbind :
> alldf
x
1 3
2 2
3 1
I don't have a clue why smartbind does this, i've tried fiddling with removal of rownames rownames(alldf) <- NULL, but it doesn't seem to change this. I can use rbind instead for now, or i could inititalize the alldf on the first loop, but it seems like a hassle. Plus, I sometimes prefer to use smartbind so i would like to correct this.
Thanks for Reading

Adding Zero to a column in first x rows in R

I am creating a classification model for forecasting purposes. I have several ext files which I converted into one large list containing several lists (called comb). I then broke the large list into a separate dataframe with each list as its own column (called BI). Because each list may contain different number of elements, the simpler argument matrix(unlist(l), ncol=ncol) does not work. When reviewing alternatives, I made modification to compile the following:
max_length <- max(sapply(comb,length))
BI<-sapply(comb, function(x){
c(x, rep(0, max_length - length(x)))
})
This creates a dataframe assigning each list a column and assigning each missing element within that column a value of ZERO. Those zeros show at the end of that column but I would like them to be at the beginning of the column. Here is an example of current output:
cola colb colc
2 2 2
1 1 0
4 0 0
I need your help in converting my original code to produce the following format:
acola colb colc
2 0 0
1 2 0
4 1 2
It might be sufficient to interchange the order in the concatenation c:
max_length <- max(sapply(comb, length))
BI <- sapply(comb, function(x){
c(rep(0, max_length - length(x)), x)
})
EDIT: Based on additional information in the comments below, here's an approach that modifies the code in another way. The idea is that as long as your first approach gives
you a proper data frame, we can circumvent the problem by using
the order-function.
max_length <- max(sapply(comb,length))
BI <- sapply(comb, function(x){
.zeros <- rep(0, max_length - length(x))
.rearange <- order(c(1:length(x), .zeros))
c(x, .zeros)[.rearange]
})
I have tested that this code works upon a minor test example I
created, but I'm not certain that this example resembles your
comb...
If this revised approach doesn't work, then it's still possible
to first create the data frame with your original code, and
then reorder one column at the time.

Counting non-missing occurrences

I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.
To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).
Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"

Lookup of entries with multiplicities

Suppose I have a vector data <- c(1,2,2,1) and a reference table, say : ref <- cbind(c(1,1,2,2,2,2,4,4), c(1,2,3,4,5,6,7,8))
I would like my code to return the following vector : result <- c(1,2,3,4,5,6,3,4,5,6,1,2). It's like using the R function match(). But match() only returns the first occurrence of the reference vector. Similar for %in%.
I have tried functions like merge(), join() but I would like something with only the combination of rep() and seq() R functions.
You can try
ref[ref[,1] %in% data,2]
To return the second column value whenever the first column value is in the given set. You can wrap this in a lapply:
unlist(lapply(data, function(x) ref[ref[,1] ==x, 2]))
You can get the indices you are looking for like this:
indices <- sapply(data,function(xx)which(ref[,1]==xx))
Of course, that is a list, since the number of hits will be different for each entry of data. So you just unlist() this:
ref[unlist(indices),2]
[1] 1 2 3 4 5 6 3 4 5 6 1 2

Resources