Adding Zero to a column in first x rows in R - r

I am creating a classification model for forecasting purposes. I have several ext files which I converted into one large list containing several lists (called comb). I then broke the large list into a separate dataframe with each list as its own column (called BI). Because each list may contain different number of elements, the simpler argument matrix(unlist(l), ncol=ncol) does not work. When reviewing alternatives, I made modification to compile the following:
max_length <- max(sapply(comb,length))
BI<-sapply(comb, function(x){
c(x, rep(0, max_length - length(x)))
})
This creates a dataframe assigning each list a column and assigning each missing element within that column a value of ZERO. Those zeros show at the end of that column but I would like them to be at the beginning of the column. Here is an example of current output:
cola colb colc
2 2 2
1 1 0
4 0 0
I need your help in converting my original code to produce the following format:
acola colb colc
2 0 0
1 2 0
4 1 2

It might be sufficient to interchange the order in the concatenation c:
max_length <- max(sapply(comb, length))
BI <- sapply(comb, function(x){
c(rep(0, max_length - length(x)), x)
})
EDIT: Based on additional information in the comments below, here's an approach that modifies the code in another way. The idea is that as long as your first approach gives
you a proper data frame, we can circumvent the problem by using
the order-function.
max_length <- max(sapply(comb,length))
BI <- sapply(comb, function(x){
.zeros <- rep(0, max_length - length(x))
.rearange <- order(c(1:length(x), .zeros))
c(x, .zeros)[.rearange]
})
I have tested that this code works upon a minor test example I
created, but I'm not certain that this example resembles your
comb...
If this revised approach doesn't work, then it's still possible
to first create the data frame with your original code, and
then reorder one column at the time.

Related

How to sum the values of different columns in a dataframe looping on the variables names

I'm relatively new to R (used to work in Stata before) so sorry if the question is too trivial.
I've a dataframe with variables named in a sequential way that follows the following logic:
q12.X.Y
where X assumes the values from 1 to 9, and Y from 1 to 5
I need to add together the values of the variables of all the q12.X.Y variables with the Y numbers from 1 to 3 (but NOT those ending with the number 4 or 5)
Ideally I would have written a loop based on the sequential numbers of the variables, namely something like:
df$test <- 0
for(i in 1:9){
for(j in 1:3){
df$test <- df$test+ df$q12.i.j
}
}
That obviously do not work.
I also tried with the command "rowSums" and "subset"
df$test <- rowSums(subset(df,select= ...)
However I find it a bit cumbersome, as the column numbers are not sequential and i do not want to type the name of all the variables.
Any suggestion how to do that?
We can use grep to get the match
rowSums(df[grep("q12\\.[1-9]\\.[1-3]", names(df))])
or if all the column names are present, then use an exact match by creating the column names with paste
rowSums(df[paste0(rep(paste0("q12.", 1:9, "."), 3), 1:3)])

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

R creating multiple 2 by 2 tables from a data frame

Next question - I have created the following data frame in R
x <- as.integer(rnorm(n=1000, mean=10, sd=5))
y <- 1:1000
z <- sample (c(0,1),1000, replace=T)
df <- data.frame(x,y,z)
# create variables df using x
for(i in 1:10){
df[paste0("col",i)] <- ifelse(df$x <i, 1, 0)
}
# create 2 by 2 tables of z against col1 to col 10
for(i in 1:10){
table[i] <- table (df[paste0("col",i)], df$z)
}
I already received some excellent help to create variables in R using a for loop within a data frame.
However i am now struggling with using a similar for loop to create a two by two table (last section of the code).
Can anybody tell where i am going wrong?
Thanks again as always!
There are several problems with the code you have written.
First of all, the table data-object does not exist, so you cannot index-assign to it.
Secondly, you need to use "[[" when accessing a named item (otherwise you get a sublist).
Finally, if you make a list, which is really the most sensible type of storage for a series of table-objects, you need to use "[[" rather than "[" to extract an item (rather than a sublist).
I also took the liberty of renaming it to tbl so there would not be cognitive confusion about what was function and what was data.
tbl<- list();
for(i in 1:10){
tbl[[i]] <- table (df[[paste0("col",i)]], df$z)
}
tbl[[1]]
0 1
0 488 473
1 16 23

Colwise eats column names within ddply

I'm trying to chunk through a data frame, find instances where the sub-data frames are unbalanced, and add 0 values for certain levels of a factor that are missing. To do this, within ddply, I did a quick comparison to a set vector of what levels of a factor should be there, and then create some new rows, replicating the first row of the subdata set but modifying their values, and then rbinding them to the old data set.
I use colwise to do the replication.
This works great outside of ddply. Inside of ddply...identifying rows get eaten, and rbind borks on my. It's curious behavior. See the following code with some debugging print statements thrown in to see the difference in results:
#a test data frame
g <- data.frame(a=letters[1:5], b=1:5)
#repeat rows using colwise
rep.row <- function(r, n){
colwise(function(x) rep(x, n))(r)
}
#if I want to do this with just one row, I get all of the columns
rep.row(g[1,],5)
is fine. It prints
a b
1 a 1
2 a 1
3 a 1
4 a 1
5 a 1
#but, as soon as I use ddply to create some new data
#and try and smoosh it to the old data, I get errors
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
rbind(x, newrows)
})
This gives
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
You can see the problem with this debugged version
#So, what is going on here?
ddply(g, .(a), function(x) {
newrows <- rep.row(x[1,],5)
newrows$b<-0
print(x)
print("\n\n")
print(newrows)
rbind(x, newrows)
})
You can see that x and newrows have different columns - they differ in a.
a b
1 a 1
[1] "\n\n"
b
1 0
2 0
3 0
4 0
5 0
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
What is going on here? Why when I use colwise on a subdata frame do the identifying rows get eaten?
It's a funny interaction between ddply and colwise, it seems. More specifically, the problem occurs when colwise calls strip_splits and finds a vars attribute that was given by ddply.
As a workaround, try putting this first line in your function,
attr(x, "vars") <- NULL
# your code follows
newrows <- rep.row(x[1,],5)

Optimization: splitting dataframe into a list of dataframes, transforming data per row

Preliminaries: this question is mostly of educational value, the actual task at hand is completed, even if the approach is not entirely optimal. My question is whether the code below can be optimized for speed and/or implemented more elegantly. Perhaps using additional packages, such as plyr or reshape. Run on the actual data it takes about 140 seconds, much higher than the simulated data, since some of the original rows contain nothing but NA, and additional checks have to be made. To compare, the simulated data are processed in about 30 seconds.
Conditions: the dataset contains 360 variables, 30 times the set of 12. Let's name them V1_1, V1_2... (first set), V2_1, V2_2 ... (second set) and so forth. Each set of 12 variables contains dichotomous (yes/no) responses, in practice corresponding to a career status. For instance: work (yes/no), study (yes/no) and so forth, in total 12 statuses, repeated 30 times.
Task: the task at hand is to recode each set of 12 dichotomous variables into a single variable with 12 response categories (e.g. work, study... ). Ultimately we should get 30 variables, each with 12 response categories.
Data: I cannot post the actual dataset, but here is a good simulated approximation:
randomRow <- function() {
# make a row with a single 1 and some NA's
sample(x=c(rep(0,9),1,NA,NA),size=12,replace=F)
}
# create a data frame with 12 variables and 1500 cases
makeDf <- function() {
data <- matrix(NA,ncol=12,nrow=1500)
for (i in 1:1500) {
data[i,] <- randomRow()
}
return(data)
}
mydata <- NULL
# combine 30 of these dataframes horizontally
for (i in 1:30) {
mydata <- cbind(mydata,makeDf())
}
mydata <- as.data.frame(mydata) # example data ready
My solution:
# Divide the dataset into a list with 30 dataframes, each with 12 variables
S1 <- lapply(1:30,function(i) {
Z <- rep(1:30,each=12) # define selection vector
mydata[Z==i] # use selection vector to get groups of variables (x12)
})
recodeDf <- function(df) {
result <- as.numeric(apply(df,1,function(x) {
if (any(!is.na(df))) which(x == 1) else NA # return the position of "1" per row
})) # the if/else check is for the real data
return(result)
}
# Combine individual position vectors into a dataframe
final.df <- as.data.frame(do.call(cbind,lapply(S1,recodeDf)))
All in all, there is a double *apply function, one across the list, the other across the dataframe rows. This makes it a bit slow. Any suggestions? Thanks in advance.
Here is an approach that is basically instantaneous. (system.time = 0.1 seconds)
se set. The columnMatch component will depend on your data, but if it is every 12 columns, then the following will work.
MYD <- data.table(mydata)
# a new data.table (changed to numeric : Arun)
newDT <- as.data.table(replicate(30, numeric(nrow(MYD)),simplify = FALSE))
# for each column, which values equal 1
whiches <- lapply(MYD, function(x) which(x == 1))
# create a list of column matches (those you wish to aggregate)
columnMatch <- split(names(mydata), rep(1:30,each = 12))
setattr(columnMatch, 'names', names(newDT))
# cycle through all new columns
# and assign the the rows in the new data.table
## Arun: had to generate numeric indices for
## cycling through 1:12, 13:24 in whiches[[.]]. That was the problem.
for(jj in seq_along(columnMatch)) {
for(ii in seq_along(columnMatch[[jj]])) {
set(newDT, j = jj, i = whiches[[ii + 12 * (jj-1)]], value = ii)
}
}
This would work just as well adding columns by reference to the original.
Note set works on data.frames as well....
I really like #Arun's matrix multiplication idea. Interestingly, if you compiling R against some OpenBLAS libraries, you could get this to operate in parallel.
However, I wanted to provide you with another, perhaps slower than matrix multiplication, solution that uses your original pattern, but is much faster than your implementation:
# Match is usually faster than which, because it only returns the first match
# (and therefore won't fail on multiple matches)
# It also neatly handles your *all NA* case
recodeDf2 <- function(df) apply(df,1,match,x=1)
# You can split your data.frame by column with split.default
# (Using split on data.frame will split-by-row)
S2<-split.default(mydata,rep(1:30,each=12))
final.df2<-lapply(S2,recodeDf2)
If you had a very large data frame, and many processors, you may consider parallelizing this operation with:
library(parallel)
final.df2<-mclapply(S2,recodeDf2,mc.cores=numcores)
# Where numcores is your number of processors.
Having read #Arun and #mnel, I learned a lot about how to improve this function, by avoiding the coercion to an array, by processing the data.frame by column instead of by row. I don't mean to "steal" an answer here; OP should consider switching the checkbox to #mnel's answer.
I wanted, however, to share a solution that doesn't use data.table, and avoids for. It is still, however, slower than #mnel's solution, albeit slightly.
nograpes2<-function(mydata) {
test<-function(df) {
l<-lapply(df,function(x) which(x==1))
lens<-lapply(l,length)
rep.int(seq.int(l),times=lens)[order(unlist(l))]
}
S2<-split.default(mydata,rep(1:30,each=12))
data.frame(lapply(S2,test))
}
I would also like to add that #Aaron's approach, using which with arr.ind=TRUE would also be very fast and elegant, if mydata started out as a matrix, rather than a data.frame. Coercion to a matrix is slower than the rest of the function. If speed were an issue, it would be worth considering reading the data in as a matrix in the first place.
IIUC, you've only one 1 per 12 columns. You've the rest with 0's or NA's. If so, the operation can be performed much faster by this idea.
The idea: Instead of going through each row and asking for the position of 1, you could use a matrix with dimensions 1500 * 12 where each row is just 1:12. That is:
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
Now, you can multiply this matrix with each of your subset'd data.frame (of same dimensions, 1500*12 here) and them take their "rowSums" (which is vectorised) with na.rm = TRUE. This'll just give directly the row where you have 1 (because that 1 will have been multiplied by the corresponding value between 1 and 12).
data.table implementation: Here, I'll use data.table to illustrate the idea. Since it creates column by references, I'd expect that the same idea used on a data.frame would be a tad slower, although it should drastically speed up your current code.
require(data.table)
DT <- data.table(mydata)
ids <- seq(1, ncol(DT), by=12)
# for multiplying with each subset and taking rowSums to get position of 1
mul.mat <- matrix(rep(1:12, nrow(DT)), ncol = 12, byrow=TRUE)
for (i in ids) {
sdcols <- i:(i+12-1)
# keep appending the new columns by reference to the original data
DT[, paste0("R", i %/% 12 + 1) := rowSums(.SD * mul.mat,
na.rm = TRUE), .SDcols = sdcols]
}
# delete all original 360 columns by reference from the original data
DT[, grep("V", names(DT), value=TRUE) := NULL]
Now, you'll be left with 30 columns that correspond to the position of 1's. On my system, this takes about 0.4 seconds.
all(unlist(final.df) == unlist(DT)) # not a fan of `identical`
# [1] TRUE
Another way this could be done with base R is with simply getting the values you want to put in the new matrix and filling them in directly with matrix indexing.
idx <- which(mydata==1, arr.ind=TRUE) # get indices of 1's
i <- idx[,2] %% 12 # get column that was 1
idx[,2] <- ((idx[,2] - 1) %/% 12) + 1 # get "group" and put in "col" of idx
out <- array(NA, dim=c(1500,30)) # make empty matrix
out[idx] <- i # and fill it in!

Resources