r: data.table, most efficient row wise normalization - r

This code normalizes each value in each row (all values end up between -1 and 1).
dt <- setDT(knime.in)
df <-as.data.frame(t(apply(dt[,-1], 1, function(x) x / sum(x) )))
df1<-cbind(knime.in$Majors_Final,df)
BUT
It is not dynamic. The code "knows" that the String categorical variable is in row one and removes it before running the calculations
It seems old school and I suspect it does not make full use of data.table's referencing memory allocations.
QUESTIONS
How do I use the most memory efficient data.table code to achieve the row wise normalization?
How do I exclude all is.character() columns (or include only is.numeric), if I do not know the position or name of these columns?

Related

how to access rownames in a chained data.table of R

We need rownames sometimes to create a new column that is a function of previous columns but aggregated just for one row (each row). In other words the function is operating across the row.
Consider this:
library(data.table)
library(geosphere)
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # correct
The second line of code works fine as the data.table name dt is available inside the square brackets (which in itself does not look quite elegant to me), but not always.
What if there is a chain of data.tables? Consider this extension of previous example:
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=c(lon,lat),p2=c(i.lon,i.lat)),by=rownames(dt)] # incorrect
Clearly this is an incorrect use as rownames(dt) is a different length than the subsetted data.table passed inside to the next in chain.
I guess my larger question is: Is rownames() the only way to achieve summarisation on each row? If not then the specific question remains: how do we access the data.table inside the by= construct if it is a chained data.table?
Try cbind:
dt <- data.table(lon=77+rnorm(100),lat=13 + rnorm(100),i.lon=77+rnorm(100),i.lat=13 + rnorm(100))
dt[,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# correct : 100 lines
dt[lon>77 & lat<12.5][,dist:=distGeo(p1=cbind(lon,lat),p2=cbind(i.lon,i.lat))]
# also correct : 16 lines
:= works on each row without need for summarization.
cbind allows to supply the expexted n*2 lat-lon matrix to the function.

Summing over all previous rows in large column efficiently

I have a large data set (>100,000 rows) and would like to create a new column that sums all previous values of another column.
For a simulated data set test.data with 100,000 rows and 2 columns, I create the new vector that sums the contents of column 2 with:
sapply(1:100000, function(x) sum(test.data[1:x[1],2]))
I append this vector to the test.table later with cbind() This is too slow, however. Is there a faster way to accomplish this, or be able to reference the vector that sapply is making in sapply so I can just update the cumulative sum instead of performing the whole calc again?
Per my comment above it'll be faster if you do a direct assignment and use cumsum instead of sapply (cumsum was specifically built for what you want to do).
This should work:
test.data$sum <- cumsum(test.data[, 2])

R: Warning when creating a (long) list of dummies

A dummy column for a column c and a given value x equals 1 if c==x and 0 else. Usually, by creating dummies for a column c, one excludes one value x at choice, as the last dummy column doesn't add any information w.r.t. the already existing dummy columns.
Here's how I'm trying to create a long list of dummies for a column firm, in a data.table:
values <- unique(myDataTable$firm)
cols <- paste('d',as.character(inds[-1]), sep='_') # gives us nice d_value names for columns
# the [-1]: I arbitrarily do not create a dummy for the first unique value
myDataTable[, (cols):=lapply(values[-1],function(x)firm==x)]
This code reliably worked for previous columns, which had smaller unique values. firm however is larger:
tr(values)
num [1:3082] 51560090 51570615 51603870 51604677 51606085 ...
I get a warning when trying to add the columns:
Warning message:
truelength (6198) is greater than 1000 items over-allocated (length = 36). See ?truelength. If you didn't set the datatable.alloccol option very large, please report this to datatable-help including the result of sessionInfo().
As far as I can tell, there is still all columns that I need. Can I just ignore this issue? Will it slow down future computations? I'm not sure what to make of this and the relevant of truelength.
Taking Arun's comment as an answer.
You should use alloc.col function to pre-allocate required amount of columns in your data.table to the number which will be bigger than expected ncol.
alloc.col(myDataTable, 3200)
Additionally depending on the way how you consume the data I would recommend to consider reshaping your wide table to long table, see EAV. Then you need to have only one column per data type.

using loops or lapply in R

I'm trying to iteratively loop through subsets of an R df but am having some trouble. df$A contains values from 0-1000. I'd like to subset the df based on each unique value of df$A, manipulate that data, save it as a newdf, and then eventually concatenate (rbind) the 1000 generated newdf's into one single df.
My current code for a single iteration (no loops) is like this:
dfA = 1
dfA_1 <- subset(df, A == dfA)
:: some ddply commands on dfA_1 altering its length and content ::
EDIT: to clarify, in the single iteration version, once I have the subset, I have been using ddply to then count the number of rows that contain some values. Not all subsets have all values, so the result can be of variable length. Thus, I have been appending the result to a skeleton df that accounts for cases in which a certain subset of df might not have any rows containing the values I expect (i.e., nrow = 0). Ideally, I wind up with the subset being fixed length for each instance of A. How can I incorporate this into a single (or multiple) plyr or dplyr set of code?
My issue with the for loops for this is that the length is not the variable, but rather the unique values of df$A.
My questions are as follows:
1. How would I use a for loop (or some form of apply) to perform this operation?
2. Can these operations be used to manipulate the data in addition to generate iterative df namess (e.g., the df named dfA_1 would be dfA_x where x is one of the values of df$A from 1 to 1000). My current thinking is that I'd then rbind the 1000 dfA_x's, though this seems cumbersome.
Many thanks for any assistance.
You should really use the dplyr package for this. What you want to do would probably take this form:
library(dplyr)
df %>%
group_by(A) %>%
summarize( . . . )
It will be easier to do, easier to read, less prone to error, and faster.
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

issue summing columns

I have a very large dataset and I'm trying to get the sums of values. The variables are binary with 0s and 1s.
Somehow, when I run a for loop
for (i in 7:39){
agegroup1[53640, i]<-sum(agegroup1[, i])
}
The loop processes but everything but the first column would contain nothing but just NA. I tried calling the values up and would see 0 and 1s, as well as checking the class (it returns "integer"). But when adding it all up, R does not work.
Any advice?
cs <- colSums(agegroup1[, 7:39])
will give you the vector of column sums without looping (at the R level).
If you have any missing values (NAs) in agegroup1[, 7:39] then you may want to add na.rm = TRUE to the colSums() call (or even your sum() call).
You don't say what agegroup1 is or how many rows it has etc, but to finalise what your loop is doing, you then need
agegroup1[53640, 7:39] <- cs
What was in agegroup1[53640, ] before you started adding the column sums? NA? If so that would explain some behaviour.
We do really need more detail though...
#Gavin Simpson provided a workable solution but alternatively you could use apply. This function allows you to apply a function to the row or column margin.
x <- cbind(x1=1, x2=c(1:8), y=runif(8))
# If you wanted to sum the rows of columns 2 and 3
apply(x[,2:3], 1, sum, na.rm=TRUE)
# If you want to sum the columns of columns 2 and 3
apply(x[,2:3], 2, sum, na.rm=TRUE)

Resources