I have a large data set (>100,000 rows) and would like to create a new column that sums all previous values of another column.
For a simulated data set test.data with 100,000 rows and 2 columns, I create the new vector that sums the contents of column 2 with:
sapply(1:100000, function(x) sum(test.data[1:x[1],2]))
I append this vector to the test.table later with cbind() This is too slow, however. Is there a faster way to accomplish this, or be able to reference the vector that sapply is making in sapply so I can just update the cumulative sum instead of performing the whole calc again?
Per my comment above it'll be faster if you do a direct assignment and use cumsum instead of sapply (cumsum was specifically built for what you want to do).
This should work:
test.data$sum <- cumsum(test.data[, 2])
Related
I have two dataframes
dataframe 1 has around million rows.. and its has two columns named 'row' and 'columns' that has the index of row and column of another dataframe (i.e. dataframe 2)..
i want to extract the values from dataframe 2 with the indexes stated in the columns named 'row' and 'columns' for each row in dataframe1.
I used a simple for loop to get the solution but it is time consuming and takes around 9 minutes, is there any other way with functions in R to solve this problem?
for(i in 1:nrow(datafram1)) {
dataframe1$value[i] = dataframe2[dataframe1$row[i],dataframe1$columns[i]]
}
You actually don't need a for loop to do this. Just add the new column to the Data Frame using the row and column names:
DataFrame1$value <- DataFrame2[DataFrame1$row, DataFrame1$column]
This should work a lot faster. If you wanted to try it a different way you could try adding the values to a new vector and then using cbind to join the vector to the Data Frame. The fact that you're trying to update the whole Data Frame during the loop is most likely what's slowing it down.
Maybe you can try the code below
dataframe1$value <- dataframe2[as.matrix(dataframe1[c("row","columns")])]
Sionce your loop only consider the rows in df1, you can cut the surplus roes on df2 and then use cbind:
dataframe2 <- dataframe2[nrow(dataframe1),]
df3 <- cbind(dataframe1, dataframe2)
I have a dataframe with cases that repeat on the rows. Some rows have more complete data than others. I would like to group cases and then assign the first non-missing value to all NA cells in that column for that group. This seems like a simple enough task but I'm stuck. I have working syntax but when I try to use apply to apply the code to all columns in the dataframe I get a list back instead of a dataframe. Using do.call(rbind) or rbindlist or unlist doesn't quite fix things either.
Here's the syntax.
df$groupid<-group_indices (df,id1,id2) #creates group id on the basis of a combination of two variables
df%<>%group_by(id1,id2) #actually groups the dataframe according to these variables
df<-summarise(df, xvar1=xvar1[which(!is.na(xvar1))[1]]) #this code works great to assign the first non missing value to all missing values but it only works on 1 column at a time (X1).
I have many columns so I try using apply to make this a manageable task..
df<-apply(df, MARGIN=2, FUN=function(x) {summarise(df, x=x[which(!is.na(x))[1]])
}
)
This gets me a list for each variable, I wanted a dataframe (which I would then de-duplicate). I tried rbindlist and do.call(rbind) and these result in a long dataframe with only 3 columns - the two group_by variables and 'x'.
I know the problem is simply how I'm using apply, probably the indexing with 'which', but I'm stumped.
What about using lapply with do.call and cbind, like the following:
df <- do.call(cbind, lapply(df, function(x) {summarise(df, x=x[which(!is.na(x))[1]])}))
This code normalizes each value in each row (all values end up between -1 and 1).
dt <- setDT(knime.in)
df <-as.data.frame(t(apply(dt[,-1], 1, function(x) x / sum(x) )))
df1<-cbind(knime.in$Majors_Final,df)
BUT
It is not dynamic. The code "knows" that the String categorical variable is in row one and removes it before running the calculations
It seems old school and I suspect it does not make full use of data.table's referencing memory allocations.
QUESTIONS
How do I use the most memory efficient data.table code to achieve the row wise normalization?
How do I exclude all is.character() columns (or include only is.numeric), if I do not know the position or name of these columns?
I'm trying to do a Present Value on every row of a dataframe, this shouldn't be too hard, but in each row it has to use a specific range of columns. I have this:
int<-sample(0:3,1000,rep=TRUE)/100
period<-sample(1:9,1000,rep=TRUE)
a<-data.frame(replicate(10,sample(0:10,1000,rep=TRUE)),int,period)
Suppose that columns 1:10 are the payments, then there is the interest rate for the PV and the last column (period) indicates from which column starts the PV for that particular row. I'm using a for to accomplish this, but I'm wondering if there is an easier way, since I'm doing it on a dataframe with over 1 million rows:
a$vpn<-0
for (i in 1:nrow(a))
{
a$vpn[i]<-pv.uneven(a$int[i],a[i,a$period[i]:10])
}
Any help would be greatly appreciated.
thanks
We don't know what pv.uneven is, but suppose we have
pv<-function(interest,cfs) sum(cfs*(1+interest)^-seq_along(cfs))
so that for a given row i of data frame a, you want pv(a$int[i],a[i,a$period[i]:10]).
A useful trick then is to create an auxillary function
pv.aux<-function(int,period,end=10,...) pv(int,c(...)[period:end])
So that you can just do.call an mapply
a$vpn<-do.call(mapply,c(pv.aux,a))
For me this is many times faster than your current loop solution.
I'm trying to iteratively loop through subsets of an R df but am having some trouble. df$A contains values from 0-1000. I'd like to subset the df based on each unique value of df$A, manipulate that data, save it as a newdf, and then eventually concatenate (rbind) the 1000 generated newdf's into one single df.
My current code for a single iteration (no loops) is like this:
dfA = 1
dfA_1 <- subset(df, A == dfA)
:: some ddply commands on dfA_1 altering its length and content ::
EDIT: to clarify, in the single iteration version, once I have the subset, I have been using ddply to then count the number of rows that contain some values. Not all subsets have all values, so the result can be of variable length. Thus, I have been appending the result to a skeleton df that accounts for cases in which a certain subset of df might not have any rows containing the values I expect (i.e., nrow = 0). Ideally, I wind up with the subset being fixed length for each instance of A. How can I incorporate this into a single (or multiple) plyr or dplyr set of code?
My issue with the for loops for this is that the length is not the variable, but rather the unique values of df$A.
My questions are as follows:
1. How would I use a for loop (or some form of apply) to perform this operation?
2. Can these operations be used to manipulate the data in addition to generate iterative df namess (e.g., the df named dfA_1 would be dfA_x where x is one of the values of df$A from 1 to 1000). My current thinking is that I'd then rbind the 1000 dfA_x's, though this seems cumbersome.
Many thanks for any assistance.
You should really use the dplyr package for this. What you want to do would probably take this form:
library(dplyr)
df %>%
group_by(A) %>%
summarize( . . . )
It will be easier to do, easier to read, less prone to error, and faster.
http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html