I'm asking this as a general/beginner question about R, not specific to the package I was using.
I have a dataframe with 3 million rows and 15 columns. I don't consider this a huge dataframe, but maybe I'm wrong.
I was running the following script and it's been running for 2+ hours - I imagine there must be something I can do to speed this up.
Code:
ddply(orders, .(ClientID), NumOrders=len(OrderID))
This is not an overly intensive script, or again, I don't think it is.
In a database, you could add an index to a table to increase join speed. Is there a similar action in R I should be doing on import to make functions/packages run faster?
Looks to me that you might want:
orders$NumOrders <- with( orders( ave(OrderID , ClientID) , FUN=length) )
(I'm not aware that len() function exists.)
With the suggested data.table package, the following operation should do the job within a second:
orders[,list(NumOrders=length(OrderID)),by=ClientID]
It seems like all your code is doing is this:
orders[order(orders$ClientID), ]
That would be faster.
Related
I'm working with a data.frame that is about 2 million rows, I need to group rows and apply functions to them, and I was using split.data.frame and modify for that.
Unfortunately the split.data.frame alone breaks the memory limit. I'm working on my company's server, so I can't really install a new r version or add any memory or anything.
I think I can multi thread the modify part, but first the the splitting needs to be successful.
What else can I try?
first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.
I was trying to create a dataframe from the results of the colMean function, but the result would always be weird. There has been previous discussion, (see:R - creating dataframe from colMeans function) but it when I implement the solution suggested, I get an extremely long data frame. It has a lot of columns, but now only one row, instead of having one column that has many rows.
temporary<-data.matrix(tempdb[,2:5])
temp2<-(as.numeric(colMeans(temporary),na.rm=T))
trgdphts<-c(trgdphts,temp2)
This is the code that I used.
I found out the problem to be that I had to clean up the variables.
After deleting everything, then rerunning the new code, it cleaned everything up. As it turns out, the functions were being run on the stale data and the stale data was never overwritten.
Thanks to Akrun for helping out. His advice inadverdently made me discover the problem.
I have a large dataset I am reading in R
I want to apply the Unique() function on it so I can work with it better, but when I try to do so, I get this prompted:
clients <- unique(clients)
Error: cannot allocate vector of size 27.9 Mb
So I am trying to apply this function part by part by doing this:
clientsmd<-data.frame()
n<-7316738 #Amount of observations in the dataset
t<-0
for(i in 1:200){
clientsm<-clients[1+(t*round((n/200))):(t+1)*round((n/200)),]
clientsm<-unique(clientsm)
clientsmd<-rbind(clientsm)
t<-(t+1) }
But I get this:
Error in `[.default`(xj, i) : subscript too large for 32-bit R
I have been told that I could do this easier with packages such as "ff" or "bigmemory" (or any other) but I don't know how to use them for this purpose.
I'd thank any kind of orientation whether is to tell me why my code won't work or to say me how could I take advantage of this packages.
Is clients a data.frame of data.table? data.table can handle quite large amounts of data compared to data.frame
library(data.table)
clients<-data.table(clients)
clientsUnique<-unique(clients)
or
duplicateIndex <-duplicated(clients)
will give rows that are duplicates.
increase your memory limit like below and then try executing.
memory.limit(4000) ## windows specific command
You could use distinct function from dplyr package .
function - df %>% distinct(ID)
where ID is something unique in your dataframe .
I want to read in a large ido file that had just under 110,000,000 rows and 8 columns. The columns are made up of 2 integer columns and 6 logical columns. The delimiter "|" is used in the file. I tried using read.big.matrix and it took forever. I also tried dumpDf and it ran out of RAM. I tried ff which I heard was a good package and I am struggling with errors. I would like to do some analysis with this table if I can read it in some way. If anyone has any suggestions that would be great.
Kind Regards,
Lorcan
Thank you for all your suggestions. I managed to figure out why I couldn't get the error to work. I'll give you all answers and suggestions so no one can make my stupid mistake again.
First of all, the data that was been giving to me contained some errors in it so I was doomed to fail from the start. I was unaware until a colleague came across it in another piece of software. In a column that contained integers there were some letters so that when the read.table.ff package tried to read in the data set it somehow got confused or I don't know. Whatever though I was given another sample of data, 16,000,000 rows and 8 columns with correct entries and it worked perfectly. The code that I ran is as follows and took about 30 seconds to read:
setwd("D:/data test")
library(ff)
ffdf1 <- read.table.ffdf(file = "test.ido", header = TRUE, sep = "|")
Thank you all for your time and if you have any questions about the answer feel free to ask and I will do my best to help.
Do you really need all the data for your analysis? Maybe you could aggregate your dataset (say from minute values to daily averages). This aggregation only needs to be done once, and can hopefully be done in chunks. In this way you do need to load all your data into memory at once.
Reading in chunks can be done using scan, the important arguments are skip and n. Alternatively, put your data into a database and extract the chunks in that way. You could even using the functions from the plyr package to run chunks in parallel, see this blog post of mine for an example.