Applying a function to a large data set - r

I have a large dataset I am reading in R
I want to apply the Unique() function on it so I can work with it better, but when I try to do so, I get this prompted:
clients <- unique(clients)
Error: cannot allocate vector of size 27.9 Mb
So I am trying to apply this function part by part by doing this:
clientsmd<-data.frame()
n<-7316738 #Amount of observations in the dataset
t<-0
for(i in 1:200){
clientsm<-clients[1+(t*round((n/200))):(t+1)*round((n/200)),]
clientsm<-unique(clientsm)
clientsmd<-rbind(clientsm)
t<-(t+1) }
But I get this:
Error in `[.default`(xj, i) : subscript too large for 32-bit R
I have been told that I could do this easier with packages such as "ff" or "bigmemory" (or any other) but I don't know how to use them for this purpose.
I'd thank any kind of orientation whether is to tell me why my code won't work or to say me how could I take advantage of this packages.

Is clients a data.frame of data.table? data.table can handle quite large amounts of data compared to data.frame
library(data.table)
clients<-data.table(clients)
clientsUnique<-unique(clients)
or
duplicateIndex <-duplicated(clients)
will give rows that are duplicates.

increase your memory limit like below and then try executing.
memory.limit(4000) ## windows specific command

You could use distinct function from dplyr package .
function - df %>% distinct(ID)
where ID is something unique in your dataframe .

Related

Sample data after using filter or select from sparkly

I have a large dataframe to analyse, so I'm using sparklyr to manage it in a fast way. My goal is to take a sample of the data, but before I need to select some variables of interest and filter some values of certain columns.
I tried to select and/or filter the data and then use the function sample_n but it always gives me this error:
Error in vapply(dots(...), escape_expr, character(1)) : values must
be length 1, but FUN(X[[2]]) result is length 8
Below is an example of the behaviour:
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-select(data_example,Sepal_Length,Sepal_Width,Petal_Length)
data_sample<-sample_n(data_select,25)
data_sample
I don't know if I'm doing something wrong, since I started using this package a few days ago, but I could not find any solution to this problem. Any help with be appreciated!
It seemed a problem with the type of object returned when you select/mutate/filter the data.
So, I managed to get around the problem by sending the data to spark using the compute() command, and then sampling the data.
library(sparklyr)
library(dplyr)
sc<-spark_connect(master='local')
data_example<-copy_to(sc,iris,'iris')
data_select<-data_example %>%
select(Sepal_Length,Sepal_Width,Petal_Length) %>%
compute('data_select')
data_sample<-sample_n(data_select,25)
data_sample
Unfortunatelly, this approach takes a long time to run and consumes a lot of memory, so I expect someday I'll find a better solution.
I had also get same issue earlier then I tried following:
data_sample = data_select %>% head(25)

Convert Document Term Matrix (DTM) to Data Frame (R Programming)

I am a beginner at R programming language and currently try to work on a project.
There's a huge Document Term Matrix (DTM) and I would like to convert it into a Data Frame.
However due to the restrictions of the functions, I am not able to do so.
The method that I have been using is to first convert it into a matrix, and then convert it to data frame.
DF <- data.frame(as.matrix(DTM), stringsAsFactors=FALSE)
It was working perfectly with smaller size DTM. However when the DTM is too large, I am not able to convert it to a matrix, yielding the error as shown below:
Error: cannot allocate vector of size 2409.3 Gb
Tried looking online for a few days however I am not able to find a solution.
Would be really thankful if anyone is able to suggest what is the best way to convert a DTM into a DF (especially when dealing with large size DTM).
In the tidytext package there is actually a function to do just that. Try using the tidy function which will return a tibble (basically a fancy dataframe that will print nicely). The nice thing about the tidy function is it'll take care of the pesky StringsAsFactors=FALSE issue by not converting strings to factors and it will deal nicely with the sparsity of your DTM.
as.matrix is trying to convert your DTM into a non-sparse matrix with an entry for every document and term even if the term occurs 0 times in that document, which is causing your memory usage to ballon. tidy` will convert it into a dataframe where each document only has the counts for the term found in them.
In your example here you'd run
library(tidytext)
DF <- tidy(DTM)
There's even a vignette on how to use the tidytext packages (meant to work in the tidyverse) here.
It's possible that as.data.frame(as.matrix(DTM), stringsAsFactors=False) instead of data.frame(as.matrix(DTM), stringsAsFactors=False) might do the trick.
The API documentation notes that as.data.frame() simply coerces a matrix into a dataframe, whereas data.frame() creates a new data frame from the input.
as.data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.data.frame.html
data.frame(...) -> https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html

In R, Create Summary Data Frame from Multiple Objects

I'm trying to create a "summary" data frame that holds some high-level stats about a few objects in my R project. I'm having trouble even accomplishing this simple task and I've tried using For loops and Apply functions with no luck.
After searching (a lot) on SO I'm seeing that For loops might not be the best performing option, so I'm open to any solution that gets the job done.
I have three objects: text1 text2 and text3 of class "Large Character (vectors)" (imagine I might be exploring these objects and will create a NLP predictive model from them). Each are > 250 MB in size (upwards of 1 million "rows" each) once loaded into R.
My goal: Store the results of object.size() length() and max(nchar()) in a table for my 3 objects.
Method 1: Use an Apply() Function
Issue: I haven't successfully applied multiple functions to a single object. I understand how to do simple applies like lapply(x, mean) but I'm falling short here.
Method 2: Bind Rows Using a For loop
I'm liking this solution because I almost know how to implement it. A lot of SO users say this is a bad approach, but I'm lacking other ideas.
sources <- c("text1", "text2", "text3")
text.summary <- data.frame()
for (i in sources){
text.summary[i ,] <- rbind(i, object.size(get(i)), length(get(i)),
max(nchar(get(i))))
}
Issue: This returns the error data length exceeds size of matrix - I know I could define the structure of my data frame (on line 2), but I've seen too much feedback on other questions that advise against doing this.
Thanks for helping me understand the proper way to accomplish this. I know I'm going to have trouble doing NLP if I can't even figure out this simple problem, but R is my first foray into programming. Oof!
Just try for example:
do.call(rbind, lapply(list(text1,text2,text3),
function(x) c(objectSize=c(object.size(x)),length=length(x),max=max(nchar(x)))))
You'll obtain a matrix. You can coerce to data.frame later if you need.

R: Why does it take so long to parse this data table

I have a data frame df that has 15 columns and 1000000 rows of all ints. My code is:
for(i in 1:nrow(df))
{
if(is.null(df$col1[i]) || .... || is.null(df$col9[i]))
df[-i,] #to delete the row if one of those columns is null
}
This has been running for an hour and is still going. Why? It seems like it should be relatively fast code to run. How can I speed it up?
The reason it is slow is that R is relatively slow at looping through vectors. Most functions in R are vectorized which means you can perform them on a vector at once much faster than it can loop through each element one by one. On a side note, I don't think you have NULLs in your data frame. I think you have NAs so I'm going to assume that is what you have. Even if you have NULLs then the following should still work.
This syntax should give you a nice speed boost.
This will take advantage of rowSums producing NA for every row that has missing values in it.
df<-subset(df, !is.na(rowSums(df[,1:10])))
This syntax should also work.
df<-df[rowSums(is.na(df[,1:10]))==0,]

Actions to speed up R calculations

I'm asking this as a general/beginner question about R, not specific to the package I was using.
I have a dataframe with 3 million rows and 15 columns. I don't consider this a huge dataframe, but maybe I'm wrong.
I was running the following script and it's been running for 2+ hours - I imagine there must be something I can do to speed this up.
Code:
ddply(orders, .(ClientID), NumOrders=len(OrderID))
This is not an overly intensive script, or again, I don't think it is.
In a database, you could add an index to a table to increase join speed. Is there a similar action in R I should be doing on import to make functions/packages run faster?
Looks to me that you might want:
orders$NumOrders <- with( orders( ave(OrderID , ClientID) , FUN=length) )
(I'm not aware that len() function exists.)
With the suggested data.table package, the following operation should do the job within a second:
orders[,list(NumOrders=length(OrderID)),by=ClientID]
It seems like all your code is doing is this:
orders[order(orders$ClientID), ]
That would be faster.

Resources