Faster version of split{base} for very large data frame

Faster version of split{base} for very large data frame - r

I have a data frame with roughly three million rows (each row is a booking) and a small number of columns one of them beeing customer id. My goal is to split this data frame by customer id into a list of data frames such that each data frame contains all the bookings of a customer. So I tried
cstmr_list <- split(df, f = df$cstmr_id)
but cancelled it after half an hour because it took too long. Next, I only split the indexes with
idx_list <- split(seq(nrow(df)), f = df$cstmr_id)
which took less than 10s. Now, I want to populate idx_list with the corresponding rows of df. Who knows how to do it?

Related

Combining two vectors with rbind

I am trying to make a column called ID that contains 5000 rows to act as an identification column for observations on 20 individuals. I want there to be 200 observations for each of the first 10 individuals, and 300 observations for the next ten individuals (because I don't want the same number of observations for each individual). So I made two separate columns:
ID <- data.frame(ID=rep(c(12,122,242,329,595,130,145,245,654,878), each = 200))
ID2 <- data.frame(ID=rep(c(863,425,24,92,75,3,200,300,40,500), each = 300))
Why am I unable to stack one on top of the other (making a single column with all individuals) using rbind?
ID <- rbind(c(ID,ID2))

you were almost there, just don't use c() inside the rbind
ID <- rbind(ID,ID2)

How to divide R data frame into multiple data frames according to time difference of rows?

I have these data:
There's 14.865 of these rows/packets.
I want to divide them into smaller data frames according to the 2-nd column (time).
Division should create many data frames, each having range of 2 seconds: e.g. the first data frame should range from cca 18:49:17.8 to 18:49:19.8.
But the next data frame should be moved by 1/3-rd of a second, i.e.: range from cca 18:49:18.1 to 18:49:20.1.
My main question is: how to write the code to find the row which is "2 or 1/3-rd seconds lower".
EDIT#1:
Data saved from dput() in a text file (console output was too long).
https://drive.google.com/open?id=195rcvz_YnmbYbWhJr-IHFreoL8HwzNYt

Update a data frame within a for loop

The point of this question is that I want to know how to update a dataframe inside of either a for loop or a function. So i know there are other ways to do the specific task i am looking at, but i want to know how to do it the way i am trying to do it.
I have a data frame with 15 columns and 2k observations with some 98 and 99s. For each row in where there is a 98 or 99 for any variable/column, I want to remove the whole row. I create a function to filter by variable name not equal to 98/99, and use lapply. however, instead of continually updating the data frame, It just spits out a series of data frames, overwriting the previous data frame, meaning that at the end i will only get a data frame with the last column cleaned. How do i get it to update the data frame for each column sequentially?
nafunction = function(variable){
kuwait5=kuwait5%>%
filter(variable<90)
}
`nafunction = function(variable){
kuwait5=kuwait5%>%
filter(variable<90)
}
lapply(kuwait5, nafunction)`
Expected result is a new data frame with all rows that have an 98 removed. What i get is a sequence of data frames each one having ONE column in which rows with NAS are removed.

How to make smaller subsets based upon a fixed number of rows repeating over the dataframe

My Problem:
I have a dataframe consisting of 86016000 rows of observations:
there are 512000 observations for each hour
there are 24 hours data for seven days
So 24*7*512000 = 86016000
there are 40 columns (variables)
There is no column of date or datetimestamp
Only row numbers are good enough to identify how many obs. for each day, and there are no errors in recording of this data.
Given such a large dataset, what I want to do is create subsets of 12288000 (i.e. 24 * 512000) rows, so that we have 7 each day's subset.
What I tried:
d <- split(PltB_Fold3_1_Data, rep(1:12288000, each=7))
But unfortunately after almost half an hour, I termicated the process as there was no result.
Is there any better solution then the one above?

You're probably looking for seq rather than rep. With seq, you can generate a sequence of numbers from 0 to 86016000 incremented by 12288000.
To save resources, you can then use this sequence to generate temporary data frames and do whatever you want with each.
sequence <- seq(from = 0, to = 86016000, by = 12288000)
for(i in 1:(length(sequence)-1)){
temp <- df[sequence[i]+1:sequence[i+1], ]
# do something here with your temporary data frame
}

dplyr to reference two data frame (summarize function) in R

I created a data frame from a data set with unique marketing sources. Let's say I have 20 unique marketing sources in this new data frame D1. I want to add another column that has the count of times this marketing source was in my original data frame. I'm trying to use the dplyr package but not sure how to reference more than one data frame.
original data has 16000 observations
new data frame has 20 observations as there are only 20 unique marketing sources.
How to use summarize in dplyr to reference two data frames?
My objective is to find the percentage of marketing sources.
My original data frame has two columns: NAME, MARKETING_SOURCE
This data frame has 16,000 observations and 20 distinct marketing sources (email, event, sales call, etc)
I created a new data frame with only the unique MARKETING_SOURCES and called that data frame D1
In my new data frame, I want to add another column that has the number of times each marketing source appeared in the original data frame.
My new Data frame should have two columns: MARKETING_SOURCE, COUNT

I don't know if you need to use dplyr for something like this...
First let's create some data.frames:
df1 <- data.frame(source = letters[sample(1:26, 400, replace = T)])
df2 <- data.frame(source = letters, count = NA)
Then we can use table() to get the frequencies:
counts <- table(df1$source)
df2$count <- counts
head(df2)
source count
1 a 10
2 b 22
3 c 12
4 d 17
5 e 18
6 f 18
UPDATE:
In response to #MrFlick's wise comment below, you can use take the names() of the output from table() to ensure order is preserved:
df2$source <- names(counts)
Certainly not quite as elegant and would be even less elegant if df2 had other columns. But sufficient for the simple case presented above.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Faster version of split{base} for very large data frame - r

Related

Combining two vectors with rbind

How to divide R data frame into multiple data frames according to time difference of rows?

Update a data frame within a for loop

How to make smaller subsets based upon a fixed number of rows repeating over the dataframe

dplyr to reference two data frame (summarize function) in R

Categories

Resources