I am trying to calculate some rates for time on condition parameters, and have written the following, which successfully calculates the desired rates. But, I'm sure there must be a more succinct way to do this using the data.table methods. Any suggestions?
Background on what I'm trying to achieve with the code.
For each run number there are 10 record numbers. Each record number refers to a value bin (the full range of values for each parameter is split into 10 equal sized bins). The values are counts of time spent in each bin. I am trying to sum the counts for P1 over each run number (calling this opHours for the run number). I then want to divide each of the bin counts by the opHours to show the proportion of each run that is spent in each bin.
library(data.table)
#### Create dummy parameter values
P1 <- rnorm(2000,400, 50);
Date <- seq(from=as.Date("2010/1/1"), by = "day", length.out = length(P1));
RECORD_NUMBER <- rep(1:10, 200);
RUN_NUMBER <- rep(1:200, each=10, len = 2000);
#### Combine the dummy parameters into a dataframe
data <- data.frame(Date, RECORD_NUMBER, RUN_NUMBER, P1);
#### Calculating operating hours for each run
setDT(data);
running_hours_table <- data[ , .(opHours = sum(P1)), by = .(RUN_NUMBER)];
#### Set the join keys for the data and running_hours tables
setkey(data, RUN_NUMBER);
setkey(running_hours_table, RUN_NUMBER);
#### Combine tables row-wise
data <- data[running_hours_table];
data$P1.countRate <- (data$P1 / data$opHours)
Is it possible to generate the opHours column in the data table without first creating a separate table and then joining them back together?
data2[ , opHours := sum(P1), by = .(RUN_NUMBER)]
You should probably read some materials about data.table:
wiki Getting-started
or
data.table.cheat.sheet
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
Example Data
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))
Problem
Pick a random start point within each id and from that point, select that row and subsequent, sequential rows totaling 1% of the rows within that ID. Then do it again for 2% of each ID's rows, and 3% and so on up to 99% of the rows per ID. Also, do not select a random point to begin sampling that is closer to the end of the ID's rows than the percentage desired to be samples (i.e., don't try to sample 20% of sequential rows from a point that's 10% from the end of an ID's number of rows.)
Desired Result
What dfcombine looks like from the first code chunk below, only instead of randomly selected fruit rows within an id, the fruit rows will have only a random start-point, with the subsequent rows needed for the sample following the start-point row sequentially.
What I've Tried
I can pull part of this problem off with the following code -- but it selects all rows at random, and I need the sample chunks to be sequential following the random start point (FYI: if you run this, you'll see your chunks start at 6% b/c this is a small dataset -- no rows <6% of sample-per-id):
library(tidyverse)
set.seed(123) # pick same sample each time
dflist<-list() # make an empty list
for (i in 1:100) # "do i a hundred times"
{
i.2<-i/100 # i.2 is i/100
dflooped <- df %>% # new df
group_by(id) %>% # group by id
sample_frac(i.2,replace=TRUE) # every i.2, take a random sample
dflooped
dflist[[i]]<-dflooped
}
dflist # check
library(data.table)
dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df
I can also pick the sequentially larger chunks I'm looking for with this -- but it doesn't allow me the random start (it always goes from the beginning of the df):
lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])
and using dplyr group_by spits an error I don't understand:
df2 <- df %>%
group_by(id) %>%
lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])
Error in match.fun(FUN) :
'seq(0.01, 1, 0.01)' is not a function, character or symbol
So I may have some of the pieces, but am having trouble putting them together -- the solution may or may not include what I've done above. Thanks.
Sequential sampling within ID
Create fake data
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
Here we'll define a function to do what you want:
I question the statistical impact of only starting your random sample from a spot where you won't "run out" of observations within this ID category.
Would it not be better to loop back to the top of the records within each ID category if you were to run out? That would ensure a uniform chance of beginning your sample within any portion of a specific ID field as opposed to limiting yourself to only within the first 80% of the data if we want a 20% sample size. Just a thought! I built this as you asked though!
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
#browser()
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - (nrows_to_sample + 1) )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
Test function for a single function call
Test out the function with just a single sample for a certain percent (10% here). This is also a good way to redo several of the same function call to ensure a randomized starting location.
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
Now place function in for loop
Set aside a unique list of all potential values within the id field. Also set aside a vector of sample sizes in percent format (between 0 and 1).
# capture all possible values in id field
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)
# initialize list:
combined_list <- list()
for(i in 1:length(possible_ids)) {
#browser()
print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}
Process the results
# process results of for loop
combined_list
# number of rows in each df in our list
sapply(combined_list, nrow)
This is the resulting dataset of all combinations of samples
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
EDIT:
I'll leave what I initially wrote up there, but in retrospect, I think this is actually a bit closer to what you are asking for.
This solution uses the same type of function, but I used nested for loops to achieve what you were asking for.
For each ID, it will:
subset dataframe for this ID value
find random starting point
sample n% of the data (starting with 1%)
repeat with +1% to n (up to 99%)
Code:
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# don't let us use zero as an index
if(nrows_to_sample < 1) {
nrows_to_sample <- 1
}
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - nrows_to_sample )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:99 / 100)
# adding an expand grid
ids_sample <- expand.grid(possible_ids, sampleperc_sequence)
# initialize list:
combined_list <- list()
counter <- 1
for(i in 1:length(possible_ids)) {
for(j in 1:length(sampleperc_sequence)) {
print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])
# manually keep track of counter
counter <- counter + 1
}
}
random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])
# process results of for loop
combined_list
# check size of first list element
combined_list[[1]] # A, 10% sample is 1 record
# check thirtieth element
combined_list[[30]] # A, 30% sample is 3 records
# check size of the sixtieth list element
combined_list[60] # A, 60% sample is 6 records
sapply(combined_list, nrow) # number of rows in each df in our list
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
I have two data sets - TEST end TRAIN. TEST is a subset of TRAIN. By using the columns "prod" and "clnt" I need to find all rows in TRAIN which corresponds to TEST (it is one to multiple correspondence). Then I make a temporal analysis of the respective values of the column "order" of TEST (first column "week" is the time).
So I take the first row of TRAIN, I compare all rows of TEST whether some of them contain the same combination of numbers of "prod" and "clnt" and record the respective values of "order" in TS. Usually I have zero to about ten values in TS per row of TRAIN. Then I do some calculations on TS (in this artificial case just mean(TS)) and record the result as well as the "Id" of the row of TEST in a data set Subm.
The algorithm works, but because I have millions of rows in TRAIN and TEST, I need it as fast as possible and especially to get rid of the loop, which is the slowest part. Probably I messed up with the data.frame declaration/usage also, but I am not sure.
set.seed(42)
NumObsTrain=100000 # this can be as much as 70 000 000
NumObsTest=10000 # this can be as much as 6 000 000
#create the TRAIN data set
train1=floor(runif(NumObsTrain, min=0, max=NumObsTrain+1))
train1=matrix(train1,ncol = 2)
train=cbind(8,train1) #week
train=rbind(train,cbind(9,train1)) #week
train=cbind(train,runif(NumObsTrain,min=1,max=10)) #order
train=cbind(c(1:nrow(train)),train)# id number of each row
colnames(train)=c("id","week","prod","clnt","order")
train=as.data.frame(train)
train=train[sample(nrow(train)),] # reflush the rows of train
# Create the TEST dataset
test=train[1:NumObsTest,]
test[,"week"][1:{NumObsTest/2}]=10
test[,"week"][{(NumObsTest/2)+1}:NumObsTest]=11
TS=numeric(length = 10)
id=c(1:NumObsTest*2)
order=c(1:NumObsTest*2)
Subm=data.frame(id,order)
ptm <- proc.time()
# This is the loop
for (i in 1:NumObsTest){
Subm$id[i]=test$id[i]
TS=train$order[train$clnt==test$clnt[i]&train$prod==test$prod[i]]
Subm$order[i]=mean(TS)
}
proc.time() - ptm
The following will create a data.frame with all (prod, clnt) and order combinations, then group them by prod and clnt, then take the mean of the order of each group. The final result is missing the id, and for some reason you have more data in your final data.frame, which I cannot figure out why. But the order results are correct.
newtrain <- train[, 3:5]
newtest <- test[, c(1, 3:4)]
x <- dplyr::inner_join(newtest, newtrain)
y <- dplyr::group_by(x, prod, clnt)
z <- dplyr::summarise(y, mean(order))
Need help to speed up this code!
Goal is to create a dataframe where the TPS (transaction per second) of the first DF: TPS_Jan7_11h_13h_CheckIMEI will be accumulated from record 1 to 30, then reset to 0 and do that again.
This is what it looks like in graph form:
https://docs.google.com/spreadsheets/d/1-286za99C5gdHLDErR9B4ZazVrZFFINGaH3xzVMghFk/edit?usp=sharing
My dataset has more than 6millions rows...
I start creating a sequence where I need to reset to 0 my cumulative variable. Then I go through the full dataset and just add on top of the previous value.
I have been running this for a few hours on a quad code x64 8gig machine and still running... so... crazy slow!
Any ideas how to speed this up? Subsets or some magic with Tables?
Here's the code:
# Create a sequence of when to reset the cumulative TPS
TPS_Jan7_11h_13h_CheckIMEI_seq30 <- seq(from = 1,nrow(TPS_Jan7_11h_13h_CheckIMEI),by = 30)
# Initialize Dataframe
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30 <- data.frame(matrix(ncol = 3, nrow = nrow(Jan7_11h_13h_CheckIMEI)))
colnames(TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30) <- c("CumulTPS","100%","130%")
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30[2] = 1000*30
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30[3] = (1000*30)*1.3
CumulVal = 0
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$CumulTPS[1] = TPS_Jan7_11h_13h_CheckIMEI$TPS[1]
for(i in 2:nrow(Jan7_11h_13h_CheckIMEI)) {
CumulVal = CumulVal + TPS_Jan7_11h_13h_CheckIMEI$TPS[i-1]
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$CumulTPS[i] = CumulVal
# print(CumulVal)
if (i %in% TPS_Jan7_11h_13h_CheckIMEI_seq30) CumulVal = 0
}
The TPS DF is simply a list of TPS on the TPS column and timestamp on first column.
Goal is to recreate what I put in the spreadsheet example, but on millions of rows!
Thanks,
Simon
Use dplyr to group your data into groups of 30 records, then compute the cumulative sum for each value in each group.
Here's some code; note that it needs some refinement to include all values - take a look at the cut documentation for help.:
library(dplyr)
# Create a sequence of when to reset the cumulative TPS
TPS_Jan7_11h_13h_CheckIMEI_seq30 <- seq(from = 1,nrow(TPS_Jan7_11h_13h_CheckIMEI),by = 30)
#use cut() to add a factor column to the data frame with a different level for each group of 30
TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30$numgroup = cut(as.numeric(row.names(TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30)), TPS_Jan7_11h_13h_CheckIMEI_seq30)
#aggregate by the new column and get the cumulative sum at each line, within each group
newdf = TPS_Jan7_11h_13h_CheckIMEI_CumulTPS30 %>% group_by(numgroup) %>% mutate(cumulsum = cumsum(TPS))
I am an R neophyte, with a data frame of database function runtimes with the following data:
> head(data2)
dbfunc runtime
1 fn_slot03_byperson 38.083
2 fn_slot03_byperson 32.396
3 fn_slot03_byperson 41.246
4 fn_slot03_byperson 92.904
5 fn_slot03_byperson 130.512
6 fn_slot03_byperson 113.853
The data has data for 127 discrete functions comprising of some 1940170 rows.
I would like to:
Summarise the data to only include database functions with a mean runtime of over 100 ms
Produce boxplots of the 25 slowest database functions showing the distribution of runtimes, sorted by slowest first.
I'm particularly stumped by the summary step.
Note : I've also asked this questions at stats.stackexchange.com.
Here's one approach using ggplot and plyr. The steps you outlined could be combined to be slightly more efficient, but for learning purposes I'll show you the steps as you asked them.
#Load ggplot and make some fake data
library(ggplot2)
dat <- data.frame(dbfunc = rep(letters[1:10], each = 100)
, runtime = runif(1000, max = 300))
#Use plyr to calculate a new variable for the mean runtime by dbfunc and add as
#a new column
dat <- ddply(dat, "dbfunc", transform, meanRunTime = mean(runtime))
#Subset only those dbfunc with mean run times greater than 100. Is this step necessary?
dat.long <- subset(dat, meanRunTime > 100)
#Reorder the level for the dbfunc variable in terms of the mean runtime. Note that relevel
#accepts a function like mean so if the subset step above isn't necessary, then we can simply
#use that instead.
dat.long$dbfunc <- reorder(dat.long$dbfunc, -dat.long$meanRunTime)
#Subset one more time to get the top *n* dbfunctions based on mean runtime. I chose three here...
dat.plot <- subset(dat.long, dbfunc %in% levels(dbfunc)[1:3])
#Now you have your top three dbfuncs, but a bunch of unused levels hanging out so let's drop them
dat.plot$dbfunc <- droplevels(dat.plot$dbfunc)
#Plotting time!
ggplot(dat.plot, aes(dbfunc, runtime)) +
geom_boxplot()
Like I said, I feel a few of those steps could be combined and made more efficient, but wanted to show you the steps as you outlined them.
The summary step is easy:
attach(data2)
func_mean = tapply(runtime, dbfunc, mean)
ad question 1:
func_mean[func_mean > 100]
ad question 2:
slowest25 = head(sort(func_mean, decreasing = TRUE), n=25)
sl25_data = merge(data.frame(dbfunc = names(slowest25), data2, sort = F)
plot(sl25_data$runtime ~ sl25_data$dbfunc)
Hope this helps. Yet the boxplots are not sorted in the plot.
I'm posting this as the 'answer' whereas Tomas and Chases' answers are in fact more complete. In Chase's case I couldn't get ggplot to operate, and time was short. In Tomas' case I got stuck at the sl25_data step.
We ended up using the following, which works with one remaining problem:
# load data frame
dbruntimes <- read.csv("db_runtimes.csv",sep=',',header=FALSE)
# calc means
meanruns <- aggregate(dbruntimes["runtime"],dbruntimes["dbfunc"],mean)
# filter
topmeanruns <- meanruns[meanruns$runtime>100,]
# order by means
meanruns <- meanruns[rev(order(meanruns$runtime)),]
# get top 25 results
drawfuncs <- meanruns[1:25,"dbfunc"]
# subset for plot
forboxplot <- subset(dbruntimes,dbfunc %in% levels(drawfuncs)[0:25])
# plot
boxplot(forboxplot$runtime~forboxplot$dbfunc)
This gives us the result we are looking for, but all the functions are still shown on the plot xaxis, rather than just the top 25.