I have two data sets - TEST end TRAIN. TEST is a subset of TRAIN. By using the columns "prod" and "clnt" I need to find all rows in TRAIN which corresponds to TEST (it is one to multiple correspondence). Then I make a temporal analysis of the respective values of the column "order" of TEST (first column "week" is the time).
So I take the first row of TRAIN, I compare all rows of TEST whether some of them contain the same combination of numbers of "prod" and "clnt" and record the respective values of "order" in TS. Usually I have zero to about ten values in TS per row of TRAIN. Then I do some calculations on TS (in this artificial case just mean(TS)) and record the result as well as the "Id" of the row of TEST in a data set Subm.
The algorithm works, but because I have millions of rows in TRAIN and TEST, I need it as fast as possible and especially to get rid of the loop, which is the slowest part. Probably I messed up with the data.frame declaration/usage also, but I am not sure.
set.seed(42)
NumObsTrain=100000 # this can be as much as 70 000 000
NumObsTest=10000 # this can be as much as 6 000 000
#create the TRAIN data set
train1=floor(runif(NumObsTrain, min=0, max=NumObsTrain+1))
train1=matrix(train1,ncol = 2)
train=cbind(8,train1) #week
train=rbind(train,cbind(9,train1)) #week
train=cbind(train,runif(NumObsTrain,min=1,max=10)) #order
train=cbind(c(1:nrow(train)),train)# id number of each row
colnames(train)=c("id","week","prod","clnt","order")
train=as.data.frame(train)
train=train[sample(nrow(train)),] # reflush the rows of train
# Create the TEST dataset
test=train[1:NumObsTest,]
test[,"week"][1:{NumObsTest/2}]=10
test[,"week"][{(NumObsTest/2)+1}:NumObsTest]=11
TS=numeric(length = 10)
id=c(1:NumObsTest*2)
order=c(1:NumObsTest*2)
Subm=data.frame(id,order)
ptm <- proc.time()
# This is the loop
for (i in 1:NumObsTest){
Subm$id[i]=test$id[i]
TS=train$order[train$clnt==test$clnt[i]&train$prod==test$prod[i]]
Subm$order[i]=mean(TS)
}
proc.time() - ptm
The following will create a data.frame with all (prod, clnt) and order combinations, then group them by prod and clnt, then take the mean of the order of each group. The final result is missing the id, and for some reason you have more data in your final data.frame, which I cannot figure out why. But the order results are correct.
newtrain <- train[, 3:5]
newtest <- test[, c(1, 3:4)]
x <- dplyr::inner_join(newtest, newtrain)
y <- dplyr::group_by(x, prod, clnt)
z <- dplyr::summarise(y, mean(order))
Related
I need help to rewrite my function (see below called randomdraws()) that operates now through a repeat loop and a for loop. This does take a lot of time (especially the for loop) for my bigger datasets. Additionally I need to repeat this function x-times and want to store the results in a list object.
Here is what I am trying to achieve: I have two dataframes (here df_1 and df_2) which I need as an input for my function randomdraws(). The interesting part of this function begins in the repeat{} section; here I need to draw a number of values from the extreme value distribution (evd) equal to the length of df_1. Afterwards I need to add this values (called evd_draw) to the values of df_1 and perform a check, if this altered values fulfill a certain condition (i.e. varX==varY). If this is not the case (condition is not met with the random draw added) I want to repeat this part until the condition is met. If the condition is met, I need to store the evd_draw with which the condtion was met. I now want to iterate this over each row in my dataframe df_1. In the end I get a new dataframe with the "stored" random draws of the evd per row of df_1 that fulfilled the condition varX==varY. In my example below, for only 10 observations, my code runs just fine.
But: if the number of rows and columns of df_1 (and df_2) expand, the function randomdraws() gets very slow. I therefore need another solution that performs the calculation of the repeat loop for each row of dataframe df_1. I think I need to parallelize my computations instead of iterating over each row one after another but I seem to fail at (i) rewriting my repeat function part for this and (ii) use that in functions likewise apply()/ map()/...
QUESTION: Is there a way that I can achieve my result (i.e. a dataframe/list of the random draws that fulfilled the condition performed on dataframes df_1 and df_2) avoiding the for loop and that is quick for large datasets/dataframes?
Example data:
df_1 <- as.data.frame(rbind(c(0.23040,0.10153,0.28394,0.17105,0.00125),
c(0.11839,0.16768 ,0.26914 ,0.19163,0.00126),
c(0.11703,0.18245 ,0.16571 ,0.16223,0.00423),
c(0.39406,0.08535 ,0.21181 ,0.12780,0.00039),
c(0.16097 ,0.16369, 0.23839, 0.17830,0.00158),
c(0.39812 ,0.04525, 0.17583, 0.09064,0.00167),
c(0.30547 ,0.10900, 0.18930 ,0.12665,0.00197),
c(0.19357 ,0.17854, 0.18003 ,0.19576,0.00189),
c(0.19466 ,0.17339, 0.21267 ,0.18410,0.00069),
c(0.07884 ,0.21299 ,0.18480 ,0.17908,0.00178)))
colnames(df_1) <- c("xf0m40","xf30m40","xf10m40","xf20m40","xf40m0")
rownames(df_1) <- c(2,7,21,33,50,77,80,96,102,110)
df_2 <- cbind.data.frame(varX=c("xf0m40","xf30m40","xf10m40","xf0m40","xf20m40","xf0m40","xf0m40","xf40m0","xf10m40","xf30m40"),
id=c(2,7,21,33,50,77,80,96,102,110))
Function (that runs smoothly but is too slow):
randomdraws <- function(df_1, df_2) {
require(tidyverse)
require(EnvStats)
dfx <- df_1 #here df_1 is actually retrieved from fitted values of regression output,
# simplified here for the sake of clarity
df <- df_2 #select two variables from separate dataframe df_2
#(already simplified here), where varX is a character var, id is numeric
# matrix containing only 0; to be filled with rowwise iteration
df_evd <- matrix(0, nrow = nrow(dfx), ncol= ncol(dfx), byrow = T)
colnames(df_evd) <- colnames(dfx)
rownames(df_evd) <- rownames(dfx)
for (i in 1:nrow(dfx)){
repeat {
evd_draw <- revd(length(dfx), scale = .5) #draw from evd for length of one row
t <- as.data.frame(dfx[i,] + evd_draw) %>% bind_cols(df[i,]) %>%
mutate(varY=as.character(pmap(across(1:ncol(dfx)),~ names(c(...)[which.max(c(...))]))),
overlap=ifelse(varX == varY,1,0))
#object t should sum row i values of dfx and evd_draw, then add varX and id from
#df_2 and calculate new varY to check if varX==varY
df_evd[i,] <- evd_draw
if (t[,ncol(t)]==1) break
#this code section should be repeated until the condition varX==varY (in
#other words; overlap==1 or t[,ncol(t)]==1 is true
}
}
return(df_evd)
}
Apply function on data:
system.time(exampledf <- randomdraws(df_1, df_2))
#replicate this function 3 times (takes even longer then!)
ls_example <- replicate(3, list(as.data.frame(randomdraws(df_1, df_2))), simplify=TRUE)
I need to divide certain values in a column by 1000 but do not know how to go about it
I attempted to use this function initially:
test <- Updins(weight,)
test$weight <- as.numeric(test$weight) / 1000
head(test)
with Updins being the dataframe and weight the column just to see if it would at least divide the entire column by 1000 but no such luck. It did not recognise 'test' as a variable.
Can anyone provide any guidance? I'm very new to R :)
If 'Updins is the dataset object name, we can select the columns with [ and not with ( as ( is used for function invoke
test <- Updins['weight']
test$weight <- as.numeric(test$weight) / 1000
Here is a fake data set to divide all rows by 1000. I also included a for-loop as one potential way to only do this for certain rows. Since you didn't specify how you were doing that, I just did it for any rows that had a value greater than 1,005, and I did a second version for only dividing by 1,000 if the ID was an odd number. If you have NAs this you may need an addition if statement to deal with them. I will provide an example for that in the third/last for-loop example.
ID<-1:10
grams<-1000:1009
df<-data.frame(ID,grams)
df$kg<-as.numeric(df$grams)/1000
df[,"kg"]<-as.numeric(df[,"grams"])/1000 #will do the same thing as the line above
for(i in 1:nrow(df)){
if(df[i,"grams"]>1005){df[i,"kg3"]<-as.numeric(df[i,"grams"])/1000}
}#if the weight is greater than 1,005 grams.
for(i in 1:nrow(df)){
if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#if the id is an odd number
df[3,"grams"]<-NA#add an NA to the weight data to test the next loop
for(i in 1:nrow(df)){
if(is.na(df[i,"grams"]) & (df[i,"ID"] %in% seq(1,101, by = 2))){df[i,"kg4"]<-NA}
else if(df[i,"ID"] %in% seq(1,101, by = 2)){df[i,"kg4"]<-as.numeric(df[i,"grams"])/1000}
}#Same as above, but works with NAs
Hard without data to work with or expected output, but here's a skeleton that you could probably use:
library(dplyr) #The package you'll need, for the pipes (%>% -- passes objects from one line to the next)
test <- Updins %>% #Using the dataset Updins
mutate(weight = ifelse(as.numeric(weight) > 199, #CHANGING weight variable. #Where weight > 50...
as.character(as.numeric(weight)/1000), #... divide a numeric version of the weight variable by 1000, but keep as a character...
weight) #OTHERWISE, keep the weight variable as is
head(test)
I kept the new value as a character, because it seems that your weight variable is a character variable based on some of the warnings ('NAs introduced by coercion') that you're getting.
Hi i'm trying to create 10 sub-training set (from a training set of 75%) in loop extracting randomly from a dataframe (DB). i'm using
smp_size<- floor((0.75* nrow(DB))/10)
train_ind<-sample(seq_len(nrow(DB)), size=(smp_size))
training<- matrix(ncol=(ncol(DB)), nrow=(smp_size))
for (i in 1:10){
training[i]<-DB[train_ind, ]
}
what's wrong?
To partition your dataset in 10 equally sized subsets, you may use the following:
# Randomly order the rows in your training set:
DB <- DB[order(runif(nrow(DB))), ]
# You will create a sequence 1,2,..,10,1,2,...,10,1,2.. you will use to subset
inds <- rep(1:10, nrow(DB)/10)
# split() will store the subsets (created by inds) in a list
subsets <- split(DB, inds)
Note, however, that split() will only give you equally sized subsets. Therefore, it might (and probably will) happen that some of the observations are not be included in any of the subsets.
If you wish to use all observations, causing some subsets to be larger than others, use inds <- rep(1:10, length.out = nrow(DB)) instead
I have a data set of x entries, and I need to resample it to y entries, with y being a number smaller than x. my data set is not a series of numbers, of rather x rows, and I need the entire row of information when resampling.
I am aware of the sample() function but given that my dataset is not a vector I am unclear how the exact code should be written.
Any help would be appreciated!
The idea is that you want to sample the index of rows, then use that to pull back all of the columns for those rows, like so:
set.seed(4444) # for reproducibility
data(iris)
x <- nrow(iris)
y <- 7
irisSubset <- iris[sample(x,y),]
I want to apply a statistical function to increasingly larger subsets of a data frame, starting at row 1 and incrementing by, say, 10 rows each time. So the first subset is rows 1-10, the second rows 1-20, and the final subset is rows 1-nrows. Can this be done without a for loop? And if so, how?
here is one solution:
# some sample data
df <- data.frame(x = sample(1:105, 105))
#getting the endpoints of the sequences you wanted
row_seq <- c(seq(0,nrow(df), 10), nrow(df))
#getting the datasubsets filtering df from 1 to each endpoint
data.subsets <- lapply(row_seq, function(x) df[1:x, ])
# applying the mean function to each data-set
# just replace the function mean by whatever function you want to use
lapply(data.subsets, mean)