I am very new to R and had a question you may find simple. I have two data frames which have the same exact column names. One data frame has around 58k rows (each row is an article number and each column is a month - the values are quantities). The second data frame is a much smaller subset of the first (has around 1000 rows). The rows from the second data frame will always have a value in the first. What I need to do is subtract the second data frames quantities for each month/article from the first larger data frame. It almost is like a vlookup on two values. Any ideas?
UPDATE: this is what I think it would look like in SQL:
SELECT I.Division,
ILS.Brand,
ILS.Cust #,
ILS.Article,
ILS.201811change - SLT.201811change AS '201811change',
ILS.201812change - SLT.201812change AS '201812change',
ILS.201901change - SLT.201901change AS '201901change',
ILS.201903change,
ILS.201904change,
ILS.201905change,
ILS.201906change,
ILS.201907change,
ILS.201808change,
ILS.201809change
FROM ILS LEFT OUTER JOIN SLT ON ILS.Article = SLT.Article
You can use left_join function on dplyr is analog of LEFT JOIN of SQL. In your case in simplified form it would be ISL %>% left_join(SLS, by = "Article"). Please see the full code below:
# data.frame simulation
strs3 <- c("Brand", "Cust", "Article", "201808change", "201809change", "201903change", "201904change", "201905change",
"201906change", "201907change", "201811change", "201812change", "201901change")
n <- 1000
total <- cbind(
as.data.frame(matrix(sample(LETTERS, 3 * n, replace = TRUE), ncol = 3)),
matrix(rnorm(n * 10), ncol = 10)
)
names(total) <- c("Brand", "Cust", "Article", "201808change", "201809change", "201903change", "201904change", "201905change",
"201906change", "201907change", "201811change", "201812change", "201901change")
spl <- ceiling(n * 57 / 58)
ils <- total[1:spl, ]
u <- unique(ils$Article)
ul <- length(u)
slt <- total[(spl + 1): (spl + ul), ]
slt$Article <- u
# left join
z <- ils %>% left_join(slt, by = "Article") %>%
mutate(`201811change` = `201811change.x` - `201811change.y`) %>%
mutate(`201812change` = `201812change.x` - `201812change.y`) %>%
mutate(`201901change` = `201901change.x` - `201901change.y`) %>%
select(-ends_with("y")) %>% select(-one_of("201811change.x", "201812change.x", "201901change.x"))
str(z)
Output (structure of the resultant data frame):
'data.frame': 983 obs. of 13 variables:
$ Brand.x : Factor w/ 26 levels "A","B","C","D",..: 16 23 19 20 19 26 7 21 22 9 ...
$ Cust.x : Factor w/ 26 levels "A","B","C","D",..: 21 15 25 3 24 2 1 26 3 23 ...
$ Article : Factor w/ 26 levels "A","B","C","D",..: 13 14 2 17 23 13 4 1 17 15 ...
$ 201808change.x: num -1.398 -0.357 -1.042 -0.653 -1.037 ...
$ 201809change.x: num 1.483 0.604 0.276 0.846 -1.245 ...
$ 201903change.x: num -0.733 -0.413 0.61 -1.037 1.048 ...
$ 201904change.x: num -0.794 -1.0688 0.577 0.3368 0.0472 ...
$ 201905change.x: num -0.427 -0.898 1.124 -0.435 -0.304 ...
$ 201906change.x: num 2.094 0.177 -0.892 -1.655 -1.091 ...
$ 201907change.x: num 0.228 0.546 0.141 -1.166 -0.687 ...
$ 201811change : num 1.5082 0.0148 -0.5335 -0.763 -1.7196 ...
$ 201812change : num 1.415 -2.128 -0.576 1.205 -0.631 ...
$ 201901change : num -0.883 -0.892 -2.032 -2.172 0.483 ...
Related
I'm scaling one column in a dataset with the intention of fitting a linear model. However, when I try to write the dataframe (with scaled column) to a csv, it doesn't work because the scaled column became complex with center and scale attributes.
Can someone please indicate how to convert the scaled column to something that can write to a csv? (and maybe why scale() needs to do it this way.)
# make a data frame
testDF <- data.frame(x1 = c(1,2,2,3,2,4,4,5,6,15,36,42,11,12,23,24,25,66,77,18,9),
x2 = c(1,4,5,9,4,15,17,25,35,200,1297,1764,120,150,500,500,640,4200,6000,365,78))
# scale the x1 attribute
testDF <- testDF %>%
mutate(x1_scaled = scale(x1, center = TRUE, scale = TRUE))
# write to csv doesn't work
write_csv(as.matrix(testDF), "testDF.csv")
# but plotting and lm do work
ggplot(testDF, aes(x1_scaled)) +
geom_histogram(aes(y = ..density..),binwidth = 1)
Lm_scaled <- lm(x2 ~ x1_scaled, data = testDF)
plot(Lm_scaled)
scale returns a matrix output. We could extract the column or use as.vector to remove the dim attribute
testDF <- testDF %>%
mutate(x1_scaled = as.vector(scale(x1, center = TRUE, scale = TRUE)))
Check the structure of the output without as.vector and with as.vector
> testDF %>%
+ mutate(x1_scaled = scale(x1, center = TRUE, scale = TRUE)) %>% str
'data.frame': 21 obs. of 3 variables:
$ x1 : num 1 2 2 3 2 4 4 5 6 15 ...
$ x2 : num 1 4 5 9 4 15 17 25 35 200 ...
$ x1_scaled: num [1:21, 1] -0.824 -0.776 -0.776 -0.729 -0.776 ...
..- attr(*, "scaled:center")= num 18.4
..- attr(*, "scaled:scale")= num 21.2
> testDF %>%
+ mutate(x1_scaled = as.vector(scale(x1, center = TRUE, scale = TRUE))) %>% str
'data.frame': 21 obs. of 3 variables:
$ x1 : num 1 2 2 3 2 4 4 5 6 15 ...
$ x2 : num 1 4 5 9 4 15 17 25 35 200 ...
$ x1_scaled: num -0.824 -0.776 -0.776 -0.729 -0.776 ...
You can simply convert the scale column to numeric in base R and write out the dataframe:
testDF$x1_scaled <- as.numeric(testDF$x1_scaled)
write_csv(testDF, "testDF.csv")
I changed my dataset to data.table and I'm using sapply (apply family) but so far that wasn't sufficiant. Is this fully correct?
I already went from this:
library(data.table)
library(lubridate)
buying_volume_before_breakout <- list()
for (e in 1:length(df_1_30sec_5min$date_time)) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
buying_volume_before_breakout[[e]] <- sum(interval$Quantity)
}
To this (created a function and and using sapply)
fun_buying_volume_before_breakout <- function(e) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
return(sum(interval$Quantity))
}
buying_volume_before_breakout <- sapply(1:length(df_1_30sec_5min$date_time), fun_buying_volume_before_breakout)
I couldn't make my data reproducible but here are some more insights about its structure.
> str(dolar_tick_data_unified_dt)
Classes ‘data.table’ and 'data.frame': 3120650 obs. of 6 variables:
$ date_time : POSIXct, format: "2017-06-02 09:00:35" "2017-06-02 09:00:35" "2017-06-02 09:00:35" ...
$ Buyer_from : Factor w/ 74 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 29 44 19 44 44 44 44 17 17 17 ...
$ Price : num 3271 3271 3272 3271 3271 ...
$ Quantity : num 5 5 5 5 5 5 10 5 50 25 ...
$ Seller_from: Factor w/ 73 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 34 34 42 28 28 28 28 34 45 28 ...
$ Type : Factor w/ 4 levels "Buyer","Direct",..: 1 3 1 1 1 1 1 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(df_1_30sec_5min)
Classes ‘data.table’ and 'data.frame': 3001 obs. of 13 variables:
$ date_time : POSIXct, format: "2017-06-02 09:33:30" "2017-06-02 09:49:38" "2017-06-02 10:00:41" ...
$ Price : num 3251 3252 3256 3256 3260 ...
$ fast_small_mm : num 3250 3253 3254 3256 3259 ...
$ slow_small_mm : num 3254 3253 3254 3256 3259 ...
$ fast_big_mm : num 3255 3256 3256 3256 3258 ...
$ slow_big_mm : num 3258 3259 3260 3261 3262 ...
$ breakout_strength : num 6.5 2 0.5 2 2.5 0.5 1 2.5 1 0.5 ...
$ buying_volume_before_breakout: num 1285 485 680 985 820 ...
$ total_volume_before_breakout : num 1285 485 680 985 820 ...
$ average_buying_volume : num 1158 338 318 394 273 ...
$ average_total_volume : num 1158 338 318 394 273 ...
$ relative_strenght : num 1 1 1 1 1 1 1 1 1 1 ...
$ relative_strenght_last_6min : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
First, separate the 'buyer' data from the rest. Then add a column for the start of the time interval and do a non-equi join in data.table, which is what #chinsoon is suggesting. I've made a reproducible example below:
library(data.table)
set.seed(123)
N <- 1e5
# Filter buyer details first
buyer_dt <- data.table(
tm = Sys.time()+runif(N,-1e6,+1e6),
quantity=round(runif(N,1,20))
)
time_dt <- data.table(
t = seq(
min(buyer_dt$tm),
max(buyer_dt$tm),
by = 15*60
)
)
t_int <- 300
time_dt[,t1:=t-t_int]
library(rbenchmark)
benchmark(
a={ # Your sapply code
bv1 <- sapply(1:nrow(time_dt), function(i){
buyer_dt[between(tm,time_dt$t[i]-t_int,time_dt$t[i]),sum(quantity)]
})
},
b={ # data.table non-equi join
all_intervals <- buyer_dt[time_dt,.(t,quantity),on=.(tm>=t1,tm<=t)]
bv2 <- all_intervals[,sum(quantity),by=.(t)]
}
,replications = 9
)
#> test replications elapsed relative user.self sys.self user.child
#> 1 a 9 42.75 158.333 81.284 0.276 0
#> 2 b 9 0.27 1.000 0.475 0.000 0
#> sys.child
#> 1 0
#> 2 0
Edit: In general, any join of two tables A and B is a subset of the outer join [A x B]. The rows of [A x B] will have all possible combinations of the rows of A and the rows of B. An equi join will subset [A x B] by checking equality conditions, i.e. If x and y are the join columns in A and B, Your join will be : rows from [A x B] where A.x=B.x and A.y=B.y
In a NON-equi join, the subset condition will have comparision operators OTHER than =, for example: like your case, where you want columns such that A.x <= B.x <= A.x + delta.
I don't know much about how they are implemented, but data.table has a pretty fast one that has worked well for me with large data frames.
Here is my dataframe example. It includes a column variable, named "dta" which is a single list of n values I want to keep for each of my scenario:
set.seed(777)
df <- data.frame(theo = numeric(),
size = numeric(),
dta = I(list()))
df[ 1: 5,"theo"] <- qlnorm(0.1, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 6:10,"theo"] <- qlnorm(0.2, meanlog=0, sdlog=1, lower.tail = TRUE, log.p = FALSE)
df[ 1: 5,"size"] <- 10
df[ 6:10,"size"] <- 20
for(i in 1:10){
df$dta[i] <- list(rlnorm(df$size[i], meanlog = 0, sdlog = 1))
}
df
str(df)
This should give a df like:
theo size dta
1 0.2776062 10 1.631967....
2 0.2776062 10 0.737667....
3 0.2776062 10 0.131252....
4 0.2776062 10 1.937334....
5 0.2776062 10 0.739868....
6 0.4310112 20 4.631176....
7 0.4310112 20 2.610180....
8 0.4310112 20 0.175918....
9 0.4310112 20 3.501670....
10 0.4310112 20 0.588178....
or:
'data.frame': 10 obs. of 4 variables:
$ theo: num 0.278 0.278 0.278 0.278 0.278 ...
$ size: num 10 10 10 10 10 20 20 20 20 20
$ dta :List of 10
..$ : num 1.632 0.671 1.667 0.671 5.148 ...
..$ : num 0.738 1.056 0.152 0.967 10.089 ...
..$ : num 0.131 1.256 0.457 3.574 4.211 ...
..$ : num 1.937 2.359 3.496 0.297 4.587 ...
..$ : num 0.74 0.66 0.481 0.434 1.874 ...
..$ : num 4.631 0.298 10.28 0.933 1.286 ...
..$ : num 2.61 0.472 0.251 1.61 0.303 ...
..$ : num 0.176 0.566 2.156 0.407 3.52 ...
..$ : num 3.502 1.748 1.283 0.648 1.359 ...
..$ : num 0.588 0.392 2.447 1.926 0.86 ...
..- attr(*, "class")= chr "AsIs"
Now, I want to subset that list in such a way that:
for each list, each value is compared with the fixed value "theo" stored in the dataframe
when that value is below or equal to "theo", then recode that value NA
Here is a working code and gives me exactly what I want:
df$dta2 <- df$dta
for(i in 1:10){
df$dta2[[i]] [ df$dta2[[i]] <= df$theo[i] ] <- NA
}
However I was wondering is there is a way to get the same result with a single line of code and no "for loop" to proceed with a conditional replacement of values contained in a list which is nested in a dataframe?
We can use Map
df$dta3 <- Map(function(x,y) replace(x, x<=y, NA), df$dta, df$theo)
all.equal(df$dta2, df$dta3, check.attributes=FALSE)
#[1] TRUE
I am new in using apply and functions together and I am stuck and frustrated. I have 2 different list of data frames that I need to add certain number of columns to the first one when a condition is fulfill related to the second one. Below this is the structure of the first list that has one data frame for any station and every df has 2 or more columns with each pressure:
> str(KDzlambdaEG)
List of 3
$ 176:'data.frame': 301 obs. of 3 variables:
..$ 0 : num [1:301] 0.186 0.182 0.18 0.181 0.177 ...
..$ 5 : num [1:301] 0.127 0.127 0.127 0.127 0.127 ...
..$ 20: num [1:301] 0.245 0.241 0.239 0.236 0.236 ...
$ 177:'data.frame': 301 obs. of 2 variables:
..$ 0 : num [1:301] 0.132 0.132 0.132 0.13 0.13 ...
..$ 25: num [1:301] 0.09 0.092 0.0902 0.0896 0.0896 ...
$ 199:'data.frame': 301 obs. of 2 variables:
..$ 0 : num [1:301] 0.181 0.182 0.181 0.182 0.179 ...
..$ 10: num [1:301] 0.186 0.186 0.185 0.183 0.184 ...
On the other hand I have the second list that have the number of columns that I need to add after every column on each data frame of the first list :
> str(dif)
List of 3
[[176]]
[1] 4 15 28
[[177]]
[1] 24 67
[[199]]
[1] 9 53
I´ve tried tonnes of things even this, using the append_col function that appear in:
How to add a new column between other dataframe columns?
for (i in 1:length(dif)){
A<-lapply(KDzlambdaEG,append_col,rep(list(NA),dif[[i]][1]),after=1)
}
but nothing seems to work so far... I have searched for answers here but its difficult to find specific ones being a newcomer.
Try:
indxlst <- lapply(dif, function(x) c(1, x[-length(x)]+1, x[length(x)]))
newdflist <- lapply(indxlst, function(x) data.frame(matrix(0, 2, sum(x))))
for(i in 1:length(newdflist)) {
newdflist[[i]][indxlst[[i]]] <- KDzlambdaEG[[i]]
}
Reproducible Data Test
df1 <- data.frame(x=1:2, y=c("Jan", "Feb"), z=c("A", "B"))
df3 <- df2 <- df1[,-3]
KDzlambdaEG <- list(df1,df2,df3)
x1 <- c(4,15,28)
x2 <- c(24,67)
x3 <- c(9, 53)
dif <- list(x1,x2,x3)
indxlst <- lapply(dif, function(x) c(1, x[-length(x)]+1, x[length(x)]))
newdflist <- lapply(indxlst, function(x) data.frame(matrix(0, 2, sum(x))))
for(i in 1:length(newdflist)) {
newdflist[[i]][indxlst[[i]]] <- KDzlambdaEG[[i]]
}
newdflist
How to split automatically a matrix using R for 5-fold cross-validation?
I actually want to generate the 5 sets of (test_matrix_indices, train matrix_indices).
I suppose you want the matrix rows to be the cases to split. Then all you need is sample and split :
X <- matrix(rnorm(1000),ncol=5)
id <- sample(1:5,nrow(X),replace=TRUE)
ListX <- split(x,id) # gives you a list with the 5 matrices
X[id==2,] # gives you the second matrix
I'd work with the list, as it allows you to do something like :
names(ListX) <- c("Train1","Train2","Train3","Test1","Test2")
mean(ListX$Train3)
which makes for code that's easier to read, and keeps you from creating tons of matrices in your workspace. You're bound to mess up if you put the matrices individually in your workspace. Use lists!
In case you want the test matrix to be smaller or larger than the other ones, use the prob argument of sample :
id <- sample(1:5,nrow(X),replace=TRUE,prob=c(0.15,0.15,0.15,0.15,0.3))
gives you a test matrix that's double the size of the train matrices.
In case you want to determine the exact number of cases, sample and prob aren't the best options. You could use a trick like :
indices <- rep(1:5,c(100,20,20,20,40))
id <- sample(indices)
to get matrices with respectively 100, 20, ... and 40 cases.
f_K_fold <- function(Nobs,K=5){
rs <- runif(Nobs)
id <- seq(Nobs)[order(rs)]
k <- as.integer(Nobs*seq(1,K-1)/K)
k <- matrix(c(0,rep(k,each=2),Nobs),ncol=2,byrow=TRUE)
k[,1] <- k[,1]+1
l <- lapply(seq.int(K),function(x,k,d)
list(train=d[!(seq(d) %in% seq(k[x,1],k[x,2]))],
test=d[seq(k[x,1],k[x,2])]),k=k,d=id)
return(l)
}
Solution without split:
set.seed(7402313)
X <- matrix(rnorm(999), ncol=3)
k <- 5 # number of folds
# Generating random indices
id <- sample(rep(seq_len(k), length.out=nrow(X)))
table(id)
# 1 2 3 4 5
# 67 67 67 66 66
# lapply over them:
indicies <- lapply(seq_len(k), function(a) list(
test_matrix_indices = which(id==a),
train_matrix_indices = which(id!=a)
))
str(indicies)
# List of 5
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 12 13 14 17 18 20 23 28 41 45 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 5 6 7 8 9 10 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 4 19 31 36 47 53 58 67 83 89 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 5 6 7 8 9 10 11 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:67] 5 8 9 30 32 35 37 56 59 60 ...
# ..$ train_matrix_indices: int [1:266] 1 2 3 4 6 7 10 11 12 13 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 1 2 3 6 21 24 27 29 33 34 ...
# ..$ train_matrix_indices: int [1:267] 4 5 7 8 9 10 11 12 13 14 ...
# $ :List of 2
# ..$ test_matrix_indices : int [1:66] 7 10 11 15 16 22 25 26 40 42 ...
# ..$ train_matrix_indices: int [1:267] 1 2 3 4 5 6 8 9 12 13 ...
But you could return matrices too:
matrices <- lapply(seq_len(k), function(a) list(
test_matrix = X[id==a, ],
train_matrix = X[id!=a, ]
))
str(matrices)
List of 5
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] -1.0132 -1.3657 -0.3495 0.6664 0.0762 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.484 0.418 -0.622 0.996 0.414 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.682 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:67, 1:3] 0.682 0.812 -1.111 -0.467 0.37 ...
# ..$ train_matrix: num [1:266, 1:3] -0.65 0.797 0.689 0.484 0.186 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] -0.65 0.797 0.689 0.186 -1.398 ...
# ..$ train_matrix: num [1:267, 1:3] 0.484 0.682 0.473 0.812 -1.111 ...
# $ :List of 2
# ..$ test_matrix : num [1:66, 1:3] 0.473 0.212 -2.175 -0.746 1.707 ...
# ..$ train_matrix: num [1:267, 1:3] -0.65 0.797 0.689 0.484 0.682 ...
Then you could use lapply to get results:
lapply(matrices, function(x) {
m <- build_model(x$train_matrix)
performance(m, x$test_matrix)
})
Edit: compare to Wojciech's solution:
f_K_fold <- function(Nobs, K=5){
id <- sample(rep(seq.int(K), length.out=Nobs))
l <- lapply(seq.int(K), function(x) list(
train = which(x!=id),
test = which(x==id)
))
return(l)
}
Edit : Thanks for your answers.
I have found the following solution (http://eric.univ-lyon2.fr/~ricco/tanagra/fichiers/fr_Tanagra_Validation_Croisee_Suite.pdf) :
n <- nrow(mydata)
K <- 5
size <- n %/% K
set.seed(5)
rdm <- runif(n)
ranked <- rank(rdm)
block <- (ranked-1) %/% size+1
block <- as.factor(block)
Then I use :
for (k in 1:K) {
matrix_train<-matrix[block!=k,]
matrix_test<-matrix[block==k,]
[Algorithm sequence]
}
in order to generate the adequate sets for each iterations.
However this solution can omit one individual for tests. I do not recommend it.
Below does the trick without having to create separate data.frames/matrices, all you need to do is to keep an integer sequnce, id that stores the shuffled indices for each fold.
X <- read.csv('data.csv')
k = 5 # number of folds
fold_size <-nrow(X)/k
indices <- rep(1:k,rep(fold_size,k))
id <- sample(indices, replace = FALSE) # random draws without replacement
log_models <- new.env(hash=T, parent=emptyenv())
for (i in 1:k){
train <- X[id != i,]
test <- X[id == i,]
# run algorithm, e.g. logistic regression
log_models[[as.character(i)]] <- glm(outcome~., family="binomial", data=train)
}
The sperrorest package provides this ability. You can choose between a random split (partition.cv()), a spatial split (partition.kmeans()), or a split based on factor levels (partition.factor.cv()). The latter is currently only available in the Github version.
Example:
library(sperrorest)
data(ecuador)
## non-spatial cross-validation:
resamp <- partition.cv(ecuador, nfold = 5, repetition = 1:1)
# first repetition, second fold, test set indices:
idx <- resamp[['1']][[2]]$test
# test sample used in this particular repetition and fold:
ecuador[idx , ]
If you have a spatial data set (with coords), you can also visualize your generated folds
# this may take some time...
plot(resamp, ecuador)
Cross-validation can then be performed using sperrorest() (sequential) or parsperrorest() (parallel).