Optimize my regression using vectorization instead of a for loop

Optimize my regression using vectorization instead of a for loop - r

How would I vectorize this loop? When I have the loop with the backward stepwise regression, it takes over 15 minutes to run through the regression. (My full dataset has over 4000 observations and 20+ independent variables.) Any idea how I would vectorize this? I'm new to the whole concept.
I've looked into making this a function, and then using an ifelse statement for the training and validation. But, I haven't been able to get this to work in the code. Any ideas?
Here is a small dataset:
name <- c("Joe I.", "Joe I.", "Joe I.", "Joe I.", "Jane P.", "Jane P.", "Jane P.", "Jane P.",
"John K.", "John K.", "John K.", "John K.")
name_id <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
grade <- c(80, 99, 70, 65, 88, 90, 76, 65, 67, 68, 89, 67)
score <- c(82, 93, 72, 61, 89, 93, 71, 63, 64, 65, 82, 62)
attendance <- c(80, 99, 82, 62, 70, 65, 88, 90, 76, 93, 71, 99)
participation <- c(71, 63, 64, 71, 99, 76, 65, 67, 93, 72, 68, 89)
df <- cbind(name, name_id, class, grade, score, attendance, participation)
df <- as.data.frame(df)
df$name_id <- as.numeric(df$name_id)
df$grade <- as.numeric(df$grade)
df$score <- as.numeric(df$score)
df$attendance <- as.numeric(df$attendance)
df$participation <- as.numeric(df$participation)
Here is the loop:
magic_for(print, silent = TRUE)
for(i in 1:3){
validation = df[df$name_id == (i),]
training = df[df$name_id != (i),]
m = lm(score ~ grade + attendance, participation, data = training)
stepm <- stepAIC(m, direction = "backward", trace = FALSE)
pred1 <- predict(stepm, validation)
print(pred1)
}
options(max.print=999999)
pred1 <- magic_result_as_dataframe()

I am not sure if the following code can speed up your program, please have a try. Here df is pre-processed to be splitted by df$name_id, such that you have different chunks in terms of name_id
dfs <- split(df,df$name_id)
lapply(seq_along(dfs), function(k) {
validation <- dfs[[k]]
m <- lm(score ~ grade + attendance, participation, data = Reduce(rbind,dfs[-k]))
stepm <- stepAIC(m, direction = "backward", trace = FALSE)
pred1 <- predict(stepm, validation)
})

Related

Fit custom function to data

I have a data such that produced from special function:
where t0=1, alpha, q, gamma, C and beta are unknown parameters.
The question is how to fit the above function to following data, in R?
mydata<-structure(list(x = 1:100, y = c(0, 0, 2, 1, 3, 4, 4, 3, 7, 8,
9, 11, 12, 11, 15, 15, 17, 21, 49, 43, 117, 75, 85, 97, 113,
129, 135, 147, 149, 149, 123, 129, 127, 122, 143, 157, 144, 139,
123, 117, 141, 138, 124, 134, 158, 151, 136, 133, 121, 117, 122,
125, 117, 111, 98, 94, 92, 89, 73, 87, 91, 88, 94, 90, 93, 76,
60, 96, 71, 80, 71, 63, 65, 47, 74, 63, 78, 68, 55, 48, 51, 45,
48, 50, 71, 48, 35, 51, 69, 62, 64, 66, 51, 59, 58, 34, 57, 56,
63, 50)), class = "data.frame", row.names = c(NA, -100L))
I defined the function as follows:
t0<<-1
fyy<-function(t,cc0,alpha0,qq0,beta0,gamma0){
ret<-cc0*((t-t0)^alpha0)/(((1+(qq0-1)*beta0*(t-t0)^gamma0))^(1/(qq0-1)))
return(ret)
}
but I don't know how to continue?
as #mhovd mentioned I used "nls" function but I got an error as follows:
> fit <- nls(y~fyy(x,cc0 ,alpha0 ,beta0 ,gamma0 ,qq0 ),
data=data.frame(mydata), start=list(cc0 = .01,alpha0 =1,beta0 =.3,gamma0
= 2,qq0 = 1))
Error in numericDeriv(form[[3L]], names(ind), env) :
Missing value or an infinity produced when evaluating the model

In the comments #masoud references a paper about the specific function in the question. It suggests fixing gamma0 and qq0 and if we do that we do get a solution -- fm shown in red in the plot. We have also shown an alternate parametric curve as fm2 in blue. It also has 3 optimized parameters but has lower residual sum of squares (lower is better).
fyy <- function(t,cc0,alpha0,qq0,beta0,gamma0){
cc0 * ((t-t0)^alpha0) / (((1+(qq0-1)*beta0*(t-t0)^gamma0))^(1/(qq0-1)))
}
mydata0 <- subset(mydata, y > 0)
# fixed values
t0 <- 1
gamma0 <- 3
qq0 <- 1.2
st <- list(cc0 = 1, alpha0 = 1, beta0 = 1) # starting values
fm <- nls(y ~ fyy(x, cc0, alpha0, qq0, beta0, gamma0), mydata0,
lower = list(cc0 = 0.1, alpha0 = 0.1, beta0 = 0.00001),
start = st, algorithm = "port")
deviance(fm) # residual sum of squares
## [1] 61458.5
st2 <- list(a = 1, b = 1, c = 1)
fm2 <- nls(y ~ exp(a + b/x + c*log(x)), mydata0, start = st2)
deviance(fm2) # residual sum of squares
## [1] 16669.24
plot(mydata0, ylab = "y", xlab = "t")
lines(fitted(fm) ~ x, mydata0, col = "red")
lines(fitted(fm2) ~ x, mydata0, col = "blue")
legend("topright", legend = c("fm", "fm2"), lty = 1, col = c("red", "blue"))

How to loop between columns to calculate the variables

Below, assume this is part of the data:
df <- tribble(
~temp1, ~temp2, ~temp3, ~temp4, ~temp5, ~temp6, ~temp7, ~temp8,
75, 88, 85, 71, 98, 76, 71, 57,
80, 51, 84, 72, 59, 81, 70, 64,
54, 65, 90, 66, 93, 88, 77, 59,
59, 87, 94, 75, 74, 53, 56, 87,
52, 55, 64, 77, 50, 64, 83, 87,
)
Now I want to make a loop to get the results. In this example, temp1 should go with temp2 ONLY and temp3 should go with temp4 only, temp5 with temp6 only and temp7 with temp8.
Suppose I want to run a correlation or a t-test between the intended variables ( temp1 with 2, temp3 with temp4, temp5with tem6, temp7 with temp8 ONLY)
I would also like to get only statistics, for example only the value of r in correlation... A table would be very helpful.
I have searched it seems we need to use the function of the map, but I struggled to do it. Could we do it in R?

We can use seq to subset the columns and use map2 so that we get the correlation between temp1 and temp2, temp3 and temp4 etc
library(purrr)
out <- map2_dbl(df[seq(1, ncol(df), 2)], df[seq(2, ncol(df), 2)], ~ cor(.x, .y))
names(out) <- paste0("Time", seq_along(out))
Or with Map from base R
out <- unlist(Map(function(x, y) cor(x, y), df[seq(1, ncol(df), 2)],
df[seq(2, ncol(df), 2)]))
names(out) <- paste0("Time", seq_along(out))

You could split your dataframe in two: one with columns 1,3,5,7 and the other with 2,4,6,8.
Then you one take one column per each a time and perform cor or t.test with pmap.
library(purrr)
df %>%
split.default(rep_len(1:2, ncol(.))) %>%
pmap_dbl(~cor(.x,.y))

R n most similar time series - dwt clustering / nearest neighbour

The data attached is a simplified example, as in reality I have hundreds of people and hundreds of points in time.
I am looking for a way to determine similar time series.
I have some code here to determine clusters, but this isn't exactly what I want.
What I would like is if I selected one person it would return the names of the n most similar time series.
I.e if n = 1, and I enter Bob it would return Dave, however if I entered Sam it would return Bob (with these names going into a new column with df). If n = 2 the first column would contain the most similar time series, and the second would contain the next most similar. This is similar to K nearest neighbours but across time series, so that each individual person has a different set of "neighbours".
If this is unfeasible, or too difficult I would alternatively like would to specify the number of people in each group, rather than the number of groups.
In this example I specified 4 groups, this does not make 4 groups of 2.
Group B contains 4 people, whilst C and D have only 1 person.
hc#cluster
James A
Dave B
Bob B
Joe C
Robert A
Michael B
Sam B
Steve D
library(dtwclust)
df <- data.frame(
row.names = c("James", "Dave", "Bob", "Joe", "Robert", "Michael", "Sam", "Steve"),
Monday = c(82, 46, 96, 57, 69, 28, 100, 10),
Tuesday = c(77, 62, 112, 66, 54, 34, 107, 20),
Wednesday = c(77, 59, 109, 65, 50, 37, 114, 30),
Thursday = c(73, 92, 142, 77, 54, 30, 128, 40),
Friday = c(74, 49, 99, 90, 50, 25, 111, 50),
Saturday = c(68, 26, 76, 81, 42, 28, 63, 60),
Sunday = c(79, 37, 87, 73, 53, 33, 79, 70)
)
hc<- tsclust(df, type = "h", k = 4,
preproc = zscore, seed = 899,
distance = "sbd", centroid = shape_extraction,
control = hierarchical_control(method = "average"))
plot(hc)
yo <- as.data.frame(hc#cluster)
yo$`hc#cluster` <- LETTERS[yo$`hc#cluster`]
print(yo)

What you want to do is not to cluster the data, you want to order it according to one specific time-series, there lies the problem. To do what you want, first, you have to select a measure of "distance", that could be euclidean or correlation for example. In the next example, I provide a code with both measurements of distances (correlation and euclidean). It simple calculate the distance between the time-series, then sort it, and lastly pick up the N lower. Note that the selection of the measurement of distance will alter your results.
df <- data.frame(
Monday = c(82, 46, 96, 57, 69, 28, 100, 10),
Tuesday = c(77, 62, 112, 66, 54, 34, 107, 20),
Wednesday = c(77, 59, 109, 65, 50, 37, 114, 30),
Thursday = c(73, 92, 142, 77, 54, 30, 128, 40),
Friday = c(74, 49, 99, 90, 50, 25, 111, 50),
Saturday = c(68, 26, 76, 81, 42, 28, 63, 60),
Sunday = c(79, 37, 87, 73, 53, 33, 79, 70)
)
df <- as.data.frame(t(df))
colnames(df) <- c("James", "Dave", "Bob", "Joe", "Robert", "Michael", "Sam", "Steve")
get_nearest_n <- function(data, name, n = 1){
#' n must be positive and integer
#' name must be a column name of data
#' data must be a dataframe
serie <- data[,name]
data <- data[,-which(colnames(data) == name)]
dist <- sqrt(colSums((data-serie)**2))
sorted_names <- names(sort(dist)[1:n])
return(data[,sorted_names])
}
get_nearest_n2 <- function(data, name, n = 1){
#' n must be positive and integer
#' name must be a column name of data
#' data must be a dataframe
serie <- data[,name]
data <- data[,-which(colnames(data) == name)]
dist <- as.data.frame(cor(serie,data))
sorted_names <- names(sort(dist,decreasing = T)[1:n])
return(data[,sorted_names])
}
get_nearest_n(data = df, name = 'Bob', n = 3)
get_nearest_n2(data = df, name = 'Bob', n = 3)

Trying to create a single graphic with 4 lines representing 4 data frames in a file

My file looks something like this.
My file looks something like this:
Student, HW Average, Weekly Quiz Average, Grade (Midterm)
1, 94, 82, 87
2, 78, 75, 79
3, 83, 72, 79
4, 97, 94, 96
5, 98, 93, 96
6, 97, 88, 93
7, 93, 88, 91
8, 77, 58, 72
9, 85, 71, 78
10, 96, 79, 90
To 66
The code I tried using was the following.
library(readr)
Students <- read_csv("Julies_Data.csv", quote = ",")
library(ggplot2)
df <- data.frame(Students=r(2...66))
new.df <- NULL
for (i in seq(2, ncol(df), 2)) {
new.df <- rbind(new.df,
data.frame(no=i/2,
"HW Average" = df[, i],
"Weekly Quiz Average"= df[, i+1],
"Grade Midterm" = df[, i+2]))
}
qplot("HW Average", "Weekly Quiz Average", "Grade Midterm",
group=no, data=new.df, col=factor(no), geom="line")

How to paste names of an object within a loop?

root <- c(54, 7, 125, 80, 0, 60, 30, 51, 23, 7)
smith <- c(112, 2, 104, 19, 7, 12, 61, 80, 4, 49)
kohli <- c(43, 31, 25, 186, 75, 37, 40, 1, 4, 130)
williamson <- c(22, 73, 98, 38, 61, 9, 22, 101, 0, 12)
batsmen <- list(root, smith, kohli, williamson)
bat_names <- c("Joe Root", "Steve Smith", "Virat Kohli", "Kane Williamson")
names(batsmen) <- bat_names
bat_averages <- function(x) {
paste(names(x), "average is", mean(x))
}
lapply(batsmen, bat_averages)
$`Joe Root`
[1] " average is 43.7"
$`Steve Smith`
[1] " average is 45"
$`Virat Kohli`
[1] " average is 57.2"
$`Kane Williamson`
[1] " average is 43.6"
How can I print the names of each object in batsmen inside the function? I think I understand why using a loop/lapply doesn't work with names(), but I can't figure out how to get the desired result. Many thanks.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Optimize my regression using vectorization instead of a for loop - r

Related

Fit custom function to data

How to loop between columns to calculate the variables

R n most similar time series - dwt clustering / nearest neighbour

Trying to create a single graphic with 4 lines representing 4 data frames in a file

How to paste names of an object within a loop?

Categories

Resources