writing data form loops to matrix in R - r

ı need help. I have Form x (5000 samples) and FormY (5000 samples). I draw samples (sample size differentiate between 10-400) from X and Y forms and I equte these forms in R. But I have a problem. I want to write equated scores ( for each sample size and for 100 replications) to matrix But ı couldnt. If you help me, ı will be glad...
My code:
x<-read.table("X_Top_25_SANS_0.csv")
y<-read.table("Y_sum25_SANS_0_SD0.1.csv")
data_xy<-data.frame(x,y)
ii<-c(10,15,25,50,75,100,125,150,200,250,400) # Sample sizes
jj<-c(rep(1:100)) # replication
################# örneklem döngüleri ######################
for(i in 1:length(ii)){ #sample loop
for(j in 1:length(jj)){ #replication loop
x_rep<-sample(data_xy[,1],ii[i],replace=TRUE) #drawing sample
y_rep<-sample(data_xy[,2],ii[i],replace=TRUE)
xy_lin=(sd(y_rep)/sd(x_rep))*x_rep + mean(y_rep) - (mean(x_rep)*(sd(y_rep)/sd(x_rep)))
}}
I want to write "xy_lin" to matrix

As I said in my comment, it is difficult to understand the problem without a reproducable example, and I am not sure I understand the formula correctly that you apply in the loop, so my solution is based on some guesses. Anyway, try this:
data_xy<-data.frame(x=rnorm(5000),y=rnorm(5000)) # random data
ii<-c(10,15,25,50,75,100,125,150,200,250,400)
jj<-c(rep(1:100))
xy_lin <- matrix(NA, nrow=length(ii), ncol=length(jj)) # create the results matrix
for(i in 1:length(ii)){
for(j in 1:length(jj)){
x_rep<-sample(data_xy[,1],ii[i],replace=TRUE)
y_rep<-sample(data_xy[,2],ii[i],replace=TRUE)
xy_lin[i,j] <- sd(y_rep)/sd(x_rep)*mean(x_rep)+mean(y_rep) - mean(x_rep)*(sd(y_rep)/sd(x_rep))
}}
Note that I changed your formula from ...*x_rep to ...*mean(x_rep), since I think that this is what you had in mind. But I might be mistaken, so please be aware of that.

Related

How to remove two data points from a data set that have a large influence on the regression model

I have found two outlier data points in my data set but I don't know how to remove them. All of the guides that I have found online seem to emphasize plotting the data but my question does not require plotting, it only takes regression model fitting. I am having great difficulty finding out how to remove the two data points from my data set and then fitting the new data set with a new model.
Here is the code that I have written and the outliers that I found:
library(alr4)
library(MASS)
data(lathe1)
head(lathe1)
y=lathe1$Life
x1=lathe1$Speed
x2=lathe1$Feed
x1_square=(x1)^2
x2_square=(x2)^2
#part A (Box-Cox method show log transformation)
y.regression=lm(y~x1+x2+(x1)^2+(x2)^2+(x1*x2))
mod=boxcox(y.regression, data=lathe1, lambda = seq(-1, 1, length=10))
best.lam=mod$x[which(mod$y==max(mod$y))]
best.lam
#part B (null-hypothesis F-test)
y.regression1_Reduced=lm(log(y)~1)
y.regression1=lm(log(y)~x1+x2+x1_square+x2_square+(x1*x2))
anova(y.regression1_Reduced, y.regression1)
#part D (F-test of log(Y) without beta1)
y.regression2=lm(log(y)~x2+x2_square)
anova(y.regression1_Reduced, y.regression2)
#part E (Cook's distance and refit)
cooks.distance(y.regression1)
Outliers:
9 10
0.7611370235 0.7088115474
I think you may be able (if execution time / corpus size allows it) to pass through your data using a loop and copy / remove elems by your criteria to obtain your desired result e.g.
corpus_list_without_outliers = []
for elem in corpus_list:
if(elem.speed <= 10000) # elem.[any_param_name] < arbitrary_outlier_value
# push to corpus_list_without_outliers because it is OK :)
print corpus_list_without_outliers
# regression algorithm after
this is how I'd see the situation, but you can change the above-if with a remove statement to avoid the creation of a second list etc. e.g.
for elem in corpus_list:
if(elem.speed > 10000) # elem.[any_param_name]
# remove from current corpus because it is an outlier :(
print corpus_list
# regression algorithm after
Hope it helped you!

What is the difference between stat:kmeans and "naive" k-means

I am trying to understand what the stat:kmeans does differently to the simple version explained eg on Wikipedia. I am honestly so supremely clueless.
Reading the help on kmeans I learned that the default algorithm is Hartigan–Wong not the more basic method, so there should be a difference, but playing around with some normal distributed variables I couldn't find a case where they differed substantially and predictably.
For reference, this is my utterly horrible code I tested it against
##squre of eudlidean metric
my_metric <- function(x=vector(),y=vector()) {
stopifnot(length(x)==length(y))
sum((x-y)^2)
}
## data: xy data
## k: amount of groups
my_kmeans <- function(data, k, maxIt=10) {
##get length and check if data lengths are equal and if enough data is provided
l<-length(data[,1])
stopifnot(l==length(data[,2]))
stopifnot(l>k)
## generate the starting points
ms <- data[sample(1:l,k),]
##append the data with g column and initilize last
data$g<-0
last <- data$g
it<-0
repeat{
it<-it+1
##iterate through each data point and assign to cluster
for(i in 1:l){
distances <- c(Inf,Inf,Inf)
for(j in 1:k){
distances[j]<-my_metric(data[i,c(1,2)],ms[j,])
}
data$g[i] <- which.min(distances)
}
##update cluster points
for(i in 1:k){
points_in_cluster <- data[data$g==i,1:2]
ms[i,] <- c(mean(points_in_cluster[,1]),mean(points_in_cluster[,2]))
}
##break condition: nothing changed
if(my_metric(last,data$g)==0 | it > maxIt){
break
}
last<-data$g
}
data
}
First off, this was a duplication (as I just found out) of this post.
But I will still try to give an example: When the clusters are separated, Lloyd tends to leave the centers inside the clusters they start in, meaning that some may end up partitioned while some others might be lumped together

Find the best value for KNN for statement

I am trying to to write a for statement to find the best value k in KNN. Unfortunately, I tried my code snippet now several times, but it seems like it does not calculate the correct value. Do you have an idea what is wrong about the statement
# Tune the value of K using K-Fold Cross Validation
bestaccuracy = 0
bestaccuracy
n.folds <- 100
for (k in 1:n.folds) {
set.seed(1)
knn.cvac <- knn.cv(train= x.australian.stan, cl= y.australian, k=k)
knn.cvac.table <- table (knn.cvac, y.australian)
knn.cvac.accuracy <- sum(diag(knn.cvac.table))/sum(knn.cvac.table)
if(bestaccuracy< knn.cvac.accuracy) bestk=k
if(bestaccuracy< knn.cvac.accuracy) bestaccuracy = knn.cvac.accuracy}
print(bestk)
print(bestaccuracy)
I tested it on a few simulation-based data and it works just fine! The only thing to notice is that you may have different Ks for which you get the highest accuracy and you print the biggest K(Because of the way it is coded).
Perhaps you can change the line of your code to this:
if(bestaccuracy< knn.cvac.accuracy) bestk=c(bestk, k)
So you can see all the optimal Ks when you print bestk.

How to create a loop for Regression

I just started using R for statistical purposes and I appreciate any kind of help.
My task is to make calculations on one index and 20 stocks from the index. The data contains 22 columns (DATE, INDEX, S1 .... S20) and about 4000 rows (one row per day).
Firstly I imported the .csv file, called it "dataset" and calculated log returns this way and did it for all stocks "S1-S20" plus the INDEX.
n <- nrow(dataset)
S1 <- dataset$S1
S1_logret <- log(S1[2:n])-log(S1[1:(n-1)])
Secondly, I stored the data in a data.frame:
logret_data <- data.frame(INDEX_logret, S1_logret, S2_logret, S3_logret, S4_logret, S5_logret, S6_logret, S7_logret, S8_logret, S9_logret, S10_logret, S11_logret, S12_logret, S13_logret, S14_logret, S15_logret, S16_logret, S17_logret, S18_logret, S19_logret, S20_logret)
Then I ran the regression (S1 to S20) using the log returns:
S1_Reg1 <- lm(S1_logret~INDEX_logret)
I couldn't figure out how to write the code in a more efficient way and use some function for repetition.
In a further step I have to run a cross sectional regression for each day in a selected interval. It is impossible to do it manually and R should provide some quick solution. I am quite insecure about how to do this part. But I would also like to use kind of loop for the previous calculations.
Yet I lack the necessary R coding knowledge. Any kind of help top the point or advise for literature or tutorial is highly appreciated! Thank you!
You could provide all the separate dependent variables in a matrix to run your regressions. Something like this:
#example data
Y1 <- rnorm(100)
Y2 <- rnorm(100)
X <- rnorm(100)
df <- data.frame(Y1, Y2, X)
#run all models at once
lm(as.matrix(df[c('Y1', 'Y2')]) ~ X)
Out:
Call:
lm(formula = as.matrix(df[c("Y1", "Y2")]) ~ df$X)
Coefficients:
Y1 Y2
(Intercept) -0.15490 -0.08384
df$X -0.15026 -0.02471

Dynamic linear regression loop for different order summation

I've been trying hard to recreate this model in R:
Model
(FARHANI 2012)
I've tried many things, such as a cumsum paste - however that would not work as I could not assign strings the correct variable as it kept thinking that L was a function.
I tried to do it manually, I'm only looking for p,q = 1,2,3,4,5 however after starting I realized how inefficient this is.
This is essentially what I am trying to do
model5 <- vector("list",20)
#p=1-5, q=0
model5[[1]] <- dynlm(DLUSGDP~L(DLUSGDP,1))
model5[[2]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2))
model5[[3]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3))
model5[[4]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4))
model5[[5]] <- dynlm(DLUSGDP~L(DLUSGDP,1)+L(DLUSGDP,2)+L(DLUSGDP,3)+L(DLUSGDP,4)+L(DLUSGDP,5))
I'm also trying to do this for regressing DLUSGDP on DLWTI (my oil variable's name) for when p=0, q=1-5 and also p=1-5, q=1-5
cumsum would not work as it would sum the variables rather than treating them as independent regresses.
My goal is to run these models and then use IC to determine which should be analyzed further.
I hope you understand my problem and any help would be greatly appreciated.
I think this is what you are looking for:
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
where n is some order you want to try. For example,
n <- 3
reformulate(paste0("L(DLUSGDP,", 1:n,")"), "DLUSGDP")
# DLUSGDP ~ L(DLUSGDP, 1) + L(DLUSGDP, 2) + L(DLUSGDP, 3)
Then you can construct your model fitting by
model5 <- vector("list",20)
for (i in 1:20) {
form <- reformulate(paste0("L(DLUSGDP,", 1:i,")"), "DLUSGDP")
model5[[i]] <- dynlm(form)
}

Resources