Delete a row and recalculate R^2

Delete a row and recalculate R^2 - r

I have coded the following in R:
User chooses a file that contains 2 columns (V1 and V2), with numerous rows (number of rows will vary depending on input file)
The script calculates the rsq of the relationship between 2 the variables. There can be anything from 10 to 1000 rows of data depending on the input file.
I want to code the following:
The code should loop through all rows, removing/omitting/ignoring one row at a time and calculating the new rsq with this row missing. So, for example:
There are 10 rows of data and the total rsq = 0.97
Step1: The first row of data are removed from the equation
The rsq is calculated again, but this time for 9 rows, giving rsq = 0.98.
Step 2:The 1st row is re-added and the 2nd row is removed
rsq is calculated again
Step 3: The second row is re-added and the 3rd row is removed
rsq is calculated again
After each loop the "new rsq" will be placed in a new column next to the row that was removed.
Can anyone advise how to do this? I have this coded in excel and it works well but is cumbersome and therefore not ideal.

Do you want to do something like this?
# Make some sample data
set.seed(1095)
data <- data.frame( V1 = 1:10 , V2 = sample.int(5 ,10 ,repl = TRUE ) )
# Use sapply to get r2 removing each row at a time
r2 <- sapply( 1:nrow(data) , function(x) ( cor( data[-x,1] , data[-x,2] ) )^2 )
# Combine into a data frame
newdata <- cbind( data , r2 )
newdata
# V1 V2 r2
# 1 1 5 0.2526316
# 2 2 3 0.4657601
# 3 3 5 0.3204721
# 4 4 5 0.3691612
# 5 5 1 0.5405405
# 6 6 3 0.3769480
# 7 7 3 0.3840426
# 8 8 2 0.3409425
# 9 9 1 0.2725806
# 10 10 3 0.4986702

Related

Beta estimation over panel data by group

I found some previous questions on this topic especially this R: Grouped rolling window linear regression with rollapply and ddply and R: Rolling / moving avg by group , however, both questions did not provide an exact solution for the problem that I am facing. I am currently trying to estimate CAPM beta over panel data using a linear regression. So I have different funds (in the example below I used 3 fund groups) for which I would like to calculate the betas separately and per row. To put this more abstract: I am trying to do a linear regression with a moving window by group to estimate the coefficient for every row based on the data in the window.
install.packages("zoo","dplyr")
library(zoo);library(dplyr)
# Create dataframe
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
I want to do a rolling regression over two columns df[6:7] for every group indicated by the column "fund". The calculation should be done separately so the first two rows in the beta column for every fund group will always show "NA". In the end I want to have a full dataframe with all fund groups and all beta values combined.
I managed to come up with a new code that works but is pretty messy and it requires to order the data by fund & date before executing. I would welcome any suggestions on how to make it better.
funddata <- funddata[order(funddata$fund, funddata$date),]
beta_func <- function(x, benchmark_riskfree, ret_riskfree) {
a <- coef(lm(as.formula(paste(ret_riskfree, "~", benchmark_riskfree,-1)),
data = x))
return(a)
}
beta_list<-list()
for (i in c(1:3)){beta_list[[paste(i, sep="_")]]<- (rollapplyr(funddata[(funddata$fund==i),6:7], width = 3,
FUN = function(x) beta_func(as.data.frame(x), "benchmark_riskfree" , "ret_riskfree"),
by.column = FALSE,fill=NA))}
beta_list<-unlist(beta_list, recursive=FALSE)
funddata$beta<-beta_list

As I mentioned in the comment above, this solution might be a bit off since I'm not able to reproduce your desired output 100%. Still, the functionality of what you're trying to accomplish is there. Have a look at it and let me know if this is something you could use or I could develop further.
EDIT: The code below does not reproduce the desired output as specified above, but turned out to be what the OP was looking for after all.
Here goes:
# Datasource
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
# Target check #################################################################
# Subset last three rows in original dataframe
df_check <- funddata[funddata$fund == 1,]
df_check <- tail(df_check,3)
# Run regression check
mod_check <- lm(df_check$ret_riskfree~df_check$benchmark_riskfree)
coef(mod_check)
# My suggestion ################################################################
# The following function takes three arguments:
# 1. a dataframe, myDf
# 2. a column that you'd like to myDf on
# 3. a window length for a sliding window, myWin
fun_rollreg <- function(myDf, subCol, varY, varX, myWin){
df_main <- myDf
# Make an empty data frame to store results in
df_data <- data.frame()
# Identify unique funds
unFunds <- unique(unlist(df_main[subCol]))
# Loop through your subset
for (fundx in unFunds){
# Subset
df <- df_main
df <- df[df$fund == fundx,]
# Keep a copy of the original until later
df_new <- df
# Specify a container for your beta estimates
betas <- c()
# Specify window length
wlength <- myWin
# Retrieve some data dimensions to loop on
rows = dim(df)[1]
periods <- rows - wlength
# Loop through each subset of the data
# and run regression
for (i in rows:(rows - periods)){
# Split dataframe in subsets
# according to the window length
df1 <- df[(i-(wlength-1)):i,]
# Run regression
beta <- coef(lm(df1[[varY]]~df1[[varX]]))[2]
# Keep regression ressults
betas[[i]] <- beta
}
# Add regression data to dataframe
df_new <- data.frame(df, betas)
# Keep the new dataset for later concatenation
df_data <- rbind(df_data, df_new)
}
return(df_data)
}
# Run the function:
df_roll <- fun_rollreg(myDf = funddata, subCol = 'fund',
varY <- 'ret_riskfree', varX <- 'benchmark_riskfree',
myWin = 3)
# Show the results
print(head(df_roll,8))
For the first 8 rows in the new dataframe (fund = 1), this is the result:
date fund return benchmark riskfree ret_riskfree benchmark_riskfree betas
1 2010-07-30 1 1 1 1 0 0 NA
2 2010-08-31 1 2 13 5 -3 8 NA
3 2010-09-30 1 3 14 1 2 13 0.10465116
4 2010-10-31 1 4 20 2 2 18 0.50000000
5 2010-11-30 1 5 14 1 4 13 -0.20000000
6 2010-12-31 1 6 32 6 0 26 -0.30232558
7 2011-01-30 1 7 4 4 3 0 -0.11538462
8 2011-02-28 1 8 1 7 1 -6 -0.05645161

'Random' Sorting with a condition in R for Psychology Research

I have Valence Category for word stimuli in my psychology experiment.
1 = Negative, 2 = Neutral, 3 = Positive
I need to sort the thousands of stimuli with a pseudo-randomised condition.
Val_Category cannot have more than 2 of the same valence stimuli in a row i.e. no more than 2x negative stimuli in a row.
for example - 2, 2, 2 = not acceptable
2, 2, 1 = ok
I can't sequence the data i.e. decide the whole experiment will be 1,3,2,3,1,3,2,3,2,2,1 because I'm not allowed to have a pattern.
I tried various packages like dylpr, sample, order, sort and nothing so far solves the problem.

I think there's a thousand ways to do this, none of which are probably very pretty. I wrote a small function that takes care of the ordering. It's a bit hacky, but it appeared to work for what I tried.
To explain what I did, the function works as follows:
Take the vector of valences and samples from it.
If sequences are found that are larger than the desired length, then, (for each such sequence), take the last value of that sequence at places it "somewhere else".
Check if the problem is solved. If so, return the reordered vector. If not, then go back to 2.
# some vector of valences
val <- rep(1:3,each=50)
pseudoRandomize <- function(x, n){
# take an initial sample
out <- sample(val)
# check if the sample is "bad" (containing sequences longer than n)
bad.seq <- any(rle(out)$lengths > n)
# length of the whole sample
l0 <- length(out)
while(bad.seq){
# get lengths of all subsequences
l1 <- rle(out)$lengths
# find the bad ones
ind <- l1 > n
# take the last value of each bad sequence, and...
for(i in cumsum(l1)[ind]){
# take it out of the original sample
tmp <- out[-i]
# pick new position at random
pos <- sample(2:(l0-2),1)
# put the value back into the sample at the new position
out <- c(tmp[1:(pos-1)],out[i],tmp[pos:(l0-1)])
}
# check if bad sequences (still) exist
# if TRUE, then 'while' continues; if FALSE, then it doesn't
bad.seq <- any(rle(out)$lengths > n)
}
# return the reordered sequence
out
}
Example:
The function may be used on a vector with or without names. If the vector was named, then these names will still be present on the pseudo-randomized vector.
# simple unnamed vector
val <- rep(1:3,each=5)
pseudoRandomize(val, 2)
# gives:
# [1] 1 3 2 1 2 3 3 2 1 2 1 3 3 1 2
# when names assigned to the vector
names(val) <- 1:length(val)
pseudoRandomize(val, 2)
# gives (first row shows the names):
# 1 13 9 7 3 11 15 8 10 5 12 14 6 4 2
# 1 3 2 2 1 3 3 2 2 1 3 3 2 1 1
This property can be used for randomizing a whole data frame. To achieve that, the "valence" vector is taken out of the data frame, and names are assigned to it either by row index (1:nrow(dat)) or by row names (rownames(dat)).
# reorder a data.frame using a named vector
dat <- data.frame(val=rep(1:3,each=5), stim=rep(letters[1:5],3))
val <- dat$val
names(val) <- 1:nrow(dat)
new.val <- pseudoRandomize(val, 2)
new.dat <- dat[as.integer(names(new.val)),]
# gives:
# val stim
# 5 1 e
# 2 1 b
# 9 2 d
# 6 2 a
# 3 1 c
# 15 3 e
# ...

I believe this loop will set the Valence Category's appropriately. I've called the valence categories treat.
#Generate example data
s1 = data.frame(id=c(1:10),treat=NA)
#Setting the first two rows
s1[1,"treat"] <- sample(1:3,1)
s1[2,"treat"] <- sample(1:3,1)
#Looping through the remainder of the rows
for (i in 3:length(s1$id))
{
s1[i,"treat"] <- sample(1:3,1)
#Check if the treat value is equal to the previous two values.
if (s1[i,"treat"]==s1[i-1,"treat"] & s1[i-1,"treat"]==s1[i-2,"treat"])
#If so draw one of the values not equal to that value
{
a = 1:3
remove <- s1[i,"treat"]
a=a[!a==remove]
s1[i,"treat"] <- sample(a,1)
}
}
This solution is not particularly elegant. There may be a much faster way to accomplish this by sorting several columns or something.

R ignore missing data

I have two R data files each with 100 columns but row number vary from 220 to 360 in each data1 and data2. data1 and data2 represent changes of two quantities changes during a set of experiments. so [i,j] of data1 and[i,j] of data2 represent same event, but will have different value. I want to print data which is greater than 2.5 in any of the file, along with the column and row number
for (i in 1:360){
for (j in 1:100){
if((data1[i,j]>2.5) | ( data2[i,j]>2.5)) {
cat(i, j, data1[i,j], data2[i,j], "\n", file="extr-b2.5.txt", append=T)
}
}
}
I get this error because of NAs.
Error in if ((data1[i, j] > 2.5) | (data2[i, j] > :
missing value where TRUE/FALSE needed
if I set i to 1:220 (every column has at least 220 row), it works fine.
How can modify above code to neglect NA values.

I would something like this :
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
reproducible example:
set.seed(1)
dat1 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
dat2 <- as.data.frame(matrix(runif(12,1,5),ncol=3))
idx <- which(dat1>2.5 & dat2>2.5,arr.ind=TRUE)
cbind(idx,v1=dat1[idx],v2=dat2[idx])
# row col v1 v2
# [1,] 3 1 3.291413 4.079366
# [2,] 4 1 4.632831 2.990797
# [3,] 2 2 4.593559 4.967624
# [4,] 3 2 4.778701 2.520141
# [5,] 4 2 3.643191 4.109781
# [6,] 1 3 3.516456 4.738821
where dat1 and dat2:
# dat1
# V1 V2 V3
# 1 2.062035 1.806728 3.516456
# 2 2.488496 4.593559 1.247145
# 3 3.291413 4.778701 1.823898
# 4 4.632831 3.643191 1.706227
# > dat2
# V1 V2 V3
# 1 3.748091 3.870474 4.738821
# 2 2.536415 4.967624 1.848570
# 3 4.079366 2.520141 3.606695
# 4 2.990797 4.109781 1.502220

Without the for loops you can use pmax to compare two arrays.
bigger=pmax(data1,data2)
this gives an array with the maximum values. Then you can check if the max is bigger than 2.5
which( bigger>2.5,arr.ind=T)
will give the location where the max is bigger than your cutoff.
for completeness if I were to do it in your double looping framework, I would just set the Missing values to be below the min of all the other data, this will work so long as you have a value below 2.5 somewhere in your data.
lowest=min(c(data1,data2))
data1[which(is.na(data1),arr.ind=T)]=lowest
then run your double loop

Random number selection from a data-frame

I have created a dataframe of "errors" following the steps outlined by Bernaard & Sijtsma's (2000) two-way method for missing data imputation. In order to complete my calculation for missing data, I need to make a random selection of a SINGLE NUMBER from this error dataframe and add it to my already calculated missing data values.
I am familiar with the sample() function, but I am not looking for a random sample of a row or a column, but rather one individual cell from a data-frame. Is there a simple way to do this, such as a single "select random number()" command? Is there an alternative method I have yet to explore?
Any help is greatly appreciated.

It's easier if you can convert to a matrix instead of a dataframe , but on the assumption that you need to keep different data types or some such limitation,
foo<-as.data.frame(matrix(runif(20),nrow=4,ncol=5))
foo[sample(1:nrow(foo)),sample(1:ncol(foo))]
will pick a random element.

Similar to what #CarlWitthoft answered, you can convert your data frame back to matrix to make sure you sample a random cell
> set.seed(10)
> M <- data.frame(matrix(runif(20), nrow = 4, ncol = 5))
> M
# X1 X2 X3 X4 X5
# 1 0.5074782 0.08513597 0.6158293 0.1135090 0.05190332
# 2 0.3067685 0.22543662 0.4296715 0.5959253 0.26417767
# 3 0.4269077 0.27453052 0.6516557 0.3580500 0.39879073
# 4 0.6931021 0.27230507 0.5677378 0.4288094 0.83613414
> sample(as.matrix(M), 1)
# [1] 0.2641777 ## came from row 2, column 5
> sample(as.matrix(M), 1)
# [1] 0.113509 ## came from row 1, column 4
> sample(as.matrix(M), 1)
# [1] 0.4288094 ## came from row 4, column 4
> sample(as.matrix(M), 1)
# [1] 0.2723051 ## came from row 4, column 2
seq(as.matrix(M)) will show you all the cell numbers (top to bottom, left to right). You could also sample from that.
> seq(as.matrix(M))
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> sample(seq(as.matrix(M)), 1)
# [1] 15

Subset based on granularity and average values

I have large data-frame consists of two columns. I want to calculate the average of the second column values for each subset of the first column. The subset of the first column is based on a specified granularity. For example, for the following data-frame, df, I want to calculate the average of df$B values for each subset of df$A with an increment(granularity) of 1 for each subset. The results should be in two new columns.
A B expected results newA newB
0.22096 1 0 1.142857
0.33489 1 1 2
0.33655 1 2 4
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5
This is a simple example, I'm not sure how to loop over the whole data-frame and perform the calculation i.e. the average of the df$B.
tried below to subset, but couldn't figure how to append the results and create final results:
Tried something like :
increment<-1
mx<-max(df$A)
i<-0
newDF<-data.frame()
while(i < mx){
tmp<-subset(df, (A >i & A< (i+increment)))
i<-i+granualrity
}
Not sure about the logic. But I'm sure there is a short way to do the required calculation. Any thoughts?

I would use findInterval for the subset selection (In your example a simple ceiling for each A value should be sufficient, too. But if your increment is different from 1 you need findInterval.) and tapply to calculate the mean:
df <- read.table(textConnection("
A B
0.22096 1
0.33489 1
0.33655 1
0.43953 1
0.64933 2
0.86668 1
0.96932 1
1.09342 2
1.58314 2
1.88481 2
2.07654 4
2.34652 3
2.79777 5"), header=TRUE)
## sort data.frame by column A (needed for findInterval)
df <- df[order(df$A), ]
## define granuality
subsets <- seq(1, max(ceiling(df$A)), by=1) # change the "by" argument for different increments
df$subset <- findInterval(df$A, subsets)
tapply(df$B, df$subset, mean)
# 0 1 2
#1.142857 2.000000 4.000000

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Delete a row and recalculate R^2 - r

Related

Beta estimation over panel data by group

'Random' Sorting with a condition in R for Psychology Research

R ignore missing data

Random number selection from a data-frame

Subset based on granularity and average values

Categories

Resources