Fill upper triangular matrix without loop in R - r

The objective is to fill the upper triangular matrix with the count calculate from the rating dataset. Each value is calculated and stored by finding the correct index. It is not stored sequentially.The below R code works perfectly, but takes too much time for large datasets.
ratings <- read.csv("ratings.csv", header=TRUE, sep=",")
>> head(ratings)
userId movieId rating timestamp
1 1 16 4.0 1217897793
2 1 24 1.5 1217895807
3 1 32 4.0 1217896246
4 1 47 4.0 1217896556
5 1 50 4.0 1217896523
6 1 110 4.0 1217896150
no_nodes <- nrow(movies)*2
temp <- movies$movieId
nodes_name <- c(paste(temp,"-L",sep=""),paste(temp,"-D",sep=""))
ac_graph <- matrix(NA,nrow=length(nodes_name),ncol=length(nodes_name),dimnames = list(nodes_name,nodes_name))
for(i in 1:nrow(movies)){
for(j in (i+1):nrow(movies)){
ac_graph[which(nodes_name==paste(i,"-L",sep="")),which(nodes_name==paste(j,"-L",sep=""))] <- length(intersect(ratings[ratings$movieId==i&ratings$rating>2.5,1],ratings[ratings$movieId==j&ratings$rating>2.5,1]))
ac_graph[which(nodes_name==paste(i,"-D",sep="")),which(nodes_name==paste(j,"-D",sep=""))] <- length(intersect(ratings[ratings$movieId==i&ratings$rating<=2.5,1],ratings[ratings$movieId==j&ratings$rating<=2.5,1]))
}
}
Is it possible to do the same using apply,sapply, outer or some function?

Related

Beta estimation over panel data by group

I found some previous questions on this topic especially this R: Grouped rolling window linear regression with rollapply and ddply and R: Rolling / moving avg by group , however, both questions did not provide an exact solution for the problem that I am facing. I am currently trying to estimate CAPM beta over panel data using a linear regression. So I have different funds (in the example below I used 3 fund groups) for which I would like to calculate the betas separately and per row. To put this more abstract: I am trying to do a linear regression with a moving window by group to estimate the coefficient for every row based on the data in the window.
install.packages("zoo","dplyr")
library(zoo);library(dplyr)
# Create dataframe
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
I want to do a rolling regression over two columns df[6:7] for every group indicated by the column "fund". The calculation should be done separately so the first two rows in the beta column for every fund group will always show "NA". In the end I want to have a full dataframe with all fund groups and all beta values combined.
I managed to come up with a new code that works but is pretty messy and it requires to order the data by fund & date before executing. I would welcome any suggestions on how to make it better.
funddata <- funddata[order(funddata$fund, funddata$date),]
beta_func <- function(x, benchmark_riskfree, ret_riskfree) {
a <- coef(lm(as.formula(paste(ret_riskfree, "~", benchmark_riskfree,-1)),
data = x))
return(a)
}
beta_list<-list()
for (i in c(1:3)){beta_list[[paste(i, sep="_")]]<- (rollapplyr(funddata[(funddata$fund==i),6:7], width = 3,
FUN = function(x) beta_func(as.data.frame(x), "benchmark_riskfree" , "ret_riskfree"),
by.column = FALSE,fill=NA))}
beta_list<-unlist(beta_list, recursive=FALSE)
funddata$beta<-beta_list
As I mentioned in the comment above, this solution might be a bit off since I'm not able to reproduce your desired output 100%. Still, the functionality of what you're trying to accomplish is there. Have a look at it and let me know if this is something you could use or I could develop further.
EDIT: The code below does not reproduce the desired output as specified above, but turned out to be what the OP was looking for after all.
Here goes:
# Datasource
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
# Target check #################################################################
# Subset last three rows in original dataframe
df_check <- funddata[funddata$fund == 1,]
df_check <- tail(df_check,3)
# Run regression check
mod_check <- lm(df_check$ret_riskfree~df_check$benchmark_riskfree)
coef(mod_check)
# My suggestion ################################################################
# The following function takes three arguments:
# 1. a dataframe, myDf
# 2. a column that you'd like to myDf on
# 3. a window length for a sliding window, myWin
fun_rollreg <- function(myDf, subCol, varY, varX, myWin){
df_main <- myDf
# Make an empty data frame to store results in
df_data <- data.frame()
# Identify unique funds
unFunds <- unique(unlist(df_main[subCol]))
# Loop through your subset
for (fundx in unFunds){
# Subset
df <- df_main
df <- df[df$fund == fundx,]
# Keep a copy of the original until later
df_new <- df
# Specify a container for your beta estimates
betas <- c()
# Specify window length
wlength <- myWin
# Retrieve some data dimensions to loop on
rows = dim(df)[1]
periods <- rows - wlength
# Loop through each subset of the data
# and run regression
for (i in rows:(rows - periods)){
# Split dataframe in subsets
# according to the window length
df1 <- df[(i-(wlength-1)):i,]
# Run regression
beta <- coef(lm(df1[[varY]]~df1[[varX]]))[2]
# Keep regression ressults
betas[[i]] <- beta
}
# Add regression data to dataframe
df_new <- data.frame(df, betas)
# Keep the new dataset for later concatenation
df_data <- rbind(df_data, df_new)
}
return(df_data)
}
# Run the function:
df_roll <- fun_rollreg(myDf = funddata, subCol = 'fund',
varY <- 'ret_riskfree', varX <- 'benchmark_riskfree',
myWin = 3)
# Show the results
print(head(df_roll,8))
For the first 8 rows in the new dataframe (fund = 1), this is the result:
date fund return benchmark riskfree ret_riskfree benchmark_riskfree betas
1 2010-07-30 1 1 1 1 0 0 NA
2 2010-08-31 1 2 13 5 -3 8 NA
3 2010-09-30 1 3 14 1 2 13 0.10465116
4 2010-10-31 1 4 20 2 2 18 0.50000000
5 2010-11-30 1 5 14 1 4 13 -0.20000000
6 2010-12-31 1 6 32 6 0 26 -0.30232558
7 2011-01-30 1 7 4 4 3 0 -0.11538462
8 2011-02-28 1 8 1 7 1 -6 -0.05645161

Finding out the percentage of times a sequence in one column is the same as in another column

I hope I articulate this properly. I have a data set with two columns I am trying to compare in a memory experiment. Recall.CRESP is a column specifying the correct answers on a memory test selected through grid coordinates. Recall.RESP shows participants response.
The columns look something like this:
|Recall.CRESP | Recall.RESP |
|---------------------------------|---------------------------------|
|grid35grid51grid12grid43grid54 | grid35grid51grid12grid43grid54 |
|grid22grid53grid35grid21grid44 | grid23grid53grid35grid21grid43 |
|grid12grid14grid15grid41grid23 | grid12grid24grid31grid41grid25 |
|grid15grid41grid33grid24grid55 | grid15grid41grid33grid14grid55 |
I have the following line of code to tell me the percentage of times per row that the columns are identical to each other:
paste0((100*with(Data, mean(Recall.CRESP==Recall.RESP, na.rm = "TRUE"))), "%")
So for example, in my dataset 20% of the time column Recall.CRESP matches Recall.RESP exactly, signifying that a subject scored 5 out of 5 in their memory test 20% of the time.
However I want to be able to expand on this in two ways. The first is rather than giving me a percentage of when the rows are identical, I would like a percentage for when there is a partial match in the sequence. For instance grid11gird42gird22grid51grid32 and grid11gird15gird55grid42grid32 share a match of 2/5, with both the first and the last grid coordinate being identical. I am not sure how to specify the request in R for a partial sequence match of 2/5 (or any other outcome out of 5). Also keep in mind that in this example grid42 shows up in both sequences, but is not correctly recalled considering it is remembered out of position in Recall.RESP. The order is important in these sequences.
The other point is that so far I have described the experiment in terms of checking accuracy for forwards recall of memory items. Yet I also have separate data where participants were recalling in backwards order. So for example, grid11gird22gird33grid44grid55 from Recall.CRESP and grid51grid44grid33grid22grid11 from Recall.RESP are correctly matching 4/5 times. How can I turn the code around to check for reverse sequences and calculate percentages out of 5?
Any thoughts would be greatly appreciated.
I would separate the strings into columns of matrices, which will make them easy to compare and manipulate:
# borrowing Oriol's nicely shared data
Recall.CRESP <- c('grid35grid51grid12grid43grid54',
'grid22grid53grid35grid21grid44',
'grid12grid14grid15grid41grid23',
'grid15grid41grid33grid24grid55')
Recall.RESP <- c('grid35grid51grid12grid43grid54',
'grid23grid53grid35grid21grid43',
'grid12grid24grid31grid41grid25',
'grid15grid41grid33grid14grid55')
# function to create matrices
matrixify = function(dat) {
dat = do.call(rbind, strsplit(dat, split = "grid"))
dat = dat[, -1]
mode(dat) = "numeric"
return(dat)
}
cresp_mat = matrixify(Recall.CRESP)
resp_mat = matrixify(Recall.RESP)
## an example of what we made: just the numbers in the right order
cresp_mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 35 51 12 43 54
# [2,] 22 53 35 21 44
# [3,] 12 14 15 41 23
# [4,] 15 41 33 24 55
## Calculating results is now easy:
(forwards = rowMeans(cresp_mat == resp_mat))
# [1] 1.0 0.6 0.4 0.8
(reverse = rowMeans(cresp_mat == resp_mat[, 5:1]))
# [1] 0.2 0.2 0.0 0.2
You could, of course, assign the results to be new columns of your original data.
Here is my solution:
Recall.CRESP <- c('grid35grid51grid12grid43grid54',
'grid22grid53grid35grid21grid44',
'grid12grid14grid15grid41grid23',
'grid15grid41grid33grid24grid55')
Recall.RESP <- c('grid35grid51grid12grid43grid54',
'grid23grid53grid35grid21grid43',
'grid12grid24grid31grid41grid25',
'grid15grid41grid33grid14grid55')
df <- data.frame(Recall.CRESP, Recall.RESP, stringsAsFactors = F)
df$correctNormal <- NA
df$correctReverse <- NA
for (row in 1:nrow(df)) {
crespVector <- unlist(strsplit(as.character(df[row, 1]), 'grid'))[-1]
respVector <- unlist(strsplit(as.character(df[row, 2]), 'grid'))[-1]
correctNormal <- 0
correctReverse <- 0
for (i in 1:length(crespVector)) {
if (crespVector[i] == respVector[i]) correctNormal <- correctNormal + 1
if (crespVector[i] == respVector[length(respVector) + 1 - i]) correctReverse <- correctReverse + 1
}
df$correctNormal[row] = correctNormal / 5
df$correctReverse[row] = correctReverse / 5
}
df
## Recall.CRESP Recall.RESP correctNormal correctReverse
## 1 grid35grid51grid12grid43grid54 grid35grid51grid12grid43grid54 1.0 0.2
## 2 grid22grid53grid35grid21grid44 grid23grid53grid35grid21grid43 0.6 0.2
## 3 grid12grid14grid15grid41grid23 grid12grid24grid31grid41grid25 0.4 0.0
## 4 grid15grid41grid33grid24grid55 grid15grid41grid33grid14grid55 0.8 0.2

Solve a linear equation on every row in datatable

I did some linear regression and I want to forecast the moment of exceeding a certain value.
This means I have three columns:
a= slope
b = intercept
c = target value
On every row I want to calculate
solve(a,(c-b))
How do I do this in an efficient way, without using a loop (it is an extensive dataset)?
So you basically want to solve the equation
c = a*x + b
for x for each row? That has the pretty simple solution of
x = (c-b)/a
which is a vectorized operation in R. No loop necessary
dd <- data.frame(
a = 1:5,
b = -2:2,
c = 10:14
)
transform(dd, solution=(c-b)/a)
# a b c solution
# 1 1 -2 10 12.0
# 2 2 -1 11 6.0
# 3 3 0 12 4.0
# 4 4 1 13 3.0
# 5 5 2 14 2.4
in addition to the aforementioned responses, you could also use the mutate function from the tidyverse. like so:
library(magrittr)
library(tidyverse)
dataframe %<>% mutate(prediction=solve(a,(c-b))
in this example we are assuming the columns 'a','b', and 'c' are in a table called 'dataframe.' we then use the %<>% function from the magrittr library to say "apply the function that follows to the dataframe".
Here is a simple way using the Vectorize function:
solve_vec <- Vectorize(solve)
solve_vec(d$a, d$c - d$b)
> solve_vec(d$a, d$c - d$b)
[1] 12.0 6.0 4.0 3.0 2.4

Find row that matches a range of values

I am trying to get the column importantval for a number that is within a range. I have no clue how to even start this, anyone have any ideas?
data<-data.frame(lower=c(1,4,6,7,7),upper=c(3,5,7,8,9),importantval=c(99,98,97,96,95))
vals<-c(1.14,3.5,7.2,19)
> data
lower upper importantval
1 1 3 99
2 4 5 98
3 6 7 97
4 7 8 96
5 7 9 95
output goal
# 1.14 99
# 3.5 NA
# 7.2 96 <--return the smalller interval (from 7 to 8 is smaller than 7 to 9)
# 19 NA <--doesnt exist so return NA
A simple lapply would do the trick. Identifying the line is relatively easy. The if statement to take only the smaller interval when multiple values work is a bit harder to understand but mostly, if there are more than one possibility, I take the row where the interval is equal to the smallest interval possible.
foo <- function(i) {
res <- data[data$lower < i & data$upper > i, ]
if (nrow(res) > 1) {
res <- res[which(res$upper - res$lower == min(res$upper - res$lower)), ]
}
if (nrow(res) == 0) return(NA)
return(res$importantval)
}
results <- data.frame(vals, sapply(vals, foo))
This assumes that there are no intervals that are of same length. If this is a possibility, you could add return(min(res$importantval)) at the end to get only the smaller value.
If you would want to keep both values, take the results in a list:
results <- lapply(vals, foo)
names(results) <- vals

How to count and test for the sum and repeat the action

I need to test the value of'peso'(see replication code below) for each factor. Whether a factor reaches 50% of the overall sum for 'peso', the values of each factor should be paste into a new object 'results', otherwise, R should evaluate which factor has the lowest aggregated value for 'peso', and consider the factor in the next column for aggregate 'peso' again. Basically, this process replace the lowest scored factor for the next factor. The process should repeat till a factor cross the 50% threshold. So my question is, where do I start?
set.seed(51)
Data <- sapply(1:100, function(x) sample(1:10, size=5))
Data <- data.frame(t(Data))
names(Data) <- letters[1:5]
Data$peso <- sample(0:3.5, 100, rep=TRUE)
It should be like
If your first two rows are:
a b c d e peso
8 2 3 7 9 1
8 3 4 5 7 3
9 7 4 10 1 2
10 3 4 5 7 3
What would you like for the total?
Totals_08 = 4
Totals_09 = 2
Totals_10 = 3
etc?
So, factor 8 got the greater share 4/(4+2+3) = 0.4444444, but not reached 50% threshold in the round a. Therefore, I need something more: repeat the aggregation but considering now the factor 7 in the column 'b' instead of factors 9 in the column 'a', since it got the lowest aggregated value in the first round.
It's unclear if you have your list of factors already or not. If you do not have it, and are taking it from the data set, you can grab it in a few different ways:
# Get a list of all the factors
myFactors <- levels(Data[[1]]) # If actual factors.
myFactors <- sort(unique(unlist(Data))) # Otherwise use similar to this line
Then to calculate the Totals per factor, you can do the following
Totals <-
colSums(sapply(myFactors, function(fctr)
# calculate totals per fctr
as.integer(Data$peso) * rowSums(fctr == subset(Data, select= -peso))
))
names(Totals) <- myFactors
Which gives
Totals
# 1 2 3 4 5 6 7 8 9 10
# 132 153 142 122 103 135 118 144 148 128
Next:
I'm not sure if afterwards, you want to compare to the sum of peso or the sum of the totals. Here are both options, broken down into steps:
# Calculate the total of all the Totals:
TotalSum <- sum(Totals)
# See percentage for each:
Totals / TotalSum
Totals / sum(as.integer(Data$peso))
# See which, if any, is greater than 50%
Totals / TotalSum > 0.50
Totals / sum(as.integer(Data$peso)) > 0.50
# Using Which to identify the ones you are looking for
which(Totals / TotalSum > 0.50)
which(Totals / sum(as.integer(Data$peso)) > 0.50)
Note on your sampling for Peso
You took a sample of 0:3.5, however, the x:y sequence only gives integers.
If you want fractions, you can either use seq() or you can take a larger sequence and then divide appropriately:
option1 <- (0:7) / 2
option2 <- seq(from=0, to=3.5, by=0.5)
If you want whole integers from 0:3 and also the value 3.5, then use c()
option3 <- c(0:3, 3.5)

Resources