How do I create a similarity matrix from a similarity data frame? - r

I found this online and used this with my data:
df <- data.frame(seasons = c("Season1","Season2","Season3","Season4"))
for(i in unique(df$seasons)) {
df[[paste0(i)]] <- ifelse(df$seasons==i,1,0)
}
The only challenge is where there is a 0 in the resultant cell, I want to fill in a meaningful value from a data frame that has data arranged like so:
S1
S2
Value
Season1
Season2
3
Season3
Season1
5
Season2
Season3
4
Note how a season in a pair could pop up at S1 or S2.
I'll need to fill for example,{row Season1; col Season 2} as well as {col Season 1 and row Season 2} in my matrix as 3.
Is there anyway for me to do this? I tried a few things but decided to give a shoutout to the community in case there is something simple out there I'm missing!
Thanks a bunch!

There are three steps and decided to rebuild the original matrix and call it S:
# Make square matrix of zeros
rc <- length(unique(df[[1]]) ) # going to assume that number of unique values is same in both cols
S <- diag(1, rc,rc)
# Label rows and cols
dimnames(S) <- list( sort(unique(df[[1]])), sort( unique(df[[2]])) )
# Assign value to matrix positions based on values of df[[3]]
S[ data.matrix( df[1:2]) ] <- # using 2 col matrix indexing
df[[3]]
# -------
> S
Season1 Season2 Season3
Season1 1 3 0
Season2 0 1 4
Season3 5 0 1
It's now a real matrix rather than a dataframe.

Related

Beta estimation over panel data by group

I found some previous questions on this topic especially this R: Grouped rolling window linear regression with rollapply and ddply and R: Rolling / moving avg by group , however, both questions did not provide an exact solution for the problem that I am facing. I am currently trying to estimate CAPM beta over panel data using a linear regression. So I have different funds (in the example below I used 3 fund groups) for which I would like to calculate the betas separately and per row. To put this more abstract: I am trying to do a linear regression with a moving window by group to estimate the coefficient for every row based on the data in the window.
install.packages("zoo","dplyr")
library(zoo);library(dplyr)
# Create dataframe
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
I want to do a rolling regression over two columns df[6:7] for every group indicated by the column "fund". The calculation should be done separately so the first two rows in the beta column for every fund group will always show "NA". In the end I want to have a full dataframe with all fund groups and all beta values combined.
I managed to come up with a new code that works but is pretty messy and it requires to order the data by fund & date before executing. I would welcome any suggestions on how to make it better.
funddata <- funddata[order(funddata$fund, funddata$date),]
beta_func <- function(x, benchmark_riskfree, ret_riskfree) {
a <- coef(lm(as.formula(paste(ret_riskfree, "~", benchmark_riskfree,-1)),
data = x))
return(a)
}
beta_list<-list()
for (i in c(1:3)){beta_list[[paste(i, sep="_")]]<- (rollapplyr(funddata[(funddata$fund==i),6:7], width = 3,
FUN = function(x) beta_func(as.data.frame(x), "benchmark_riskfree" , "ret_riskfree"),
by.column = FALSE,fill=NA))}
beta_list<-unlist(beta_list, recursive=FALSE)
funddata$beta<-beta_list
As I mentioned in the comment above, this solution might be a bit off since I'm not able to reproduce your desired output 100%. Still, the functionality of what you're trying to accomplish is there. Have a look at it and let me know if this is something you could use or I could develop further.
EDIT: The code below does not reproduce the desired output as specified above, but turned out to be what the OP was looking for after all.
Here goes:
# Datasource
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
# Target check #################################################################
# Subset last three rows in original dataframe
df_check <- funddata[funddata$fund == 1,]
df_check <- tail(df_check,3)
# Run regression check
mod_check <- lm(df_check$ret_riskfree~df_check$benchmark_riskfree)
coef(mod_check)
# My suggestion ################################################################
# The following function takes three arguments:
# 1. a dataframe, myDf
# 2. a column that you'd like to myDf on
# 3. a window length for a sliding window, myWin
fun_rollreg <- function(myDf, subCol, varY, varX, myWin){
df_main <- myDf
# Make an empty data frame to store results in
df_data <- data.frame()
# Identify unique funds
unFunds <- unique(unlist(df_main[subCol]))
# Loop through your subset
for (fundx in unFunds){
# Subset
df <- df_main
df <- df[df$fund == fundx,]
# Keep a copy of the original until later
df_new <- df
# Specify a container for your beta estimates
betas <- c()
# Specify window length
wlength <- myWin
# Retrieve some data dimensions to loop on
rows = dim(df)[1]
periods <- rows - wlength
# Loop through each subset of the data
# and run regression
for (i in rows:(rows - periods)){
# Split dataframe in subsets
# according to the window length
df1 <- df[(i-(wlength-1)):i,]
# Run regression
beta <- coef(lm(df1[[varY]]~df1[[varX]]))[2]
# Keep regression ressults
betas[[i]] <- beta
}
# Add regression data to dataframe
df_new <- data.frame(df, betas)
# Keep the new dataset for later concatenation
df_data <- rbind(df_data, df_new)
}
return(df_data)
}
# Run the function:
df_roll <- fun_rollreg(myDf = funddata, subCol = 'fund',
varY <- 'ret_riskfree', varX <- 'benchmark_riskfree',
myWin = 3)
# Show the results
print(head(df_roll,8))
For the first 8 rows in the new dataframe (fund = 1), this is the result:
date fund return benchmark riskfree ret_riskfree benchmark_riskfree betas
1 2010-07-30 1 1 1 1 0 0 NA
2 2010-08-31 1 2 13 5 -3 8 NA
3 2010-09-30 1 3 14 1 2 13 0.10465116
4 2010-10-31 1 4 20 2 2 18 0.50000000
5 2010-11-30 1 5 14 1 4 13 -0.20000000
6 2010-12-31 1 6 32 6 0 26 -0.30232558
7 2011-01-30 1 7 4 4 3 0 -0.11538462
8 2011-02-28 1 8 1 7 1 -6 -0.05645161

R looping through multiple dataframes in a list

Attempting to calculate differences between every two values in a row then sum the total differences for each dataframe in a list. I know for/while loops in R absolutely suck. I had this working before, but I've broken it. Can someone suggest how to optimize this using an alternative in the apply family? Current code:
for (i in 1:length(refdata)) { #for each dataframe in a list
refdif <- as.data.frame(matrix(0, ncol = 1, nrow = nrow(refdata[[i]])))
refdif1 <- c()
for (z in 1:ncol(refdata[[i]])) { #for each column in a dataframe
for(x in 1:nrow(refdata[[i]])) { #for each row in a dataframe
refdif <- (refdata[[i]][x,z] - refdata[[i]][x,z+1]) #difference of first value + the enxt
refdif1[x,1] <- (refdif1[x,1] + refidf) #sum of latest difference
}
}
print(refdif1) #where I can conduct tests on each individual dataframe with a column of sums of differences
}
example data:
list 1 refdata[[1]]
$`1`
var1 var2 var3 var4
1 1 2 3 4
2 5 6 7 8
$`2`
var1 var2 var3 var4
1 1 2 3 4
2 5 6 7 8
var 1 + 2 has the difference calculated, var 3 and 4 has the difference calculated, then each difference is summed together and placed in a new dataframe in a single column. (5-6) + (7-8), (1-2) + (3-4), etc etc:
$`1`
dif
1 -2
2 -2
$`2`
dif
1 -2
2 -2
One way to do it (per unlisted dataframe) could be by using logical vectors for indexing - their values are recycled - that way calculating the difference between every other column and finally summing the resulting df row-wise.
refdata1<-rowSums(refdata[c(T,F)]-refdata[c(F,T)])
Edit
Exact output can be obtained by
lapply(refdata, function(df){ data.frame(dif=rowSums(df[c(T,F)]-df[c(F,T)])) })
thx Heroka
# Create test data
x <- rbind(1:4, 5:8)
refdata <- list(x,x)
# Calculate results (all elements should have an even number of columns)
lapply(refdata, FUN = function(x) x %*% rep_len(c(1, -1), NCOL(x)))

Create a new data frame based on another dataframe

I am trying to use a huge dataframe (180000 x 400) to calculate another one that would be much smaller.
I have the following dataframe
df1=data.frame(LOCAT=c(1,2,3,4,5,6),START=c(120,345,765,1045,1347,1879),END=c(150,390,802,1120,1436,1935),CODE1=c(1,1,0,1,0,0),CODE2=c(1,0,0,0,-1,-1))
df1
LOCAT START END CODE1 CODE2
1 1 120 150 1 1
2 2 345 390 1 0
3 3 765 802 0 0
4 4 1045 1120 1 0
5 5 1347 1436 0 -1
6 6 1879 1935 0 -1
This is a sample dataframe. The rows continue until 180000 and the columns are over 400.
What I need to do is create a new dataframe based on each column that tells me the size of each continues "1" or "-1" and returns it with the location, size and value.
Something like this for CODE1:
LOCAT SIZE VALUE
1 1 to 2 270 POS
2 4 to 4 75 POS
And like this for CODE2:
LOCAT SIZE VALUE
1 1 to 1 30 POS
2 5 to 6 588 NEG
Unfortunately I still didn't figure out how to do this. I have been trying several lines of code to develop a function to do this automatically but start to get lost or stuck in loops and it seems that nothing works.
Any help would be appreciated.
Thanks in advance
Below is code that gives you the answer in the exact format that you wanted, except I split your "LOCAT" column into two columns entitled "Starts" and "Stops". This code will work for your entire data frame, no need to replicate it manually for each CODE (CODE1, CODE2, etc).
It assumes that the only non-CODE column have the names "LOCAT" "START" and "END".
# need package "plyr"
library("plyr")
# test2 is the example data frame that you gave in the question
test2 <- data.frame(
"LOCAT"=1:6,
"START"=c(120,345,765, 1045, 1347, 1879),
"END"=c(150,390,803,1120,1436, 1935),
"CODE1"=c(1,1,0,1,0,0),
"CODE2"=c(1,0,0,0,-1,-1)
)
codeNames <- names(test2)[!names(test2)%in%c("LOCAT","START","END")] # the names of columns that correspond to different codes
test3 <- reshape(test2, varying=codeNames, direction="long", v.names="CodeValue", timevar="Code") # reshape so the different codes are variables grouped into the same column
test4 <- test3[,!names(test3)%in%"id"] #remove the "id" column
sss <- function(x){ # sss gives the starting points, stopping points, and sizes (sss) in a data frame
rleX <- rle(x[,"CodeValue"]) # rle() to get the size of consecutive values
stops <- cumsum(rleX$lengths) # cumulative sum to get the end-points for the indices (the second value in your LOCAT column)
starts <- c(1, head(stops,-1)+1) # the starts are the first value in your LOCAT column
ssX0 <- data.frame("Value"=rleX$values, "Starts"=starts, "Stops"=stops) #the starts and stops from X (ss from X)
ssX <- ssX0[ssX0[,"Value"]!=0,] # remove the rows the correspond to CODE_ values that are 0 (not POS or NEG)
# The next 3 lines calculate the equivalent of your SIZE column
sizeX1 <- x[ssX[,"Starts"],"START"]
sizeX2 <- x[ssX[,"Stops"],"END"]
sizeX <- sizeX2 - sizeX1
sssX <- data.frame(ssX, "Size"=sizeX) # Combine the Size to the ssX (start stop of X) data frame
return(sssX) #Added in EDIT
}
answer0 <- ddply(.data=test4, .variables="Code", .fun=sss) # use the function ddply() in the package "plyr" (apply the function to each CODE, why we reshaped)
answer <- answer0 # duplicate the original, new version will be reformatted
answer[,"Value"] <- c("NEG",NA,"POS")[answer0[,"Value"]+2] # reformat slightly so that we have POS/NEG instead of 1/-1
Hopefully this helps, good luck!
Use run-length encoding to determine groups where CODE1 takes the same value.
rle_of_CODE1 <- rle(df1$CODE1)
For convenience, find the points where the value is non-zero, and the lenghts of the corresponding blocks.
CODE1_is_nonzero <- rle_of_CODE1$values != 0
n <- rle_of_CODE1$lengths[CODE1_is_nonzero]
Ignore the parts of df1 where CODE1 is zero.
df1_with_nonzero_CODE1 <- subset(df1, CODE1 != 0)
Define a group based on the contiguous blocks we found with rle.
df1_with_nonzero_CODE1$GROUP <- rep(seq_along(n), times = n)
Use ddply to get summary stats for each group.
summarised_by_CODE1 <- ddply(
df1_with_nonzero_CODE1,
.(GROUP),
summarise,
MinOfLOCAT = min(LOCAT),
MaxOfLOCAT = max(LOCAT),
SIZE = max(END) - min(START)
)
summarised_by_CODE1$VALUE <- ifelse(
rle_of_CODE1$values[CODE1_is_nonzero] == 1,
"POS",
"NEG"
)
summarised_by_CODE1
## GROUP MinOfLOCAT MaxOfLOCAT SIZE VALUE
## 1 1 1 2 270 POS
## 2 3 4 4 75 POS
Now repeat with CODE2.

R Matrix process with conditional additions

I have to pre-process a big matrix. To make my example easier to understand I will use the following matrix:
Raw data
Where col = people and row = skills
In R my matrix is:
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
Aim
In my case I need to process row by row. So there is 3 steps. For each row I have to :
Put 0 if ij=ij (So all diagonals equals zero)
Put 0 if one of the ij=0
Otherwise I have to add ij+ij
I will show the 3 steps to be more clear.
Step 1 (row1)
The data are the row 1
The result is:
Step 2 (row2)
The data are the row 2
The result is:
Step 3 (row3)
The data are the row 3
The result is:
Create a maximum matrix
Then the maximum matching are :
So my final matrix should be:
Question
Can someone tell me how to succeed to achieve this in R?
And of course the same process should work if my matrix has more row and columns...
Thanks a lot :)
Here is my implementation in R. The code doesn't execute the steps exactly in the way you specified them. I focused on your final matrix and assumed that this is the main result you're interested in.
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
rownames(test) <- paste("Skill", 1:dim(test)[1], sep="")
colnames(test) <- paste("People", 1:dim(test)[2], sep="")
test
# Pairwise combinations
comb.mat <- combn(1:dim(test)[2], 2)
pairwise.mat <- data.frame(matrix(t(comb.mat), ncol=2))
pairwise.mat$max.score <- 0
names(pairwise.mat) <- c("Person1", "Person2", "Max.Score")
for ( i in 1:dim(comb.mat)[2] ) { # Loop over the rows
first.person <- comb.mat[1,i]
second.person <- comb.mat[2,i]
temp.mat <- test[, c(first.person, second.person)]
temp.mat[temp.mat == 0] <- NA
temp.rowSums <- rowSums(temp.mat, na.rm=FALSE)
temp.rowSums[is.na(temp.rowSums)] <- 0
max.sum <- max(temp.rowSums)
previous.val <- pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person]
pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person] <- max.sum*(max.sum > previous.val)
}
pairwise.mat
Person1 Person2 Max.Score
1 1 2 25
2 1 3 32
3 2 3 0
person.mat <- matrix(NA, nrow=dim(test)[2], ncol=dim(test)[2])
rownames(person.mat) <- colnames(person.mat) <- paste("People", 1:dim(test)[2], sep="")
diag(person.mat) <- 0
person.mat[cbind(pairwise.mat[,1], pairwise.mat[,2])] <- pairwise.mat$Max.Score
person.mat[lower.tri(person.mat, diag=F)] <- t(person.mat)[lower.tri(person.mat, diag=F)]
person.mat
People1 People2 People3
People1 0 25 32
People2 25 0 0
People3 32 0 0

R converting a long list of questionnaire choices to a dataframe with one row for each questionnaire

A questionnaire was passed to teachers to check their curriculum preferences. They had to choose 20 items from about 50 options.
The resulting data is a long list of choices of the following type:
Teacher ID, Question ID
i want to format it to a list with one row for each teacher and a colomn per each question with the possible values: 0 (not chosen), 1 (chosen).
In pseudo code (of a programming language)
it would probably be something like this:
iterate list {
data [teacher_id] [question_id] = 0
}
Here is a sample data and the intended result:
a <- data.frame(
Case_ID = c(1,1,2,2,4,4),
Q_ID = c(3,5,5,8,2,6)
)
intended result is
res <- data.frame(
Case_ID = c(1,2,4),
Q_1 = c(0,0,0),
Q_2 = c(0,0,1),
Q_3 = c(1,0,0),
Q_4 = c(0,0,0),
Q_5 = c(1,1,0),
Q_6 = c(0,0,1),
Q_7 = c(0,0,0),
Q_8 = c(0,1,0)
)
Any help would be greatly appreciated.
Tnx
Hed
Returning a matrix and using matrix indexing to do the work:
m <- matrix(0, nrow=3, ncol=8)
rownames(m) <- c(1,2,4)
colnames(m) <- 1:8
idx <-apply(a, 2, as.character)
m[idx] <- 1
m
## 1 2 3 4 5 6 7 8
## 1 0 0 1 0 1 0 0 0
## 2 0 0 0 0 1 0 0 1
## 4 0 1 0 0 0 1 0 0
Note that you can think of a as a list of indecies, which themselves reference which cells in a "master array" are TRUE.
Then if you have a master matrix, say res of all 0's, you can then tell R: "all of the elements that are referenced in a should be 1"
This is done below
First we create the "master matrix"
# identify the unique teacher ID's
teacherIDs <- unique(a$Case_ID)
# count how many teachers there are
numbTeachers <- length(teacherIDs)
# create the column names for the questions
colNames <- c(paste0("Q_", 1:50))
# dim names for matrix. Using T_id for the row names
dnames <- list(paste0("T_", teacherIDs),
colNames)
# create the matrix
res2 <- matrix(0, ncol=50, nrow=numbTeachers, dimnames=dnames)
Next we convert a to a set of indices.
*Note that the first two lines below are only needed if there are Teacher ID's that are not present. ie in your example, T_3 is not present*
# create index out of a
indx <- a
indx$Case_ID <- as.numeric(as.factor(indx$Case_ID))
indx <- as.matrix(indx)
# populate those in a with 1
res2[indx] <- 1
res2

Resources