how to generate grouping variable based on correlation?

how to generate grouping variable based on correlation? - r

library(magrittr)
library(dplyr)
V1 <- c("A","A","A","A","A","A","B","B","B","B", "B","B","C","C","C","C","C","C","D","D","D","D","D","D","E","E","E","E","E","E")
V2 <- c("A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F","A","B","C","D","E","F")
cor <- c(1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.8,NA,NA,NA,NA,0.8,1,NA,NA,NA,NA,NA,NA,1,0.9)
df <- data.frame(V1,V2,cor)
# exclude rows where cor=NA
df <- df[complete.cases(df)==TRUE,]
This is the full data frame, cor=NA represents a correlation smaller than 0.8
df
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
30 E F 0.9
In the above df, F is not in V1, meaning that F is not of interest
so here I remove rows where V2=F (more generally, V2 equals to value that is not in V1)
V1.LIST <- unique(df$V1)
df.gp <- df[which(df$V2 %in% V1.LIST),]
df.gp
V1 V2 cor
1 A A 1.0
2 A B 0.8
7 B A 0.8
8 B B 1.0
15 C C 1.0
16 C D 0.8
21 D C 0.8
22 D D 1.0
29 E E 1.0
So now, df.gp is the dataset I need to work on
I drop the unused level in V2 (which is F in the example)
df.gp$V2 <- droplevels(df.gp$V2)
I do not want to exclude the autocorrelated variables, in case some of the V1 are not correlated with others, and I would like to put each of them in a separated group
By looking at the cor, A and B are correlated, C and D are correalted, and E belongs to a group by itself.
Therefore, the example here should have three groups.

The way I see this, you may have complicated things by working your data straight into a data.frame. I took the liberty of transforming it back to a matrix.
library(reshape2)
cormat <- as.matrix(dcast(data = df,formula = V1~V2))[,-1]
row.names(cormat) <- colnames(cormat)[-length(colnames(cormat))]
cormat
After I had your correlation matrix, it is easy to see which indices or non NA values are shared with other variables.
a <- apply(cormat, 1, function(x) which(!is.na(x)))
a <- data.frame(t(a))
a$var <- row.names(a)
row.names(a) <- NULL
a
X1 X2 var
1 1 2 A
2 1 2 B
3 3 4 C
4 3 4 D
5 5 6 E
Now either X1 or X2 determines your unique groupings.
Edited by cyrusjan:
The above script is a possible solution when assuming we already select the rows in with cor >= a, where a is a threshold taken as 0.8 in the above question.
Contributed by alexis_laz:
By using cutree and hclust, we can set the threshold in the script (i.e. h=0.8) as blow.
cor.gp <- data.frame(cor.gp =
cutree(hclust(1 - as.dist(xtabs(cor ~ V1 + V2, df.gp))), h = 0.8))

Related

dataframe to correlation matrix

I have a data frame in R (df) which looks like this:
colA colB
A,B 0.5
A,C 8
B,A 0.5
B,C 9
C,A 8
C,B 9
It represents correlation values obtained by running a certain software.
Now, I would like to convert this data frame to a correlation matrix to be plotted with the Corr() function:
DESIRED OUTPUT:
A B C
A 1 0.5 8
B 0.5 1 9
C 8 9 1
Please, any suggestion about the code I can utilise?

Data:
input <- structure(list(colA = c("A,B", "A,C", "B,A", "B,C", "C,A", "C,B"
), colB = c(0.5, 8, 0.5, 9, 8, 9)), class = "data.frame", row.names = c(NA, -6L))
Solution:
## separate that column "colA" into 2
rc <- read.csv(text = input$colA, header = FALSE)
# V1 V2
#1 A B
#2 A C
#3 B A
#4 B C
#5 C A
#6 C B
tapply(input$colB, unname(rc), FUN = identity, default = 1)
# A B C
#A 1.0 0.5 8
#B 0.5 1.0 9
#C 8.0 9.0 1
Note 1: OP has carelessly made-up data. Correlation is never bigger than 1.
Note 2: Thanks thelatemail for suggesting simply using read.csv instead of scan + matrix + asplit, as was in my initial answer.
Remark 1: If using xtabs, we have to modify diagonal elements to 1 later.
Remark 2: Matrix indexing is also a good approach, but takes more lines of code.
Remark 3: "reshaping" solution is also a good idea.
rc$value <- input$colB
reshape2::acast(rc, V1 ~ V2, fill = 1)
# A B C
#A 1.0 0.5 8
#B 0.5 1.0 9
#C 8.0 9.0 1

Something like that?
# create your input df:
df<-data.frame(colA=c("A,B","A,C","B,A","B,C","C,A","C,B"),value=c(0.5,8,0.5,9,8,9))
# split ID column
df[,c("col.A","col.B")]<- matrix(ncol=2,unlist(strsplit(df$colA,",")),byrow = T)
# reshape
library(reshape2)
dcast( df , col.A~col.B ,fill=1)

Assign values in table from vector

In R have a table containing a set of insect species and an empty column "habitat specifity". Additionally, a vector specifies those species considerated habitat specialists: Species B and C are habitat specialists, species A, D and E are habitat generalists.
example.species <- data.frame (species = c("A","B","C","D","E"), habitat.specifity=NA)
example.species
species habitat.specifity
1 A NA
2 B NA
3 C NA
4 D NA
5 E NA
example.specialists <- c("B","C")
I simply want to fill column two ("habitat specifity") with "s" for specialist and "g" for generalist. The table should then look like this:
species habitat.specifity
1 A g
2 B s
3 C s
4 D g
5 E g
I think it must be a simple task to accomplish, but I cannot figure out how. Any help is appreciated!

Here's a straightforward way in base R:
example.species <- data.frame (species = c("A","B","C","D","E"), habitat.specifity=NA)
example.species$habitat.specifity <- "g" # default value
example.species$habitat.specifity[example.species$species %in% c("B","C")] <- "s"
# species habitat.specifity
# 1 A g
# 2 B s
# 3 C s
# 4 D g
# 5 E g

Example with dplyr:
library(dplyr)
# Your data
example.species <- data.frame(species = c("A","B","C","D","E"),habitat.specifity=NA)
# Simple if_else with dplyr and pipes
example.species %>%
mutate(habitat.specifity = if_else(species %in% c("B","C"), "s", "g"))
# Result
species habitat.specifity
1 A g
2 B s
3 C s
4 D g
5 E g

Aggregated rolling average with a conditional statement in R

I have a data frame that follows the following format.
match team1 team2 winningTeam
1 A D A
2 B E E
3 C F C
4 D C C
5 E B B
6 F A A
7 A D D
8 D A A
What I want to do is to crate variables that calculates the form of both team 1 and 2 over the last x matches. For example, I would want to create a variable called team1_form_last3_matches which for match 8 would be 0.33 (as they won 1 of their last 3 matches) and there would also be a variable called team2_form_last3_matches which would be 0.66 in match 8 (as they won 2 of their last 3 matches). Ideally I would like to be able to specify the number of previous matches to be considered when calculating the teamx_form_lasty variable and those variables to be automatically created. I have tried a bunch of approaches using dplyr, zoo rolling mean functions and a load of nested for / if statements. However, I have not quite cracked it and certainly not in an elegant way. I feel like I am missing a simple solution to this generic problem. Any help would be much appreciated!
Cheers,
Jack

This works for t1l3, you will need to replicate it for t2.
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"),stringsAsFactors = FALSE)
dat$t1l3 <- c(NA,sapply(2:nrow(dat),function(i) {
df <- dat[1:(i-1),] #just previous games, i.e. excludes current game
df <- df[df$team1==dat$team1[i] | df$team2==dat$team1[i],] #just those containing T1
df <- tail(df,3) #just the last three (or fewer if there aren't three previous games)
return(sum(df$winningTeam==dat$team1[i])/nrow(df)) #total wins/total games (up to three)
}))

How about something like:
dat <- data.frame(match = c(1:8), team1 = c("A","B","C","D","E","F","A","D"), team2 = c("D","E","F","C","B","A","D","A"), winningTeam = c("A","E","C","C","B","A","D","A"))
match team1 team2 winningTeam
1 1 A D A
2 2 B E E
3 3 C F C
4 4 D C C
5 5 E B B
6 6 F A A
7 7 A D D
8 8 D A A
Allteams <- c("A","B","C","D","E","F")
# A vectorized function for you to use to do as you ask:
teamX_form_lastY <- function(teams, games, dat){
sapply(teams, function(x) {
games_info <- rowSums(dat[,c("team1","team2")] == x) + (dat[,"winningTeam"] == x)
lookup <- ifelse(rev(games_info[games_info != 0])==2,1,0)
games_won <- sum(lookup[1:games])
if(length(lookup) < games) warning(paste("maximum games for team",x,"should be",length(lookup)))
games_won/games
})
}
teamX_form_lastY("A", 4, dat)
A
0.75
# Has a warning for the number of games you should be using
teamX_form_lastY("A", 5, dat)
A
NA
Warning message:
In FUN(X[[i]], ...) : maximum games for team A should be 4
# vectorized input
teamX_form_lastY(teams = c("A","B"), games = 2, dat = dat)
A B
0.5 0.5
# so you ca do all teams
teamX_form_lastY(teams = Allteams, 2, dat)
A B C D E F
0.5 0.5 1.0 0.5 0.5 0.0

following sequence of A-B-C-D-E-F. How to proceed using R

I have 1 data frame, I want to to go from A to F by following sequence of A-B-C-D-E-F. How to proceed using R.
> m
V1 V2 V3
1 A B 0.1
2 B C 0.2
3 C D 0.3
4 D E 0.2
5 E F 0.5

The way I understand your comments, the relationship between A and F is the product of m$V3 between their rows.
af <- function(from, to){
x <- which(m$V1 == from)
y <- which(m$V2 == to)
return(prod(m$V3[x:y]))
}
af("A", "F")
[1] 6e-04
Then, F = A * 0.0006.
To generalize to any sequence and any row order in the table, we first define the sequence.
sq <- c("A", "B", "C", "D", "E", "F") # or LETTERS[1:6] in this case
Within the function, we select the respective rows as those, where both columns V1 and V2 contain sequence conditions that match the specification.
af2 <- function(from, to){
cond <- sq[which(sq == from):which(sq == to)]
x <- m$V1 %in% cond & m$V2 %in% cond
return(prod(m$V3[x]))
}
Test
Using the original matrix, both functions provide identical results.
af("B","E")
[1] 0.012
af2("B","E")
[1] 0.012
When we randomise row order, only the second function returns the correct result.
set.seed(123456)
m <- m[sample(1:5),]
m
V1 V2 V3
4 D E 0.2
5 E F 0.5
2 B C 0.2
1 A B 0.1
3 C D 0.3
af("B","E")
[1] 0.02
af2("B","E")
[1] 0.012

R: Reshape count matrix to long format with multiple entries

I have a matrix. The entries of the matrix are counts for the combination of the dimension levels. For example:
(m0 <- matrix(1:4, nrow=2, dimnames=list(c("A","B"),c("A","B"))))
A B
A 1 3
B 2 4
I can change it to a long format:
library("reshape")
(m1 <- melt(m0))
X1 X2 value
1 A A 1
2 B A 2
3 A B 3
4 B B 4
But I would like to have multipe entries according to value:
m2 <- m1
for (i in 1:nrow(m1)) {
j <- m1[i,"value"]
k <- 2
while ( k <= j) {
m2 <- rbind(m2,m1[i,])
k = k+1
}
}
> m2 <- subset(m2,select = - value)
> m2[order(m2$X1),]
X1 X2
1 A A
3 A B
31 A B
32 A B
2 B A
4 B B
21 B A
41 B B
42 B B
43 B B
Is there a parameter in melt which considers to multiply the entries according to value? Or any other library which can perform this issue?

We could do this with base R. We convert the dimnames of 'm0' to a 'data.frame' with two columns using expand.grid, then replicate the rows of the dataset with the values in 'm0', order the rows and change the row names to NULL (if necessary).
d1 <- expand.grid(dimnames(m0))
d2 <- d1[rep(1:nrow(d1), c(m0)),]
res <- d2[order(d2$Var1),]
row.names(res) <- NULL
res
# Var1 Var2
#1 A A
#2 A B
#3 A B
#4 A B
#5 B A
#6 B A
#7 B B
#8 B B
#9 B B
#10 B B
Or with melt, we convert the 'm0' to 'long' format and then replicate the rows as before.
library(reshape2)
dM <- melt(m0)
dM[rep(1:nrow(dM), dM$value),1:2]
As #Frank mentioned, we can also use table with as.data.frame to create 'dM'
dM <- as.data.frame(as.table(m0))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to generate grouping variable based on correlation? - r

Related

dataframe to correlation matrix

Assign values in table from vector

Aggregated rolling average with a conditional statement in R

following sequence of A-B-C-D-E-F. How to proceed using R

R: Reshape count matrix to long format with multiple entries

Categories

Resources