Create a dendrogram

Create a dendrogram - r

I have a dataset like this
SELF
OTHER
1
3
1
4
1
5
2
6
2
1
2
5
3
5
3
4
3
2
4
8
4
7
4
5
5
1
5
2
5
3
6
7
6
8
6
5
7
5
7
2
7
3
8
6
8
6
8
6
The self represent the 8 football players and the other represents with which of this 8 football players one is befriended. I one to evaluate how similar two football players have a friendship circle, also share same people as friends. And then I would like to do a clustering
This is what I did so far, but got error message
footballplayers <- read.csv("footballplayer.csv", header = TRUE)
footballplayers_e <- graph_from_edgelist(as.matrix(footballplayers),directed = TRUE)
similarities <- as.dist(1 - cor(footballplayers_e), upper = TRUE)
footballplayersdend<-hclust(similarities)
plot(footballplayersdend)
But it doesn't work.
Error in FUN(X[[i]], ...) : as.sociomatrix.sna input must be an adjacency matrix/array, network, or list.

An easy way to get an adjacency matrix from an edge list is by using Matrix::sparseMatrix
library(tibble)
df <- tribble(~SELF, ~OTHER,
1, 3,
1, 4,
1, 5,
2, 6,
2, 1,
2, 5,
3, 5,
3, 4,
3, 2,
4, 8,
4, 7,
4, 5,
5, 1,
5, 2,
5, 3,
6, 7,
6, 8,
6, 5,
7, 5,
7, 2,
7, 3,
8, 6,
8, 6,
8, 6)
with(df, Matrix::sparseMatrix(SELF, OTHER, symmetric = TRUE))
##> 8 x 8 sparse Matrix of class "ngCMatrix"
##>
##> [1,] . . | | | . . .
##> [2,] . . . . | | . .
##> [3,] | . . | | . . .
##> [4,] | . | . | . | |
##> [5,] | | | | . . . .
##> [6,] . | . . . . | |
##> [7,] . . . | . | . .
##> [8,] . . . | . | . .
Then we can apply dist and hclust e.g.:
with(df, Matrix::sparseMatrix(SELF, OTHER, symmetric = TRUE)) |>
dist("binary") |>
hclust() |>
plot()

Related

Vector of repeated index values

I have a vector of the following form:-
a <- c(4, 6, 3, 6, 1)
What I want is to make a vector such that it has the index of the vector a the number of times the value of that index in vector a.
Like the first index has value 4, so there should be 4 ones, followed by 6 twos, followed by 3 threes, and so on.
Then resulting vector should be of the following form:-
b <- c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5)
Thanks in advance.

We can use rep as :
a <- c(4, 6, 3, 6, 1)
rep(seq_along(a), a)
#[1] 1 1 1 1 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 5

We can use sequence
cumsum(sequence(a) == 1)
#[1] 1 1 1 1 2 2 2 2 2 2 3 3 3 4 4 4 4 4 4 5
Or using uncount
library(dplyr)
library(tidyr)
tibble(a) %>%
mutate(rn = row_number()) %>%
uncount(a)

Importing a fixed-width file into R when the variables are defined in another file

I'm trying to import this data into R.
https://www.cdc.gov/healthyyouth/data/yrbs/data.htm
I know I need the survey package, but these files are odd.
Anyone know what to do?

To read the data in you can use the read.fwf base method.
As mentioned in a comment, you can get the concordance from the SPSS syntax: https://www.cdc.gov/healthyyouth/data/yrbs/sadc_2017/2017_sadc_spss_input_program.sps
I've used a text editor to quickly obtain the column widths:
vec <- c(5, 50, 50, 8, 8, 3, 10, 8, 8, 8, 3, 3, 3, 3, 3, 8, 8, 8, 8,
3, 3, 1, 1, 8, 8, 8, 8, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3)
The corresponding names of each column/variable:
names <- c("sitecode", "sitename", "sitetype", "sitetypenum", "year",
"survyear", "weight", "stratum", "PSU", "record", "age", "sex",
"grade", "race4", "race7", "stheight", "stweight", "bmi", "bmipct",
"qnobese", "qnowt", "q67", "q66", "sexid", "sexid2", "sexpart",
"sexpart2", "Q8", "Q9", "Q10", "Q11", "Q12", "Q13", "Q14", "Q15",
"Q16", "Q17", "Q18", "Q19", "Q20", "Q21", "Q22", "Q23", "Q24",
"Q25", "Q26", "Q27", "Q28", "Q29", "Q30", "Q31", "Q32", "Q33",
"Q34", "Q35", "Q36", "Q37", "Q38", "Q39", "Q40", "Q41", "Q42",
"Q43", "Q44", "Q45", "Q46", "Q47", "Q48", "Q49", "Q50", "Q51",
"Q52", "Q53", "Q54", "Q55", "Q56", "Q57", "Q58", "Q59", "Q60",
"Q61", "Q62", "Q63", "Q64", "Q65", "Q68", "Q69", "Q70", "Q71",
"Q72", "Q73", "Q74", "Q75", "Q76", "Q77", "Q78", "Q79", "Q80",
"Q81", "Q82", "Q83", "Q84", "Q85", "Q86", "Q87", "Q88", "Q89",
"QN8", "QN9", "QN10", "QN11", "QN12", "QN13", "QN14", "QN15",
"QN16", "QN17", "QN18", "QN19", "QN20", "QN21", "QN22", "QN23",
"QN24", "QN25", "QN26", "QN27", "QN28", "QN29", "QN30", "QN31",
"QN32", "QN33", "QN34", "QN35", "QN36", "QN37", "QN38", "QN39",
"QN40", "QN41", "QN42", "QN43", "QN44", "QN45", "QN46", "QN47",
"QN48", "QN49", "QN50", "QN51", "QN52", "QN53", "QN54", "QN55",
"QN56", "QN57", "QN58", "QN59", "QN60", "QN61", "QN62", "QN63",
"QN64", "QN65", "QN68", "QN69", "QN70", "QN71", "QN72", "QN73",
"QN74", "QN75", "QN76", "QN77", "QN78", "QN79", "QN80", "QN81",
"QN82", "QN83", "QN84", "QN85", "QN86", "QN87", "QN88", "QN89",
"qnfrcig", "qndaycig", "qnfrevp", "qndayevp", "qnfrskl", "qndayskl",
"qnfrcgr", "qndaycgr", "qntb2", "qntb3", "qntb4", "qniudimp",
"qnshparg", "qnothhpl", "qndualbc", "qnbcnone", "qnfr0", "qnfr1",
"qnfr2", "qnfr3", "qnveg0", "qnveg1", "qnveg2", "qnveg3", "qnsoda1",
"qnsoda2", "qnsoda3", "qnmilk1", "qnmilk2", "qnmilk3", "qnbk7day",
"qnpa0day", "qnpa7day", "qndlype", "qnnodnt", "qbikehelmet",
"qdrivemarijuana", "qcelldriving", "qpropertydamage", "qbullyweight",
"qbullygender", "qbullygay", "qchokeself", "qcigschool", "qchewtobschool",
"qalcoholschool", "qtypealcohol", "qhowmarijuana", "qmarijuanaschool",
"qcurrentcocaine", "qcurrentheroin", "qcurrentmeth", "qhallucdrug",
"qprescription30d", "qgenderexp", "qtaughtHIV", "qtaughtsexed",
"qtaughtstd", "qtaughtcondom", "qtaughtbc", "qdietpop", "qcoffeetea",
"qsportsdrink", "qenergydrink", "qsugardrink", "qwater", "qfastfood",
"qfoodallergy", "qwenthungry", "qmusclestrength", "qsunscreenuse",
"qindoortanning", "qsunburn", "qconcentrating", "qcurrentasthma",
"qwheresleep", "qspeakenglish", "qtransgender", "qnbikehelmet",
"qndrivemarijuana", "qncelldriving", "qnpropertydamage", "qnbullyweight",
"qnbullygender", "qnbullygay", "qnchokeself", "qncigschool",
"qnchewtobschool", "qnalcoholschool", "qntypealcohol", "qnhowmarijuana",
"qnmarijuanaschool", "qncurrentcocaine", "qncurrentheroin", "qncurrentmeth",
"qnhallucdrug", "qnprescription30d", "qngenderexp", "qntaughtHIV",
"qntaughtsexed", "qntaughtstd", "qntaughtcondom", "qntaughtbc",
"qndietpop", "qncoffeetea", "qnsportsdrink", "qnspdrk1", "qnspdrk2",
"qnspdrk3", "qnenergydrink", "qnsugardrink", "qnwater", "qnwater1",
"qnwater2", "qnwater3", "qnfastfood", "qnfoodallergy", "qnwenthungry",
"qnmusclestrength", "qnsunscreenuse", "qnindoortanning", "qnsunburn",
"qnconcentrating", "qncurrentasthma", "qnwheresleep", "qnspeakenglish",
"qntransgender")
As mentioned in an earlier comment, we can use the read.fwf method to read the fixed with *.dat file in (I have saved just a subset ... I expect it will take a while to read the entire file in):
df <- read.fwf(file = "c:/temp/file", widths = vec)
# Rename columns
names(df) <- names
# Inspect the head.
head(df, n=2)
# sitecode sitename sitetype sitetypenum year survyear weight stratum PSU record age sex grade race4 race7
# 1 XX United States (XX) National 3 1991 1 0.2645 12210 5 29890 . . 1 3 4
# 2 XX United States (XX) National 3 1991 1 0.5060 12310 29 29891 . . . . .
# stheight stweight bmi bmipct qnobese qnowt q67 q66 sexid sexid2 sexpart sexpart2 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q31 Q32 Q33
# 1 . . . . . . NA NA . . . . 2 4 NA NA 4 NA NA NA NA 3 NA NA NA NA NA NA NA NA 2 2 1 1 1 NA 2 4
# 2 . . . . . . NA NA . . . . NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA 1 1 1 1 1 NA 1 1
# Q34 Q35 Q36 Q37 Q38 Q39 Q40 Q41 Q42 Q43 Q44 Q45 Q46 Q47 Q48 Q49 Q50 Q51 Q52 Q53 Q54 Q55 Q56 Q57 Q58 Q59 Q60 Q61 Q62 Q63 Q64 Q65 Q68 Q69 Q70 Q71 Q72 Q73 Q74 Q75 Q76 Q77 Q78 Q79 Q80 Q81 Q82 Q83 Q84
# 1 NA NA NA NA NA NA 4 4 3 NA NA NA 5 5 5 1 NA NA NA NA NA 1 NA NA 1 1 5 4 4 3 3 8 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# 2 NA NA NA NA NA NA 6 2 2 NA NA NA 1 1 1 1 NA NA NA NA NA 1 NA NA 1 1 2 2 2 3 3 2 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 6 NA NA
# Q85 Q86 Q87 Q88 Q89 QN8 QN9 QN10 QN11 QN12 QN13 QN14 QN15 QN16 QN17 QN18 QN19 QN20 QN21 QN22 QN23 QN24 QN25 QN26 QN27 QN28 QN29 QN30 QN31 QN32 QN33 QN34 QN35 QN36 QN37 QN38 QN39 QN40 QN41 QN42
# 1 NA NA NA NA NA 1 1 . . 1 . . . . 1 . . . . . . . . 2 2 2 2 1 . 1 2 . . . . . . 1 1 1
# 2 NA NA NA NA NA . . . . . . . . . 2 . . . . . . . . 1 1 2 2 1 . 2 . . . . . . . 1 1 1
# QN43 QN44 QN45 QN46 QN47 QN48 QN49 QN50 QN51 QN52 QN53 QN54 QN55 QN56 QN57 QN58 QN59 QN60 QN61 QN62 QN63 QN64 QN65 QN68 QN69 QN70 QN71 QN72 QN73 QN74 QN75 QN76 QN77 QN78 QN79 QN80 QN81 QN82 QN83
# 1 . . . 1 2 1 2 . . . . . 2 . . 1 1 2 2 1 2 2 2 2 . . . . . . . . . . . . . 1 .
# 2 . . . 2 2 2 2 . . . . . 2 . . 1 1 1 2 2 . . . 2 . . . . . . . . . . . . . 1 .
# QN84 QN85 QN86 QN87 QN88 QN89 qnfrcig qndaycig qnfrevp qndayevp qnfrskl qndayskl qnfrcgr qndaycgr qntb2 qntb3 qntb4 qniudimp qnshparg qnothhpl qndualbc qnbcnone qnfr0 qnfr1 qnfr2 qnfr3 qnveg0
# 1 . . . . . . 2 2 . . . . . . . . . . . . . 2 . . . . .
# 2 . . . . . . 2 2 . . . . . . . . . . . . . . . . . . .
# qnveg1 qnveg2 qnveg3 qnsoda1 qnsoda2 qnsoda3 qnmilk1 qnmilk2 qnmilk3 qnbk7day qnpa0day qnpa7day qndlype qnnodnt qbikehelmet qdrivemarijuana qcelldriving qpropertydamage qbullyweight qbullygender
# 1 . . . . . . . . . . . . 1 . 2 NA NA NA NA NA
# 2 . . . . . . . . . . . . 1 . NA NA NA NA NA NA
# qbullygay qchokeself qcigschool qchewtobschool qalcoholschool qtypealcohol qhowmarijuana qmarijuanaschool qcurrentcocaine qcurrentheroin qcurrentmeth qhallucdrug qprescription30d qgenderexp
# 1 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
# qtaughtHIV qtaughtsexed qtaughtstd qtaughtcondom qtaughtbc qdietpop qcoffeetea qsportsdrink qenergydrink qsugardrink qwater qfastfood qfoodallergy qwenthungry qmusclestrength qsunscreenuse
# 1 2 NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
# 2 1 NA NA NA NA NA NA NA NA NA NA NA NA NA 2 NA
# qindoortanning qsunburn qconcentrating qcurrentasthma qwheresleep qspeakenglish qtransgender qnbikehelmet qndrivemarijuana qncelldriving qnpropertydamage qnbullyweight qnbullygender qnbullygay
# 1 NA NA NA NA NA NA NA 1 . . . . . .
# 2 NA NA NA NA NA NA NA . . . . . . .
# qnchokeself qncigschool qnchewtobschool qnalcoholschool qntypealcohol qnhowmarijuana qnmarijuanaschool qncurrentcocaine qncurrentheroin qncurrentmeth qnhallucdrug qnprescription30d qngenderexp
# 1 . . . . . . . 2 . . . . .
# 2 . . . . . . . 2 . . . . .
# qntaughtHIV qntaughtsexed qntaughtstd qntaughtcondom qntaughtbc qndietpop qncoffeetea qnsportsdrink qnspdrk1 qnspdrk2 qnspdrk3 qnenergydrink qnsugardrink qnwater qnwater1 qnwater2 qnwater3
# 1 2 . . . . . . . . . . . . . . . .
# 2 1 . . . . . . . . . . . . . . . .
# qnfastfood qnfoodallergy qnwenthungry qnmusclestrength qnsunscreenuse qnindoortanning qnsunburn qnconcentrating qncurrentasthma qnwheresleep qnspeakenglish qntransgender
# 1 . . . 2 . . . . . . . .
# 2 . . . 2 . . . . . . . .
Note that any character columns may need to be trimmed. Missing's are also "." So you would likely want to remove these as well.

Although I can't answer your question fully, I can get you started. The reason you are unsure what to do is because the data are not formtted in a way you are used to. The data are in an ASCII format. Here's what the website says:
"Note: SAS and SPSS programs need to be used to convert ASCII into SAS and SPSS datasets. How to use the ASCII data varies from one software package to another. Column positions for each variable usually have to be specified. Column positions for each variable can be found in the documentation for each year’s data. Consult your software documentation for more information."
ASCII is just a different way of storing data, like a .csv, or other format, but it's just not as readable as having it all in columns. You can start but searching how to import ASCII data into R and go from there. Sorry I can be of more help.

Create all pairs within groups and maintaining variables

I have a dataframe with around 30k observations, divided in 300 groups. For example
id, group, x, y
1, 1, 2, 3
2, 1, 4, 3
3, 1, 2, 4
4, 2, 5, 4
5, 2, 5, 3
6, 2, 6, 4
I want to make it so
pair, group, x_i, x_j, y_i, y_j
12, 1, 2, 4, 3, 3
13, 1, 2, 2, 3, 4
23, 1, 4, 2, 3, 4
45, 2, 5, 5, 4, 3
and so on. I've found a few topics, but they don't seem to apply exactly to my problem.

The combn function can be used to generate each corresponding pair of x and y values. We operate by group using lapply. lapply returns a list so we use rbind to put each list element (the results for each group) back together in a single data frame.
new.dat = lapply(unique(dat$group), function(g) {
data.frame(pairs = apply(t(combn(dat$id[dat$group==g], 2)), 1, paste, collapse=""),
group=g,
x = t(combn(dat$x[dat$group==g], 2)),
y = t(combn(dat$y[dat$group==g], 2)))
})
do.call(rbind, new.dat)
pairs group x.1 x.2 y.1 y.2
1 12 1 2 4 3 3
2 13 1 2 2 3 4
3 23 1 4 2 3 4
4 45 2 5 5 4 3
5 46 2 5 6 4 4
6 56 2 5 6 3 4
You could also use split, which saves some typing, but is about 10% slower on my machine:
lapply(split(dat, dat$group), function(df) {
data.frame(pairs = apply(t(combn(df$id, 2)), 1, paste, collapse=""),
group=g,
x = t(combn(df$x, 2)),
y = t(combn(df$y, 2)))
})

I won't say this is an ooptimal result, but it should work:
df <- read.table(text="id, group, x, y
1,1,2,3
2,1,4,3
3,1,2,4
4,2,5,4
5,2,5,3
6,2,6,4", header=T, sep=",")
df.new <- do.call(rbind,lapply(tapply(df$id, df$group, combn, m=2), FUN=function(x) data.frame(pairi=x[1,], pairj=x[2,])))
df.new <- do.call(rbind,apply(df.new, 1, FUN=function(x) data.frame(pair=paste0(x[1], x[2]),group=df[df$id==x[1], 'group'], x_i=df[df$id==x[1],'x'], x_j=df[df$id==x[2],'x'], y_i=df[df$id==x[1],'y'], y_j=df[df$id==x[2],'y'] )))
df.new
pair group x_i x_j y_i y_j
1.1 12 1 2 4 3 3
1.2 13 1 2 2 3 4
1.3 23 1 4 2 3 4
2.1 45 2 5 5 4 3
2.2 46 2 5 6 4 4
2.3 56 2 5 6 3 4

remove a bunch of rows by rownames - how do I initialize a null string in R?

I have this sparse-matrix I named N:
4 x 4 sparse Matrix of class "dgCMatrix"
C1 C2 C3 C4
V1 . 3 5 2
V2 . 5 1 .
V3 . . . .
V4 . . 4 .
I'm trying to remove rows that have two or more missing values. I expect to end up with this:
C1 C2 C3 C4
V1 . 3 5 2
I wrote this piece of code:
#iterate on rows and count:
#how many values in row ri are bigger than 0
# if count is not bigger than limit, remove row ri
limit = 3
for(ri in 1:nrow(N)){
count <- length(which(N[ri,]>0))
if (count <limit){
tmp <- paste("V",ri,sep="")
rmv <- paste (rmv, tmp, sep= " ")
}
}
#now remove specific row names
N <- N[!rownames(N) %in% rmv, ]
The problem is - this doesn't work since in the first loop rmv is unspecified and I receive an error:
"object 'rmv' not found"
How can I initalize rmv?
If I use:
rmv <- ""
Then I get a string that starts with an empty space, for example:
> rmv
[1] " V2"
and then my final line doesn't work:
N <- N[!rownames(N) %in% rmv, ]
Also - this is the very first code I have ever written in R, so if there is anything major I'm missing in the basic concepts I'd love to read it (this has taken me 6 hours and a lot of reading in stackoverflow and different R tutorials, but I'm pretty proud of myself getting this far, this is my first question).
Thanks!

Assuming your sparse matrix is called N, this should do it:
N[rowSums(as.matrix(N) == 0) < 2, ]
A small example with some data from ?xtabs:
d.ergo <- data.frame(Type = paste0("T", rep(1:4, 9*4)),
Subj = gl(9, 4, 36*4))
set.seed(15) # a subset of cases:
N <- xtabs(~ Type + Subj,
data = d.ergo[sample(36, 10), ],
sparse = TRUE)
N
# 4 x 9 sparse Matrix of class "dgCMatrix"
# 1 2 3 4 5 6 7 8 9
# T1 . 1 . 1 . 1 . 1 .
# T2 1 . . . . . 1 . 1
# T3 . . . . 1 . . . .
# T4 1 . . . . . 1 . .
rowSums(as.matrix(N) == 0) ## How many missing
# T1 T2 T3 T4
# 5 6 8 7
## Let's remove any with more than 7 missing
N[rowSums(as.matrix(N) == 0) < 7, ]
# 2 x 9 sparse Matrix of class "dgCMatrix"
# 1 2 3 4 5 6 7 8 9
# T1 . 1 . 1 . 1 . 1 .
# T2 1 . . . . . 1 . 1

With a large sparse matrix, you'll need to work with the matrix's summary, or as.matrix will make you run out of memory:
library(Matrix)
M <- sparseMatrix(i = c(1, 1, 1, 2, 2, 4),
j = c(2, 3, 4, 2, 3, 2),
x = c(3, 5, 2, 5, 1, 4))
M[tabulate(summary(M)$i) > 2, , drop = FALSE]
# 1 x 4 sparse Matrix of class "dgCMatrix"
#
# [1,] . 3 5 2
Step-by-step to see how it works:
summary(M)
# 4 x 4 sparse Matrix of class "dgCMatrix", with 6 entries
# i j x
# 1 1 2 3
# 2 2 2 5
# 3 4 2 4
# 4 1 3 5
# 5 2 3 1
# 6 1 4 2
tabulate(summary(M)$i)
# [1] 3 2 0 1
tabulate(summary(M)$i) > 2
# [1] TRUE FALSE FALSE FALSE

How to calculate the difference between different data frames with common column names

I have three data frames and trying to calculate the difference between two data frames (Df2 and Df3) conditioned by data frame 1. As explained in following example I have three data frames, Df1, Df2 and Df3 with common names. In first step, in Df1, I want to compare the values of “standard” column with all three columns, “Das”,”Dss” and ”Tri” probably row wise and where ever any value of these columns, “Das”, “Dss” and “Tri” is higher than the “Standard” in Df1, calculate the difference of same position in Df2 and Df3 and put the difference in a separate column.
Df1
Names Standard Das Dss Tri
Aa 3 3 6 2
Ab 4 6 4 3
Ac 2 5 2 4
Ad 4 3 3 8
Ae 6 4 5 7
Af 4 5 7 5
Ag 2 6 8 2
Ah 9 7 6 2
Df2
Names Das Dss Tri
Aa 4 2 5
Ab 7 5 4
Ac 5 7 2
Ad 6 4 3
Ae 5 3 5
Af 3 2 6
Ag 2 5 4
Ah 4 6 3
Df3
Names Das Dss Tri
Aa 5 3 5
Ab 8 5 4
Ac 6 7 2
Ad 6 4 7
Ae 5 3 8
Af 4 5 6
Ag 1 5 4
Ah 4 6 3
Final Ouput
Df3
Names Das Dss Tri Difference
Aa 5 3 5 -1
Ab 8 5 4 -1
Ac 6 7 2 -1
Ad 6 4 7 -4
Ae 5 3 8 -3
Af 4 5 6 -4
Ag 1 5 4 1
Ah 4 6 3 0

Here's the script that takes the index of the first biggest value if more than 1 value is found and if no values are found, NA is returned.
df1 <- structure(list(standard = c(3, 4, 2, 4, 6, 4, 2, 9), das = c(3,
6, 5, 3, 4, 5, 6, 7), dss = c(6, 4, 2, 3, 5, 7, 8, 6), tri = c(2,
3, 4, 8, 7, 5, 2, 2)), .Names = c("standard", "das", "dss", "tri"
), row.names = c(NA, -8L), class = "data.frame")
df2 <- structure(list(das = c(4, 7, 5, 6, 5, 3, 2, 4), dss = c(2,
5, 7, 4, 3, 2, 5, 6), tri = c(5,4,2,3,5,6,4,3)), .Names = c("das", "dss", "tri"
), row.names = c(NA, -8L), class = "data.frame")
df3 <- structure(list(das = c(5, 8, 6, 6, 5, 4, 1, 4), dss = c(3,
5, 7, 4, 3, 5, 5, 6), tri = c(5,4,2,7,8,6,4,3)), .Names = c("das", "dss", "tri"
), row.names = c(NA, -8L), class = "data.frame")
# get indices. run through every row of df1
# and get the maximum column index > standard
idx.v <- sapply( 1:nrow(df1), function(idx) {
t <- which(df1[idx, 2:4] > df1[idx, 1])
})
df3$result <- sapply(1:length(idx.v), function(ix) {
col.idx <- idx.v[[ix]]
len.idx <- length(col.idx)
if (len.idx > 0) {
res <- sum(df2[ix, col.idx] - df3[ix, col.idx])
} else {
res <- NA
}
})
Output:
> df3
das dss tri result
1 5 3 5 -1
2 8 5 4 -1
3 6 7 2 -1
4 6 4 7 -4
5 5 3 8 -3
6 4 5 6 -4
7 1 5 4 1
8 4 6 3 NA
Thanks for the chat. This is what you require.

I think this is the correct result, but note that the seventh value differs. Using the max value of the three columns (an easier task) produces a result that differs in even more slots.
df1.w <- sapply( seq(1, nrow(df1)),
function(idx) min(c(Inf, which(df1[-(1:2)][idx,] > df1[idx, 2])))
)
df1.mat <- matrix(c(seq(1, nrow(df1)), df1.w), ncol=2)
df1.mat[is.infinite(df1.mat)] <- 1
ifelse(is.infinite(df1.w), 0,
df2[-1][df1.mat] - df3[-1][df1.mat]
)
## [1] -1 -1 -1 -4 -3 -1 1 0
If you actually do want to use the index of the max value in df1[-(1:2)], replace the definition of df1.w (the sapply call) with this:
df1.w <- apply(df1[-(1:2)], 1, which.max)
Using the rest of the code above then gives this result:
## [1] -1 -1 -1 -4 -3 -3 0 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create a dendrogram - r

Related

Vector of repeated index values

Importing a fixed-width file into R when the variables are defined in another file

Create all pairs within groups and maintaining variables

remove a bunch of rows by rownames - how do I initialize a null string in R?

How to calculate the difference between different data frames with common column names

Categories

Resources