For loop to get rowmeans of each 8 columns in a large dataframe - r

I have a large data.frame with 8 columns per sample with 200 samples. I need to get row-means of each 8.
rowMeans(mat[j,1:8]), rowMeans(mat[j,9:16]), rowMeans(mat[j,17:24])...
rownames are gene names.
I used the following:
for(j in 1:nrow(mat)){
for (i in 1:ncol(mat)/8) {
row_m[j, i]<- rowMeans(mat[j,c(i:i+7)])
}
}
Dataframe sample data, here I have shown 9 columns, should get the mean from first 8 (AM) and then repeat for other samples....
dput(head(deconv3[1:9], 20))
structure(list(AM.amplifying.intestine = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), AM43.5.epithelial.of.mammary = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.76506, 0, 0, 1406.48, 0, 196.401,
0, 1996.5, 0), AM.epithelium.of.bronchus = c(549.649, 1647.63,
0, 0, 0, 0, 0, 0, 699.868, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
AM.epithelium.of.intestine = c(0, 0, 0, 0, 0, 0, 572.85,
59.2414, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), AM.epithelium.of.trachea = c(0,
0, 0, 0, 199.549, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), AM.kidney.epithelial.cell = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1.32926, 0, 0, 333.592, 0), AM.medullary.thymic.epithelial.cell = c(126.847,
0, 0, 0, 0, 0, 0, 0, 0, 63.1822, 0, 0, 0, 0, 0, 0, 0, 26.0598,
0, 11.117), AM.myoepithelial.cell = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), AK.amplifying.intestine = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c("A1BG",
"A2M", "NAT2", "SERPINA3", "AANAT", "ABAT", "ABCA2", "ABCA3",
"ABCB7", "ABCA4", "ABO", "ACACA", "ACADL", "ACADS", "ACADSB",
"ACAT1", "ACLY", "ACR", "ACP1", "ACRV1"), class = "data.frame")
But it does not work. I am wondering if you could help me with this. Thanks in advance!

sample_length <- 8
row_m <- matrix(nrow=dim(mat)[1], ncol = ncol(mat)/sample_length)
for (j in 1:nrow(mat)) {
for (i in seq(from = 1, to = ncol(mat), by = sample_length)) {
row_m[j, (sample_length - 1 + i)/sample_length] <- mean(as.numeric(mat[j, i:(i + (sample_length-1))]))
}
}

Try:
row_m <- do.call(cbind, lapply(1:(NCOL(mat) %/% 8 + 1), function(i){
rowMeans(d[, ((1:NCOL(mat) - 1) %/% 8 + 1) == i, drop=F])}))

Related

Problems with merging two files with yearly binary data for two overlapping subsets of individuals

I work with mark-recaptures of animals, and I have two capture histories I need to merge. Both files look like this:
Both files include subsets of the same group of animals, however, all inividuals are not present in both files. Also, one file contains more YEARS (in columns) than the other. The 0's and 1's indicate whether the animal was observed this year or not.
I need to merge both files, ending up with a file that contains all individuals that are included in these files. Observation data need to be merged for those individuals that are present in both files. If observation status for a given animal is 0 in FILE1 and 0 in FILE2, the observation status in the merged file need to be 0, if 0 in FILE1 and 1 in FILE2, observation status in the merged file should be 1, and if 1 in both files, it still needs to be 1 in the merged file (NOT 2).
Below you'll find samples of both files, FILE1 and FILE2. Any help appreciated.
FILE1:
> dput(FILE1)
structure(list(ID = c("1", "LL-30", "M-300", "NKW-001", "NKW-002",
"NKW-003", "NKW-004", "NKW-006", "NKW-007", "NKW-009", "NKW-010",
"NKW-011", "NKW-012", "NKW-013", "NKW-014", "NKW-015", "NKW-016",
"NKW-017", "NKW-018", "NKW-019", "NKW-021", "NKW-022", "NKW-023",
"NKW-024", "NKW-025", "NKW-026", "NKW-028", "NKW-029", "NKW-030",
"NKW-031", "NKW-032", "NKW-033", "NKW-034", "NKW-035", "NKW-036",
"NKW-037", "NKW-038", "NKW-039", "NKW-040"), `1986` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1987` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1988` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1989` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1990` = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1991` = c(0,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1992` = c(0,
0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1993` = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1994` = c(0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1995` = c(1,
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1996` = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1997` = c(0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1998` = c(1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1999` = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2000` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2001` = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2002` = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2003` = c(1,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2004` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2005` = c(1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0), `2006` = c(0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2007` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2008` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2012` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2013` = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0), `2014` = c(0,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0), `2015` = c(0,
0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1), `2016` = c(0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0), `2017` = c(0,
0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1), `2018` = c(0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1), `2019` = c(0,
0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1)), class = "data.frame", row.names = c(NA,
-39L))
FILE2:
> dput(FILE2)
structure(list(ID = c("KI03", "KI05", "KI06", "KI07", "KI08",
"KI10", "NKW-001", "NKW-004", "NKW-005", "NKW-009", "NKW-019",
"NKW-023", "NKW-025", "NKW-027", "NKW-031", "NKW-032", "NKW-040",
"NKW-045", "NKW-424", "NKW-431", "NKW-441", "NKW-443"), `2008` = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0
), `2009` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0), `2010` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2011` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1), `2012` = c(0,
0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1
), `2013` = c(1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0), `2014` = c(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2015` = c(1, 1, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0), `2016` = c(1,
0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0
), `2017` = c(1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 0, 0, 0), `2018` = c(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `2019` = c(0, 0, 0, 1, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0), `2020` = c(0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
)), class = "data.frame", row.names = c(NA, -22L))
Here is a scalable data.table solution with no merging involved.
If you have got more files, just add them to the list L
library( data.table )
setDT(df1);setDT(df2) #set to data.table format
L <- list( df1, df2 ) #put the data.tables in a list
#melt all data.tables in the list to long format
L.melt <- lapply( L, melt, id.vars = "ID", variable.name = "year", variable.factor = FALSE )
#rowbind to one large data.table
DT <- data.table::rbindlist( L.melt, use.names = TRUE, fill = TRUE )
#summarise, output a logical TRUE (=1) of FALSE = 0 based on the sum of 0's and 1's
ans <- DT[, .( seen = as.numeric( sum(value) > 0 ) ), by = .(ID, year) ]
#cast to wide again, fill in missing observations in years with 0
dcast( ans, ID ~ year, value.var = "seen", fill = 0 )

I am dealing with a DTM and I want to do k-means, heirarchical, and k-medoids clustering. Am I suppose to normalize the DTM first?

The data, AllBooks has 590 observations of 8266 variables. Here is the code I have:
AllBooks = read_csv("AllBooks_baseline_DTM_Unlabelled.csv")
dtms = as.matrix(AllBooks)
dtms_freq = as.matrix(rowSums(dtms) / 8266)
dtms_freq1 = dtms_freq[order(dtms_freq),]
sd = sd(dtms_freq)
mean = mean(dtms_freq)
This tells me that my mean is: 0.01242767
and my std. dev. is: 0.01305608
So since my standard deviation is low this means the data has low variability in terms of size of documents. So I do not need to normalize the DTM? And by normalize I mean using the scale function in R which subtracts the mean of the data and divides by the standard deviation.
In other words my big questions is: When am I suppose to standardize data (specifically a Document Term Matrix) for clustering purposes?
Here is a little output of data:
dput(head(AllBooks,10))
budding = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), enjoyer = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), needs = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), sittest = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), eclipsed = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), engagement = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
exuberant = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), abandons = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), well = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), cheerfulness = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
hatest = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), state = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0), stained = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), production = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), whitened = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), revered = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), developed = c(0, 0, 0, 2, 0, 0, 0, 0, 0, 0),
regarded = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), enactments = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), aromatical = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0), admireth = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0
), foothold = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), shots = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), turner = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), inversion = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
lifeless = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), postponement = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), stout = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), taketh = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), kettle = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), erred = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), thinkest = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), modern = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), reigned = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0), sparingly = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
visual = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), thoughts = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), illumines = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0), attire = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
explains = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -10L))
You can view full data from link: https://www.dropbox.com/s/p9v1y6oxith1prh/AllBooks_baseline_DTM_Unlabelled.csv?dl=0
You have a sparse dataset, where most of it is dominated by zeros, hence standard deviation is very low. You can scale it if some of your non-zero counts are extremely large, eg some are 100s while others are 1s and 2s.
It might not be such a good idea to use kmeans on sparse data, because it is unlikely you can find meaningful centers. There might be a few options available, check this link on dimension reduction.There are also graph based approaches, such as this used in biology.
Below is a simplistic way to clust and visualize:
x = read.csv("AllBooks_baseline_DTM_Unlabelled.csv")
# remove singleton columns
x = x[rowMeans(x)>0,colSums(x>0)>1]
Treat it as binary and hierachical on a binary distance:
hc=hclust(dist(x,method="binary"),method="ward.D")
clus = cutree(hc,5)
Calculate PCA and visualize:
library(Rtsne)
library(ggplo2)
pca = prcomp(x,scale=TRUE,center=TRUE)
TS = Rtsne(pca$x[,1:30])
ggplot(data.frame(Dim1=TS$Y[,1],Dim2=TS$Y[,2],C=factor(clus)),
aes(x=Dim1,y=Dim2,col=C))+geom_point()
Cluster 5 seems to be very different, and they differ in these words:
names(tail(sort(colMeans(x[clus==5,]) - colMeans(x[clus!=5,])),10))
[1] "wisdom" "thee" "lord" "things" "god" "hath" "thou" "man"
[9] "thy" "shall"

How to fix 'Node inconsistent with parents' in R2jags::jags

I am working with the R-package R2jags. After running the code I attach below, R produced the error message: "Node inconsistent with parents".
I tried to solve it. However, the error message persists. The variables I am using are:
i) "Adop": a 0-1 dummy variable.
ii) "NumInfo": a counter variable whose range is {0, 1, 2,...}.
iii) "Price": 5
iv) "NRows": 326.
install.packages("R2jags")
library(R2jags)
# Data you need to run the model.
# Adop: a 0-1 dummy variable.
Adop <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
# NumInfo: a counter variable.
NumInfo <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1)
# NRows: length of both 'NumInfo' and 'Adop'.
NRows <- length(NumInfo)
# Price: 5
Price <- 5
Data <- list("NRows" = NRows, "Adop" = Adop, "NumInfo" = NumInfo, "Price" = Price)
# The Bayesian model. The parameters I would like to infer are: 'mu.m', 'tau2.m', 'r.s', 'lambda.s', 'k', 'c', and 'Sig2'.
# I would like to obtain samples from the posterior distribution of the vector of parameters.
Bayesian_Model <- "model {
mu.m ~ dnorm(0, 1)
tau2.m ~ dgamma(1, 1)
r.s ~ dgamma(1, 1)
lambda.s ~ dgamma(1, 1)
k ~ dunif(1, 1/Price)
c ~ dgamma(1, 1)
Sig2 ~ dgamma(1, 1)
precision.m <- 1/tau2.m
m ~ dnorm(mu.m, precision.m)
s2 ~ dgamma(r.s, lambda.s)
for(i in 1:NRows){
Media[i] <- NumInfo[i]/Sig2 * m
Var[i] <- equals(NumInfo[i], 0) * 10 + (1 - equals(NumInfo[i], 0)) * NumInfo[i]/Sig2 * s2 * (NumInfo[i]/Sig2 + 1/s2)
Prec[i] <- pow(Var[i], -1)
W[i] ~ dnorm(Media[i], Prec[i])
PrAd1[i] <- 1 - step(-m/s2 - 1/c * 1/s2 * log(1 - k * Price) + 1/2 * c)
PrAd2[i] <- 1 - step(-W[i] - m/s2 - 1/c * 1/s2 * log(1 - k * Price) + 1/2 * c - 1/c * log(1 - k * Price))
PrAd[i] <- equals(NumInfo[i], 0) * PrAd1[i] + (1 - equals(NumInfo[i], 0)) * PrAd2[i]
Adop[i] ~ dbern(PrAd[i])
}
}"
# Save the Bayesian model in your computer with an extension '.bug'.
# Suppose that you saved the .bug file in: "C:/Users/Default/Bayesian_Model.bug".
writeLines(Bayesian_Model, "C:/Users/Default/Bayesian_Model.bug")
# Here I would like to use jags command from R-package called R2jags.
# I would like to generate 1000 iterations.
MCMC_Bayesian_Model <- R2jags::jags(
model.file = "C:/Users/Default/Bayesian_Model.bug",
data = Data,
n.chains = 1,
n.iter = 1000,
parameters.to.save = c("mu.m", "tau2.m", "r.s", "lambda.s", "k", "c", "Sig2")
)
When running the code, R produced the error message: "Node inconsistent with parents". I do not know what the mistakes are. I was wondering if you could help me with this problem, please. If you need more information, please let me know. Thank you very much.
It's a little hard to figure out the model without knowing what you're trying to do, but I suggest two fixes:
Instead of k ~ dunif(1, 1/Price), did you mean k ~ dunif(0, 1/Price)? For dunif(a, b), you must have a < b (see page 48 here: http://people.stat.sc.edu/hansont/stat740/jags_user_manual.pdf).
I inserted an additional line in the model,
PrAd01[i] <- max(min(PrAd[i], 0.99), 0.01)
and changed the last line to
Adop[i] ~ dbern(PrAd01[i])
Page 49 of the manual above states that 0 < p < 1 for dbern(p).
The model runs with the above two changes.

plot a very large data with many zeros

This is a small portion of a vey big data
df<- structure(list(A = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0.68906, 0, 0, 0, 0, 0, 0, 0, 0, 0.13597, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), B = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0.40001, 0, 0, 0, 0, 0.69718, 0, 0, 0, 0, 0, 0, 0,
0, 0.090752, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), C = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0.84068, 0, 0, 0, 0.34713, 0, 0, 0, 0, 0.65201,
0, 0, 0.25725, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
), D = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.86419, 0, 0, 0, 0.3845,
0, 0, 0, 0, 0.67091, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0), E = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1.1083, 0.8324,
0, 0, 0, 0.38499, 0, 0, 0, 0, 0.69064, 0, 0, 0.14596, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), F = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 1.0954, 0.74426, 0, 0, 0, 0.37715, 0, 0, 0, 0, 0.68884,
0, 0, 0.20826, 0, 0.38782, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0), G = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 1.0985, 0.66651, 0, 0,
0, 0, 0, 0, 0, 0, 0.68861, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.1812,
0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("A", "B", "C", "D", "E",
"F", "G"), class = "data.frame", row.names = c(NA, -39L))
What I want is to show the values in a more stressed way when there are a lot of zeros in a data
How I plot it is like this
eucl_dist=dist(df,method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
my_palette <- colorRampPalette(c( "green", "yellow", "red"))(n = 1000)
heatmap.2(mydata, scale = c("none"), Colv=F, Rowv=as.dendrogram(hie_clust),
xlab = "X", ylab = "Y", key=TRUE, keysize=1.1, trace="none",
density.info=c("none"), margins=c(4, 4), col=my_palette, dendrogram="row")
But as you see, in this small example, the zero dominate my plot and when it is very large then it is impossible to see anything. also I cannot change the position of the values
You are asking a lot of questions here, I'll try to answer those I see.
Zero dominates plot
Zeros dominate you data but, what do the zeros mean? Without some insight into what the zeros actually mean its hard to prescribe one best way to deal with it.
Colormap
The colorful colormap that you chose is not the best way to describe quantitative data. I would suggest a simple white to blue (or color of your choice) so that your zeros are shown as white and get hidden with the nonzero data emphasized. Example (only changing my_palette <- colorRampPalette(c("white", "cornflowerblue"))(n = 1000)):
Changing the position of the values
I'm not certain what you mean here but the layout is fixed by the dendrogram you defined.

Find the smallest distance between the profiles

I would like to find the smallest distance between the profiles stored in a data frame. I am interested especially in one row in comparison to the rest of the rows stored in the data frame.
That's a data frame:
structure(list(`10` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `34` = c(0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 393090, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6718400,
0, 311350, 0), `59` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2164949.7,
4834137.6, 0, 0, 0, 1187816.7, 0, 0, 0, 0, 0, 0, 1340912.5, 0
), `84` = c(0, 0, 0, 0, 0, 0, 0, 0, 8607100, 0, 0, 17586713.2,
22629743.6, 0, 0, 0, 2808791.7, 0, 0, 4026222.5, 0, 0, 0, 1981900,
0), `110` = c(2296000, 0, 0, 0, 0, 2140221.7, 0, 0, 5809230.6,
0, 0, 37134898.5, 3861828.7, 2553100, 0, 12075845.8, 0, 0, 1272950,
8695273, 0, 0, 2657180, 2710080, 0), `134` = c(0, 0, 0, 1176150,
0, 1329596.7, 1471000, 0, 6511934, 6511934, 0, 18709227.3, 0,
1041211.2, 0, 6544176.9, 0, 0, 2412651.7, 7724956.9, 2878418.3,
0, 8620131.7, 2386972.8, 0), `165` = c(0, 1226610, 0, 1345098.7,
2083771.9, 0, 1808231.4, 0, 0, 10742997.7, 0, 13060798.9, 0,
538340, 538340, 2791649.5, 0, 0, 6217622, 1316097.1, 4716931.8,
0, 6615816.9, 1510532, 0), `199` = c(0, 1571525, 0, 1903038.3,
1676700, 0, 888832.2, 0, 0, 9084418.6, 0, 11189460.1, 0, 0, 1807662.5,
2564275, 0, 0, 18080359.7, 0, 0, 0, 2397710.2, 1717949.2, 0),
`234` = c(0, 1314900, 2482696, 1325684, 0, 0, 0, 0, 0, 7321432.7,
0, 9843409.2, 0, 0, 1073341.7, 2762775, 0, 0, 9335312.8,
0, 0, 0, 1950788.2, 1509100, 0), `257` = c(0, 1568700, 14604298.7,
940162.2, 0, 0, 0, 0, 0, 4779505.9, 0, 9691692.4, 0, 0, 735290,
2650165, 0, 2311383.7, 5193383.4, 0, 0, 0, 1341998.7, 1225325.6,
0), `362` = c(0, 0, 4190740.5, 288800, 0, 0, 0, 0, 0, 4846634.8,
0, 9574498.7, 0, 0, 0, 1425600, 0, 8339312.1, 3877892.5,
0, 0, 0, 1752866.7, 0, 0), `433` = c(0, 0, 773280, 0, 0,
0, 0, 0, 0, 3926582.8, 3926582.8, 5962586.5, 0, 0, 0, 1041400,
0, 1972909.3, 1895439.4, 0, 0, 0, 963891.2, 0, 1109800),
`506` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9332272, 0, 0, 0,
0, 0, 0, 2219100, 0, 0, 0, 0, 0, 0, 0), `581` = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 4371537.1, 0, 0, 0, 0, 0, 0, 2428800,
0, 0, 0, 0, 0, 0, 0), `652` = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1689871.4, 0, 0, 0, 0, 0, 0, 988399.7, 0, 0, 0, 0, 0,
0, 0), `733` = c(0, 0, 0, 0, 0, 0, 0, 1250100, 0, 0, 1754205.3,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `818` = c(0, 0,
0, 0, 0, 0, 0, 517340, 0, 0, 1149227.6, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), `896` = c(0, 0, 0, 0, 0, 0, 0, 579846.7,
0, 0, 985931.2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
`972` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 858255.5, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `1039` = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 848993.3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0)), .Names = c("10", "34", "59", "84", "110", "134",
"165", "199", "234", "257", "362", "433", "506", "581", "652",
"733", "818", "896", "972", "1039"), row.names = c("Mark_1",
"Mark_2", "Alex_1", "Katrin_1", "Georg_1", "Martin_1",
"Tim_1", "Tom_1", "Mike_1", "Mike_2", "Mike_3",
"Hare_1", "Dea_1", "Monty_1", "Monty_2", "Niko_1",
"Lee_1", "Marq_1", "Otto_1", "Priaq_1", "Surkta_1",
"Norsa_1", "Norsa_2", "Quer_1", "Quer_2"), class = "data.frame")
So the row named Katrin_1 is the one which is interesting for me. I would like to find which rows have the smallest euclidean distance to Katrin_1. Let say 3-5 rows.
Let's get rid of Katrin_1 column with df[!rownames(df) %in% "Katrin_1", ], subtract df["Katrin_1", ] from each of the remaining rows with sweep, find Euclidean distances by squaring the resulting matrix element-wise and using rowSums, use which.min to get the final result:
names(which.min(rowSums(sweep(df[!rownames(df) %in% "Katrin_1", ], 2, as.numeric(df["Katrin_1", ]), `-`)^2)))
# [1] "Mark_2"
This should be much more efficient than using dist as dist would compute all possible distances, while we need need only a few.

Resources