R: "Error: unexpected string constant in" with read_fwf() - r

I am trying to read a fixed width file from U.S. Census Bureau into R using read_fwf(). I keep getting an error in the same place in the list of column names. I have tried to change the particular column name at the location in multiple attempts and R keeps throwing the error. I restarted R to a new session and I keep getting the error. In the list of column names, it's the 39th item that seems to have the problem. I've changed the name of the 39th, and sometimes the 38th, position in one of the attempts I've included in code. The first line of code in the code block has the original column name values. In that line, the 39th name is "cbsac", but the error prints that as "... "" ". It's close to the name in the 38th position, "cbsa", but a lot of the names in succession in other parts of the list are very similar and they don't cause an error. I don't know what that is supposed to indicate. Does "cbsac" mean something in R that I'm not aware of?
library(readr)
> tf <- read_fwf("D:/projects_and_data/data/PostgreSQL/data/data/or2010.sf1/orgeo2010.sf1", fwf_widths( c(6, 2, 3, 2, 3, 2, 7, 1, 1, 2, 3, 2, 2, 5, 2, 2, 5, 2, 2, 6, 1, 4, 2, 5, 2, 2, 4, 5, 2, 1, 3, 5, 2, 6, 1, 5, 2, 5, 2, 5, 3, 5, 2, 5, 3, 1, 1, 5, 2, 1, 1, 2, 3, 3, 6, 1, 3, 5, 5, 2, 5, 5, 5, 14, 14, 90, 1, 1, 9, 9, 11, 12, 2, 1, 6, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 5, 18), c("fileid", "stusab", "sumlev", "geocomp", "chariter", "cifsn", "logrecno", "region", "division", "state", "county", "countycc", "countysc", "cousub", "cousubcc", "cousubsc", "place", "placecc", "placesc", "tract", "blkgrp", "block", "iuc", "concit", "concitcc", "concitsc", "aianhh", "aianhhfp", "aianhhcc", "aihhtli", "aitsce", "aits", "aitscc", "ttract", "tblkgrp", "anrc", "anrccc", "cbsa", "cbsac", "metdiv", "csa", "necta", "nectasc", "nectadiv" "cnecta", "cbsapci", "nectapci", "ua", "uasc", "uatype", "ur", "cd", "sldu", "sldl", "vtd", "vtdi", "reserve2", "zcta5", "submcd", "submcdcc", "sdelem", "sdsec", "sduni", "arealand", "areawatr", "name", "funcstat", "gcuni", "pop100", "hu100", "intptlat", "intptlon", "lsadc", "partflag", "reserve3", "uga", "statens", "countyns", "cousubns", "placens", "concitns", "aianhhns", "aitsns", "anrcns", "submcdns", "cd113", "cd114", "cd115", "sldu2", "sldu3", "sldu4", "sldl2", "sldl3", "sldl4", "aianhhsc", "csasc", "cnectasc", "memi", "nmemi", "puma", "reserved")))
Error: unexpected string constant in ""tract", "blkgrp", "block", "iuc", "concit", "concitcc", "concitsc", "aianhh", "aianhhfp", "aianhhcc", "aihhtli", "aitsce", "aits", "aitscc", "ttract", "tblkgrp", "anrc", "anrccc", "cbsa", ""
> tf <- read_fwf("D:/projects_and_data/data/PostgreSQL/data/data/or2010.sf1/orgeo2010.sf1", fwf_widths( c(6, 2, 3, 2, 3, 2, 7, 1, 1, 2, 3, 2, 2, 5, 2, 2, 5, 2, 2, 6, 1, 4, 2, 5, 2, 2, 4, 5, 2, 1, 3, 5, 2, 6, 1, 5, 2, 5, 2, 5, 3, 5, 2, 5, 3, 1, 1, 5, 2, 1, 1, 2, 3, 3, 6, 1, 3, 5, 5, 2, 5, 5, 5, 14, 14, 90, 1, 1, 9, 9, 11, 12, 2, 1, 6, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 5, 18), c("fileid", "stusab", "sumlev", "geocomp", "chariter", "cifsn", "logrecno", "region", "division", "state", "county", "countycc", "countysc", "cousub", "cousubcc", "cousubsc", "place", "placecc", "placesc", "tract", "blkgrp", "block", "iuc", "concit", "concitcc", "concitsc", "aianhh", "aianhhfp", "aianhhcc", "aihhtli", "aitsce", "aits", "aitscc", "ttract", "tblkgrp", "anrc", "anrccc", "BCas", "CBsac", "metdiv", "csa", "necta", "nectasc", "nectadiv" "cnecta", "cbsapci", "nectapci", "ua", "uasc", "uatype", "ur", "cd", "sldu", "sldl", "vtd", "vtdi", "reserve2", "zcta5", "submcd", "submcdcc", "sdelem", "sdsec", "sduni", "arealand", "areawatr", "name", "funcstat", "gcuni", "pop100", "hu100", "intptlat", "intptlon", "lsadc", "partflag", "reserve3", "uga", "statens", "countyns", "cousubns", "placens", "concitns", "aianhhns", "aitsns", "anrcns", "submcdns", "cd113", "cd114", "cd115", "sldu2", "sldu3", "sldu4", "sldl2", "sldl3", "sldl4", "aianhhsc", "csasc", "cnectasc", "memi", "nmemi", "puma", "reserved")))
Error: unexpected string constant in ""tract", "blkgrp", "block", "iuc", "concit", "concitcc", "concitsc", "aianhh", "aianhhfp", "aianhhcc", "aihhtli", "aitsce", "aits", "aitscc", "ttract", "tblkgrp", "anrc", "anrccc", "BCas", ""
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.3.1
loaded via a namespace (and not attached):
[1] compiler_3.6.1 backports_1.1.5 R6_2.4.0 hms_0.5.1
[5] pillar_1.4.2 tibble_2.1.3 Rcpp_1.0.2 crayon_1.3.4
[9] vctrs_0.2.0 zeallot_0.1.0 pkgconfig_2.0.3 rlang_0.4.0
This links to a zip that has the source file. The file is "orgeo2010.sf1". I should have said, the zip is kind of big. Sorry about that.

Does this fix your issue?
widths <- c(6, 2, 3, 2, 3, 2, 7, 1, 1, 2, 3, 2, 2, 5, 2, 2, 5,
2, 2, 6, 1, 4, 2, 5, 2, 2, 4, 5, 2, 1, 3, 5, 2, 6, 1, 5, 2, 5,
2, 5, 3, 5, 2, 5, 3, 1, 1, 5, 2, 1, 1, 2, 3, 3, 6, 1, 3, 5, 5,
2, 5, 5, 5, 14, 14, 90, 1, 1, 9, 9, 11, 12, 2, 1, 6, 5, 8, 8,
8, 8, 8, 8, 8, 8, 8, 2, 2, 2, 3, 3, 3, 3, 3, 3, 2, 2, 2, 1, 1, 5, 18)
vars <- c("fileid", "stusab", "sumlev", "geocomp", "chariter", "cifsn", "logrecno",
"region", "division", "state", "county", "countycc", "countysc", "cousub",
"cousubcc", "cousubsc", "place", "placecc", "placesc", "tract", "blkgrp", "block",
"iuc", "concit", "concitcc", "concitsc", "aianhh", "aianhhfp", "aianhhcc", "aihhtli",
"aitsce", "aits", "aitscc", "ttract", "tblkgrp", "anrc", "anrccc", "cbsa", "cbsac",
"metdiv", "csa", "necta", "nectasc", "nectadiv", "cnecta", "cbsapci", "nectapci",
"ua", "uasc", "uatype", "ur", "cd", "sldu", "sldl", "vtd", "vtdi", "reserve2",
"zcta5", "submcd", "submcdcc", "sdelem", "sdsec", "sduni", "arealand", "areawatr",
"name", "funcstat", "gcuni", "pop100", "hu100", "intptlat", "intptlon", "lsadc",
"partflag", "reserve3", "uga", "statens", "countyns", "cousubns", "placens",
"concitns", "aianhhns", "aitsns", "anrcns", "submcdns", "cd113", "cd114", "cd115",
"sldu2", "sldu3", "sldu4", "sldl2", "sldl3", "sldl4", "aianhhsc", "csasc",
"cnectasc", "memi", "nmemi", "puma", "reserved")
td <- read_fwf("D:/projects_and_data/data/PostgreSQL/data/data/or2010.sf1/orgeo2010.sf1", fwf_widths(widths)
names(td) <- vars
The unexpected string constant is caused by not defining the character vector correctly (you were missing a comma)

Related

R: item name missing in the plot legend

With this code I get the plot I want
d <- density(mydata$item1)
plot(d)
This code is the same, but omits N/As. And there is a flaw in the plot's legend. As you can see, it doesn't tell what item is plotted, (x = .)
Can you tell where is the matter and how to fix it? Thank you for your help.
My data
structure(list(item1 = c(5, 5, 5, 5, 4, 4, 2, 1, 3, 4, 4, 3,
2, 5, 2, 4, 4, 3, 6, 5, 3, 2, 5, 3, 3, 1, 3, 5, 1, 3, 2, 6, 3,
5, 4, 4, 3, 5, 6, 3, 2, 6, 6, 5, 2, 2, 2, 3, 3, 3), item2 = c(5,
4, 5, 1, 2, 2, 3, 2, 2, 2, 2, 3, 2, 5, 1, 4, 4, 3, 3, 5, 3, 2,
4, 4, 3, 4, 4, 3, 7, NA, 2, 4, 2, 4, 2, 3, 5, 3, 5, 3, 2, 6,
6, 7, 2, 3, 2, 3, 1, 4), item3 = c(5, 5, 6, 7, 3, 4, 5, 2, 2,
6, 4, 2, 5, 7, 1, 2, 4, 5, 6, 6, 5, 2, 6, 5, 6, 4, 6, 4, 6, 4,
6, 5, 5, 6, 6, 6, 5, 6, 7, 5, 5, 7, 7, 6, 2, 6, 6, 6, 5, 3)), row.names = c(NA,
-50L), class = c("tbl_df", "tbl", "data.frame"))
Use the main = argument inside plot to make the title say whatever you want it to.
Data$item2 %>%
na.omit() %>%
density() %>%
plot(main = 'Density of Data$item2')
you had a little typo in your code as the density() call was piped into a plot call refering to the variable it was been written to ... this might have resulted in the strange plot.
In general the density() function won't work with NA values acording to the documentation so you have to set the argument na.rm = TRUE as the default is FALSE for the plot to work correctly... also as #AllanCameron pointed out in an earlier answer you can set the plot title manually.
d <- density(mydata$item2, na.rm = TRUE)
plot(d)
Possibly you can substitute, interpolate or impute the NA values so that you do not have to remove them for the denstiy() call. Though this obviously depends on your data, context and goals.

Select and compare two elements from two dataframes using R

I want to calculate the shortest path between two proteins using two dataframes. For example, I want to calculate the shortest path of first from the first list and the first from the seconds list, the first from the first list and the second from the second list, etc.
structure(list(LAS1L = c("FKBP4", "RBM6", "UPF1", "SLC25A5",
"DHX33", "ELAC2", "CCDC124", "RPS20", "CSDE1", "AKAP8L", "UTP18",
"PTBP1", "DCN", "MATR3", "SAMD4A", "AQR", "STRAP", "SEC63", "BCLAF1",
"TFB1M", "GRN", "ZCCHC8", "NSUN2", "SKIV2L2", "STAU2", "CTNNA1",
"YTHDC2", "POLR2B", "TPR", "MAP4", "NOP16", "FAM120A", "R3HDM1",
"PTCD2", "RRP12", "MRTO4", "THRAP3", "NOP58", "USP36", "MLL3",
"PUM2", "MRPL43", "ZFR", "RC3H2", "ZC3H11A", "PARP12", "ALDH18A1",
"CSDA", "CCAR1")), class = "data.frame", row.names = c(NA, -49L
))
structure(list(GNL3L = c("FMR1", "FRAXA", "UBA1", "CSTF2", "MECP2",
"PHF6", "RBM10", "GSPT2", "SLC25A5", "EIF1AX", "NKRF", "RPS4X",
"RBMX2", "HTATSF1", "LAS1L", "MBNL3", "HUWE1", "RPL10", "RPL15",
"RBMX", "NONO", "RPGR", "UPF3B", "RBM3", "HNRNPH2", "UTP14A",
"DKC1", "MEX3C", "DDX3X", "FLNA", "FAM120C")), class = "data.frame", row.names = c(NA,
-31L))
So far, I just come out with this.
sp<-shortest_path[protein1[,1],protein2[,1]]
dput for shortest_path:
structure(c(0, 4, 6, 4, 4, 4, 4, 3, 3, 3, 5, 3, 5, 3, 3, 3, 4,
3, 3, 3, 4, 0, 5, 4, 4, 4, 4, 3, 3, 3, 5, 3, 5, 3, 3, 3, 4, 3,
3, 3, 6, 5, 0, 6, 4, 6, 5, 5, 5, 5, 7, 5, 6, 5, 5, 3, 6, 5, 5,
5, 4, 4, 6, 0, 3, 3, 3, 3, 3, 3, 4, 3, 5, 3, 3, 3, 4, 3, 3, 3,
4, 4, 4, 3, 0, 4, 3, 3, 3, 3, 4, 3, 5, 3, 3, 3, 4, 3, 3, 3, 4,
4, 6, 3, 4, 0, 3, 3, 3, 3, 5, 3, 5, 3, 3, 3, 4, 3, 3, 3, 4, 4,
5, 3, 3, 3, 0, 3, 3, 2, 3, 3, 5, 3, 3, 3, 4, 3, 3, 3, 3, 3, 5,
3, 3, 3, 3, 0, 2, 2, 4, 2, 4, 2, 2, 2, 3, 2, 2, 2, 3, 3, 5, 3,
3, 3, 3, 2, 0, 2, 4, 2, 4, 2, 2, 2, 3, 2, 2, 2, 3, 3, 5, 3, 3,
3, 2, 2, 2, 0, 2, 2, 4, 2, 2, 2, 3, 2, 2, 2, 5, 5, 7, 4, 4, 5,
3, 4, 4, 2, 0, 4, 6, 4, 4, 4, 5, 4, 4, 4, 3, 3, 5, 3, 3, 3, 3,
2, 2, 2, 4, 0, 4, 2, 2, 2, 3, 2, 2, 2, 5, 5, 6, 5, 5, 5, 5, 4,
4, 4, 6, 4, 0, 4, 4, 4, 5, 4, 4, 4, 3, 3, 5, 3, 3, 3, 3, 2, 2,
2, 4, 2, 4, 0, 2, 2, 3, 2, 2, 2, 3, 3, 5, 3, 3, 3, 3, 2, 2, 2,
4, 2, 4, 2, 0, 2, 3, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 4,
2, 4, 2, 2, 0, 3, 2, 2, 2, 4, 4, 6, 4, 4, 4, 4, 3, 3, 3, 5, 3,
5, 3, 3, 3, 0, 3, 3, 3, 3, 3, 5, 3, 3, 3, 3, 2, 2, 2, 4, 2, 4,
2, 2, 2, 3, 0, 1, 2, 3, 3, 5, 3, 3, 3, 3, 2, 2, 2, 4, 2, 4, 2,
2, 2, 3, 1, 0, 2, 3, 3, 5, 3, 3, 3, 3, 2, 2, 2, 4, 2, 4, 2, 2,
2, 3, 2, 2, 0), .Dim = c(20L, 20L), .Dimnames = list(c("1810055G02Rik",
"2810046L04Rik", "4922501C03Rik", "4930572J05Rik", "9830001H06Rik",
"A1CF", "A2M", "AAGAB", "AATF", "ABCA1", "ABCA13", "ABCA2", "ABCA4",
"ABCB1", "ABCB7", "ABCC2", "ABCC8", "ABCD1", "ABCD3", "ABCD4"
), c("1810055G02Rik", "2810046L04Rik", "4922501C03Rik", "4930572J05Rik",
"9830001H06Rik", "A1CF", "A2M", "AAGAB", "AATF", "ABCA1", "ABCA13",
"ABCA2", "ABCA4", "ABCB1", "ABCB7", "ABCC2", "ABCC8", "ABCD1",
"ABCD3", "ABCD4")))
Thanks in advance!
Maybe you can try the code below
outer(protein1$LAS1L, protein2$GNL3L, FUN = function(x, y) shortest_path[x, y])

Math Symbols within for loop of GGplots in R

I'm currently trying to develop a similar result as this link. I have a significant number of columns and several different labels for the x-axis.
col1 <- c(2, 4, 1, 2, 5, 1, 2, 0, 1, 4, 4, 3, 5, 2, 4, 3, 3, 6, 5, 3, 6, 4, 3, 4, 4, 3, 4,
2, 4, 3, 3, 5, 3, 5, 5, 0, 0, 3, 3, 6, 5, 4, 4, 1, 3, 3, 2, 0, 5, 3, 6, 6, 2, 3,
3, 1, 5, 3, 4, 6)
col2 <- c(2, 4, 4, 0, 4, 4, 4, 4, 1, 4, 4, 3, 5, 0, 4, 5, 3, 6, 5, 3, 6, 4, 4, 2, 4, 4, 4,
1, 1, 2, 2, 3, 3, 5, 0, 3, 4, 2, 4, 5, 5, 4, 4, 2, 3, 5, 2, 6, 5, 2, 4, 6, 3, 3,
3, 1, 4, 3, 5, 4)
col3 <- c(2, 5, 4, 1, 4, 2, 3, 0, 1, 3, 4, 2, 5, 1, 4, 3, 4, 6, 3, 4, 6, 4, 1, 3, 5, 4, 3,
2, 1, 3, 2, 2, 2, 4, 0, 1, 4, 4, 3, 5, 3, 2, 5, 2, 3, 3, 4, 2, 4, 2, 4, 5, 1, 3,
3, 3, 4, 3, 5, 4)
col4 <- c(2, 5, 2, 1, 4, 1, 3, 4, 1, 3, 5, 2, 4, 3, 5, 3, 4, 6, 3, 4, 6, 4, 3, 2, 5, 5, 4,
2, 3, 2, 2, 3, 3, 4, 0, 1, 4, 3, 3, 5, 4, 4, 4, 3, 3, 5, 4, 3, 5, 3, 6, 6, 4, 2,
3, 3, 4, 4, 4, 6)
data2 <- data.frame(col1,col2,col3,col4)
data2[,1:4] <- lapply(data2[,1:4], as.factor)
colnames(data2)<- c("A","B","C", "D")
> x.axis.list
[[1]]
expression(beta[paste(1, ",", 1L)])
[[2]]
expression(beta[paste(1, ",", 2L)])
[[3]]
expression(beta[paste(1, ",", 3L)])
[[4]]
expression(beta[paste(1, ",", 4L)])
myplots <- vector('list', ncol(data2))
for (i in seq_along(data2)) {
message(i)
myplots[[i]] <- local({
i <- i
p1 <- ggplot(data2, aes(x = data2[[i]])) +
geom_histogram(fill = "lightgreen") +
xlab(x.axis.list[[i]])
print(p1)
})
}
In the past, I've been able to do something similar to this where I can just put x.axis.list[[i]] in my loop and change the symbols. However, I continue to get the term expression on the axis. So the symbol for Beta is correct as well as the subscript but the word "expression" remains. I'm not sure exactly what I'm doing wrong, for a moment, I was able to produce a plot without "expression" but it has since stayed in the ggplot.
I want to be able to produce this plot, or one with the title on the y-axis without the word "expression".
My image currently looks . I'm not worried about this example data and the result of the plot, I'm wondering how to get rid of "expression" so only the math symbol shows.
Thanks in advance.
You can do:
for (i in seq_along(data2)) {
df <- data2[i]
names(df)[1] <- "x"
myplots[[i]] <- local({
p1 <- ggplot(df, aes(x = x)) +
geom_bar(fill = "lightgreen", stat = "count") +
xlab(x.axis.list[[i]])
})
}
And we can show all the plots together:
library(patchwork)
(myplots[[1]] + myplots[[2]]) / (myplots[[3]] + myplots[[4]])
Note I created the expression list like this:
x.axis.list <- lapply(1:4, function(i){
parse(text = paste0("beta[paste(1, \",\", ", i, ")]"))
})

Add legend to graph in R

For a sample dataframe:
df <- structure(list(antibiotic = c(0.828080341411847, 1.52002304506738,
1.31925434545302, 1.66681722567074, 1.17791610945551, 0.950096368502059,
1.10507733691997, 1.0568193215304, 1.03853131016669, 1.02313195567946,
0.868629787234043, 0.902126485349154, 1.12005679002801, 1.88261441540084,
0.137845900627507, 1.07040656448604, 1.41496470588235, 1.30978543173373,
1.16931780610558, 1.05894439450366, 1.24805122785724, 1.21318238007025,
0.497310305098053, 0.872362356327429, 0.902584749481137, 0.999731895498823,
0.907560340983954, 1.05930840957587, 1.40457554864091, 1.09747179272879,
0.944219456216072, 1.10363111431903, 0.974649273935516, 0.989983064420841,
1.14784471036171, 1.17232858907798, 1.44675812720393, 0.727078405331282,
1.36341361598635, 1.06120293299474, 1.06920290856811, 0.711007267992205,
1.39034247642439, 0.710873996527168, 1.30529753573398, 0.781191310196629,
0.921788181250106, 0.932214675722466, 0.752289683770589, 0.942392026874501
), year = c(3, 1, 4, 1, 2, 4, 1, 3, 4, 3, 4, 1, 2, 3, 4, 1, 1,
4, 1, 1, 1, 1, 4, 1, 3, 3, 1, 4, 1, 4, 2, 1, 1, 1, 3, 4, 3, 2,
2, 2, 3, 3, 1, 2, 3, 2, 3, 4, 4, 1), imd.decile = c(8, 2, 5,
5, 4, 3, 2, 8, 6, 4, 3, 6, 9, 2, 5, 3, 5, 6, 4, 2, 9, 11, 2,
8, 3, 5, 7, 8, 7, 4, 9, 7, 6, 4, 8, 10, 5, 6, 6, 11, 6, 4, 2,
4, 10, 8, 2, 8, 4, 3)), .Names = c("antibiotic", "year", "imd.decile"
), row.names = c(17510L, 6566L, 24396L, 2732L, 13684L, 28136L,
1113L, 15308L, 28909L, 21845L, 23440L, 1940L, 8475L, 22406L,
27617L, 4432L, 3411L, 27125L, 6891L, 6564L, 1950L, 5683L, 25240L,
5251L, 20058L, 18068L, 5117L, 29066L, 2807L, 24159L, 12309L,
6044L, 7629L, 2336L, 16583L, 23921L, 17465L, 14911L, 8879L, 13929L,
17409L, 19421L, 7239L, 11570L, 15283L, 8283L, 16246L, 27950L,
23723L, 4411L), class = "data.frame")
I am trying to graph imd.decile by antibiotic for each year
library(ggplot2)
p <- ggplot(df, aes(x = imd.decile, y = antibiotic, group = factor(year))) +
stat_summary(geom = "line", fun.y = mean)
p
How do I add the wave to colour the corresponding graph and add a legend (I can't seem to use the aes command correctly).

R: how to count the number of times two elements have the same ID (perhaps using the outer function)

I have the following three dimensional array:
dput(a)
structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 2, 2,
2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 6, 2, 7, 6, 2, 7, 6, 2, 7, 4, 2, 4, 4, 2, 6, 4, 2, 4, 6, 2,
7, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 4, 4, 2, 6, 4, 2, 4, 4, 2,
6, 4, 2, 6, 4, 2, 6, 6, 2, 7, 4, 2, 6, 4, 2, 6, 4, 2, 4, 2, 3,
1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 3, 7, 2, 3,
7, 2, 3, 7, 2, 3, 7, 2, 3, 7, 2, 3, 7, 2, 3, 7, 2, 3, 7, 2, 3,
7, 2, 3, 7, 1, 2, 5, 2, 3, 7, 1, 2, 4, 2, 3, 7, 2, 3, 7, 2, 3,
7, 2, 3, 7, 2, 3, 7, 2, 3, 7, 2, 3, 7, 2, 6, 3, 2, 6, 3, 2, 6,
3, 2, 6, 3, 2, 6, 3, 2, 6, 3, 2, 6, 3, 2, 6, 3, 2, 6, 3, 2, 6,
3, 1, 1, 1, 2, 6, 3, 1, 5, 5, 2, 6, 3, 2, 6, 3, 2, 6, 3, 2, 6,
3, 2, 6, 3, 2, 6, 3, 2, 6, 3, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3, 13,
2, 3, 13, 2, 3, 5, 2, 3, 5, 2, 15, 17, 2, 15, 17, 2, 15, 17,
2, 3, 5, 2, 15, 17, 2, 3, 13, 2, 15, 17, 2, 15, 17, 2, 3, 13,
2, 3, 5, 2, 15, 17, 2, 15, 17, 2, 3, 5, 2), .Dim = c(3L, 20L,
6L), .Dimnames = list(c("cl.tmp", "cl.tmp", "cl.tmp"), NULL,
NULL))
The dimension of this array (a) is 3x20x6 (after edits).
I wanted to count the proportion of times that a[,i,] matches a[,j,] element-by-element in the matrix. Basically, I wanted to get mean(a[,i,] == a[,j,]) for all i, j, and I would like to do this fast but in R.
It occurred to me that the outer function might be a possibility but I am not sure how to specify the function. Any suggestions, or any other alternative ways?
The output would be a 20x20 symmetric matrix of nonnegative elements with 1 on the diagonals.
The solution given below works (thanks!) but I have one further question (sorry).
I would like to display the coordinates above in a heatmap. I try the following:
n<-dim(a)[2]
xx <- matrix(apply(a[,rep(1:n,n),]==a[,rep(1:n,each=n),],2,sum),nrow=n)/prod(dim(a)[-2])
image(1:20, 1:20, xx, xlab = "", ylab = "")
This gives me the following heatmap.
However, I would like to display (reorder the coordinate) such that I get all the coordinates that have high-values amongst each other together. However, I would not like to bias the results by deciding on the number of groups myself. I tried
hc <- hclust(as.dist(1-xx), method = "single")
but I can not decide how to cut the resulting tree to decide on bunching the coordinates together. Any suggestions? Bascically, in the figure, I would like the coordinate pairs in the top left (and bottom right off-diagonal blocks) to be as low-valued (in this case as red) as possible.
Looking around on SO, I found that there exists a function heatmap which might do this,
heatmap(xx,Colv=T,Rowv=T, scale='none',symm = T)
and I get the following:
which is all right, but I can not figure out how to get rid of the dendrograms on the sides or the axes labels. It does work if I extract out and do the following:
yy <- heatmap(xx,Colv=T,Rowv=T, scale='none',symm = T,keep.dendro=F)
image(1:20, 1:20, xx[yy$rowInd,yy$colInd], xlab = "", ylab = "")
so I guess that is what I will stick with. Here is the result:
Try this:
n<-dim(a)[2]
matrix(apply(a[,rep(1:n,n),]==a[,rep(1:n,each=n),],2,sum),nrow=n)/prod(dim(a)[-2])
It has to be stressed that the memory usage of this method goes with n^2 so you might have trouble to use it with larger arrays.

Resources