Build data frame from multiple CSVs in R - r

(:
I don't know much about R, but I was required to plot a set of box plots from a data ensamble using it.
I have a set of .csv files representing a set of 2D data. They contain the following columns:
i: the row of the matrix
j: the column of the matrix
VBoot: a property of the matrix
My data is 128 x 128, but the .csv just contain indices for non-zero properties.
I have to plot a box plot for each of these files, side by side.
This is my approach:
library(ggplot2)
library(reshape)
# Set the directory to read the files
setwd("/Users/me/data/CSV/")
operatorProperty <- function(operator, property, degrees, m, n)
{
p <- list()
for (degree in degrees)
{
file <- paste(c(degree, operator, property, ".csv"), collapse="")
data <- read.csv(file, header=TRUE, sep=" ", dec=".")
# Create an array m * n to fill with the data
b <- vector(mode="double", length=(m*n))
# Rebuild the complete data to properly build the box plot
b[data$i * m + data$j] = sqrt(data$VBoot)
p <- append(p, list(b))
}
p
}
So far, I just created a list to insert the data for each ensamble.
Then, I though I should build a data.frame:
min_degree = 0
max_degree = 45
delta = 5
m = 128
n = 128
degrees <- seq(min_degree, max_degree, delta)
property <- "VBoot"
operator <- "Prewitt"
Sobel <- operatorProperty(operator, property, degrees, m, n)
df <- data.frame(degrees, Sobel)
df2<- melt(data=df,id.vars="degrees")
p <- ggplot(df2, aes(x=degrees,y=value,colour=variable)) +
geom_boxplot() +
theme(legend.title=element_blank()) +
xlab(expression(theta)) +
ylab("Bootstrap Variance")
However, I can't build the data.frame. I don't know how to proceed. An example of the data can be found here.
Thank you in advance.

Ok. Well, I had to change a few things to get this to work with the sample data. Here's the setup
m = 128
n = 128
operatorProperty <- function(operator, property, degrees, m, n)
{
Map(function(degree) {
file <- paste(c(degree, operator, property, ".csv"), collapse="")
data <- read.table(file, header=TRUE, dec=".")
# Create an array m * n to fill with the data
b <- vector(mode="double", length=(m*n))
# Rebuild the complete data to properly build the box plot
b[data$i * m + data$j] = sqrt(data[[property]])
b
}, degrees)
}
degrees <- c('00','05')
property <- "MSE"
operator <- "Prewitt"
Sobel <- operatorProperty(operator, property, degrees, m, n)
With this modified form, Sobel is a list with named elements corresponding to the different degrees. We can turn this into a data.frame and plot with
df2<- melt(data.frame(Sobel, check.names=F))
p <- ggplot(df2, aes(x=variable,y=value,colour=variable)) +
geom_boxplot() +
theme(legend.title=element_blank()) +
xlab(expression(theta)) +
ylab("Bootstrap Variance")
which looks very funny because you have so many zeros. All of your non-zero entries are just marked as outliers.
But even if we didn't nicely name the Sobel list, It basically was a list with two vectors (once for each degree)
list(c(0,0,0,0, ...), c(0,0,0,0,...))
if you wanted to merge that with degrees and turn into a data.frame, another choice could have been
do.call(rbind, Map(cbind.data.frame, degrees, Sobel))

Related

how to remove cluster of pixels using clump function in R

I would like to remove the pixels that form a large cluster and keep only the small cluster to analyse (means get pixels number and locations). First I apply a filter to color in white all pixels that has a value lower to 0.66. Then I use the function clump() in R. The model works but I cannot remove only the large cluster. I do not understand how clump function works.
Initial image:
Results image: plot_r is the image where the pixels with value < 0.66 are changed to 0. plot_rc is the results after clump() function. As observed I cannot remove only the large cluster of pixels (on top of the image plot_r). I changed the value (700 in the code) but not better, how to do?
Here the code:
library(magick)
library(pixmap)
library(raster)
library(igraph)
f <- "https://i.stack.imgur.com/2CjCh.jpg"
x <- image_read(f)
x <- image_convert(x, format = "pgm", depth = 8)
# Save the PGM file
f <- tempfile(fileext = ".pgm")
image_write(x, path = f, format = "pgm")
# Read in the PGM file
picture <- read.pnm(file = f, cellres = 1)
str(picture)
picture#size
mat <- picture#grey
mat[mat<0.66] <- 0; x
##############################################################
##Remove clumps of pixels in R using package Raster and igraph
#Detect clumps (patches) of connected cells
r <-raster(mat)
rc <- clump(r)
#extract IDs of clumps according to some criteria
clump9 = data.frame(freq(rc))
#remove clump observations with frequency smaller/larger than N
clump9 = clump9[ ! clump9$count > 700, ]
# record IDs from clumps which met the criteria in previous step
clump9 = as.vector(clump9$value)
#replace cells with IDs which do not belong to the group of interest
rc[rc != clump9[1] & rc != clump9[2]] = NA
# converting rasterlayer to matrix
n <- as.matrix(r)
m <- as.matrix(rc)
Perhaps something like this
library(raster)
library(igraph)
Short-cutting your approach a bit
f <- "https://i.stack.imgur.com/2CjCh.jpg"
b <- brick(f)
x <- sum(b)
r <- x > 450
rc <- clump(r)
f <- freq(rc, useNA="no")
Replace the clumps with the number of cells they consist of and then set the larger one (here more than 100 cells) to NA, and use the result to mask the original raster
rs <- subs(rc, data.frame(f))
rsc <- reclassify(rs, cbind(100,Inf,NA))
m <- mask(b, rsc)
plotRGB(m)

Minimal number of coverage of big data lists

Following my question
I use the following code:
dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
....
p32<-c('att35','att34','att32')
In the real case there can be 1024 vectors.
I would like to find all the relevant p that the unification of them will be the maximal components of dist. I this case the solution would be p1, p3, p5. I want to choose the minimal number of p. In addition, in case there is no way to cover all the of dist component so I want to choose the maximal cover with minimal number of vectors (p).
N = 32
library(qdapTools)
library(dplyr)
library(data.table)
## generate matrix of attributes
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5,...,p32))
library (bigmemory)
## generate matrix of attributes
grid_matrix <- do.call(CJ, rep(list(1:0), N)) %>% as.big.matrix
Error: cannot allocate vector of size 8.0 Gb
I tried an alternative way for it:
grid_matrix <- do.call(CJ, rep(list(1:0), N)) %>% as.data.frame
grid_matrix <- as.matrix (grid_matrix)
And still got the same error.
How can I fix it and use it for big data? I wanted to continue with:
colnames(grid_matrix) <- paste0("p", 1:N)
combin_all_element_present <- rowSums(grid_matrix %*% attribute_matrix > 0) %>% `==`(., ncol(attribute_matrix))
grid_matrix_sub <- grid_matrix[combin_all_element_present, ]
grid_matrix_sub[rowSums(grid_matrix_sub) == min(rowSums(grid_matrix_sub)), ]
This is known as a set covering problem. It can be solved using integer linear programming. Let x1, x2, ... be 0/1 variables (one for each p variable) and represent p1, p2, ... as 0/1 vectors P1, P2, ... and dist as
a 0/1 vector D. Then the problem can be stated as:
min x1 + x2 + ... + x32
such that
P1 * x1 + P2 + x2 + ... + P32 * x32 >= D
which in R code is the following. First create a list p with the p vectors in sorted order. Use mixedsort so that p32 comes at the end instead of rigth after p3. Define attnames as the set of all att names in all the p vectors.
Then formulate the objective function (which equals the number of p's in the cover), the constraint matrix (consisting of the P vectors as columns) and the right hand side of the constraint equations (which is dist as a 0/1 vector). Finally run the integer linear program and convert the solution from a 0/1 vector to a vector of p names.
library(gtools)
library(lpSolve)
p <- mget(mixedsort(ls(pattern = "^p\\d+$")))
attnames <- mixedsort(unique(unlist(p)))
objective <- rep(1L, length(p))
const.mat <- sapply(p, function(x) attnames %in% x) + 0L
const.rhs <- (attnames %in% dist) + 0L
ans <- lp("min", objective, const.mat, ">=", const.rhs, all.bin = TRUE)
names(p)[ans$solution == 1L]
## [1] "p2" "p4" "p5"
The constraint matrix has a row for each attnames entry and a column for each p vector.
The solution produces the minimal covers of those attnames elements that are in dist. If every element of dist appears in at least one p vector then the solution will represent a cover of dist. If not, the solution will represent a cover of those att names in one or more p vectors that are also in dist; thus, this handles both cases discussed in the question. The uncovered elements of dist are:
setdiff(dist, attnames)
so if that is of zero length then the solution represents a complete cover of dist. If not the solution represents a cover of
intersect(dist, attnames)
The sorting done in the code is not stricly needed but it may be easier to work with the various inputs to the optimization by having the rows and columns of the constraint matrix in a logical order.
Note: Run this code from the question before running the above code:
dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
p32<-c('att35','att34','att32')
The answer already provided is perfect but another approach could be the following:
dist<-c('att1','att2','att3','att4','att5','att6')
p1<-c('att1','att5','att2')
p2<-c('att5','att1','att4')
p3<-c('att3','att4','att2')
p4<-c('att1','att2','att3')
p5<-c('att6')
library(qdapTools)
library(data.table)
attribute_matrix <- mtabulate(list(p1, p2, p3, p4, p5))
minimal_sets <- function(superset, subsets_matrix, p){
setDT(subsets_matrix)
# removing the columns that are not in the superset
updated_sub_matr <- subsets_matrix[, which(names(subsets_matrix) %in% superset), with = F]
# initializing counter for iterations and the subset selected
subset_selected <- integer(0)
counter <- p
## Loop until either we ran out of iterations counter = 0 or we found the solution
while (counter > 0 & length(superset) > 0){
## find the row with the most matches with the superset we want to achieve
max_index <- which.max(rowSums(updated_sub_matr))
## remove from the superset the entries that match that line and from the subsets_matrix those columns as they dont contribute anymore
superset <- superset[which(updated_sub_matr[max_index, ] == 0)]
updated_sub_matr <- updated_sub_matr[, - which(updated_sub_matr[max_index, ] != 0), with = F]
counter <- counter - 1
subset_selected <- c(subset_selected, max_index)
}
if (length(superset) > 0){
print(paste0("No solution found, there are(is) ", length(superset), " element(s) left ", paste(superset, collapse = "-")))
} else {
print(paste0("Found a solution after ", p - counter, " iterations"))
}
print(paste0("Selected the following subsets: ", paste(subset_selected, collapse = "-")))
}
In this function you input your superset (in this case dist), the attribute_matrix and the number p which you want to check and it outputs the best possible solution it found as well as the iterations.
> minimal_sets(dist, attribute_matrix, 1)
[1] "No solution found, there are(is) 3 element(s) left att3-att4-att6"
[1] "Selected the following subsets: 1"
> minimal_sets(dist, attribute_matrix, 3)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5"
> minimal_sets(dist, attribute_matrix, 5)
[1] "Found a solution after 3 iterations"
[1] "Selected the following subsets: 1-3-5

How to display counts using the periodic table with ggplot?

I have a list of elemental compositions and I'd like to display a count for the number of times an element is included in a composition mapped onto the periodic table (e.g. CH4 would increase the count on H and C by one).
How can I do this with ggplot? Is there a map I can use?
With a bit of searching I found information about the periodic table in this example code project. They had an Access Database with element information. I've exported it to this gist. You can import the data using the httr library with
library(httr)
dd <- read.table(text=content(GET("https://gist.githubusercontent.com/MrFlick/c1183c911bc5398105d4/raw/715868fba2d0d17a61a8081de17c468bbc525ab1/elements.txt")), sep=",", header=TRUE)
(You should probably create your own local version for easier loading in the future.)
Then your other challenge is decomposing something like "CH4" into the raw element counts. I've created this helper function which I think does what you need.
decompose <- function(x) {
m <- gregexpr("([A-Z][a-z]?)(\\d*)", x, perl=T)
dx <- Map(function(x, y) {
ElementSymbol <- gsub("\\d","", x)
cnt <- as.numeric(gsub("\\D","", x))
cnt[is.na(cnt)]<-1
cbind(Sym=y, as.data.frame(xtabs(cnt~ElementSymbol)))
}, regmatches(x,m), x)
do.call(rbind, dx)
}
Here I test the function
test_input <- c("H2O","CH4")
decompose(test_input)
# Sym ElementSymbol Freq
# 1 H2O H 2
# 2 H2O O 1
# 3 CH4 C 1
# 4 CH4 H 4
Now we can combine the data and the reference information to make a plot
library(ggplot2)
ggplot(merge(decompose("CH4"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
geom_text(aes(label=Freq))
Clearly there are opportunities for improvement but this should give you a good start.
You might look for a more robust decomposition function. Looks like the CHNOSZ package has one
library(CHNOSZ)
data(thermo)
decompose <- function(x) {
do.call(`rbind`, lapply(x, function (x) {
z <- makeup(x)
cbind(data.frame(ElementSymbol = names(z),Freq=z), Sym=x)
}))
}
ggplot(merge(decompose("CaAl2Si2O7(OH)2*H2O"), dd), aes(Column, -Row)) +
geom_tile(data=dd, aes(fill=GroupName), color="black") +
geom_text(aes(label=Freq))

Retain start and end IDs when creating spatial lines from X Y points

The following code will create SpatialLines that connect all x/y points in a dataset. However, each of those x/y points has a unique ID. I need to retain both the startpoint and the endpoint IDs of the x/y as "attributes" (the ArcGIS term) of each of the SpatialLines that connect them.
Help would be appreciated.
I have a dataset of XY values that looks like this
x<-c(2,4,6,3,7,9,1)
y<-c(6,4,8,2,9,6,1)
id<-c("a","b","c","d","e","f","g")
dataset<-data.frame(cbind(x,y,id))
dataset$x<-as.numeric(as.character(dataset$x)) #converting from factor to numeric
dataset$y<-as.numeric(as.character(dataset$y))
plot(dataset$x,dataset$y)
Replicate the dataframe to cover for all possible combinations
dataset<-do.call(rbind, replicate(7, dataset, simplify=FALSE))
Now, create a matrix with all the same destination points, mixed:
nm=matrix(ncol=3)
for (i in 1:7){
nm<-rbind(nm,do.call(rbind,replicate(7,as.matrix(dataset[i,]),simplify=FALSE)))
}
nm<-nm[-1,]
Rename the columns of matrix, so they make sense, and bind the existing data frame with the new matrix
colnames(nm)<-c("x2","y2","id.dest")
newds<-cbind(dataset,as.data.frame(nm))
Remove duplicated trajectories:
newds1<-newds[-which(newds$id==newds$id.dest),]
converting destination x & y to numeric from factor
newds1$x2<-as.numeric(as.character(newds1$x2)) #converting from factor to numeric
newds1$y2<-as.numeric(as.character(newds1$y2))
plotting the destination points . . .same as the origin points
plot(newds1$x, newds1$y)
plot(newds1$x2, newds1$y2, col="red")
####*
converting the begin and end points to spatial lines
raw list to store Lines objects
l <- vector("list", nrow(newds1)) #
this l is now an empty vector w/ number of rows defined by length (nrow) of newds1
splitting origin and destination coordinates so I can run this script
origins<-data.frame(cbind(newds1$x, newds1$y))
destinations<-data.frame(cbind(newds1$x2, newds1$y2))
library(sp)
for (i in seq_along(l)) {
l[[i]] <- Lines(list(Line(rbind(origins[i, ], destinations[i,]))), as.character(i))
}
l.spatial<-SpatialLines(l)
plot(l.spatial, add=T)
The object newds1 contains both the startpoint and endpoint. However, the final SpatialLines that are created from those start and endpoints (l.spatial) do not contain a reference to the start and endpoints. I would like those SpatialLines to contain two "attribute" columns that refer to the IDs of the start and endpoints. I think this is a matter of binding newds1 (a dataframe) onto l.spatial (spatial lines), but the code I'm running doesn't seem to do it.
Attempting to spatially bind the start and end IDs to l.spatial
row.names(newds1)<-1:length(newds1$id) #renaming rows in the dataframe so they match the spatial object
id<-newds1$id
newds2<-spCbind(l.spatial, id)
I get
"Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘spCbind’ for signature ‘"SpatialLines", "factor"’"
In short:
newds2 <- SpatialLinesDataFrame(l.spatial, newds1, match.ID = FALSE)
## or you can use the rownames of newds1 in the lines loop)
There are 42 distinct lines, made by matching every coordinate to each other coordinate once.
FWIW, you don't need to convert from factor for numeric:
x <- c(2,4,6,3,7,9,1)
y <- c(6,4,8,2,9,6,1)
id <- c("a","b","c","d","e","f","g")
## don't coerce to character in the
##first place cbind(x, y, id) *must* be
## character and then data.frame
## converts characters to factors
dataset <- data.frame(x = x, y = y, id = id)
There are other ways to simplify your task, but here's a reasonably straightforward way (I think this is what you are after):
x <- c(2,4,6,3,7,9,1)
y <- c(6,4,8,2,9,6,1)
id<-c("a","b","c","d","e","f","g")
## don't coerce to character in the first place cbind(x, y, id) must be character
## and then data.frame converts characters to factors by default
dataset<-data.frame(x = x, y = y, id = id)
l <- vector("list", nrow(dataset) * (nrow(dataset) - 1))
origID <- destID <- character(length(l))
##xy <- as.matrix(dataset[, c("x", "y")])
cnt <- 0
for (i in seq(nrow(dataset))) {
pt0 <- as.matrix(dataset[i, c("x", "y") ])
pts <- dataset[-i, ]
for (j in seq(nrow(pts))) {
cnt <- cnt + 1
l[[cnt]] <- Lines(list(Line(rbind(pt0, as.matrix(pts[j, c("x", "y")])))), as.character(cnt))
destID[cnt] <- pts$id[j]
origID[cnt] <- dataset$id[i]
}
}
x <- SpatialLinesDataFrame(SpatialLines(l), data.frame(dest = destID, orig = origID, row.names = as.character(1:cnt)))
Pick out one line and investigate:
itest <- 10
## so for example
as.data.frame(x[itest, ])
index <- c(x$orig[itest], x$dest[itest])
plot(x)
plot(x[itest, ], lwd = 4, add = TRUE)
lines(dataset[index, c("x", "y")], col = "firebrick", lwd = 2)
text(dataset[index, c("x", "y")], label = dataset$id[index], col = "dodgerblue", cex = 4)

Splitting a data set using two parameters and saving the sub-data sets in a list

I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }

Resources