To investigate on the distribution of pixelvalues in an image, I want to compute a Grey-Level-Co-Occurence-Matrix (GLCM) for entire Images (NO sliding/moving Windows). The idea is to receive a single value (for "mean", "variance", "homogeneity", "contrast", "dissimilarity", "entropy", "second_moment", "correlation") for every image, to compare the images among each other regarding their distribution of pixelvalues.
e.g.:
image 1:
0 0 0 0
0 0 0 1
0 0 1 1
0 1 1 1
image 2:
1 0 0 1
0 1 0 0
0 0 1 0
1 0 0 1
image 3:
1 1 1 0
1 1 0 0
1 0 0 0
0 0 0 0
All of These 3 images have got the same statistics (mean, max, min, …), nevertheless the distribution of the pixelvalues is completely different. To find kind of a measure to describe that difference, I want to compute the GLCM´s for each of these images.
I am using the package "glcm" so far, a fantastic package for texture-analysis by Alex Zvoleff. Unfortunately it´s just possible to use it with a sliding/moving window… But since I want to receive one single value for every image per statistical measure it seems to be useless for me... Is there anyone who can help an R-Rookie like me out with that? :)
install.packages("glcm")
library(glcm)
# install and load package "glcm"
# see URL:http://azvoleff.com/articles/calculating-image-textures-with-glcm/
values <- seq(1, c(12*12), 1)
values_mtx <- matrix(data = values, nrow = 12, ncol = 12, byrow = TRUE)
# create an "image"
values_mtx_small <- values_mtx[-12, -12]
# since you have to use a sliding/moving window in glcm::glcm() give the image # ...an odd number of rows and cols by deleting the last row and last column
values_raster_small <- raster(values_mtx_small)
# create rasterlayer-object
values_textures <- glcm::glcm(values_raster_small, window = c((nrow(values_raster_small)-2), (ncol(values_raster_small)-2)), shift=list(c(0,1), c(1,1), c(1,0), c(1,-1)), statistics = c("mean", "variance", "homogeneity", "contrast", "dissimilarity", "entropy", "second_moment", "correlation"), min_x = NULL, max_x = NULL, na_opt = "ignore", na_val = NA, asinteger = FALSE)
# compute a GLCM for the image with a maximum size for the moving window to
# ...receive a "measure" for the image
values_textures_mean <- as.matrix(values_textures$glcm_mean)
# extract the calculated GLCM_mean data
values_textures_mean
# get an Output
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] NA NA NA NA NA NA NA NA NA NA NA
[2,] NA NA NA NA NA NA NA NA NA NA NA
[3,] NA NA NA NA NA NA NA NA NA NA NA
[4,] NA NA NA NA NA NA NA NA NA NA NA
[5,] NA NA NA NA NA 0.4589603 NA NA NA NA NA
[6,] NA NA NA NA NA 0.5516493 NA NA NA NA NA
[7,] NA NA NA NA NA NA NA NA NA NA NA
[8,] NA NA NA NA NA NA NA NA NA NA NA
[9,] NA NA NA NA NA NA NA NA NA NA NA
[10,] NA NA NA NA NA NA NA NA NA NA NA
[11,] NA NA NA NA NA NA NA NA NA NA NA
# unfortunately two numbers as "measure" are left…
My R package GLCMTextures is mainly meant to deal with spatial raster data like glcm, but it should be able to do this too. You'll have to tabulate a GLCM for each of the four shifts [c(1, 0), c(1, 1), c(0, 1), c(-1, 1)] individually and then average the texture metrics of each type to get directionally invariant measures.
library(GLCMTextures)
library(raster)
# create an "image"
values_mtx <- matrix(data = seq(1, c(12*12), 1), nrow = 12, ncol = 12, byrow = TRUE)
values_mtx_raster<- raster(values_mtx) #Make it a raster
values_mtx_raster_quantized<- quantize_raster(values_mtx_raster, n_levels = 32, method = "equal prob") #make values integers from 0-31
plot(values_mtx_raster_quantized)
text(values_mtx_raster_quantized)
values_mtx_quantized<- as.matrix(values_mtx_raster_quantized) #Make it a matrix
glcm_10<- make_glcm(values_mtx_quantized, n_levels = 32, shift = c(1,0), na.rm = FALSE, normalize = TRUE) #tabulate glcm with xshift=1, yshift=0 (i.e. pixel to the right)
glcm_metrics(glcm_10)
# glcm_contrast glcm_dissimilarity glcm_homogeneity glcm_ASM glcm_entropy glcm_mean glcm_variance glcm_correlation
# 0.21212121 0.21212121 0.89393939 0.02100551 4.08422180 15.50000000 84.25000000 0.99874112
This suggestion might provided the tools needed to get at the answer through the package EBImage. The complete answer would likely require applying additional data reduction techniques and statistical analysis to the results from the textural analysis demonstrated here.
# EBImage needed through Bioconductor, which uses BiocManager
if (!require(EBImage)) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EBImage")
library(EBImage)
}
For EBImage, a binary mask is required to define objects for subsequent analysis. In this case, the entire image (array) seems to serve as the object of analysis so a binary mask covering the entire image is created and then modified to replicate the example.
# Create three 32 x 32 images similar to the example
mask <- Image(1, dim = c(32, 32))
img1 <- img2 <- img3 <- mask
img1[upper.tri(img1)] <- 0
nzero <- sum(img1 == 0)
img2[sample(32*32, nzero)] <- 0
img3[lower.tri(img3)] <- 0
# Combine three images into a single 64 x 64 x 3 array for simplicity
img <- combine(img1, img2, img3)
# Verify similarity of global properties of each image
apply(img, 3, mean)
> [1] 0.515625 0.515625 0.515625
apply(img, 3, sd)
> [1] 0.5 0.5 0.5
Haralick features computes rotational invariant textural properties from the gray-level co-occurrence matrix. The parameter haralick.scales is used to specify the expected repeating scale for the textural patterns. The default uses c(1, 2) to look for repeats every 1 and 2 pixels. Here I just limit it to 1 pixel.
I have to admit that I use it without fully understanding it. One helpful resource may be a post by Earl Glynn. Also, a question answered on the Bioconductor about computing Haralick features provides great information that's hard to find.
# Introduce and apply the computeFeatures.haralick function at a scale of 1
# The first line simply captures the names and properties of the features
props <- computeFeatures.haralick(properties = TRUE, haralick.scales = 1)
# Apply computeFeatures.haralick to each of the 3 dimensions (frames)
m <- sapply(getFrames(img),
function(ref) computeFeatures.haralick(mask, ref, haralick.scales = 1))
# Add meaningful row and column names to the resulting matrix
rownames(m) <- props$name
colnames(m) <- paste0("img", 1:3)
print(round(m, 4))
> img1 img2 img3
> h.asm.s1 0.4702 0.2504 0.4692
> h.con.s1 30.7417 480.7583 30.7417
> h.cor.s1 0.9359 -0.0013 0.9360
> h.var.s1 240.6937 241.0729 241.1896
> h.idm.s1 0.9680 0.5003 0.9680
> h.sav.s1 34.4917 33.8417 33.4917
> h.sva.s1 2093.5247 1594.4603 2028.1987
> h.sen.s1 0.3524 0.4511 0.3528
> h.ent.s1 0.3620 0.6017 0.3625
> h.dva.s1 0.0000 0.0000 0.0000
> h.den.s1 0.0137 0.1506 0.0137
> h.f12.s1 0.7954 0.0000 0.7957
> h.f13.s1 0.6165 0.0008 0.6169
Here I use a heatmap to visualize and organize the 13 Haralick parameters. The plot pretty clearly shows that images 1 and 3 are rather similar and quite different from image 2. Still, differences between image 1 and 3 can be seen.
The matrix used for this heatmap, especially if it was generated from many more images, could be scaled and further analyzed by principle components analysis to identify related images.
heatmap(m)
To learn more about EBImage see the the package vignette.
Related
I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:
#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))
#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
#ci is the opposite of the proportion
ci = 1-ProportionRemove
Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}
#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)
I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.
B C D E
[1,] NA 24.004402 7.201558 NA
[2,] NA NA NA NA
[3,] NA 4.029659 NA NA
[4,] NA NA NA NA
[5,] NA 29.377632 NA NA
[6,] NA 3.340918 -2.131747 NA
[7,] NA NA NA NA
[8,] NA 15.967318 NA NA
[9,] NA NA NA NA
[10,] NA -8.078221 NA NA
In summary, I want to replace a propotion of observations with NAs in each column.
Any help is greatly appreciated!!!
This makes sense to me. As #Frank suggested (in a since-deleted comment ... *sigh*), "randomness" can give you really non-random-looking results (Dilbert: Tour of Accounting, 2001-10-25).
If you want random samples with guaranteed ratios, try this:
guaranteedSampling <- function(DataFrame, ProportionRemove) {
n <- max(1L, floor(nrow(DataFrame) * ProportionRemove))
inds <- replicate(ncol(DataFrame), sample(nrow(DataFrame), size=n), simplify=FALSE)
DataFrame[] <- mapply(`[<-`, DataFrame, inds, MoreArgs=list(NA), SIMPLIFY=FALSE)
DataFrame
}
set.seed(2)
guaranteedSampling(DF[2:5], 0.8)
# B C D E
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 6.792463 10.582938 NA NA
# 5 NA NA -0.612816 NA
# 6 NA -2.278758 NA NA
# 7 NA NA NA 2.245884
# 8 NA NA NA 5.993387
# 9 7.863310 NA 9.042127 NA
# 10 NA NA NA NA
Further to #joran's comment, you either wanted nrow(DataFrame) or length(x)
The specific impact in your example is that you are producing a vector with 5 elements (because DF has 5 variables) each with 0.8 probability of being NA and 0.2 of being TRUE.
Then this statement (which is what the sapply is doing to each column you specify and in this case I'm applying to DF$B only):
DF$B[sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)]
does something that isn't immediately obvious to the uninitiated*. This:
sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)
gives a logical vector, which when used to extract elements of a vector is silently recycled. So lets say you end up with:
NA TRUE NA TRUE NA
When you subset DF$B you end up getting this:
DF$B[c(NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA)]
Notice in your example how the top 5 numbers always follow the same pattern as the bottom 5 numbers. This explains why so many columns ended up being all NA, because there is a 0.32768 probability of getting 5 out of 5 NA which gets recycled to the whole column.
The other issue with your code is that the function doesn't actually do anything useful because you didn't specify any return value. Here it is corrected and cleaned up and using http://adv-r.had.co.nz/Style.html:
random_sample <- function(x, col_start, col_end, p) {
sapply(x[col_start:col_end],
function(y) y[sample(c(TRUE, NA), prob = c(1-p, p), size = length(y), replace = TRUE)])
}
*The uninitiated in this case includes me! I had no idea that logical vectors were recycled when used to extract until having a look at this question.
This is not a regex problem.
I am trying to correct for lookahead bias in data, basically to move the values up by 1. This is what I came up with. Does anyone have a better/faster/ built-in method to do this?
d<-c(1,2,3,4)
#correct for lookahead bias, move values up by 1
e<-d[-c(1)]
length(e)<-length(d)
cbind(d,e)
> cbind(d,e)
d e
[1,] 1 2
[2,] 2 3
[3,] 3 4
[4,] 4 NA
There are a few ways I could think of to do this. Both are fairly concise one liners:
base
cbind(d, c(d[-1], NA))
data.table
rev(data.table::shift(rev(d), 1))
If we want to write it as a function, we can do that too. Note that this function does not attempt to error handle anything.
shift_up <- function(x, n = 1) c(x[-(1:n)], rep(NA, n))
Which is very useful for fans of the comic series Batman:
d <- 1:16
shift_up(d, 16)
# [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA #BATMAN!
I am looking for a more efficient way (in terms of length of code) of converting a data.frame from:
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 1 2 3 NA NA NA NA NA NA
# 2 NA NA NA 3 2 1 NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA 1 2 3
to
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 3 2 1
#[3,] NA NA NA
#[4,] NA NA NA
#[5,] 1 2 3
That is, I want to remove excess NAs but correctly represent rows with only NAs.
I wrote the following function which does the job, but I am sure there is a less lengthy way of achieving the same.
#Dummy data.frame
data <- matrix(c(1:3, rep(NA, 6),
rep(NA, 3), 3:1, rep(NA, 3),
rep(NA, 9),
rep(NA, 9),
rep(NA, 6), 1:3),
byrow=TRUE, ncol=9)
data <- as.data.frame(data)
sieve <- function(data) {
#get a list of all entries that are not NA
cond <- apply(data, 1, function(x) x[!is.na(x)])
#set integer(0) equal to NA
cond[sapply(cond, function(x) length(x)==0)] <- NA
#check how many items there are in non-empty rows
#(rows are either empty or contain the same number of items)
n <- max(sapply(cond, length))
#replace single NA with n NAs, where n = number of items
#first get an index of entries with single NAs
index <- (1:length(cond)) [sapply(cond, function(x) length(x)==1)]
#then replace each entry with n NAs
for (i in index) cond[[i]] <- rep(NA, n)
#turn list into a data.frame
cond <- matrix(unlist(cond), nrow=length(cond), byrow=TRUE)
cond
}
sieve(data)
My question resembles this question about extracting conditions to which participants are assigned (for which I received great answers). I tried expanding these answers to the current dummy data, but without success so far. Hence my rather lengthy custom function.
Edit: Additional info for why I am asking this question: The first data frame represents the raw output from an experiment in which I assigned participants to one of three conditions (using 3 here for simplicity). In each condition, participants read a different scenario, but then answered the same set of questions about the scenario they had read. Qualtrics recorded answers from participants in the first condition in the columns V1through V3, answers from participants in the second condition in the columns V4through V6 and answers from participants in the third condition in columns V7through V9. (If this block of questions would have contained 4 questions it would have been columns V1 through V4 for answers from participants in the first condition, V2 through V8 for answers from participants in the second condition ...).
You can try this if the length of non-NAs is always the same in rows that aren't entirely filled with NA:
First, create a data frame with the appropriate (transposed) dimensions, and fill it with NAs.
d2 <- data.frame(
matrix(nrow = max(apply(d, 1, function(ii) sum(!is.na(ii)))),
ncol=nrow(d)))
Then, using apply fill that data frame, then transpose it to get your desired outcome:
d2[] <- apply(d, 1, function(ii) ii[!is.na(ii)])
t(d2)
# [,1] [,2] [,3]
#X1 1 2 3
#X2 3 2 1
#X3 NA NA NA
#X4 NA NA NA
#X5 1 2 3
I have this vector
vec <- c(NA, 1, 2, 3, 4, NA)
for which I wish to calculate a rollmean of a window of size 3 aligned to the right (that is, if I understand correctly looking backwards)
The expected rolling mean of my vector would be
# [1] NA NA NA 2 3 NA #
and yet if I do
rollmean(vec, 3, align='right', fill=NA)
I obtain
# [1] NA NA NA NA NA NA
You can use apply function instead.
rollapply(vec,3,mean,fill=NA,align="right")
[1] NA NA NA 2 3 NA
I have a table with a lot of colums and I want to remove columns having more than 500 missing values.
I already know the number of missing values per column with :
library(fields)
t(stats(mm))
I got :
N mean Std.Dev. min Q1 median Q3 max missing values
V1 1600 8.67 … 400
Some columns exhibit NA for all the characteristics :
N mean Std.Dev. min Q1 median Q3 max missing values
V50 NA NA NA NA NA NA
I also want to remove these kind of columns.
Here is a one liner to do it mm[colSums(is.na(mm)) > 500]
If you store the results of the stats call like this:
tmpres<-t(stats(mm))
You can do something like:
whichcolsneedtogo<-apply(tmpres, 1, function(currow){all(is.na(currow)) || (currow["missing values"] > 500)})
Finally:
mmclean<-mm[!whichcolsneedtogo]
Of course this is untested, as you have not provided data to reproduce your example.
Another potential solution (works especially well with dataframes):
data[,!sapply(data,function(x) any(is.na(x)))]
rem = NULL
for(col.nr in 1:dim(data)[2]){
if(sum(is.na(data[, col.nr]) > 500 | all(is.na(data[,col.nr])))){
rem = c(rem, col.nr)
}
}
data[, -rem]
m is the matrix that you are working with.
this creates a vector, wntg
(stands for which needs to go)
that lists the columns which have the sum number of NA values greater than 500
The conditions of this comparison can be easily modified to fit your needs
Then make a new matrix I call mr (stands for m reduced) where you have removed the columns
defined by the vector, wntg
In this simple example I have done the case where you want to exclude columns with more than 2 NAs
wntg<-which(colSums(is.na(m))>2)
mr<-m[,-c(wntg)]
> m<-matrix(c(1,2,3,4,NA,NA,7,8,9,NA,NA,NA), nrow=4, ncol =3)
> m
[,1] [,2] [,3]
[1,] 1 NA 9
[2,] 2 NA NA
[3,] 3 7 NA
[4,] 4 8 NA
> wntg<-which(colSums(is.na(m))>2)
> wntg
[1] 3
> mr<-m[,-c(wntg)]
> mr
[,1] [,2]
[1,] 1 NA
[2,] 2 NA
[3,] 3 7
[4,] 4 8