I have a table with a lot of colums and I want to remove columns having more than 500 missing values.
I already know the number of missing values per column with :
library(fields)
t(stats(mm))
I got :
N mean Std.Dev. min Q1 median Q3 max missing values
V1 1600 8.67 … 400
Some columns exhibit NA for all the characteristics :
N mean Std.Dev. min Q1 median Q3 max missing values
V50 NA NA NA NA NA NA
I also want to remove these kind of columns.
Here is a one liner to do it mm[colSums(is.na(mm)) > 500]
If you store the results of the stats call like this:
tmpres<-t(stats(mm))
You can do something like:
whichcolsneedtogo<-apply(tmpres, 1, function(currow){all(is.na(currow)) || (currow["missing values"] > 500)})
Finally:
mmclean<-mm[!whichcolsneedtogo]
Of course this is untested, as you have not provided data to reproduce your example.
Another potential solution (works especially well with dataframes):
data[,!sapply(data,function(x) any(is.na(x)))]
rem = NULL
for(col.nr in 1:dim(data)[2]){
if(sum(is.na(data[, col.nr]) > 500 | all(is.na(data[,col.nr])))){
rem = c(rem, col.nr)
}
}
data[, -rem]
m is the matrix that you are working with.
this creates a vector, wntg
(stands for which needs to go)
that lists the columns which have the sum number of NA values greater than 500
The conditions of this comparison can be easily modified to fit your needs
Then make a new matrix I call mr (stands for m reduced) where you have removed the columns
defined by the vector, wntg
In this simple example I have done the case where you want to exclude columns with more than 2 NAs
wntg<-which(colSums(is.na(m))>2)
mr<-m[,-c(wntg)]
> m<-matrix(c(1,2,3,4,NA,NA,7,8,9,NA,NA,NA), nrow=4, ncol =3)
> m
[,1] [,2] [,3]
[1,] 1 NA 9
[2,] 2 NA NA
[3,] 3 7 NA
[4,] 4 8 NA
> wntg<-which(colSums(is.na(m))>2)
> wntg
[1] 3
> mr<-m[,-c(wntg)]
> mr
[,1] [,2]
[1,] 1 NA
[2,] 2 NA
[3,] 3 7
[4,] 4 8
Related
To investigate on the distribution of pixelvalues in an image, I want to compute a Grey-Level-Co-Occurence-Matrix (GLCM) for entire Images (NO sliding/moving Windows). The idea is to receive a single value (for "mean", "variance", "homogeneity", "contrast", "dissimilarity", "entropy", "second_moment", "correlation") for every image, to compare the images among each other regarding their distribution of pixelvalues.
e.g.:
image 1:
0 0 0 0
0 0 0 1
0 0 1 1
0 1 1 1
image 2:
1 0 0 1
0 1 0 0
0 0 1 0
1 0 0 1
image 3:
1 1 1 0
1 1 0 0
1 0 0 0
0 0 0 0
All of These 3 images have got the same statistics (mean, max, min, …), nevertheless the distribution of the pixelvalues is completely different. To find kind of a measure to describe that difference, I want to compute the GLCM´s for each of these images.
I am using the package "glcm" so far, a fantastic package for texture-analysis by Alex Zvoleff. Unfortunately it´s just possible to use it with a sliding/moving window… But since I want to receive one single value for every image per statistical measure it seems to be useless for me... Is there anyone who can help an R-Rookie like me out with that? :)
install.packages("glcm")
library(glcm)
# install and load package "glcm"
# see URL:http://azvoleff.com/articles/calculating-image-textures-with-glcm/
values <- seq(1, c(12*12), 1)
values_mtx <- matrix(data = values, nrow = 12, ncol = 12, byrow = TRUE)
# create an "image"
values_mtx_small <- values_mtx[-12, -12]
# since you have to use a sliding/moving window in glcm::glcm() give the image # ...an odd number of rows and cols by deleting the last row and last column
values_raster_small <- raster(values_mtx_small)
# create rasterlayer-object
values_textures <- glcm::glcm(values_raster_small, window = c((nrow(values_raster_small)-2), (ncol(values_raster_small)-2)), shift=list(c(0,1), c(1,1), c(1,0), c(1,-1)), statistics = c("mean", "variance", "homogeneity", "contrast", "dissimilarity", "entropy", "second_moment", "correlation"), min_x = NULL, max_x = NULL, na_opt = "ignore", na_val = NA, asinteger = FALSE)
# compute a GLCM for the image with a maximum size for the moving window to
# ...receive a "measure" for the image
values_textures_mean <- as.matrix(values_textures$glcm_mean)
# extract the calculated GLCM_mean data
values_textures_mean
# get an Output
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] NA NA NA NA NA NA NA NA NA NA NA
[2,] NA NA NA NA NA NA NA NA NA NA NA
[3,] NA NA NA NA NA NA NA NA NA NA NA
[4,] NA NA NA NA NA NA NA NA NA NA NA
[5,] NA NA NA NA NA 0.4589603 NA NA NA NA NA
[6,] NA NA NA NA NA 0.5516493 NA NA NA NA NA
[7,] NA NA NA NA NA NA NA NA NA NA NA
[8,] NA NA NA NA NA NA NA NA NA NA NA
[9,] NA NA NA NA NA NA NA NA NA NA NA
[10,] NA NA NA NA NA NA NA NA NA NA NA
[11,] NA NA NA NA NA NA NA NA NA NA NA
# unfortunately two numbers as "measure" are left…
My R package GLCMTextures is mainly meant to deal with spatial raster data like glcm, but it should be able to do this too. You'll have to tabulate a GLCM for each of the four shifts [c(1, 0), c(1, 1), c(0, 1), c(-1, 1)] individually and then average the texture metrics of each type to get directionally invariant measures.
library(GLCMTextures)
library(raster)
# create an "image"
values_mtx <- matrix(data = seq(1, c(12*12), 1), nrow = 12, ncol = 12, byrow = TRUE)
values_mtx_raster<- raster(values_mtx) #Make it a raster
values_mtx_raster_quantized<- quantize_raster(values_mtx_raster, n_levels = 32, method = "equal prob") #make values integers from 0-31
plot(values_mtx_raster_quantized)
text(values_mtx_raster_quantized)
values_mtx_quantized<- as.matrix(values_mtx_raster_quantized) #Make it a matrix
glcm_10<- make_glcm(values_mtx_quantized, n_levels = 32, shift = c(1,0), na.rm = FALSE, normalize = TRUE) #tabulate glcm with xshift=1, yshift=0 (i.e. pixel to the right)
glcm_metrics(glcm_10)
# glcm_contrast glcm_dissimilarity glcm_homogeneity glcm_ASM glcm_entropy glcm_mean glcm_variance glcm_correlation
# 0.21212121 0.21212121 0.89393939 0.02100551 4.08422180 15.50000000 84.25000000 0.99874112
This suggestion might provided the tools needed to get at the answer through the package EBImage. The complete answer would likely require applying additional data reduction techniques and statistical analysis to the results from the textural analysis demonstrated here.
# EBImage needed through Bioconductor, which uses BiocManager
if (!require(EBImage)) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EBImage")
library(EBImage)
}
For EBImage, a binary mask is required to define objects for subsequent analysis. In this case, the entire image (array) seems to serve as the object of analysis so a binary mask covering the entire image is created and then modified to replicate the example.
# Create three 32 x 32 images similar to the example
mask <- Image(1, dim = c(32, 32))
img1 <- img2 <- img3 <- mask
img1[upper.tri(img1)] <- 0
nzero <- sum(img1 == 0)
img2[sample(32*32, nzero)] <- 0
img3[lower.tri(img3)] <- 0
# Combine three images into a single 64 x 64 x 3 array for simplicity
img <- combine(img1, img2, img3)
# Verify similarity of global properties of each image
apply(img, 3, mean)
> [1] 0.515625 0.515625 0.515625
apply(img, 3, sd)
> [1] 0.5 0.5 0.5
Haralick features computes rotational invariant textural properties from the gray-level co-occurrence matrix. The parameter haralick.scales is used to specify the expected repeating scale for the textural patterns. The default uses c(1, 2) to look for repeats every 1 and 2 pixels. Here I just limit it to 1 pixel.
I have to admit that I use it without fully understanding it. One helpful resource may be a post by Earl Glynn. Also, a question answered on the Bioconductor about computing Haralick features provides great information that's hard to find.
# Introduce and apply the computeFeatures.haralick function at a scale of 1
# The first line simply captures the names and properties of the features
props <- computeFeatures.haralick(properties = TRUE, haralick.scales = 1)
# Apply computeFeatures.haralick to each of the 3 dimensions (frames)
m <- sapply(getFrames(img),
function(ref) computeFeatures.haralick(mask, ref, haralick.scales = 1))
# Add meaningful row and column names to the resulting matrix
rownames(m) <- props$name
colnames(m) <- paste0("img", 1:3)
print(round(m, 4))
> img1 img2 img3
> h.asm.s1 0.4702 0.2504 0.4692
> h.con.s1 30.7417 480.7583 30.7417
> h.cor.s1 0.9359 -0.0013 0.9360
> h.var.s1 240.6937 241.0729 241.1896
> h.idm.s1 0.9680 0.5003 0.9680
> h.sav.s1 34.4917 33.8417 33.4917
> h.sva.s1 2093.5247 1594.4603 2028.1987
> h.sen.s1 0.3524 0.4511 0.3528
> h.ent.s1 0.3620 0.6017 0.3625
> h.dva.s1 0.0000 0.0000 0.0000
> h.den.s1 0.0137 0.1506 0.0137
> h.f12.s1 0.7954 0.0000 0.7957
> h.f13.s1 0.6165 0.0008 0.6169
Here I use a heatmap to visualize and organize the 13 Haralick parameters. The plot pretty clearly shows that images 1 and 3 are rather similar and quite different from image 2. Still, differences between image 1 and 3 can be seen.
The matrix used for this heatmap, especially if it was generated from many more images, could be scaled and further analyzed by principle components analysis to identify related images.
heatmap(m)
To learn more about EBImage see the the package vignette.
when i write mean it worked
mean(b1temp[1, ])
but for standard deviation, it returns NA
sd(b1temp[1, ])
NA
SO, I modified the function but still returns NA
sd(b1temp[1, ], na.rm=FALSE)
NA
my dataset contains only a row. Is this an issue?
The problem here is incorrect subsetting of data.frame as in effect when you execute b1temp[1, ] you get only 1 number where the standard deviation is not defined. Which is the reason of getting NA.
By default data.frame data are organized by column, not by row. So to apply sd to your data you should use subsetting for the columns bitemp[, 1].
Please see the code and simulation below:
b1temp <- data.frame(x = 1:10)
b1temp[1, ]
# [1] 1
mean(b1temp[1, ])
# [1] 1
sd(b1temp[1, ])
# [1] NA
sd(1)
# [1] NA
b1temp[1, ]
# [1] 1 2 3 4 5 6 7 8 9 10
sd(b1temp[, 1])
# [1] 3.02765
I need to do a z-normalization on my data (i.e transform variables to mean=0 and sd=1).
I am using the following formula (e.g. scaling mean annual temperature, "MAT"):
sca$MAT <- (sca$MAT - mean(sca$MAT)) / sd(sca$MAT)
But I get NaN values since few data are missing for this variable. How can I exlude NA values for MAT in the above formula?
PS: I tried to include na.rm=TRUE in the formula but it doesn't work.
A faster way could probably use dplyr as showed here: but I get the same problem
A fast solution is to use the is.na function in order to obtain the index of the NA elements and then to remove them. The commands are the following:
clean <- sca$MAT[-is.na(sca$MAT)]
standardized <- (clean - mean(clean)) / sd(clean)
scale will exclude NAs for you
x <- c(1:5,NA)
scale(x)
[,1]
[1,] -1.2649111
[2,] -0.6324555
[3,] 0.0000000
[4,] 0.6324555
[5,] 1.2649111
[6,] NA
attr(,"scaled:center")
[1] 3
attr(,"scaled:scale")
[1] 1.581139
so sca$MAT <- scale(sca$MAT) should do what you need.
Using na.rm=TRUE should work
For example:
> sca <- data.frame(L=LETTERS[1:6], MAT=c(1:5,NA))
> sca
L MAT
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
6 F NA
> sca$MAT <- (sca$MAT - mean(sca$MAT, na.rm=TRUE)) / sd(sca$MAT, na.rm=TRUE)
> sca
L MAT
1 A -1.2649111
2 B -0.6324555
3 C 0.0000000
4 D 0.6324555
5 E 1.2649111
6 F NA
gives the same results as Glen_b's use of scale
I currently have a dataframe that looks like this:
speed <- c(61,24,3,10,18,19,12,12,7,9)
distance <-c(58,111,92,93,84,103,83,93,88,81)
df <- as.data.frame(cbind(speed, distance))
What I would like is to sort my speed variable into different columns based on their distance value. For example, for the example dataframe I would like it to look like this:
under50 <- rep(NA,10)
under100<- c(61,3,10,18,12,12,7,9,NA,NA)
under150 <- c(61,24,3,10,18,19,12,12,7,9)
df2 <- as.data.frame(cbind(under50, under100, under150))
I would like it to be as automated as possible since I have 23 dataframes each with 100+ rows, but am not sure where to start. Any help would be muchly appreciated!!
So here is yet another way:
breaks=c(50,100,150)
result <- data.frame(sapply(breaks,function(x)with(df,ifelse(distance<x,speed,NA))))
result <- sapply(result,function(x)c(na.omit(x),rep(NA,sum(is.na(x)))))
colnames(result) <- paste0("under",breaks)
result
# under50 under100 under150
# [1,] NA 61 61
# [2,] NA 3 24
# [3,] NA 10 3
# [4,] NA 18 10
# [5,] NA 12 18
# [6,] NA 12 19
# [7,] NA 7 12
# [8,] NA 9 12
# [9,] NA NA 7
# [10,] NA NA 9
The line:
result <- data.frame(sapply(breaks,function(x)with(df,ifelse(distance<x,speed,NA))))
takes advantage of the ifelse(...) function ro return speed or NA depending on the value of distance. The line:
result <- sapply(result,function(x)c(na.omit(x),rep(NA,sum(is.na(x)))))
moves the NA's to the end.
You can use boolean vectors to select elements, allowing you to select elements in the speed vector by doing a logical check on all elements in the distance vector.
under50 <- speed[distance<50]
This could extended to multiple distances;
cutoffs <- c(50,100,150)
under <- matrix(nrow=length(distance),ncol=length(cutoffs))
for (cutoff in 1:length(cutoffs)){
under[,cutoff] <- speed[distance<cutoffs[cutoff]]
}
Which you could again extend to multiple data.frames. I haven't actually tested the above loop, and if you have many (large) data.frames loops could get slow.
I get flummoxed by some of the simplest of things. In the following code I wanted to extract just a portion of one column in a data.frame called 'a'. I get the right values, but the final entity is padded with NAs which I don't want. 'b' is the extracted column, 'c' is the correct portion of data but has extra NA padding at the end.
How do I best do this where 'c' is ends up naturally only 9 elements long? (i.e. - the 15 original minus the 6 I skipped)
NumBars = 6
a = as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
a[,2] = c(11,12,13,14,15,16,17,18,19,20,21,22,23,24,25)
names(a)[1] = "Data1"
names(a)[2] = "Data2"
{Use 1st column of data only}
b = as.matrix(a[,1])
c = as.matrix(b[NumBars+1:length(b)])
The immediate reason why you're getting NA's is that the sequence operator : takes precedence over the addition operator +, as is detailed in the R Language Definition. Therefore NumBars+1:length(b) is not the same as (NumBars+1):length(b). The first adds NumBars to the vector 1:length(b), while the second adds first and then takes the sequence.
ind.1 <- 1+1:3 # == 2:4
ind.2 <- (1+1):3 # == 2:3
When you index with this longer vector, you get all the elements you want, and you also are asking for entries like b[length(b)+1], which the R Language Definition tells us returns NA. That's why you have trailing NA's.
If i is positive and exceeds length(x) then the corresponding
selection is NA. A negative out of bounds value for i causes an error.
b <- c(1,2,3)
b[ind.1]
#[1] 2 3 NA
b[ind.2]
#[1] 2 3
From a design perspective, the other solutions listed here are good choices to help avoid this mistake.
It is often easier to think of what you want to remove from your vector / matrix. Use negative subscripts to remove items.
c = as.matrix(b[-1:-NumBars])
c
## [,1]
## [1,] 7
## [2,] 8
## [3,] 9
## [4,] 10
## [5,] 11
## [6,] 12
## [7,] 13
## [8,] 14
## [9,] 15
If your goal is to remove NAs from a column, you can also do something like
c <- na.omit(a[,1])
E.g.
> x
[1] 1 2 3 NA NA
> na.omit(x)
[1] 1 2 3
attr(,"na.action")
[1] 4 5
attr(,"class")
[1] "omit"
You can ignore the attributes - they are there to let you know what elements were removed.