I am testing the impact of missing data on regression analysis. So, using a simulated dataset, I want to randomly remove a proportion of observations (not entire rows) from a designated set of columns. I am using 'sample' to do this. Unfortunately, this is making some columns have much more missing values than others. See an example below:
#Data frame with 5 columns, 10 rows
DF = data.frame(A = paste(letters[1:10]),B = rnorm(10, 1, 10), C = rnorm(10, 1, 10), D = rnorm(10, 1, 10), E = rnorm(10,1,10))
#Function to randomly delete a proportion (ProportionRemove) of records per column, for a designated set of columns (ColumnStart - ColumnEnd)
RandomSample = function(DataFrame,ColumnStart, ColumnEnd,ProportionRemove){
#ci is the opposite of the proportion
ci = 1-ProportionRemove
Missing = sapply(DataFrame[(ColumnStart:ColumnEnd)], function(x) x[sample(c(TRUE, NA), prob = c(ci,ProportionRemove), size = length(DataFrame), replace = TRUE)])}
#Randomly sample column 2 - 5 within DF, deleting 80% of the observation per column
Test = RandomSample(DF, 2, 5, 0.8)
I understand there is an element of randomness to this, but in 10 trials (10*4 = 40 columns), 17 of the columns had no data, and in one trial, one column still had 6 records (rather than the expected ~2) - see below.
B C D E
[1,] NA 24.004402 7.201558 NA
[2,] NA NA NA NA
[3,] NA 4.029659 NA NA
[4,] NA NA NA NA
[5,] NA 29.377632 NA NA
[6,] NA 3.340918 -2.131747 NA
[7,] NA NA NA NA
[8,] NA 15.967318 NA NA
[9,] NA NA NA NA
[10,] NA -8.078221 NA NA
In summary, I want to replace a propotion of observations with NAs in each column.
Any help is greatly appreciated!!!
This makes sense to me. As #Frank suggested (in a since-deleted comment ... *sigh*), "randomness" can give you really non-random-looking results (Dilbert: Tour of Accounting, 2001-10-25).
If you want random samples with guaranteed ratios, try this:
guaranteedSampling <- function(DataFrame, ProportionRemove) {
n <- max(1L, floor(nrow(DataFrame) * ProportionRemove))
inds <- replicate(ncol(DataFrame), sample(nrow(DataFrame), size=n), simplify=FALSE)
DataFrame[] <- mapply(`[<-`, DataFrame, inds, MoreArgs=list(NA), SIMPLIFY=FALSE)
DataFrame
}
set.seed(2)
guaranteedSampling(DF[2:5], 0.8)
# B C D E
# 1 NA NA NA NA
# 2 NA NA NA NA
# 3 NA NA NA NA
# 4 6.792463 10.582938 NA NA
# 5 NA NA -0.612816 NA
# 6 NA -2.278758 NA NA
# 7 NA NA NA 2.245884
# 8 NA NA NA 5.993387
# 9 7.863310 NA 9.042127 NA
# 10 NA NA NA NA
Further to #joran's comment, you either wanted nrow(DataFrame) or length(x)
The specific impact in your example is that you are producing a vector with 5 elements (because DF has 5 variables) each with 0.8 probability of being NA and 0.2 of being TRUE.
Then this statement (which is what the sapply is doing to each column you specify and in this case I'm applying to DF$B only):
DF$B[sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)]
does something that isn't immediately obvious to the uninitiated*. This:
sample(c(TRUE, NA), prob=c(0.2, 0.8), size = 5, replace=TRUE)
gives a logical vector, which when used to extract elements of a vector is silently recycled. So lets say you end up with:
NA TRUE NA TRUE NA
When you subset DF$B you end up getting this:
DF$B[c(NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA)]
Notice in your example how the top 5 numbers always follow the same pattern as the bottom 5 numbers. This explains why so many columns ended up being all NA, because there is a 0.32768 probability of getting 5 out of 5 NA which gets recycled to the whole column.
The other issue with your code is that the function doesn't actually do anything useful because you didn't specify any return value. Here it is corrected and cleaned up and using http://adv-r.had.co.nz/Style.html:
random_sample <- function(x, col_start, col_end, p) {
sapply(x[col_start:col_end],
function(y) y[sample(c(TRUE, NA), prob = c(1-p, p), size = length(y), replace = TRUE)])
}
*The uninitiated in this case includes me! I had no idea that logical vectors were recycled when used to extract until having a look at this question.
Related
I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame.
This is my code, which works perfectly fine, and creates a result which matches my expectation.
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
unique_to_a <- df$a[!(df$a %in% df$b)]
unique_to_b <- df$b[!(df$b %in% df$a)]
n <- max(c(unique_to_a, unique_to_b))
out <- data.frame(A=rep(NA,n), B=rep(NA,n))
for (element in unique_to_a){
out[element, "A"] = element
}
for (element in unique_to_b){
out[element, "B"] = element
}
out
The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...
Any ideas on how to speed up the operation is much appreciated.
Cheers!
Didn't compare the speed but at least this is more concise:
elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
# X1 X2
# 1 NA NA
# 2 NA NA
# 3 NA 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 NA NA
# 8 NA NA
# 9 NA NA
# 10 NA NA
# 11 11 NA
To investigate on the distribution of pixelvalues in an image, I want to compute a Grey-Level-Co-Occurence-Matrix (GLCM) for entire Images (NO sliding/moving Windows). The idea is to receive a single value (for "mean", "variance", "homogeneity", "contrast", "dissimilarity", "entropy", "second_moment", "correlation") for every image, to compare the images among each other regarding their distribution of pixelvalues.
e.g.:
image 1:
0 0 0 0
0 0 0 1
0 0 1 1
0 1 1 1
image 2:
1 0 0 1
0 1 0 0
0 0 1 0
1 0 0 1
image 3:
1 1 1 0
1 1 0 0
1 0 0 0
0 0 0 0
All of These 3 images have got the same statistics (mean, max, min, …), nevertheless the distribution of the pixelvalues is completely different. To find kind of a measure to describe that difference, I want to compute the GLCM´s for each of these images.
I am using the package "glcm" so far, a fantastic package for texture-analysis by Alex Zvoleff. Unfortunately it´s just possible to use it with a sliding/moving window… But since I want to receive one single value for every image per statistical measure it seems to be useless for me... Is there anyone who can help an R-Rookie like me out with that? :)
install.packages("glcm")
library(glcm)
# install and load package "glcm"
# see URL:http://azvoleff.com/articles/calculating-image-textures-with-glcm/
values <- seq(1, c(12*12), 1)
values_mtx <- matrix(data = values, nrow = 12, ncol = 12, byrow = TRUE)
# create an "image"
values_mtx_small <- values_mtx[-12, -12]
# since you have to use a sliding/moving window in glcm::glcm() give the image # ...an odd number of rows and cols by deleting the last row and last column
values_raster_small <- raster(values_mtx_small)
# create rasterlayer-object
values_textures <- glcm::glcm(values_raster_small, window = c((nrow(values_raster_small)-2), (ncol(values_raster_small)-2)), shift=list(c(0,1), c(1,1), c(1,0), c(1,-1)), statistics = c("mean", "variance", "homogeneity", "contrast", "dissimilarity", "entropy", "second_moment", "correlation"), min_x = NULL, max_x = NULL, na_opt = "ignore", na_val = NA, asinteger = FALSE)
# compute a GLCM for the image with a maximum size for the moving window to
# ...receive a "measure" for the image
values_textures_mean <- as.matrix(values_textures$glcm_mean)
# extract the calculated GLCM_mean data
values_textures_mean
# get an Output
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,] NA NA NA NA NA NA NA NA NA NA NA
[2,] NA NA NA NA NA NA NA NA NA NA NA
[3,] NA NA NA NA NA NA NA NA NA NA NA
[4,] NA NA NA NA NA NA NA NA NA NA NA
[5,] NA NA NA NA NA 0.4589603 NA NA NA NA NA
[6,] NA NA NA NA NA 0.5516493 NA NA NA NA NA
[7,] NA NA NA NA NA NA NA NA NA NA NA
[8,] NA NA NA NA NA NA NA NA NA NA NA
[9,] NA NA NA NA NA NA NA NA NA NA NA
[10,] NA NA NA NA NA NA NA NA NA NA NA
[11,] NA NA NA NA NA NA NA NA NA NA NA
# unfortunately two numbers as "measure" are left…
My R package GLCMTextures is mainly meant to deal with spatial raster data like glcm, but it should be able to do this too. You'll have to tabulate a GLCM for each of the four shifts [c(1, 0), c(1, 1), c(0, 1), c(-1, 1)] individually and then average the texture metrics of each type to get directionally invariant measures.
library(GLCMTextures)
library(raster)
# create an "image"
values_mtx <- matrix(data = seq(1, c(12*12), 1), nrow = 12, ncol = 12, byrow = TRUE)
values_mtx_raster<- raster(values_mtx) #Make it a raster
values_mtx_raster_quantized<- quantize_raster(values_mtx_raster, n_levels = 32, method = "equal prob") #make values integers from 0-31
plot(values_mtx_raster_quantized)
text(values_mtx_raster_quantized)
values_mtx_quantized<- as.matrix(values_mtx_raster_quantized) #Make it a matrix
glcm_10<- make_glcm(values_mtx_quantized, n_levels = 32, shift = c(1,0), na.rm = FALSE, normalize = TRUE) #tabulate glcm with xshift=1, yshift=0 (i.e. pixel to the right)
glcm_metrics(glcm_10)
# glcm_contrast glcm_dissimilarity glcm_homogeneity glcm_ASM glcm_entropy glcm_mean glcm_variance glcm_correlation
# 0.21212121 0.21212121 0.89393939 0.02100551 4.08422180 15.50000000 84.25000000 0.99874112
This suggestion might provided the tools needed to get at the answer through the package EBImage. The complete answer would likely require applying additional data reduction techniques and statistical analysis to the results from the textural analysis demonstrated here.
# EBImage needed through Bioconductor, which uses BiocManager
if (!require(EBImage)) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("EBImage")
library(EBImage)
}
For EBImage, a binary mask is required to define objects for subsequent analysis. In this case, the entire image (array) seems to serve as the object of analysis so a binary mask covering the entire image is created and then modified to replicate the example.
# Create three 32 x 32 images similar to the example
mask <- Image(1, dim = c(32, 32))
img1 <- img2 <- img3 <- mask
img1[upper.tri(img1)] <- 0
nzero <- sum(img1 == 0)
img2[sample(32*32, nzero)] <- 0
img3[lower.tri(img3)] <- 0
# Combine three images into a single 64 x 64 x 3 array for simplicity
img <- combine(img1, img2, img3)
# Verify similarity of global properties of each image
apply(img, 3, mean)
> [1] 0.515625 0.515625 0.515625
apply(img, 3, sd)
> [1] 0.5 0.5 0.5
Haralick features computes rotational invariant textural properties from the gray-level co-occurrence matrix. The parameter haralick.scales is used to specify the expected repeating scale for the textural patterns. The default uses c(1, 2) to look for repeats every 1 and 2 pixels. Here I just limit it to 1 pixel.
I have to admit that I use it without fully understanding it. One helpful resource may be a post by Earl Glynn. Also, a question answered on the Bioconductor about computing Haralick features provides great information that's hard to find.
# Introduce and apply the computeFeatures.haralick function at a scale of 1
# The first line simply captures the names and properties of the features
props <- computeFeatures.haralick(properties = TRUE, haralick.scales = 1)
# Apply computeFeatures.haralick to each of the 3 dimensions (frames)
m <- sapply(getFrames(img),
function(ref) computeFeatures.haralick(mask, ref, haralick.scales = 1))
# Add meaningful row and column names to the resulting matrix
rownames(m) <- props$name
colnames(m) <- paste0("img", 1:3)
print(round(m, 4))
> img1 img2 img3
> h.asm.s1 0.4702 0.2504 0.4692
> h.con.s1 30.7417 480.7583 30.7417
> h.cor.s1 0.9359 -0.0013 0.9360
> h.var.s1 240.6937 241.0729 241.1896
> h.idm.s1 0.9680 0.5003 0.9680
> h.sav.s1 34.4917 33.8417 33.4917
> h.sva.s1 2093.5247 1594.4603 2028.1987
> h.sen.s1 0.3524 0.4511 0.3528
> h.ent.s1 0.3620 0.6017 0.3625
> h.dva.s1 0.0000 0.0000 0.0000
> h.den.s1 0.0137 0.1506 0.0137
> h.f12.s1 0.7954 0.0000 0.7957
> h.f13.s1 0.6165 0.0008 0.6169
Here I use a heatmap to visualize and organize the 13 Haralick parameters. The plot pretty clearly shows that images 1 and 3 are rather similar and quite different from image 2. Still, differences between image 1 and 3 can be seen.
The matrix used for this heatmap, especially if it was generated from many more images, could be scaled and further analyzed by principle components analysis to identify related images.
heatmap(m)
To learn more about EBImage see the the package vignette.
I am looking for a more efficient way (in terms of length of code) of converting a data.frame from:
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 1 2 3 NA NA NA NA NA NA
# 2 NA NA NA 3 2 1 NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA 1 2 3
to
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 3 2 1
#[3,] NA NA NA
#[4,] NA NA NA
#[5,] 1 2 3
That is, I want to remove excess NAs but correctly represent rows with only NAs.
I wrote the following function which does the job, but I am sure there is a less lengthy way of achieving the same.
#Dummy data.frame
data <- matrix(c(1:3, rep(NA, 6),
rep(NA, 3), 3:1, rep(NA, 3),
rep(NA, 9),
rep(NA, 9),
rep(NA, 6), 1:3),
byrow=TRUE, ncol=9)
data <- as.data.frame(data)
sieve <- function(data) {
#get a list of all entries that are not NA
cond <- apply(data, 1, function(x) x[!is.na(x)])
#set integer(0) equal to NA
cond[sapply(cond, function(x) length(x)==0)] <- NA
#check how many items there are in non-empty rows
#(rows are either empty or contain the same number of items)
n <- max(sapply(cond, length))
#replace single NA with n NAs, where n = number of items
#first get an index of entries with single NAs
index <- (1:length(cond)) [sapply(cond, function(x) length(x)==1)]
#then replace each entry with n NAs
for (i in index) cond[[i]] <- rep(NA, n)
#turn list into a data.frame
cond <- matrix(unlist(cond), nrow=length(cond), byrow=TRUE)
cond
}
sieve(data)
My question resembles this question about extracting conditions to which participants are assigned (for which I received great answers). I tried expanding these answers to the current dummy data, but without success so far. Hence my rather lengthy custom function.
Edit: Additional info for why I am asking this question: The first data frame represents the raw output from an experiment in which I assigned participants to one of three conditions (using 3 here for simplicity). In each condition, participants read a different scenario, but then answered the same set of questions about the scenario they had read. Qualtrics recorded answers from participants in the first condition in the columns V1through V3, answers from participants in the second condition in the columns V4through V6 and answers from participants in the third condition in columns V7through V9. (If this block of questions would have contained 4 questions it would have been columns V1 through V4 for answers from participants in the first condition, V2 through V8 for answers from participants in the second condition ...).
You can try this if the length of non-NAs is always the same in rows that aren't entirely filled with NA:
First, create a data frame with the appropriate (transposed) dimensions, and fill it with NAs.
d2 <- data.frame(
matrix(nrow = max(apply(d, 1, function(ii) sum(!is.na(ii)))),
ncol=nrow(d)))
Then, using apply fill that data frame, then transpose it to get your desired outcome:
d2[] <- apply(d, 1, function(ii) ii[!is.na(ii)])
t(d2)
# [,1] [,2] [,3]
#X1 1 2 3
#X2 3 2 1
#X3 NA NA NA
#X4 NA NA NA
#X5 1 2 3
I have this vector
vec <- c(NA, 1, 2, 3, 4, NA)
for which I wish to calculate a rollmean of a window of size 3 aligned to the right (that is, if I understand correctly looking backwards)
The expected rolling mean of my vector would be
# [1] NA NA NA 2 3 NA #
and yet if I do
rollmean(vec, 3, align='right', fill=NA)
I obtain
# [1] NA NA NA NA NA NA
You can use apply function instead.
rollapply(vec,3,mean,fill=NA,align="right")
[1] NA NA NA 2 3 NA
I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))