I have dataframe with 100+ columns. I want to filter those columns based on their nonuniformity.
Ex. if there're columns with more than 90% (or 95% or 99%) NAs (or 0-s, or -999, or whatever value), remove them from dataframe.
I can remove with NAs, or 0-s, but the problem is I don't know what value it will be.
Ex. of removing NAs with more than 90% df[, which(colMeans(!is.na(df)) > 0.9)]
I would simply use table to count the number of occurence of each value, when the maximum of these values exceeds the needed threshold you can discard the column.
In the following toy example, x, y and z are "constant". For x there are 96% of NA values, for y there are 99% of 0 and for z there are 97% of -1 (but any value would work).
set.seed(26012023)
df <- data.frame(w = rnorm(100), x = c(rep(NA, 96), rnorm(4)), y = c(rep(0, 99), rnorm(1)),
z = c(rep(-1, 97), rnorm(3)))
apply(df, 2, function(x, cutoff = .95) {
tab <- table(x, useNA = "ifany")
max_val <- max(tab)
max_val >= cutoff * length(x)
})
# w x y z
# FALSE TRUE TRUE TRUE
We can create a toy example, defining the following data.frame named df
# Seed to make it reproducible
set.seed(12345)
df <- data.frame(cbind(Var1 = c(rep(10,19),1),
Var2 = sample(letters[1:5],20, prob = c(0.8,0.1,0.5,0.25,0.25), replace = T),
Var3 = sample(c("Yes","No"), 20, prob = c(.95, .05), replace = T),
Var4 = sample(1:3, 20, replace = T),
Var5 = c(rep(NA,15),rep(1,5))))
Then we compute the maximum frequency of a single value for each colum and finally we delet those that exceed the required value
# Calculate the maximum frequency for a single value for each column
aux <- apply(df,2,function(x) max(prop.table(table(x, useNA = "ifany"))))
# Define new.df as df whithout the columns that have a value more than a 90% of times
new.df <- df[,-which(aux>.9)]
Related
Want to do Bootstrapping while comparing two dataframe column wise with the different number of rows.
I have two dataframe in which row represent values from experiments and column with the dataset names (data1, data2, data3, data4)
emp.data1 <- data.frame(
data1 = c(234,0,34,0,46,0,0,0,2.26,0, 5,8,93,56),
data2 = c(1.40,1.21,0.83,1.379,2.60,9.06,0.88,1.16,0.64,8.28, 5,8,93,56),
data3 =c(0,34,43,0,0,56,0,0,0,45,5,8,93,56),
data4 =c(45,0,545,34,0,35,0,35,0,534, 5,8,93,56),
stringsAsFactors = FALSE
)
emp.data2 <- data.frame(
data1 = c(45, 0, 0, 45, 45, 53),
data2 = c(23, 0, 45, 12, 90, 78),
data3 = c(72, 45, 756, 78, 763, 98),
data4 = c(1, 3, 65, 78, 9, 45),
stringsAsFactors = FALSE
)
I am trying to do bootstrapping(n=1000). Values are selected at random replacement from emp.data1(14 * 4) without change in the emp.data2(6 * 4). For example from emp.data2 first column (data1) select 6 values colSum and from emp.data1(data1) select 6 random non zero values colSum Divide the values and store in temp repeat the same 1000 times and take a median value et the end. like this i want to do it for each column of the dataframe. sample code I am providing which is working fine but i am not able get the non-zero random values for emp.data1
nboot <- 1e3
boot_temp_emp<- c()
n_data1 <- nrow(emp.data1); n_data2 <- nrow(emp.data2)
for (j in seq_len(nboot)) {
boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
value <- colSums(emp.data2)/colSums(emp.data1[boot,])
boot_temp_emp <- rbind(boot_temp_emp, value)
}
boot_data<- apply(boot_temp_emp, 2, median)
From the above script i am able get the output but each column emp.data1[boot,] data has zero values and taken sum. I want indivisual ramdomly selected non-zero values column sum so I tried below script not able remove zero values. Not able get desired output please some one help me to correct my script
nboot <- 1e3
boot_temp_emp<- c()
for (i in colnames(emp.data2)){
for (j in seq_len(nboot)){
data1=emp.data1[i]
data2=emp.data2[i]
n_data1 <- nrow(data1); n_data2 <- nrow(data2)
boot <- sample(x = seq_len(n_data1), size = n_data2, replace = TRUE)
value <- colSums(data2[i])/colSums(data1[boot, ,drop = FALSE])
boot_temp_emp <- rbind(boot_temp_emp, value)
}
}
boot_data<- apply(boot_temp_emp, 2, median)
Thank you
Here is a solution.
Write a function to make the code clearer. This function takes the following arguments.
x the input data.frame emp.data1;
s2 the columns sums of emp.data2;
n = 6 the number of vector elements to sample from emp.data1's columns with a default value of 6.
The create a results matrix, pre-compute the column sums of emp.data2 and call the function in a loop.
boot_fun <- function(x, s2, n = 6){
# the loop makes sure ther is no divide by zero
nrx <- nrow(x)
repeat{
i <- sample(nrx, n, replace = TRUE)
s1 <- colSums(x[i, ])
if(all(s1 != 0)) break
}
s2/s1
}
set.seed(2022)
nboot <- 1e3
sums2 <- colSums(emp.data2)
results <- matrix(nrow = nboot, ncol = ncol(emp.data1))
for(i in seq_len(nboot)){
results[i, ] <- boot_fun(emp.data1, sums2)
}
ratios_medians <- apply(results, 2, median)
old_par <- par(mfrow = c(2, 2))
for(j in 1:4) {
main <- paste0("data", j)
hist(results[, j], main = main, xlab = "ratios", freq = FALSE)
abline(v = ratios_medians[j], col = "blue", lty = "dashed")
}
par(old_par)
Created on 2022-02-24 by the reprex package (v2.0.1)
Edit
Following the comments here is a revised version of the bootstrap function. It makes sure there are no zeros in the sampled vectors, before computing their sums.
boot_fun2 <- function(x, s2, n = 6){
nrx <- nrow(x)
ncx <- ncol(x)
s1 <- numeric(ncx)
for(j in seq.int(ncx)) {
repeat{
i <- sample(nrx, n, replace = TRUE)
if(all(x[i, j] != 0)) {
s1[j] <- sum(x[i, j])
break
}
}
}
s2/s1
}
set.seed(2022)
nboot <- 1e3
sums2 <- colSums(emp.data2)
results2 <- matrix(nrow = nboot, ncol = ncol(emp.data1))
for(i in seq_len(nboot)){
results2[i, ] <- boot_fun2(emp.data1, sums2)
}
ratios_medians2 <- apply(results2, 2, median)
old_par <- par(mfrow = c(2, 2))
for(j in 1:4) {
main <- paste0("data", j)
hist(results2[, j], main = main, xlab = "ratios", freq = FALSE)
abline(v = ratios_medians2[j], col = "blue", lty = "dashed")
}
par(old_par)
Created on 2022-02-27 by the reprex package (v2.0.1)
I would expect the imputed values of x to be the same if the same preditor variables were used, despite other variables being imputed or not, but it's not the case, as reproduced here:
library(data.table)
library(robustlmm)
library(mice)
library(miceadds)
library(magrittr)
library(dplyr)
library(tidyr)
set.seed(1)
# Data ------------------------------------
dt1 <- data.table(id = rep(1:10, each=3),
group = rep(1:2, each=15),
time = rep(1:3, 10),
sex = rep(sample(c("F","M"),10,replace=T), each=3),
x = rnorm(30),
y = rnorm(30),
z = rnorm(30))
setDT(dt1)[id %in% sample(1:10,4) & time == 2, `:=` (x = NA, y = NA)][
id %in% sample(1:10,4) & time == 3, `:=` (x = NA, y = NA)]
dt2 <- dt1 %>% group_by(id) %>% fill(y) %>% ungroup %>% as.data.table
# MI 1 ------------------------------------
pm1 <- make.predictorMatrix(dt1)
pm1['x',c('y','z')] <- 0
pm1[c('x','y'), 'id'] <- -2
imp1 <- mice(dt1, pred = pm1, meth = "2l.pmm", seed = 1, m = 2, print = F, maxit = 20)
# boundary (singular) fit: see ?isSingular - don't know how to interpret this (don't occur with my real data)
View(complete(imp1, 'long'))
# MI 2 ------------------------------------
pm2 <- make.predictorMatrix(dt2)
pm2['x',c('y','z')] <- 0
pm2['x', 'id'] <- -2
imp2 <- mice(dt2, pred = pm2, meth = "2l.pmm", seed = 1, m = 2, print = F, maxit = 20, remove.constant = F)
# imp2$loggedEvents report sex as constant (don't know why) so I include remove.constant=F to keep that variable (don't occur with my real data)
View(complete(imp2, 'long'))
In imp1:
group, time and sex are used to predict x
group, time, sex, x and z are used to predict y
In ìmp2:
group, time and sex are used to predict x
y is complete so no imputation is performed for this variable
Given so, why are the results different for the imputed data on x?
Is it the expected behavior?
Thank you!
I have this data set and I plot F_1 against ks. I need to find the value of ks that has the maximum F_1 value.
set.seed(1)
library(caret)
library(dplyr)
library(modelr)
data("heights")
ks <- seq(1, 101, 3)
F_1 <- sapply(ks, function(k){
test_index <- createDataPartition(heights$sex, times = 1, p = 0.5, list = FALSE)
test_set <- heights[test_index, ]
train_set <- heights[-test_index, ]
fit <- knn3(sex ~ height, data = train_set, k = k)
y_hat <- predict(fit, test_set, type = "class") %>%
factor(levels = levels(train_set$sex))
F_meas(data = y_hat, reference = test_set$sex)
})
plot(ks, F_1)
I can get the maximum F_1 value from max(F_1). But how to get corresponding ks value for that maximum F_1 value?
To obtain the input value that corresponds to the maximum of the output you may simply make use of the index obtained from the output vector of your function.
Example:
f1 <- function(x){
-x^2
}
# Input Values
z <- -100:100
# Corresponding Input Value(s) to max output
z[f1(z) == max(f1(z))]
I am new to R and trying to loop through each row of df1 and search for rows in df2 that are close in distance (5mi/8046.72m). I think df1 is looping as intended but I don't think it is going through all of df2.
{for (i in 1:1452){
p1 <- df1[i, 4:5]
p2 <- df2[1:11, 2:3]
d <- distCosine(p1, p2, r=6378137)
return(d< 8046.72)
i <- i+1}
}
I get the output:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
I would just use an apply function. First, let's make your problem reproducible by creating some "fake" data - I am making the lon/lat pairs artificially close so that we can get a few TRUE's back in the results:
library(geosphere)
df1 <- data.frame(X1 = sample(letters, 100, replace = T),
x2 = sample(letters, 100, replace = T),
x3 = sample(letters, 100, replace = T),
lon = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T),
lat = sample(10:12 + rnorm(100, 0, 0.1), replace = T))
df2 <- data.frame(x1 = sample(letters, 100, replace = T),
lon = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T),
lat = sample(10:12 + rnorm(100, 0, 0.1), 100, replace = T))
We can then create two matrices containing the values of interest:
m1 <- as.matrix(df1[, c("lon", "lat")])
m2 <- as.matrix(df2[1:11, c("lon", "lat")])
Now we can use the apply function across the rows of m2 which return a 100 X 11 matrix:
results <- apply(m2, 1, FUN = function(x) distCosine(x, m1))
To get the less than 5 mi (~8046.72m), results, we simply subset:
results[results < 8046.72]
# Showing the next two for alternative output
which(results < 8046.72)
which(results < 8046.72, arr.ind = T)
Note: In your question, it looks like you are interested in the first 1,452 rows -- this would mean the results would we be a 1,452 X 11 matrix.
Within the simulated data set
n = 50
set.seed(378)
df <- data.frame(
age = sample(c(20:90), n, rep = T),
sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)),
py = abs(rnorm(n, 25, 10)),
yrsquit = abs (rnorm (n, 10,2)),
outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
)
I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried
df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1) {
age[x] <- age[x] + 15
sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))
but there is no difference as shown by
tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)
The use of sapply inside within does not work as you expect. The function within does only use the returned value of sapply. But in your code, sapply returns NULL. Hence, within does not modify the data frame.
Here is an easier way to modify the data frame without a loop or sapply:
idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15;
sex[idx] <- sample(c("m", "f"), sum(idx),
replace = TRUE, prob = c(0.8, 0.2))})
Now, the data frames are different:
> tapply(df$age, df$outcome, mean)
0 1
60.46341 57.55556
> tapply(df1$age, df$outcome, mean)
0 1
60.46341 72.55556
> tapply(df$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
2 7
> tapply(df1$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
1 8