I am trying to use reformulate to feed into an R function to create a formula that looks like this.
~ X, ~Y
To be fed into this type of function:
as.data.frame(svyby(~X,~Y, design, svymean, na.rm= T))
I know that:
reformulate("X","Y")
returns: Y ~ X
How do I modify reformulate to achieve the above formula? I have tried:
reformulate(c(`~`,"X"),c(`~`,"Y")) # throws an error
form <- noquote(paste0("~",x,",","~",column)) # is not accepted int function
This is the full function I am feeding it into:
Where form contains the 'formulae' portion and where x is pulled in from a vector_vars of c("Q1","Q2") etc., and column is a character variable 'poverty' being fed in from a larger function.
myfun <- function(x){
form <- noquote(paste0("~",x,",","~",column))
cbind(as.data.frame(svyby(form, design, svymean, na.rm = T)),
freq = c(svytable(form, design)))
}
do.call(rbind, lapply(vector_vars, myfun))
Data and function for testing
data <- read_table2("Q50_1 Q50_2 Q38 Q90 pov gender wgt id
yes 3 Yes NA High M 1.3 A
NA 4 No 2 Med F 0.4 B
no 2 NA 4 Low F 1.2 C
maybe 3 No 2 High M 0.5 D
yes NA No NA High M 0.7 E
no 2 Yes 3 Low F 0.56 F
maybe 4 Yes 2 Med F 0.9 G
")
design <- svydesign(id =~id,
weights = ~wgt,
nest = FALSE,
data = data)
vector_vars <- c("Q50_1", "Q38")
create_df<- function(design, vector_vars, column){
# function to retrieve the weighted, mean and se
myfun <- function(x){
form <- noquote(paste0("~",x,",","~",column))
cbind(as.data.frame(svyby(form, design, svymean, na.rm = T)),
freq = c(svytable(form, design)))
}
final <- do.call(rbind, lapply(vector_vars, myfun))
return(final)
}
create_df(design, 'gender')
From the svyby, the ~X and ~Y used are for different arguments i.e. first is for formula and second for by
library(survey)
myfun <- function(design, x, colnm) {
# // first formula
fmla <- reformulate(x)
# // formula for by
by <- reformulate(colnm)
# // return a named list
list(d1 = as.data.frame(svyby(fmla, by, design, svymean, na.rm = TRUE)),
freq = c(svytable(fmla, design)))
}
-testing
lapply(vector_vars, function(x) myfun(design, x, "gender"))
[[1]]
[[1]]$d1
gender Q50_1maybe Q50_1no se.Q50_1maybe se.Q50_1no
F F 0.3383459 0.6616541 0.3026058 0.3026058
M M 0.2000000 0.8000000 0.2148115 0.2148115
[[1]]$freq
maybe no yes
1.40 1.76 2.00
[[2]]
[[2]]$d1
gender Q38No Q38Yes se.Q38No se.Q38Yes
F F 0.2150538 0.7849462 0.2253182 0.2253182
M M 0.4800000 0.5200000 0.3317149 0.3317149
[[2]]$freq
No Yes
1.60 2.76
Based on the results, showed, the column names are different, so rbind wouldn't work. We may use bind_rows from dplyr or rbindlist from data.table with the updated function
myfun <- function(design, x, colnm) {
fmla <- reformulate(x)
by <- reformulate(colnm)
d1 <- as.data.frame(svyby(fmla, by, design, svymean, na.rm = TRUE))
freq <- c(svytable(fmla, design))
d1[names(freq)] <- as.list(freq)
return(d1)
}
library(data.table)
rbindlist(lapply(vector_vars, function(x) myfun(design, x, "gender")), fill = TRUE)
gender Q50_1maybe Q50_1no se.Q50_1maybe se.Q50_1no maybe no yes Q38No Q38Yes se.Q38No se.Q38Yes No Yes
1: F 0.3383459 0.6616541 0.3026058 0.3026058 1.4 1.76 2 NA NA NA NA NA NA
2: M 0.2000000 0.8000000 0.2148115 0.2148115 1.4 1.76 2 NA NA NA NA NA NA
3: F NA NA NA NA NA NA NA 0.2150538 0.7849462 0.2253182 0.2253182 1.6 2.76
4: M NA NA NA NA NA NA NA 0.4800000 0.5200000 0.3317149 0.3317149 1.6 2.76
Related
I am trying to write a function to return regression coefficient and standard errors since I need run a large number of regressions.
The data could look like this
library(tidyverse)
library(fixest)
library(broom)
data<-tibble(Date = c("2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01"),
Card = c(1,2,3,4,1,2,3,4),
A = rnorm(8),
B = rnorm(8),
C = rnorm(8)
)
My current code is as following
estimation_fun <- function(col1,col2,df) {
regression<-feols(df[[col1]] ~ df[[col2]] | Card + Date, df)
est =tidy(regression)$estimate
se = tidy(regression)$std.error
output <- list(est,se)
return(output)
}
estimation_fun("A","B",example)
However, it does not work. I guess it is related to column name in feols because I can make it work for lm().
feols function needs a formula object. You can create it using paste0/sprintf.
estimation_fun <- function(col1,col2,df) {
regression<-feols(as.formula(sprintf('%s ~ %s | Card + Date', col1, col2)), df)
est =tidy(regression)$estimate
se = tidy(regression)$std.error
output <- list(est,se)
return(output)
}
estimation_fun("A","B",data)
#[[1]]
#[1] -0.1173276
#attr(,"type")
#[1] "Clustered (Card)"
#[[2]]
#[1] 1.083011
#attr(,"type")
#[1] "Clustered (Card)"
To apply this for every pair of variables you may do -
cols <- names(data)[-(1:2)]
do.call(rbind, combn(cols, 2, function(x) {
data.frame(cols = paste0(x, collapse = '-'),
t(estimation_fun(x[1],x[2],data)))
}, simplify = FALSE))
cols X1 X2
#1 A-B -0.1173276 1.083011
#2 A-C -0.1117691 0.5648162
#3 B-C -0.3771884 0.1656587
Ronak's right: only formulas made of variable names can be used.
Since fixest 0.10.0, you can use the dot square bracket operator to do just that. See the help page for formula manipulation in xpd.
Just change one line in your code to make it work:
estimation_fun <- function(lhs, rhs, df) {
# lhs must be of length 1 (otherwise => not what you'd want)
# rhs can be a vector of variables
regression <- feols(.[lhs] ~ .[rhs] | Card + Date, df)
# etc...
}
# Example of how ".[]" works:
lhs = "A"
rhs = c("B", "C")
feols(.[lhs] ~ .[rhs], data)
#> OLS estimation, Dep. Var.: A
#> Observations: 8
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.375548 0.428293 0.876849 0.42069
#> B -0.670476 0.394592 -1.699164 0.15004
#> C 0.177647 0.537452 0.330536 0.75440
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.737925 Adj. R2: 0.183702
By the way, I recommend to use the built-in multiple estimation facility (see help here) since estimation speed will be substantially improved.
Update
All combinations can be estimated with one line of code:
# All combinations at once
est_all = feols(c(A, B, C) ~ sw(A, B, C) | Card + Date, data)
Extraction of coefs/SEs can be done with another line:
# Coef + SE // see doc for summary.fixest_multi
coef_se_all = summary(est_all, type = "se_long")
coef_se_all
#> lhs rhs type A B C
#> 1 A A coef 1.0000000 NA NA
#> 2 A A se NaN NA NA
#> 3 A B coef NA 0.8204932 NA
#> 4 A B se NA 1.1102853 NA
#> 5 A C coef NA NA -0.7889534
#> 6 A C se NA NA 0.3260451
#> 7 B A coef 0.2456443 NA NA
#> 8 B A se 0.2314143 NA NA
#> 9 B B coef NA 1.0000000 NA
#> 10 B B se NA NaN NA
#> 11 B C coef NA NA -0.1977089
#> 12 B C se NA NA 0.3335988
#> 13 C A coef -0.4696954 NA NA
#> 14 C A se 0.3858851 NA NA
#> 15 C B coef NA -0.3931512 NA
#> 16 C B se NA 0.8584968 NA
#> 17 C C coef NA NA 1.0000000
#> 18 C C se NA NA NaN
NOTA: it requires fixest 0.10.1 or higher.
I needed a function for get the p-values of multiple Chi-Square tests in a Matrix
Looking for, I found this code:
chisqmatrix <- function(x) {
names = colnames(x); num = length(names)
m = matrix(nrow=num,ncol=num,dimnames=list(names,names))
for (i in 1:(num-1)) {
for (j in (i+1):num) {
m[i,j] = chisq.test(x[,i],x[,j],)$p.value
}
}
return (m)
}
mat = chisqmatrix(DATAFRAME)
mat
And works perfectly!
but the problem is that I need that this function omit the NA values.
I can't just omit the NA values in all the dataframe, I need them to be omitted for each pair in the function
So when x[,i] select the columns How can I implement that for only take the values that are not null. I tried things like !="NA" but not correctly way.
Thanks you!
You really need to provide reproducible data. As documented on the manual page, chisq.test removes missing values before computing:
set.seed(42)
x <- matrix(sample(c(LETTERS[1:3], NA), 100, replace=TRUE), 20, 5)
x <- data.frame(x)
head(x)
# X1 X2 X3 X4 X5
# 1 A C C A <NA>
# 2 A C C B A
# 3 A A B B B
# 4 A A B B A
# 5 B C <NA> B A
# 6 <NA> <NA> <NA> B <NA>
x.chi <- chisq.test(x[, 1], x[, 2])
# Warning message:
# In chisq.test(x[, 1], x[, 2]) : Chi-squared approximation may be incorrect
x.chi$observed
# x[, 2]
# x[, 1] A B C
# A 3 1 3
# B 2 1 2
sum(x.chi$observed) # How many observations in x.chi?
[1] 12
nrow(na.omit(x[, 1:2])) $ How many rows in x after removing NAs?
[1] 12
Your function will do exactly what you want it to do.
I have a large data.frame with 'staggered' data and would like to align it. What I mean is I would like to take something like
and remove the leading (top) NAs from all columns to get
I know about the na.trim function from the zoo package, but this didn't work on either the initial data.frame presented above or its transpose. For this I used, with transposed dataframe t.df,
t.df <- na.trim(t.df, sides = 'left')
This only returned an empty data.frame, and wouldn't work the way I wanted anyway since it would create vectors of different lengths. Can anyone point me to a package or function that might be more helpful?
Here is the code for my example used above:
# example of what I have
var1 <- c(1,2,3,4,5,6,7,8,9,10)
var2 <- c(6,2,4,7,3,NA,NA,NA,NA,NA)
var3 <- c(NA,NA,8,6,3,7,NA,NA,NA,NA)
var4 <- c(NA,NA,NA,NA,5,NA,2,6,2,9)
df <- data.frame(var1, var2, var3, var4)
# transpose and (unsuccessful) attempt to remove leading NAs
t.df <- t(df)
t.df <- na.trim(t.df, sides = 'left')
We can loop over the columns (lapply(..) and apply na.trim. Then, pad NAs at the end of the each of the list elements by assigning length as the maximum length from the list elements.
library(zoo)
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
# var1 var2 var3 var4
#1 1 6 8 5
#2 2 2 6 NA
## 3 4 3 2
#4 4 7 7 6
#5 5 3 NA 2
#6 6 NA NA 9
#7 7 NA NA NA
#8 8 NA NA NA
#9 9 NA NA NA
#10 10 NA NA NA
Or as #G.Grothendieck mentioned in the comments
replace(df, TRUE, do.call("merge", lapply(lst, zoo)))
You can do with base functions:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
df[,] <- lapply(df, my.na.trim)
df
# var1 var2 var3 var4
# 1 1 6 8 5
# 2 2 2 6 NA
# 3 3 4 3 2
# 4 4 7 7 6
# 5 5 3 NA 2
# 6 6 NA NA 9
# 7 7 NA NA NA
# 8 8 NA NA NA
# 9 9 NA NA NA
# 10 10 NA NA NA
alternative coding for the function:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
r1 <- r$length[1]
c(tail(x, -r1), head(x, r1))
}
We can use the cbind.na() function from the qpcR package and combine it with the na.trim() function from the zoo package:
do.call(qpcR:::cbind.na, lapply(df, zoo::na.trim))
# var1 var2 var3 var4
# [1,] 1 6 8 5
# [2,] 2 2 6 NA
# [3,] 3 4 3 2
# [4,] 4 7 7 6
# [5,] 5 3 NA 2
# [6,] 6 NA NA 9
# [7,] 7 NA NA NA
# [8,] 8 NA NA NA
# [9,] 9 NA NA NA
#[10,] 10 NA NA NA
If speed is a matter you can use this data.table solution.
library(data.table)
dt_foo <- function(dt) {
shift_v <- sapply(dt, function(col) min(which(+(is.na(col)) == 0))-1)
shift_expr <- parse(text = paste0("list(", paste("shift(", names(shift_v), ", n = ", shift_v, ", type = 'lead')", collapse = ", "), ")"))
dt[, names(shift_v) := eval(shift_expr), with = F]
dt[]
}
Some benchmarking follows.
library(zoo)
library(microbenchmark)
set.seed(1)
DT <- as.data.table(matrix(sample(c(0:9L, NA), 1e8, T, prob = c(rep(.01, 10), .9)), ncol = 1000))
zoo_foo <- function(df) {
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
}
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
microbenchmark(dt_foo(copy(DT)), zoo_foo(DT),
as.data.frame(lapply(DT, my.na.trim)), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
dt_foo(copy(DT)) 1.468749 1.618289 1.690293 1.699926 1.725534 1.893018 10 a
zoo_foo(DT) 6.493227 6.516247 6.834768 6.779045 7.190705 7.319058 10 c
as.data.frame(lapply(DT, my.na.trim)) 4.988514 5.013340 5.384399 5.385273 5.508889 6.517748 10 b
I've got a rather large (around 100k observations) data set, similar to this:
data <- data.frame(
ID = seq(1, 5, 1),
Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"),
stringsAsFactors=F)
data
ID Values
1 1 1,2,3
2 2 4
3 3
4 4 4,1,6,5,1,1,6
5 5 0,0
I want to split the Values column by "," with NA for missed cells:
ID v1 v2 v3 v4 v5 v6 v7
1 1 2 3 NA NA NA NA
2 4 NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 4 1 6 5 1 1 6
5 0 0 NA NA NA NA NA
...
Best attempt was strsplit + rbind:
df <- data.frame(do.call(
"rbind",
strsplit(as.character(data$Values), split = "," , fixed = FALSE)
))
But rbind function just recycles all 'short' rows instead to set an "NA".
Have found similar problem
Many thanks, Leo
I would suggest looking at my cSplit function or approaching the problem manually.
The cSplit approach would simply be:
cSplit(data, "Values", ",")
# ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1: 1 1 2 3 NA NA NA NA
# 2: 2 4 NA NA NA NA NA NA
# 3: 3 NA NA NA NA NA NA
# 4: 4 4 1 6 5 1 1 6
# 5: 5 0 0 NA NA NA NA NA
Approaching the problem manually would look like:
## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)
^^ That's pretty much what goes on in cSplit, but the function has a few other options and some basic error checking and so on that might make it a little bit slower than a purely manual approach (or a function written to address your specific problem).
Both of these approaches would be faster than a "data.table" + "reshape2" approach. Also, since each row is treated individually, you shouldn't have any problems even if you have duplicated ID values--your output should have the same number of rows as your input.
Benchmarks
I've done benchmarks on more rows and on data that would give "wider" results (since that's implied in your comments to David's answer).
Here is the sample data:
set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x)
paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)
Here are the functions to test:
fun1a <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ","))), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}
fun1b <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ",", fixed = TRUE),
use.names = FALSE)), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}
fun2 <- function(inDT) {
cSplit(DT, "Values", ",")
}
fun3 <- function(inDF) {
Split <- strsplit(inDF$Values, ",", fixed = TRUE)
Ncol <- vapply(Split, length, 1L)
M <- matrix(NA_character_, nrow = nrow(inDF),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
M[cbind(rep(1:nrow(inDF), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
data.table(ID = inDF$ID, M)
}
Here are the results:
library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339 20
# fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568 20
## Didn't want to microbenchmark here...
system.time(fun1a(DT))
# user system elapsed
# 16.92 0.50 17.59
system.time(fun1b(DT)) # fixed = TRUE & use.names = FALSE
# user system elapsed
# 11.54 0.42 12.01
NOTE: The results of fun1a and fun1b would not be the same as those of fun2 and fun3 because of the duplicated IDs.
Here's a data.table combined with reshape2 approach (should be very efficient)
library(data.table) # Loading `data.table` package
data2 <- setDT(data)[, list(Values = unlist(strsplit(Values, ","))), by = ID] # splitting the values by `,` for each `ID`
data2[, Var := paste0("v", seq_len(.N)), by = ID] # Adding the `Var` variable
library(reshape2) # Loading `reshape2` package
dcast.data.table(data2, ID ~ Var, fill = NA_character_, value.var = "Values") # decasting
# ID v1 v2 v3 v4 v5 v6 v7
# 1: 1 1 2 3 NA NA NA NA
# 2: 2 4 NA NA NA NA NA NA
# 3: 3 NA NA NA NA NA NA
# 4: 4 4 1 6 5 1 1 6
# 5: 5 0 0 NA NA NA NA NA
I was using the prcomp function when I received this error
Error in prcomp.default(x, ...) :
cannot rescale a constant/zero column to unit variance
I know I can scan my data manually but is there any function or command in R that can help me remove these constant variables?
I know this is a very simple task, but I have never been across any function that does this.
Thanks,
The problem here is that your column variance is equal to zero. You can check which column of a data frame is constant this way, for example :
df <- data.frame(x=1:5, y=rep(1,5))
df
# x y
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# Supply names of columns that have 0 variance
names(df[, sapply(df, function(v) var(v, na.rm=TRUE)==0)])
# [1] "y"
So if you want to exclude these columns, you can use :
df[,sapply(df, function(v) var(v, na.rm=TRUE)!=0)]
EDIT : In fact it is simpler to use apply instead. Something like this :
df[,apply(df, 2, var, na.rm=TRUE) != 0]
I guess this Q&A is a popular Google search result but the answer is a bit slow for a large matrix, plus I do not have enough reputation to comment on the first answer. Therefore I post a new answer to the question.
For each column of a large matrix, checking whether the maximum is equal to the minimum is sufficient.
df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]
This is the test. More than 90% of the time is reduced compared to the first answer. It is also faster than the answer from the second comment on the question.
ncol = 1000000
nrow = 10
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE))]) # my method
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 })]) # Keith's method
time1
# user system elapsed
# 22.267 0.194 22.626
time2
# user system elapsed
# 2.073 0.077 2.155
time3
# user system elapsed
# 6.702 0.060 6.790
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix and #raymkchow version is slow with NAs i propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
First build an example data.table, with more lines than columns (which is usually the case) and 10% of NAs
ncol = 1000
nrow = 100000
df <- matrix(sample(1:(ncol*nrow),ncol*nrow,replace = FALSE), ncol = ncol)
df <- apply (df, 2, function(x) {x[sample( c(1:nrow), floor(nrow/10))] <- NA; x} ) # Add 10% of NAs
df[,sample(1:ncol,70,replace = FALSE)] <- rep(1,times = nrow) # df is a large matrix
df <- as.data.table(df)
Then benchmark all approaches:
time1 <- system.time(df1 <- df[,apply(df, 2, var, na.rm=TRUE) != 0, with = F]) # the first method
time2 <- system.time(df2 <- df[,!apply(df, MARGIN = 2, function(x) max(x, na.rm = TRUE) == min(x, na.rm = TRUE)), with = F]) # raymkchow
time3 <- system.time(df3 <- df[,apply(df, 2, function(col) { length(unique(col)) > 1 }), with = F]) # Keith's method
time4 <- system.time(df4 <- df[,-which_are_constant(df, verbose=FALSE)]) # My method
The results are the following:
time1 # Variance approch
# user system elapsed
# 2.55 1.45 4.07
time2 # Min = max approach
# user system elapsed
# 2.72 1.5 4.22
time3 # length(unique()) approach
# user system elapsed
# 6.7 2.75 9.53
time4 # Exponential search approach
# user system elapsed
# 0.39 0.07 0.45
all.equal(df1, df2)
# [1] TRUE
all.equal(df3, df2)
# [1] TRUE
all.equal(df4, df2)
# [1] TRUE
dataPreparation:which_are_constant is 10 times faster than the other approaches.
Plus the more rows you have the more interesting it is to use.
The janitor library has the comment remove_constant that can help delete constant columns.
Let's create a synthesis data for illustration:
library(janitor)
test_dat <- data.frame(A=1, B=1:10, C= LETTERS[1:10])
test_dat
This is the test_dat
> test_dat
A B C
1 1 1 A
2 1 2 B
3 1 3 C
4 1 4 D
5 1 5 E
6 1 6 F
7 1 7 G
8 1 8 H
9 1 9 I
10 1 10 J
then the comment remove_constant can help delete the constant column
remove_constant(test_dat)
remove_constant(test_dat, na.rm= TRUE)
Using the above two comments, we will get:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
NOTE: use the argument na.rm = TRUE to make sure that any column having one value and NA will also be deleted. For example,
test_dat_with_NA <- data.frame(A=c(1, NA), B=1:10, C= LETTERS[1:10])
test_dat_with_NA
the test_dat_with_NA we get:
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
then the comment
remove_constant(test_dat_with_NA)
could not delete the column A
A B C
1 1 1 A
2 NA 2 B
3 1 3 C
4 NA 4 D
5 1 5 E
6 NA 6 F
7 1 7 G
8 NA 8 H
9 1 9 I
10 NA 10 J
while the comment
remove_constant(test_dat_with_NA, na.rm= TRUE)
could delete the column A with only value 1 and NA:
B C
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 6 F
7 7 G
8 8 H
9 9 I
10 10 J
If you are after a dplyr solution that returns the non-constant variables in a df, I'd recommend the following. Optionally, you can add %>% colnames() if the column names are desired:
library(dplyr)
df <- data.frame(x = 1:5, y = rep(1,5))
# returns dataframe
var_df <- df %>%
select_if(function(v) var(v, na.rm=TRUE) != 0)
var_df %>% colnames() # returns column names
tidyverse version of Keith's comment:
df %>% purrr::keep(~length(unique(.x)) != 1)