Subset data frame by factor cardinality? - r

I suspect that this will be a duplicate, but my efforts to find an answer have failed. Suppose that I have a data frame with columns made entirely of either integers or factors. Some of these columns have factors with many levels and some do not. Suppose that I want to select parts of or otherwise subset the data such that I only get the columns with factors that have less than 10 levels. How can I do this? My first thought was to make a particularly nasty sapply command, but I'm hoping for a better way.

We can use select_if
library(dplyr)
df1 %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
With a reproducible example using iris
data(iris)
iris %>%
select_if(~ is.factor(.) && nlevels(.) < 10)
Or using sapply
i1 <- sapply(df1, function(x) is.factor(x) && nlevels(x) < 10)
df1[i1]

With data.table you can do:
library(data.table)
setDT(df)
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))<10)]
Example:
df <- data.table(x = factor(1:3, levels = 1:5), y = factor(1:3, levels = 1:10))
df[,.SD, .SDcols = sapply(df, function(x) length(levels(x))>5)]
y
1: 1
2: 2
3: 3

Related

Replace NA in all columns: argument is not numeric or logical

I want to loop though a lot of columns in an r dataframe and replace NA with column mean.
I can get a mean for columns like this
mean(df$col20, na.rm = TRUE)
But this gets the warning: argument is not numeric or logical: returning NA
mean(df[ , 20], na.rm = TRUE)
I tried the above syntax with a small dummy df including some NA and it works fine. Any idea what else to look for to fix this?
ps. head(df[20]) tells me it's a dbl and str(df) says it's num.
(and [ , 20] is an example; I actually get lots of warnings because it really sits in a for loop - but I have executed the line by itself as a test)
1) na.aggregate Create a logical vector ok which is TRUE for each numeric column and FALSE for other columns. Then use na.aggregate on the numeric columns.
library(zoo)
df <- data.frame(a = c(1, NA, 2), b = c("a", NA, "b")) # test data
ok <- sapply(df, is.numeric)
replace(df, ok, na.aggregate(df[ok]))
giving:
a b
1 1.0 a
2 1.5 <NA>
3 2.0 b
2) dplyr/tidyr Alternately use dplyr. df is from above and the output is the same.
library(dplyr)
library(tidyr)
df %>%
mutate(across(where(is.numeric), ~ replace_na(., mean(., na.rm =TRUE))))
3) collapse We could alternately use ftransformv in collapse.
library(collapse)
library(zoo)
ftransformv(df, is.numeric, na.aggregate)
4) base A base solution would be:
fill_na <- function(x) {
if (!is.numeric(x) || all(is.na(x))) x
else replace(x, is.na(x), mean(x, na.rm = TRUE))
}
replace(df, TRUE, lapply(df, fill_na))

How to omit rows with a value contained in a separate vector [duplicate]

This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))

Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C).
I also have a continuous variable with some missing values on it.
I would like to replace the NA values with the mean of its group. This is, missing observations from group A has to be replaced with the mean of group A.
I know I can just calculate each group's mean and replace missing values, but I'm sure there's another way to do so more efficiently with loops.
A <- subset(data, group == "A")
mean(A$variable, rm.na = TRUE)
A$variable[which(is.na(A$variable))] <- mean(A$variable, na.rm = TRUE)
Now, I understand I could do the same for group B and C, but perhaps a for loop (with if and else) might do the trick?
require(dplyr)
data %>% group_by(group) %>%
mutate(variable=ifelse(is.na(variable),mean(variable,na.rm=TRUE),variable))
For a faster, base-R version, you can use ave:
data$variable<-ave(data$variable,data$group,FUN=function(x)
ifelse(is.na(x), mean(x,na.rm=TRUE), x))
You could use data.table package to achieve this-
tomean <- c("var1", "var2")
library(data.table)
setDT(dat)
dat[, (tomean) := lapply(tomean, function(x) {
x <- get(x)
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
})]

Removing infrequent rows in a data frame

Let's say I have a following very simple data frame:
a <- rep(5,30)
b <- rep(4,80)
d <- rep(7,55)
df <- data.frame(Column = c(a,b,d))
What would be the most generic way for removing all rows with the value that appear less then 60 times?
I know you could say "in this case it's just a", but in my real data there are many more frequencies, so I wouldn't want to specify them one by one.
I was thinking of writing a loop such that if length() of an 'i' is smaller than 60, these rows will be deleted, but perhaps you have other ideas. Thanks in advance.
A solution using dplyr.
library(dplyr)
df2 <- df %>%
group_by(Column) %>%
filter(n() >= 60)
Or a solution from base R
uniqueID <- unique(df$Column)
targetID <- sapply(split(df, df$Column), function(x) nrow(x) >= 60)
df2 <- df[df$Column %in% uniqueID[targetID], , drop = FALSE]
We create a frequency table and then subset the rows based on the 'count' of values in 'Column'
tbl <- table(df$Column) >=60
subset(df, Column %in% names(tbl)[tbl])
Or with ave from base R
df[with(df, ave(Column, Column, FUN = length)>=60),]
Or we use data.table
library(data.table)
setDT(df)[, .SD[.N >= 60], Column]
Or another option with data.table is .I
setDT(df)[df[, .I[.N >=60], Column]$V1]
If there are more than one column to group, place it in a list (or compactly .()
setDT(df)[df[, .I[.N >=60], by = .(Column1, Column2)]$V1]
If there are many columns, we can also pass as a character string or object
colnms <- paste0("Column", 1:5)
setDT(df)[df[, .I[.N >=60], by = c(colnms)]$V1]
Using data.table
library(data.table)
setDT(df)
df[Column %in% df[, .N, by = Column][N >= 60, Column]]
There is also a variant to Eric Watt's answer which uses a join instead of %in%:
library(data.table)
setDT(df)
df[df[, .N, by = Column][N >= 60, .(Column)], on = "Column"]

how can I apply a function to all dataframe variables?

I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code:
sum(is.na(dataframe$variable) / nrow(dataframe) * 100)
My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?
Use lapply() with your method:
lapply(df, function(x) sum(is.na(x))/nrow(df)*100)
If you want to return a data.frame rather than a list (via lapply()) or a vector (via sapply()), you can use summarise_each from the dplyr package:
library(dplyr)
df %>%
summarise_each(funs(sum(is.na(.)) / length(.)))
or, even more concisely:
df %>% summarise_each(funs(mean(is.na(.))))
data
df <- data.frame(
x = 1:10,
y = 1:10,
z = 1:10
)
df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA

Resources