Apply filter to the table function - r

I'm looking for a way to execute a simple task faster than I am currently able to.
I want to use the table function in R on part of a dataframe. Of course it would be possible to first use subset and then table, but this is a bit tedious. (In my case, during a first inspection of the data, I want to check the frequency of NAs on individual variables in a multi-national survey for each of the 25 participating countries. So I'd need to create 25 subsets, make the table, and then remove the subsets again because I don't need them anymore.)
Here is some example data:
a <- c(1,1,1,1,1,2,2,2,2,2)
b <- c(1,3,99,99,2,3,2,99,1,1)
df <- cbind.data.frame(a,b)
And this is the workaround solution.
df1 <- subset(df, a == 1)
table(df1$b)
df2 <- subset(df, a == 2)
table(df2$b)
rm(df1, df2)
Is there a simpler way?
Also, I feel like I am spamming with ultra-basic questions like these. If anyone has a suggestion on how I could have found the answer directly I'd be happy to hear it. Other than trying some code myself, I googled terms like 'r apply filter to table', 'r filter table function', 'r table subset dataframe', etc.

Assuming 99 are your NAs then there is a way using purrr package, which I find is excellent to see how many NAs there are in each column:
library(purrr)
df |>
map_df(~sum(. == 99))
a b
<int> <int>
1 0 3

Can you provide an example of the structure of the original data (multi-national survey)?
Probably you would be able to answer your question with a much tidier code using the package dplyr with functions such as
survey_data %>%
select(column1, column2, country, etc) %>% #choose your desired columns
group_by(country) %>%
summarise_all(funs(sum(is.na(.))))

You could split on your a variable and use lapply to use table on each list like this:
lapply(split(df, df$a), \(x) table(x))
#> $`1`
#> b
#> a 1 2 3 99
#> 1 1 1 1 2
#>
#> $`2`
#> b
#> a 1 2 3 99
#> 2 2 1 1 1
Created on 2023-02-18 with reprex v2.0.2

Just use it in an lapply.
alv <- unique(df$a)
lapply(alv, \(x) table(subset(df, a == x, b))) |> setNames(alv)
# $`1`
# b
# 1 2 3 99
# 1 1 1 2
#
# $`2`
# b
# 1 2 3 99
# 2 1 1 1
However, it might be better to code 99 (and probably others) as NA,
df[] <- lapply(df, \(x) replace(x, x %in% c(99), NA))
and count the NAs in b for each individual a.
with(df, tapply(b, a, \(x) sum(is.na(x))))
# 1 2
# 2 1

Just use table() on the whole dataframe, and pull out the parts you want afterwards. You convert the a and b values to character values when indexing into the two-way table. For example,
a <- c(1,1,1,1,1,2,2,2,2,2)
b <- c(1,3,99,99,2,3,2,99,1,1)
df <- cbind.data.frame(a,b)
full <- table(df$a, df$b)
full["1",] # corresponds to subset a == 1
#> 1 2 3 99
#> 1 1 1 2
full["2",] # corresponds to subset a == 2
#> 1 2 3 99
#> 2 1 1 1
full[, "99"] # corresponds to subset b == 99
#> 1 2
#> 2 1
Created on 2023-02-18 with reprex v2.0.2

Related

R function to find the sum of c given the column values

I want to create a function that I can simulate n number of times. My ultimate goal is to find if the sum of c for every n number of simulations. I am a beginner in r-coding so I am just starting to practice with for loops and if else statements.
This is what I hope to achieve as of now: If a> b, c would be "2" and if a < b, c would be "-2". If a=b, c would be determined by the a and b value of the NEXT row. This is what i have so far, but I am keep getting errors. I would like to know if what I have for a=b is how I should approach this. Any help is appreciated.
a<-c(5,6,7,8,9,10,1,4,6,7)
b<-c(4,6,8,5,3,4,5,2,1,3)
c<-c(0,0,0,0,0,0,0,0,0,0)
df<-data.frame(a,b,c)
if(df$a > df$b){
df$c<- c(2)}
else if(df$a < df$b){
df$c<- c(-2)}
else if(df$a == df$b){ # a=b
if(df$a[+1,] > df$b[+1,]) {
df$c<- c(2)}
else(df$a[+1,] < df$b[+1,]){
df$c<- c(-2) }
}
else
print("error")
}
sum(df$c)
The problem
if() and else() in R is meant for control flow, and is not vectorized. In plain English this means that if() is expecting a statement evaluating to one TRUE or FALSE. When you do df$a > df$b you get a boolean vector of the same length as rows in your dataframe. When this happens, if() will only use the first item, and give you a warning. This will give you the wrong answers.
A better solution
I think you are looking for ifelse() which is vectorized. And since you have nested if-else statements you are probably better off with dplyr::case_when().
Here is an example which also fixes cases where a == b for multiple rows:
# Note that I've added two consecutive rows where a == b
a <- c(5,6,6,7,8,9,10,1,4,6,7)
b <- c(4,6,6,8,5,3,4,5,2,1,3)
df <- data.frame(a, b)
library(dplyr)
df %>%
mutate(
c = case_when(
a > b ~ 2,
a < b ~ -2,
# If not a > b nor a < b is TRUE, they must be equal,
# so we set all other cases to NA...
TRUE ~ NA_real_
)
) %>%
# ... and then we use fill() to replace NAs with the first
# non NA valua after it
tidyr::fill(c, .direction = "up")
#> a b c
#> 1 5 4 2
#> 2 6 6 -2
#> 3 6 6 -2
#> 4 7 8 -2
#> 5 8 5 2
#> 6 9 3 2
#> 7 10 4 2
#> 8 1 5 -2
#> 9 4 2 2
#> 10 6 1 2
#> 11 7 3 2
Created on 2022-03-30 by the reprex package (v2.0.1)
How this works:
ifelse() works like if() and else() in your code, but it accepts multiple values
case_when() acts like nested ifelse() statements, so it will first check if a > b and set those values equal to 2, next it will check the remaining rows if a < b and set those to -2 and so on.
In cases where a is not less nor more than b, they must be equal. We set these cases to NA.
After we use tidyr::fill() to replace missing values with the first instance of a non-missing value after it. This handles cases where there are multiple consecutive rows of a == b.
Edit: two users already pointed out what to do if there's consecutive rows of a == b. Good opportunity to dive into the tidyverse (as already suggested by others):
library(dplyr)
library(tidyr)
df <- data.frame(
a = c(5,6,7,8,9,10,1,4,6,7),
b = c(4,6,8,5,3,4,5,2,1,3)
)
df %>%
mutate(c = ifelse(a == b, NA, 2 * sign(a-b))) %>% ## (1)
fill(c, .direction = 'up') ## (2)
(1) set c to NA when a == b
(2) 'fill' (replace) NAs with the next availabe value down the rows
Starting with R, it's helpful to know that vectorizing (the x[n] thing) usually makes your code conciser and—in certain situations— much faster than using loops. In your case:
df$c <- 2 * sign(df$a - df$b) ## see ?sign
z <- df$c == 0 ## see (1)
df$c[z] = lead(df$c,1)[z] ## see (2)
(1) equal numbers have sign zero, z is a boolean vector indicating the positions (rows) where a == b (thus: z is TRUE)
(2) change c only at the positions where z is TRUE. lead and lag are functions taking a vector and returning its shifted (by a given number of positions) vector.
Here is a tidyverse solution. This will also work with multiple equal a and b in series (I have added row 3 to the data to demonstrate).
It relies on cumsum() to group the data, such that rows with a == b are in the same group as the next row that is a != b. Then it sets c to the last value in the group.
library(tidyverse)
a<-c(5,6,5,7,8,9,10,1,4,6,7)
b<-c(4,6,5,8,5,3,4,5,2,1,3)
df <-data.frame(a,b)
df |>
mutate(c = ifelse(a>b, 2, -2), # Determines c for `a != b` cases
grp = rev(cumsum(rev(a != b)))) |> # create group variable, use rev() since we want backward cumsum
group_by(grp) |>
mutate(c = last(c)) |>
ungroup() |>
select(-grp)
#> # A tibble: 11 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 5 4 2
#> 2 6 6 -2
#> 3 5 5 -2
#> 4 7 8 -2
#> 5 8 5 2
#> 6 9 3 2
#> 7 10 4 2
#> 8 1 5 -2
#> 9 4 2 2
#> 10 6 1 2
#> 11 7 3 2
Created on 2022-03-30 by the reprex package (v2.0.1)

Aggregate by NA in R

Does anybody know how to aggregate by NA in R.
If you take the example below
a <- matrix(1,5,2)
a[1:2,2] <- NA
a[3:5,2] <- 2
aggregate(a[,1], by=list(a[,2]), sum)
The output is:
Group.1 x
2 3
But is there a way to get the output to include NAs in the output like this:
Group.1 x
2 3
NA 2
Thanks
Instead of aggregate(), you may want to consider rowsum(). It is actually designed for this exact operation on matrices and is known to be much faster than aggregate(). We can add NA to the factor levels of a[, 2] with addNA(). This will assure that NA shows up as a grouping variable.
rowsum(a[, 1], addNA(a[, 2]))
# [,1]
# 2 3
# <NA> 2
If you still want to use aggregate(), you can incorporate addNA() as well.
aggregate(a[, 1], list(Group = addNA(a[, 2])), sum)
# Group x
# 1 2 3
# 2 <NA> 2
And one more option with data.table -
library(data.table)
as.data.table(a)[, .(x = sum(V1)), by = .(Group = V2)]
# Group x
# 1: NA 2
# 2: 2 3
Use summarize from dplyr
library(dplyr)
a %>%
as.data.frame %>%
group_by(V2) %>%
summarize(V1_sum = sum(V1))
Using sqldf:
a <- as.data.frame(a)
sqldf("SELECT V2 [Group], SUM(V1) x
FROM a
GROUP BY V2")
Output:
Group x
1 NA 2
2 2 3
stats package
A variation of AdamO's proposal:
data.frame(xtabs( V1 ~ V2 , data = a,na.action = na.pass, exclude = NULL))
Output:
V2 Freq
1 2 3
2 <NA> 2
You can also try aggregating by is.na(a[,2]) instead.
aggregate(a[,1], by=list(is.na(a[,2])), sum)
# Group.1 x
# 1 FALSE 3
# 2 TRUE 2
If you want a finer distinction than just NA or not, then you may want to define a new variable that uses an previously unused value to denote NA (a factor would be more elegant, but a numeric vector is the simplest):
b <- a[,2]
b[is.na(b)] <- 999
aggregate(a[,1], by=list(b), sum)
# Group.1 x
# 1 2 3
# 2 999 2
The addNA solution of Rich doesn't require any substantial change to the aggregate syntax, so I think it's the best solution. I'll point out that another option, which produces output similar to table (and thus can be coerced into a data.frame structure similar to that of aggregate) is xtabs.
xtabs(a[, 1] ~ a[, 2], addNA=T)
Gives:
Group.1 x
1 2 3
2 <NA> 2
Another "trick" I see is assigning a missing code to these data. We all like the NA output of R, but assigning a missing code to a grouping variable is a good coding exercise. We take it so that it has one more digit than the largest value in the dataset and is of the form -999...99.
codemiss <- function(x) -10^(floor(log(max(abs(x), na.rm=T), base=10))+2)-1
works in general.
Then you get
a[, 2][is.na(a[, 2])] <- codemiss(a[, 2])
And:
aggregate(a[, 1], list(a[, 2]), sum)
Gives you:
Group.1 x
1 -99 2
2 2 3

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

Extract only first line in a data frame from several subgroups that satisfy a conditional

I have a data frame similar to the dummy example here:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
In the original data frame, there are many more groups, each with 10 values. For each group (a,b or c) I would like to extract the first line where value!=NA, but only the first line where this is true. As in a group there could be several values different from NA and from each other I can't simply subset.
I was imagining something like this using plyr and a conditional, but I honestly have no idea what the conditional should take:
ddply<-(df,.(Group),function(sub_data){
for(i in 1:length(sub_data$value)){
if(sub_data$Value!='NA'){'take value but only for the first non NA')
return(first line that satisfies)
})
Maybe this is easy with other strategies that I don't know of
Any suggestion is very much appreciated!
I know this has been answered but for this you should be looking at the data.table package. It provides a very expressive and terse syntax for doing what you ask:
df<-data.table(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
> df[ Value != "NA", .SD[1], by=Group ]
Group Value
1: a 10
2: b 4
3: c 2
Do youself a favor and learn data.table
Some other notes:
You can easily convert data.frames to data.tables
I think that you don't want "NA" but simply NA in your example, in that case the syntax is:
df[ ! is.na(Value), .SD[1], by=Group ]
Since you suggested plyr in the first place:
ddply(subset(df, !is.na(Value)), .(Group), head, 1L)
That assumes you have NAs and not 'NA's. If the latter (not recommended), then:
ddply(subset(df, Value != 'NA'), .(Group), head, 1L)
Note how concise this is. I would agree with using plyr.
If you're willing to use actual NA's vs strings, then the following should give you what you're looking for:
df <- (Group=rep(letters[1:3], each=3),
Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
print(df)
## Group Value
## 1 a <NA>
## 2 a <NA>
## 3 a 10
## 4 b <NA>
## 5 b 4
## 6 b 8
## 7 c <NA>
## 8 c <NA>
## 9 c 2
df.1 <- by(df, df$Group, function(x) {
head(x[complete.cases(x),], 1)
})
print(df.1)
## df$Group: a
## Group Value
## 3 a 10
## ------------------------------------------------------------------------
## df$Group: b
## Group Value
## 5 b 4
## ------------------------------------------------------------------------
## df$Group: c
## Group Value
## 9 c 2
First you should take care of NA's:
options(stringsAsFactors=FALSE)
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
And then maybe something like this would do the trick:
for(i in unique(df$Group)) {
for(j in df$Value[df$Group==i]) {
if(!is.na(j)) {
print(paste(i,j))
break
}
}
}
Assuming that Value is actually numeric, not character.
> df <- data.frame(Group=rep(letters[1:3],each=3),
Value=c(NA, NA, 10, NA, 4, 8, NA, NA, 2)
> do.call(rbind, lapply(split(df, df$Group), function(x){
x[ is.na(x[,2]) == FALSE, ][1,]
}))
## Group Value
## a a 10
## b b 4
## c c 2
I don't see any solutions using aggregate(...), which would be the simplest:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c('NA','NA','10','NA','4','8','NA','NA','2'))
aggregate(Value~Group,df[df$Value!="NA",],head,1)
# Group Value
# 1 a 10
# 2 b 4
# 3 c 2
If your df contains actual NA, and not "NA" as in your example, then use this:
df<-data.frame(Group=rep(letters[1:3],each=3),Value=c(NA,NA,'10',NA,'4','8',NA,NA,'2'))
aggregate(Value~Group,df[!is.na(df$Value),],head,1)
Group Value
1 a 10
2 b 4
3 c 2
Your life would be easier if you marked missing values with NA and not as a character string 'NA'; the former is really missing to R and it has tools to work with such missingness. The latter ('NA') is really not missing except for the meaning that this string has to you alone; R cannot divine that information directly. Assuming you correct this, then the solution below is one way to go about doing this.
Similar in spirit to #hrbrmstr's by() but to my eyes aggregate() gives nicer output:
> foo <- function(x) head(x[complete.cases(x)], 1)
> aggregate(Value ~ Group, data = df, foo)
Group Value
1 a 10
2 b 4
3 c 2
> aggregate(df$Value, list(Group = df$Group), foo)
Group x
1 a 10
2 b 4
3 c 2

Remove columns from dataframe where some of values are NA

I have a dataframe where some of the values are NA. I would like to remove these columns.
My data.frame looks like this
v1 v2
1 1 NA
2 1 1
3 2 2
4 1 1
5 2 2
6 1 NA
I tried to estimate the col mean and select the column means !=NA. I tried this statement, it does not work.
data=subset(Itun, select=c(is.na(colMeans(Itun))))
I got an error,
error : 'x' must be an array of at least two dimensions
Can anyone give me some help?
The data:
Itun <- data.frame(v1 = c(1,1,2,1,2,1), v2 = c(NA, 1, 2, 1, 2, NA))
This will remove all columns containing at least one NA:
Itun[ , colSums(is.na(Itun)) == 0]
An alternative way is to use apply:
Itun[ , apply(Itun, 2, function(x) !any(is.na(x)))]
Here's a convenient way to do it using the dplyr function select_if(). Combine not (!), any() and is.na(), which is equivalent to selecting all columns that don't contain any NA values.
library(dplyr)
Itun %>%
select_if(~ !any(is.na(.)))
Alternatively, select(where(~FUNCTION)) can be used:
library(dplyr)
(df <- data.frame(x = letters[1:5], y = NA, z = c(1:4, NA)))
#> x y z
#> 1 a NA 1
#> 2 b NA 2
#> 3 c NA 3
#> 4 d NA 4
#> 5 e NA NA
# Remove columns where all values are NA
df %>%
select(where(~!all(is.na(.))))
#> x z
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e NA
# Remove columns with at least one NA
df %>%
select(where(~!any(is.na(.))))
#> x
#> 1 a
#> 2 b
#> 3 c
#> 4 d
#> 5 e
You can use transpose twice:
newdf <- t(na.omit(t(df)))
data[,!apply(is.na(data), 2, any)]
A base R method related to the apply answers is
Itun[!unlist(vapply(Itun, anyNA, logical(1)))]
v1
1 1
2 1
3 2
4 1
5 2
6 1
Here, vapply is used as we are operating on a list, and, apply, it does not coerce the object into a matrix. Also, since we know that the output will be logical vector of length 1, we can feed this to vapply and potentially get a little speed boost. For the same reason, I used anyNA instead of any(is.na()).
Another alternative with the dplyr package would be to make use of the Filter function
Filter(function(x) !any(is.na(x)), Itun)
with data.table would be a little more cumbersome
setDT(Itun)[,.SD,.SDcols=setdiff((1:ncol(Itun)),
which(colSums(is.na(Itun))>0))]
You can also try:
df <- df[,colSums(is.na(df))<nrow(df)]

Resources