Remove all factorial variables with more than 50% NA [duplicate] - r

This question already has answers here:
Deleting columns from a data.frame where NA is more than 15% of the column length [duplicate]
(2 answers)
Closed 5 years ago.
I have a CSV file with headers. Some of the features (columns) are factorial, some are numerical.
For the factorial variables I have a lot of columns with a lot of NAs, e.g.:
Num1 Fact1 Num2 Fact2 Fact3
9889 Bla 23 BBxv NA
NA NA 456 BBxz NA
NA Abcd 3 BBxx Jet
NA NA 100 BBxy NA
NA NA NA NA NA
I Want to remove all Factorial columns with more than 50% NAs in it.
e.g. the resulting data frame should be:
Num1 Num2 Fact2
9889 23 BBxv
NA 456 BBxz
NA 3 BBxx
NA 100 BBxy
NA NA NA
Moreover, Is there a way to also remove numerical columns with more than 50% NAs in it, in the SAME process?
e.g. after the cleanup the resulting data frame would be one that contains only Num2 and Fact2 columns.

Try:
dff[colMeans(is.na(dff)) <= 0.5]
Should get:
Num2 Fact2
23 BBxv
456 BBxz
3 BBxx
100 BBxy
NA <NA>
Edit:
If you're looking to remove columns with more than 50% of zeros in the same process, give the following a try:
dff[colMeans(is.na(dff)) <= 0.5 & colMeans((dff == 0), na.rm = T) <= 0.5]
I hope this helps.

Related

R - Select all rows that have one NA value at most? [duplicate]

This question already has answers here:
How to delete rows from a dataframe that contain n*NA
(4 answers)
Closed 3 days ago.
I'm trying to impute my data and keep as many observations as I can. I want to select observations that have 1 NA value at most from the data found at: mlbench::data(PimaIndiansDiabetes2).
For example:
Var1 Var2 Var3
1 NA NA
2 34 NA
3 NA NA
4 NA 55
5 NA NA
6 40 28
What I would like returned:
Var1 Var2 Var3
2 34 NA
4 NA 55
6 40 28
This code returns rows with NA values and I know that I could join all observations with 1 NA value using merge() to observations without NA values. I'm not sure how to do extract those though.
na_rows <- df[!complete.cases(df), ]
A base R solution:
df[rowSums(is.na(df)) <= 1, ]
Its dplyr equivalent:
library(dplyr)
df %>%
filter(rowSums(is.na(pick(everything()))) <= 1)

Move data from small data frame to columns in large dataframe with R [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed last year.
I have two data frames in R. There is not an ID of any sort in DF1 to use to map the rows to - I just need the entire column copied over for a data migration.
DF1 has 1349 named columns, and empty rows.
DF2 has 10 named columns and 2990 rows of sample data.
I made a small scale example:
DF1 <- data.frame(matrix(ncol = 10, nrow = 0))
colnames(DF1) <- c('one','two','three','four','five','six','seven','eight','nine','ten')
one <- c(1,54,7,3,6,3)
seven <- c('MLS','Marshall','AAE','JC','AAA','EXE')
DF2 <- data.frame(one,seven)
The column names are the same, but they are not blocked together in DF1 - they are randomly dispersed.
I want to find an efficient way of mapping the 10 columns and all of the rows from DF2 to DF1 without needing to type in each column name, as I will also need to do with with a much larger data frame later.
I expect the rest of the columns in DF1 to be blank/null other than the 'imported' columns from DF2 have been added -- this is okay. Is there an easy way to do this?
Thanks!
dplyr has a nice utility for this:
dplyr::bind_rows(DF1, DF2)
# one two three four five six seven eight nine ten
# 1 1 NA NA NA NA NA MLS NA NA NA
# 2 54 NA NA NA NA NA Marshall NA NA NA
# 3 7 NA NA NA NA NA AAE NA NA NA
# 4 3 NA NA NA NA NA JC NA NA NA
# 5 6 NA NA NA NA NA AAA NA NA NA
# 6 3 NA NA NA NA NA EXE NA NA NA

Convert a single column into multiple columns based on delimiter in R

I have the following dataframe:
ID Parts
-- -----
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
And I would like the convert the Parts column into multiple columns by the : delimiter. so it should look like:
ID A B X2 J4 C D G4 X6 ........
-- - - -- -- - - -- --
1 A B na na na na na na
2 na na X2 na na na na na
3 na na na J4 na na na na
4 A na na na C D G4 X6
where there I would not know the number of potential columns in advance.
I have met my match on this one - strsplit() by delim I can do but only with fixed number of entities in the Parts column
You can use a combination of tidyr::seperate, tidyr::pivot_wider, and tidyr::pivot_longer. First you can still use strsplit to determine the number of columns to split Parts into not the number of unique values (How it works):
library(dplyr)
library(tidyr)
library(stringr)
n_col <- max(stringr::str_count(df$Parts, ":")) + 1
df %>%
tidyr::separate(Parts, into = paste0("col", 1:n_col), sep = ":") %>%
dplyr::mutate(across(everything(), ~dplyr::na_if(., ""))) %>%
tidyr::pivot_longer(-ID) %>%
dplyr::select(-name) %>%
tidyr::drop_na() %>%
tidyr::pivot_wider(id_cols = ID,
names_from = value)
ID A B X2 J4 C D G4 X6
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 A B NA NA NA NA NA NA
2 2 NA NA X2 NA NA NA NA NA
3 3 NA NA NA J4 NA NA NA NA
4 4 A NA NA NA C D G4 X6
How it works
You do not need to know the number of unique values with this code -- the pivots take care of that. What you do need to know is how many new columns Parts will be split into with seperate. That's easy to do by counting the number of delimiters and adding one with str_count. This way you have the appropriate number of columns to seperate Parts into by your delimiter.
This is because pivot_longer will create a two column dataframe with repeated ID and a column with the delimited values of Parts -- an ID, Parts pairing. Then when you use pivot_wider the columns are automatically created for each unique value of Parts and the value is retained within the column. This function automatically fills with NA where an ID and Parts combination is not found.
Try running this pipe by pipe to better understand if need be.
Data
lines <- "
ID Parts
1 A:B::
2 X2:::
3 ::J4:
4 A:C:D:G4:X6
"
df <- read.table(text = lines, header = T)
Could the seperate function from tidyr be what you are looking for?
https://tidyr.tidyverse.org/reference/separate.html
It might require some fancy regex implementation, but could potentially work.

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

Creating categorical variables from mutually exclusive dummy variables

My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable.
In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here), so I don't think interaction does what I need to do.
For example, my data might look like this:
id conditionA conditionB conditionC conditionD
1 NA 1 NA NA
2 1 NA NA NA
3 NA NA 1 NA
4 NA NA NA 1
5 NA 2 NA NA
6 2 NA NA NA
7 NA NA 2 NA
8 NA NA NA 2
I'd like to now make categorical variables that combine ACROSS different types of conditions. For example, people who had values for condition A and B might be coded with one categorical variable, and people who had values for condition C and D.
id conditionA conditionB conditionC conditionD factor1 factor2
1 NA 1 NA NA 1 NA
2 1 NA NA NA 1 NA
3 NA NA 1 NA NA 1
4 NA NA NA 1 NA 1
5 NA 2 NA NA 2 NA
6 2 NA NA NA 2 NA
7 NA NA 2 NA NA 2
8 NA NA NA 2 NA 2
Right now, I'm doing this using ifelse() statements, which quite simply is a hot mess (and doesn't always work). Please help! There's probably some super-obvious "easier way."
EDIT:
The kinds of ifelse commands that I am using are as follows:
attach(df)
df$factor<-ifelse(conditionA==1 | conditionB==1, 1, NA)
df$factor<-ifelse(conditionA==2 | conditionB==2, 2, df$factor)
In reality, I'm combining across 6-8 columns each time, so a more elegant solution would help a lot.
Update (2019): Please use dplyr::coalesce(), it works pretty much the same.
My R package has a convenience function that allows to choose the first non-NA value for each element in a list of vectors:
#library(devtools)
#install_github('kimisc', 'muelleki')
library(kimisc)
df$factor1 <- with(df, coalesce.na(conditionA, conditionB))
(I'm not sure if this works if conditionA and conditionB are factors. Convert them to numerics before using as.numeric(as.character(...)) if necessary.)
Otherwise, you could give interaction a try, combined with recoding of the levels of the resulting factor -- but to me it looks like you're more interested in the first solution:
df$conditionAB <- with(df, interaction(coalesce.na(conditionA, 0),
coalesce.na(conditionB, 0)))
levels(df$conditionAB) <- c('A', 'B')
I think this function gives you what you need (admittedly, this is a quick hack).
to_indicator <- function(x, grp)
{
apply(tbl, 1,
function (x)
{
idx <- which(!is.na(x))
nm <- names(idx)
if (nm %in% grp)
x[idx]
else
NA
})
}
And here is it's used with the example data you provide.
tbl <- read.table(header=TRUE, text="
conditionA conditionB conditionC conditionD
NA 1 NA NA
1 NA NA NA
NA NA 1 NA
NA NA NA 1
NA 2 NA NA
2 NA NA NA
NA NA 2 NA
NA NA NA 2")
tbl <- data.frame(tbl)
(tbl <- cbind(tbl,
factor1=to_indicator(tbl, c("conditionA", "conditionB")),
factor2=to_indicator(tbl, c("conditionC", "conditionD"))))
Well, I think you can do it simply with ifelse, something like :
factor1 <- ifelse(is.na(conditionA), conditionB, conditionA)
Another way could be :
factor1 <- conditionA
factor1[is.na(factor1)] <- conditionB
And a third solution, certainly more pratical if you have more than two columns conditions :
factor1 <- apply(df[,c("conditionA","conditionB")], 1, sum, na.rm=TRUE)

Resources