find the row with highest number of NA value in R - r

I have datafrom
df
1 a c NA NA
2 a a a NA
3 c NA NA NA
Firstly, I want to find which row has the highest number of NA value. I am also interested to find rows with the condition of having more than 2 NA values.
How can I do it in R?

na_rows = rowSums(is.na(df)) gives the count of NA by row. You can then look at which.max(na_rows) and which(na_rows > 2).

Related

Data.frame copy and paste values based on a condition

I have a data frame with the following structure/values and would like to go through the data frame (by row) and paste the values from the first column ("One") into the cells of the other columns only if they are not NA:
My data:
One Two Three Four
1 Bar_2_Foo NA NA 1
2 Mur_4_Doo 1 NA 2
3 Bur_3_Hoo NA 1 NA
What I would like to achieve:
One Two Three Four
1 Bar_2_Foo NA NA Bar_2_Foo_1
2 Mur_4_Doo Mur_4_Doo_1 NA Mur_4_Doo_2
3 Bur_3_Hoo NA Bur_3_Hoo_1 NA
Any ideas how to achieve this would be great. Thanks.
Is this what you're looking for?
mutate_at(data, Two:Four, function(i){
ifelse(!is.na(i), paste0(One, "_", i), i) } )

Merging multiple columns to one in data frame by sum r

I want to merge 7 columns in one column by sum, but i can not find a good way to do this. The data frames contains 71 observations and 7 variables.
The first ones are
> head(df)
pop_exposed_1_1 pop_exposed_1_2 pop_exposed_1_3 pop_exposed_1_4
1 NA NA 15778358 NA
2 NA NA NA NA
3 NA NA NA 3971412
4 NA NA NA 2694625
5 NA NA NA NA
6 NA NA NA NA
pop_exposed_2_2 pop_exposed_2_3 pop_exposed_2_4
1 NA NA NA
2 38044072 NA NA
3 NA NA NA
4 NA NA NA
5 NA 1626335.0 NA
6 NA 429924.4 NA
All the NA values need to be replaced by a value from another variable and some rows have multiple values that need to be combined by sum. So that the outcome is just one variable pop_exposed. I have tried several things, but nothing worked the way I would like to.
Look up ?rowSums
rowSums(df, na.rm=TRUE)
rowMeans(df, na.rm=TRUE)
or the apply way
apply(df,1,sum ,na.rm = TRUE) # Sum by row '1' (for columns use '2')
apply(df,1,mean,na.rm = TRUE) # Mean by row '1' (for columns use '2')

How can I find out the names of columns that satisfy a condition in a data frame

I wish to know (by name) which columns in my data frame satisfy a particular condition. For example, if I was looking for the names of any columns that contained more than 3 NA, how could I proceed?
>frame
m n o p
1 0 NA NA NA
2 0 2 2 2
3 0 NA NA NA
4 0 NA NA 1
5 0 NA NA NA
6 0 1 2 3
> for (i in frame){
na <- is.na(i)
as.numeric(na)
total<-sum(na)
if(total>3){
print (i) }}
[1] NA 2 NA NA NA 1
[2] NA 2 NA NA NA 2
So this actually succeeds in evaluating which columns satisfy the condition, however, it does not display the column name. Perhaps subsetting the columns which interest me would be another way to do it, but I'm not sure how to solve it that way either. Plus I'd prefer to know if there's a way to just get the names directly.
I'll appreciate any input.
We can use colSums on a logical matrix (is.na(frame)), check whether it is greater than 3 to get a logical vector and then subset the names of 'frame' based on that.
names(frame)[colSums(is.na(frame))>3]
#[1] "n" "o"
If we are using dplyr, one way is
library(dplyr)
frame %>%
summarise_each(funs(sum(is.na(.))>3)) %>%
unlist() %>%
names(.)[.]
#[1] "n" "o"

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

Creating categorical variables from mutually exclusive dummy variables

My question regards an elaboration on a previously answered question about combining multiple dummy variables into a single categorical variable.
In the question previously asked, the categorical variable was created from dummy variables that were NOT mutually exclusive. For my case, my dummy variables are mutually exclusive because they represent crossed experimental conditions in a 2X2 between-subjects factorial design (that also has a within subjects component which I'm not addressing here), so I don't think interaction does what I need to do.
For example, my data might look like this:
id conditionA conditionB conditionC conditionD
1 NA 1 NA NA
2 1 NA NA NA
3 NA NA 1 NA
4 NA NA NA 1
5 NA 2 NA NA
6 2 NA NA NA
7 NA NA 2 NA
8 NA NA NA 2
I'd like to now make categorical variables that combine ACROSS different types of conditions. For example, people who had values for condition A and B might be coded with one categorical variable, and people who had values for condition C and D.
id conditionA conditionB conditionC conditionD factor1 factor2
1 NA 1 NA NA 1 NA
2 1 NA NA NA 1 NA
3 NA NA 1 NA NA 1
4 NA NA NA 1 NA 1
5 NA 2 NA NA 2 NA
6 2 NA NA NA 2 NA
7 NA NA 2 NA NA 2
8 NA NA NA 2 NA 2
Right now, I'm doing this using ifelse() statements, which quite simply is a hot mess (and doesn't always work). Please help! There's probably some super-obvious "easier way."
EDIT:
The kinds of ifelse commands that I am using are as follows:
attach(df)
df$factor<-ifelse(conditionA==1 | conditionB==1, 1, NA)
df$factor<-ifelse(conditionA==2 | conditionB==2, 2, df$factor)
In reality, I'm combining across 6-8 columns each time, so a more elegant solution would help a lot.
Update (2019): Please use dplyr::coalesce(), it works pretty much the same.
My R package has a convenience function that allows to choose the first non-NA value for each element in a list of vectors:
#library(devtools)
#install_github('kimisc', 'muelleki')
library(kimisc)
df$factor1 <- with(df, coalesce.na(conditionA, conditionB))
(I'm not sure if this works if conditionA and conditionB are factors. Convert them to numerics before using as.numeric(as.character(...)) if necessary.)
Otherwise, you could give interaction a try, combined with recoding of the levels of the resulting factor -- but to me it looks like you're more interested in the first solution:
df$conditionAB <- with(df, interaction(coalesce.na(conditionA, 0),
coalesce.na(conditionB, 0)))
levels(df$conditionAB) <- c('A', 'B')
I think this function gives you what you need (admittedly, this is a quick hack).
to_indicator <- function(x, grp)
{
apply(tbl, 1,
function (x)
{
idx <- which(!is.na(x))
nm <- names(idx)
if (nm %in% grp)
x[idx]
else
NA
})
}
And here is it's used with the example data you provide.
tbl <- read.table(header=TRUE, text="
conditionA conditionB conditionC conditionD
NA 1 NA NA
1 NA NA NA
NA NA 1 NA
NA NA NA 1
NA 2 NA NA
2 NA NA NA
NA NA 2 NA
NA NA NA 2")
tbl <- data.frame(tbl)
(tbl <- cbind(tbl,
factor1=to_indicator(tbl, c("conditionA", "conditionB")),
factor2=to_indicator(tbl, c("conditionC", "conditionD"))))
Well, I think you can do it simply with ifelse, something like :
factor1 <- ifelse(is.na(conditionA), conditionB, conditionA)
Another way could be :
factor1 <- conditionA
factor1[is.na(factor1)] <- conditionB
And a third solution, certainly more pratical if you have more than two columns conditions :
factor1 <- apply(df[,c("conditionA","conditionB")], 1, sum, na.rm=TRUE)

Resources