Remove NA's by keeping all the populated cells in new columns using R - r

How can I drop all the elements with missing values but instead of deleting entire columns, create columns with just the populated cells? For example getting from this
A B C D
1 NA 2 NA
NA 3 NA 4
NA 5 6 NA
(data1) in order to create a data-set containing only the populated cells, as this
AB BB
1 2
3 4
5 6
Below I have created a small working example to test a solution.
># Create example dataset (data1)
>data1 <- data.frame(matrix(c(1,NA,2,NA,NA,3,NA,4,NA,5,6,NA),nrow = 3, byrow = T))
>colnames(data1) <- c("A","B","C","D")
>print(data1)
A B C D
1 NA 2 NA
NA 3 NA 4
NA 5 6 NA
> # Create new dataset?

Here is a potential solution using akrun's/Valentin's answer from this question.
Let's say the data is
data1 <- data.frame(matrix(c(1,NA,2,NA,NA,3,NA,4,NA,5,NA,NA),nrow = 3, byrow = T))
> data1
X1 X2 X3 X4
1 1 NA 2 NA
2 NA 3 NA 4
3 NA 5 NA NA
Then use
df1 <- t(sapply(apply(data1, 1, function(x) x[!is.na(x)]), "length<-", max(lengths(lapply(data1, function(x) x[!is.na(x)])))))
to arrive at
> df1
X1 X3
[1,] 1 2
[2,] 3 4
[3,] 5 NA

Related

adding two variables which has NA present

lets say data is 'ab':
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
ab <-c(a,b)
I would like to have new variable which is sum of the two but keeping NA's as follows:
desired output:
ab$c <-(6,2,7,NA,5,6)
so addition of number + NA should equal number
I tried following but does not work as desired:
ab$c <- a+b
gives me : 6 NA 7 NA NA NA
Also don't know how to include "na.rm=TRUE", something I was trying.
I would also like to create third variable as categorical based on cutoff <=4 then event 1, otherwise 0:
desired output:
ab$d <-(1,1,1,NA,0,0)
I tried:
ab$d =ifelse(ab$a<=4|ab$b<=4,1,0)
print(ab$d)
gives me logical(0)
Thanks!
a <- c(1,2,3,NA,5,NA)
b <- c(5,NA,4,NA,NA,6)
dfd <- data.frame(a,b)
dfd$c <- rowSums(dfd, na.rm = TRUE)
dfd$c <- ifelse(is.na(dfd$a) & is.na(dfd$b), NA_integer_, dfd$c)
dfd$d <- ifelse(dfd$c >= 4, 1, 0)
dfd
a b c d
1 1 5 6 1
2 2 NA 2 0
3 3 4 7 1
4 NA NA NA NA
5 5 NA 5 1
6 NA 6 6 1

Fill matrix from list based on column name match from other list

I have some data:
num.list1 <- list(1,2,1,4,5)
num.list2 <- list(2,3)
num.list3 <- list(3,5,2)
num.data.list <- list(num.list1, num.list2, num.list3)
name.list1 <- list("A","B","C","D","E")
name.list2 <- list("B","C")
name.list3 <- list("A","C","E")
name.data.list <- list(name.list1, name.list2, name.list3)
all.names <- unique(unlist(name.data.list))
my.matrix <- matrix(data = NA, nrow = length(name.data.list), ncol = length(all.names))
colnames(my.matrix) <- all.names
I would like to populate my.matrix with the values from num.data.list based on matching the column names of my.matrix with the values in name.data.list.
i.e. :
A B C D E
1 1 2 1 4 5
2 NA 2 3 NA NA
3 3 NA 5 NA 2
Any Ideas? Thanks.
Using matrix subsetting:
library(reshape2)
nm = melt(name.data.list)
my.matrix[matrix(c(nm$L1, match(nm$value, all.names)), ncol = 2)] = unlist(num.data.list)
# A B C D E
#[1,] 1 2 1 4 5
#[2,] NA 2 3 NA NA
#[3,] 3 NA 5 NA 2
I'd start to give num.data.list and name.data.list the proper structure:
num.data.list<-lapply(num.data.list,unlist)
name.data.list<-lapply(name.data.list,unlist)
Then:
for (i in 1:nrow(my.matrix)) my.matrix[i,name.data.list[[i]]]<-num.data.list[[i]]
my.matrix
# A B C D E
#[1,] 1 2 1 4 5
#[2,] NA 2 3 NA NA
#[3,] 3 NA 5 NA 2
Here is another option you can do with Map and rbindlist function, assuming that your names and data are in such an order that each element in num.data.list match the corresponding element in the name.data.list.
library(data.table);
rbindlist(Map(function(x, y) setNames(data.frame(x), y),
num.data.list, name.data.list), fill = T)
A B C D E
1: 1 2 1 4 5
2: NA 2 3 NA NA
3: 3 NA 5 NA 2

Conditionals calculations across rows R

First, I'm brand new to R and am making the switch from SAS. I have a dataset that is 1000 rows by 24 columns, where the columns are different treatments. I want to count the number of times an observation meets a criteria across rows of my dataset listed below.
Gene A B C D
1 AARS_3 NA NA 4.168365 NA
2 AASDHPPT_21936 NA NA NA -3.221287
3 AATF_26432 NA NA NA NA
4 ABCC2_22 4.501518 3.17992 NA NA
5 ABCC2_26620 NA NA NA NA
I was trying to create column vectors that counted
1) Number of NAs
2) Number of columns <0
3) Number of columns >0
I would then use cbind to add these to my large dataset
I solved the first one with :
NA.Count <- (apply(b01,MARGIN=1,FUN=function(x) length(x[is.na(x)])))
I tried to modify this to count evaluate the !is.na and then count the number of times the value was less than zero with this:
lt0 <- (apply(b01,MARGIN=1,FUN=function(x) ifelse(x[!is.na(x)],count(x[x<0]))))
which didn't work at all.
I tried a dozen ways to get dplyr mutate to work with this and did not succeed.
What I want are the last two columns below; and if you had a cleaner version of the NA.Count I did, that would also be greatly appreciated.
Gene A B C D NA.Count lt0 gt0
1 AARS_3 NA NA 4.168365 NA 3 0 1
2 AASDHPPT_21936 NA NA NA -3.221287 3 1 0
3 AATF_26432 NA NA NA NA 4 0 0
4 ABCC2_22 4.501518 3.17992 NA NA 2 0 2
5 ABCC2_26620 NA NA NA NA 4 0 0
Here is one way to do it taking advantage of the fact that TRUE equals 1 in R.
# test data frame
lil_df <- data.frame(Gene = c("AAR3", "ABCDE"),
A = c(NA, 3),
B = c(2, NA),
C = c(-1, -2),
D = c(NA, NA))
# is.na
NA.count <- rowSums(is.na(lil_df[,-1]))
# less than zero
lt0 <- rowSums(lil_df[,-1]<0, na.rm = TRUE)
# more that zero
mt0 <- rowSums(lil_df[,-1]>0, na.rm = TRUE)
# cbind to data frame
larger_df <- cbind(lil_df, NA.count, lt0, mt0 )
larger_df
Gene A B C D NA.count lt0 mt0
1 AAR3 NA 2 -1 NA 2 1 1
2 ABCDE 3 NA -2 NA 2 1 1

R- Perform operations on column and place result in a different column, with the operation specified by the output column's name

I have a dataframe with 3 columns- L1, L2, L3- of data and empty columns labeled L1+L2, L2+L3, L3+L1, L1-L2, etc. combinations of column operations. Is there a way to check the column name and perform the necessary operation to fill that new column with data?
I am thinking:
-use match to find the appropriate original columns and using a for loop to iterate over all of the columns in this search?
so if the column I am attempting to fill is L1+L2 I would have something like:
apply(dataframe[,c(i, j), 1, sum)
It seems strange that you would store your operations in your column names, but I suppose it is possible to achieve:
As always, sample data helps.
## Creating some sample data
mydf <- setNames(data.frame(matrix(1:9, ncol = 3)),
c("L1", "L2", "L3"))
## The operation you want to do...
morecols <- c(
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "+")),
combn(names(mydf), 2, FUN=function(x) paste(x, collapse = "-"))
)
## THE FINAL SAMPLE DATA
mydf[, morecols] <- NA
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 NA NA NA NA NA NA
# 2 2 5 8 NA NA NA NA NA NA
# 3 3 6 9 NA NA NA NA NA NA
One solution could be to use eval(parse(...)) within lapply to perform the calculations and store them to the relevant column.
mydf[morecols] <- lapply(names(mydf[morecols]), function(x) {
with(mydf, eval(parse(text = x)))
})
mydf
# L1 L2 L3 L1+L2 L1+L3 L2+L3 L1-L2 L1-L3 L2-L3
# 1 1 4 7 5 8 11 -3 -6 -3
# 2 2 5 8 7 10 13 -3 -6 -3
# 3 3 6 9 9 12 15 -3 -6 -3
dfrm <- data.frame( L1=1:3, L2=1:3, L3=3+1, `L1+L2`=NA,
`L2+L3`=NA, `L3+L1`=NA, `L1-L2`=NA,
check.names=FALSE)
dfrm
#------------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 NA NA NA NA
2 2 2 4 NA NA NA NA
3 3 3 4 NA NA NA NA
#-------------
dfrm[, 4:7] <- lapply(names(dfrm[, 4:7]),
function(nam) eval(parse(text=nam), envir=dfrm) )
dfrm
#-----------
L1 L2 L3 L1+L2 L2+L3 L3+L1 L1-L2
1 1 1 4 2 5 5 0
2 2 2 4 4 6 6 0
3 3 3 4 6 7 7 0
I chose to use eval(parse(text=...)) rather than with, since the use of with is specifically cautioned against in its help page. I'm not sure I can explain why the eval(..., target_dfrm) form should be any safer, though.

How does one merge dataframes by row name without adding a "Row.names" column?

If I have two data frames, such as:
df1 = data.frame(x=1:3,y=1:3,row.names=c('r1','r2','r3'))
df2 = data.frame(z=5:7,row.names=c('r5','r6','r7'))
(
R> df1
x y
r1 1 1
r2 2 2
r3 3 3
R> df2
z
r5 5
r6 6
r7 7
), I'd like to merge them by row names, keeping everything (so an outer join, or all=T). This does it:
merged.df <- merge(df1,df2,all=T,by='row.names')
R> merged.df
Row.names x y z
1 r1 1 1 NA
2 r2 2 2 NA
3 r3 3 3 NA
4 r5 NA NA 5
5 r6 NA NA 6
6 r7 NA NA 7
but I want the input row names to be the row names in the output dataframe (merged.df).
I can do:
rownames(merged.df) <- merged.df[[1]]
merged.df <- merged.df[-1]
which works, but seems inelegant and hard to remember. Anyone know of a cleaner way?
Not sure if it's any easier to remember, but you can do it all in one step using transform.
transform(merge(df1,df2,by=0,all=TRUE), row.names=Row.names, Row.names=NULL)
# x y z
#r1 1 1 NA
#r2 2 2 NA
#r3 3 3 NA
#r5 NA NA 5
#r6 NA NA 6
#r7 NA NA 7
From the help of merge:
If the matching involved row names, an extra character column called
Row.names is added at the left, and in all cases the result has
‘automatic’ row names.
So it is clear that you can't avoid the Row.names column at least using merge. But maybe to remove this column you can subset by name and not by index. For example:
dd <- merge(df1,df2,by=0,all=TRUE) ## by=0 easier to write than row.names ,
## TRUE is cleaner than T
Then I use row.names to subset like this :
res <- subset(dd,select=-c(Row.names))
rownames(res) <- dd[,'Row.names']
x y z
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 NA NA 5
5 NA NA 6
6 NA NA 7

Resources