Delete column with NAs in first row - r

If I have a dataframe like so
a <- c(NA,1,2,NA,4)
b <- c(6,7,8,9,10)
c <- c(NA,12,13,14,15)
d <- c(16,NA,18,NA,20)
df <- data.frame(a,b,c,d)
How can I delete columns "a" and "c" by asking R to delete those columns that contain an NA in the first row?
My actual dataset is much bigger, and this is only by way of a reproducible example.
Please note that this isn't the same as asking to delete columns with any NAs in it. My columns may have other NA values in it. I'm looking to delete just the ones with an NA in the first row.

You can use a vector of booleans indicating wether the first row is missing in this case.
res <- df[,!is.na(df[1,])]
> res
b d
1 6 16
2 7 NA
3 8 18
4 9 NA
5 10 20

Related

How to exclude missing data in specific columns in R

I have a df with 15,105 rows and 127 columns. I'd like to exclude some specific colunms' rows that have NA. I´m using the following command:
wave1b <- na.omit(wave1, cols=c("Bx", "Deq", "Gef", "Has", "Pla", "Ty"))
However, when I run it it returns with 19 rows only, when it was expected to return with 14,561 rows (if it should have excluded only the NA in those specific colunms requested). I'm afirming this, cause I did a subset on the df in order to test the accuracy of the missing deletion.
Does anyone could help me solving this issue? Thank you!
I think this code is not efficient but it could work:
df <- data.frame(A = rep(NA,3), B = c(NA,2,3),C=c(1,NA,2))
df
A B C
1 NA NA 1
2 NA 2 NA
3 NA 3 2
It removes only the rows which have missing values for the columns B and C:
df[-which(is.na(df$B)|is.na(df$C)),]
A B C
3 NA 3 2
You can use complete.cases
> df[complete.cases(df[, -1]), ]
A B C
3 NA 3 2

Getting wrong result while removing all NA value columns in R

I am getting wrong result while removing all NA value column in R
data file : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
trainingData <- read.csv("D:\\pml-training.csv",na.strings = c("NA","", "#DIV/0!"))
Now I want to remove all the column which only has NA's
Approach 1: here I mean read all the column which has more than 0 sum and not NA
aa <- trainingData[colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
154 columns
Approach 2: As per this query, it will give all the columns which is NA and sum = 0, but it is giving the result of column which does not have NA and gives expected result
bb <- trainingData[,colSums(is.na(trainingData)) == 0]
length(colnames(bb))
60 columns (expected)
Can someone please help me to understand what is wrong in first statement and what is right in second one
aa <- trainingData[,colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
You convert the dataframe to a boolean dataframe with !is.na(trainingData), and find all columns where there is more than one TRUE (so non-NA) in the column. So this returns all columns that have at least one non-NA value, which seem to be all but 6 columns.
bb <- trainingData[colSums(is.na(trainingData)) == 0]
length(colnames(bb))
You convert the dataframe to boolean with is.na(trainingData) and return all values where there is no TRUE (no NA) in the column. This returns all columns where there are no missing values (i.e. no NA's).
Example as requested in comment:
df = data.frame(a=c(1,2,3),b=c(NA,1,1),c=c(NA,NA,NA))
bb <- df[colSums(is.na(df)) == 0]
> df
a b c
1 1 NA NA
2 2 1 NA
3 3 1 NA
> bb
a
1 1
2 2
3 3
So the statements are in fact different. If you want to remove all columns that are only NA's, you should use the first statement. Hope this helps.

new column with value from another one which name is in a third one in R

got that one I can't resolve.
Example dataset:
company <- c("compA","compB","compC")
compA <- c(1,2,3)
compB <- c(2,3,1)
compC <- c(3,1,2)
df <- data.frame(company,compA,compB,compC)
I want to create a new column with the value from the column which name is in the column "company" of the same line. the resulting extraction would be:
df$new <- c(1,3,2)
df
The way you have it set up, there's one row and one column for every company, and the rows and columns are in the same order. If that's your real dataset, then as others have said diag(...) is the solution (and you should select that answer).
If your real dataset has more than one instance of company (e.g., more than one row per company, then this is more general:
# using your df
sapply(1:nrow(df),function(i)df[i,as.character(df$company[i])])
# [1] 1 3 2
# more complex case
set.seed(1) # for reproducible example
newdf <- data.frame(company=LETTERS[sample(1:3,10,replace=T)],
A = sample(1:3,10,replace=T),
B=sample(1:5,10,replace=T),
C=1:10)
head(newdf)
# company A B C
# 1 A 1 5 1
# 2 B 1 2 2
# 3 B 3 4 3
# 4 C 2 1 4
# 5 A 3 2 5
# 6 C 2 2 6
sapply(1:nrow(newdf),function(i)newdf[i,as.character(newdf$company[i])])
# [1] 1 2 4 4 3 6 7 2 5 3
EDIT: eddi's answer is probably better. It is more likely that you would have the dataframe to work with rather than the individual row vectors.
I am not sure if I understand your question, it is unclear from your description. But it seems you are asking for the diagonals of the data values since this would be the place where "name is in the column "company" of the same line". The following will do this:
df$new <- diag(matrix(c(compA,compB,compC), nrow = 3, ncol = 3))
The diag function will return the diagonal of the matrix for you. So I first concatenated the three original vectors into one vector and then specified it to be wrapped into a matrix of three rows and three columns. Then I took the diagonal. The whole thing is then added to the dataframe.
Did that answer your question?

Merge data.frames with duplicates

I have many data.frames, for example:
df1 = data.frame(names=c('a','b','c','c','d'),data1=c(1,2,3,4,5))
df2 = data.frame(names=c('a','e','e','c','c','d'),data2=c(1,2,3,4,5,6))
df3 = data.frame(names=c('c','e'),data3=c(1,2))
and I need to merge these data.frames, without delete the name duplicates
> result
names data1 data2 data3
1 'a' 1 1 NA
2 'b' 2 NA NA
3 'c' 3 4 1
4 'c' 4 5 NA
5 'd' 5 6 NA
6 'e' NA 2 2
7 'e' NA 3 NA
I cant find function like merge with option to handle with name duplicates. Thank you for your help.
To define my problem. The data comes from biological experiment where one sample have a different number of replicates. I need to merge all experiment, and I need to produce this table. I can't generate unique identifier for replicates.
First define a function, run.seq, which provides sequence numbers for duplicates since it appears from the output that what is desired is that the ith duplicate of each name in each component of the merge be associated. Then create a list of the data frames and add a run.seq column to each component. Finally use Reduce to merge them all.
run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))
L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$names)))
out <- Reduce(function(...) merge(..., all = TRUE), L2)[-2]
The last line gives:
> out
names data1 data2 data3
1 a 1 1 NA
2 b 2 NA NA
3 c 3 4 1
4 c 4 5 NA
5 d 5 6 NA
6 e NA 2 2
7 e NA 3 NA
EDIT: Revised run.seq so that input need not be sorted.
See other questions:
How to join data frames in R (inner, outer, left, right)
recombining-a-list-of-data-frames-into-a-single-data-frame
...
Examples:
library(reshape)
out <- merge_recurse(L)
or
library(plyr)
out<-join(df1, df2, type="full")
out<-join(out, df3, type="full")
*can be looped
or
library(plyr)
out<-ldply(L)
I think there is just not enough information in your example data frames to do this. Which 'c' in dataframe 1 should be paired with which 'c' in data frame 2? We cannot tell, so R can't either. I suspect you will have to add another variable to each of your dataframes that uniquely identifies these duplicate cases.

Select last non-NA value in a row, by row

I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1

Resources