I have a dataframe in which I want to identify columns that have NA values.
Only about 10% of the columns have NA values, so I only want to list out the columns where the number of NAs is greater than 0.
The line below will bring back all columns, is there a way I can filter out the columns that dont have NA values?
colSums(is.na(df))
To list out columns with at least a single null value you can use:
df.columns[df.isna().any()].tolist()
To list out columns with all null values you can use:
df.columns[df.isna().all()].tolist()
I have a dataset and the row which I need has an NA value.
If I use na.omit, the rows will be omitted; however, I need the row. Hence I need to predict a value in place of the NA.
How do I proceed?
I have a data frame extracted from a data base that contains different types of data (record types). The different record types have different column names which occupy the first three rows (including header). This data frame is made to be used in excel where you can easily filter out the data by choosing the correct record type.
Here I present small sample of my data frame which in reality contains many more columns (59) as well as rows (34000).
sample <- data.frame(X01RecordType=c("01HL","01CA","HH","HH","HH","HL"), X02Quarter=c(NA,NA,2,2,2,1),X05Gear=c(NA,NA,"KRA","KRA","KRA",NA),X06SweepLngt=c(NA,NA,35,35,-9,-9),
X12Month=c("12SpecCodeType",NA,4,5,4,2), X13Day=c("13SpecCode",NA,26,5,25,160617), X22StatRec=c("22LngtCode","22CANoAtLngt","45G1",NA,NA,NA),X23Depth=c("23LngtClass","23IndWgt",41,NA,63,NA))
As you might see the cells which contain column names are preceded by an X and a number and then a text, e.g. X01RecordType. It would be very easy to replace column names with the first rows by using:
colnames(df) <- df[1,]
However, as you can see some of the cells in the first two rows also contain NA-values. These NA-values indicate that the column names are the same for all record types, using the current header and therefore I would like to keep these. So really what I would like to do is replace the column names with the values of the first row (where record type header equals 01HL) except for NA-values.
If possible I would like to do this without using any external packages. Cells within the data may also contain NA-values and I would like to keep these rows so filtering out all columns containing NA is not an option if it doesn't only apply to the first row. Which is really the way I tried to approach this problem, but I can't figure out how.
I hope this is all the information required to help me out and thanks!
Another option without a loop
colnames(sample)[!is.na(sample[1,])] <- sample[1,][!is.na(sample[1,])]
sample[1:2,]
# 01HL X02Quarter X05Gear X06SweepLngt 12SpecCodeType 13SpecCode 22LngtCode
#1 01HL NA <NA> NA 12SpecCodeType 13SpecCode 22LngtCode
#2 01CA NA <NA> NA <NA> <NA> 22CANoAtLngt
# 23LngtClass
#1 23LngtClass
#2 23IndWgt
I suggest a simple loop:
for(c in 1:length(sample)) if(!is.na(sample[1,c])) colnames(sample)[c] = as.character(sample[1,c])
So I have a block of text that I've seperated into a vector, and from each line of vector I've further seperated it into a data frame. In a perfect world, every row of the DF would be exactly the same, but it's not and there a number of rows with NA values in them. What I need to do is select the row from the data frame with the least number of NA values.
So say the DF looked like this:
Name Year NA Address NA State NA
Name Year ID Address City State Rank
Name Year NA NA City State NA
Name NA NA NA NA NA Rank
Name Year NA NA NA NA NA
Where they each belong to column. So I need a way to identify which row has the least number of NA's, and then select that row's elements. So ultimately I want the return to just be single row DF (or a vector preferably) that reads
Name Year ID Address City State Rank
In this case, row 2.
I know that:
max( rowSums(!is.na(x)) )
Will return me the row# with the most number of not-na values, but I can't seem to figure out how to grab the elements of that row. I was thinking using which() would work, but I can't seem to figure it out.
Thanks for your help!
David
If your data frame is df, then:
df[which.max(rowSums(!is.na(df))),]
Should return the single-row data frame with the fewest NAs.
I am new to R with a fairly simple question, I just can't figure out the answer. For my example I will use a data frame with 3 columns, but my actual data set is 139 columns with 10000 rows.
I want to replace all of the values in a given row with NA if the value in the same row in column C contains a value < 10.
Assume that all of my columns are either number or integer values.
so I want to take the data frame:
x=data.frame(c(5,9,2),c(3,4,6),c(12,9,11))
names(x)=c("A","B","C")
and replace row 2 with NA to create
y=data.frame(c(5,"NA",2),c(3,"NA",6),c(12,"NA",11))
names(y)=c("A","B","C")
Thanks!
how about:
x[x$C <10 ,] <- NA