remove rows with NA values in a specific column - r

I have a huge dataset of about 1.6 million rows, and the variable (column) I need to focus on is 'temperature'. The temperature column has many NA values, and the other variable columns have NA values throughout as well. I want to remove only the rows with NA values in the temperature column, I don't particularly care about the NA values in the other columns. How can I do this? If I end up needing to remove rows with NA values for more than just my temperature column, (eg the depth column) how can I select two columns? This is my code:
otn <- tidync(filename, row.names=TRUE) %>% activate('D0')
glider_table <- hyper_tibble(otn)
attach(glider_table)
summary(temperature)
na.omit(glider_table)
na.omit () removes all rows with NA values regardless of which column they're in, so I need something more selective.

You can use the drop_na() function, the first argument is the dataset name, and the second is an optional argument where you can name the specific columns you want to remove the NA responses from.
Like this , drop_na(dataset, column)

Related

Columns With NA Values

I have a dataframe in which I want to identify columns that have NA values.
Only about 10% of the columns have NA values, so I only want to list out the columns where the number of NAs is greater than 0.
The line below will bring back all columns, is there a way I can filter out the columns that dont have NA values?
colSums(is.na(df))
To list out columns with at least a single null value you can use:
df.columns[df.isna().any()].tolist()
To list out columns with all null values you can use:
df.columns[df.isna().all()].tolist()

Predict values in place of NA by using the original data frame?

I have a dataset and the row which I need has an NA value.
If I use na.omit, the rows will be omitted; however, I need the row. Hence I need to predict a value in place of the NA.
How do I proceed?

R: Replace column names with row values except where cells equal NA

I have a data frame extracted from a data base that contains different types of data (record types). The different record types have different column names which occupy the first three rows (including header). This data frame is made to be used in excel where you can easily filter out the data by choosing the correct record type.
Here I present small sample of my data frame which in reality contains many more columns (59) as well as rows (34000).
sample <- data.frame(X01RecordType=c("01HL","01CA","HH","HH","HH","HL"), X02Quarter=c(NA,NA,2,2,2,1),X05Gear=c(NA,NA,"KRA","KRA","KRA",NA),X06SweepLngt=c(NA,NA,35,35,-9,-9),
X12Month=c("12SpecCodeType",NA,4,5,4,2), X13Day=c("13SpecCode",NA,26,5,25,160617), X22StatRec=c("22LngtCode","22CANoAtLngt","45G1",NA,NA,NA),X23Depth=c("23LngtClass","23IndWgt",41,NA,63,NA))
As you might see the cells which contain column names are preceded by an X and a number and then a text, e.g. X01RecordType. It would be very easy to replace column names with the first rows by using:
colnames(df) <- df[1,]
However, as you can see some of the cells in the first two rows also contain NA-values. These NA-values indicate that the column names are the same for all record types, using the current header and therefore I would like to keep these. So really what I would like to do is replace the column names with the values of the first row (where record type header equals 01HL) except for NA-values.
If possible I would like to do this without using any external packages. Cells within the data may also contain NA-values and I would like to keep these rows so filtering out all columns containing NA is not an option if it doesn't only apply to the first row. Which is really the way I tried to approach this problem, but I can't figure out how.
I hope this is all the information required to help me out and thanks!
Another option without a loop
colnames(sample)[!is.na(sample[1,])] <- sample[1,][!is.na(sample[1,])]
sample[1:2,]
# 01HL X02Quarter X05Gear X06SweepLngt 12SpecCodeType 13SpecCode 22LngtCode
#1 01HL NA <NA> NA 12SpecCodeType 13SpecCode 22LngtCode
#2 01CA NA <NA> NA <NA> <NA> 22CANoAtLngt
# 23LngtClass
#1 23LngtClass
#2 23IndWgt
I suggest a simple loop:
for(c in 1:length(sample)) if(!is.na(sample[1,c])) colnames(sample)[c] = as.character(sample[1,c])

Select a row from a data frame with the least number of NA values

So I have a block of text that I've seperated into a vector, and from each line of vector I've further seperated it into a data frame. In a perfect world, every row of the DF would be exactly the same, but it's not and there a number of rows with NA values in them. What I need to do is select the row from the data frame with the least number of NA values.
So say the DF looked like this:
Name Year NA Address NA State NA
Name Year ID Address City State Rank
Name Year NA NA City State NA
Name NA NA NA NA NA Rank
Name Year NA NA NA NA NA
Where they each belong to column. So I need a way to identify which row has the least number of NA's, and then select that row's elements. So ultimately I want the return to just be single row DF (or a vector preferably) that reads
Name Year ID Address City State Rank
In this case, row 2.
I know that:
max( rowSums(!is.na(x)) )
Will return me the row# with the most number of not-na values, but I can't seem to figure out how to grab the elements of that row. I was thinking using which() would work, but I can't seem to figure it out.
Thanks for your help!
David
If your data frame is df, then:
df[which.max(rowSums(!is.na(df))),]
Should return the single-row data frame with the fewest NAs.

Change values in row based on a column value r

I am new to R with a fairly simple question, I just can't figure out the answer. For my example I will use a data frame with 3 columns, but my actual data set is 139 columns with 10000 rows.
I want to replace all of the values in a given row with NA if the value in the same row in column C contains a value < 10.
Assume that all of my columns are either number or integer values.
so I want to take the data frame:
x=data.frame(c(5,9,2),c(3,4,6),c(12,9,11))
names(x)=c("A","B","C")
and replace row 2 with NA to create
y=data.frame(c(5,"NA",2),c(3,"NA",6),c(12,"NA",11))
names(y)=c("A","B","C")
Thanks!
how about:
x[x$C <10 ,] <- NA

Resources