In the midst of merging several data sets, I'm trying to remove all rows of a data frame that have a missing value for one particular variable (I want to keep the NAs in some of the other columns for the time being). I used the following line:
data.frame <- data.frame[!is.na(data.frame$year),]
This successfully removes all rows with NAs for year, (and no others), but the other columns, which previously had data, are now entirely NAs. In other words, non-missing values are being converted to NA. Any ideas as to what's going on here? I've tried these alternatives and got the same outcome:
data.frame <- subset(data.frame, !is.na(year))
data.frame$x <- ifelse(is.na(data.frame$year) == T, 1, 0);
data.frame <- subset(data.frame, x == 0)
Am I using is.na incorrectly? Are there any alternatives to is.na in this scenario? Any help would be greatly appreciated!
Edit Here is code that should reproduce the issue:
#data
tc <- read.csv("http://dl.dropbox.com/u/4115584/tc2008.csv")
frame <- read.csv("http://dl.dropbox.com/u/4115584/frame.csv")
#standardize NA codes
tc[tc == "."] <- NA
tc[tc == -9] <- NA
#standardize spatial units
colnames(frame)[1] <- "loser"
colnames(frame)[2] <- "gainer"
frame$dyad <- paste(frame$loser,frame$gainer,sep="")
tc$dyad <- paste(tc$loser,tc$gainer,sep="")
drops <- c("loser","gainer")
tc <- tc[,!names(tc) %in% drops]
frame <- frame[,!names(frame) %in% drops]
rm(drops)
#merge tc into frame
data <- merge(tc, frame, by.x = "year", by.y = "dyad", all.x=T, all.y=T) #year column is duplicated in this process. I haven't had this problem with nearly identical code using other data.
rm(tc,frame)
#the first column in the new data frame is the duplicate year, which does not actually contain years. I'll rename it.
colnames(data)[1] <- "double"
summary(data$year) #shows 833 NA's
summary(data$procedur) #note that at this point there are non-NA values
#later, I want to create 20 year windows following the events in the tc data. For simplicity, I want to remove cases with NA in the year column.
new.data <- data[!is.na(data$year),]
#now let's see what the above operation did
summary(new.data$year) #missing years were successfully removed
summary(new.data$procedur) #this variable is now entirely NA's
I think the actual problem is with your merge.
After you merge and have the data in data, if you do:
# > table(data$procedur, useNA="always")
# 1 2 3 4 5 6 <NA>
# 122 112 356 59 39 19 192258
You see there are these many (122+112...+19) values for data$procedur. But, all these values are corresponding to data$year = NA.
> all(is.na(data$year[!is.na(data$procedur)]))
# [1] TRUE # every value of procedur occurs where year = NA
So, basically, all values of procedur are also removed because you removed those rows checking for NA in year.
To solve this problem, I think you should use merge as:
merge(tc, frame, all=T) # it'll automatically calculate common columns
# also this will not result in duplicated year column.
Check if this merge gives you the desired result.
Try complete.cases:
data.frame.clean <- data.frame[complete.cases(data.frame$year),]
...though, as noted above, you may want to pick a more descriptive name.
Related
I found sort of the reverse question here: R: Replace multiple values in multiple columns of dataframes with NA
But I couldn't make it work with my data. In my case, I want to find the NA's and replace them with the value from another column.
I have a dataset dta1 in which there are 2493 variables I am interested in manipulating. Aside from these 2493 variables there's a column var_fill. When any of the columns named in vars is NA I want to fill it in with the value from var_fill. I tried reverse engineering the solution posted above but it gives me multiple warnings of:
1: In `[<-.factor`(`*tmp*`, list, value = structure(c(16946L, ... : invalid factor level, NA generated
2: In x[...] <- m : number of items to replace is not a multiple of replacement length
And also just doesn't work.
vars <- sprintf("var%0.4d",seq(1:2493))
dta1[vars] <- lapply(dta1[vars], function(x) replace(x,is.na(x), dta1$var_fill) )
I apologize but because of the size of this data I couldn't generate a full reproducible dataset so I heavily subsetted it but I am working with about 3000 columns and 240K rows of data.
Here's the data: https://drive.google.com/file/d/1oj_nhd99ftgN1Bh930_IRQftLACR2FO9/view?usp=sharing
It's too big to post even though it's only 10 people.
Turn the columns to characters and replace the NA values with the corresponding var_fill value.
dta1$var_fill <- as.character(dta1$var_fill)
dta1[vars] <- lapply(dta1[vars], function(x) {
x <- as.character(x)
x[is.na(x)] <- dta1$var_fill[is.na(x)]
x
})
In dplyr, you can use coalesce.
library(dplyr)
dta1 <- dta1 %>% mutate(across(all_of(vars), ~coalesce(., var_fill)))
I have a data frame of NA values and specified dimensions, and I am trying to replace the rows in that data frame with the rows from other data frames, but the replacement rows have several data types, and the factors keep getting converted to integers. How do I stop this?
#Example
df.na <- data.frame(matrix(NA,nrow=2,ncol=3)) #data frame of NA values
df <- cbind.data.frame(c("a","b"),c(1,2),c(TRUE,FALSE)) #data frame of values I want
df.na[1,] <- df[1,] #replace row 1
str(df.na) #column one is now an integer not a factor
The purpose behind this is that it is part of a for loop, and I'm using the indexing to replace rows as I go (each iteration generates a row) rather than building the data frame one iteration at a time.
Thanks in advance for the help
You can prevent conversion to integer by not using a factor in the first place, but character, using stringsAsFactors = FALSE:
df.na <- data.frame(matrix(NA,nrow=2,ncol=3))
df <- cbind.data.frame(c("a","b"),c(1,2),c(TRUE,FALSE), stringsAsFactors = FALSE)
df.na[1,] <- df[1,]
df.na
# X1 X2 X3
# 1 a 1 TRUE
# 2 <NA> NA NA
If you need the column as factor, use
df.na[, 1] <- factor(df.na[, 1])
But usually this is not necessary I guess.
How can I know how many values are NA in a dataset? OR if there are any NAs and NaNs in dataset?
This may also work fine
sum(is.na(df)) # For entire dataset
for a particular column in a dataset
sum(is.na(df$col1))
Or to check for all the columns as mentioned by #nicola
colSums(is.na(df))
As #Roland noticed there are multiple functions for finding and dealing with missing values in R (see help("NA") and here).
Example:
Create a fake dataset with some NA's:
data <- matrix(1:300,,3)
data[sample(300, 40)] <- NA
Check if there are any missing values:
anyNA(data)
Columnwise check if there are any missing values:
apply(data, 2, anyNA)
Check percentages and counts of missing values in columns:
colMeans(is.na(data))*100
colSums(is.na(data))
For a dataframe it is:
sum(is.na(df)
here df is the dataframe
where as for a particular column in the dataframe you can use:
sum(is.na(df$col)
or
cnt=0
for(i in df$col){
if(is.na(i)){
cnt=cnt+1
}
}
cnt
here cnt gives the no. of NA in the column
You can simply get the number of "NA" included in the each column of dataset by using R.
For a vector x
summary(x)
For a data frame df
summary(df)
I have a data frame (df) that has some NA values. I wanted to extract the rows where there are NA values across multiple columns (in the example below, I am doing so for columns 12-20):
NArows = which(is.na(df[,20])&is.na(df[,19]&is.na(df[,18])&is.na(df[,17])&is.na(df[,16])&is.na(df[,15])&is.na(df[,14])&is.na(df[,13])&is.na(df[,12])))
Is there a more readable (and condensed) way to accomplish this, without putting each column condition surrounded by & sign?
Thank you for any help...
Try this:
cols <- 12:20
NArows <- which(apply(df[cols],1,function(y)sum(!is.na(y))==0))
It slices your df to just the 'cols' you care about,then applies to each row the test is.na() and if it finds all values in those cols are NA it adds that row number to NArows.
Or based on david arenburg's answer, this flags any rows with 2 or more NAs:
NArows <- which(rowSums(is.na(df[12:20])) > 1L)
Adapting it to more closely match your requirements, this flags only rows where all are NAs:
cols <- 12:20
NArows <- which(rowSums(is.na(df[cols])) == ncol(df[cols]))
I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.
To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).
Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"