Stargazer package gives me really nice descriptive table to include in latex document.
library(stargazer)
stargazer(attitude)
Is it possible to add a column reporting number of NAs for each of the variables?
This snippet will give you the number of NAs per row and column in a data frame
# make some data
count <- 10
data <- data.frame( a=runif(count), b=runif(count))
# add some NAs
data[data$a>0.5,]$a <- NA
data[data$b>0.5,]$b <- NA
# NAs per row
data$NACount <- apply(data, 1, function(x) {length(x[is.na(x)])})
# NAs per column
NACountsByColumn <- lapply(data, function(x) {length(x[is.na(x)])} )
The "N" column of stargazer does not count the NAs. Alhough there is not direct way to input NAs in the output table, you can simply calculate it by
nrow(dataset) - the number in "N" column
Related
I used below codes to identify outliers on different columns:
outliers_x1 <- boxplot(mydata$x1, plot=FALSE)$out
outliers_x4 <- boxplot(mydata$x4, plot=FALSE)$out
outliers_x6 <- boxplot(mydata$x6, plot=FALSE)$out
Now, how can I remove those outliers from the dataset by one code?
This will set any outlier values to NA, and then optionally remove all rows where any column contains an outlier. Works with arbitrary number of columns.
Uses data.table for convenience.
library(data.table)
library(matrixStats)
##
# create sample data
#
set.seed(1)
dt <- data.table(x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
##
# incorporate possible outliers
#
dt[sample(100, 5), x1:=10*x1]
dt[sample(100, 5), x2:=10*x2]
dt[sample(100, 5), x3:=10*x3]
##
# you start here...
# remove all rows where any column contains an outlier
#
indx <- sapply(dt, \(x) !(x %in% boxplot(x, plot=FALSE)$out))
dt[as.logical(rowProds(indx))]
In the above, indx is a matrix with three logical columns. Each element is TRUE unless the corresponding column contained an outlier in that row. We use rowProds(...) from the matrixStats package to multiply ( & ) the 3 rows together. Unfortunately this converts everything numeric (1, 0), so we have to convert back to logical to use as an index into dt.
##
# replaces outliers with NA in each column
#
dt.melt <- melt(dt[, id:=seq(.N)], id='id')
dt.melt[, ol:=(value %in% boxplot(value, plot=FALSE)$out), by=.(variable)]
dt.melt[(ol), value:=NA]
result <- dcast(dt.melt, id~variable)[, id:=NULL]
##
# remove all rows where any column contains an outlier
#
na.omit(result)
In the code above we add an id column, then melt(...) so all other columns are in one column (value) with a second column (variable) indicating the original source column. Then we apply the boxplot(...) algorithm group-wise (by variable) to produce an ol column indicating an outlier. Then we set any value corresponding to ol == TRUE to NA. Then we re-convert to your original wide format with dcast(...) and remove the id.
It's a bit roundabout but this melt - process - dcast pattern is common when processing multiple columns like this.
Finally, na.omit(result) will remove any rows which have NA in any of the columns. If that's what you want it's simpler to use the first approach.
I found sort of the reverse question here: R: Replace multiple values in multiple columns of dataframes with NA
But I couldn't make it work with my data. In my case, I want to find the NA's and replace them with the value from another column.
I have a dataset dta1 in which there are 2493 variables I am interested in manipulating. Aside from these 2493 variables there's a column var_fill. When any of the columns named in vars is NA I want to fill it in with the value from var_fill. I tried reverse engineering the solution posted above but it gives me multiple warnings of:
1: In `[<-.factor`(`*tmp*`, list, value = structure(c(16946L, ... : invalid factor level, NA generated
2: In x[...] <- m : number of items to replace is not a multiple of replacement length
And also just doesn't work.
vars <- sprintf("var%0.4d",seq(1:2493))
dta1[vars] <- lapply(dta1[vars], function(x) replace(x,is.na(x), dta1$var_fill) )
I apologize but because of the size of this data I couldn't generate a full reproducible dataset so I heavily subsetted it but I am working with about 3000 columns and 240K rows of data.
Here's the data: https://drive.google.com/file/d/1oj_nhd99ftgN1Bh930_IRQftLACR2FO9/view?usp=sharing
It's too big to post even though it's only 10 people.
Turn the columns to characters and replace the NA values with the corresponding var_fill value.
dta1$var_fill <- as.character(dta1$var_fill)
dta1[vars] <- lapply(dta1[vars], function(x) {
x <- as.character(x)
x[is.na(x)] <- dta1$var_fill[is.na(x)]
x
})
In dplyr, you can use coalesce.
library(dplyr)
dta1 <- dta1 %>% mutate(across(all_of(vars), ~coalesce(., var_fill)))
How can I know how many values are NA in a dataset? OR if there are any NAs and NaNs in dataset?
This may also work fine
sum(is.na(df)) # For entire dataset
for a particular column in a dataset
sum(is.na(df$col1))
Or to check for all the columns as mentioned by #nicola
colSums(is.na(df))
As #Roland noticed there are multiple functions for finding and dealing with missing values in R (see help("NA") and here).
Example:
Create a fake dataset with some NA's:
data <- matrix(1:300,,3)
data[sample(300, 40)] <- NA
Check if there are any missing values:
anyNA(data)
Columnwise check if there are any missing values:
apply(data, 2, anyNA)
Check percentages and counts of missing values in columns:
colMeans(is.na(data))*100
colSums(is.na(data))
For a dataframe it is:
sum(is.na(df)
here df is the dataframe
where as for a particular column in the dataframe you can use:
sum(is.na(df$col)
or
cnt=0
for(i in df$col){
if(is.na(i)){
cnt=cnt+1
}
}
cnt
here cnt gives the no. of NA in the column
You can simply get the number of "NA" included in the each column of dataset by using R.
For a vector x
summary(x)
For a data frame df
summary(df)
I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.
To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).
Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"
In the midst of merging several data sets, I'm trying to remove all rows of a data frame that have a missing value for one particular variable (I want to keep the NAs in some of the other columns for the time being). I used the following line:
data.frame <- data.frame[!is.na(data.frame$year),]
This successfully removes all rows with NAs for year, (and no others), but the other columns, which previously had data, are now entirely NAs. In other words, non-missing values are being converted to NA. Any ideas as to what's going on here? I've tried these alternatives and got the same outcome:
data.frame <- subset(data.frame, !is.na(year))
data.frame$x <- ifelse(is.na(data.frame$year) == T, 1, 0);
data.frame <- subset(data.frame, x == 0)
Am I using is.na incorrectly? Are there any alternatives to is.na in this scenario? Any help would be greatly appreciated!
Edit Here is code that should reproduce the issue:
#data
tc <- read.csv("http://dl.dropbox.com/u/4115584/tc2008.csv")
frame <- read.csv("http://dl.dropbox.com/u/4115584/frame.csv")
#standardize NA codes
tc[tc == "."] <- NA
tc[tc == -9] <- NA
#standardize spatial units
colnames(frame)[1] <- "loser"
colnames(frame)[2] <- "gainer"
frame$dyad <- paste(frame$loser,frame$gainer,sep="")
tc$dyad <- paste(tc$loser,tc$gainer,sep="")
drops <- c("loser","gainer")
tc <- tc[,!names(tc) %in% drops]
frame <- frame[,!names(frame) %in% drops]
rm(drops)
#merge tc into frame
data <- merge(tc, frame, by.x = "year", by.y = "dyad", all.x=T, all.y=T) #year column is duplicated in this process. I haven't had this problem with nearly identical code using other data.
rm(tc,frame)
#the first column in the new data frame is the duplicate year, which does not actually contain years. I'll rename it.
colnames(data)[1] <- "double"
summary(data$year) #shows 833 NA's
summary(data$procedur) #note that at this point there are non-NA values
#later, I want to create 20 year windows following the events in the tc data. For simplicity, I want to remove cases with NA in the year column.
new.data <- data[!is.na(data$year),]
#now let's see what the above operation did
summary(new.data$year) #missing years were successfully removed
summary(new.data$procedur) #this variable is now entirely NA's
I think the actual problem is with your merge.
After you merge and have the data in data, if you do:
# > table(data$procedur, useNA="always")
# 1 2 3 4 5 6 <NA>
# 122 112 356 59 39 19 192258
You see there are these many (122+112...+19) values for data$procedur. But, all these values are corresponding to data$year = NA.
> all(is.na(data$year[!is.na(data$procedur)]))
# [1] TRUE # every value of procedur occurs where year = NA
So, basically, all values of procedur are also removed because you removed those rows checking for NA in year.
To solve this problem, I think you should use merge as:
merge(tc, frame, all=T) # it'll automatically calculate common columns
# also this will not result in duplicated year column.
Check if this merge gives you the desired result.
Try complete.cases:
data.frame.clean <- data.frame[complete.cases(data.frame$year),]
...though, as noted above, you may want to pick a more descriptive name.