R splitting a data frame based on NA values - r

I want to split a dataset in R based on NA values from a variable, for example:
var1 var2
1 21
2 NA
3 NA
4 10
and make it like this:
var1 var2
1 21
4 10
var1 var2
2 NA
3 NA

See More Details:
Most statistical functions (e.g., lm()) have something like na.action which applies to the model, not to individual variables. na.fail() returns the object (the dataset) if there are no NA values, otherwise it returns NA (stopping the analysis). na.pass() returns the data object whether or not it has NA values, which is useful if the function deals with NA values internally. na.omit () returns the object with entire observations (rows) omitted if any of the variables used in the model are NA for that observation. na.exclude() is the same as na.omit(), except that it allows functions using naresid or napredict. You can think of na.action as a function on your data object, the result being the data object in the lm() function. The syntax of the lm() function allows specification of the na.action as a parameter:
lm(na.omit(dataset),y~a+b+c)
lm(dataset,y~a+b+c,na.omit) # same as above, and the more common usage
You can set your default handling of missing values with
options("na.actions"=na.omit)

You could just subset the data frame using is.na():
df1 <- df[!is.na(df$var2), ]
df2 <- df[is.na(df$var2), ]
Demo here:
Rextester

Hi try this
new_DF <- DF[rowSums(is.na(DF)) > 0,]
or in case you want to check a particular column, you can also use
new_DF <- DF[is.na(DF$Var),]
In case you have NA character values, first run
Df[Df=='NA'] <- NA
to replace them with missing values.

split function comes handily in this case.
data <- read.table(text="var1 var2
1 21
2 NA
3 NA
4 10", header=TRUE)
split(data, is.na(data$var2))
#
# $`FALSE`
# var1 var2
# 1 1 21
# 4 4 10
#
# $`TRUE`
# var1 var2
# 2 2 NA
# 3 3 NA

An alternative and more general approach is using the complete.cases command. The command spots rows that have no missing values (no NAs) and returns TRUE/FALSE values.
dt = data.frame(var1 = c(1,2,3,4),
var2 = c(21,NA,NA,10))
dt1 = dt[complete.cases(dt),]
dt2 = dt[!complete.cases(dt),]
dt1
# var1 var2
# 1 1 21
# 4 4 10
dt2
# var1 var2
# 2 2 NA
# 3 3 NA

Related

R - Aggregate Function different Results When Adding new grouping column

I am a R-beginner and I am stuck and can't find a solution. Any remarks are highly appreciated. Here is the problem:
I have a dataframe df.
The columns are converted to char (Attributes) and num.
I want to reduce the dataframe by using the aggregate function (dply is not an option).
When I am aggregating using
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1")], sum)
I get correct results. But I want to group by more attributes. When adding more attributes for example
df_agg <- aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
then at some point, the aggegrate result changes. The sum of Amount is no longer equal to the result of the first first aggegration (or the original dataframe).
Has anyone an idea what causes this behavior.
My best guess is that you have missing values in some of your grouping columns. Demonstrating on the built-in mtcars data, which has no missing values, everything is fine:
sum(mtcars$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am")], sum)$mpg)
# [1] 642.9
sum(aggregate(mtcars["mpg"], mtcars[c("am", "cyl")], sum)$mpg)
# [1] 642.9
But if we introduce a missing value in a grouping variable, it is not included in the aggregation:
mt = mtcars
mt$cyl[1] = NA
sum(aggregate(mt["mpg"], mt[c("am", "cyl")], sum)$mpg)
# [1] 621.9
The easiest fix would be to fill in the missing values with something other than NA, perhaps the string "missing".
I think #Gregor has correctly pointed out that problem could be a grouping variable having NA. The dplyr handles NA in grouping variables differently than aggregate.
We have an alternate solution with aggregate. Please note that document suggest that
`by` a list of grouping elements, each as long as the variables in the data
frame x. The elements are coerced to factors before use.
Here is clue. You can convert your grouping variables to factor using exclude="" which will ensure NA are part of factor.
set.seed(1)
df <- data.frame(ATTRIBUTE1 = sample(LETTERS[1:3], 10, replace = TRUE),
ATTRIBUTE2 = sample(letters[1:3], 10, replace = TRUE),
AMOUNT = 1:10)
df$ATTRIBUTE2[5] <- NA
aggregate(df["AMOUNT"], by = list(factor(df$ATTRIBUTE1,exclude = ""),
factor(df$ATTRIBUTE2, exclude="")), sum)
# Group.1 Group.2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7
# 8 A <NA> 5
The result when grouping variables are not explicitly converted to factor to include NA is as:
aggregate(df["AMOUNT"], df[c("ATTRIBUTE1", "ATTRIBUTE2")], sum)
# ATTRIBUTE1 ATTRIBUTE2 AMOUNT
# 1 A a 1
# 2 B a 2
# 3 B b 9
# 4 C b 10
# 5 A c 10
# 6 B c 11
# 7 C c 7

Replacing NULL values in a data.frame

I am quite new to R and I am working on a data frame with several NULL values. So far I am not able to replace those, and I can't wrap my head about a solution so it would be amazing if anybody could help me.
All the variables where the NULL value comes up are classified as factor.
If I use the function is.null(data) the answer is FALSE, which means that the have to replaced to be able to make a decent graph.
Can I use set.seed to replace all the NULL values, or I need to use a different function?
You can use dplyr and replace
Data
df <- data.frame(A=c("A","NULL","B"), B=c("NULL","C","D"), stringsAsFactors=F)
solution
library(dplyr)
ans <- df %>% replace(.=="NULL", NA) # replace with NA
Output
A B
1 A <NA>
2 <NA> C
3 B D
Another example
ans <- df %>% replace(.=="NULL", "Z") # replace with "Z"
Output
A B
1 A Z
2 Z C
3 B D
In general, R works better with NA values instead of NULL values. If by NULL values you mean the value actually says "NULL", as opposed to a blank value, then you can use this to replace NULL factor values with NA:
df <- data.frame(Var1=c('value1','value2','NULL','value4','NULL'),
Var2=c('value1','value2','value3','NULL','value5'))
#Before
Var1 Var2
1 value1 value1
2 value2 value2
3 NULL value3
4 value4 NULL
5 NULL value5
df <- apply(df,2,function(x) suppressWarnings(levels(x)<-sub("NULL", NA, x)))
#After
Var1 Var2
[1,] "value1" "value1"
[2,] "value2" "value2"
[3,] NA "value3"
[4,] "value4" NA
[5,] NA "value5"
It really depends on what the content of your column looks like though. The above only really makes sense to do in the case of columns that aren't numeric. If the values in a column are numeric, then using as.numeric() will automatically remove everything that isn't a digit. Note that it's important to convert factors to character before converting to numeric though; so use as.numeric(as.character(x)), as shown below:
df <- data.frame(Var1=c('1','2','NULL','4','NULL'))
df$Var1 <- as.numeric(as.character(df$Var1))
#After
Var1
1 1
2 2
3 NA
4 4
5 NA

Combine 2 tables with common variables but no common observations

I would like to match 2 Data sets (tables) which only have some (not all) variables in common but not any of those obs. - So actually I want to add dataset1 to dataset2, adding the column names of dataset2, while in empty fields of the table should be filled in with NA.
So what I did is, I tried the following function;
matchcol = function(x,y){
y = y[,match(colnames(x),colnames(y))]
colnames(y)=colnames(x)
return(y)
}
sum =matchcol(dataset1,dataset2)
data = rbind(dataset1,dataset2)
But I get; "Error: NA columns indexes not supported.
What can I do? What can I change in my code.
Thx!!
To use rbind you need to have the same column names, but with bind_rows from dplyr package you don't, try this:
library(dplyr)
data <- bind_rows(dataset1, dataset2)
example :
dataset1 <- data.frame(a= 1:5,b=6:10)
dataset2 <- data.frame(a= 11:15,c=16:20)
data <- bind_rows(dataset1,dataset2)
# a b c
# 1 1 6 NA
# 2 2 7 NA
# 3 3 8 NA
# 4 4 9 NA
# 5 5 10 NA
# 6 11 NA 16
# 7 12 NA 17
# 8 13 NA 18
# 9 14 NA 19
# 10 15 NA 20
If I understand your question right, it looks like dplyr::full_join is good for that:
library(dplyr)
dataset1 <- data.frame(Var_A = 1:10, Var_B = 100:109)
dataset2 <- data.frame(Var_A = 11:20, Var_C = 200:209)
dataset_new <- full_join(dataset1, dataset2)
dataset_new
This will automatically join the two datasets by common column names and add all other columns from both datasets. And empty fields are NAs.
Does that work for you?

How to create new rows in a dataframe based on another row's contents

Given
index = c(1,2,3,4,5)
codes = c("c1","c1,c2","","c3,c1","c2")
df=data.frame(index,codes)
df
index codes
1 1 c1
2 2 c1,c2
3 3
4 4 c3,c1
5 5 c2
How can I create a new df that looks like
df1
index codes
1 1 c1
2 2 c1
3 2 c2
4 3
5 4 c3
6 4 c1
7 5 c2
so that I can perform aggregates on the codes? The "index" of the actual data set are a series of timestamps, so I'll want to aggregate by day or hour.
The method of Roland is quite good, provided the variable index has unique keys. You can gain some speed by working with the lists directly. Take into account that :
in your original data frame, codes is a factor. No point in doing that, you want it to be character.
in your original data frame, "" is used instead of NA. As the length of that one is 0, you can get in all kind of trouble later on. I'd use NA there. " " is an actual value, "" is no value at all, but you want a missing value. Hence NA.
So my idea would be:
The data:
index = c(1,2,3,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2")
df=data.frame(index,codes,stringsAsFactors=FALSE)
Then :
X <- strsplit(df$codes,",")
data.frame(
index = rep(df$index,sapply(X,length)),
codes = unlist(X)
)
Or, if you insist on using "" instead of NA:
X <- strsplit(df$codes,",")
ll <- sapply(X,length)
X[ll==0] <- NA
data.frame(
index = rep(df$index,pmax(1,ll)),
codes = unlist(X)
)
Neither of both methods assume a unique key in index. They work perfectly well with non-unique timestamps.
You need to split the string (using strsplit) and then combine the resulting list with the data.frame.
The following relies on the assumption that codes are unique in each row. If you have many codes in some rows and only few in others, this might waste a lot of RAM and it might be better to loop.
#to avoid character(0), which would be omitted in rbind
levels(df$codes)[levels(df$codes)==""] <- " "
#rbind fills each row by propagating the values to the "empty" columns for each row
df2 <- cbind(df, do.call(rbind,strsplit(as.character(df$codes),",")))[,-2]
library(reshape2)
df2 <- melt(df2, id="index")[-2]
#here the assumtion is needed
df2 <- df2[!duplicated(df2),]
df2[order(df2[,1], df2[,2]),]
# index value
#1 1 c1
#2 2 c1
#7 2 c2
#3 3
#9 4 c1
#4 4 c3
#5 5 c2
Here's another alternative using "data.table". The sample data includes NA instead of a blank space and includes duplicated index values:
index = c(1,2,3,2,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2","c3")
df = data.frame(index,codes,stringsAsFactors=FALSE)
library(data.table)
## We could create the data.table directly, but I'm
## assuming you already have a data.frame ready to work with
DT <- data.table(df)
DT[, list(codes = unlist(strsplit(codes, ","))), by = "index"]
# index codes
# 1: 1 c1
# 2: 2 c1
# 3: 2 c2
# 4: 2 c3
# 5: 2 c1
# 6: 3 NA
# 7: 4 c2
# 8: 5 c3

Select last non-NA value in a row, by row

I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1

Resources