I am quite new to R and I am working on a data frame with several NULL values. So far I am not able to replace those, and I can't wrap my head about a solution so it would be amazing if anybody could help me.
All the variables where the NULL value comes up are classified as factor.
If I use the function is.null(data) the answer is FALSE, which means that the have to replaced to be able to make a decent graph.
Can I use set.seed to replace all the NULL values, or I need to use a different function?
You can use dplyr and replace
Data
df <- data.frame(A=c("A","NULL","B"), B=c("NULL","C","D"), stringsAsFactors=F)
solution
library(dplyr)
ans <- df %>% replace(.=="NULL", NA) # replace with NA
Output
A B
1 A <NA>
2 <NA> C
3 B D
Another example
ans <- df %>% replace(.=="NULL", "Z") # replace with "Z"
Output
A B
1 A Z
2 Z C
3 B D
In general, R works better with NA values instead of NULL values. If by NULL values you mean the value actually says "NULL", as opposed to a blank value, then you can use this to replace NULL factor values with NA:
df <- data.frame(Var1=c('value1','value2','NULL','value4','NULL'),
Var2=c('value1','value2','value3','NULL','value5'))
#Before
Var1 Var2
1 value1 value1
2 value2 value2
3 NULL value3
4 value4 NULL
5 NULL value5
df <- apply(df,2,function(x) suppressWarnings(levels(x)<-sub("NULL", NA, x)))
#After
Var1 Var2
[1,] "value1" "value1"
[2,] "value2" "value2"
[3,] NA "value3"
[4,] "value4" NA
[5,] NA "value5"
It really depends on what the content of your column looks like though. The above only really makes sense to do in the case of columns that aren't numeric. If the values in a column are numeric, then using as.numeric() will automatically remove everything that isn't a digit. Note that it's important to convert factors to character before converting to numeric though; so use as.numeric(as.character(x)), as shown below:
df <- data.frame(Var1=c('1','2','NULL','4','NULL'))
df$Var1 <- as.numeric(as.character(df$Var1))
#After
Var1
1 1
2 2
3 NA
4 4
5 NA
Related
I have some data that I am looking at in R. One particular column, titled "Height", contains a few rows of NA.
I am looking to subset my data-frame so that all Heights above a certain value are excluded from my analysis.
df2 <- subset ( df1 , Height < 40 )
However whenever I do this, R automatically removes all rows that contain NA values for Height. I do not want this. I have tried including arguments for na.rm
f1 <- function ( x , na.rm = FALSE ) {
df2 <- subset ( x , Height < 40 )
}
f1 ( df1 , na.rm = FALSE )
but this does not seem to do anything; the rows with NA still end up disappearing from my data-frame. Is there a way of subsetting my data as such, without losing the NA rows?
If we decide to use subset function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.
You could also do:
df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.
# Create Dataset
library(data.table)
df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
df
# V1 V2
# 1: Surface 1
# 2: Bottom 2
# 3: <NA> 3
# Keep all but 'Bottom'
df[!V1 %in% c('Bottom')]
# V1 V2
# 1: Surface 1
# 2: <NA> 3
This works because %in% never returns an NA (see ?match)
I want to split a dataset in R based on NA values from a variable, for example:
var1 var2
1 21
2 NA
3 NA
4 10
and make it like this:
var1 var2
1 21
4 10
var1 var2
2 NA
3 NA
See More Details:
Most statistical functions (e.g., lm()) have something like na.action which applies to the model, not to individual variables. na.fail() returns the object (the dataset) if there are no NA values, otherwise it returns NA (stopping the analysis). na.pass() returns the data object whether or not it has NA values, which is useful if the function deals with NA values internally. na.omit () returns the object with entire observations (rows) omitted if any of the variables used in the model are NA for that observation. na.exclude() is the same as na.omit(), except that it allows functions using naresid or napredict. You can think of na.action as a function on your data object, the result being the data object in the lm() function. The syntax of the lm() function allows specification of the na.action as a parameter:
lm(na.omit(dataset),y~a+b+c)
lm(dataset,y~a+b+c,na.omit) # same as above, and the more common usage
You can set your default handling of missing values with
options("na.actions"=na.omit)
You could just subset the data frame using is.na():
df1 <- df[!is.na(df$var2), ]
df2 <- df[is.na(df$var2), ]
Demo here:
Rextester
Hi try this
new_DF <- DF[rowSums(is.na(DF)) > 0,]
or in case you want to check a particular column, you can also use
new_DF <- DF[is.na(DF$Var),]
In case you have NA character values, first run
Df[Df=='NA'] <- NA
to replace them with missing values.
split function comes handily in this case.
data <- read.table(text="var1 var2
1 21
2 NA
3 NA
4 10", header=TRUE)
split(data, is.na(data$var2))
#
# $`FALSE`
# var1 var2
# 1 1 21
# 4 4 10
#
# $`TRUE`
# var1 var2
# 2 2 NA
# 3 3 NA
An alternative and more general approach is using the complete.cases command. The command spots rows that have no missing values (no NAs) and returns TRUE/FALSE values.
dt = data.frame(var1 = c(1,2,3,4),
var2 = c(21,NA,NA,10))
dt1 = dt[complete.cases(dt),]
dt2 = dt[!complete.cases(dt),]
dt1
# var1 var2
# 1 1 21
# 4 4 10
dt2
# var1 var2
# 2 2 NA
# 3 3 NA
I have some data that I am looking at in R. One particular column, titled "Height", contains a few rows of NA.
I am looking to subset my data-frame so that all Heights above a certain value are excluded from my analysis.
df2 <- subset ( df1 , Height < 40 )
However whenever I do this, R automatically removes all rows that contain NA values for Height. I do not want this. I have tried including arguments for na.rm
f1 <- function ( x , na.rm = FALSE ) {
df2 <- subset ( x , Height < 40 )
}
f1 ( df1 , na.rm = FALSE )
but this does not seem to do anything; the rows with NA still end up disappearing from my data-frame. Is there a way of subsetting my data as such, without losing the NA rows?
If we decide to use subset function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.
You could also do:
df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.
# Create Dataset
library(data.table)
df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
df
# V1 V2
# 1: Surface 1
# 2: Bottom 2
# 3: <NA> 3
# Keep all but 'Bottom'
df[!V1 %in% c('Bottom')]
# V1 V2
# 1: Surface 1
# 2: <NA> 3
This works because %in% never returns an NA (see ?match)
I am using R for a project and I have a data frame in in the following format:
A B C
1 1 0 0
2 0 1 1
I want to return a data frame that gives the Column Name when the value is 1.
i.e.
Impair1 Impair2
1 A NA
2 B C
Is there a way to do this for thousands of records? The max impairment number is 4.
Note: There are more than 3 columns. Only 3 were listed to make it easier.
You could loop through the rows of your data, returning the column names where the data is set with an appropriate number of NA values padded at the end:
`colnames<-`(t(apply(dat == 1, 1, function(x) c(colnames(dat)[x], rep(NA, 4-sum(x))))),
paste("Impair", 1:4))
# Impair1 Impair2 Impair3 Impair4
# 1 "A" NA NA NA
# 2 "B" "C" NA NA
Using the apply family of functions, here is a general solution that should work for your larger dataset:
res <- apply(df, 1, function(x) {
out <- character(4) # create a 4-length vector of NAs
tmp <- colnames(df)[which(x==1)] # store the column names in a tmp field
out[1:length(tmp)] <- tmp # overwrite the relevant positions
out
})
# transpose and turn it into a data.frame
> data.frame(t(res))
X1 X2 X3 X4
1 A
2 B C
I have a data frame where each row is a vector of values of varying lengths. I would like to create a vector of the last true value in each row.
Here is an example data frame:
df <- read.table(tc <- textConnection("
var1 var2 var3 var4
1 2 NA NA
4 4 NA 6
2 NA 3 NA
4 4 4 4
1 NA NA NA"), header = TRUE); close(tc)
The vector of values I want would therefore be c(2,6,3,4,1).
I just can't figure out how to get R to identify the last value.
Any help is appreciated!
Do this by combining three things:
Identify NA values with is.na
Find the last value in a vector with tail
Use apply to apply this function to each row in the data.frame
The code:
lastValue <- function(x) tail(x[!is.na(x)], 1)
apply(df, 1, lastValue)
[1] 2 6 3 4 1
Here's an answer using matrix subsetting:
df[cbind( 1:nrow(df), max.col(!is.na(df),"last") )]
This max.col call will select the position of the last non-NA value in each row (or select the first position if they are all NA).
Here's another version that removes all infinities, NA, and NaN's before taking the first element of the reversed input:
apply(df, 1, function(x) rev(x[is.finite(x)])[1] )
# [1] 2 6 3 4 1