Removing NA’s from a dataset in R - r

I want to remove all of the NA’s from the variables selected however when I used na.omited() for example:
na.omit(df$livharm)
it does not work and the NA’s are still there. I have also tried an alternative way for example:
married[is.na(livharm1)] <-NA
I have done this for each variable within the larger variable I am looking at using the code:
E.g.
df <- within(df, {
married <- as.numeric(livharm == 1)
“
“
“
married[is.na(livharm1)] <- NA
})
however I’m not sure what I actually have to do. Any help I would greatly appreciate!

Using complete.cases gives:
dat <- data.frame( a=c(1,2,3,4,5),b=c(1,NA,3,4,5) )
dat
a b
1 1 1
2 2 NA
3 3 3
4 4 4
5 5 5
complete.cases(dat)
[1] TRUE FALSE TRUE TRUE TRUE
# is.na equivalent has to be used on a vector for the same result:
!is.na(dat$b)
[1] TRUE FALSE TRUE TRUE TRUE
dat[complete.cases(dat),]
a b
1 1 1
3 3 3
4 4 4
5 5 5
Using na.omit is the same as complete.cases but instead of returning a boolean vector the object itself is returned.
na.omit(dat)
a b
1 1 1
3 3 3
4 4 4
5 5 5
This function returns a different result when applied only to a vector, which is probably not handled correctly by ggplot2. It can be "rescued" by putting it back in a data frame. base plot works as intended though.
na.omit(dat$b)
[1] 1 3 4 5
attr(,"na.action")
[1] 2
attr(,"class")
[1] "omit"
data.frame(b=na.omit(dat$b))
b
1 1
2 3
3 4
4 5
Plotting with ggplot2
ggplot(dat[complete.cases(dat),]) + geom_point( aes(a,b) )
# <plot>
# See warning when using original data set with NAs
ggplot(dat) + geom_point( aes(a,b) )
Warning message:
Removed 1 rows containing missing values (geom_point).
# <same plot as above>

Related

Assigning vector elements a value associated with preceding matching value [duplicate]

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
I have a vector of alternating TRUE and FALSE values:
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
I'd like to number each instance of TRUE with a unique sequential number and to assign each FALSE value the number associated with the TRUE value preceding it.
therefore, my desired output using the example dat above (which has 4 TRUE values):
1 1 1 2 2 2 2 3 3 4 4 4 4 4
What I tried:
I've tried the following (which works), but I know there must be a simpler solution!!
whichT <- which(dat==T)
whichF <- which(dat==F)
l1 <- lapply(1:length(whichT),
FUN = function(x)
which(whichF > whichT[x] & whichF < whichT[(x+1)])
)
l1[[length(l1)]] <- which(whichF > whichT[length(whichT)])
replaceFs <- unlist(
lapply(1:length(whichT),
function(x) l1[[x]] <- rep(x,length(l1[[x]]))
)
)
replaceTs <- 1:length(whichT)
dat2 <- dat
dat2[whichT] <- replaceTs
dat2[whichF] <- replaceFs
dat2
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
I need a simpler and quicker solution b/c my real data set is 181k rows long!
Base R solutions preferred, but any solution works
cumsum(dat) will do what you want. When used in mathematical functions TRUE gets converted to 1 and FALSE to 0 so taking the cumulative sum will add 1 every time you see a TRUE and add nothing when there is a FALSE which is what you want.
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
cumsum(dat)
# [1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
Instead of doing the indexing, it can be easily done with cumsum from base R. Here, TRUE/FALSE gets coerced to 1/0 and when we do the cumulative sum, whereever there is 1, it gets increment by 1
cumsum(dat)
#[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
cumsum() is the most straightforward way, however, you can also do:
Reduce("+", dat, accumulate = TRUE)
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4

passing `cbind()` as a function argument

Suppose I have a nice little data frame
df <- data.frame(x=seq(1,5),y=seq(5,1),z=c(1,2,3,2,1),a=c(1,1,1,2,2))
df
## x y z a
## 1 1 5 1 1
## 2 2 4 2 1
## 3 3 3 3 1
## 4 4 2 2 2
## 5 5 1 1 2
and I want to aggregate a part of it:
aggregate(cbind(x,z)~a,FUN=sum,data=df)
## a x z
## 1 1 6 6
## 2 2 9 3
How do I go about making it programmatic? I want to pass:
The list of variables to be aggregated cbind(x,z)
The grouping variable a (I will be using it in several other parts of the program, so passing the whole thing cbind(x,z)~a is not helpful)
The environment within which the things are happening
My starting point is
blah <- function(varlist,groupvar,df) {
# I kinda like to see what I am doing here
cat(paste0(deparse(substitute(varlist)),"~",deparse(substitute(groupvar))),"\n")
cat(is.data.frame(df),"\n")
cat(dim(df),"\n")
# but I really need to aggregate this
return( aggregate(eval(deparse(substitute(varlist))~deparse(substitute(groupvar)),df),
FUN=sum,data=df) )
}
and it works halfway:
blah(cbind(x,z),a,df)
## [1] "cbind(x, z)~a"
## TRUE
## 5 4
## Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
So I am kind of able to build the character representation of the formula that I need, but putting it into aggregate() fails.

How to change characters into NA?

I have a census dataset with some missing variables indicated with a ?,
When checking for incomplete cases in R it says there are none because R takes the ? as a valid character. Is there any way to change all the ? to NAs? I would like to run multiple imputation using the mice package to fill in the missing data after.
Data frames. You may need to fiddle with the quotation marks. I have not tested this.
df[df == "?"] <- NA
Creating data frame df
df <- data.frame(A=c("?",1,2),B=c(2,3,"?"))
df
# A B
# 1 ? 2
# 2 1 3
# 3 2 ?
I. Using replace() function
replace(df,df == "?",NA)
# A B
# 1 <NA> 2
# 2 1 3
# 3 2 <NA>
II. While importing a file with ?
data <- read.table("xyz.csv",sep=",",header=T,na.strings=c("?",NA))
data
# A B
# 1 1 NA
# 2 2 3
# 3 3 4
# 4 NA NA
# 5 NA NA
# 6 4 5

Removing same values in columns in some rows of a file in R

I have a file like this.
1 3
1 2
1 10
1 5
**5 5**
6 7
8 9
4 6
1 2
**10 10**
......
The file contains thousands of rows. I wanted to know, how can I remove the rows which contains the same values in columns in R ( The row containing 5 5 and row containing 10 10 )? I know how to remove duplicate columns or duplicate rows, but how do I go about selectively removing them? Thanks. :)
I would do this with indexing, example with small data frame:
myDf <- data.frame(a=c(3,5,8,6,9,4,3), b=c(3,3,5,8,9,6,4))
myDf <- myDf[myDf$a != myDf$b,]
I would consider writing a helper function like this:
indicator <- function(indf) {
rowSums(vapply(indf, function(x) x == indf[, 1],
logical(nrow(indf)))) == ncol(indf)
}
Basically, the function compares each column in the data.frame with the first column of the data.frame, then, checks to see which rowSums are the same as the number of columns in the data.frame.
This basically creates a logical vector that can be used to subset your data.frame.
Example:
mydf <- data.frame(a=c(3,5,8,6,9,4,3),
b=c(3,3,5,8,9,6,4),
c=c(3,4,5,6,9,7,2))
indicator(mydf)
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE
mydf[!indicator(mydf), ]
# a b c
# 2 5 3 4
# 3 8 5 5
# 4 6 8 6
# 6 4 6 7
# 7 3 4 2

R: Creating a column by adding values of two previous columns

I am working in R. I have typed in the command :
table(shoppingdata$Identifier, shoppingdata$Coupon)
I have the following data:
FALSE TRUE
197386 0 5
197388 0 2
197390 2 0
197392 0 3
197394 1 0
197397 0 1
197398 1 1
197400 0 4
197402 1 5
197406 0 5
First of all, I cannot name the vectors FALSE and TRUE by something else, e.g couponused.
Most importantly, I want to create a third column which is the sum of FALSE+TRUE( Coupon used+coupon not used= number of visits). The actual columns contain hundreds of entries.
The solution is not obvious at all.
You have stumbled into the abyss of R data types, through no fault of your own.
Assuming that shoppingdata is a data frame,
table(shoppingdata$Identifier, shoppingdata$Coupon)
creates an object of type "table". One would think that using, e.g.
as.data.frame(table(shoppingdata$Identifier, shoppingdata$Coupon))
would turn this into a data frame with the same format as in the printout, but, as the example below shows, it does not!
# example
data <- data.frame(ID=rep(1:5,each=10),coupon=(sample(c(T,F),50,replace=T)))
# creates "contingency table", not a data frame.
t <- table(data)
t
# coupon
# ID FALSE TRUE
# 1 5 5
# 2 3 7
# 3 4 6
# 4 6 4
# 5 3 7
as.data.frame(t) # not useful!!
# ID coupon Freq
# 1 1 FALSE 5
# 2 2 FALSE 3
# 3 3 FALSE 4
# 4 4 FALSE 6
# 5 5 FALSE 3
# 6 1 TRUE 5
# 7 2 TRUE 7
# 8 3 TRUE 6
# 9 4 TRUE 4
# 10 5 TRUE 7
# this works...
coupons <- data.frame(ID=rownames(t),not.used=t[,1],used=t[,2])
# add two columns to make a third
coupons$total <- coupons$used + coupons$not.used
# or, less typing
coupons$ total <- with(coupons,not.used+used)
FWIW, I think yours is a perfectly reasonable question. The reason more people don't use R is that it has an extremely steep learning curve, and the documentation is not very good. On the other hand, once you've climbed that learning curve, R is astonishingly powerful.

Resources