I am using R to generate examples of how to deal with missing data for the statistics class I am teaching. One method requires generating a "missing values binary variable", with 0 for cases containing missing values, and 1 with no missing values. For example
n X Y Z
1 4 300 2
2 8 400 4
3 10 500 7
4 18 NA 10
5 20 50 NA
6 NA 1000 5
I would like to generate a variable M, such that
n m
1 1
2 1
3 1
4 0
5 0
6 0
It seems this should be simple, given R's ability to handle missing values. The closest I have found is m <-ifelse(is.na(missguns),0,1), but all this does is generate a new entire data matrix with 0 or 1 indicating missingness. However, I just want one variable indicating if a row contains missing values.
complete.cases does exactly what you want.
complete.cases(x)
## [1] TRUE TRUE TRUE FALSE FALSE FALSE
You can coerce to numeric or integer:
as.integer(complete.cases(x))
## [1] 1 1 1 0 0 0
Related
The numeric variable weitage is given like,
> weitage
[1] 20 10 50 10 5 5
Then,
sort_wei<-sort(weitage,decreasing = T)
sort_wei
[1] 50 20 10 10 5 5
match(sort_wei,weitage)
results in 3 1 2 2 5 5. But actually needed position is 3 1 2 4 5 6. How to get these positions? Can i use match() in R?
We can try using the order function, which returns the indices of the input vector according to some sort order:
order(weitage, decreasing=TRUE)
#[1] 3 1 2 4 5 6
I'm having some trouble with the following:
I have a list with 6 different factors:
1 stand
2 stand
3 walk
4 downstairs
5 sit
6 stand
7 lay
8 walk
How can I convert these factors into a binary list in which walk is considered 1 and not walking is 0?
Output:
1 0
2 0
3 1
4 0
5 0
6 0
7 0
8 1
Currently, when I try something like this, I get the following accompanying error.
test[1] <- 0
Warning message:
In [<-.factor(*tmp*, 1, value = 0) : invalid factor level, NA generated
I don't completely understand why I'm getting the error I'm getting.
Thank you for any assistance you can provide.
I'm trying to format a dataset for use in some survival analysis models. Each row is a school, and the time-varying columns are the total number of students enrolled in the school that year. Say the data frame looks like this (there are time invariate columns as well).
Name total.89 total.90 total.91 total.92
a 8 6 4 0
b 1 2 4 9
c 7 9 0 0
d 2 0 0 0
I'd like to create a new column indicating when the school "died," i.e., the first column in which a zero appears. Ultimately I'd like to have this column be "years since 1989" and can re-name columns accordingly.
A more general version of the question, for a series of time ordered columns, how do I identify the first column in which a given value occurs?
Here's a base R approach to get a column with the first zero (x = 0) or NA if there isn't one:
data$died <- apply(data[, -1], 1, match, x = 0)
data
# Name total.89 total.90 total.91 total.92 died
# 1 a 8 6 4 0 4
# 2 b 1 2 4 9 NA
# 3 c 7 9 0 0 3
# 4 d 2 0 0 0 2
Here is an option using max.col with rowSums
df1$died <- max.col(!df1[-1], "first") * NA^!rowSums(!df1[-1])
df1$died
#[1] 4 NA 3 2
If I have a vector numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4), and I use 'table(numbers)', I get
names 1 2 4 5
counts 2 5 4 1
What if I want it to include 3 also or generally, all numbers from 1:max(numbers) even if they are not represented in numbers. Thus, how would I generate an output as such:
names 1 2 3 4 5
counts 2 5 0 4 1
If you want R to add up numbers that aren't there, you should create a factor and explicitly set the levels. table will return a count for each level.
table(factor(numbers, levels=1:max(numbers)))
# 1 2 3 4 5
# 2 5 0 4 1
For this particular example (positive integers), tabulate would also work:
numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4)
tabulate(numbers)
# [1] 2 5 0 4 1
I am working in R. I have typed in the command :
table(shoppingdata$Identifier, shoppingdata$Coupon)
I have the following data:
FALSE TRUE
197386 0 5
197388 0 2
197390 2 0
197392 0 3
197394 1 0
197397 0 1
197398 1 1
197400 0 4
197402 1 5
197406 0 5
First of all, I cannot name the vectors FALSE and TRUE by something else, e.g couponused.
Most importantly, I want to create a third column which is the sum of FALSE+TRUE( Coupon used+coupon not used= number of visits). The actual columns contain hundreds of entries.
The solution is not obvious at all.
You have stumbled into the abyss of R data types, through no fault of your own.
Assuming that shoppingdata is a data frame,
table(shoppingdata$Identifier, shoppingdata$Coupon)
creates an object of type "table". One would think that using, e.g.
as.data.frame(table(shoppingdata$Identifier, shoppingdata$Coupon))
would turn this into a data frame with the same format as in the printout, but, as the example below shows, it does not!
# example
data <- data.frame(ID=rep(1:5,each=10),coupon=(sample(c(T,F),50,replace=T)))
# creates "contingency table", not a data frame.
t <- table(data)
t
# coupon
# ID FALSE TRUE
# 1 5 5
# 2 3 7
# 3 4 6
# 4 6 4
# 5 3 7
as.data.frame(t) # not useful!!
# ID coupon Freq
# 1 1 FALSE 5
# 2 2 FALSE 3
# 3 3 FALSE 4
# 4 4 FALSE 6
# 5 5 FALSE 3
# 6 1 TRUE 5
# 7 2 TRUE 7
# 8 3 TRUE 6
# 9 4 TRUE 4
# 10 5 TRUE 7
# this works...
coupons <- data.frame(ID=rownames(t),not.used=t[,1],used=t[,2])
# add two columns to make a third
coupons$total <- coupons$used + coupons$not.used
# or, less typing
coupons$ total <- with(coupons,not.used+used)
FWIW, I think yours is a perfectly reasonable question. The reason more people don't use R is that it has an extremely steep learning curve, and the documentation is not very good. On the other hand, once you've climbed that learning curve, R is astonishingly powerful.