How do I identifying the first zero in a group of ordered columns? - r

I'm trying to format a dataset for use in some survival analysis models. Each row is a school, and the time-varying columns are the total number of students enrolled in the school that year. Say the data frame looks like this (there are time invariate columns as well).
Name total.89 total.90 total.91 total.92
a 8 6 4 0
b 1 2 4 9
c 7 9 0 0
d 2 0 0 0
I'd like to create a new column indicating when the school "died," i.e., the first column in which a zero appears. Ultimately I'd like to have this column be "years since 1989" and can re-name columns accordingly.
A more general version of the question, for a series of time ordered columns, how do I identify the first column in which a given value occurs?

Here's a base R approach to get a column with the first zero (x = 0) or NA if there isn't one:
data$died <- apply(data[, -1], 1, match, x = 0)
data
# Name total.89 total.90 total.91 total.92 died
# 1 a 8 6 4 0 4
# 2 b 1 2 4 9 NA
# 3 c 7 9 0 0 3
# 4 d 2 0 0 0 2

Here is an option using max.col with rowSums
df1$died <- max.col(!df1[-1], "first") * NA^!rowSums(!df1[-1])
df1$died
#[1] 4 NA 3 2

Related

Use if-else function on data frame with multiple values

I have a data frame that contains multiple values in each spot, like this:
ID<-c(1,1,1,2,2,2,2,3,3,4,4,4,5,6,6)
W<-c(29,72,32,33,34,44,42,78,32,42,18,26,10,34,39)
df1<-data.frame(ID, W)
df<-ddply(df1, .(ID), summarize,
X=paste(unique(W),collapse=","))
ID X
1 1 29,72,32
2 2 33,34,44,42
3 3 78,32
4 4 42,18,26
5 5 10
6 6 34,39
I am trying to generate another column using an if-else function so that every ID that has an X value greater than 70 will show a 1, and all others will show a 0, like this:
ID X Y
1 1 29,72,32 1
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
This is the code that I tried:
df$Y <- ifelse(df$X>=70, 1, 0)
But it doesn't work; it only seems to put the first value of each spot through the function:
ID X Y
1 1 29,72,32 0
2 2 33,34,44,42 0
3 3 78,32 1
4 4 42,18,26 0
5 5 10 0
6 6 34,39 0
It worked fine on my one column that has only one value per spot. Is there a way to get to the if-else function to evaluate every value in each spot and assign a 1 if any of them fit the statement?
Thank you, I'm sorry that I do not know a lot of R vocabulary yet.
As 'X' is a string, we can split the 'X' at the , to create a list of vectors, loop over the list with map check if there are any numeric converted values are greater than 70
library(dplyr)
library(purrr)
df %>%
mutate(Y = map_int(strsplit(X, ","), ~ +(any(as.numeric(.x) > 70))))

Treshold values row-wise in a dataframe

Consider an example data frame:
A B C v
5 4 2 3
7 1 3 5
1 2 1 1
I want to set all elements of a row to 1 if the element is bigger or equal than v, and 0 otherwise. The example data frame would result in the following:
A B C v
1 1 0 3
1 0 0 5
1 1 1 1
How can I do this efficiently? The number of columns will be much higher, and I would like a solution that does not require me to specify the names of the columns individually, and will apply it to all of them (except v) instead.
My solution with a for loop is way too slow.
We can create a logical matrix and coerce to binary
df1[-4] <- +(df1[-4] >= df1$v)

R: ifelse statement: comparing data.frames

I have 2 dataframes where im trying to compare the value in one with another
If the value matches in both table 1 and 2, then a third value from table 2 is inserted into Table one.
Example Table My DF
words number
1 it 1
2 was 2
3 the 3
4 LTD QTY 4
5 end 5
6 of 6
7 winter 7
Table x.sub
lev_dist Var1 Var2
31 1 LTD QTY LTD QTY
What i want to say is, if Var1 in x.sub is equal to words in MyDF then insert x.sub.lev_dist in a third column next to the word in mydf
My attempt is below but keeps producing 3 in the results instead of the lev_value
mydf$lev_dist <- ifelse(test = (mydf$words == x.sub$Var1),x.sub$Var1,0)
Results:
words number lev_dist
1 it 1 0
2 was 2 0
3 the 3 0
4 LTD QTY 4 3
5 end 5 0
6 of 6 0
7 winter 7 0
Can anyone help
The x.sub$Var1 is a factor column. So, when we do the ifelse, we get the numeric levels of the factor. Replace x.sub$Var1 with as.character(x.sub$Var1) in the ifelse
mydf$lev_dist <- ifelse(mydf$words == as.character(x.sub$Var1)),
x.sub$lev_dist,0)
This could have avoided if the columns were of character class. Using stringsAsFactors=FALSE in the read.csv/read.table or data.frame would ensure that all the character columns are of character class.
You can also use merge:
x.sub = setNames(x.sub,c('lev_dist','words','Var2'))
df_ = merge(df, x.sub[,1:2], by='words', all=T)
df_[is.na(df_)]=0
# >df_
# words number lev_dist
#1 end 5 0
#2 it 1 0
#3 LTD QTY 4 1
#4 of 6 0
#5 the 3 0
#6 was 2 0
#7 winter 7 0

Removing rows after a certain value in R

I have a data frame in R,
df <- data.frame(a=c(1,1,1,2,2,5,5,5,5,5,6,6), b=c(0,1,0,0,0,0,0,1,0,0,0,1))
I want to remove the rows which has values for the variable b equal to 0 which occurs after the value equals to 1 for the duplicated variable a values.
So the output I am looking for is,
df.out <- data.frame(a=c(1,1,2,2,5,5,5,6,6), b=c(0,1,0,0,0,0,1,0,1))
Is there a way to do this in R?
This should do the trick?
ind = intersect(which(df$b==0), which(df$b==1)+1)
df.out = df[-ind,]
The which(df$b==1) returns the index of the df where b==1. add one to this and intersect with the indexes where b==0.
How about
df[ ave(df$b, df$a, FUN=function(x) x>=cummax(x))==1, ]
# a b
# 1 1 0
# 2 1 1
# 4 2 0
# 5 2 0
# 6 5 0
# 7 5 0
# 8 5 1
# 11 6 0
# 12 6 1
Here we use ave to look within each level of a and we test to see if we've seen a 1 yet with cummax.

Select max or equal value from several columns in a data frame

I'm trying to select the column with the highest value for each row in a data.frame. So for instance, the data is set up as such.
> df <- data.frame(one = c(0:6), two = c(6:0))
> df
one two
1 0 6
2 1 5
3 2 4
4 3 3
5 4 2
6 5 1
7 6 0
Then I'd like to set another column based on those rows. The data frame would look like this.
> df
one two rank
1 0 6 2
2 1 5 2
3 2 4 2
4 3 3 3
5 4 2 1
6 5 1 1
7 6 0 1
I imagine there is some sort of way that I can use plyr or sapply here but it's eluding me at the moment.
There might be a more efficient solution, but
ranks <- apply(df, 1, which.max)
ranks[which(df[, 1] == df[, 2])] <- 3
edit: properly spaced!

Resources