count consecutive occurrence and stop once found value - r

I have a data frame that looks like this:
account <- c('123','123','123','123')
bin <- c(3,6,9,12)
count <- c(0,0,2,0)
df <- data.frame(account,bin,count)
df
> df
account bin count
1 123 3 0
2 123 6 0
3 123 9 2
4 123 12 0
I want an output that looks like this:
> df
account bin count cumCount
1 123 3 0 1
2 123 6 0 2
3 123 9 2 0
4 123 12 0 0
Basically, I need to count the number of consecutive zeror starting from bin = 3. But once count columns is >0 I want the rest of the values to be zero.
I've looked around the web a bit and here are 2 part solutions that are almost there:
df %>%
group_by(count) %>%
mutate(id = row_number())
# A tibble: 4 x 4
# Groups: count [2]
account bin count id
<fctr> <dbl> <dbl> <int>
1 123 3 0 1
2 123 6 0 2
3 123 9 2 1
4 123 12 0 3
And
df %>%
mutate( x = sequence(rle(
as.character(count))$lengths))
> df %>%
+ mutate( x = sequence(rle(
+ as.character(count))$lengths))
account bin count x
1 123 3 0 1
2 123 6 0 2
3 123 9 2 1
4 123 12 0 1
but they still keep counting after zero is found.
Is there another solution?

We could first create a row number column cumCount. After that we replace the values to 0 for index from the first occurrence of non-zero value to the end of dataframe.
df$cumCount = 1:nrow(df)
df$cumCount[which.max(df$count != 0) : nrow(df)] <- 0
df
# account bin count cumCount
#1 123 3 0 1
#2 123 6 0 2
#3 123 9 2 0
#4 123 12 0 0
In dplyr, it is easier using row_number and replace function
library(dplyr)
df %>%
mutate(cumCount = replace(row_number(), cumsum(count!=0) > 0, 0))
# account bin count cumCount
#1 123 3 0 1
#2 123 6 0 2
#3 123 9 2 0
#4 123 12 0 0
The equivalent base R of the above dplyr version would be
df$cumCount <- replace(1:nrow(df), cumsum(df$count != 0) > 0, 0)

Related

How to have R sum nonexistent or null data

A bit convoluted so I will start with the basic concept. The data is employment by area and sizeclass. From there, I produce a data frame that has the sizeclass, area, total employment by sizeclass, number of worksites by sizeclass. The bigger the sizeclass, the more employment. 1 equal to employing between 0 and 4. 9 being equal to employing 1000+. Obviously some areas do not have large employers. However, I need the end result to always have 9 rows per area even if there is 0 employment for that sizeclass. Sample data is below.
area <- c(01,01,01,01,01,01,01,03,03,03,03)
employment <- c(1,5,9,10,11,12,67,100,4,444,149)
sizeclass <- c(1,2,2,3,3,3,5,6,1,7,6)
df2 <- data.frame(area,employment,sizeclass)
This is the code that I am using and while it works, it does not produce a result for sizeclass 4 in area 01 for example. How would I have it sum by sizeclass even if there is nothing to sum or count?
sizeclassreport <- df2 %>%
select (area,employment,sizeclass) %>%
group_by(area,sizeclass) %>%
summarise(employment = sum(employment),worksites = n())
The desired result would be 18 rows in length with the sum of employment by sizeclass for each sizeclass and number of worksites even if there is no employment.
We can use complete to get all the values from the custom value range between 1 and 9 for the 'sizeclass'. By default, the other columns values will be filled by NA. If wanted, it can be filled with a custom value i.e. 0
library(dplyr)
library(tidyr)
sizeclassreport %>%
group_by(area) %>%
complete(sizeclass = 1:9,
fill = list(employment = 0, worksites = 0)) %>%
ungroup
-output
# A tibble: 18 x 4
area sizeclass employment worksites
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 14 2
3 1 3 33 3
4 1 4 0 0
5 1 5 67 1
6 1 6 0 0
7 1 7 0 0
8 1 8 0 0
9 1 9 0 0
10 3 1 4 1
11 3 2 0 0
12 3 3 0 0
13 3 4 0 0
14 3 5 0 0
15 3 6 249 2
16 3 7 444 1
17 3 8 0 0
18 3 9 0 0

Assign observation level values by grouping variable

Thanks in advance for any help.
I have the below data frame
> df <- data.frame(
id = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5),
time = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
mortality = c(NA,1,0,0,0,0,NA,0,0,0,0,1,NA,0,0,0,0,0,NA,0,0,1,0,0,NA,0,1,0,0,0)
)
> head(df)
id time mortality
1 1 1 NA
2 1 2 1
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
df$id represents individuals measured at six points in time throughout a survival trail. At the start of the trial all individuals are alive and they subsequently die or remain alive. df$mortality represents within which time period that the individual died, for example individual 1 died in time period 2.
I would like to create a new variable indicating what I have called cumulative survival. This would indicate if the individual had died in the current time period or any of the previous time periods. How would I code this?
I have tried a number of different ways using ifelse() statements and dplyr group_by() without success.
Below is what the new data frame should look like. Thanks
> df
id time mortality cum.survival
1 1 1 NA 0
2 1 2 1 1
3 1 3 0 1
4 1 4 0 1
5 1 5 0 1
6 1 6 0 1
7 2 1 NA 0
8 2 2 0 0
9 2 3 0 0
10 2 4 0 0
11 2 5 0 0
12 2 6 1 1
13 3 1 NA 0
14 3 2 0 0
15 3 3 0 0
16 3 4 0 0
17 3 5 0 0
18 3 6 0 0
19 4 1 NA 0
20 4 2 0 0
21 4 3 0 0
22 4 4 1 1
23 4 5 0 1
24 4 6 0 1
25 5 1 NA 0
26 5 2 0 0
27 5 3 1 1
28 5 4 0 1
29 5 5 0 1
30 5 6 0 1
Assuming the person will die only once, we can also use cumsum.
First replacing NA in mortality to 0 in cum.survival.
df <- transform(df, cum.survival = replace(mortality, is.na(mortality), 0))
We can then use base R :
df$cum.survival <- with(df, ave(cum.survival, id, FUN = cumsum))
dplyr :
library(dplyr)
df %>% group_by(id) %>% mutate(cum.survival = cumsum(cum.survival))
Or data.table :
library(data.table)
setDT(df)[, cum.survival := cumsum(cum.survival), id]
Another option is to match the row index in the group to the index where 1 is present.
We can use which.max :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= which.max(mortality)))
OR match :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= match(1, mortality)))
An option using by:
df$cum.survival <- unlist(by(df$mortality, df$id, function(x) cummax(replace(x, is.na(x), 0L))))
or ave:
df$cum.survival <- ave(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L)))
or tapply:
df$cum.survival <- unlist(tapply(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L))))

Creating an ifelse statement that is conditional by factor level

I'm new to R, so I'm sorry if this is obvious. But, I've been stuck on this for a while, but can have been fruitless in finding answers thus far.
Data frame:
1 b c id e
2 0 1 45 5
3 1 0 45 7
4 0 1 48 5
5 1 0 46 7
Desired result:
1 b c id e f
2 0 1 45 5 1
3 1 0 45 7 1
4 0 1 48 5 0
5 1 0 46 7 0
What I'm trying to do: I am trying to create column F based on levels of b and c for people with the same ID. Column E is still important to me along with other omitted values, so I can't collapse the data on ID.
The closest I've gotten:
library(dplyr)
df2 <- df %>%
group_by(id) %>%
mutate(ifelse(b == 1 & c == 1, 1, 0))
But, I think my problem there is that I'm not using dplyr::group_by correctly so I'm essentially doing a base ifelse statement.
We don't need an ifelse here
df %>%
group_by(id) %>%
mutate(f = as.integer(any(b) & any(c)))
# A tibble: 4 x 5
# Groups: id [3]
# b c id e f
# <int> <int> <int> <int> <int>
#1 0 1 45 5 1
#2 1 0 45 7 1
#3 0 1 48 5 0
#4 1 0 46 7 0

Deleting unnecessary rows after column shuffling in a data frame in R

I have a data frame as below. The Status of each ID recorded in different time points. 0 means the person is alive and 1 means dead.
ID Status
1 0
1 0
1 1
2 0
2 0
2 0
3 0
3 0
3 0
3 1
I want to shuffle the column Status and each ID can have a status of 1, just one time. After that, I want to have NA for other rows. For instance, I want my data frame to look like below after shuffling:
ID Status
1 0
1 0
1 0
2 0
2 1
2 NA
3 0
3 1
3 NA
3 NA
From the data you posted and your example output, it looks like you want to randomly sample df$Status and then do the replacement. To get what you want in one step you could do:
set.seed(3)
df$Status <- ave(sample(df$Status), df$ID, FUN = function(x) replace(x, which(cumsum(x)>=1)[-1], NA))
df
# ID Status
#1 1 0
#2 1 0
#3 1 0
#4 2 1
#5 2 NA
#6 2 NA
#7 3 0
#8 3 0
#9 3 1
#10 3 NA
One option to use cumsum of cumsum to decide first 1 appearing for an ID.
Note that I have modified OP's sample dataframe to represent logic of reshuffling.
library(dplyr)
df %>% group_by(ID) %>%
mutate(Sum = cumsum(cumsum(Status))) %>%
mutate(Status = ifelse(Sum > 1, NA, Status)) %>%
select(-Sum)
# # A tibble: 10 x 2
# # Groups: ID [3]
# ID Status
# <int> <int>
# 1 1 0
# 2 1 0
# 3 1 1
# 4 2 0
# 5 2 1
# 6 2 NA
# 7 3 0
# 8 3 1
# 9 3 NA
# 10 3 NA
Data
df <- read.table(text =
"ID Status
1 0
1 0
1 1
2 0
2 1
2 0
3 0
3 1
3 0
3 0", header = TRUE)

Wide to long format with several variables

This question is related to a previous question I asked on converting from wide to long format in R with an additional complication.
previous question is here: Wide to long data conversion
The wide data I start with looks like the following:
d2 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1),
'Q1_X_Opt_1' = c(0,0),
'Q1_X_Opt_2' = c(75,200),
'Q1_X_Opt_3' = c(150,300),
'Q2_X_Opt_1' = c(0,0),
'Q2_X_Opt_2' = c(150,200),
'Q2_X_Opt_3' = c(75,300),
'Q3_X_Opt_1' = c(0,0),
'Q3_X_Opt_2' = c(100,500),
'Q3_X_Opt_3' = c(150,300))
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3} encoded in Q1, Q2, and Q3. So, in this examples, individual 1 chose option 2 in Q1, chose option 1 in Q2, and chose option 3 in Q3.
For each option there is also a variable X associated with each option that I also need to be converted to wide format. The output I am seeking looks like the following:
id question option choice cost
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
I have tried to adapting the code from the answer to the prior question, but with no success thus far. Thanks for any suggestions or comments.
It's not exactly elegant, but here's a tidyverse version:
library(tidyverse)
d3 <- d2 %>%
gather(option, cost, -id:-Q3) %>%
gather(question, choice, Q1:Q3) %>%
separate(option, c('question2', 'option'), extra = 'merge') %>%
filter(question == question2) %>%
mutate_at(vars(question, option), parse_number) %>%
mutate(choice = as.integer(option == choice)) %>%
select(1, 5, 3, 6, 4) %>%
arrange(id)
d3
#> id question option choice cost
#> 1 1 1 1 0 0
#> 2 1 1 2 1 75
#> 3 1 1 3 0 150
#> 4 1 2 1 1 0
#> 5 1 2 2 0 150
#> 6 1 2 3 0 75
#> 7 1 3 1 0 0
#> 8 1 3 2 0 100
#> 9 1 3 3 1 150
#> 10 2 1 1 0 0
#> 11 2 1 2 0 200
#> 12 2 1 3 1 300
#> 13 2 2 1 0 0
#> 14 2 2 2 0 200
#> 15 2 2 3 1 300
#> 16 2 3 1 1 0
#> 17 2 3 2 0 500
#> 18 2 3 3 0 300
1) First melt the input transformihg it to long form. Then break apart the variable column on underscore using read.table giving columns named V1, V2, V3, V4 representing the question as a factor, junk, junk and the option parts, respectively. Append that back to m and set the question to the factor level of V1 and option to V4. Sort it by id to give the same ordering as in the question. (If the order does not matter this line could be omiited.)
Now put the parts together noting that choice is 1 if the appropriate column among the Q1/Q2/Q3 columns equals the option and 0 otherwise.
library(reshape2)
m <- melt(d2, id = 1:4)
m <- cbind(m, read.table(text = as.character(m$variable), sep = "_"))
m <- transform(m, question = as.numeric(V1), option = V4)
m <- m[order(m$id), ]
n <- nrow(m)
with(m, data.frame(id,
question,
option,
choice = (m[cbind(1:n, question + 1)] == option) + 0,
value))
The result is:
id question option choice value
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
2) This could also be expressed using magirttr giving the same answer. Note that the last two pipes use the exposition operator %$% providing an implicit with(., ...) around the subsequent expression:
library(magrittr)
library(reshape2)
d2 %>%
melt(id = 1:4) %>%
cbind(read.table(text = as.character(.$variable), sep = "_")) %>%
transform(question = as.numeric(V1), option = V4) %$%
.[order(id), ] %$%
data.frame(id,
question,
option,
choice = (.[cbind(1:nrow(.), question + 1)] == option) + 0,
value)
3) This can be translated to reshape2/dplyr/tidyr:
library(reshape2)
library(dplyr)
library(tidyr)
d2 %>%
melt(id = 1:4) %>%
separate(variable, c("question", "X", "Opt", "option")) %>%
arrange(id) %>%
mutate(question = as.numeric(factor(question)),
choice = (.[cbind(1:n(), question + 1)] == option) + 0) %>%
select(id, question, option, choice, value)

Resources