How to have R sum nonexistent or null data - r

A bit convoluted so I will start with the basic concept. The data is employment by area and sizeclass. From there, I produce a data frame that has the sizeclass, area, total employment by sizeclass, number of worksites by sizeclass. The bigger the sizeclass, the more employment. 1 equal to employing between 0 and 4. 9 being equal to employing 1000+. Obviously some areas do not have large employers. However, I need the end result to always have 9 rows per area even if there is 0 employment for that sizeclass. Sample data is below.
area <- c(01,01,01,01,01,01,01,03,03,03,03)
employment <- c(1,5,9,10,11,12,67,100,4,444,149)
sizeclass <- c(1,2,2,3,3,3,5,6,1,7,6)
df2 <- data.frame(area,employment,sizeclass)
This is the code that I am using and while it works, it does not produce a result for sizeclass 4 in area 01 for example. How would I have it sum by sizeclass even if there is nothing to sum or count?
sizeclassreport <- df2 %>%
select (area,employment,sizeclass) %>%
group_by(area,sizeclass) %>%
summarise(employment = sum(employment),worksites = n())
The desired result would be 18 rows in length with the sum of employment by sizeclass for each sizeclass and number of worksites even if there is no employment.

We can use complete to get all the values from the custom value range between 1 and 9 for the 'sizeclass'. By default, the other columns values will be filled by NA. If wanted, it can be filled with a custom value i.e. 0
library(dplyr)
library(tidyr)
sizeclassreport %>%
group_by(area) %>%
complete(sizeclass = 1:9,
fill = list(employment = 0, worksites = 0)) %>%
ungroup
-output
# A tibble: 18 x 4
area sizeclass employment worksites
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 14 2
3 1 3 33 3
4 1 4 0 0
5 1 5 67 1
6 1 6 0 0
7 1 7 0 0
8 1 8 0 0
9 1 9 0 0
10 3 1 4 1
11 3 2 0 0
12 3 3 0 0
13 3 4 0 0
14 3 5 0 0
15 3 6 249 2
16 3 7 444 1
17 3 8 0 0
18 3 9 0 0

Related

Assign observation level values by grouping variable

Thanks in advance for any help.
I have the below data frame
> df <- data.frame(
id = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5),
time = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
mortality = c(NA,1,0,0,0,0,NA,0,0,0,0,1,NA,0,0,0,0,0,NA,0,0,1,0,0,NA,0,1,0,0,0)
)
> head(df)
id time mortality
1 1 1 NA
2 1 2 1
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
df$id represents individuals measured at six points in time throughout a survival trail. At the start of the trial all individuals are alive and they subsequently die or remain alive. df$mortality represents within which time period that the individual died, for example individual 1 died in time period 2.
I would like to create a new variable indicating what I have called cumulative survival. This would indicate if the individual had died in the current time period or any of the previous time periods. How would I code this?
I have tried a number of different ways using ifelse() statements and dplyr group_by() without success.
Below is what the new data frame should look like. Thanks
> df
id time mortality cum.survival
1 1 1 NA 0
2 1 2 1 1
3 1 3 0 1
4 1 4 0 1
5 1 5 0 1
6 1 6 0 1
7 2 1 NA 0
8 2 2 0 0
9 2 3 0 0
10 2 4 0 0
11 2 5 0 0
12 2 6 1 1
13 3 1 NA 0
14 3 2 0 0
15 3 3 0 0
16 3 4 0 0
17 3 5 0 0
18 3 6 0 0
19 4 1 NA 0
20 4 2 0 0
21 4 3 0 0
22 4 4 1 1
23 4 5 0 1
24 4 6 0 1
25 5 1 NA 0
26 5 2 0 0
27 5 3 1 1
28 5 4 0 1
29 5 5 0 1
30 5 6 0 1
Assuming the person will die only once, we can also use cumsum.
First replacing NA in mortality to 0 in cum.survival.
df <- transform(df, cum.survival = replace(mortality, is.na(mortality), 0))
We can then use base R :
df$cum.survival <- with(df, ave(cum.survival, id, FUN = cumsum))
dplyr :
library(dplyr)
df %>% group_by(id) %>% mutate(cum.survival = cumsum(cum.survival))
Or data.table :
library(data.table)
setDT(df)[, cum.survival := cumsum(cum.survival), id]
Another option is to match the row index in the group to the index where 1 is present.
We can use which.max :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= which.max(mortality)))
OR match :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= match(1, mortality)))
An option using by:
df$cum.survival <- unlist(by(df$mortality, df$id, function(x) cummax(replace(x, is.na(x), 0L))))
or ave:
df$cum.survival <- ave(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L)))
or tapply:
df$cum.survival <- unlist(tapply(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L))))

Add column to data frame with loop calculation from another data frame

I have two datasets, one at the individual level and one at the school level. I would like to calculate the proportion of fighting in each school using a loop (since i have >100 schools).
Current code:
for (i in levels(df$school_id)) {
school <- subset(df, school_id == i)
number_students <- nrow(school)
prop <- (sum(school$fight_binary, na.rm = TRUE))/number_students
df$proportion_fight[df$school_id == i] <- prop
}
I tried initializing the new column first, but when I run this loop nothing happens at all.
Here's some sample data
INDIVIDUAL LEVEL:
student_id school_id ever_fight
1 2 1
2 3 0
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 3 1
10 1 0
11 3 1
12 3 1
13 3 1
14 3 1
15 1 0
16 2 0
17 1 0
18 1 0
19 1 0
20 1 0
SCHOOL LEVEL (need to fill the second column with data from above):
school_id proportion_fight
1
2
3
We can use a group by mean
library(dplyr)
df1 %>%
group_by(school_id) %>%
summarise(proportion_flight = mean(ever_flight))

count consecutive occurrence and stop once found value

I have a data frame that looks like this:
account <- c('123','123','123','123')
bin <- c(3,6,9,12)
count <- c(0,0,2,0)
df <- data.frame(account,bin,count)
df
> df
account bin count
1 123 3 0
2 123 6 0
3 123 9 2
4 123 12 0
I want an output that looks like this:
> df
account bin count cumCount
1 123 3 0 1
2 123 6 0 2
3 123 9 2 0
4 123 12 0 0
Basically, I need to count the number of consecutive zeror starting from bin = 3. But once count columns is >0 I want the rest of the values to be zero.
I've looked around the web a bit and here are 2 part solutions that are almost there:
df %>%
group_by(count) %>%
mutate(id = row_number())
# A tibble: 4 x 4
# Groups: count [2]
account bin count id
<fctr> <dbl> <dbl> <int>
1 123 3 0 1
2 123 6 0 2
3 123 9 2 1
4 123 12 0 3
And
df %>%
mutate( x = sequence(rle(
as.character(count))$lengths))
> df %>%
+ mutate( x = sequence(rle(
+ as.character(count))$lengths))
account bin count x
1 123 3 0 1
2 123 6 0 2
3 123 9 2 1
4 123 12 0 1
but they still keep counting after zero is found.
Is there another solution?
We could first create a row number column cumCount. After that we replace the values to 0 for index from the first occurrence of non-zero value to the end of dataframe.
df$cumCount = 1:nrow(df)
df$cumCount[which.max(df$count != 0) : nrow(df)] <- 0
df
# account bin count cumCount
#1 123 3 0 1
#2 123 6 0 2
#3 123 9 2 0
#4 123 12 0 0
In dplyr, it is easier using row_number and replace function
library(dplyr)
df %>%
mutate(cumCount = replace(row_number(), cumsum(count!=0) > 0, 0))
# account bin count cumCount
#1 123 3 0 1
#2 123 6 0 2
#3 123 9 2 0
#4 123 12 0 0
The equivalent base R of the above dplyr version would be
df$cumCount <- replace(1:nrow(df), cumsum(df$count != 0) > 0, 0)

Mutate function only mutating first row, and then prints the same result for every subsequent row

I have a data frame that contains a column with hex color values. I want to mutate the data frame so that I can have the red, blue, and green values (rgb format) in separate columns, so I'm using the col2rgb function from grDevices package. This is what the data frame, which I named color_count, looks like before mutating:
hex Count
<chr> <int>
1 #00000B 3
2 #00000C 10
3 #00000D 1
4 #00000E 42
5 #00000F 4
Here's the code where I mutate the data frame:
color_rgb <- color_count %>%
mutate(r = col2rgb(hex)[1], g = col2rgb(hex)[2], b = col2rgb(hex)[3])
This is the output:
hex Count r g b
<chr> <int> <int> <int> <int>
1 #00000B 3 0 0 11
2 #00000C 10 0 0 11
3 #00000D 1 0 0 11
4 #00000E 42 0 0 11
5 #00000F 4 0 0 11
The rgb values are only correct for the first row, every other row just displays the exact same combination.
I tried using rowwise() as this other thread suggested, but it didn't work.
> col2rgb(color_count$hex)
[,1] [,2] [,3] [,4] [,5]
red 0 0 0 0 0
green 0 0 0 0 0
blue 11 12 13 14 15
> class(col2rgb(color_count$hex))
[1] "matrix"
col2rgb gives you a matrix, so you need to index by rows instead of by elements:
> col2rgb(color_count$hex)[1,]
[1] 0 0 0 0 0
> col2rgb(color_count$hex)[2,]
[1] 0 0 0 0 0
> col2rgb(color_count$hex)[3,]
[1] 11 12 13 14 15
Adding a comma after your index in [] means that you are indexing by rows and columns instead of by element. Number before , is row index, while number after , is column index.
library(dplyr)
color_rgb <- color_count %>%
mutate(r = col2rgb(hex)[1,], g = col2rgb(hex)[2,], b = col2rgb(hex)[3,])
# hex Count r g b
# 1 #00000B 3 0 0 11
# 2 #00000C 10 0 0 12
# 3 #00000D 1 0 0 13
# 4 #00000E 42 0 0 14
# 5 #00000F 4 0 0 15
Here's a solution which avoids multiple calls to col2rgb in each row with the sacrifice of using some more complex tidyverse manipulation
library(tidyverse)
color_rgb <- color_count %>%
group_nest(hex, Count) %>%
mutate(
data = map(hex, ~col2rgb(.) %>% t %>% as_tibble)
) %>%
unnest(data)
# A tibble: 5 x 5
# hex Count red green blue
# <chr> <dbl> <int> <int> <int>
# 1 #00000B 3 0 0 11
# 2 #00000C 10 0 0 12
# 3 #00000D 1 0 0 13
# 4 #00000E 42 0 0 14
# 5 #00000F 4 0 0 15

Wide to long format with several variables

This question is related to a previous question I asked on converting from wide to long format in R with an additional complication.
previous question is here: Wide to long data conversion
The wide data I start with looks like the following:
d2 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1),
'Q1_X_Opt_1' = c(0,0),
'Q1_X_Opt_2' = c(75,200),
'Q1_X_Opt_3' = c(150,300),
'Q2_X_Opt_1' = c(0,0),
'Q2_X_Opt_2' = c(150,200),
'Q2_X_Opt_3' = c(75,300),
'Q3_X_Opt_1' = c(0,0),
'Q3_X_Opt_2' = c(100,500),
'Q3_X_Opt_3' = c(150,300))
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3} encoded in Q1, Q2, and Q3. So, in this examples, individual 1 chose option 2 in Q1, chose option 1 in Q2, and chose option 3 in Q3.
For each option there is also a variable X associated with each option that I also need to be converted to wide format. The output I am seeking looks like the following:
id question option choice cost
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
I have tried to adapting the code from the answer to the prior question, but with no success thus far. Thanks for any suggestions or comments.
It's not exactly elegant, but here's a tidyverse version:
library(tidyverse)
d3 <- d2 %>%
gather(option, cost, -id:-Q3) %>%
gather(question, choice, Q1:Q3) %>%
separate(option, c('question2', 'option'), extra = 'merge') %>%
filter(question == question2) %>%
mutate_at(vars(question, option), parse_number) %>%
mutate(choice = as.integer(option == choice)) %>%
select(1, 5, 3, 6, 4) %>%
arrange(id)
d3
#> id question option choice cost
#> 1 1 1 1 0 0
#> 2 1 1 2 1 75
#> 3 1 1 3 0 150
#> 4 1 2 1 1 0
#> 5 1 2 2 0 150
#> 6 1 2 3 0 75
#> 7 1 3 1 0 0
#> 8 1 3 2 0 100
#> 9 1 3 3 1 150
#> 10 2 1 1 0 0
#> 11 2 1 2 0 200
#> 12 2 1 3 1 300
#> 13 2 2 1 0 0
#> 14 2 2 2 0 200
#> 15 2 2 3 1 300
#> 16 2 3 1 1 0
#> 17 2 3 2 0 500
#> 18 2 3 3 0 300
1) First melt the input transformihg it to long form. Then break apart the variable column on underscore using read.table giving columns named V1, V2, V3, V4 representing the question as a factor, junk, junk and the option parts, respectively. Append that back to m and set the question to the factor level of V1 and option to V4. Sort it by id to give the same ordering as in the question. (If the order does not matter this line could be omiited.)
Now put the parts together noting that choice is 1 if the appropriate column among the Q1/Q2/Q3 columns equals the option and 0 otherwise.
library(reshape2)
m <- melt(d2, id = 1:4)
m <- cbind(m, read.table(text = as.character(m$variable), sep = "_"))
m <- transform(m, question = as.numeric(V1), option = V4)
m <- m[order(m$id), ]
n <- nrow(m)
with(m, data.frame(id,
question,
option,
choice = (m[cbind(1:n, question + 1)] == option) + 0,
value))
The result is:
id question option choice value
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
2) This could also be expressed using magirttr giving the same answer. Note that the last two pipes use the exposition operator %$% providing an implicit with(., ...) around the subsequent expression:
library(magrittr)
library(reshape2)
d2 %>%
melt(id = 1:4) %>%
cbind(read.table(text = as.character(.$variable), sep = "_")) %>%
transform(question = as.numeric(V1), option = V4) %$%
.[order(id), ] %$%
data.frame(id,
question,
option,
choice = (.[cbind(1:nrow(.), question + 1)] == option) + 0,
value)
3) This can be translated to reshape2/dplyr/tidyr:
library(reshape2)
library(dplyr)
library(tidyr)
d2 %>%
melt(id = 1:4) %>%
separate(variable, c("question", "X", "Opt", "option")) %>%
arrange(id) %>%
mutate(question = as.numeric(factor(question)),
choice = (.[cbind(1:n(), question + 1)] == option) + 0) %>%
select(id, question, option, choice, value)

Resources