Add column to data frame with loop calculation from another data frame - r

I have two datasets, one at the individual level and one at the school level. I would like to calculate the proportion of fighting in each school using a loop (since i have >100 schools).
Current code:
for (i in levels(df$school_id)) {
school <- subset(df, school_id == i)
number_students <- nrow(school)
prop <- (sum(school$fight_binary, na.rm = TRUE))/number_students
df$proportion_fight[df$school_id == i] <- prop
}
I tried initializing the new column first, but when I run this loop nothing happens at all.
Here's some sample data
INDIVIDUAL LEVEL:
student_id school_id ever_fight
1 2 1
2 3 0
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 3 1
10 1 0
11 3 1
12 3 1
13 3 1
14 3 1
15 1 0
16 2 0
17 1 0
18 1 0
19 1 0
20 1 0
SCHOOL LEVEL (need to fill the second column with data from above):
school_id proportion_fight
1
2
3

We can use a group by mean
library(dplyr)
df1 %>%
group_by(school_id) %>%
summarise(proportion_flight = mean(ever_flight))

Related

R script/Survey data: Find sibling and record when that sibling is <=1 year old

I would like to find a child's sibling(s) in survey data, check if it has ANY sibling whose age is <= 1 year, and store the result (1,0).
Here is my data:
cluster
house_number
age
1
5
0
1
5
1
1
8
4
1
21
4
1
21
1
2
22
0
2
36
0
2
5
0
2
5
2
2
5
3
I thought of looking for the match between cluster and house_number, and then check the age. But when there is a match how can you check for each child's siblings age and store the result (when it has at least one sibling <= 1 year of age). So you end up with this:
cluster
house_number
age
sibling_age1
1
5
0
1
1
5
1
1
1
8
4
0
1
21
4
1
1
21
1
0
2
22
0
0
2
36
0
0
2
5
0
0
2
5
2
1
2
5
3
1
Do you mean something like this :
# let's call your dataframe : data
# we create a new column called sibling_age on the condition of the age column
# and we use the ifelse function
# the first value represents the if argument
# the second value represents the else argument
data$sibling_age = ifelse(data$age>1,1,0)
# I hope that this is what you were looking for.

How to have R sum nonexistent or null data

A bit convoluted so I will start with the basic concept. The data is employment by area and sizeclass. From there, I produce a data frame that has the sizeclass, area, total employment by sizeclass, number of worksites by sizeclass. The bigger the sizeclass, the more employment. 1 equal to employing between 0 and 4. 9 being equal to employing 1000+. Obviously some areas do not have large employers. However, I need the end result to always have 9 rows per area even if there is 0 employment for that sizeclass. Sample data is below.
area <- c(01,01,01,01,01,01,01,03,03,03,03)
employment <- c(1,5,9,10,11,12,67,100,4,444,149)
sizeclass <- c(1,2,2,3,3,3,5,6,1,7,6)
df2 <- data.frame(area,employment,sizeclass)
This is the code that I am using and while it works, it does not produce a result for sizeclass 4 in area 01 for example. How would I have it sum by sizeclass even if there is nothing to sum or count?
sizeclassreport <- df2 %>%
select (area,employment,sizeclass) %>%
group_by(area,sizeclass) %>%
summarise(employment = sum(employment),worksites = n())
The desired result would be 18 rows in length with the sum of employment by sizeclass for each sizeclass and number of worksites even if there is no employment.
We can use complete to get all the values from the custom value range between 1 and 9 for the 'sizeclass'. By default, the other columns values will be filled by NA. If wanted, it can be filled with a custom value i.e. 0
library(dplyr)
library(tidyr)
sizeclassreport %>%
group_by(area) %>%
complete(sizeclass = 1:9,
fill = list(employment = 0, worksites = 0)) %>%
ungroup
-output
# A tibble: 18 x 4
area sizeclass employment worksites
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 1 2 14 2
3 1 3 33 3
4 1 4 0 0
5 1 5 67 1
6 1 6 0 0
7 1 7 0 0
8 1 8 0 0
9 1 9 0 0
10 3 1 4 1
11 3 2 0 0
12 3 3 0 0
13 3 4 0 0
14 3 5 0 0
15 3 6 249 2
16 3 7 444 1
17 3 8 0 0
18 3 9 0 0

Assign observation level values by grouping variable

Thanks in advance for any help.
I have the below data frame
> df <- data.frame(
id = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5),
time = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
mortality = c(NA,1,0,0,0,0,NA,0,0,0,0,1,NA,0,0,0,0,0,NA,0,0,1,0,0,NA,0,1,0,0,0)
)
> head(df)
id time mortality
1 1 1 NA
2 1 2 1
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
df$id represents individuals measured at six points in time throughout a survival trail. At the start of the trial all individuals are alive and they subsequently die or remain alive. df$mortality represents within which time period that the individual died, for example individual 1 died in time period 2.
I would like to create a new variable indicating what I have called cumulative survival. This would indicate if the individual had died in the current time period or any of the previous time periods. How would I code this?
I have tried a number of different ways using ifelse() statements and dplyr group_by() without success.
Below is what the new data frame should look like. Thanks
> df
id time mortality cum.survival
1 1 1 NA 0
2 1 2 1 1
3 1 3 0 1
4 1 4 0 1
5 1 5 0 1
6 1 6 0 1
7 2 1 NA 0
8 2 2 0 0
9 2 3 0 0
10 2 4 0 0
11 2 5 0 0
12 2 6 1 1
13 3 1 NA 0
14 3 2 0 0
15 3 3 0 0
16 3 4 0 0
17 3 5 0 0
18 3 6 0 0
19 4 1 NA 0
20 4 2 0 0
21 4 3 0 0
22 4 4 1 1
23 4 5 0 1
24 4 6 0 1
25 5 1 NA 0
26 5 2 0 0
27 5 3 1 1
28 5 4 0 1
29 5 5 0 1
30 5 6 0 1
Assuming the person will die only once, we can also use cumsum.
First replacing NA in mortality to 0 in cum.survival.
df <- transform(df, cum.survival = replace(mortality, is.na(mortality), 0))
We can then use base R :
df$cum.survival <- with(df, ave(cum.survival, id, FUN = cumsum))
dplyr :
library(dplyr)
df %>% group_by(id) %>% mutate(cum.survival = cumsum(cum.survival))
Or data.table :
library(data.table)
setDT(df)[, cum.survival := cumsum(cum.survival), id]
Another option is to match the row index in the group to the index where 1 is present.
We can use which.max :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= which.max(mortality)))
OR match :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= match(1, mortality)))
An option using by:
df$cum.survival <- unlist(by(df$mortality, df$id, function(x) cummax(replace(x, is.na(x), 0L))))
or ave:
df$cum.survival <- ave(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L)))
or tapply:
df$cum.survival <- unlist(tapply(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L))))

R Spread function with duplicates- still can't get to work after adding transient row

trying to get the spread() function to work with duplicates in the key column- yes, this has been covered before but I can't seem to get it to work and I've spent the better part of a day on it (somewhat new to R).
I have two columns of data. The first column 'snowday' represents the first day of a winter season, with the corresponding snow depth in the 'depth' column. This is several years of data (~62 years). So there should be sixty two years of first, second, third, etc days for the snowday column- this produces duplicates in snowday:
snowday row depth
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
...
75 4633 24
75 4634 4
75 4635 6
75 4636 20
75 4637 29
75 4638 1
I added a "row" column to make the data frame more transient (which I vaguely understand to be hones so 1:4638 rows is the total measurements taken over ~62 years at 75 days per year . Now i'd like to spread it wide:
wide <- spread(seasondata, key = snowday, value = depth, fill = 0)
and i get all zeros:
row 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
what I want it to look like is something like this (the columns are defined by the "snowday" and the row values are the various depths recorded on for that particular day over the various years- e.g. days 1 through 11 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1 3 4 0 0 1 0 2 8 9 19 0 3
0 8 0 0 0 4 0 6 6 0 1 0 2 0
3 5 0 0 0 2 0 1 0 2 7 0 12 4
I think I'm fundamentally missing something here- I've tried working through drop=TRUE or convert = TRUE, and the output values are either all zeros or NA's depending on how I tinker. Also, all values in the data.frame(seasondata) are integers. Any thoughts?
It seems to me what you wish to do is to split up the the depth column according to values of snowday, and then bind all the 75 columns together.
There is a complication, in that 62*75 is not 4638, so I assume we do not observe 75 snowdays in some years. That is, some of the 75 columns (snowdays) will not have 62 observations. We'll make sure all 75 columns are 62 entries long by filling short columns up with NAs.
I make some fake data as an example. We observe 3 "years" of data for snowdays 1 and 2, but only 2 "years" of data for snowdays 3 and 4.
set.seed(1)
seasondata <- data.frame(
snowday = c(rep(1:2, each = 3), rep(3:4, each = 2)),
depth = round(runif(10, 0, 10), 0))
# snowday depth
# 1 1 3
# 2 1 4
# 3 1 6
# 4 2 9
# 5 2 2
# 6 2 9
# 7 3 9
# 8 3 7
# 9 4 6
# 10 4 1
We first figure out how long a column should be. In your case, m == 62. In my example, m == 3 (the years of data).
m <- max(table(seasondata$snowday))
Now, we use the by function to split up depth by values of snowdays, and fill short columns with NAs, and finally cbind all the columns together:
out <- do.call(cbind,
by(seasondata$depth, seasondata$snowday,
function(x) {
c(x, rep(NA, m - length(x)))
}
)
)
out
# 1 2 3 4
# [1,] 3 9 9 6
# [2,] 4 2 7 1
# [3,] 6 9 NA NA
Using spread:
You can use spread if you wish. In this case, you have to define row correctly. row should be 1 for the first first snowday (snowday == 1), 2 for the second first snowday, etc. row should also be 1 for the first second snowday, 2 for the second second snowday, etc.
seasondata$row <- unlist(sapply(rle(seasondata$snowday)$lengths, seq_len))
seasondata
# snowday depth row
# 1 1 3 1
# 2 1 4 2
# 3 1 6 3
# 4 2 9 1
# 5 2 2 2
# 6 2 9 3
# 7 3 9 1
# 8 3 7 2
# 9 4 6 1
# 10 4 1 2
Now we can use spread:
library(tidyr)
spread(seasondata, key = snowday, value = depth, fill = NA)
# row 1 2 3 4
# 1 1 3 9 9 6
# 2 2 4 2 7 1
# 3 3 6 9 NA NA

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

Resources