Add rows to data frame based on count - r

I have a data frame that looks like this:
Species_ID
Location_ID
Altitude
Female
Male
mon
WH
1700
3
10
jon
IF
1850
5
2
sylv
WS
2100
7
3
ter
MB
1700
20
15
I would like to have a total number of individuals (Female & Male) as an extra column
I would like to add rows to the data frame based on the total number of individuals, each row containing all info of the columns. So for example for the Species_ID "mon" we have a total number of 13 individuals. So i want 13 extra rows containing all infos of "Species_ID", "Location_ID" and "Altitude"
I pretty sure I can handle the first question by using mutate(), but I have absolutely no idea how to solve the second step.

You can use uncount from tidyr. The optional argument .id creates a new variable which gives a unique identifier for each created row.
library(tidyr)
df %>%
uncount(Female + Male, .id = "ID")
# Species_ID Location_ID Altitude Female Male ID
# 1 mon WH 1700 3 10 1
# 2 mon WH 1700 3 10 2
# 3 mon WH 1700 3 10 3
# 4 mon WH 1700 3 10 4
# 5 mon WH 1700 3 10 5
# 6 mon WH 1700 3 10 6
# 7 mon WH 1700 3 10 7
# 8 mon WH 1700 3 10 8
# 9 mon WH 1700 3 10 9
# 10 mon WH 1700 3 10 10
# 11 mon WH 1700 3 10 11
# 12 mon WH 1700 3 10 12
# 13 mon WH 1700 3 10 13
# ...
Data
df <- structure(
list(Species_ID = c("mon", "jon", "sylv", "ter"),
Location_ID = c("WH", "IF", "WS", "MB"),
Altitude = c(1700L, 1850L, 2100L, 1700L),
Female = c(3L, 5L, 7L, 20L),
Male = c(10L, 2L, 3L, 15L)),
class = "data.frame", row.names = c(NA, -4L))

Is this what you're looking for:
library(dplyr)
d <- tibble::tribble(
~Species_ID, ~Location_ID, ~Altitude, ~Female, ~Male,
"mon", "WH", 1700, 3, 10,
"jon", "IF", 1850, 5, 2,
"sylv", "WS", 2100, 7, 3,
"ter", "MB", 1700, 20, 15)
d <- d %>%
mutate(all_obs = Female + Male)
d[rep(1:nrow(d), d$all_obs), 1:3]
#> # A tibble: 65 × 3
#> Species_ID Location_ID Altitude
#> <chr> <chr> <dbl>
#> 1 mon WH 1700
#> 2 mon WH 1700
#> 3 mon WH 1700
#> 4 mon WH 1700
#> 5 mon WH 1700
#> 6 mon WH 1700
#> 7 mon WH 1700
#> 8 mon WH 1700
#> 9 mon WH 1700
#> 10 mon WH 1700
#> # … with 55 more rows
Created on 2023-01-17 by the reprex package (v2.0.1)

Ok, so I solved it like this:
b2 <- b1 %>%
rowwise() %>%
mutate(all_obs = sum(Weibchen,Arbeiterinnen,Männchen, na.rm=TRUE))
%>%
dplyr::select(Taxon_ID, Standort, Höhenstufe, all_obs)
b3 <- uncount(b2, all_obs, .remove=TRUE, .id="ID")

Related

Repeatedly count events before a certain date in R

I have a data set with a list of event dates and a list of sample dates. Events and samples are grouped by unit. For each sample date, I want to count the number of events that came before that sample date
and the number of different months in which those events occurred, grouped by unit. A couple complications: sometimes the event date happens after the sample date in the same year. Sometimes there are sample dates but no event in a particular year.
Example data (my actual dataset has ~6000 observations):
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
Output data should look something like this:
year unit numevent nummonth
1996 a 1 1
1997 a 1 1
1998 a 3 3
1999 a 3 3
1996 b 1 1
1997 b 2 1
1998 b 3 1
1999 b 4 1
Note that in 1997 in unit a, the event is not counted because it happened after the sample date.
For smaller datasets, I have manually subset the data by each sample date and counted events/unique months (and then merged the datasets back together), but I can't do that with ~6000 observations.
numevent.1996<-ddply(data[data$eventdate<'1996-08-01',], .(unit),
summarize, numevent=length(eventdate), nummth=length(unique(eventmonth)), year=1996)
This might work:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
data <- data %>%
mutate(eventdate = lubridate::ymd(eventdate),
sampledate = lubridate::ymd(sampledate))
data %>%
group_by(unit, year, eventmonth) %>%
summarise(numevent = sum(sampledate >= eventdate)) %>%
group_by(unit, year) %>%
summarise(nummonth = sum(numevent > 0),
numevent = sum(numevent))
#> `summarise()` has grouped output by 'unit', 'year'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'unit'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups: unit [2]
#> unit year nummonth numevent
#> <chr> <int> <int> <int>
#> 1 a 1996 1 1
#> 2 a 1997 0 0
#> 3 a 1998 1 1
#> 4 a 1999 NA NA
#> 5 b 1996 1 1
#> 6 b 1997 1 1
#> 7 b 1998 1 1
#> 8 b 1999 1 1
Created on 2023-01-08 by the reprex package (v2.0.1)
Note, I don't think the data you've included actually produce the output you proposed as the output looks to have 18 events that meet the condition and there are only 8 rows in the sample data provided.
Try this?
data %>%
group_by(unit) %>%
mutate(
numevent = sapply(sampledate, function(z) sum(eventdate < z, na.rm = TRUE)),
nummonth = sapply(sampledate, function(z) length(unique(na.omit(eventmonth[eventdate < z]))))
) %>%
ungroup()
# # A tibble: 8 × 7
# unit eventdate eventmonth sampledate year numevent nummonth
# <chr> <date> <int> <date> <int> <int> <int>
# 1 a 1996-06-01 6 1996-08-01 1996 1 1
# 2 a 1997-09-03 9 1997-08-02 1997 1 1
# 3 a 1998-05-15 5 1998-08-03 1998 3 3
# 4 a NA NA 1999-08-02 1999 3 3
# 5 b 1996-05-31 5 1996-08-01 1996 1 1
# 6 b 1997-05-31 5 1997-08-02 1997 2 1
# 7 b 1998-05-15 5 1998-08-03 1998 3 1
# 8 b 1999-05-16 5 1999-08-02 1999 4 1
Data
data <- structure(list(unit = c("a", "a", "a", "a", "b", "b", "b", "b"), eventdate = structure(c(9648, 10107, 10361, NA, 9647, 10012, 10361, 10727), class = "Date"), eventmonth = c(6L, 9L, 5L, NA, 5L, 5L, 5L, 5L), sampledate = structure(c(9709, 10075, 10441, 10805, 9709, 10075, 10441, 10805), class = "Date"), year = c(1996L, 1997L, 1998L, 1999L, 1996L, 1997L, 1998L, 1999L)), class = "data.frame", row.names = c(NA, -8L))

How best to parse fields in R?

Below is the sample data. This is how it comes from the current population survey. There are 115 columns in the original. Below is just a subset. At the moment, I simply append a new row each month and leave it as is. However, there has been a new request that it be made longer and parsed a bit.
For some context, the first character is the race, a = all, b=black, w=white, and h= hispanic. The second character is the gender, x = all, m = male, and f= female. The third variable, which does not appear in all columns is the age. These values are 2024 for ages 20-24, 3039 or 30-39, and so on. Each one will end in the terms, laborforce unemp or unemprate.
stfips <- c(32,32,32,32,32,32,32,32)
areatype <- c(01,01,01,01,01,01,01,01)
periodyear <- c(2021,2021,2021,2021,2021,2021,2021,2021)
period <- (01,02,03,04,05,06,07,08)
xalaborforce <- c(1210.9,1215.3,1200.6,1201.6,1202.8,1209.3,1199.2,1198.9)
xaunemp <- c(55.7,55.2,65.2,321.2,77.8,88.5,92.4,102.6)
xaunemprate <- c(2.3,2.5,2.7,2.9,3.2,6.5,6.0,12.5)
walaborforce <- c(1000.0,999.2,1000.5,1001.5,998.7,994.5,999.2,1002.8)
waunemp <- c(50.2,49.5,51.6,251.2,59.9,80.9,89.8,77.8)
waunemprate <- c(3.4,3.6,3.8,4.0,4.2,4.5,4.1,2.6)
balaborforce <- c (5.5,5.7,5.2,6.8,9.2,2.5,3.5,4.5)
ba2024laborforce <- c(1.2,1.4,1.2,1.3,1.6,1.7,1.4,1.5)
ba2024unemp <- c(.2,.3,.2,.3,.4,.5,.02,.19))
ba2024lunemprate <- c(2.1,2.2,3.2,3.2,3.3,3.4,1.2,2.5)
test2 <- data.frame (stfips,areatype,periodyear, period, xalaborforce,xaunemp,xaunemprate,walaborforce, waunemp,waunemprate,balaborforce,ba2024laborforce,ba2024unemp,ba2024unemprate)
Desired result
stfips areatype periodyear period race gender age laborforce unemp unemprate
32 01 2021 01 x a all 1210.9 55.7 2.3
32 01 2021 02 x a all 1215.3 55.2 2.5
.....(the other six rows for race = x and gender = a
32 01 2021 01 w a all 1000.0 50.2 3.4
32 01 2021 02 w a all 999.2 49.5 3.6
....(the other six rows for race = w and gender = a
32 01 2021 01 b a 2024 1.2 .2 2.1
Edit -- added handling for columns with age prefix. Mostly there, but would be nice to have a concise way to add the - to make 2024 into 20-24....
test2 %>%
pivot_longer(xalaborforce:ba2024laborforce) %>%
separate(name, c("race", "gender", "stat"), sep = c(1,2)) %>%
mutate(age = coalesce(parse_number(stat) %>% as.character, "all"),
stat = str_remove_all(stat, "[0-9]")) %>%
pivot_wider(names_from = stat, values_from = value)
# A tibble: 32 × 10
stfips areatype periodyear period race gender age laborforce unemp unemprate
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 32 1 2021 1 x a all 1211. 55.7 2.3
2 32 1 2021 1 w a all 1000 50.2 3.4
3 32 1 2021 1 b a all 5.5 NA NA
4 32 1 2021 1 b a 2024 1.2 NA NA
5 32 1 2021 2 x a all 1215. 55.2 2.5
6 32 1 2021 2 w a all 999. 49.5 3.6
7 32 1 2021 2 b a all 5.7 NA NA
8 32 1 2021 2 b a 2024 1.4 NA NA
9 32 1 2021 3 x a all 1201. 65.2 2.7
10 32 1 2021 3 w a all 1000. 51.6 3.8
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

Joining two data frames using range of values

I have two data sets I would like to join. The income_range data is the master dataset and I would like to join data_occ to the income_range data based on what band the income falls inside. Where there are more than two observations(incomes) that are within the range I would like to take the lower income.
I was attempting to use data.table but was having trouble. I was would also like to keep all columns from both data.frames if possible.
The output dataset should only have 7 observations.
library(data.table)
library(dplyr)
income_range <- data.frame(id = "France"
,inc_lower = c(10, 21, 31, 41,51,61,71)
,inc_high = c(20, 30, 40, 50,60,70,80)
,perct = c(1,2,3,4,5,6,7))
data_occ <- data.frame(id = rep(c("France","Belgium"), each=50)
,income = sample(10:80, 50)
,occ = rep(c("manager","clerk","manual","skilled","office"), each=20))
setDT(income_range)
setDT(data_occ)
First attempt.
df2 <- income_range [data_occ ,
on = .(id, inc_lower <= income, inc_high >= income),
.(id, income, inc_lower,inc_high,perct,occ)]
Thank you in advance.
Since you tagged dplyr, here's one possible solution using that library:
library('fuzzyjoin')
# join dataframes on id == id, inc_lower <= income, inc_high >= income
joined <- income_range %>%
fuzzy_left_join(data_occ,
by = c('id' = 'id', 'inc_lower' = 'income', 'inc_high' = 'income'),
match_fun = list(`==`, `<=`, `>=`)) %>%
rename(id = id.x) %>%
select(-id.y)
# sort by income, and keep only the first row of every unique perct
result <- joined %>%
arrange(income) %>%
group_by(perct) %>%
slice(1)
And the (intermediate) results:
> head(joined)
id inc_lower inc_high perct income occ
1 France 10 20 1 10 manager
2 France 10 20 1 19 manager
3 France 10 20 1 14 manager
4 France 10 20 1 11 manager
5 France 10 20 1 17 manager
6 France 10 20 1 12 manager
> result
# A tibble: 7 x 6
# Groups: perct [7]
id inc_lower inc_high perct income occ
<chr> <dbl> <dbl> <dbl> <int> <chr>
1 France 10 20 1 10 manager
2 France 21 30 2 21 manual
3 France 31 40 3 31 manual
4 France 41 50 4 43 manager
5 France 51 60 5 51 clerk
6 France 61 70 6 61 manager
7 France 71 80 7 71 manager
I've added the intermediate dataframe joined for easy of understanding. You can omit it and just chain the two command chains together with %>%.
Here is one data.table approach:
cols = c("inc_lower", "inc_high")
data_occ[, (cols) := income]
result = data_occ[order(income)
][income_range,
on = .(id, inc_lower>=inc_lower, inc_high<=inc_high),
mult="first"]
data_occ[, (cols) := NULL]
# id income occ inc_lower inc_high perct
# 1: France 10 clerk 10 20 1
# 2: France 21 manager 21 30 2
# 3: France 31 clerk 31 40 3
# 4: France 41 clerk 41 50 4
# 5: France 51 clerk 51 60 5
# 6: France 62 manager 61 70 6
# 7: France 71 manager 71 80 7

Using dplyr mutate function to create new variable conditionally based on current row

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

How to group by in base R

I would like to express the following SQL query using base R (without any particular package):
select month, day, count(*) as count, avg(dep_delay) as avg_delay
from flights
group by month, day
having count > 1000
It selects the mean departure delay and the number of flights per day on busy days (days with more than 1000 flights). The dataset is nycflights13 containing information of flights departed from NYC in 2013.
Notice I can easily write this in dplyr as:
flights %>%
group_by(month, day) %>%
summarise(count = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
filter(count > 1000)
Since I was reminded earlier about the elegance of by (tip of the hat to #Parfait), here is a solution using by:
res <- by(flights, list(flights$month, flights$day), function(x)
if (nrow(x) > 1000) {
c(
month = unique(x$month),
day = unique(x$day),
count = nrow(x),
avg_delay = mean(x$dep_delay, na.rm = TRUE))
})
# Store in data.frame and order by month, day
df <- do.call(rbind, res);
df <- df[order(df[, 1], df[, 2]) ,];
# month day count avg_delay
#[1,] 7 8 1004 37.296646
#[2,] 7 9 1001 30.711499
#[3,] 7 10 1004 52.860702
#[4,] 7 11 1006 23.609392
#[5,] 7 12 1002 25.096154
#[6,] 7 17 1001 13.670707
#[7,] 7 18 1003 20.626789
#[8,] 7 25 1003 19.674134
#[9,] 7 31 1001 6.280843
#[10,] 8 7 1001 8.680402
#[11,] 8 8 1001 43.349947
#[12,] 8 12 1001 8.308157
#[13,] 11 27 1014 16.697651
#[14,] 12 2 1004 9.021978
as commented you can use a combi of subset and aggregate. Changed the order of day & month to recieve the same order as your dplyr approach. Using na.action = NULL to count rows inclunding NAs.
library(nycflights13)
#> Warning: Paket 'nycflights13' wurde unter R Version 3.4.4 erstellt
subset(aggregate(dep_delay ~ day + month, flights,
function(x) cbind(count=length(x), avg_delay=mean(x, na.rm = TRUE)),
na.action = NULL),
dep_delay[,1] > 1000)
#> day month dep_delay.1 dep_delay.2
#> 189 8 7 1004.000000 37.296646
#> 190 9 7 1001.000000 30.711499
#> 191 10 7 1004.000000 52.860702
#> 192 11 7 1006.000000 23.609392
#> 193 12 7 1002.000000 25.096154
#> 198 17 7 1001.000000 13.670707
#> 199 18 7 1003.000000 20.626789
#> 206 25 7 1003.000000 19.674134
#> 212 31 7 1001.000000 6.280843
#> 219 7 8 1001.000000 8.680402
#> 220 8 8 1001.000000 43.349947
#> 224 12 8 1001.000000 8.308157
#> 331 27 11 1014.000000 16.697651
#> 336 2 12 1004.000000 9.021978
Created on 2018-04-05 by the reprex package (v0.2.0).
Not a particularly elegant solution, but this will do what you want using Base R
flights_split <- split(flights, f = list(flights$month, flights$day))
result <- lapply(flights_split, function(x) {
if(nrow(x) > 1000) {
data.frame(month = unique(x$month), day = unique(x$day), avg_delay = mean(x$dep_delay, na.rm = T), count = nrow(x))
} else {
NULL
}
}
)
do.call(rbind, result)
# month day mean_delay n
# 12.2 12 2 9.021978 1004
# 8.7 8 7 8.680402 1001
# 7.8 7 8 37.296646 1004
# 8.8 8 8 43.349947 1001
# 7.9 7 9 30.711499 1001
# 7.10 7 10 52.860702 1004
# 7.11 7 11 23.609392 1006
# 7.12 7 12 25.096154 1002
# 8.12 8 12 8.308157 1001
# 7.17 7 17 13.670707 1001
# 7.18 7 18 20.626789 1003
# 7.25 7 25 19.674134 1003
# 11.27 11 27 16.697651 1014
# 7.31 7 31 6.280843 1001
Here is my solution:
grp <- expand.grid(mth = unique(flights$month), d = unique(flights$day))
out <- mapply(function(mth, d){
sub_data <- subset(flights, month == mth & day == d)
df <- data.frame(
month = mth,
day = d,
count = nrow(sub_data),
avg_delay = mean(sub_data$dep_delay, na.rm = TRUE)
)
df[df$count > 1000]
}, grp$mth, grp$d)
res <- do.call(rbind, out)
This is a lot slower than the dplyr solution.

Resources