I have a data set with a list of event dates and a list of sample dates. Events and samples are grouped by unit. For each sample date, I want to count the number of events that came before that sample date
and the number of different months in which those events occurred, grouped by unit. A couple complications: sometimes the event date happens after the sample date in the same year. Sometimes there are sample dates but no event in a particular year.
Example data (my actual dataset has ~6000 observations):
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
Output data should look something like this:
year unit numevent nummonth
1996 a 1 1
1997 a 1 1
1998 a 3 3
1999 a 3 3
1996 b 1 1
1997 b 2 1
1998 b 3 1
1999 b 4 1
Note that in 1997 in unit a, the event is not counted because it happened after the sample date.
For smaller datasets, I have manually subset the data by each sample date and counted events/unique months (and then merged the datasets back together), but I can't do that with ~6000 observations.
numevent.1996<-ddply(data[data$eventdate<'1996-08-01',], .(unit),
summarize, numevent=length(eventdate), nummth=length(unique(eventmonth)), year=1996)
This might work:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
data <- data %>%
mutate(eventdate = lubridate::ymd(eventdate),
sampledate = lubridate::ymd(sampledate))
data %>%
group_by(unit, year, eventmonth) %>%
summarise(numevent = sum(sampledate >= eventdate)) %>%
group_by(unit, year) %>%
summarise(nummonth = sum(numevent > 0),
numevent = sum(numevent))
#> `summarise()` has grouped output by 'unit', 'year'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'unit'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups: unit [2]
#> unit year nummonth numevent
#> <chr> <int> <int> <int>
#> 1 a 1996 1 1
#> 2 a 1997 0 0
#> 3 a 1998 1 1
#> 4 a 1999 NA NA
#> 5 b 1996 1 1
#> 6 b 1997 1 1
#> 7 b 1998 1 1
#> 8 b 1999 1 1
Created on 2023-01-08 by the reprex package (v2.0.1)
Note, I don't think the data you've included actually produce the output you proposed as the output looks to have 18 events that meet the condition and there are only 8 rows in the sample data provided.
Try this?
data %>%
group_by(unit) %>%
mutate(
numevent = sapply(sampledate, function(z) sum(eventdate < z, na.rm = TRUE)),
nummonth = sapply(sampledate, function(z) length(unique(na.omit(eventmonth[eventdate < z]))))
) %>%
ungroup()
# # A tibble: 8 × 7
# unit eventdate eventmonth sampledate year numevent nummonth
# <chr> <date> <int> <date> <int> <int> <int>
# 1 a 1996-06-01 6 1996-08-01 1996 1 1
# 2 a 1997-09-03 9 1997-08-02 1997 1 1
# 3 a 1998-05-15 5 1998-08-03 1998 3 3
# 4 a NA NA 1999-08-02 1999 3 3
# 5 b 1996-05-31 5 1996-08-01 1996 1 1
# 6 b 1997-05-31 5 1997-08-02 1997 2 1
# 7 b 1998-05-15 5 1998-08-03 1998 3 1
# 8 b 1999-05-16 5 1999-08-02 1999 4 1
Data
data <- structure(list(unit = c("a", "a", "a", "a", "b", "b", "b", "b"), eventdate = structure(c(9648, 10107, 10361, NA, 9647, 10012, 10361, 10727), class = "Date"), eventmonth = c(6L, 9L, 5L, NA, 5L, 5L, 5L, 5L), sampledate = structure(c(9709, 10075, 10441, 10805, 9709, 10075, 10441, 10805), class = "Date"), year = c(1996L, 1997L, 1998L, 1999L, 1996L, 1997L, 1998L, 1999L)), class = "data.frame", row.names = c(NA, -8L))
Related
I have a data frame that looks like this:
Species_ID
Location_ID
Altitude
Female
Male
mon
WH
1700
3
10
jon
IF
1850
5
2
sylv
WS
2100
7
3
ter
MB
1700
20
15
I would like to have a total number of individuals (Female & Male) as an extra column
I would like to add rows to the data frame based on the total number of individuals, each row containing all info of the columns. So for example for the Species_ID "mon" we have a total number of 13 individuals. So i want 13 extra rows containing all infos of "Species_ID", "Location_ID" and "Altitude"
I pretty sure I can handle the first question by using mutate(), but I have absolutely no idea how to solve the second step.
You can use uncount from tidyr. The optional argument .id creates a new variable which gives a unique identifier for each created row.
library(tidyr)
df %>%
uncount(Female + Male, .id = "ID")
# Species_ID Location_ID Altitude Female Male ID
# 1 mon WH 1700 3 10 1
# 2 mon WH 1700 3 10 2
# 3 mon WH 1700 3 10 3
# 4 mon WH 1700 3 10 4
# 5 mon WH 1700 3 10 5
# 6 mon WH 1700 3 10 6
# 7 mon WH 1700 3 10 7
# 8 mon WH 1700 3 10 8
# 9 mon WH 1700 3 10 9
# 10 mon WH 1700 3 10 10
# 11 mon WH 1700 3 10 11
# 12 mon WH 1700 3 10 12
# 13 mon WH 1700 3 10 13
# ...
Data
df <- structure(
list(Species_ID = c("mon", "jon", "sylv", "ter"),
Location_ID = c("WH", "IF", "WS", "MB"),
Altitude = c(1700L, 1850L, 2100L, 1700L),
Female = c(3L, 5L, 7L, 20L),
Male = c(10L, 2L, 3L, 15L)),
class = "data.frame", row.names = c(NA, -4L))
Is this what you're looking for:
library(dplyr)
d <- tibble::tribble(
~Species_ID, ~Location_ID, ~Altitude, ~Female, ~Male,
"mon", "WH", 1700, 3, 10,
"jon", "IF", 1850, 5, 2,
"sylv", "WS", 2100, 7, 3,
"ter", "MB", 1700, 20, 15)
d <- d %>%
mutate(all_obs = Female + Male)
d[rep(1:nrow(d), d$all_obs), 1:3]
#> # A tibble: 65 × 3
#> Species_ID Location_ID Altitude
#> <chr> <chr> <dbl>
#> 1 mon WH 1700
#> 2 mon WH 1700
#> 3 mon WH 1700
#> 4 mon WH 1700
#> 5 mon WH 1700
#> 6 mon WH 1700
#> 7 mon WH 1700
#> 8 mon WH 1700
#> 9 mon WH 1700
#> 10 mon WH 1700
#> # … with 55 more rows
Created on 2023-01-17 by the reprex package (v2.0.1)
Ok, so I solved it like this:
b2 <- b1 %>%
rowwise() %>%
mutate(all_obs = sum(Weibchen,Arbeiterinnen,Männchen, na.rm=TRUE))
%>%
dplyr::select(Taxon_ID, Standort, Höhenstufe, all_obs)
b3 <- uncount(b2, all_obs, .remove=TRUE, .id="ID")
I am trying to obtain the largest 10 investors in a country but obtain confusing result using arrange in dplyr versus order in base R.
head(fdi_partner)
give the following results
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Total registered capital (Mill. USD)(*)`
<chr> <chr> <chr>
1 TOTAL 1818 38854.3
2 Singapore 231 11358.66
3 Korea Rep.of 377 7679.9
4 Japan 204 4325.79
5 Netherlands 24 4209.64
6 China, PR 216 3001.79
and
fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric) %>%
arrange("Number of projects") %>%
head()
give almost the same result
# A tibble: 6 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Singapore 231 11359.
3 Korea Rep.of 377 7680.
4 Japan 204 4326.
5 Netherlands 24 4210.
6 China, PR 216 3002.
while the following code is working fine with base R
head(fdi_partner)
fdi_numeric <- fdi_partner %>%
rename("Registered capital" = "Total registered capital (Mill. USD)(*)") %>%
mutate_at(c("Number of projects", "Registered capital"), as.numeric)
head(fdi_numeric[order(fdi_numeric$"Number of projects", decreasing = TRUE), ], n=11)
which gives
# A tibble: 11 x 3
`Main counterparts` `Number of projects` `Registered capital`
<chr> <dbl> <dbl>
1 TOTAL 1818 38854.
2 Korea Rep.of 377 7680.
3 Singapore 231 11359.
4 China, PR 216 3002.
5 Japan 204 4326.
6 Hong Kong SAR (China) 132 2365.
7 United States 83 783.
8 Taiwan 66 1464.
9 United Kingdom 50 331.
10 F.R Germany 37 131.
11 Thailand 36 370.
Can anybody help explain what's wrong with me?
dplyr (and more generally tidyverse packages) accept only unquoted variable names. If your variable name has a space in it, you must wrap it in backticks:
library(dplyr)
test <- data.frame(`My variable` = c(3, 1, 2), var2 = c(1, 1, 1), check.names = FALSE)
test
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Your code (doesn't work)
test %>%
arrange("My variable")
#> My variable var2
#> 1 3 1
#> 2 1 1
#> 3 2 1
# Solution
test %>%
arrange(`My variable`)
#> My variable var2
#> 1 1 1
#> 2 2 1
#> 3 3 1
Created on 2023-01-05 with reprex v2.0.2
I am trying to calculate the number of days over which all objects in a group overlap with each member of the group. To do this I want to compare each row of one column in a group, to each other row in that column in the same group. However, I am unable to come up with a simple solution for this; most of my effort has been with the map variants from purrr. Aside from that I have gone down some nested loop (:-/), nested apply rabbit holes; but I suspect there is a very simple way to accomplish this comparison.
Essentially I want the sum of the intersect of each interval in a group to one row of the group.
Input data: (format with intervals)
ID Group year interval_obs
1 A 2020 2020-04-29 UTC--2020-05-19 UTC
2 A 2020 2020-05-04 UTC--2020-05-29 UTC
3 A 2020 2020-05-09 UTC--2020-05-24 UTC
4 A 2020 2020-04-24 UTC--2020-04-28 UTC
5 A 2020 2020-05-30 UTC--2020-06-03 UTC
6 B 2020 2019-12-31 UTC--2020-01-20 UTC
7 B 2020 2020-01-10 UTC--2020-01-30 UTC
8 B 2020 2020-01-20 UTC--2020-02-09 UTC
9 B 2020 2020-01-15 UTC--2020-02-04 UTC
Input data (more human readable?) - where each start/end is the Day of Year (doy)
ID Group Year start end
1 A 2020 120 140
2 A 2020 125 150
3 A 2020 130 145
4 A 2020 115 119
5 A 2020 151 155
6 B 2020 0 20
7 B 2020 10 30
8 B 2020 20 40
9 B 2020 15 35
Desired Results:
ID total_overlap
1 25
2 30
3 25
4 0
5 0
6 15
7 35
8 25
9 35
note the desired total overlap is in days, the sum of all days the 4 other observations in group A overlap. Group B with 4 records to indicate variable lengths.
example data for problem
data <- structure(list(
ID = 1:9,
group = c("A", "A", "A", "A", "A", "B", "B", "B", "B"),
year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 2020L),
start = c(120L, 125L, 130L, 115L, 151L, 0L, 10L, 20L, 15L),
end = c(140L, 150L, 145L, 119L, 155L, 20L, 30L, 40L, 35L)),
class = "data.frame",
row.names = c(NA, -9L))
data <- data %>%
group_by(group, year) %>% # real dataset has several combos - both vars left as reminder
mutate(across(c(start, end), ~ as_date(., origin = paste0(year-1, "-12-31")))) %>% #this year-1 term is due to leap years etc.
mutate(interval_obs = interval(ymd(start), ymd(end))) %>%
dplyr::select(-start, -end)
output <- data %>% map(.x = .$interval_obs, # this code at least runs
.f = ~{results = sum(as.numeric(intersect(.x, .y$interval_obs)))})
The little chunk above is one of many types of way's I have approached this (map2, map_df etc.), and while it does not work I imagine (...) a solution is in that ballpark. Note that my example output has two features: 1) units are converted to days, 2) the 'self intersection' is subtracted out. Do not worry about those features I have ways to do both of those, I just did not include those because they may obfuscate the problem. However if it helps...
mutate(self_intersection = as.numeric(intersect(interval_obs, interval_obs2))) %>%
mutate(results = results - self_intersection) %>%
mutate(total_overlap = as.numeric(results)/86400))
I have been trying to keep data in lubridate or another date format so that different temporal resolutions could be easily accommodated in the future (e.g. hours, minutes)
edit 2 - example of calculating overlap for Group A
(data reproduced here)
ID Group Year start end
1 A 2020 120 140
2 A 2020 125 150
3 A 2020 130 145
4 A 2020 115 119
5 A 2020 151 155
for Group # 1, numbers after 'comparison' refer to ID.
comparison 1 - 2. End1 - Start2 = 15 days
comparison 1 - 3. End1 - Start2 = 10 days
comparison 1 - 4. NO OVERLAP = 0 days
comparison 1 - 5. NO OVERLAP = 0 days
total_overlap 25 days
Is this what you are looking for?
The total overlap in the third line is off from your desired output, but that may be a typo?
library(tidyverse)
library(lubridate)
data |>
group_by(group) |>
mutate(total_overlap = map_dbl(interval_obs,
\(x) x |>
intersect(interval_obs) |>
int_length() |>
sum(na.rm = T) - int_length(x)
) / 86400
)
#> # A tibble: 9 × 5
#> # Groups: group [2]
#> ID group year interval_obs total_overlap
#> <int> <chr> <int> <Interval> <dbl>
#> 1 1 A 2020 2020-04-29 UTC--2020-05-19 UTC 25
#> 2 2 A 2020 2020-05-04 UTC--2020-05-29 UTC 30
#> 3 3 A 2020 2020-05-09 UTC--2020-05-24 UTC 25
#> 4 4 A 2020 2020-04-24 UTC--2020-04-28 UTC 0
#> 5 5 A 2020 2020-05-30 UTC--2020-06-03 UTC 0
#> 6 6 B 2020 2019-12-31 UTC--2020-01-20 UTC 15
#> 7 7 B 2020 2020-01-10 UTC--2020-01-30 UTC 35
#> 8 8 B 2020 2020-01-20 UTC--2020-02-09 UTC 25
#> 9 9 B 2020 2020-01-15 UTC--2020-02-04 UTC 35
This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed last year.
I am very new to coding and just started doing some R graphics and now I am kinda lost with my data analyse and need some light! I am training some analyses and I got a very long dataset with 19 Countries x 12 months x 22 Products and for every month a Profit. Kinda like this:
Country Month Product Profit
Brazil Jan A 50
Brazil fev A 80
Brazil mar A 15
Austria Jan A 35
Austria fev A 80
Austria mar A 47
France Jan A 21
France fev A 66
France mar A 15
[...]
France Dez C 40 etc...
I am was thinking to do one graph showing the profits through the year and another for every country, so I could see the top and bottom 2 countries. I wanted to have something like:
All Countries Jan 106 or Brazil 2021 145
All Countries Fev 146 Austria 2021 162
All Countries Mar 77 France 2021 112
but the sum function can't help with characters type and as I have a long List, idk how to sum only part of the column.
sorry if it got confusing.
The package dplyr has quite a natural syntax for this:
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(
Country = rep(c(rep("Brazil", 3L), rep("Austria", 3L), rep("France", 3L)), 3L),
Profit = rep(c(50, 80, 15, 35, 80, 47, 21, 66, 15), 3L),
Month = rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"), 3L),
Year = sort(rep(c(2019, 2020, 2021), 9L))
)
df %>%
group_by(Country, Month) %>%
summarize(sum = sum(Profit))
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: Country [3]
#> Country Month sum
#> <chr> <chr> <dbl>
#> 1 Austria Apr 105
#> 2 Austria Jun 141
#> 3 Austria May 240
#> 4 Brazil Feb 240
#> 5 Brazil Jan 150
#> 6 Brazil Mar 45
#> 7 France Aug 198
#> 8 France Jul 63
#> 9 France Sep 45
Using base R, you can try something along these lines.
# sum of profit per month
out1 <- tapply(df$Profit, df$Month, sum)
# sum of profit per year per country
out2 <- data.frame(
profit = sapply(split(df, f = ~ df$Country + df$Year), function(x) sum(x$Profit))
)
out2$Country <- gsub('\\.[0-9]*', '', rownames(out2))
out2$Year <- gsub('[a-zA-z]*\\.', '', rownames(out2))
rownames(out2) <- NULL
Output
> out1
Apr Aug Feb Jan Jul Jun Mar May Sep
105 198 240 150 63 141 45 240 45
> head(out2)
profit Country Year
1 162 Austria 2019
2 145 Brazil 2019
3 102 France 2019
4 162 Austria 2020
5 145 Brazil 2020
6 102 France 2020
Data
# sample data
df <- data.frame(
Country = rep(c(rep('Brazil',3L),rep('Austria',3L),rep('France',3L)), 3L),
Profit = rep(c(50,80,15,35,80,47,21,66,15), 3L),
Month = rep(c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep'),3L),
Year = sort(rep(c(2019,2020,2021), 9L))
)
> str(pc)
'data.frame': 562 obs. of 9 variables:
$ id : int 1 2 3 4 5 10 12 17 19 22 ...
$ gender : chr "M" "F" "M" "M" ...
$ birth_year: int 1973 1974 1937 1943 1958 1958 1940 1973 1971 1950 ...
$ type : chr "spontaneous SAH" "traumatic SAH" "spontaneous SAH" "traumatic SAH" ...
$ admit_year: int 2011 2011 2016 2012 2018 2017 2010 2018 2016 2018 ...
$ admit_date: chr "2011-06-22" "2011-12-19" "2016-12-06" "2012-10-28" ...
$ admitage : int 38 37 79 69 60 59 70 45 45 68 ...
$ death_date: chr NA NA NA "2012-10-28" ...
$ death_year: int NA NA NA 2012 NA NA NA NA 2016 NA ...
Hello. I have a list that looks like this. The column "id" indicates patients IDs. But there are rows with the same ID because some patients got admitted to the hospital several times. How do I delete the duplicates and leave one row per ID?
I tried this
c <- unique(pc$id)
to extract the "id" numbers, but I don't know what to do next.
I'm a beginner, so I would appreciate it if you could explain it to me with easy codes!
EDIT: I want to make one list containing the ones with the initial admitted dates of the patients, and another list containing the ones with the final admitted dates?
How can I do that? This list is in ID order, but if one patient got admitted multiple times, the date is not necessarily in chronological order. I'd like to know how I can achieve that just by using !duplicated.
Something like this should work : pc[!duplicated(pc$id),]. It will by default keep the first occurence.
library(tidyverse)
data <- tibble::tribble(
~id, ~gender, ~birth_year, ~admit_year,
1, "M", 1973, 2014,
2, "F", 1974, 2016,
3, "M", 1958, 2013,
2, "F", 1974, 2017,
1, "M", 1973, 2011,
1, "M", 1973, 2020,
1, "M", 1973, 2018,
2, "F", 1974, 2009,
)
data
# A tibble: 8 x 4
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2014
2 2 F 1974 2016
3 3 M 1958 2013
4 2 F 1974 2017
5 1 M 1973 2011
6 1 M 1973 2020
7 1 M 1973 2018
8 2 F 1974 2009
to keep the first and last row (first admit year and last admit year) by id
df <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first and last row (first admit year and last admit year) by id
slice(unique(c(1, n())))
df
# A tibble: 5 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 1 M 1973 2020
3 2 F 1974 2009
4 2 F 1974 2017
5 3 M 1958 2013
to keep the last row (last admit year) by id
df2 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the last row (last admit year) by id
slice(n())
df2
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2020
2 2 F 1974 2017
3 3 M 1958 2013
to keep the first row (first admit year) by id
df3 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first row (first admit year) by id
slice(1)
df3
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 2 F 1974 2009
3 3 M 1958 2013