This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
How to sum a variable by group
(18 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed last year.
I am very new to coding and just started doing some R graphics and now I am kinda lost with my data analyse and need some light! I am training some analyses and I got a very long dataset with 19 Countries x 12 months x 22 Products and for every month a Profit. Kinda like this:
Country Month Product Profit
Brazil Jan A 50
Brazil fev A 80
Brazil mar A 15
Austria Jan A 35
Austria fev A 80
Austria mar A 47
France Jan A 21
France fev A 66
France mar A 15
[...]
France Dez C 40 etc...
I am was thinking to do one graph showing the profits through the year and another for every country, so I could see the top and bottom 2 countries. I wanted to have something like:
All Countries Jan 106 or Brazil 2021 145
All Countries Fev 146 Austria 2021 162
All Countries Mar 77 France 2021 112
but the sum function can't help with characters type and as I have a long List, idk how to sum only part of the column.
sorry if it got confusing.
The package dplyr has quite a natural syntax for this:
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data.frame(
Country = rep(c(rep("Brazil", 3L), rep("Austria", 3L), rep("France", 3L)), 3L),
Profit = rep(c(50, 80, 15, 35, 80, 47, 21, 66, 15), 3L),
Month = rep(c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep"), 3L),
Year = sort(rep(c(2019, 2020, 2021), 9L))
)
df %>%
group_by(Country, Month) %>%
summarize(sum = sum(Profit))
#> `summarise()` has grouped output by 'Country'. You can override using the `.groups` argument.
#> # A tibble: 9 × 3
#> # Groups: Country [3]
#> Country Month sum
#> <chr> <chr> <dbl>
#> 1 Austria Apr 105
#> 2 Austria Jun 141
#> 3 Austria May 240
#> 4 Brazil Feb 240
#> 5 Brazil Jan 150
#> 6 Brazil Mar 45
#> 7 France Aug 198
#> 8 France Jul 63
#> 9 France Sep 45
Using base R, you can try something along these lines.
# sum of profit per month
out1 <- tapply(df$Profit, df$Month, sum)
# sum of profit per year per country
out2 <- data.frame(
profit = sapply(split(df, f = ~ df$Country + df$Year), function(x) sum(x$Profit))
)
out2$Country <- gsub('\\.[0-9]*', '', rownames(out2))
out2$Year <- gsub('[a-zA-z]*\\.', '', rownames(out2))
rownames(out2) <- NULL
Output
> out1
Apr Aug Feb Jan Jul Jun Mar May Sep
105 198 240 150 63 141 45 240 45
> head(out2)
profit Country Year
1 162 Austria 2019
2 145 Brazil 2019
3 102 France 2019
4 162 Austria 2020
5 145 Brazil 2020
6 102 France 2020
Data
# sample data
df <- data.frame(
Country = rep(c(rep('Brazil',3L),rep('Austria',3L),rep('France',3L)), 3L),
Profit = rep(c(50,80,15,35,80,47,21,66,15), 3L),
Month = rep(c('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep'),3L),
Year = sort(rep(c(2019,2020,2021), 9L))
)
Related
I have a data set with a list of event dates and a list of sample dates. Events and samples are grouped by unit. For each sample date, I want to count the number of events that came before that sample date
and the number of different months in which those events occurred, grouped by unit. A couple complications: sometimes the event date happens after the sample date in the same year. Sometimes there are sample dates but no event in a particular year.
Example data (my actual dataset has ~6000 observations):
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
Output data should look something like this:
year unit numevent nummonth
1996 a 1 1
1997 a 1 1
1998 a 3 3
1999 a 3 3
1996 b 1 1
1997 b 2 1
1998 b 3 1
1999 b 4 1
Note that in 1997 in unit a, the event is not counted because it happened after the sample date.
For smaller datasets, I have manually subset the data by each sample date and counted events/unique months (and then merged the datasets back together), but I can't do that with ~6000 observations.
numevent.1996<-ddply(data[data$eventdate<'1996-08-01',], .(unit),
summarize, numevent=length(eventdate), nummth=length(unique(eventmonth)), year=1996)
This might work:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
data <- data %>%
mutate(eventdate = lubridate::ymd(eventdate),
sampledate = lubridate::ymd(sampledate))
data %>%
group_by(unit, year, eventmonth) %>%
summarise(numevent = sum(sampledate >= eventdate)) %>%
group_by(unit, year) %>%
summarise(nummonth = sum(numevent > 0),
numevent = sum(numevent))
#> `summarise()` has grouped output by 'unit', 'year'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'unit'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups: unit [2]
#> unit year nummonth numevent
#> <chr> <int> <int> <int>
#> 1 a 1996 1 1
#> 2 a 1997 0 0
#> 3 a 1998 1 1
#> 4 a 1999 NA NA
#> 5 b 1996 1 1
#> 6 b 1997 1 1
#> 7 b 1998 1 1
#> 8 b 1999 1 1
Created on 2023-01-08 by the reprex package (v2.0.1)
Note, I don't think the data you've included actually produce the output you proposed as the output looks to have 18 events that meet the condition and there are only 8 rows in the sample data provided.
Try this?
data %>%
group_by(unit) %>%
mutate(
numevent = sapply(sampledate, function(z) sum(eventdate < z, na.rm = TRUE)),
nummonth = sapply(sampledate, function(z) length(unique(na.omit(eventmonth[eventdate < z]))))
) %>%
ungroup()
# # A tibble: 8 × 7
# unit eventdate eventmonth sampledate year numevent nummonth
# <chr> <date> <int> <date> <int> <int> <int>
# 1 a 1996-06-01 6 1996-08-01 1996 1 1
# 2 a 1997-09-03 9 1997-08-02 1997 1 1
# 3 a 1998-05-15 5 1998-08-03 1998 3 3
# 4 a NA NA 1999-08-02 1999 3 3
# 5 b 1996-05-31 5 1996-08-01 1996 1 1
# 6 b 1997-05-31 5 1997-08-02 1997 2 1
# 7 b 1998-05-15 5 1998-08-03 1998 3 1
# 8 b 1999-05-16 5 1999-08-02 1999 4 1
Data
data <- structure(list(unit = c("a", "a", "a", "a", "b", "b", "b", "b"), eventdate = structure(c(9648, 10107, 10361, NA, 9647, 10012, 10361, 10727), class = "Date"), eventmonth = c(6L, 9L, 5L, NA, 5L, 5L, 5L, 5L), sampledate = structure(c(9709, 10075, 10441, 10805, 9709, 10075, 10441, 10805), class = "Date"), year = c(1996L, 1997L, 1998L, 1999L, 1996L, 1997L, 1998L, 1999L)), class = "data.frame", row.names = c(NA, -8L))
I tried to rename columns which is actually a very straight forward operation but still getting errors. I tried two methods and none of them working. Can any one explain, what needs to be done to rename columns without getting these strange errors. I tried several SO posts but none of them really worked.
library(pacman)
#> Warning: package 'pacman' was built under R version 4.2.1
p_load(dplyr, readr)
data = read_csv("https://raw.githubusercontent.com/srk7774/data/master/august_october_2020.csv",
col_names = TRUE)
#> Rows: 16 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): X.1
#> dbl (2): Total Agree - August 2020, Total Agree - October 2020
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
column_recodes <- c(X.1 = "country",
august = "Total Agree - August 2020",
october = "`Total Agree - October 2020",
`Another non-existent column name` = "bar")
data %>% rename_with(~recode(., !!!column_recodes))
#> # A tibble: 16 × 3
#> country `Total Agree - August 2020` `Total Agree - October 2020`
#> <chr> <dbl> <dbl>
#> 1 Total 77 73
#> 2 India 87 87
#> 3 China 97 85
#> 4 South Korea 84 83
#> 5 Brazil 88 81
#> 6 Australia 88 79
#> 7 United Kingdom 85 79
#> 8 Mexico 75 78
#> 9 Canada 76 76
#> 10 Germany 67 69
#> 11 Japan 75 69
#> 12 South Africa 64 68
#> 13 Italy 67 65
#> 14 Spain 72 64
#> 15 United States 67 64
#> 16 France 59 54
data %>%
rename(country = X.1,
august = Total.Agree...August.2020,
october = Total.Agree...October.2020)
#> Error in `chr_as_locations()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `Total.Agree...August.2020` doesn't exist.
Created on 2022-10-24 by the reprex package (v2.0.1)
Add backtick when using names with space:
data %>%
rename(country = X.1,
august = `Total Agree - August 2020`,
october =`Total Agree - October 2020`)
> str(pc)
'data.frame': 562 obs. of 9 variables:
$ id : int 1 2 3 4 5 10 12 17 19 22 ...
$ gender : chr "M" "F" "M" "M" ...
$ birth_year: int 1973 1974 1937 1943 1958 1958 1940 1973 1971 1950 ...
$ type : chr "spontaneous SAH" "traumatic SAH" "spontaneous SAH" "traumatic SAH" ...
$ admit_year: int 2011 2011 2016 2012 2018 2017 2010 2018 2016 2018 ...
$ admit_date: chr "2011-06-22" "2011-12-19" "2016-12-06" "2012-10-28" ...
$ admitage : int 38 37 79 69 60 59 70 45 45 68 ...
$ death_date: chr NA NA NA "2012-10-28" ...
$ death_year: int NA NA NA 2012 NA NA NA NA 2016 NA ...
Hello. I have a list that looks like this. The column "id" indicates patients IDs. But there are rows with the same ID because some patients got admitted to the hospital several times. How do I delete the duplicates and leave one row per ID?
I tried this
c <- unique(pc$id)
to extract the "id" numbers, but I don't know what to do next.
I'm a beginner, so I would appreciate it if you could explain it to me with easy codes!
EDIT: I want to make one list containing the ones with the initial admitted dates of the patients, and another list containing the ones with the final admitted dates?
How can I do that? This list is in ID order, but if one patient got admitted multiple times, the date is not necessarily in chronological order. I'd like to know how I can achieve that just by using !duplicated.
Something like this should work : pc[!duplicated(pc$id),]. It will by default keep the first occurence.
library(tidyverse)
data <- tibble::tribble(
~id, ~gender, ~birth_year, ~admit_year,
1, "M", 1973, 2014,
2, "F", 1974, 2016,
3, "M", 1958, 2013,
2, "F", 1974, 2017,
1, "M", 1973, 2011,
1, "M", 1973, 2020,
1, "M", 1973, 2018,
2, "F", 1974, 2009,
)
data
# A tibble: 8 x 4
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2014
2 2 F 1974 2016
3 3 M 1958 2013
4 2 F 1974 2017
5 1 M 1973 2011
6 1 M 1973 2020
7 1 M 1973 2018
8 2 F 1974 2009
to keep the first and last row (first admit year and last admit year) by id
df <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first and last row (first admit year and last admit year) by id
slice(unique(c(1, n())))
df
# A tibble: 5 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 1 M 1973 2020
3 2 F 1974 2009
4 2 F 1974 2017
5 3 M 1958 2013
to keep the last row (last admit year) by id
df2 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the last row (last admit year) by id
slice(n())
df2
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2020
2 2 F 1974 2017
3 3 M 1958 2013
to keep the first row (first admit year) by id
df3 <- data %>%
# I will keep the patient with the last admit year
arrange(admit_year) %>%
# I group by id
group_by(id) %>%
# to keep the first row (first admit year) by id
slice(1)
df3
# A tibble: 3 x 4
# Groups: id [3]
id gender birth_year admit_year
<dbl> <chr> <dbl> <dbl>
1 1 M 1973 2011
2 2 F 1974 2009
3 3 M 1958 2013
I am trying to find a solution but haven't, yet.
I have a dataframe structured as follows:
country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54
I want to find the moving average for every 3 years (i.e. 2014-16, 2015-17, etc) to be placed in ad-hoc columns.
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
France Paris 23 34 54 12 23 21 37 33.3 29.7 18.7
US NYC 1 2 2 12 95 54 etc etc etc etc
Any hint?
1) Using the data shown reproducibly in the Note at the end we apply rollmean to each column in the transpose of the data and then transpose back. We rollapply the appropriate paste command to create the names.
library(zoo)
DF2 <- DF[-(1:2)]
cbind(DF, setNames(as.data.frame(t(rollmean(t(DF2), 3))),
rollapply(names(DF2), 3, function(x) paste(range(x), collapse = "-"))))
giving:
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
1 France Paris 23 34 54 12 23 21 37.000000 33.333333 29.66667 18.66667
2 US NYC 1 2 2 12 95 54 1.666667 5.333333 36.33333 53.66667
2) This could also be expressed using dplyr/tidyr/zoo like this:
library(dplyr)
library(tidyr)
library(zoo)
DF %>%
pivot_longer(-c(country, City)) %>%
group_by(country, City) %>%
mutate(value = rollmean(value, 3, fill = NA),
name = rollapply(name, 3, function(x) paste(range(x), collapse="-"), fill=NA)) %>%
ungroup %>%
drop_na %>%
pivot_wider %>%
left_join(DF, ., by = c("country", "City"))
Note
Lines <- "country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54 "
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, check.names = FALSE)
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have the dataframe below.
year<-c(2016,2016,2017,2017,2016,2016,2017,2017)
city<-c("NY","NY","NY","NY","WS","WS","WS","WS")
spec<-c("df","df","df","df","vb","vb","vb","vb")
num<-c(45,67,89,90,45,67,89,90)
df<-data.frame(year,city,spec,num)
I would like to know if it is possible to sum the num based on year,city and spec columns in order to bring it from this form:
year city spec num
1 2016 NY df 45
2 2016 NY df 67
3 2017 NY df 89
4 2017 NY df 90
5 2016 WS vb 45
6 2016 WS vb 67
7 2017 WS vb 89
8 2017 WS vb 90
to this:
year city spec num
1 2016 NY df 112
2 2017 NY df 179
3 2016 WS vb 112
4 2017 WS vb 179
Possible duplicate, but here is an answer:
library(tidyverse)
df %>%
group_by(year,city,spec) %>%
summarise(sum = sum(num))
...results in ...
# A tibble: 4 x 4
# Groups: year, city [4]
year city spec sum
<dbl> <fct> <fct> <dbl>
1 2016 NY df 112
2 2016 WS vb 112
3 2017 NY df 179
4 2017 WS vb 179
One way is to use sqldf package:
sqldf("Select year, city, spec, sum(num) from df
group by year, city, spec order by city")
year city spec sum(num)
1 2016 NY df 112
2 2017 NY df 179
3 2016 WS vb 112
4 2017 WS vb 179
Using dplyr
df %>%
group_by(year, city, spec) %>%
summarise(SumNum = sum(num)) %>%
arrange(city)