The question is: Count the number of storms per year since 1975 and write their names
What I have so far is the code below:
storms %>%
count(name, year >= 1975, sort = TRUE)
I got this output:
name `year >= 1975` n
<chr> <lgl> <int>
1 Emily TRUE 217
2 Bonnie TRUE 209
3 Alberto TRUE 184
4 Claudette TRUE 180
5 Felix TRUE 178
6 Danielle TRUE 165
7 Josephine TRUE 165
8 Edouard TRUE 159
9 Gordon TRUE 158
10 Gabrielle TRUE 157
I think this is the correct output, but I just want to make sure.
Technically, year >= 1975 part should be in filter and not in count. However, the data starts from 1975 so your output is correct as well. You get the same output even if you remove filter part from the below code.
library(dplyr)
storms %>% filter(year >= 1975) %>% count(name, sort = TRUE)
# A tibble: 214 × 2
# name n
# <chr> <int>
# 1 Emily 217
# 2 Bonnie 209
# 3 Alberto 184
# 4 Claudette 180
# 5 Felix 178
# 6 Danielle 165
# 7 Josephine 165
# 8 Edouard 159
# 9 Gordon 158
#10 Gabrielle 157
# … with 204 more rows
Related
This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 1 year ago.
I am looking to make this table below. I have the first 2 columns of data for 2 years and wanting to calculate the daily mean price and apply it to all relevant days.
Publish_Date
Product_Price
Daily_Mean
2019-07-01--
146----------
142-------
2019-07-01--
144----------
142-------
2019-07-01--
136----------
142-------
2019-07-02--
120----------
123-------
2019-07-02--
126----------
123-------
2019-07-02--
123----------
123-------
2019-07-03--
112----------
112-------
I have tried the following:
TGPDailyMean = aggregate(Product_Price ~ Publish_Date, TGP, mean)
but it only gives one value per day, shortening the amount of rows by a factor of 3 or so. I need the amount of rows to be the same so I can take the difference between another dataframe and Daily_Mean.
I have also tried:
TGP$DailyMean = lapply(TGP$Product_Price, mean)
but this only replicates the values in Product_Price and does not find the mean per day.
Tidyverse Solution
You can use group_by and mutate:
library(dplyr)
TPG %>%
group_by(Publish_Date) %>%
mutate(Daily_Mean = mean(Product_Price)) %>%
ungroup()
#> # A tibble: 7 x 3
#> Publish_Date Product_Price Daily_Mean
#> <chr> <int> <dbl>
#> 1 2019-07-01 146 142
#> 2 2019-07-01 144 142
#> 3 2019-07-01 136 142
#> 4 2019-07-02 120 123
#> 5 2019-07-02 126 123
#> 6 2019-07-02 123 123
#> 7 2019-07-03 112 112
Base R solution
As suggest by #nicola in the comments, you can also use ave:
TPG$Daily_Mean <- ave(TPG$Product_Price, TPG$Publish_Date)
TPG
#> Publish_Date Product_Price Daily_Mean
#> 1 2019-07-01 146 142
#> 2 2019-07-01 144 142
#> 3 2019-07-01 136 142
#> 4 2019-07-02 120 123
#> 5 2019-07-02 126 123
#> 6 2019-07-02 123 123
#> 7 2019-07-03 112 112
data.table oneliner
library(data.table)
# or setDT(mydata)
setDT(mydata)[, Daily_Mean2 := mean(Product_Price), by = .(Publish_Date)]
# Publish_Date Product_Price Daily_Mean Daily_Mean2
# 1: 2019-07-01 146 142 142
# 2: 2019-07-01 144 142 142
# 3: 2019-07-01 136 142 142
# 4: 2019-07-02 120 123 123
# 5: 2019-07-02 126 123 123
# 6: 2019-07-02 123 123 123
# 7: 2019-07-03 112 112 112
Typically I use dplyr::distinct() to remove duplicated rows from the data. This function selects one copy of the duplicated rows and keeps it.
However, sometimes I wish to remove all copies if suspect the row is not valid.
Example
Let's say that I survey people and ask them about height, weight, and country they're from.
library(dplyr)
library(tibble)
set.seed(2021)
df_1 <- data.frame(id = 1:10,
height = sample(c(150:210), size = 10),
weight = sample(c(80: 200), size = 10))
df_2 <- df_1
df_final <- rbind(df_1, df_2)
df_final <- dplyr::arrange(df_final, id)
df_final <-
df_final %>%
add_column("country" = c("uk", "uk",
"france", "usa",
"germany", "germany",
"denmark", "norway",
"india", "india",
"chine", "china",
"mozambique", "argentina",
"morroco", "morroco",
"sweden", "japan",
"italy", "italy"))
df_final
#> id height weight country
#> 1 1 156 189 uk
#> 2 1 156 189 uk
#> 3 2 187 148 france
#> 4 2 187 148 usa
#> 5 3 195 190 germany
#> 6 3 195 190 germany
#> 7 4 207 182 denmark
#> 8 4 207 182 norway
#> 9 5 188 184 india
#> 10 5 188 184 india
#> 11 6 161 102 chine
#> 12 6 161 102 china
#> 13 7 201 155 mozambique
#> 14 7 201 155 argentina
#> 15 8 155 130 morroco
#> 16 8 155 130 morroco
#> 17 9 209 139 sweden
#> 18 9 209 139 japan
#> 19 10 202 97 italy
#> 20 10 202 97 italy
Created on 2021-07-19 by the reprex package (v2.0.0)
In df_final, each id means one person. In this example data we have duplicates for all 10 people. Everyone took the survey twice. However, if we look closely we see that some people reported they're from a different country. For example, id == 2 reported both usa in one case and france in another. In my data cleaning I wish to remove those people.
My primary goal is to remove duplicates. My secondary goal is to filter out those people who answered a different country.
If I simply go with dplyr::distinct(), I remain with all 10 ids.
df_final %>%
distinct(id, .keep_all = TRUE)
#> id height weight country
#> 1 1 156 189 uk
#> 2 2 187 148 france
#> 3 3 195 190 germany
#> 4 4 207 182 denmark
#> 5 5 188 184 india
#> 6 6 161 102 chine
#> 7 7 201 155 mozambique
#> 8 8 155 130 morroco
#> 9 9 209 139 sweden
#> 10 10 202 97 italy
What should I do in order to run distinct() but only on those who have the same value for country in all duplicated copies (per id)?
Thanks
Here is one option...
df_final %>%
group_by(id) %>%
filter(length(unique(country)) == 1) %>%
distinct()
# A tibble: 5 x 4
# Groups: id [5]
id height weight country
<int> <int> <int> <chr>
1 1 177 83 uk
2 3 191 151 germany
3 5 186 175 india
4 8 164 178 morroco
5 10 201 141 italy
We may also do
library(dplyr)
df_final %>%
distinct(id, country, .keep_all = TRUE) %>%
filter(id %in% names(which(table(id) == 1)))
I can pivot the data in wider format if the values to be pivoted consist in more than one column.
us_rent_income %>%
pivot_wider(
names_from = variable,
names_glue = "{variable}_{.value}",
values_from = c(estimate, moe)
)
# A tibble: 52 x 6
GEOID NAME income_estimate rent_estimate income_moe rent_moe
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 747 136 3
2 02 Alaska 32940 1200 508 13
3 04 Arizona 27517 972 148 4
4 05 Arkansas 23789 709 165 5
5 06 California 29454 1358 109 3
6 08 Colorado 32401 1125 109 5
7 09 Connecticut 35326 1123 195 5
8 10 Delaware 31560 1076 247 10
9 11 District of Columbia 43198 1424 681 17
10 12 Florida 25952 1077 70 3
# ... with 42 more rows
In this code output, I want the order of columns to be income_estimate, income_moe, rent_estimate and rent_moe. Setting names_sort = T isn't helping. Changing the order in names_glue doesn't help either. I know I can reorder columns by select and through other functions, but I just want to know that is there any argument in pivot_wider to do so?
EDIT the issue seems already in development; it has been discussed here and here at least.
With the advent of tidyr 1.2.0, it is now super easy with the use of argument names_vary
library(tidyr)
us_rent_income %>%
pivot_wider(
names_from = variable,
names_glue = "{variable}_{.value}",
values_from = c(estimate, moe),
names_vary = 'slowest'
)
#> # A tibble: 52 x 6
#> GEOID NAME income_estimate income_moe rent_estimate rent_moe
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 136 747 3
#> 2 02 Alaska 32940 508 1200 13
#> 3 04 Arizona 27517 148 972 4
#> 4 05 Arkansas 23789 165 709 5
#> 5 06 California 29454 109 1358 3
#> 6 08 Colorado 32401 109 1125 5
#> 7 09 Connecticut 35326 195 1123 5
#> 8 10 Delaware 31560 247 1076 10
#> 9 11 District of Columbia 43198 681 1424 17
#> 10 12 Florida 25952 70 1077 3
#> # ... with 42 more rows
The explanation of names_vary given at package help page is -
names_vary
When names_from identifies a column (or columns) with multiple unique values, and multiple values_from columns are provided, in what order should the resulting column names be combined?
"fastest" varies names_from values fastest, resulting in a column naming scheme of the form: value1_name1, value1_name2, value2_name1, value2_name2. This is the default.
"slowest" varies names_from values slowest, resulting in a column naming scheme of the form: value1_name1, value2_name1, value1_name2, value2_name2.
For fine-grained control, you can use pivot_wider_spec(), which lets you define the specification for the resulting data frame:
library(tidyverse)
spec <- tibble(
.name = c("income_estimate", "income_moe", "rent_estimate", "rent_moe"),
.value = c("estimate", "moe", "estimate", "moe"),
variable = c("income", "income", "rent", "rent")
)
us_rent_income %>% pivot_wider_spec(spec)
Output:
# A tibble: 52 x 6
GEOID NAME income_estimate income_moe rent_estimate rent_moe
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 01 Alabama 24476 136 747 3
2 02 Alaska 32940 508 1200 13
3 04 Arizona 27517 148 972 4
4 05 Arkansas 23789 165 709 5
5 06 California 29454 109 1358 3
6 08 Colorado 32401 109 1125 5
7 09 Connecticut 35326 195 1123 5
8 10 Delaware 31560 247 1076 10
9 11 District of Columbia 43198 681 1424 17
10 12 Florida 25952 70 1077 3
# … with 42 more rows
And with a few pre-processing steps, you can avoid having to manually enter all the values in spec:
field <- us_rent_income %>% distinct(variable) %>% pull()
sub_field <- colnames(us_rent_income)[4:5]
pivot_names <- map(field, ~paste(., sub_field, sep = "_")) %>% unlist()
pivot_vals <- rep(sub_field, 2)
pivot_vars <- map(field, rep, 2) %>% unlist()
spec <- tibble(.name = pivot_names, .value = pivot_vals, variable = pivot_vars)
us_rent_income %>% pivot_wider_spec(spec)
After the pivoting, we could do a select by ordering the substring of column names
library(dplyr)
library(tidyr)
library(stringr0
us_rent_income %>%
pivot_wider(
names_from = variable,
names_glue = "{variable}_{.value}",
values_from = c(estimate, moe)
) %>%
select(GEOID, NAME, order(str_remove(names(.)[-(1:2)], "_.*")) + 2)
-output
# A tibble: 52 x 6
# GEOID NAME income_estimate income_moe rent_estimate rent_moe
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 01 Alabama 24476 136 747 3
# 2 02 Alaska 32940 508 1200 13
# 3 04 Arizona 27517 148 972 4
# 4 05 Arkansas 23789 165 709 5
# 5 06 California 29454 109 1358 3
# 6 08 Colorado 32401 109 1125 5
# 7 09 Connecticut 35326 195 1123 5
# 8 10 Delaware 31560 247 1076 10
# 9 11 District of Columbia 43198 681 1424 17
#10 12 Florida 25952 70 1077 3
# … with 42 more rows
The ordering is based on the names_from column and so the names_sort have no impact on the column names from values_from i.e. in the OP's solution, it wouldn't change if we change the order in names_glue. In the data, the 'variable' column unique value appearance is in income, followed by rent. So, it does that order, when the default names_sort = FALSE. If it is changed to TRUE, it does alphabetic order, which is again i followed by r.
It can be checked if we first reshape to 'long', unite the columns and then do the pivot_wider
us_rent_income %>%
pivot_longer(cols = c(estimate, moe)) %>%
unite(variable, variable, name) %>%
pivot_wider(names_from = variable, values_from = value)
-output
# A tibble: 52 x 6
# GEOID NAME income_estimate income_moe rent_estimate rent_moe
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 01 Alabama 24476 136 747 3
# 2 02 Alaska 32940 508 1200 13
# 3 04 Arizona 27517 148 972 4
# 4 05 Arkansas 23789 165 709 5
# 5 06 California 29454 109 1358 3
# 6 08 Colorado 32401 109 1125 5
# 7 09 Connecticut 35326 195 1123 5
# 8 10 Delaware 31560 247 1076 10
# 9 11 District of Columbia 43198 681 1424 17
#10 12 Florida 25952 70 1077 3
# … with 42 more rows
Now, we check by changing into custom order with factor and specify names_sort = TRUE, it will go in the order we wanted
us_rent_income %>%
pivot_longer(cols = c(estimate, moe)) %>%
unite(variable, variable, name) %>%
mutate(variable = factor(variable,
levels = c('income_estimate', 'rent_moe', 'rent_estimate', 'income_moe'))) %>%
pivot_wider(names_from = variable, values_from = value, names_sort = TRUE)
# A tibble: 52 x 6
# GEOID NAME income_estimate rent_moe rent_estimate income_moe
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 01 Alabama 24476 3 747 136
# 2 02 Alaska 32940 13 1200 508
# 3 04 Arizona 27517 4 972 148
# 4 05 Arkansas 23789 5 709 165
# 5 06 California 29454 3 1358 109
# 6 08 Colorado 32401 5 1125 109
# 7 09 Connecticut 35326 5 1123 195
# 8 10 Delaware 31560 10 1076 247
# 9 11 District of Columbia 43198 17 1424 681
#10 12 Florida 25952 3 1077 70
# … with 42 more rows
Question
I use time-series data regularly. Sometimes, I would like to transmute an entire data frame to obtain some data frame of growth rates, or shares, for example.
When using transmute this is relatively straight-forward. But when I have a lot of columns to transmute and I want to keep the date column, I'm not sure if that's possible.
Below, using the economics data set, is an example of what I mean.
Example
library(dplyr)
economics %>%
transmute(date,
pce * 10,
pop * 10,
psavert * 10)
# A tibble: 574 x 4
date `pce * 10` `pop * 10` `psavert * 10`
<date> <dbl> <dbl> <dbl>
1 1967-07-01 5067 1987120 126
2 1967-08-01 5098 1989110 126
3 1967-09-01 5156 1991130 119
4 1967-10-01 5122 1993110 129
5 1967-11-01 5174 1994980 128
6 1967-12-01 5251 1996570 118
7 1968-01-01 5309 1998080 117
8 1968-02-01 5336 1999200 123
9 1968-03-01 5443 2000560 117
10 1968-04-01 5440 2002080 123
# ... with 564 more rows
Now, using transmute_at. The below predictably removes date in the .vars argument, but I haven't found a way of removing date and reintroducing it in .funs such that the resulting data frame looks as it does above. Any ideas?
economics %>%
transmute_at(.vars = vars(-c(date, uempmed, unemploy)),
.funs = list("trans" = ~ . * 10))
# A tibble: 574 x 3
pce_trans pop_trans psavert_trans
<dbl> <dbl> <dbl>
1 5067 1987120 126
2 5098 1989110 126
3 5156 1991130 119
4 5122 1993110 129
5 5174 1994980 128
6 5251 1996570 118
7 5309 1998080 117
8 5336 1999200 123
9 5443 2000560 117
10 5440 2002080 123
# ... with 564 more rows
We can use if/else inside the function.
library(dplyr)
library(ggplot2)
data(economics)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .)
# A tibble: 574 x 4
# date pce pop psavert
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we need to change the column names selectively, can do this after the transmute_at
library(stringr)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .) %>%
rename_at(vars(-date), ~ str_c(., '_trans'))
# A tibble: 574 x 4
# date pce_trans pop_trans psavert_trans
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we are changing the column names in all the selected columns in transmute_at use list(trans =
economics %>%
transmute_at(vars(date:psavert), list(trans = ~if(is.numeric(.)) .* 10 else .))
I need to calculate summary statistics for observations of bird breeding activity for each of 150 species. The data frame has the species (scodef), the type of observation (codef)(e.g. nest building), and the ordinal date (days since 1 January, since the data were collected over multiple years). Using dplyr I get exactly the result I want.
library(dplyr)
library(tidyr)
phenology %>% group_by(sCodef, codef) %>%
summarize(N=n(), Min=min(jdate), Max=max(jdate), Median=median(jdate))
# A tibble: 552 x 6
# Groups: sCodef [?]
sCodef codef N Min Max Median
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 ABDU AY 3 172 184 181
2 ABDU FL 12 135 225 188
3 ACFL AY 18 165 222 195
4 ACFL CN 4 142 156 152.
5 ACFL FL 10 166 197 192.
6 ACFL NB 6 139 184 150.
7 ACFL NY 6 166 207 182
8 AMCO FL 1 220 220 220
9 AMCR AY 53 89 198 161
10 AMCR FL 78 133 225 166.
# ... with 542 more rows
How do I get these summary statistics into some sort of data object so that I can export them to use ultimately in a Word document? I have tried this and gotten an error. All of the many explanations of summarize I have reviewed just show the summary data on screen. Thanks
out3 <- summarize(N=n(), Min=min(jdate), Max=max(jdate), median=median(jdate))
Error: This function should not be called directly
Assign this to a variable, then write to a csv like so:
summarydf <- phenology %>% group_by......(as above)
write.csv(summarydf, filename="yourfilenamehere.csv")