Frequency count in R - r

I want it to display a frequency table of total domestic ( which includes Boston + salt lake city) and total frequency of international ( London + Shanghai). But it prints it out like this.
table$Category<-c("Domestic","International")
> table
problem.6.data Freq Category
1 Boston 136 Domestic
2 London 102 International
3 Salt Lake City 277 Domestic
4 Shanghai 184 International
I want an output of:
1. Domestic: 136+277
2. International: 102+ 184
so, in the end the table should look like:
Domestic: 413
International: 286
What am I doing wrong?

If you don't mind using the tidyverse, you could use group_by() and summarize():
library(tidyverse)
df <-
data.frame(
stringsAsFactors = FALSE,
problem.6.data = c("Boston", "London", "Salt Lake City", "Shanghai"),
Freq = c(136L, 102L, 277L, 184L),
Category = c("Domestic", "International", "Domestic", "International")
)
df %>%
group_by(Category) %>%
summarise(sum = sum(Freq))
#> # A tibble: 2 x 2
#> Category sum
#> <chr> <int>
#> 1 Domestic 413
#> 2 International 286
Created on 2020-03-19 by the reprex package (v0.3.0)

Maybe aggregate from base R can give the desired output
dfout <- aggregate(Freq ~ Category, df, sum)
such that
> dfout
Category Freq
1 Domestic 413
2 International 286

Related

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

How to conditionally mutate a new column when data is in long format, and condition is depending on grouping combination

I have data in long format, and I'm trying to test each row against the mean of a certain grouping combination, in order to generate a new column with the conclusion from that test.
Example
In this toy example, I have data about 20 cars. Each car could be of one of three possible makers. We have mpg data for each car, measured 8 times: in the city or highway, in the morning or evening, during the winter or spring.
library(tidyr)
set.seed(2021)
df_id_and_makers <-
data.frame(id = 1:20,
maker = sample(c("toyota", "audi", "bmw"), size = 20, replace = TRUE))
df <- tidyr::expand_grid(df_id_and_makers,
road_type = c("city", "highway"),
time_of_day = c("morning", "evening"),
season = c("winter", "spring"))
df$mpg_val <- sample(15:40, size = nrow(df), replace = TRUE)
df
#> # A tibble: 160 x 6
#> id maker road_type time_of_day season mpg_val
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 bmw city morning winter 28
#> 2 1 bmw city morning spring 22
#> 3 1 bmw city evening winter 40
#> 4 1 bmw city evening spring 18
#> 5 1 bmw highway morning winter 19
#> 6 1 bmw highway morning spring 36
#> 7 1 bmw highway evening winter 30
#> 8 1 bmw highway evening spring 16
#> 9 2 audi city morning winter 33
#> 10 2 audi city morning spring 18
#> # ... with 150 more rows
Created on 2021-07-07 by the reprex package (v2.0.0)
I want to analyze this data to test my hypothesis that mpg in city is larger than mpg in highway. To this end, I want to create a new column that tests whether the value in mpg_val when road_type is city is larger than the mean of mpg_val across rows where road_type is highway. Furthermore, I want to compare just among cars of the same makers.
So, for example, id = 1 is bmw, and therefore the new column I want to compute should test each value of mpg_val in rows where road_type == city (i.e., rows 1-4, but not 5-6), and see whether mpg_val is larger than mean(mpg_val) in rows where road_type == highway and maker == bmw.
Expected output
Here's the manual and dumb way of doing this. I'll show only how I do this for maker = bmw for the sake of demonstration.
library(dplyr)
# step 1 -- calculate the mean of `mpg_val` for `road_type = highway` and only across bmw
mean_bmw_highway_mpg <-
df %>%
filter(maker == "bmw",
road_type == "highway") %>%
pull(mpg_val) %>%
mean()
mean_bmw_highway_mpg
## [1] 26.22222
# step 2 -- compare each row where `maker = "bmw"` and `road_type = "city"` for its `mpg_val` against `mean_bmw_highway_mpg`
result_bmw_only <-
df %>%
mutate(is_mpg_city_larger_than_mpg_highway = case_when(maker != "bmw" ~ "not_relevant",
road_type != "city" ~ "not_relevant",
mpg_val > mean_bmw_highway_mpg ~ "yes",
TRUE ~ "no"))
result_bmw_only
## # A tibble: 160 x 7
## id maker road_type time_of_day season mpg_val is_mpg_city_larger_than_mpg_highway
## <int> <chr> <chr> <chr> <chr> <int> <chr>
## 1 1 bmw city morning winter 28 yes ## because 28 > 26.222
## 2 1 bmw city morning spring 22 no ## because 22 < 26.222
## 3 1 bmw city evening winter 40 yes
## 4 1 bmw city evening spring 18 no
## 5 1 bmw highway morning winter 19 not_relevant
## 6 1 bmw highway morning spring 36 not_relevant
## 7 1 bmw highway evening winter 30 not_relevant
## 8 1 bmw highway evening spring 16 not_relevant
## 9 2 audi city morning winter 33 not_relevant
## 10 2 audi city morning spring 18 not_relevant
## # ... with 150 more rows
How could I achieve the same result as result_bmw_only (but applied to the entire df) in a more elegant way? Hopefully using dplyr approach, because this is what I'm used to, but otherwise any method will do.
Thanks!
EDIT 1
One solution I could think of involves purrr, but I can't get this done yet.
library(purrr)
solution_purrr <-
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
mpg_val > mean(mpg_val) ~ "yes",
TRUE ~ "no"))))
It seems that solution_purrr gets the desired output, but not exactly. This is because the second logic in case_when (i.e., mpg_val > mean(mpg_val) ~ "yes") is not what I want. I want to compare mpg_val to mean(mpg_val) when that mean is computed based only on rows where road_type == "highway". But here mean(mpg_val) computes across all rows.
EDIT 2
Based on #Till's answer below, I'd like to clarify that I'm looking for a solution that avoids a separate calculation of the mean we want to test against. What I did above with mean_bmw_highway_mpg is the undesired way of working towards the output. I showed mean_bmw_highway_mpg only for demonstrating the kind of mean I need to calculate.
What you tried is already close. Take a look at the documentation of dplyr::group_by()
it is designed for these kinds of operations.
Below is how you can expand your BMW-only solution to the full dataset using group_by().
library(tidyverse)
mean_highway_mpg_df <-
df %>%
filter(road_type == "highway") %>%
group_by(maker) %>%
summarise(mean_highway_mpg = mean(mpg_val))
result_df <-
df %>%
filter(road_type == "city") %>%
group_by(maker) %>%
left_join(mean_highway_mpg_df) %>%
mutate(mpg_city_higher_highway = mpg_val > mean_highway_mpg)
#> Joining, by = "maker"
result_df %>%
select(-(time_of_day:season))
#> # A tibble: 80 x 6
#> # Groups: maker [3]
#> id maker road_type mpg_val mean_highway_mpg mpg_city_higher_highway
#> <int> <chr> <chr> <int> <dbl> <lgl>
#> 1 1 bmw city 28 26.2 TRUE
#> 2 1 bmw city 22 26.2 FALSE
#> 3 1 bmw city 40 26.2 TRUE
#> 4 1 bmw city 18 26.2 FALSE
#> 5 2 audi city 33 28.1 TRUE
#> 6 2 audi city 18 28.1 FALSE
#> 7 2 audi city 35 28.1 TRUE
#> 8 2 audi city 36 28.1 TRUE
#> 9 3 audi city 25 28.1 FALSE
#> 10 3 audi city 32 28.1 TRUE
#> # … with 70 more rows
I think I got this. The following solution is based on both my EDIT 1 above, as well as #MrFlick's comment here.
First, we define a helper function:
is_x_larger_than_mean_y <- function(x, y) {
x > mean(y)
}
Then, we run:
library(dplyr)
library(purrr)
library(tidyr)
df %>%
group_by(maker) %>%
nest(data = -maker) %>%
mutate(tbl_with_desired_new_col = map(.x = data,
.f = ~ .x %>%
mutate(is_mpg_city_lrgr_thn_mpg_hwy = case_when(road_type != "city" ~ "not_relevant",
is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes",
TRUE ~ "no")))) %>%
select(-data) %>%
unnest(cols = tbl_with_desired_new_col)
This way, the line within case_when() that says is_x_larger_than_mean_y(mpg_val, mpg_val[road_type == "highway"]) ~ "yes" ensures that we compute the mean of mpg_val only based on rows in which road_type == "highway".

How to fill in time series data into a data frame?

I am working with the following time series data:
Weeks <- c("1995-01", "1995-02", "1995-03", "1995-04", "1995-06", "1995-08", "1995-10", "1995-15", "1995-16", "1995-24", "1995-32")
Country <- c("United States")
Values <- sample(seq(1,500,1), length(Weeks), replace = T)
df <- data.frame(Weeks,Country, Values)
Weeks Country Values
1 1995-01 United States 193
2 1995-02 United States 183
3 1995-03 United States 402
4 1995-04 United States 75
5 1995-06 United States 402
6 1995-08 United States 436
7 1995-10 United States 97
8 1995-15 United States 445
9 1995-16 United States 336
10 1995-24 United States 31
11 1995-32 United States 413
It is structured according to the year and the week number in that year (column 1). Notice, how some weeks are omitted (as a result of the aggregation function). For example, 1995-05 is not included. How can I include the omitted rows into the data, add the appropriate country name, and assign them a value = 0?
Thank you for your help!
separate year and week values in different columns. For each Country and Years we complete the missing weeks and assign Values to 0. Finally unite year and week column to get the data in the same format as the original one.
library(dplyr)
library(tidyr)
df %>%
separate(Weeks, c('Years', 'Weeks'), sep = '-', convert = TRUE) %>%
group_by(Country, Years) %>%
complete(Weeks = min(Weeks):max(Weeks), fill = list(Values = 0)) %>%
ungroup() %>%
mutate(Weeks = sprintf('%02d', Weeks)) %>%
unite(Weeks, Years, Weeks, sep = '-')
# Country Weeks Values
# <chr> <chr> <dbl>
# 1 United States 1995-01 354
# 2 United States 1995-02 395
# 3 United States 1995-03 408
# 4 United States 1995-04 143
# 5 United States 1995-05 0
# 6 United States 1995-06 481
# 7 United States 1995-07 0
# 8 United States 1995-08 49
# 9 United States 1995-09 0
#10 United States 1995-10 229
# … with 22 more rows

to set column name to row vaues in R

I have this type of table in R
April Tourist
2018 123
2018 222
I want my table to look like this:-
Month Year Domestic International Total
April 2018 123 222 345
I am new to R. I tried using melt and rownames() function given by R but not getting exactly the way out.
Based on your comment that you only have 2 rows in your data set here's a way to do this with dplyr and tidyr -
df <- data_frame(April = c(2018, 2018),
Tourist = c(123, 222))
df %>%
mutate(Type = c("Domestic", "International")) %>%
gather(Month, Year, April) %>%
spread(Type, Tourist) %>%
mutate(
Total = Domestic + International
)
# A tibble: 1 x 5
Month Year Domestic International Total
<chr> <dbl> <dbl> <dbl> <dbl>
1 April 2018 123 222 345

R dataframe combine columns in dataframe based on other columns

I have a dataframe as follows:
df <- tibble::tribble(~home, ~visitor, ~hcountry, ~vcountry,
"Milan", "Manchester", "ITA", "ENG",
"LIVERPOOL", "MILAN", "ENG", "ITA",
"Real Madrid", "Juventus", "SPA", "ITA")
#> # A tibble: 3 x 4
#> home visitor hcountry vcountry
#> <chr> <chr> <chr> <chr>
#> 1 Milan Manchester ITA ENG
#> 2 LIVERPOOL MILAN ENG ITA
#> 3 Real Madrid Juventus SPA ITA
and would like to get only the italian teams ie: Milan, Milan, Juventus...how is it possible without using loops?
First off, I recommend a basic R tutorial to familiarise yourself with basic R data operations like subsetting etc. See for example R for Beginners on CRAN.
In your case you can do:
df[df$hcountry == "ITA" | df$vcountry == "ITA", ]
# home visitor hcountry vcountry
#1 Milan Manchester ITA ENG
#2 LIVERPOOL MILAN ENG ITA
#3 Real Madrid Juventus SPA ITA
Or
subset(df, hcountry == "ITA" | vcountry == "ITA")
Sample data
df <- read.table(text =
"home visitor hcountry vcountry
Milan Manchester ITA ENG
LIVERPOOL MILAN ENG ITA
'Real Madrid' Juventus SPA ITA", header =T)
Alternatively you could try stacking home and visitor countries to find unique values
library(dplyr)
library(tidyr)
df %>% gather(key1, country, -c(home, visitor)) %>%
gather(key2, team, -c(key1, country)) %>%
mutate_at(vars(key1, key2), substr, start=1, stop=1) %>%
filter(key1==key2) %>% select(-key1, -key2) %>%
mutate(team=tools::toTitleCase(tolower(team))) %>%
filter(country=="ITA") %>%
distinct()
#> # A tibble: 2 x 2
#> country team
#> <chr> <chr>
#> 1 ITA Milan
#> 2 ITA Juventus
Remove last distinct() if you want to see Milan value duplicated
We can use filter from dplyr
library(dplyr)
df %>%
filter(hcountry == "ITA" | vcountry == "ITA")

Resources