I have data on 30 people that includes ethnicity, gender, school type, whether they received free school meals, etc.
I want to produce frequency counts for all of these features. Currently my code looks like this:
df <- read.csv("~file")
df %>% select(Ethnicity) %>% group_by(Ethnicity) %>% summarise(freq = n())
df %>% select(Gender) %>% group_by(Gender) %>% summarise(freq = n())
df %>% select(School.type) %>% group_by(School.type) %>% summarise(freq = n())
Is there a way I can create a frequency tibble for 8 columns (e.g. ethnicity, gender, school type, etc.) in a more efficient way (e.g. 1 or 2 lines of code)?
As an example output for the ethnicity code:
# A tibble: 13 × 2
Ethnicity freq
<chr> <int>
1 Asian or Asian British - Bangladeshi 1
2 Asian or Asian British - Indian 7
3 Asian or Asian British - Pakistani 1
4 Black or Black British - African 5
5 Black or Black British - Caribbean 2
6 Chinese 3
7 Mixed - White and Asian 2
8 Mixed - White and Black African 1
9 Mixed - White and Black Caribbean 1
10 Not known/ prefer not to say 1
11 White British 27
12 White Irish 1
13 White Other 5
And for gender:
# A tibble: 2 × 2
Gender freq
<chr> <int>
1 Female 36
2 Male 21
NB: some columns also contain data on postcode & name which I obviously don't want to perform the frequency function on, so I think I'll somehow need to select just the columns I want to perform this function on
One option would be to use lapply to loop over a vector of your desired columns and dplyr::count for the frequency table.
Using the starwars dataset as example data:
library(dplyr, warn = FALSE)
cols <- c("hair_color", "sex")
lapply(cols, function(x) {
count(starwars, .data[[x]], name = "freq")
})
#> [[1]]
#> # A tibble: 13 × 2
#> hair_color freq
#> <chr> <int>
#> 1 auburn 1
#> 2 auburn, grey 1
#> 3 auburn, white 1
#> 4 black 13
#> 5 blond 3
#> 6 blonde 1
#> 7 brown 18
#> 8 brown, grey 1
#> 9 grey 1
#> 10 none 37
#> 11 unknown 1
#> 12 white 4
#> 13 <NA> 5
#>
#> [[2]]
#> # A tibble: 5 × 2
#> sex freq
#> <chr> <int>
#> 1 female 16
#> 2 hermaphroditic 1
#> 3 male 60
#> 4 none 6
#> 5 <NA> 4
Ok, so here is my scenario: I have a dataset with a column composed of lists of words (keyword tags for YT videos, where each row is video data).
What I want to do is do a complete count of all unique object instances within these lists, for the entire column. So basically what I want in the end is a table with two fields: keyword, count.
If I just do a simple dplyr query, then it counts the list itself as a unique object. While this is also interesting, this is not what I want.
So this is the above dplyr query that I want to utilize further, but not sure how to nest unique instances within the unique lists:
vid_tag_freq = df %>%
count(tags)
To further clarify:
With a dataset like:
Tags
1 ['Dog', 'Cat', 'Mouse', 'Fish']
2 ['Cat', 'Fish']
3 ['Cat', 'Fish']
I am now getting:
Tags Count
1 ['Dog', 'Cat', 'Mouse', 'Fish'] 1
2 ['Cat', 'Fish'] 2
What I actually want:
Tags Count
1 'Cat' 3
2 'Fish' 3
3 'Dog' 1
4 'Mouse' 1
I hope that explains it lol
EDIT: This is what my data looks like, guess most are lists of lists? Maybe I should clean up [0]s as null?
[1] "[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]"
[2] "[0]"
[3] "[0]"
[4] "[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]"
[5] "[0]"
[6] "[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]"
[7] "[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]"
[8] "[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]"
[9] "[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]"
[10] "[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]"
[11] "[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]"
[12] "[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]"
[13] "[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]"
[14] "[['Blue Jays', 'Ashes', 'friends']]"
[15] "[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
It would help if you could dput() some of the data for a working example. Going off the idea that you have a list column, here are a couple of general solutions you may be able to work with:
df <- tibble::tibble(
x = replicate(10, sample(state.name, sample(5:10, 1), TRUE), simplify = FALSE)
)
df
#> # A tibble: 10 × 1
#> x
#> <list>
#> 1 <chr [7]>
#> 2 <chr [7]>
#> 3 <chr [8]>
#> 4 <chr [6]>
#> 5 <chr [8]>
#> 6 <chr [8]>
#> 7 <chr [8]>
#> 8 <chr [6]>
#> 9 <chr [5]>
#> 10 <chr [10]>
# dplyr in a dataframe
df |>
tidyr::unnest(x) |>
dplyr::count(x)
#> # A tibble: 36 × 2
#> x n
#> <chr> <int>
#> 1 Alabama 1
#> 2 Alaska 1
#> 3 Arkansas 4
#> 4 California 3
#> 5 Colorado 5
#> 6 Connecticut 1
#> 7 Delaware 3
#> 8 Florida 1
#> 9 Georgia 3
#> 10 Hawaii 2
#> # … with 26 more rows
# vctrs
vctrs::vec_count(unlist(df$x))
#> key count
#> 1 Colorado 5
#> 2 Louisiana 5
#> 3 North Dakota 4
#> 4 Mississippi 4
#> 5 Arkansas 4
#> 6 Delaware 3
#> 7 Vermont 3
#> 8 Minnesota 3
#> 9 Utah 3
#> 10 California 3
#> 11 Georgia 3
#> 12 Indiana 2
#> 13 Missouri 2
#> 14 New Hampshire 2
#> 15 Maryland 2
#> 16 Nebraska 2
#> 17 Hawaii 2
#> 18 New Jersey 2
#> 19 Oklahoma 2
#> 20 Massachusetts 1
#> 21 Illinois 1
#> 22 Texas 1
#> 23 Connecticut 1
#> 24 Rhode Island 1
#> 25 Michigan 1
#> 26 New York 1
#> 27 Ohio 1
#> 28 Nevada 1
#> 29 Florida 1
#> 30 Montana 1
#> 31 Wisconsin 1
#> 32 Alabama 1
#> 33 Alaska 1
#> 34 North Carolina 1
#> 35 Washington 1
#> 36 Kansas 1
Created on 2022-10-07 with reprex v2.0.2
Edit
If you list is actually a character vector, you'll need to do some string parsing.
# "list" but are actually strings
x <- c(
"[['Flood (Disaster Type)', 'Burlington (City/Town/Village)', 'Ontario (City/Town/Village)']]",
"[0]",
"[0]",
"[['Rocket (Product Category)', 'Interview (TV Genre)', 'Canadian Broadcasting Corporation (TV Network)', 'Israel (Country)', 'Gaza War (Military Conflict)']]",
"[0]",
"[['Iraq (Country)', 'Military (Film Genre)', 'United States Of America (Country)']]",
"[['Ebola (Disease Or Medical Condition)', 'Chair', 'Margaret Chan (Physician)', 'WHO']]",
"[['CBC Television (TV Network)', 'CBC News (Website Owner)', 'Canadian Broadcasting Corporation (TV Network)']]",
"[['Rob Ford (Politician)', 'the fifth estate', 'CBC Television (TV Network)', 'Bill Blair', 'Gillian Findlay', 'Documentary (TV Genre)']]",
"[['B.C.', 'Dog Walking (Profession)', 'dogs', 'dog walker', 'death', 'dead']]",
"[['Suicide Of Amanda Todd (Event)', 'Amanda Todd', 'cyberbullying', 'CBC Television (TV Network)', 'the fifth estate', 'Mark Kelley', 'cappers', 'Documentary (TV Genre)']]",
"[['National Hockey League (Sports Association)', 'Climate Change (Website Category)', 'Hockey (Sport)', 'greenhouse gas', 'emissions']]",
"[['Rob Ford (Politician)', 'bomb threat', 'Toronto (City/Town/Village)', 'City Hall (Building)']]",
"[['Blue Jays', 'Ashes', 'friends']]",
"[['Robin Williams (Celebrity)', 'Peter Gzowski']]"
)
# assing to a data.frame
df <- data.frame(x = x)
df |>
dplyr::mutate(
# remove square brackets at beginning or end
x = gsub("^\\[{1,2}|\\]{1,2}$", "", x),
# separate the strings into an actual list
x = strsplit(x, "',\\s|,\\s'")
) |>
# unnuest the list column so they appear as individual rows
tidyr::unnest(x) |>
# some extract cleaning to string out the '
dplyr::mutate(x = gsub("^'|'$", "", x)) |>
# count the individual elements
dplyr::count(x, sort = TRUE)
#> # A tibble: 47 × 2
#> x n
#> <chr> <int>
#> 1 0 3
#> 2 CBC Television (TV Network) 3
#> 3 Canadian Broadcasting Corporation (TV Network) 2
#> 4 Documentary (TV Genre) 2
#> 5 Rob Ford (Politician) 2
#> 6 the fifth estate 2
#> 7 Amanda Todd 1
#> 8 Ashes 1
#> 9 B.C. 1
#> 10 Bill Blair 1
#> # … with 37 more rows
# same result just working with the vector
x |>
gsub("^\\[{1,2}|\\]{1,2}$", "", x = _) |>
strsplit("',\\s|,\\s'") |>
unlist() |>
gsub("^'|'$", "", x = _) |>
vctrs::vec_count() # or table()
#> key count
#> 1 CBC Television (TV Network) 3
#> 2 0 3
#> 3 Rob Ford (Politician) 2
#> 4 the fifth estate 2
#> 5 Documentary (TV Genre) 2
#> 6 Canadian Broadcasting Corporation (TV Network) 2
#> 7 City Hall (Building) 1
#> 8 United States Of America (Country) 1
#> 9 Mark Kelley 1
#> 10 Israel (Country) 1
#> 11 Bill Blair 1
#> 12 Interview (TV Genre) 1
#> 13 Blue Jays 1
#> 14 Hockey (Sport) 1
#> 15 friends 1
#> 16 Peter Gzowski 1
#> 17 Suicide Of Amanda Todd (Event) 1
#> 18 greenhouse gas 1
#> 19 Dog Walking (Profession) 1
#> 20 Flood (Disaster Type) 1
#> 21 National Hockey League (Sports Association) 1
#> 22 Amanda Todd 1
#> 23 Chair 1
#> 24 dog walker 1
#> 25 bomb threat 1
#> 26 dogs 1
#> 27 Climate Change (Website Category) 1
#> 28 Robin Williams (Celebrity) 1
#> 29 Margaret Chan (Physician) 1
#> 30 cyberbullying 1
#> 31 Ashes 1
#> 32 Ontario (City/Town/Village) 1
#> 33 Iraq (Country) 1
#> 34 WHO 1
#> 35 cappers 1
#> 36 Gillian Findlay 1
#> 37 Military (Film Genre) 1
#> 38 CBC News (Website Owner) 1
#> 39 B.C. 1
#> 40 Ebola (Disease Or Medical Condition) 1
#> 41 Toronto (City/Town/Village) 1
#> 42 death 1
#> 43 emissions 1
#> 44 Rocket (Product Category) 1
#> 45 Gaza War (Military Conflict) 1
#> 46 dead 1
#> 47 Burlington (City/Town/Village) 1
Created on 2022-10-08 with reprex v2.0.2
It looks like you need unnest_longer():
library(dplyr)
library(tidyr)
df <- tibble(
Tags = list(
list('Dog', 'Cat', 'Mouse', 'Fish'),
list('Cat', 'Fish'),
list('Cat', 'Fish')
)
)
df %>%
tidyr::unnest_longer(Tags) %>%
count(Tags) %>%
arrange(desc(n))
#> # A tibble: 4 × 2
#> Tags n
#> <chr> <int>
#> 1 Cat 3
#> 2 Fish 3
#> 3 Dog 1
#> 4 Mouse 1
starwars %>%
group_by(species,sex) %>%
summarise() %>%
select(unique.species=species, unique.sex=sex)
How to get unique values from 2 columns("species","sex") all together? I wrote the code above but i'm not sure it's right. Thank you
library(tidyverse)
starwars |>
select(species, sex) |>
distinct()
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Human male
#> 2 Droid none
#> 3 Human female
#> 4 Wookiee male
#> 5 Rodian male
#> 6 Hutt hermaphroditic
#> 7 Yoda's species male
#> 8 Trandoshan male
#> 9 Mon Calamari male
#> 10 Ewok male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)
library(tidyverse)
starwars %>%
expand(nesting(species, sex))
#> # A tibble: 41 × 2
#> species sex
#> <chr> <chr>
#> 1 Aleena male
#> 2 Besalisk male
#> 3 Cerean male
#> 4 Chagrian male
#> 5 Clawdite female
#> 6 Droid none
#> 7 Dug male
#> 8 Ewok male
#> 9 Geonosian male
#> 10 Gungan male
#> # … with 31 more rows
Created on 2022-04-25 by the reprex package (v2.0.1)
There are multiple options. You can use the following code:
unique(starwars[c("species", "sex")])
Output:
species sex
<chr> <chr>
1 Human male
2 Droid none
3 Human female
4 Wookiee male
5 Rodian male
6 Hutt hermaphroditic
7 Yoda's species male
8 Trandoshan male
9 Mon Calamari male
10 Ewok male
# … with 31 more rows
I am working through Rob Hyndman's FPP3. I am on section 2.5 and there is an example about Australian holiday tourism. Here is the example with output:
holidays <- tourism %>%
filter(Purpose == "Holiday") %>%
group_by(State) %>%
summarise(Trips = sum(Trips))
holidays
#> # A tsibble: 640 x 3 [1Q]
#> # Key: State [8]
#> State Quarter Trips
#> <chr> <qtr> <dbl>
#> 1 ACT 1998 Q1 196.
#> 2 ACT 1998 Q2 127.
#> 3 ACT 1998 Q3 111.
#> 4 ACT 1998 Q4 170.
#> 5 ACT 1999 Q1 108.
#> 6 ACT 1999 Q2 125.
#> 7 ACT 1999 Q3 178.
#> 8 ACT 1999 Q4 218.
#> 9 ACT 2000 Q1 158.
#> 10 ACT 2000 Q2 155.
#> # … with 630 more rows
However, when I use the same code I get the following output:
> holidays
# A tibble: 8 x 2
State Trips
<chr> <dbl>
1 ACT 12089.
2 New South Wales 238741.
3 Northern Territory 14917.
4 Queensland 170787.
5 South Australia 52887.
6 Tasmania 31229.
7 Victoria 179228.
8 Western Australia 63349.
As you can see, the tsibble has been changed to a tibble. When I run everything but the summarise function, I still get a tsibble. I am thinking that perhaps the summarise function is somehow changing the type to tibble. Any help would be appreciated. Thanks!
I uninstalled and reinstalled the tsibble package. I noticed that my original version was 0.8.6 but after installation I now have 0.9.0. After I did that it fixed the issue. Thanks!
I'm struggling to understand exactly how to compute a deflation factor for wages in a panel based on inflation.
I've teh R example below to help me illustrate the issue.
In Wooldridge (2009:452) Introductory Econometrics, 5th ed., he creates a deflation factor by dividing 107.6 by 65.2, i.e. 107.6/65.2 ≈ 1.65, but I can't figure out to to apply this to my own panel data. Wooldridge only mentions the deflation factor in passing.
Say I have a mini panel with two people, Jane and Tom, staring from 2006/2009 and running until 2015 with their yearly wage,
# install.packages(c("dplyr"), dependencies = TRUE)
library(dplyr)
set.seed(2)
tbl <- tibble(id = rep(c('Jane', 'Tom'), c(7, 10)),
yr = c(2009:2015, 2006:2015),
wg = c(rnorm(7, mean=5.1*10^4, sd=9), rnorm(10, 4*10^4, 12))
); tbl
#> A tibble: 17 x 3
#> id yr wg
#> <chr> <int> <dbl>
#> 1 Jane 2009 50991.93
#> 2 Jane 2010 51001.66
#> 3 Jane 2011 51014.29
#> 4 Jane 2012 50989.83
#> 5 Jane 2013 50999.28
#> 6 Jane 2014 51001.19
#> 7 Jane 2015 51006.37
#> 8 Tom 2006 39997.12
#> 9 Tom 2007 40023.81
#> 10 Tom 2008 39998.33
#> 11 Tom 2009 40005.01
#> 12 Tom 2010 40011.78
#> 13 Tom 2011 39995.29
#> 14 Tom 2012 39987.52
#> 15 Tom 2013 40021.39
#> 16 Tom 2014 39972.27
#> 17 Tom 2015 40010.54
I now get the consumer price index (CPI) (using this answer)
# install.packages(c("Quandl"), dependencies = TRUE)
CPI00to16 <- Quandl::Quandl("FRED/CPIAUCSL", collapse="annual",
start_date="2000-01-01", end_date="2016-01-01")
as_tibble(CPI00to16)
#> # A tibble: 17 x 2
#> Date Value
#> <date> <dbl>
#> 1 2016-12-31 238.106
#> 2 2015-12-31 237.846
#> 3 2014-12-31 236.290
#> 4 2013-12-31 234.723
#> 5 2012-12-31 231.221
#> 6 2011-12-31 227.223
#> 7 2010-12-31 220.472
#> 8 2009-12-31 217.347
#> 9 2008-12-31 211.398
#> 10 2007-12-31 211.445
#> 11 2006-12-31 203.100
#> 12 2005-12-31 198.100
#> 13 2004-12-31 191.700
#> 14 2003-12-31 185.500
#> 15 2002-12-31 181.800
#> 16 2001-12-31 177.400
#> 17 2000-12-31 174.600
my question is how do I deflate Jane and Tom's wages cf. Wooldridge 2009 selecting 2015 as the baseline year?
update; following MrSmithGoesToWashington’s comment below.
CPI00to16$yr <- as.numeric(format(CPI00to16$Date,'%Y'))
CPI00to16 <- mutate(CPI00to16, deflation_factor = CPI00to16[2,2]/Value)
df <- tbl %>% inner_join(as_tibble(CPI00to16[,3:4]), by = "yr")
df <- mutate(df, wg_defl = deflation_factor*wg, wg_diff = wg_defl-wg)
df
#> # A tibble: 17 x 6
#> id yr wg deflation_factor wg_defl wg_diff
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Jane 2009 50991.93 1.094315 55801.21 4809.2844
#> 2 Jane 2010 51001.66 1.078804 55020.78 4019.1176
#> 3 Jane 2011 51014.29 1.046751 53399.28 2384.9910
#> 4 Jane 2012 50989.83 1.028652 52450.80 1460.9728
#> 5 Jane 2013 50999.28 1.013305 51677.83 678.5477
#> 6 Jane 2014 51001.19 1.006585 51337.04 335.8494
#> 7 Jane 2015 51006.37 1.000000 51006.37 0.0000
#> 8 Tom 2006 39997.12 1.171078 46839.76 6842.6394
#> 9 Tom 2007 40023.81 1.124860 45021.18 4997.3691
#> 10 Tom 2008 39998.33 1.125110 45002.53 5004.1909
#> 11 Tom 2009 40005.01 1.094315 43778.07 3773.0575
#> 12 Tom 2010 40011.78 1.078804 43164.86 3153.0747
#> 13 Tom 2011 39995.29 1.046751 41865.12 1869.8369
#> 14 Tom 2012 39987.52 1.028652 41133.26 1145.7322
#> 15 Tom 2013 40021.39 1.013305 40553.87 532.4863
#> 16 Tom 2014 39972.27 1.006585 40235.49 263.2225
#> 17 Tom 2015 40010.54 1.000000 40010.54 0.0000