I have two datasets, one for migration inflow to county A from other counties and other for migration outflow from county A to other counties. In order to combine the two data sets as:
Desired output:
Key County State FIPS Inflow Outflow FiscalYear Year
510012012 Accomack County VA 51001 NA 27 2011 - 2012 2012
160012012 Ada County ID 16001 16 16 2011 - 2012 2012
80012012 Adams County CO 8001 39 30 2011 - 2012 2012
80012011 Adams County CO 8001 42 31 2010 - 2011 2011
450032012 Aiken County SC 45003 NA 21 2011 - 2012 2012
120012012 Alachua County FL 12001 433 NA 2011 - 2012 2012
How can I combine the two into one dataset in such a way that I don't have to hardcode each and every common county and state name and FIPS and Year? Missing values would be NA.
The common value between the two data sets is the key.
My original migration outflow data has 517 observations and migration inflow has 441, thus different number of counties in each dataset.
Sample data:
# People moving out of county A to other counties
inflow_df = structure(list(Origin_FIPS = c(12001L, 8001L, 16001L, 12001L,
8001L, 16001L), Origin_StateName = c("FL", "CO", "ID", "FL",
"CO", "ID"), Origin_Place = c("Alachua County", "Adams County",
"Ada County", "Alachua County", "Adams County", "Ada County"),
InIndividuals = c(433L, 30L, 16L, 381L, 42L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2011 - 2012", "2010 - 2011",
"2010 - 2011"), Year = c(2012L, 2012L, 2012L, 2011L, 2011L,
2011L), Key = c(120012012L, 80012012L, 160012012L, 120012011L,
80012011L, 160012011L)), class = "data.frame", row.names = c(NA,
-6L))
# People moving in county A from other counties
outflow_df = structure(list(Dest_FIPS = c(51001L, 16001L, 8001L, 8001L, 45003L
), Dest_StateName = c("VA", "ID", "CO", "CO", "SC"), Dest_Place = c("Accomack County",
"Ada County", "Adams County", "Adams County", "Aiken County"),
OutIndividuals = c(27L, 16L, 39L, 31L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2011 - 2012", "2011 - 2012"
), Year = c(2012L, 2012L, 2012L, 2011L, 2012L), Key = c(510012012L,
160012012L, 80012012L, 80012011L, 450032012L)), class = "data.frame", row.names = c(NA,
-5L))
We can collate the two tables by giving them consistent names (presumably Origin_Place in one should match with Dest_Place in the other) and then performing a join. full_join outputs all the keys found in either table, in this case c("Key", "County", "State", "FIPS", "FiscalYear", "Year").
I would have expected that the inflow_df would reflect the counties that are seeing inflows (ie the destinations) and outflow_df would reflect the counties that have outflows (ie the origins), so it seems possible the table names are swapped in the question.
inflow2 <-
inflow_df %>%
transmute(Key,
County = Origin_Place,
State = Origin_StateName,
FIPS = Origin_FIPS,
Inflow = InIndividuals,
FiscalYear,
Year)
outflow2 <-
outflow_df %>%
transmute(Key,
County = Dest_Place,
State = Dest_StateName,
FIPS = Dest_FIPS,
Outflow = OutIndividuals,
FiscalYear,
Year)
inflow2 %>%
full_join(outflow2)
Result (updated with data from 2022-11-04)
Joining, by = c("Key", "County", "State", "FIPS", "FiscalYear", "Year")
Key County State FIPS Inflow FiscalYear Year Outflow
1 120012012 Alachua County FL 12001 433 2011 - 2012 2012 NA
2 80012012 Adams County CO 8001 30 2011 - 2012 2012 39
3 160012012 Ada County ID 16001 16 2011 - 2012 2012 16
4 120012011 Alachua County FL 12001 381 2011 - 2012 2011 NA
5 80012011 Adams County CO 8001 42 2010 - 2011 2011 NA
6 160012011 Ada County ID 16001 21 2010 - 2011 2011 NA
7 510012012 Accomack County VA 51001 NA 2011 - 2012 2012 27
8 80012011 Adams County CO 8001 NA 2011 - 2012 2011 31
9 450032012 Aiken County SC 45003 NA 2011 - 2012 2012 21
Related
I have two datasets, one for migration inflow to county A from other counties and other for migration outflow from county A to other counties. In order to combine the two data sets as:
Desired output:
Key County State FIPS Inflow Outflow FiscalYear Year
510012012 Accomack County VA 51001 NA 27 2011 - 2012 2012
160012012 Ada County ID 16001 12 18 2011 - 2012 2012
80012012 Adams County CO 8001 22 39 2011 - 2012 2012
80012011 Adams County CO 8001 42 31 2010 - 2011 2011
450032012 Aiken County SC 45003 NA 21 2011 - 2012 2012
120012012 Alachua County FL 12001 433 NA 2011 - 2012 2012
Part of the problem is that the common columns have unequal number of rows. Another problem is that different counties can belong to the same state as shown in the dummy data below. So, what I am thinking is to have a unique key (Key) for both destination and origin county by concatenating FIPS (unique for each county) and Year. That way I can combine the counties with their respective states and the rest of the associated columns values in a single row.
How can I combine the two into one dataset in such a way that I don't have to hardcode each and every common county and state name and FIPS and Year? Missing values would be NA.
My original migration outflow data has 517 observations and migration inflow has 441, thus different number of counties in each dataset.
Sample data:
# People moving out of county A to other counties
inflow_df = structure(list(Origin_FIPS = c(12001L, 8001L, 16001L, 12001L,
8001L, 16001L), Origin_StateName = c("FL", "CO", "ID", "FL",
"CO", "ID"), Origin_Place = c("Alachua County", "Adams County",
"Ada County", "Alachua County", "Adams County", "Ada County"),
InIndividuals = c(433L, 30L, 16L, 381L, 42L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2010 - 2011", "2010 - 2011",
"2010 - 2011"), Year = c(2012L, 2012L, 2012L, 2011L, 2011L,
2011L), Key = c(120012012L, 80012012L, 160012012L, 120012011L,
80012011L, 160012011L)), class = "data.frame", row.names = c(NA,
-6L))
# People moving in county A from other counties
outflow_df = structure(list(Dest_FIPS = c(51001L, 16001L, 8001L, 8001L, 45003L
), Dest_StateName = c("VA", "ID", "CO", "CO", "SC"), Dest_Place = c("Accomack County",
"Ada County", "Adams County", "Adams County", "Aiken County"),
OutIndividuals = c(27L, 16L, 39L, 31L, 21L), FiscalYear = c("2011 - 2012",
"2011 - 2012", "2011 - 2012", "2010 - 2011", "2011 - 2012"
), Year = c(2012L, 2012L, 2012L, 2011L, 2012L), Key = c(510012012L,
160012012L, 80012012L, 80012011L, 450032012L)), class = "data.frame", row.names = c(NA,
-5L))
Perhaps this helps
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
bind_rows(lst(Inflow_df, Outflow_df), .id = 'datname') %>%
pivot_longer(cols = contains("_"), names_to = ".value",
names_pattern = ".*_([^_]+$)") %>%
mutate(Key = str_c(County, Year), rn = rowid(Key, datname)) %>%
pivot_wider(names_from = datname, values_from = Individuals) %>%
arrange(rn) %>%
select(-rn)
This question already has answers here:
Aggregate one data frame by time intervals from another data frame
(3 answers)
Closed 1 year ago.
I've posted this as another question, but realised I've got my sample data wrong.
I've got two separate datasets. df1 looks like this:
loc_ID year observations
nin212 2002 90
nin212 2003 98
nin212 2004 102
cha670 2001 18
cha670 2002 19
cha670 2003 21
df2 looks like this:
loc_ID start_year end_year
nin212 2002 2003
nin212 2003 2004
cha670 2001 2002
cha670 2002 2003
I want to calculate the number of observations in the time intervals (start_year to end_year) per loc_ID. In the example above, I would like to achieve this final dataset:
loc_ID start_year end_year observations
nin212 2002 2003 188
nin212 2003 2004 200
cha670 2001 2002 37
cha670 2002 2003 40
How could I do this?
We can do a non-equi join
library(data.table)
setDT(df2)[, observations := setDT(df1)[df2, sum(observations),
on = .(loc_ID, year >= start_year, year <= end_year),
by = .EACHI]$V1]
-output
df2
# loc_ID start_year end_year observations
#1: nin212 2002 2003 188
#2: nin212 2003 2004 200
#3: cha670 2001 2002 37
#4: cha670 2002 2003 40
data
structure(list(loc_ID = c("nin212", "nin212", "nin212", "cha670",
"cha670", "cha670"), year = c(2002L, 2003L, 2004L, 2001L, 2002L,
2003L), observations = c(90L, 98L, 102L, 18L, 19L, 21L)),
class = "data.frame", row.names = c(NA,
-6L))
> dput(df2)
structure(list(loc_ID = c("nin212", "nin212", "cha670", "cha670"
), start_year = c(2002L, 2003L, 2001L, 2002L), end_year = c(2003L,
2004L, 2002L, 2003L)), class = "data.frame", row.names = c(NA,
-4L))
I have two dataset which look like below
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 NA
South Asia 2009 4.5 NA
South Asia 2011 11 0
South Asia 2014 16.7 NA
Africa 2008 0.4 NA
Africa 2013 3.5 0
Africa 2017 9.7 NA
Strategy
Region StrategyYear
South Asia 2011
Africa 2013
Japan 2007
SE Asia 2009
There are multiple regions and many review years which are not periodic and not even same for all regions. I have added a column 'Index' to dataframe 'Sales' such that for a strategy year from second dataframe, the index value is zero. I now want to change NA to a series of numbers that tell how many rows before or after that particular row is to 0 row, grouped by 'Region'.
I can do this using a for loop but that is just tedious and checking if there is a cleaner way to do this. Final output should look like
Sales
Region ReviewYear Sales Index
South Asia 2006 1.5 -2
South Asia 2009 4.5 -1
South Asia 2011 11 0
South Asia 2014 16.7 1
Africa 2008 0.4 -1
Africa 2013 3.5 0
Africa 2017 9.7 1
Join the two datasets by Region and for each Region create an Index column by subtracting the row number with the index where StrategyYear matches the ReviewYear.
library(dplyr)
left_join(Sales, Strategy, by = 'Region') %>%
arrange(Region, StrategyYear) %>%
group_by(Region) %>%
mutate(Index = row_number() - match(first(StrategyYear), ReviewYear))
# Region ReviewYear Sales Index StrategyYear
# <chr> <int> <dbl> <int> <int>
#1 Africa 2008 0.4 -1 2013
#2 Africa 2013 3.5 0 2013
#3 Africa 2017 9.7 1 2013
#4 SouthAsia 2006 1.5 -2 2011
#5 SouthAsia 2009 4.5 -1 2011
#6 SouthAsia 2011 11 0 2011
#7 SouthAsia 2014 16.7 1 2011
data
Sales <- structure(list(Region = c("SouthAsia", "SouthAsia", "SouthAsia",
"SouthAsia", "Africa", "Africa", "Africa"), ReviewYear = c(2006L,
2009L, 2011L, 2014L, 2008L, 2013L, 2017L), Sales = c(1.5, 4.5,
11, 16.7, 0.4, 3.5, 9.7), Index = c(NA, NA, 0L, NA, NA, 0L, NA
)), class = "data.frame", row.names = c(NA, -7L))
Strategy <- structure(list(Region = c("SouthAsia", "Africa", "Japan", "SEAsia"
), StrategyYear = c(2011L, 2013L, 2007L, 2009L)), class = "data.frame",
row.names = c(NA, -4L))
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a dataframe combined_data that looks like this (this is just an example):
Year state_name VoS_thousUSD industry
2008 Alabama 100 Shipping
2009 Alabama 100 Shipping
2008 Alabama 200 Shipping
2010 Alabama 100 Shipping
2010 Alabama 50 Shipping
2010 Alabama 100 Shipping
2008 Alabama 100 Shipping
There are multiple Year, state_name, and industry
variable, with associated VoS_thousUSD values, as well as other columns I no longer need.
I am trying to produce this
Year state_name VoS_thousUSD industry
2008 Alabama 400 Shipping
2009 Alabama 100 Shipping
2010 Alabama 250 Shipping
Where the dataframe is grouped by Year, state_name, and industry, and VoS_thousand is a sum by those groups.
So far I have
combined_data %>%
group_by(Year, state_name, GCAM_industry) %>%
summarise() -> VoS_thousUSD_state_ind
But I am not sure how/where to add in the sum for VoS_thousUSD. Would like to use a dplyr pipeline.
We can use
aggregate( VoS_thousUSD~ ., combined_data, FUN = sum)
Or with dplyr
library(dplyr)
combined_data %>%
group_by(Year, state_name, industry) %>%
summarise(VoS_thousUSD = sum(VoS_thousUSD))
# A tibble: 3 x 4
# Groups: Year, state_name [3]
# Year state_name industry VoS_thousUSD
# <int> <chr> <chr> <int>
#1 2008 Alabama Shipping 400
#2 2009 Alabama Shipping 100
#3 2010 Alabama Shipping 250
data
combined_data <- structure(list(Year = c(2008L, 2009L, 2008L, 2010L, 2010L, 2010L,
2008L), state_name = c("Alabama", "Alabama", "Alabama", "Alabama",
"Alabama", "Alabama", "Alabama"), VoS_thousUSD = c(100L, 100L,
200L, 100L, 50L, 100L, 100L), industry = c("Shipping", "Shipping",
"Shipping", "Shipping", "Shipping", "Shipping", "Shipping")),
class = "data.frame", row.names = c(NA,
-7L))
I am learning dplyr and working on summarizing, filtering, and grouping.
I have a dataset "set2c" that contains census data for cityname, statename, year, total population, total populations, total by gender, and average per city and year:
https://share.getcloudapp.com/o0uQnyGn
Continuing with the, I am trying to find State Populations by aggregating with a sum of the population of each City. I'd like to see a new column "stateAvePop" which has summarized the average population for each state.
set2d <- set2c %>%
group_by(YEAR,STNAME,totalAll,totalMale,totalFemale) %>%
summarize(stateavepop = sum(avgpop))
set2d
https://share.getcloudapp.com/Z4uw0epe
But I think I need to sort by state somehow when I am trying to get the average sum. Would someone please take a look at where I am going wrong?
head(set2c, 20) returns:
# A tibble: 20 x 7
# Groups: CTYNAME, YEAR, STNAME, totalAll, totalMale [20]
CTYNAME YEAR STNAME totalAll totalMale totalFemale avgpop
<fct> <int> <fct> <int> <int> <int> <dbl>
1 Abbeville County 10 South Carolina 24560 11895 12665 24560
2 Abbeville County 11 South Carolina 24541 11868 12673 24541
3 Acadia Parish 10 Louisiana 62514 30405 32109 62514
4 Acadia Parish 11 Louisiana 62190 30342 31848 62190
5 Accomack County 10 Virginia 32566 15871 16695 32566
6 Accomack County 11 Virginia 32412 15817 16595 32412
7 Ada County 10 Idaho 456885 228715 228170 456885
8 Ada County 11 Idaho 469966 235266 234700 469966
9 Adair County 10 Iowa 7053 3503 3550 7053
10 Adair County 10 Kentucky 19294 9578 9716 19294
11 Adair County 10 Missouri 25306 12183 13123 25306
12 Adair County 10 Oklahoma 21981 10981 11000 21981
13 Adair County 11 Iowa 7063 3509 3554 7063
14 Adair County 11 Kentucky 19215 9508 9707 19215
15 Adair County 11 Missouri 25339 12194 13145 25339
16 Adair County 11 Oklahoma 22082 11015 11067 22082
17 Adams County 10 Colorado 504428 254651 249777 504428
18 Adams County 10 Idaho 4132 2129 2003 4132
19 Adams County 10 Illinois 66094 32521 33573 66094
20 Adams County 10 Indiana 35422 17683 17739 35422
Based on the OP's updated data example, we only need to group by 'STNAME'
library(dplyr)
set2c %>%
group_by(STNAME) %>%
summarise(totalAll = sum(totalAll), avppop = mean(avgpop))
# A tibble: 11 x 3
# STNAME totalAll avppop
# <chr> <int> <dbl>
# 1 Colorado 504428 504428
# 2 Idaho 930983 310328.
# 3 Illinois 66094 66094
# 4 Indiana 35422 35422
# 5 Iowa 14116 7058
# 6 Kentucky 38509 19254.
# 7 Louisiana 124704 62352
# 8 Missouri 50645 25322.
# 9 Oklahoma 44063 22032.
#10 South Carolina 49101 24550.
#11 Virginia 64978 32489
If the intention is to select certain columns while creating new columns, use transmute instead of summarise
set2c %>%
group_by(STNAME) %>%
transmute(totalAll, totalAllSum = sum(totalAll), avppop = mean(avgpop))
# A tibble: 20 x 4
# Groups: STNAME [11]
# STNAME totalAll totalAllSum avppop
# <chr> <int> <int> <dbl>
# 1 South Carolina 24560 49101 24550.
# 2 South Carolina 24541 49101 24550.
# 3 Louisiana 62514 124704 62352
# 4 Louisiana 62190 124704 62352
# 5 Virginia 32566 64978 32489
# 6 Virginia 32412 64978 32489
# 7 Idaho 456885 930983 310328.
# 8 Idaho 469966 930983 310328.
# 9 Iowa 7053 14116 7058
#10 Kentucky 19294 38509 19254.
#11 Missouri 25306 50645 25322.
#12 Oklahoma 21981 44063 22032.
#13 Iowa 7063 14116 7058
#14 Kentucky 19215 38509 19254.
#15 Missouri 25339 50645 25322.
#16 Oklahoma 22082 44063 22032.
#17 Colorado 504428 504428 504428
#18 Idaho 4132 930983 310328.
#19 Illinois 66094 66094 66094
#20 Indiana 35422 35422 35422
data
set2c <- structure(list(CTYNAME = c("Abbeville County", "Abbeville County",
"Acadia Parish", "Acadia Parish", "Accomack County", "Accomack County",
"Ada County", "Ada County", "Adair County", "Adair County", "Adair County",
"Adair County", "Adair County", "Adair County", "Adair County",
"Adair County", "Adams County", "Adams County", "Adams County",
"Adams County"), YEAR = c(10L, 11L, 10L, 11L, 10L, 11L, 10L,
11L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 10L, 10L, 10L, 10L
), STNAME = c("South Carolina", "South Carolina", "Louisiana",
"Louisiana", "Virginia", "Virginia", "Idaho", "Idaho", "Iowa",
"Kentucky", "Missouri", "Oklahoma", "Iowa", "Kentucky", "Missouri",
"Oklahoma", "Colorado", "Idaho", "Illinois", "Indiana"), totalAll = c(24560L,
24541L, 62514L, 62190L, 32566L, 32412L, 456885L, 469966L, 7053L,
19294L, 25306L, 21981L, 7063L, 19215L, 25339L, 22082L, 504428L,
4132L, 66094L, 35422L), totalMale = c(11895L, 11868L, 30405L,
30342L, 15871L, 15817L, 228715L, 235266L, 3503L, 9578L, 12183L,
10981L, 3509L, 9508L, 12194L, 11015L, 254651L, 2129L, 32521L,
17683L), totalFemale = c(12665L, 12673L, 32109L, 31848L, 16695L,
16595L, 228170L, 234700L, 3550L, 9716L, 13123L, 11000L, 3554L,
9707L, 13145L, 11067L, 249777L, 2003L, 33573L, 17739L), avgpop = c(24560L,
24541L, 62514L, 62190L, 32566L, 32412L, 456885L, 469966L, 7053L,
19294L, 25306L, 21981L, 7063L, 19215L, 25339L, 22082L, 504428L,
4132L, 66094L, 35422L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))