Finding Average of One Column Based on 2 Other Columns RStudio - r

I currently have a data frame that has three columns (City, State and Income) I wrote an example of the data below...
City State Income
Addison Illinois 71,000
Addison Illinois 101,000
Addison Illinois 81,000
Addison Texas 74,000
As you can see there are repeats of the cities. There are several Addison, IL's because income differs by the zip-code/area of the city.
I want to take the average of all incomes in a given city and state. In this example I want the average of all Addison IL's but NOT including Addison, Texas.
I am looking for this (in this given example)
City State MeanIncome
Addison Illinois 84,333
Addison Texas 74,000
I tried this:
Income_By_City <- aggregate( Income ~ City, df, mean )
But it gave me the average of ALL Addison's, including Texas...
Is there a way to take the average of Income Column, based on City AND State??
I am pretty new to coding, so I'm not sure if this is a simple question. But I would appreciate any help I can get.

df <- data.frame(City = c("Addison", "Addison", "Addison", "Addison"), State = c("Illinois", "Illinois", "Illinois", "Texas"), Income = c(71000, 101000, 81000, 74000))
library(dplyr)
df %>%
group_by(City, State) %>%
summarise(MeanIncome=(mean(Income)))
# City State MeanIncome
#1 Addison Illinois 84333.33
#2 Addison Texas 74000.00

Here is a dplyr solution:
library(tidyverse)
df <- tribble(
~City, ~State, ~Income,
"Addison", "Illinois", 71000,
"Addison", "Illinois", 101000,
"Addison", "Illinois", 81000,
"Addison", "Texas", 74000
)
df %>%
group_by(City, State) %>%
mutate(AverageIncome = mean(Income))
# A tibble: 4 x 4
# Groups: City, State [2]
City State Income AverageIncome
<chr> <chr> <dbl> <dbl>
1 Addison Illinois 71000 84333.33
2 Addison Illinois 101000 84333.33
3 Addison Illinois 81000 84333.33
4 Addison Texas 74000 74000.00

Related

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

filter data by values (common values but different data entry) stored in another dataframe

Based on the data below how can I filter data by values stored in another dataframe object?
Sample data:
# Data to be filtered
Dest_FIPS = c(1,2,3,4)
Dest_county = c("West Palm Beach County","Brevard County","Bay County","Miami-Dade County")
Dest_State = c("FL", "FL", "FL", "FL")
OutFlow = c(111, 222, 333, 444)
Orig_county = c("Broward County", "Broward County", "Broward County", "Broward County")
Orig_FIPS = c(5,5,5,5)
Orig_State = c("FL", "FL", "FL", "FL")
df = data.frame(Dest_FIPS, Dest_county, Dest_State, OutFlow, Orig_county, Orig_FIPS, Orig_State)
# rows to be filtered in column Dest_county based on the values in val_df
COUNTY_NAM = c("WEST PALM BEACH","BAY","MIAMI-DADE") #(values are actually stored in a CSV, so will be imported as a dataframe)
val_df = data.frame(COUNTY_NAM) # will use val_df to filter df
Desired output:
Dest_FIPS Dest_county OutFlow Orig_county
1 West Palm Beach County 111 Broward County
3 Bay County 333 Broward County
4 Miami-Dade County 444 Broward County
Transform df$Dest_county to match the format in val_df, then check which values are %in% val_df$COUNTY_NAM.
Base R:
df[toupper(gsub(" County", "", df$Dest_county)) %in% val_df$COUNTY_NAM,]
tidyverse:
library(dplyr)
filter(df, str_to_upper(str_remove(Dest_county, " County")) %in% val_df$COUNTY_NAM)
Output for both:
Dest_FIPS Dest_county Dest_State OutFlow Orig_county Orig_FIPS Orig_State
1 1 West Palm Beach County FL 111 Broward County 5 FL
2 3 Bay County FL 333 Broward County 5 FL
3 4 Miami-Dade County FL 444 Broward County 5 FL

Finding rows that have the minimum of a specific factor group

I'm am attempting to find the minimum incomes from the state.x77 dataset based on the state.region variable.
df1 <- data.frame(state.region,state.x77,row.names = state.name)
tapply(state.x77,state.region,min)
I am trying to get it to output which state has the lowest income for X region eg for south Alabama would be the lowest income. Im trying to use tapply but I keep getting an error saying
Error in tapply(state.x77, state.region, min) :
arguments must have same length
What is the issue?
Here is a solution. First get the vector of incomes and make of it a named vector. Then use tapply to get the names of the minima incomes.
state <- setNames(state.x77[, "Income"], rownames(state.x77))
tapply(state, state.region, function(x) names(x)[which.min(x)])
# Northeast South North Central West
# "Maine" "Mississippi" "South Dakota" "New Mexico"
The following, more complicated, code will output state names, regions and incomes.
df1 <- data.frame(
State = rownames(state.x77),
Income = state.x77[, "Income"],
Region = state.region
)
merge(aggregate(Income ~ Region, df1, min), df1)[c(3, 1, 2)]
# State Region Income
#1 South Dakota North Central 4167
#2 Maine Northeast 3694
#3 Mississippi South 3098
#4 New Mexico West 3601
And another solution with aggregate but avoiding merge.
agg <- aggregate(Income ~ Region, df1, min)
i <- match(agg$Income, df1$Income)
data.frame(
State = df1$State[i],
Region = df1$Region[i],
Income = df1$Income[i]
)
# State Region Income
#1 Maine Northeast 3694
#2 Mississippi South 3098
#3 South Dakota North Central 4167
#4 New Mexico West 3601
You can also use this solution:
library(dplyr)
library(tibble)
state2 %>%
rownames_to_column() %>%
bind_cols(state.region) %>%
rename(State = rowname,
Region = ...10) %>%
group_by(Region, State) %>%
summarise(Income = sum(Income)) %>% arrange(desc(Income)) %>%
slice_tail(n = 1)
# A tibble: 4 x 3
# Groups: Region [4]
Region State Income
<fct> <chr> <dbl>
1 Northeast Maine 3694
2 South Mississippi 3098
3 North Central South Dakota 4167
4 West New Mexico 3601

R column mapping

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore
Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore
A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)
If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

R package "acs": Get county name, FIPS?

in search for a solution to an unsolved problem, I came across the acs package. I assume, there's no way within the choropleth package to get any county information from data in the format [city, state]. That's why pre-processing with acs needs to be done.
I tried following code to get the county information on a city:
library(acs)
geo.lookup(state="CA", place="San Francisco")
> geo.lookup(state="CA", place="San Francisco")
state state.name county.name place place.name
1 6 California <NA> NA <NA>
2 6 California San Francisco County 67000 San Francisco city
3 6 California San Mateo County 73262 South San Francisco city
As we know, cities can be part of different counties. Most likely, I will go with the second
> geo.lookup(state="CA", place="San Francisco")[2,]
state state.name county.name place place.name
2 6 California San Francisco County 67000 San Francisco city
by default.
My question:
Is there a way to get the state abbreviation, county name and county FIPS, too? I could not find the answer in the documentation.
Also, for further processing (matching with choroplethr), the last "County" in county.name and "city" in place.name need to be removed.
Here's how to add the state abbreviation, county name, and county FIPS to your example. R has built-in variables for state names and state abbreviations. For the FIPS codes, I read a csv file from the Census Bureau's website.
library(acs)
library(tidyverse)
states <- cbind(state.name, state.abb) %>% tbl_df()
fips <-
read_csv(
"https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt",
col_names = c("state.abb", "statefp", "countyfp", "county.name", "classfp")
)
query <- geo.lookup(state = "CA", place = "San Francisco")[2, ] %>%
tbl_df() %>%
left_join(states, by = "state.name") %>%
left_join(fips, by = c("county.name", "state.abb"))
query
# # A tibble: 1 x 9
# state state.name county.name place place.name state.abb statefp countyfp classfp
# <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr>
# 1 6 California San Francisco County 67000 San Francisco city CA 06 075 H6
As you note at the end of your question, you may need to clean up this data a bit more to make it fit choroplethr.

Resources