Finding rows that have the minimum of a specific factor group - r

I'm am attempting to find the minimum incomes from the state.x77 dataset based on the state.region variable.
df1 <- data.frame(state.region,state.x77,row.names = state.name)
tapply(state.x77,state.region,min)
I am trying to get it to output which state has the lowest income for X region eg for south Alabama would be the lowest income. Im trying to use tapply but I keep getting an error saying
Error in tapply(state.x77, state.region, min) :
arguments must have same length
What is the issue?

Here is a solution. First get the vector of incomes and make of it a named vector. Then use tapply to get the names of the minima incomes.
state <- setNames(state.x77[, "Income"], rownames(state.x77))
tapply(state, state.region, function(x) names(x)[which.min(x)])
# Northeast South North Central West
# "Maine" "Mississippi" "South Dakota" "New Mexico"
The following, more complicated, code will output state names, regions and incomes.
df1 <- data.frame(
State = rownames(state.x77),
Income = state.x77[, "Income"],
Region = state.region
)
merge(aggregate(Income ~ Region, df1, min), df1)[c(3, 1, 2)]
# State Region Income
#1 South Dakota North Central 4167
#2 Maine Northeast 3694
#3 Mississippi South 3098
#4 New Mexico West 3601
And another solution with aggregate but avoiding merge.
agg <- aggregate(Income ~ Region, df1, min)
i <- match(agg$Income, df1$Income)
data.frame(
State = df1$State[i],
Region = df1$Region[i],
Income = df1$Income[i]
)
# State Region Income
#1 Maine Northeast 3694
#2 Mississippi South 3098
#3 South Dakota North Central 4167
#4 New Mexico West 3601

You can also use this solution:
library(dplyr)
library(tibble)
state2 %>%
rownames_to_column() %>%
bind_cols(state.region) %>%
rename(State = rowname,
Region = ...10) %>%
group_by(Region, State) %>%
summarise(Income = sum(Income)) %>% arrange(desc(Income)) %>%
slice_tail(n = 1)
# A tibble: 4 x 3
# Groups: Region [4]
Region State Income
<fct> <chr> <dbl>
1 Northeast Maine 3694
2 South Mississippi 3098
3 North Central South Dakota 4167
4 West New Mexico 3601

Related

How to fill in blanks based off another column's values in R

Here is the code and resulting dataframe
category <- c("East","BLANK","NorthEast","BLANK","BLANK")
subcat <- c("East","North","SW","NE","SE")
data1 <- as.data.frame(category)
data1$subcat <- subcat
**Category** **Subcat**
East East
BLANK North
NorthEast SW
BLANK NE
BLANK SE
The East category contains subcat East and North. The NorthEast category contains subcat SW,NE,SE.
As you can see there are blanks for each category. How would I make so the 2nd value in Category is East and 4th and 5th row is North East? I have many more rows in the actual data so a way to do this would be helpful.
The result should be
**Category** **Subcat**
East East
*East* North
NorthEast SW
*NorthEast* NE
*NorthEast* SE
Here is a dplyr only solution:
First we group by every string that is not BLANK,
then replace all group members with first value:
data1 %>%
group_by(x = cumsum(category != "BLANK")) %>%
mutate(category = first(category)) %>%
ungroup() %>%
select(-x)
category subcat
<chr> <chr>
1 East East
2 East North
3 NorthEast SW
4 NorthEast NE
5 NorthEast SE
We could convert the 'BLANK' to NA and use fill
library(tidyr)
library(dplyr)
data1 %>%
na_if( "BLANK") %>%
fill(category)
-output
category subcat
1 East East
2 East North
3 NorthEast SW
4 NorthEast NE
5 NorthEast SE

Create a ggplot with grouped factor levels

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?
If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

R column mapping

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore
Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore
A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)
If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

Finding Average of One Column Based on 2 Other Columns RStudio

I currently have a data frame that has three columns (City, State and Income) I wrote an example of the data below...
City State Income
Addison Illinois 71,000
Addison Illinois 101,000
Addison Illinois 81,000
Addison Texas 74,000
As you can see there are repeats of the cities. There are several Addison, IL's because income differs by the zip-code/area of the city.
I want to take the average of all incomes in a given city and state. In this example I want the average of all Addison IL's but NOT including Addison, Texas.
I am looking for this (in this given example)
City State MeanIncome
Addison Illinois 84,333
Addison Texas 74,000
I tried this:
Income_By_City <- aggregate( Income ~ City, df, mean )
But it gave me the average of ALL Addison's, including Texas...
Is there a way to take the average of Income Column, based on City AND State??
I am pretty new to coding, so I'm not sure if this is a simple question. But I would appreciate any help I can get.
df <- data.frame(City = c("Addison", "Addison", "Addison", "Addison"), State = c("Illinois", "Illinois", "Illinois", "Texas"), Income = c(71000, 101000, 81000, 74000))
library(dplyr)
df %>%
group_by(City, State) %>%
summarise(MeanIncome=(mean(Income)))
# City State MeanIncome
#1 Addison Illinois 84333.33
#2 Addison Texas 74000.00
Here is a dplyr solution:
library(tidyverse)
df <- tribble(
~City, ~State, ~Income,
"Addison", "Illinois", 71000,
"Addison", "Illinois", 101000,
"Addison", "Illinois", 81000,
"Addison", "Texas", 74000
)
df %>%
group_by(City, State) %>%
mutate(AverageIncome = mean(Income))
# A tibble: 4 x 4
# Groups: City, State [2]
City State Income AverageIncome
<chr> <chr> <dbl> <dbl>
1 Addison Illinois 71000 84333.33
2 Addison Illinois 101000 84333.33
3 Addison Illinois 81000 84333.33
4 Addison Texas 74000 74000.00

state.divsion index in R

I'm asked to use the state.x77 data set and find the minimum income for each division defined by state.division and then use the state.name to find the name of the state that is in New England that has the minimum income. I'm getting some weird answers. Does anyone know what I'm doing wrong?
x <- tapply(state.x77$Income, state.division, min)
x
New England Middle Atlantic South Atlantic East South Central
3694 4449 3617 3098
West South Central East North Central West North Central Mountain
3378 4458 4167 3601
Pacific
4660
x1 <- tapply(state.x77$Income, state.name[state.division], min)
x1
Alabama Alaska Arizona Arkansas California Colorado
3694 4449 3617 3098 3378 4458
Connecticut Delaware Florida
4167 3601 4660
I personally tend to go straight for dplyr, where you could use either
library(dplyr)
result <- state.x77 %>%
group_by(state.division) %>%
filter(Income == min(Income))
if you want to preserve all minimum value rows (as in, if there are two minimums) or
state.x77 %>%
group_by(state.division) %>%
slice(which.min(Income))
if you want only one minimum value row.
If you want to only use the base package, you could try using ave() with min:
state.x77[state.x77$Distance == ave(state.x77$Income, state.x77$state.division, FUN = min), ]

Resources