Merge two data frames by a max number condition in r - r

Cheers, I have a data frame df1 with the Major City with max visitors in 2011.
df1:
Country City Visitors_2011
UK London 100000
USA Washington D.C 200000
USA New York 100000
France Paris 100000
The other data frame df2 consists of top visited cities in the country for 2012:
df2:
Country City Visitors_2012
USA Washington D.C 200000
USA New York 100000
USA Las Angeles 100000
UK London 100000
UK Manchester 100000
France Paris 100000
France Nice 100000
The Output I would need is:
Logic: To obtain df3, merge df1 and df2 by Country and City and if you can't find city in df1 then add that volume to biggest city in df1.
Example: Los Angeles visitor count here is added to Washington D.C because Los Angeles is not present in df1 and Washington D.C has more visitors(2012) than New York.
df3:
Country City Visitors_2011 Visitors_2012
UK London 100000 200000
USA Washington D.C 200000 300000
USA New York 100000 100000
France Paris 100000 200000
Can anyone point me to the right direction?

Assume df1.txt and df2.txt contain your space-delimited dataframes.
Here is a solution in base R:
df1 <- read.table("df1.txt", header = T, stringsAsFactors = F);
df2 <- read.table("df2.txt", header = T, stringsAsFactors = F);
# Merge with all = TRUE, see ?merge
df <- merge(df1, df2, all = TRUE);
# Deal with missing values
tmp <- lapply(split(df, df$Country), function(x) {
# Make sure NA's are at the bottom
x <- x[order(x$Visitors_2011), ];
# Select first max Visitors_2012 entry
idx <- which.max(x$Visitors_2012);
# Add any NA's to max entry
x$Visitors_2012[idx] <- x$Visitors_2012[idx] + sum(x$Visitors_2012[is.na(x$Visitors_2011)]);
# Return dataframe
return(x[!is.na(x$Visitors_2011), ])});
# Bind list entries into dataframe
df <- do.call(rbind, tmp);
print(df);
Country City Visitors_2011 Visitors_2012
France France Paris 100000 200000
UK UK London 100000 200000
USA.6 USA New_York 100000 100000
USA.7 USA Washington_D.C 200000 300000

A dplyr approach:
library(dplyr)
max.cities <- df1 %>% group_by(Country) %>% summarise(City = City[which.max(Visitors_2011)])
result <- df2 %>% mutate(City=ifelse(City %in% df1$City, City,
max.cities$City[match(Country, max.cities$Country)])) %>%
group_by(Country,City) %>%
summarise(Visitors_2012=sum(Visitors_2012)) %>%
left_join(df1,., by=c("Country", "City"))
Notes:
First, compute the City that has the max visitors group_by Country in df1 and set that to a separate data frame max.cities.
mutate the City column in df2 so that if the City is in df1, then the name is unchanged; otherwise, the City from max.cites that matches the Country is used.
Once the City has been suitably modified, group_by both Country and City and sum up the Visitors_2012.
Finally, left_join with df1 by c("Country", "City") to get the final result.
The result using your posted data is as expected:
print(result)
## Country City Visitors_2011 Visitors_2012
##1 UK London 100000 200000
##2 USA Washington D.C 200000 300000
##3 USA New York 100000 100000
##4 France Paris 100000 200000

Related

How can I add the country name to a dataset based on city name and population? [duplicate]

This question already has answers here:
extracting country name from city name in R
(3 answers)
Closed 7 months ago.
I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.
population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin
I expect the output to look like this:
population city country
500,000 Oslo Norway
750,000 Bristol England
500,000 Liverpool England
1,000,000 Dublin Ireland
How can I add a column of country names based on the city and population to a large dataset in R?
I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.
library(maps)
library(dplyr)
data("world.cities")
df <- readr::read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df |>
inner_join(
select(world.cities, name, country.etc, pop),
by = c("city" = "name")
) |> group_by(city) |>
filter(
abs(pop - population) == min(abs(pop - population))
)
# A tibble: 4 x 4
# Groups: city [4]
# population city country.etc pop
# <dbl> <chr> <chr> <int>
# 1 500000 Oslo Norway 821445
# 2 750000 Bristol UK 432967
# 3 500000 Liverpool UK 468584
# 4 1000000 Dublin Ireland 1030431
As stated by others, the cities exists in other countries too as well.
library(tidyverse)
library(maps)
data("world.cities")
df <- read_table("population city
500,000 Oslo
750,000 Bristol
500,000 Liverpool
1,000,000 Dublin")
df %>%
merge(., world.cities %>%
select(name, country.etc),
by.x = "city",
by.y = "name")
# A tibble: 7 × 3
city population country.etc
<chr> <dbl> <chr>
1 Bristol 750000 UK
2 Bristol 750000 USA
3 Dublin 1000000 USA
4 Dublin 1000000 Ireland
5 Liverpool 500000 UK
6 Liverpool 500000 Canada
7 Oslo 500000 Norway
I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

If/Else statement in R

I have two dataframes in R:
city price bedroom
San Jose 2000 1
Barstow 1000 1
NA 1500 1
Code to recreate:
data = data.frame(city = c('San Jose', 'Barstow'), price = c(2000,1000, 1500), bedroom = c(1,1,1))
and:
Name Density
San Jose 5358
Barstow 547
Code to recreate:
population_density = data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
I want to create an additional column named city_type in the data dataset based on condition, so if the city population density is above 1000, it's an urban, lower than 1000 is a suburb, and NA is NA.
city price bedroom city_type
San Jose 2000 1 Urban
Barstow 1000 1 Suburb
NA 1500 1 NA
I am using a for loop for conditional flow:
for (row in 1:length(data)) {
if (is.na(data[row,'city'])) {
data[row, 'city_type'] = NA
} else if (population[population$Name == data[row,'city'],]$Density>=1000) {
data[row, 'city_type'] = 'Urban'
} else {
data[row, 'city_type'] = 'Suburb'
}
}
The for loop runs with no error in my original dataset with over 20000 observations; however, it yields a lot of wrong results (it yields NA for the most part).
What has gone wrong here and how can I do better to achieve my desired result?
I have become quite a fan of dplyr pipelines for this type of join/filter/mutate workflow. So here is my suggestion:
library(dplyr)
# I had to add that extra "NA" there, did you not? Hm...
data <- data.frame(city = c('San Jose', 'Barstow', NA), price = c(2000,1000, 500), bedroom = c(1,1,1))
population <- data.frame(Name=c('San Jose', 'Barstow'), Density=c(5358, 547));
data %>%
# join the two dataframes by matching up the city name columns
left_join(population, by = c("city" = "Name")) %>%
# add your new column based on the desired condition
mutate(
city_type = ifelse(Density >= 1000, "Urban", "Suburb")
)
Output:
city price bedroom Density city_type
1 San Jose 2000 1 5358 Urban
2 Barstow 1000 1 547 Suburb
3 <NA> 500 1 NA <NA>
Using ifelse create the city_type in population_density, then we using match
population_density$city_type=ifelse(population_density$Density>1000,'Urban','Suburb')
data$city_type=population_density$city_type[match(data$city,population_density$Name)]
data
city price bedroom city_type
1 San Jose 2000 1 Urban
2 Barstow 1000 1 Suburb
3 <NA> 1500 1 <NA>

R column mapping

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore
Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore
A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)
If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

R factor with overlapping level ranges

Hi I am struggling about a problem since coupple of days and haven't found any answer yet.
Supposed I am having a dataset with columns: Country, Population. The Country is incoded in Numbers, so the raw dataset looks like this:
df <- data.frame(country=c(1,2,3,4,5,6), population=c(10000,20000,30000,4000,50000,60000))
df
country population
1 1 10000
2 2 20000
3 3 30000
4 4 4000
5 5 50000
6 6 60000
I want country to be a factor with the following levels: France, Germany, Canada, USA, India, China and Europe, America, Asia.
So to say a factor combinig:
df$country <- factor(df$country, labels = c("France", "Germany", "Canada", "USA", "India", "Asia"))
df
country population
1 France 10000
2 Germany 20000
3 Canada 30000
4 USA 4000
5 India 50000
6 Asia 60000
and
df$country <- cut(df$country, breaks = c(0,2,4,6),labels = c("Europe", "America", "Asia"))
df
country population
1 Europe 10000
2 Europe 20000
3 America 30000
4 America 4000
5 Asia 50000
6 Asia 60000
My aim is to do something like:
tapply(df$population, df$country, sum)
with a result like this:
France Germany Canada USA India China Europe America Asia
10000 20000 30000 4000 50000 60000 30000 34000 110000
Is there a way to this, without creating a third column in the data frame?
I hope it is understandble, what my problem is.
I already tried interaction() but thats not what I want.
So the following function from the plyr-Package divides your data frame into sub-data-frames (one sub-data-frame per country) and then sums up the population values. The t function just transverses your data frame.
> library(plyr)
> neu <- ddply(df, .(country), Summe = sum(population))
> t(neu)

Manipulating data.frames

I have a sample survey sheet; something like demographic. One of the columns is country (factor) another is annual income. Now, I need to calculate average of each country and store in new data.frame with country and corresponding mean. It should be simple but I am lost. The data is something like the one shown below:
Country Income($) Education ... ... ...
1. USA 90000 Phd
2. UK 94000 Undergrad
3. USA 94000 Highschool
4. UK 87000 Phd
5. Russia 77000 Undergrad
6. Norway 60000 Masters
7. Korea 90000 Phd
8. USA 110000 Masters
.
.
I need a final result like:
USA UK Russia ...
98000 90000 75000
Thank You.
data example:
dat <- read.table(text="Country Income Education
USA 90000 Phd
UK 94000 Undergrad
USA 94000 Highschool
UK 87000 Phd
Russia 77000 Undergrad
Norway 60000 Masters
Korea 90000 Phd
USA 110000 Masters",header=TRUE)
Do what you want with plyr :
if your data is called dat:
library(plyr)
newdf <- ddply(dat, .(Country), function(x) Countrymean = mean(x$Income))
# newdf <- ddply(dat, .(Country), function(x) data.frame(Income = mean(x$Income)))
and aggregate:
newdf <- aggregate(Income ~ Country, data = dat, FUN = mean)
for the output you show at the end maybe tapply?
tapply(dat$Income, dat$Country, mean)

Resources