R column mapping - r

How to map column of one CSV file to column of another CSV file in R. If both are in same data type.
For example first column of data frame A consist some text with country name in it. While column of second data frame B contains a standard list of all country .Now I have to map all rows of first data frame with standard country column.
For example column (location) of data frame A consist 10000 rows of data like this
Sydney, Australia
Aarhus C, Central Region, Denmark
Auckland, New Zealand
Mumbai Area, India
Singapore
df1 <- data.frame(col1 = 1:5, col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"))
Now I have another column (country) of data frame B as
India
USA
New Zealand
UK
Singapore
Denmark
China
df2 <- data.frame(col1=1:7, col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"))
If location column matches with Country column then, I want to replace that location with country name otherwise it will remain as it is. Sample output is as
Sydney, Australia
Denmark
New Zealand
India
Singapore

Initially, it looked like a trivial question but it's not. This approach works like this:
1. We convert the location string into vector using unlist, strsplit.
2. Then we check if any string in the vector is available in country column. If it is available, we store the country name in res and if not we store notfound.
2. Finally, we check if res contains a country name or not.
df1 <- data.frame(location = c('Sydney, Australia',
'Aarhus C, Central Region, Denmark',
'Auckland, New Zealand',
'Mumbai Area, India',
'Singapore'),stringsAsFactors = F)
df2 <- data.frame(country = c('India',
'USA',
'New Zealand',
'UK',
'Singapore',
'Denmark',
'China'),stringsAsFactors = F)
get_values <- function(i)
{
val <- unlist(strsplit(i, split = ','))
val <- sapply(val, str_trim)
res <- c()
for(j in val)
{
if(j %in% df2$country) res <- append(res, j)
else res <- append(res, 'notfound')
}
if(all(res == 'notfound')) return (i)
else return (res[res!='notfound'])
}
df1$location2 <- sapply(df1$location, get_values)
location location2
1 Sydney, Australia Sydney, Australia
2 Aarhus C, Central Region, Denmark Denmark
3 Auckland, New Zealand New Zealand
4 Mumbai Area, India India
5 Singapore Singapore

A solution using tidyverse. First, please convert your col2 to character by setting stringsAsFactors = FALSE because that is easier to work with.
We can use str_extract to extract the matched country name, and then create a new col2 with mutate and ifelse.
df3 <- df1 %>%
mutate(Country = str_extract(col2, paste0(df2$col2, collapse = "|")),
col2 = ifelse(is.na(Country), col2, Country)) %>%
select(-Country)
df3
# col1 col2
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
We can also start with df1, use separate_rows to separate the country name. After that, use semi_join to check if the country names are in df2. Finally, we can combine the data frame with the original df1 by rows, and then filter the first one for each id in col1. df3 is the final output.
library(tidyverse)
df3 <- df1 %>%
separate_rows(col2, sep = ", ") %>%
semi_join(df2, by = "col2") %>%
bind_rows(df1) %>%
group_by(col1) %>%
slice(1) %>%
ungroup() %>%
arrange(col1)
df3
# # A tibble: 5 x 2
# col1 col2
# <int> <chr>
# 1 1 Sydney, Australia
# 2 2 Denmark
# 3 3 New Zealand
# 4 4 India
# 5 5 Singapore
DATA
df1 <- data.frame(col1 = 1:5,
col2=c("Sydney, Australia", "Aarhus C, Central Region, Denmark", "Auckland, New Zealand", "Mumbai Area, India", "Singapore"),
stringsAsFactors = FALSE)
df2 <- data.frame(col1=1:7,
col2=c("India", "USA", "New Zealand", "UK", "Singapore", "Denmark", "China"),
stringsAsFactors = FALSE)

If you are looking for the countries, and they come after the cities then you can do something like this.
transform(df1,col3= sub(paste0(".*,\\s*(",paste0(df2$col2,collapse="|"),")"),"\\1",col2))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore
Breakdown:
> A=sub(".*,\\s(.*)","\\1",df1$col2)
> B=sapply(A,grep,df2$col2,value=T)
> transform(df1,col3=replace(A,!lengths(B),col2[!lengths(B)]))
col1 col2 col3
1 1 Sydney, Australia Sydney, Australia
2 2 Aarhus C, Central Region, Denmark Denmark
3 3 Auckland, New Zealand New Zealand
4 4 Mumbai Area, India India
5 5 Singapore Singapore

Related

Create several columns from a complex column in R

Imagine dataset:
df1 <- tibble::tribble(~City, ~Population,
"United Kingdom > Leeds", 1500000,
"Spain > Las Palmas de Gran Canaria", 200000,
"Canada > Nanaimo, BC", 150000,
"Canada > Montreal", 250000,
"United States > Minneapolis, MN", 700000,
"United States > Milwaukee, WI", NA,
"United States > Milwaukee", 400000)
The same dataset for visual representation:
I would like to:
Split column City into three columns: City, Country, State (if available, NA otherwise)
Check that Milwaukee has data in state and population (the NA for Milwaukee should have a value of 400000 and then split [City-State-Country] :).
Could you, please, suggest the easiest method to do so :)
Here's another solution with extract to do the extraction of Country, City, and State in a single go with State extracted by an optional capture group (the remainder of the task is done as by #Allen's code):
library(tidyr)
library(dplyr)
df1 %>%
extract(City,
into = c("Country", "City", "State"),
regex = "([^>]+) > ([^,]+),? ?([A-Z]+)?"
) %>%
# as by #Allen Cameron:
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
You can use separate twice to get the country and state, then group_by Country and City to summarize away the NA values where appropriate:
library(tidyverse)
df1 %>%
separate(City, sep = " > ", into = c("Country", "City")) %>%
separate(City, sep = ', ', into = c('City', 'State')) %>%
group_by(Country, City) %>%
summarize(State = ifelse(all(is.na(State)), NA, State[!is.na(State)]),
Population = Population[!is.na(Population)])
#> # A tibble: 6 x 4
#> # Groups: Country [4]
#> Country City State Population
#> <chr> <chr> <chr> <dbl>
#> 1 Canada Montreal <NA> 250000
#> 2 Canada Nanaimo BC 150000
#> 3 Spain Las Palmas de Gran Canaria <NA> 200000
#> 4 United Kingdom Leeds <NA> 1500000
#> 5 United States Milwaukee WI 400000
#> 6 United States Minneapolis MN 700000

Finding rows that have the minimum of a specific factor group

I'm am attempting to find the minimum incomes from the state.x77 dataset based on the state.region variable.
df1 <- data.frame(state.region,state.x77,row.names = state.name)
tapply(state.x77,state.region,min)
I am trying to get it to output which state has the lowest income for X region eg for south Alabama would be the lowest income. Im trying to use tapply but I keep getting an error saying
Error in tapply(state.x77, state.region, min) :
arguments must have same length
What is the issue?
Here is a solution. First get the vector of incomes and make of it a named vector. Then use tapply to get the names of the minima incomes.
state <- setNames(state.x77[, "Income"], rownames(state.x77))
tapply(state, state.region, function(x) names(x)[which.min(x)])
# Northeast South North Central West
# "Maine" "Mississippi" "South Dakota" "New Mexico"
The following, more complicated, code will output state names, regions and incomes.
df1 <- data.frame(
State = rownames(state.x77),
Income = state.x77[, "Income"],
Region = state.region
)
merge(aggregate(Income ~ Region, df1, min), df1)[c(3, 1, 2)]
# State Region Income
#1 South Dakota North Central 4167
#2 Maine Northeast 3694
#3 Mississippi South 3098
#4 New Mexico West 3601
And another solution with aggregate but avoiding merge.
agg <- aggregate(Income ~ Region, df1, min)
i <- match(agg$Income, df1$Income)
data.frame(
State = df1$State[i],
Region = df1$Region[i],
Income = df1$Income[i]
)
# State Region Income
#1 Maine Northeast 3694
#2 Mississippi South 3098
#3 South Dakota North Central 4167
#4 New Mexico West 3601
You can also use this solution:
library(dplyr)
library(tibble)
state2 %>%
rownames_to_column() %>%
bind_cols(state.region) %>%
rename(State = rowname,
Region = ...10) %>%
group_by(Region, State) %>%
summarise(Income = sum(Income)) %>% arrange(desc(Income)) %>%
slice_tail(n = 1)
# A tibble: 4 x 3
# Groups: Region [4]
Region State Income
<fct> <chr> <dbl>
1 Northeast Maine 3694
2 South Mississippi 3098
3 North Central South Dakota 4167
4 West New Mexico 3601

finding shared column information - a least common ancestor question

I have a data.frame object consisting of columns of information that is tree-like. For instance, I have performed a search of a set of features (query_name) and returned a set of potential matches (match_name). Every match has an associated location that is split into continent, country, region, and town.
The problem I'd like to resolve is finding, for a given query_name, the location information that all potential matches have in common.
For example, with this bit of example data:
query_name <- c(rep("feature1", 3), rep("feature2", 2), rep("feature3", 4))
match_name <- paste0("match", seq(1:9))
continent <- c(rep("NorthAmerica", 3), rep("NorthAmerica", 2), rep("Europe", 4))
country <- c(rep("UnitedStates", 3), rep("Canada", 2), rep("Germany", 4))
region <- c(rep("NewYork", 3), "Ontario", NA, rep("Bayern", 2), rep("Berlin", 2))
town <- c("Manhattan", "Albany", "Buffalo", "Toronto", NA, "Munich", "Nuremberg", "Berlin", "Frankfurt")
data <- data.frame(query_name, match_name, continent, country, region, town)
We'd generate this data.frame object:
query_name match_name continent country region town
1 feature1 match1 NorthAmerica UnitedStates NewYork Manhattan
2 feature1 match2 NorthAmerica UnitedStates NewYork Albany
3 feature1 match3 NorthAmerica UnitedStates NewYork Buffalo
4 feature2 match4 NorthAmerica Canada Ontario Toronto
5 feature2 match5 NorthAmerica Canada <NA> <NA>
6 feature3 match6 Europe Germany Bayern Munich
7 feature3 match7 Europe Germany Bayern Nuremberg
8 feature3 match8 Europe Germany Berlin Berlin
9 feature3 match9 Europe Germany Berlin Frankfurt
I'm hoping to get advice on how to construct a function that will produce the result below. Note that shared location information is now concatenated and separated with a ; delimiter.
Feature1 differs only at the town information, thus the returned string contains the continent through region information.
Feature2 doesn't differ at region or town in the two matches here because one of the two matches contains no information. Nevertheless, lack of information is considered distinct from values with information, so the only thing shared in common for feature2 matches are continent and country.
Feature3 contains shared continent and country information, but distinct region and town, so just continent and country are retained.
Hoping for an output file that looks like this:
query_name location_output
feature1 NorthAmerica;UnitedStates;NewYork;
feature2 NorthAmerica;Canada;;
feature3 Europe;Germany;;
Thanks for any advice you can spare.
Cheers!
Here is an option
library(tidyverse)
data %>%
gather(key, val, -query_name, -match_name) %>%
select(-match_name, -key) %>%
group_by(query_name, val) %>%
add_count() %>%
group_by(query_name) %>%
filter(n == max(n)) %>%
summarise(location_output = paste0(unique(val[!is.na(val)]), collapse = ";"))
## A tibble: 3 x 2
# query_name location_output
# <fct> <chr>
#1 feature1 NorthAmerica;UnitedStates;NewYork
#2 feature2 NorthAmerica;Canada
#3 feature3 Europe;Germany
This is less elegant than #MauritsEvers' solution (it doesn't automatically take care of an arbitrary number of levels), but it ensures that every location_output has all four ; delimiters.
library(dplyr)
data %>%
group_by(query_name) %>%
summarize(continent = ifelse(n_distinct(continent) == 1, first(continent), ""),
country = ifelse(n_distinct(country) == 1, first(country), ""),
region = ifelse(n_distinct(region) == 1, first(region), ""),
town = ifelse(n_distinct(town) == 1, first(town), "")) %>%
mutate(location_output = paste(continent, country, region, town, sep = ";")) %>%
select(query_name, location_output)
lapply(split(data, data$query_name), function(x){
x = x[,-(1:2)]
r = rle(sapply(x, function(d) length(unique(d))))
x[1, seq(r$lengths[1])]
})
#$feature1
# continent country region
#1 NorthAmerica UnitedStates NewYork
#$feature2
# continent country
#4 NorthAmerica Canada
#$feature3
# continent country
#6 Europe Germany

Create a ggplot with grouped factor levels

This is variation on a question asked here: Group factor levels in ggplot.
I have a dataframe:
df <- data.frame(respondent = factor(c(1, 2, 3, 4, 5, 6, 7)),
location = factor(c("California", "Oregon", "Mexico",
"Texas", "Canada", "Mexico", "Canada")))
There are three separate levels related to the US. I don't want to collapse them as the distinction between states is useful for data analysis. I would like to have, however, a basic barplot that combines the three US states and stacks them on top of one another, so that there are three bars in the barplot--Canada, Mexico, and US--with the US bar divided into three states like so:
If the state factor levels had the "US" in their names, e.g. "US: California", I could use
library(tidyverse)
with_states <- df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other")) %>%
mutate(State = as.factor(State)
%>% fct_relevel("Other", after = Inf))
to achieve the desired outcome. But how can this be done when R doesn't know that the three states are in the US?
If you look at the previous example, all the separate and replace_na functions do is separate the location variable into a country and state variable:
df
respondent location
1 1 US: California
2 2 US: Oregon
3 3 Mexico
...
df %>%
separate(location, into = c("Country", "State"), sep = ": ") %>%
replace_na(list(State = "Other"))
respondent Country State
1 1 US California
2 2 US Oregon
3 3 Mexico Other
...
So really all you need to do if get your data into this format: with a column for country and a column for state/provence.
There are many ways to do this yourself. Many times your data will already be in this format. If it isn't, the easiest way to fix it is to do a join to a table which maps location to country:
df
respondent location
1 1 California
2 2 Oregon
3 3 Mexico
4 4 Texas
5 5 Canada
6 6 Mexico
7 7 Canada
state_mapping <- data.frame(state = c("California", "Oregon", "Texas"),
country = c('US', 'US', 'US'),
stringsAsFactors = F)
df %>%
left_join(state_mapping, by = c('location' = 'state')) %>%
mutate(country = if_else(is.na(.$country),
location,
country))
respondent location country
1 1 California US
2 2 Oregon US
3 3 Mexico Mexico
4 4 Texas US
5 5 Canada Canada
6 6 Mexico Mexico
7 7 Canada Canada
Once you've got it in this format, you can just do what the other question suggested.

R make new data frame from current one

I'm trying to calculate the best goal differentials in the group stage of the 2014 world cup.
football <- read.csv(
file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE,
strip.white = TRUE
)
football <- head(football,n=48L)
football[which(max(abs(football$home_score - football$away_score)) == abs(football$home_score - football$away_score)),]
Results in
home home_continent home_score away away_continent away_score result
4 Cameroon Africa 0 Croatia Europe 4 l
7 Spain Europe 1 Netherlands Europe 5 l
37 Germany
So those are the games with the highest goal differntial, but now I need to make a new data frame that has a team name, and abs(football$home_score-football$away_score)
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- ifelse(football$home_score > football$away_score, as.character(football$home),
ifelse(football$result == "d", NA, as.character(football$away)))
You could save some typing in this way. You first get score differences and winners. When the result indicates w, home is the winner. So you do not have to look into scores at all. Once you add the score difference and winner, you can subset your data by subsetting data with max().
mydf <- read.csv(file="http://pastebin.com/raw.php?i=iTXdPvGf",
header = TRUE, strip.white = TRUE)
mydf <- head(mydf,n = 48L)
library(dplyr)
mutate(mydf, scorediff = abs(home_score - away_score),
winner = ifelse(result == "w", as.character(home),
ifelse(result == "l", as.character(away), "draw"))) %>%
filter(scorediff == max(scorediff))
# home home_continent home_score away away_continent away_score result scorediff winner
#1 Cameroon Africa 0 Croatia Europe 4 l 4 Croatia
#2 Spain Europe 1 Netherlands Europe 5 l 4 Netherlands
#3 Germany Europe 4 Portugal Europe 0 w 4 Germany
Here is another option without using ifelse for creating the "winner" column. This is based on row/column indexes. The numeric column index is created by matching the result column with its unique elements (match(football$result,..), and the row index is just 1:nrow(football). Subset the "football" dataset with columns 'home', 'away' and cbind it with an additional column 'draw' with NAs so that the 'd' elements in "result" change to NA.
football$score_diff <- abs(football$home_score - football$away_score)
football$winner <- cbind(football[c('home', 'away')],draw=NA)[
cbind(1:nrow(football), match(football$result, c('w', 'l', 'd')))]
football[with(football, score_diff==max(score_diff)),]
# home home_continent home_score away away_continent away_score result
#60 Brazil South America 1 Germany Europe 7 l
# score_diff winner
#60 6 Germany
If the dataset is very big, you could speed up the match by using chmatch from library(data.table)
library(data.table)
chmatch(as.character(football$result), c('w', 'l', 'd'))
NOTE: I used the full dataset in the link

Resources