R duplicate rows based on the elements in a string column [duplicate] - r

This question already has answers here:
str_extract_all: return all patterns found in string concatenated as vector
(2 answers)
Closed 2 years ago.
I have a more or less specific question that probably pertains to loops in R. I have a dataframe:
X location year
1 North Dakota, Minnesota, Michigan 2011
2 California, Tennessee 2012
3 Bastrop County (Texas) 2013
4 Dallas (Texas) 2014
5 Shasta (California) 2015
6 California, Oregon, Washington 2011
I have two problems with this data: 1) I need a column that consists of just the state names of each row. I guess this should be generally easy with gsub and using a list of all US state names.
list <- c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "etc")
pat <- paste0("\\b(", paste0(list, collapse="|"), ")\\b")
pat
data$state <- gsub(data$location, "", paragraph)
The bigger issue for me is 2) I need an individual (duplicate) row for each state that is in the dataset. So if row 6 has California, Oregon and Washington in 2011, I need to have a separate row of each one separately like this:
X location year
1 California 2011
2 Oregon 2011
3 Washington 2011
Thank you for your help!

You can use str_extract_all to extract all the states and unnest to duplicate rows such that each state is in a separate row. There is an inbuilt constant state.name which have the state names of US which can be used here to create pattern.
library(dplyr)
pat <- paste0("\\b", state.name, "\\b", collapse = "|")
df %>%
mutate(states = stringr::str_extract_all(location, pat)) %>%
tidyr::unnest(states)
# A tibble: 11 x 3
# location year states
# <chr> <int> <chr>
# 1 North Dakota, Minnesota, Michigan 2011 North Dakota
# 2 North Dakota, Minnesota, Michigan 2011 Minnesota
# 3 North Dakota, Minnesota, Michigan 2011 Michigan
# 4 California, Tennessee 2012 California
# 5 California, Tennessee 2012 Tennessee
# 6 Bastrop County (Texas) 2013 Texas
# 7 Dallas (Texas) 2014 Texas
# 8 Shasta (California) 2015 California
# 9 California, Oregon, Washington 2011 California
#10 California, Oregon, Washington 2011 Oregon
#11 California, Oregon, Washington 2011 Washington
data
df <- structure(list(location = c("North Dakota, Minnesota, Michigan",
"California, Tennessee", "Bastrop County (Texas)", "Dallas (Texas)",
"Shasta (California)", "California, Oregon, Washington"), year = c(2011L,
2012L, 2013L, 2014L, 2015L, 2011L)), class = "data.frame", row.names = c(NA, -6L))

Related

r deleting certain rows of dataframe based on multiple columns

I have a dataframe looks like below:
Year Name Place Job
2010 Jim USA CEO
2010 Jim Canada Advisor
2010 Jim Canada Board Member
2011 Jim USA CEO
2017 Peter Mexico COO
2019 Peter Korea CEO
2019 Peter China Advisor
2013 Harry USA Chairman
2014 Harry Canada CEO
2015 Harry Canada CEO
2015 Harry Canada Advisor
I want to remove certain rows in the above dataframe based on the "Year" and "Name" column. basically, all "Year/Name" occurs in the below list (in dataframe format) should be removed:
Year Name
2010 Jim
2019 Peter
2013 Harry
2014 Harry
Thus, the final output looks like:
Year Name Place Job
2011 Jim USA CEO
2017 Peter Mexico COO
2015 Harry Canada CEO
2015 Harry Canada Advisor
base R
While dplyr (below) has anti_join, in base R one needs to merge and find the rows that did not match and remove them by hand.
# using the `rem` frame, augmenting a little
rem$keep <- FALSE
tmp <- merge(dat, rem, by = c("Year", "Name"), all.x = TRUE)
tmp
# Year Name Place Job keep
# 1 2010 Jim USA CEO FALSE
# 2 2010 Jim Canada Advisor FALSE
# 3 2010 Jim Canada Board Member FALSE
# 4 2011 Jim USA CEO NA
# 5 2013 Harry USA Chairman FALSE
# 6 2014 Harry Canada CEO FALSE
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
# 10 2019 Peter Korea CEO FALSE
# 11 2019 Peter China Advisor FALSE
tmp[ is.na(tmp$keep), ]
# Year Name Place Job keep
# 4 2011 Jim USA CEO NA
# 7 2015 Harry Canada CEO NA
# 8 2015 Harry Canada Advisor NA
# 9 2017 Peter Mexico COO NA
dplyr
dplyr::anti_join(dat, rem, by = c("Year", "Name"))
# Year Name Place Job
# 1 2011 Jim USA CEO
# 2 2017 Peter Mexico COO
# 3 2015 Harry Canada CEO
# 4 2015 Harry Canada Advisor
Data
dat <- structure(list(Year = c(2010L, 2010L, 2010L, 2011L, 2017L, 2019L, 2019L, 2013L, 2014L, 2015L, 2015L), Name = c("Jim", "Jim", "Jim", "Jim", "Peter", "Peter", "Peter", "Harry", "Harry", "Harry", "Harry"), Place = c("USA", "Canada", "Canada", "USA", "Mexico", "Korea", "China", "USA", "Canada", "Canada", "Canada"), Job = c("CEO", "Advisor", "Board Member", "CEO", "COO", "CEO", "Advisor", "Chairman", "CEO", "CEO", "Advisor")), row.names = c(NA, -11L), class = "data.frame")
rem <- structure(list(Year = c(2010L, 2019L, 2013L, 2014L), Name = c("Jim", "Peter", "Harry", "Harry")), class = "data.frame", row.names = c(NA, -4L))
Using data.table
library(data.table)
setDT(df)[!remove, on = .(Year, Name)]
-ouptut
# Year Name Place Job
#1: 2011 Jim USA CEO
#2: 2017 Peter Mexico COO
#3: 2015 Harry Canada CEO
#4: 2015 Harry Canada Advisor
Another approach:
library(dplyr)
library(stringr)
dat %>% mutate(x = str_c(Year, Name)) %>%
filter(str_detect(x, str_c(str_c(rem$Year,rem$Name), collapse = '|'), negate = TRUE)) %>%
select(-x)
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor
We could use str_c:
library(dplyr)
library(stringr)
dat %>%
filter(!Year %in% str_c(rem$Year))
Output:
Year Name Place Job
1 2011 Jim USA CEO
2 2017 Peter Mexico COO
3 2015 Harry Canada CEO
4 2015 Harry Canada Advisor
A base R option using merge + subset
subset(
merge(
dat,
cbind(rem, Removal = 1),
all = TRUE
),
is.na(Removal),
select = -Removal
)
gives
Year Name Place Job
4 2011 Jim USA CEO
7 2015 Harry Canada CEO
8 2015 Harry Canada Advisor
9 2017 Peter Mexico COO

Split a data frame column based on a comma [duplicate]

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 3 years ago.
I have a data frame with the following structure, titled "final_proj_data"
ID County Population Year
<dbl> <chr> <dbl> <dbl>
1003 Baldwin County, Alabama 169162 2006
1015 Calhoun County, Alabama 112903 2006
1043 Cullman County, Alabama 80187 2006
1049 DeKalb County, Alabama 68014 2006
I am trying to split the column County into two different columns, County and State, and remove the comma.
I tried a number of permutations of the separate() function but I keep getting back this error:
Error: var must evaluate to a single number or a column name, not a
character vector
I've tried (among others)
final_proj_data %>%
separate(final_proj_data$County, c("State", "County"), sep = ",", remove = TRUE)
final_proj_data %>%
separate(data = final_proj_data, col = County,
into = c("State", "County"), sep = ",")
I'm not sure what I am doing wrong, or why the "col =" keeps throwing this error. Any help would be appreciated!
Using dplyr and base R:
library(dplyr)
final_proj_data %>%
mutate(State=unlist(lapply(strsplit(County,", "),function(x) x[2])),
County=gsub(",.*","",County))
ID County Population Year State
1 1003 Baldwin County 169162 2006 Alabama
2 1015 Calhoun County 112903 2006 Alabama
3 1043 Cullman County 80187 2006 Alabama
4 1049 DeKalb County 68014 2006 Alabama
Original:
With dplyr and tidyr(Just seen that #Ronak Shah commented the same above):
library(dplyr)
library(tidyr)
final_proj_data %>%
separate(County,c("County","State"),sep=",")
ID County State Population Year
1 1003 Baldwin County Alabama 169162 2006
2 1015 Calhoun County Alabama 112903 2006
3 1043 Cullman County Alabama 80187 2006
4 1049 DeKalb County Alabama 68014 2006
We can try using sub here for a base R option:
County <- sub(",.*$", "", final_proj_data$County)
State <- sub("^.*,\\s*", "", final_proj_data$County)
final_proj_data$County <- County
final_proj_data$State <- State
We can do this in base R using read.csv
final_proj_data[c("County", "State")] <- read.csv(text = final_proj_data$County,
header = FALSE, stringsAsFactors = FALSE, strip.white = TRUE)
final_proj_data
# ID County Population Year State
#1 1003 Baldwin County 169162 2006 Alabama
#2 1015 Calhoun County 112903 2006 Alabama
#3 1043 Cullman County 80187 2006 Alabama
#4 1049 DeKalb County 68014 2006 Alabama
data
final_proj_data <- structure(list(ID = c(1003L, 1015L, 1043L, 1049L),
County = c("Baldwin County, Alabama",
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c(169162L, 112903L, 80187L, 68014L), Year = c(2006L,
2006L, 2006L, 2006L)), class = "data.frame", row.names = c(NA,
-4L))
We can use strsplit in base R.
cbind(d, `colnames<-`(do.call(rbind, strsplit(d$County, ", ")), c("County", "State")))[-2]
# ID Population Year County State
# 1 1003 169162 2006 Baldwin County Alabama
# 2 1015 112903 2006 Calhoun County Alabama
# 3 1043 80187 2006 Cullman County Alabama
# 4 1049 68014 2006 DeKalb County Alabama
Note: Use strsplit(as.character(d$County), ", ") if County is a factor column.
Data
d <- structure(list(ID = c("1003", "1015", "1043", "1049"), County = c("Baldwin County, Alabama",
"Calhoun County, Alabama", "Cullman County, Alabama", "DeKalb County, Alabama"
), Population = c("169162", "112903", "80187", "68014"), Year = c("2006",
"2006", "2006", "2006")), row.names = c(NA, -4L), class = "data.frame")

Interpolate based on multiple conditions in r

Beginner r user here. I have a dataset of yearly employment numbers for different industry classifications and different subregions. For some observations, the number of employees is null. I would like to fill these values through linear interpolation (using na.approx or some other method). However, I only want to interpolate within the same industry classification and subregion.
For example, I have this:
subregion <- c("East Bay", "East Bay", "East Bay", "East Bay", "East Bay", "South Bay")
industry <-c("A","A","A","A","A","B" )
year <- c(2013, 2014, 2015, 2016, 2017, 2002)
emp <- c(50, NA, NA, 80,NA, 300)
data <- data.frame(cbind(subregion,industry,year, emp))
subregion industry year emp
1 East Bay A 2013 50
2 East Bay A 2014 <NA>
3 East Bay A 2015 <NA>
4 East Bay A 2016 80
5 East Bay A 2017 <NA>
6 South Bay B 2002 300
I need to generate this table, skipping interpolating the fifth observation because subregion and industry do not match the previous observation.
subregion industry year emp
1 East Bay A 2013 50
2 East Bay A 2014 60
3 East Bay A 2015 70
4 East Bay A 2016 80
5 East Bay A 2017 <NA>
6 South Bay B 2002 300
Articles like this have been helpful, but I cannot figure out how to adapt the solution to match the requirement that two columns be the same for interpolation to occur, instead of one. Any help would be appreciated.
We could do a group by na.approx (from zoo)
library(tidyverse)
data %>%
group_by(subregion, industry) %>%
mutate(emp = zoo::na.approx(emp, na.rm = FALSE))
# A tibble: 6 x 4
# Groups: subregion, industry [2]
# subregion industry year emp
# <fct> <fct> <dbl> <dbl>
#1 East Bay A 2013 50
#2 East Bay A 2014 60
#3 East Bay A 2015 70
#4 East Bay A 2016 80
#5 East Bay A 2017 NA
#6 South Bay B 2002 300
data
data <- data.frame(subregion,industry,year, emp)

extracting country name from city name in R

This question may look like a duplicate but I am facing some issue while extracting country names from the string. I have gone through this link [link]Extracting Country Name from Author Affiliations but I was not able to solve my problem.I have tried grepl and for loop for text matching and replacement, my data column consists of more than 300k rows so using grepl and for loop for pattern matching is very very slow.
I have a column like this.
org_loc
Zug
Zug Canton of Zug
Zimbabwe
Zigong
Zhuhai
Zaragoza
York United Kingdom
Delhi
Yalleroi Queensland
Waterloo Ontario
Waterloo ON
Washington D.C.
Washington D.C. Metro
New York
df$org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom", "Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
the string may contain the name of a state, city or country. I just want Country as output. Like this
org_loc
Switzerland
Switzerland
Zimbabwe
China
China
Spain
United Kingdom
India
Australia
Canada
Canada
United State
United state
United state
I am trying to convert state (if match found) to its country using countrycode library but not able to do so. Any help would be appreciable.
You can use your City_and_province_list.csv as a custom dictionary for countrycode. The custom dictionary can not have duplicates in the origin vector (the City column in your City_and_province_list.csv), so you'll have to remove them or deal with them somehow first (as in my example below). Currently, you don't have all of the possible strings in your example in your lookup CSV, so they are not all converted, but if you added all of the possible strings to the CSV, it would work completely.
library(countrycode)
org_loc <- c("Zug", "Zug Canton of Zug", "Zimbabwe", "Zigong", "Zhuhai",
"Zaragoza", "York United Kingdom", "Delhi",
"Yalleroi Queensland", "Waterloo Ontario", "Waterloo ON",
"Washington D.C.", "Washington D.C. Metro", "New York")
df <- data.frame(org_loc)
city_country <- read.csv("https://raw.githubusercontent.com/girijesh18/dataset/master/City_and_province_list.csv")
# custom_dict for countrycode cannot have duplicate origin codes
city_country <- city_country[!duplicated(city_country$City), ]
df$country <- countrycode(df$org_loc, "City", "Country",
custom_dict = city_country)
df
# org_loc country
# 1 Zug Switzerland
# 2 Zug Canton of Zug <NA>
# 3 Zimbabwe <NA>
# 4 Zigong China
# 5 Zhuhai China
# 6 Zaragoza Spain
# 7 York United Kingdom <NA>
# 8 Delhi India
# 9 Yalleroi Queensland <NA>
# 10 Waterloo Ontario <NA>
# 11 Waterloo ON <NA>
# 12 Washington D.C. <NA>
# 13 Washington D.C. Metro <NA>
# 14 New York United States of America
library(countrycode)
df <- c("zug switzerland", "zug canton of zug switzerland", "zimbabwe",
"zigong chengdu pr china", "zhuhai guangdong china", "zaragoza","York United Kingdom", "Yamunanagar","Yalleroi Queensland Australia","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","USA")
df1 <- countrycode(df, 'country.name', 'country.name')
It didn't match a lot of them, but that should do what you're looking for, based on the reference manual for countrycode.
With function geocode from package ggmap you may accomplish, with good but not total accuracy your task; you must also use your criterion to say "Zaragoza" is a city in Spain (which is what geocode returns) and not somewhere in Argentina; geocode tends to give you the biggest city when there are several homonyms.
(remove the $country to see all of the output)
library(ggmap)
org_loc <- c("zug", "zug canton of zug", "zimbabwe",
"zigong", "zhuhai", "zaragoza","York United Kingdom",
"Delhi","Yalleroi Queensland","Waterloo Ontario","Waterloo ON","Washington D.C.","Washington D.C. Metro","New York")
geocode(org_loc, output = "more")$country
as geocode is provided by google, it has a query limit, 2,500 per day per IP address; if it returns NAs it may be because an unconsistent limit check, just try it again.

Merging data by 2 variables in R

I am attempting to merge two data sets. In the past I have used merge() with by equal to the variable I want to merge by. However, now I would like to do so with two variables. My first data set looks something like this:
Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South Carolina
2013 Tennessee Texas
Then I have another data set with a rank of each team (this is very simplified) for each year. Like this:
Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51
I would like to merge them so I have a data set that looks like this:
Year Winning_Tm Winning_TM_rank Losing_Tm Losing_Tm_rank
2011 Texas 32 Washington 34
2012 Alabama 12 South Carolina 45
2013 Tennessee 51 Texas 6
My hope is that there is a simple way to do this but it may be more complicated. Thanks!
I reproduced your data (try to include a dput of it next time):
A <- data.frame(
Year = c(2011, 2012, 2013),
Winning_Tm = c("Texas","Alabama","Tennessee"),
Losing_Tm = c("Washington","South Carolina", "Texas"),
stringsAsFactors = FALSE
)
B <- data.frame(
Year = c("2011","2011","2012","2012","2013","2013"),
Team = c("Texas","Washington","South Carolina","Alabama","Texas","Tennessee"),
Rank = c(32,34,45,12,6,51),
stringsAsFactors = FALSE
)
You can melt the first dataframe using the reshape2 package:
library(reshape2)
A <- melt(A, id.vars = "Year")
names(A)[3] <- "Team"
Now it looks like this:
> A
Year variable Team
1 2011 Winning_Tm Texas
2 2012 Winning_Tm Alabama
3 2013 Winning_Tm Tennessee
4 2011 Losing_Tm Washington
5 2012 Losing_Tm South Carolina
6 2013 Losing_Tm Texas
You can then merge the datasets together by the two columns of interest:
AB <- merge(A, B, by=c("Year","Team"))
Which looks like this:
> AB
Year Team variable Rank
1 2011 Texas Winning_Tm 32
2 2011 Washington Losing_Tm 34
3 2012 Alabama Winning_Tm 12
4 2012 South Carolina Losing_Tm 45
5 2013 Tennessee Winning_Tm 51
6 2013 Texas Losing_Tm 6
Then using the reshape command from base R you can change AB to a wide format:
reshape(AB, idvar = "Year", timevar = "variable", direction = "wide")
The result:
Year Team.Winning_Tm Rank.Winning_Tm Team.Losing_Tm Rank.Losing_Tm
1 2011 Texas 32 Washington 34
3 2012 Alabama 12 South Carolina 45
5 2013 Tennessee 51 Texas 6
Two separate merges. You would need to wrap your list of by variables in c(), and since the variables have different names, you need by.x and by.y. Afterward you could rename the rank variables.
I'll call your data winlose and teamrank, respectively. Then you'd need:
first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))
Renaming the variables:
names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'
If you are familiar with SQL a rather complicated, but fast way to do this all in one step would be:
res <- sqldf("SELECT l.*,
max(case when l.Winning_Tm = r.Team then r.Rank else 0 end) as Winning_Tm_rank,
max(case when l.Losing_Tm = r.Team then r.Rank else 0 end) as Losing_Tm_rank
FROM df1 as l
inner join df2 as r
on (l.Winning_Tm = r.Team
OR l.Losing_Tm = r.Team)
AND l.Year = r.Year
group by l.Year, l.Winning_Tm, l.Losing_Tm")
res
Year Winning_Tm Losing_Tm Winning_Tm_rank Losing_Tm_rank
1 2011 Texas Washington 32 34
2 2012 Alabama South_Carolina 12 45
3 2013 Tennessee Texas 51 6
Data:
df1 <- read.table(header=T,text="Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South_Carolina
2013 Tennessee Texas")
df2<- read.table(header=T,text="Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South_Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51")

Resources