How to combine two columns in R data.table like this: - r

I have this data.table:
CITY CITY2
Phoenix NA
NASHVILLE Nashville
Los Angeles Los Angeles
NEWYORK New York
CHICAGO NA
This is the result I want:
CITY
Phoenix
Nashville
Los Angeles
New York
CHICAGO
I tried in many ways and nothing worked. Any idea?

Due to my despair I not stopped researching and found a solution:
myDataTable[ is.na( CITY2 ) & !is.na( CITY ), CITY2 := CITY, ]

This is a bit of a mess of a dataframe as you have some desired results in both columns but there appears to be a lack of predictability. Are you sure that city2 has the correct formatting for all values that are not NA?
Either way, there are a couple of methods to get to your final desired answer with the correct capitalization of city name using dplyr and the "tools" package.
library(dplyr)
library(tools)
city_df <- data.frame(
city = c("Phoenix", "NASHVILLE", "Los Angeles", "NEWYORK", "CHICAGO"),
city2 = c(NA, "Nashville", "Los Angeles", "New York", NA),
stringsAsFactors = FALSE)
The first method assumes city_df$city contains all of the cities but is formatted incorrectly.
city_df %>%
mutate(city =
replace(x = city, city == "NEWYORK", values = "New York")) %>%
select(city) %>%
mutate(city = tools::toTitleCase(tolower(city)))
which returns
city
1 Phoenix
2 Nashville
3 Los Angeles
4 New York
5 Chicago
If you need the values of df_city$city replaced with the Non-NA values of df_city$city2 you can do the following:
city_df %>%
mutate(city = case_when(
!(is.na(city2)) ~ city2,
is.na(city2) ~ city)) %>%
select(city) %>%
mutate(city = tools::toTitleCase(tolower(city)))
This returns the same column as above.

Related

How to extract matches from stringr::str_detect in R into a list vector

I am trying to perform the following search on a database of text.
Here is the sample database, df
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6),
name = c("john doe", "carol jones", "jimmy smith",
"jenny ruiz", "joey jones", "tim brown"),
place = c("reno nevada", "poland maine", "warsaw poland",
"trenton new jersey", "brooklyn new york", "atlanta georgia")
)
I have a vector of strings which contains terms I am trying to find.
new_search <- c("poland", "jones")
I pass the vector to str_detect to find ANY of the strings in new_search in ANY of the columns in df and then return rows which match...
df %>%
filter_all(any_vars(str_detect(., paste(new_search, collapse = "|"))))
Question... how can I extract the results of str_detect into a new column?
For each row which is returned... I would like to generate a list of the terms which were successfully matched and put them in a list or character vector (matched_terms)...something like this...
id name place matched_terms
1 2 carol jones poland maine c("jones", "poland")
2 3 jimmy smith warsaw poland c("poland")
3 5 joey jones brooklyn new york c("jones")
This is my naive solution:
new_search <- c("poland", "jones") %>% paste(collapse = "|")
df %>%
mutate(new_var = str_extract_all(paste(name, place), new_search))
You can extract all the patterns in multiple columns using str_extract_all, combine them into one column with unite. unite combines the column into one string hence the empty values are turned into "character(0)" which we remove using str_remove_all and keep only those rows that have any matched term.
library(tidyverse)
pat <- str_c(new_search, collapse = "|")
df %>%
mutate(across(-id, ~str_extract_all(., pat), .names = '{col}_new')) %>%
unite(matched_terms, ends_with('new'), sep = ',') %>%
mutate(matched_terms = str_remove_all(matched_terms,
'character\\(0\\),?|,character\\(0\\)')) %>%
filter(matched_terms != '')
# id name place matched_terms
#1 2 carol jones poland maine jones,poland
#2 3 jimmy smith warsaw poland poland
#3 5 joey jones brooklyn new york jones

From state and county names to fips in R

I have the following data frame in R. I would like to get fips from this dataset. I tried to use fips function in usmap (https://rdrr.io/cran/usmap/man/fips.html). But I could not get fips from this function because I need to enclose double quote. Then, I tried to use paste0(""", df$state, """), but I could not get it. Is there any efficient ways to get fips?
> df1
state county
1 california napa
2 florida palm beach
3 florida collier
4 florida duval
UPDATE
I can get "\"california\"" by using dQuote. Thanks. After the conversion of each column, I tried the followings. How do I deal with this issue?
> df1$state <- dQuote(df1$state, FALSE)
> df1$county <- dQuote(df1$county, FALSE)
> fips(state = df1$state, county = df1$county)
Error in fips(state = df1$state, county = df1$county) :
`county` parameter cannot be used with multiple states.
> fips(state = df1$state[1], county = df1$county[1])
Error in fips(state = df1$state[1], county = df1$county[1]) :
"napa" is not a valid county in "california".
> fips(state = "california", county = "napa")
[1] "06055"
We can split the dataset by state and apply the fips
library(usmap)
lapply(split(df1, df1$state), function(x)
fips(state = x$state[1], county = x$county))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with Map
lst1 <- split(df1$county, df1$state)
Map(fips, lst1, state = names(lst1))
#$california
#[1] "06055"
#$florida
#[1] "12099" "12021" "12031"
Or with tidyverse
library(dplyr)
library(tidyr)
df1 %>%
group_by(state) %>%
summarise(new = list(fips(state = first(state), county = county))) %>%
unnest(c(new))
# A tibble: 4 x 2
# state new
# <chr> <chr>
#1 california 06055
#2 florida 12099
#3 florida 12021
#4 florida 12031
data
df1 <- structure(list(state = c("california", "florida", "florida",
"florida"), county = c("napa", "palm beach", "collier", "duval"
)), class = "data.frame", row.names = c("1", "2", "3", "4"))

How can I fuzzy string match multiple strings from different sized data frames?

I would like to match the strings from my first dataset with all of their closest common matches.
Data looks like:
dataset1:
California
Texas
Florida
New York
dataset2:
Californiia
callifoornia
T3xas
Te xas
texas
Fl0 rida
folrida
New york
new york
desired result is:
col_1 col_2 col_3 col4
California Californiia callifoornia
Texas T3xas texas Te xas
Florida folrida Fl0 rida
New York New york new york
The question is:
How do I search for common strings between the first dataset and the
second dataset, and generate a list of terms in the second dataset
that align with each term in the first?
Thanks in advance.
library(fuzzyjoin); library(tidyverse)
dataset1 %>%
stringdist_left_join(dataset2,
max_dist = 3) %>%
rename(col_1 = "states.x") %>%
group_by(col_1) %>%
mutate(col = paste0("col_", row_number() + 1)) %>%
spread(col, states.y)
#Joining by: "states"
## A tibble: 4 x 4
## Groups: col_1 [4]
# col_1 col_2 col_3 col_4
# <chr> <chr> <chr> <chr>
#1 California Californiia callifoornia NA
#2 Florida Fl0 rida folrida NA
#3 New York New york new york NA
#4 Texas T3xas Te xas texas
data:
dataset1 <- data.frame(states = c("California",
"Texas",
"Florida",
"New York"),
stringsAsFactors = F)
dataset2 <- data.frame(stringsAsFactors = F,
states = c(
"Californiia",
"callifoornia",
"T3xas",
"Te xas",
"texas",
"Fl0 rida",
"folrida",
"New york",
"new york"
)
)
I read a bit about stringdist and came up with this. It's a workaround, but I like it. Can definitely be improved:
library(stringdist)
library(janitor)
ds1a <- read.csv('dataset1')
ds2a <- read.csv('dataset2')
distancematrix <- stringdistmatrix(ds2a$name, ds1a$name, useNames = T)
df <- data.frame(stringdistmatrix(ds2a$name, ds1a$name, useNames = T), ncol=maxcol in distance matrix)
# go thru this df, and every cell that's < 4, replace with the column name, otherwise replace with empty string
for (j in 1:ncol(df)) {
trigger <- df[j,] < 4
df[trigger , j] <- names(df)[j]
df[!trigger , j] <- ""
}
df <- remove_constant(df)
write.csv(df, file="~/Desktop/df.csv")

Joining two data frames to turn full state names into state abbreviations in R and dplyr [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
The first data frame I have includes a column for states called state, but some of the entries are shown as abbreviations (LA, CA, OH), while others have the full name of the state (Louisiana, California, Ohio).
The second data frame I have includes four columns with the following titles:
allCaps (example: ALABAMA)
full (example: Alabama)
twoLetter (example: AL)
threeLetter (example: Ala.)
Is there a way to join the two data frames so that the first data frame only shows the state abbreviations in the state column, replacing the full names with their abbreviations?
EDIT:
I'm going to include pictures, despite having been shot down for doing so before.
This is table one. Each row is a separate tweet that was sent from the respective states. I created it with this code (drawing data from a separate table called tweets):
tweets_per_state <- tweets %>%
filter(country_code == "US" & place_type == "city" | place_type == "admin") %>%
select(place_type, full_name) %>%
mutate(state = ifelse(place_type == "admin", str_sub(full_name, start = 1, end = -6), str_sub(full_name, -2)))
This is table two, which I am trying to join with table one so that where table one shows "Virginia", instead it shows "VA".
One dplyr based solution will involve using a dummy column to join two tables and then using grepl to replace state column with twoLetter value.
I have created data.frames with few rows to demonstrate solution.
tweets <- data.frame(place_type = rep("city",4),
full_name = c("Los Angeles, CA", "Maitland, FL", "Indianapolis, IN", "Virginia, USA" ),
state = c("CA", "FL", "IN", "Virginia"), stringsAsFactors = F)
# place_type full_name state
#1 city Los Angeles, CA CA
#2 city Maitland, FL FL
#3 city Indianapolis, IN IN
#4 city Virginia, USA Virginia
state <- data.frame(allCaps = c("CALIFORNIA", "FLORIDA", "INDIANA", "VIRGINIA"),
full = c("California", "Florida", "Indiana", "Virginia"),
twoLetter = c("CA", "FL", "IN", "VR"),
threeLetter = c("Calif.", "Fla.", "Ind.", "Vir." ),stringsAsFactors = F)
state <- state %>% mutate(dummy = 1)
tweets%>%
mutate(dummy = 1) %>%
filter(place_type == "city" | place_type == "admin") %>%
inner_join(state, by = "dummy") %>%
rowwise() %>%
mutate(state = ifelse(state == twoLetter , state,
ifelse(grepl(full, full_name),twoLetter, NA))) %>%
filter(!is.na(state)) %>%
select(place_type,full_name,state)
# Result
# place_type full_name state
# <chr> <chr> <chr>
# 1 city Los Angeles, CA CA
# 2 city Maitland, FL FL
# 3 city Indianapolis, IN IN
# 4 city Virginia, USA VR

Use aggregate function to calculate output in data frame

I've been trying myself and searching for a while now over the net and stackoverflow to no success. I've got a dataframe which I subset from applying conditions and select for projection but fail to retrieve aggregated output.
Dataframe mydf:
mydf = list()
mydf = cbind(mydf,
c("New York", "New York", "San Francisco"),
c(4000, 7600, 2500),
c("Bartosz", "Damian", "Maciej"))
mydf = as.data.frame(mydf)
colnames(mydf) = c("city","salary","name")
Let's assume given part of dataframe returned with:
subset(mydf, city == "New York", select = c(salary, name))
which return a data frame such as:
salary name
9 4000 Bartosz
10 7600 Damian
Now I need to calculate from the given salary a sum, avg and choose an employee with least salary from above data frame, preferably using one-liner by modifying the above code (I'm guessing it's possible), so that it returns:
for sum: 11600
for avg: 5800
for least: 4000 Bartosz
I've tried things as (1)
subset(mydf, city == "New York", select = sum(salary))
or (2)
x = subset(mydf, city == "New York", select = salary)
min(x)
and many more combination which only yields errors saying that summary function is only defined on a data frame with all variables being numbers (2) or the same output as the first code without sum (1)
The problem might be that your dataframe object actually contains a bunch of lists. So if you take
ny.df = subset(mydf, city == "New York", select = c(salary, name))
then any of the subsequent work needs to be peppered with as.numeric calls to translate your lists into vectors. These will give you your answers:
sum(as.numeric(ny.df$salary)) # sum
mean(as.numeric(ny.df$salary)) # avg
ny.df[which(as.numeric(ny.df$salary) == min(as.numeric(ny.df$salary))),] # row with min salary
Alternatively, you can define mydf as a dataframe of vectors instead of a dataframe of lists:
mydf = data.frame(c("New York", "New York", "San Francisco"),
c(4000, 7600, 2500),
c("Bartosz", "Damian", "Maciej"))
colnames(mydf) = c("city","salary","name")
ny.df = subset(mydf, city == "New York", select = c(salary, name))
sum(ny.df$salary)
mean(ny.df$salary)
ny.df[which(ny.df$salary == min(ny.df$salary)),]
Your mydf was weird so I made my own. I split mydf by city and then obtained the necessary data from running necessary operations (mean, sum, etc.) on each subgroup.
#DATA
mydf = structure(list(city = structure(c(1L, 1L, 2L), .Label = c("New York",
"San Francisco"), class = "factor"), salary = c(4000, 7600, 2500
), name = structure(1:3, .Label = c("Bartosz", "Damian", "Maciej"
), class = "factor")), .Names = c("city", "salary", "name"), row.names = c(NA,
-3L), class = "data.frame")
do.call(rbind, lapply(split(mydf, mydf$city), function(a)
data.frame(employee = a$name[which.min(a$salary)], #employee with least salary
mean = mean(a$salary), #mean salary
sum = sum(a$salary)))) #sum of salary
# employee mean sum
#New York Bartosz 5800 11600
#San Francisco Maciej 2500 2500
There is a simple and fast solution using data.table
library(data.table)
setDT(mydf)[, .( salary_sum = sum(salary),
salary_avg = mean(salary),
name = name[which.min(salary)]), by= city]
> city salary_sum salary_avg name
> 1: New York 11600 5800 Bartosz
> 2: San Francisco 2500 2500 Maciej
your dataset:
mydf = data.frame(city=c("New York", "New York", "San Francisco"),
salary=c(4000, 7600, 2500),
name=c("Bartosz", "Damian", "Maciej"))
Your data frame is structured unusally as lists within the dataframe, which may be casuign you issues. Here is a dplyr solution (now edited to find th elowest salary)
library(dplyr)
mydf <- data.frame(
city = c("New York", "New York", "San Francisco"),
salary = c(4000, 7600, 2500),
name = c("Bartosz", "Damian", "Maciej"))
mydf %>%
group_by(city) %>%
mutate(avg = mean(salary),
sum = sum(salary)) %>%
top_n(-1, wt = salary)
# city salary name avg sum
# <fctr> <dbl> <fctr> <dbl> <dbl>
# 1 New York 4000 Bartosz 5800 11600
# 2 San Francisco 2500 Maciej 2500 2500
I think the dplyr is what you might be looking for:
library(dplyr)
mydf %>%
group_by(city) %>%
filter (city =="New York") %>%
summarise(mean(salary), sum(salary))
# A tibble: 1 x 3
# city mean(salary) sum(salary)
# <fctr> <dbl> <dbl>
#1 New York 5800 11600
There is a good tutorial at this link link[https://rpubs.com/justmarkham/dplyr-tutorial]

Resources