Get nearest n matching strings - r

Hi I am trying to match one string from other string in different dataframe and get nearest n matches based on score.
EX: from string_2 (df_2) column i need to match with string_1(df_1) and get the nearest 3 matches based on each ID group.
ID = c(100, 100,100,100,103,103,103,103,104,104,104,104)
string_1 = c("Jack Daniel","Jac","JackDan","steve","Mark","Dukes","Allan","Duke","Puma Nike","Puma","Nike","Addidas")
df_1 = data.frame(ID,string_1)
ID = c(100, 100, 185, 103,103, 104, 104,104)
string_2 = c("Jack Daniel","Mark","Order","Steve","Mark 2","Nike","Addidas","Reebok")
df_2 = data.frame(ID,string_2)
My output dataframe df_out will look like below.
ID = c(100, 100,185,103,103,104,104,104)
string_2 = c("Jack Daniel","Mark","Order","Steve","Mark 2","Nike","Addidas","Reebok")
nearest_str_match_1 = c("Jack Daniel","JackDan","NA","Duke","Mark","Nike","Addidas","Nike")
nearest_str_match_2 =c("JackDan","Jack Daniel","NA","Dukes","Duke","Addidas","Nike","Puma Nike")
nearest_str_match_3 =c("Jac","Jac","NA","Allan","Allan","Puma","Puma","Addidas")
df_out = data.frame(ID,string_2,nearest_str_match_1,nearest_str_match_2,nearest_str_match_3)
i have tried manually with package "stringdist" - 'jw' method and get the nearest value.
stringdist::stringdist("Jack Daniel","Jack Daniel","jw")
stringdist::stringdist("Jack Daniel","Jac","jw")
stringdist::stringdist("Jack Daniel","JackDan","jw")
Thanks in advance

merge(df_1, df_2, by = 'ID') %>%
group_by(string_2) %>%
mutate(dist = (stringdist::stringdist(string_2,string_1, 'jw')) %>%
rank(ties = 'last')) %>%
slice_min(dist, n = 3) %>%
pivot_wider(names_from = dist, names_prefix = 'nearest_str_match_',
values_from = string_1)
# A tibble: 7 x 5
# Groups: string_2 [7]
ID string_2 nearest_str_match_1 nearest_str_match_2 nearest_str_match_3
<dbl> <chr> <chr> <chr> <chr>
1 104 Addidas Addidas Nike Puma
2 100 Jack Daniel Jack Daniel JackDan Jac
3 100 Mark JackDan Jack Daniel Jac
4 103 Mark 2 Mark Duke Allan
5 104 Nike Nike Addidas Puma
6 104 Reebok Nike Puma Nike Addidas
7 103 Steve Duke Dukes Allan

Related

Efficient way to repeat operations with columns with similar name in R

I am a beginner with R and have found myself repeatedly running into a problem of this kind. Say I have a dataframe with columns:
company, shares_2010, shares_2011, ... , shares_2020, share_price_2010, ... , share_price_2020
TeslaInc 1000 1200 2000 8 40
.
.
.
I then want to go ahead and calculate the market value in each year. Ordinarily I would do it this way:
dataframe <- dataframe %>%
mutate(value_2010 = shares_2010*share_price_2010,
value_2011 = shares_2011*share_price_2011,
.
:
value_2020 = shares_2020*share_price_2020)
Clearly, all of this is rather cumbersome to type out each time and it cannot be made dynamic with respect to the number of time periods included. Is there any clever way to do these operations in one line instead? I am suspecting something may be possible to do with a combination of starts_with() and some lambda function, but I just haven't been able to figure out how to make the correct things multiply yet. Surely the tidyverse must have a better way to do this?
Any help is much appreciated!
You're right, this is a very common situation in data management.
Let's make a minimal, reproducible example:
dat <- data.frame(
company = c("TeslaInc", "Merta"),
shares_2010 = c(1000L, 1500L),
shares_2011 = c(1200L, 1100L),
shareprice_2010 = 8:7,
shareprice_2011 = c(40L, 12L)
)
dat
#> company shares_2010 shares_2011 shareprice_2010 shareprice_2011
#> 1 TeslaInc 1000 1200 8 40
#> 2 Merta 1500 1100 7 12
This dataset has two issues:
It's in a wide format. This is relatively easy to visualise for humans, but it's not ideal for data analysis. We can fix this with pivot_longer() from tidyr.
Each column actually contains two variables: measure (share or share price) and year. We can fix this with separate() from the same package.
library(tidyr)
dat_reshaped <- dat |>
pivot_longer(shares_2010:shareprice_2011) |>
separate(name, into = c("name", "year")) |>
pivot_wider(everything(), values_from = value, names_from = name)
dat_reshaped
#> # A tibble: 4 × 4
#> company year shares shareprice
#> <chr> <chr> <int> <int>
#> 1 TeslaInc 2010 1000 8
#> 2 TeslaInc 2011 1200 40
#> 3 Merta 2010 1500 7
#> 4 Merta 2011 1100 12
The last pivot_wider() is needed to have shares and shareprice as two separate columns, for ease of further calculations.
We can finally use mutate() to calculate in one go all the new values.
dat_reshaped |>
dplyr::mutate(value = shares * shareprice)
#> # A tibble: 4 × 5
#> company year shares shareprice value
#> <chr> <chr> <int> <int> <int>
#> 1 TeslaInc 2010 1000 8 8000
#> 2 TeslaInc 2011 1200 40 48000
#> 3 Merta 2010 1500 7 10500
#> 4 Merta 2011 1100 12 13200
I recommend you read this chapter of R4DS to better understand these concepts - it's worth the effort!
I think further analysis will be simpler if you reshape your data long.
Here, we can extract the shares, share_price, and year from the header names using pivot_longer. Here, I specify that I want to split the headers into two pieces separated by _, and I want to put the name (aka .value) from the beginning of the header (that is, share or share_price) next to the year that came from the end of the header.
Then the calculation is a simple one-liner.
library(tidyr); library(dplyr)
data.frame(company = "Tesla",
shares_2010 = 5, shares_2011 = 6,
share_price_2010 = 100, share_price_2011 = 110) %>%
pivot_longer(-company,
names_to = c(".value", "year"),
names_pattern = "(.*)_(.*)") %>%
mutate(value = shares * share_price)
# A tibble: 2 × 5
company year shares share_price value
<chr> <chr> <dbl> <dbl> <dbl>
1 Tesla 2010 5 100 500
2 Tesla 2011 6 110 660
I agree with the other posts about pivoting this data into a longer format. Just to add a different approach that works well with this type of example: you can create a list of expressions and then use the splice operator !!! to evaluate these expressions within your context:
library(purrr)
library(dplyr)
library(rlang)
library(glue)
lexprs <- set_names(2010:2011, paste0("value_", 2010:2011)) %>%
map_chr(~ glue("shares_{.x} * share_price_{.x}")) %>%
parse_exprs()
df %>%
mutate(!!! lexprs)
Output
company shares_2010 shares_2011 share_price_2010 share_price_2011 value_2010
1 TeslaInc 1000 1200 8 40 8000
2 Merta 1500 1100 7 12 10500
value_2011
1 48000
2 13200
Data
Thanks to Andrea M
structure(list(company = c("TeslaInc", "Merta"), shares_2010 = c(1000L,
1500L), shares_2011 = c(1200L, 1100L), share_price_2010 = 8:7,
share_price_2011 = c(40L, 12L)), class = "data.frame", row.names = c(NA,
-2L))
How it works
With this usage, the splice operator takes a named list of expressions. The names of the list become the variable names and the expressions are evaluated in the context of your mutate statement.
> lexprs
$value_2010
shares_2010 * share_price_2010
$value_2011
shares_2011 * share_price_2011
To see how this injection will resolve, we can use rlang::qq_show:
> rlang::qq_show(df %>% mutate(!!! lexprs))
df %>% mutate(value_2010 = shares_2010 * share_price_2010, value_2011 = shares_2011 *
share_price_2011)
It is indeed likely you may need to have your data in a long format. But in case you don't, you can do this:
# thanks Andrea M!
df <- data.frame(
company=c("TeslaInc", "Merta"),
shares_2010=c(1000L, 1500L),
shares_2011=c(1200L, 1100L),
share_price_2010=8:7,
share_price_2011=c(40L, 12L)
)
years <- sub('shares_', '', grep('^shares_', names(df), value=T))
for (year in years) {
df[[paste0('value_', year)]] <-
df[[paste0('shares_', year)]] * df[[paste0('share_price_', year)]]
}
If you wanted to avoid the loop (for (...) {...}) you can use this instead:
sp <- df[, paste0('shares_', years)] * df[, paste0('share_price_', years)]
names(sp) <- paste0('value_', years)
df <- cbind(df, sp)

In R, pivot duplicate row values into column values

My problem is similar to this one, but I am having trouble making the code work for me:
Pivot dataframe to keep column headings and sub-headings in R
My data looks like this:
prod1<-c(1000,2000,1400,1340)
prod2<-c(5000,5400,3400,5400)
partner<-c("World","World","Turkey","Turkey")
year<-c("2017","2018","2017","2018")
type<-c("credit","credit","debit","debit")
s<-as.data.frame(rbind(partner,year,type,prod1,prod2)
But I need to convert all the rows into individual variables so that it my columns are:
column.names<-c("products","partner","year","type","value")
I've been trying the code below:
#fix partners
colnames(s)[seq(2, 7, 1)] <- colnames(s)[2] #seq(start,end,increment)
colnames(s)[seq(9, ncol(s), 1)] <- colnames(s)[8]
colnames(s) <-
c(s[1, 1], paste(sep = '_', colnames(s)[2:ncol(s)], as.character(unlist(s[1, 2:ncol(s)]))))
test<-s[-1,]
s <- rename(s, category=1)
test<- s %>%
slice(-1) %>%
pivot_longer(-1,
names_to = c("partner", ".value"),
names_sep = "_") %>%
arrange(partner, `Service item`) %>%
mutate(partner = as.character(partner))
But it keeps saying I can't have duplicate column names. Can someone please help? The initial data is submitted in this format so I need to get it in the right shape.
s <- rownames_to_column(s)
s %>% pivot_longer(starts_with("V")) %>%
pivot_wider(names_from = rowname,values_from = value) %>%
select(-name) %>% pivot_longer(starts_with("prod"), names_to = "product",
values_to = "value")
# A tibble: 8 × 5
partner year type product value
<chr> <chr> <chr> <chr> <chr>
1 World 2017 credit prod1 1000
2 World 2017 credit prod2 5000
3 World 2018 credit prod1 2000
4 World 2018 credit prod2 5400
5 Turkey 2017 debit prod1 1400
6 Turkey 2017 debit prod2 3400
7 Turkey 2018 debit prod1 1340
8 Turkey 2018 debit prod2 5400
sorry misread the question at the beginning, is that what you look for ?

Reshaping multiple long columns into wide column format in R

My sample dataset has multiple columns that I want to convert into wide format. I have tried using the dcast function, but I get error. Below is my sample dataset:
df2 = data.frame(emp_id = c(rep(1,2), rep(2,4),rep(3,3)),
Name = c(rep("John",2), rep("Kellie",4), rep("Steve",3)),
Year = c("2018","2019","2018","2018","2019","2019","2018","2019","2019"),
Type = c(rep("Salaried",2), rep("Hourly", 2), rep("Salaried",2),"Hourly",rep("Salaried",2)),
Dept = c("Sales","IT","Sales","Sales", rep("IT",3),rep("Sales",2)),
Salary = c(100,1000,95,95,1500,1500,90,1200,1200))
I'm expecting my output to look like:
One option is the function pivot_wider() from the tidyr package:
df.wide <- tidyr::pivot_wider(df2,
names_from = c("Type", "Dept", "Year"),
values_from = "Salary",
values_fn = {mean})
This should get you the desired result.
What do you think about this output? It is not the expected output, but somehow I find it easier to interpret the data??
df2 %>%
group_by(Name, Year, Type, Dept) %>%
summarise(mean = mean(Salary))
Output:
Name Year Type Dept mean
<chr> <chr> <chr> <chr> <dbl>
1 John 2018 Salaried Sales 100
2 John 2019 Salaried IT 1000
3 Kellie 2018 Hourly Sales 95
4 Kellie 2019 Salaried IT 1500
5 Steve 2018 Hourly IT 90
6 Steve 2019 Salaried Sales 1200

R: function for home inventory count?

I have a list of homes sales data in my neighborhood listed as
address, listingdate, saledate
101 Street, 2017/01/01, 2017/06/06
106 Street, 2017/03/01, 2017/08/11
102 Street, 2017/05/04, 2017/06/13
109 Street, 2017/07/04, 2017/11/24
...
I would like to calculate the number of homes listed for sale (and not sold) at the listing date too see how home sales and listing vary throughout the year.
in the example:
address, listingdate, saledate, inventory
101 Street, 2017/01/01, 2017/06/06, 1
106 Street, 2017/03/01, 2017/08/11, 2
102 Street, 2017/05/04, 2017/06/13, 3
109 Street, 2017/07/04, 2017/11/24, 2
...
E.g. 109 Street was listed when only 106 and 109 Street were for sale.
Is there a simple 1-step R expression that can calculate that?
I guess this is 3 simple steps. I'll just set the bar, I'm sure someone else will be able to go under it.
library(data.table)
library(lubridate)
dt <- data.table(
address = paste(c(101,106,102,109),"Street"),
listing_date = ymd(c('2017/01/01','2017/03/01','2017/05/04','2017/07/04')),
saledate = ymd(c("2017/06/06","2017/08/11","2017/06/13","2017/11/24")),
key = 'listing_date'))
dt2 <- rbind(dt[,.(date = listing_date, x = 1)], dt[,.(date = saledate, x = -1)])
dt3 <- dt2[, .(x = sum(x)), keyby = date][, .(date, inventory = cumsum(x))]
dt[, inventory := dt3[dt, on=c('date' = 'listing_date'), inventory]]
Or instead as a one-liner
dt[,inventory:=dt[,.(d=listing_date:saledate),.(address)][,.N,key=d][dt,N]]
dt[]
#> address listing_date saledate inventory
#> 1: 101 Street 2017-01-01 2017-06-06 1
#> 2: 106 Street 2017-03-01 2017-08-11 2
#> 3: 102 Street 2017-05-04 2017-06-13 3
#> 4: 109 Street 2017-07-04 2017-11-24 2
I couldn't use the specific solution because of incompatibilities between data.table and tibbles, but the general algorithm was very enlightening. I could translate the general idea to tidyverse land with a couple of changes
# import data from data file
homesale_file = "Home sales data.csv"
homesales <- read_csv(homesale_file,
col_types = cols(listingdate = col_date(format = "%m/%d/%Y"),
saledate = col_date(format = "%m/%d/%Y")
)
)
#
# calculation for inventory
#
listingdate <- tibble(address=homesales$address, listingdate=homesales$listingdate, type="listing",y=1)
saledate <- tibble(address=homesales$address, listingdate=homesales$saledate, type="sale", y=-1)
summation = bind_rows(listingdate, saledate) %>% arrange(listingdate) %>% mutate(inventory=cumsum(y)) %>% select(-y) %>% filter(type=="listing")
homesales <- homesales %>% inner_join(summation) %>% select(-type)
#pseudopin, thanks for the help!

How can I transpose data in each variable from long to wide using group_by? R

I have a dataframe with id variable name. I'm trying to figure out a way to transpose each variable in the dataframe by name.
My current df is below:
name jobtitle companyname datesemployed empduration joblocation jobdescrip
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati…
2 David… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
3 David… Data An… NA Jan 2018 – J… 6 mos Belfast, U… Working wi…
However, I'd like a dataframe in which there is only one row for name, and every observation for name becomes its own column, like below:
name jobtitle_1 companyname_1 datesemployed_1 empduration_1 joblocation_1 jobdescrip_1 job_title2 companyname_2 datesemployed_2 empduration_2 joblocation_2 jobdescrip_2
1 David… Project… EOS IT Man… Aug 2018 – P… 1 yr 9 mos San Franci… Coordinati… Technic… Options Te… Sep 2017 – J… 5 mos Belfast, U… Working wi…
I have used commands like gather_by and melt in the past to reshape from long to wide, but in this case, I'm not sure how to apply it, since every observation for the id variable will need to become its own column.
It sounds like you are looking for gather and pivot_wider.
I used my own sample data with two names:
df <- tibble(name = c('David', 'David', 'David', 'Bill', 'Bill'),
jobtitle = c('PM', 'TPM', 'Analyst', 'Dev', 'Eng'),
companyname = c('EOS', 'Options', NA, 'Microsoft', 'Nintendo'))
First add an index column to distinguish the different positions for each name.
indexed <- df %>%
group_by(name) %>%
mutate(.index = row_number())
indexed
# name jobtitle companyname .index
# <chr> <chr> <chr> <int>
# 1 David PM EOS 1
# 2 David TPM Options 2
# 3 David Analyst NA 3
# 4 Bill Dev Microsoft 1
# 5 Bill Eng Nintendo 2
Then it is possible to use gather to get a long form, with one value per row.
gathered <- indexed %>% gather('var', 'val', -c(name, .index))
gathered
# name .index var val
# <chr> <int> <chr> <chr>
# 1 David 1 jobtitle PM
# 2 David 2 jobtitle TPM
# 3 David 3 jobtitle Analyst
# 4 Bill 1 jobtitle Dev
# 5 Bill 2 jobtitle Eng
# 6 David 1 companyname EOS
# 7 David 2 companyname Options
# 8 David 3 companyname NA
# 9 Bill 1 companyname Microsoft
# 10 Bill 2 companyname Nintendo
Now pivot_wider can be used to create a column for each variable and index.
gathered %>% pivot_wider(names_from = c(var, .index), values_from = val)
# name jobtitle_1 jobtitle_2 jobtitle_3 companyname_1 companyname_2 companyname_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 David PM TPM Analyst EOS Options NA
# 2 Bill Dev Eng NA Microsoft Nintendo NA
Get the data in long format, create a unique column identifier and get it back to wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name, col) %>%
mutate(row = row_number()) %>%
pivot_wider(names_from = c(col, row), values_from = value)

Resources