standardizing customer ids based on the same company name - r

I need to use one of the many customers ids and standarize it upon all companies names that are extact same.
Before
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1756 Lightz California
After
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1211 Lightz California
The customer ids for the two companies are now the same. Which code would be the best for this?

We can use match here as it returns the first matching position. We can match Company with Company. According to ?match
match returns a vector of the positions of (first) matches of its first argument in its second.
df$Customer.Ids <- df$Customer.Ids[match(df$Company, df$Company)]
df
# Customer.Ids Company Location
#1 1211 Lightz NewYork
#2 1325 Comput.Inc Seattle
#3 1211 Lightz California
where
match(df$Company, df$Company) #returns
#[1] 1 2 1
Some other options, using sapply
df$Customer.Ids <- df$Customer.Ids[sapply(df$Company, function(x)
which.max(x == df$Company))]
Here we loop over each Company and get the first instance of it's occurrence.
Or another option using ave which follows same logic as that of #Shree, to get first occurrence by group.
with(df, ave(Customer.Ids, Company, FUN = function(x) head(x, 1)))
#[1] 1211 1325 1211

Here's a way using dplyrpackage. It'll replace all Ids as per the first instance for any company -
df %>%
group_by(Company) %>%
mutate(
Customer.Ids = Customer.Ids[1]
) %>%
ungroup()
# A tibble: 3 x 3
Customer.Ids Company Location
<int> <fct> <fct>
1 1211 Lightz New York
2 1325 Comput.Inc Seattle
3 1211 Lightz California

Related

Summarise rows in dataframe by two columns

I have this data frame called Worldwhich shows the following:
City Year Income Tourist
London 2008 50 100
NY 2009 75 250
Paris 2010 45 340
Dubai 2008 32 240
London 2011 50 140
Abu Dhabi 2009 60 120
Paris 2009 70 140
NY 2007 50 150
Tokyo 2008 45 150
Dubai 2010 40 480
#With 207 more rows
I want to summarise each rows so that every city shows the total income and tourists for all the years. So I want to find a code where City and Years are matched and then summarised so that every city just have one row.
Something like this:
City Income Tourist
London 1051 5040
NY 1547 5432
Paris 2600 4321
Dubai 3222 5312
Abu Dhabi 3100 7654
Tokyo 2404 4321
#With 40 more rows
After the research I've done n_distinct and group_by should be used.
Base R solution:
You can use the sapply() function to iterate over cities.
the first argument will be a vector of unique cities
we then write our function that select all the rows (years) of each city and returns the "Income" and "Tourist" columns
Sum the columns values with colSums() function
Transpose the output using the t() function.
t( sapply( unique( World$City ),function(CITY) colSums(World[World$City==CITY,c("Income","Tourist")] ) ) )
Solution with R's data.table package:
Make sure your object is of type data.table.
in the j part of the bracket (the do part):
you can provide names to the wanted columns ("Income="),
and specify the wanted output ("sum(Income)").
To group the cities, add a by argument to the data.table object.
World[,.(Income=sum(Income),Tourist=sum(Tourist)),by=City]
yes, you can use group_by and summarise function.
world %>% group_by(City) %>% summarise(across(c(Income,Tourist), sum))
you can also add Year in the group by function.
world %>% group_by(City,Year) %>% summarise(across(c(Income,Tourist), sum))

R Identifying Dataframe Change Patterns by Groups

I have a dataframe looks like below:
person year location salary
Harry 2002 Los Angeles $2000
Harry 2006 Boston $3000
Harry 2007 Los Angeles $2500
Peter 2001 New York $2000
Peter 2002 New York $2300
Lily 2007 New York $7000
Lily 2008 Boston $2300
Lily 2011 New York $4000
Lily 2013 Boston $3300
I want to identify a pattern at the person level. I want to know who moves out of a location and came back later. For example, Harry moves out of Los Angeles and came back later. Lily moved out of new York and came back later. Also for Lily, we can say she also moved out of Boston and came back later. I only am interested in who has this pattern and does not care the number of back and forth. Therefore, ideally, the output can look like:
person move_back (yes/no)
Harry 1
Peter 0
Lily 1
With the help of data.table rleid you can do -
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(val = data.table::rleid(location)) %>%
arrange(person, location) %>%
group_by(location, .add = TRUE) %>%
summarise(move_back = any(val != lag(val, default = first(val)))) %>%
summarise(move_back = as.integer(any(move_back)))
# person move_back
# <chr> <int>
#1 Harry 1
#2 Lily 1
#3 Peter 0
You could use rle to identify situations where the are one or more instances of repeats. (I think your item Lily had two repeats.)
lapply( split(dat, dat$person), function(x) duplicated( rle(x$location)$values))
$Harry
[1] FALSE FALSE TRUE
$Lily
[1] FALSE FALSE TRUE TRUE
$Peter
[1] FALSE
You could use sapply with sum or any to determine the number of move-backs or whether any move-backs occurred. If you only want to know if there's a move-back to the first site then the logic would be different.
A slightly different data.table method, based on joins and row number (.I).
Basically I'm flagging all the times that a location for a person matches a row that is not the next row, then aggregating.
library(data.table)
setDT(dat)
dat[, rn := .I]
dat[, rnp1 := .I + 1]
dat[dat, on=.(person, location, rn > rnp1), back := TRUE]
dat[, .(move_back = any(back, na.rm=TRUE)), by=person]
# person move_back
#1: Harry TRUE
#2: Peter FALSE
#3: Lily TRUE
Where dat was:
dat <- read.csv(text="person,year,location,salary
Harry,2002,Los Angeles,$2000
Harry,2006,Boston,$3000
Harry,2007,Los Angeles,$2500
Peter,2001,New York,$2000
Peter,2002,New York,$2300
Lily,2007,New York,$7000
Lily,2008,Boston,$2300
Lily,2011,New York,$4000
Lily,2013,Boston,$3300", header=TRUE)

How to Implement a Complex For-Loop + If Statement

I have two data sets, each containing five-digit ZIPs.
One data set looks like this:
From To Territory
7501 10000 Unassigned
10001 10463 Agent 1
10464 10464 Unassigned
10465 11769 Agent 2
And a second data set that looks like this:
zip5 address
1 10009 424 E 9TH ST APT 12, NEW YORK
2 10010 15 E 26TH ST APT 10C, NEW YORK
3 10013 310 GREENWICH ST, NEW YORK
4 10019 457 W 57TH ST, NEW YORK
I would like to write a for-loop in R that loops through the zip5 column in the second data set, then loops through both the From and the To columns from dataset 1, checking if the zip5 falls within the From and To range, and once it finds a match, assigns the Territory value from the first dataset into a new column in second dataset.
I started to try to think through the logic but quickly became overwhelmed and thought I would turn to the StackOverflow community for guidance.
Here was my initial attempt:
for (i in nrow(df1)){
for(j in nrow(df2)){
if(df1[1, "zip5"] > df2[1, "From"] & df1[1, "zip5"] <= df2[1, "To"])
df1$newColumn = df2[j, "Territory"]
}
}
You can use data.table::foverlaps for this:
library(data.table)
dat1 <- fread(text = '
From To Territory
7501 10000 Unassigned
10001 10463 "Agent 1"
10464 10464 Unassigned
10465 11769 "Agent 2"')
dat2 <- fread(text = '
zip5 address
10009 "424 E 9TH ST APT 12, NEW YORK"
10010 "15 E 26TH ST APT 10C, NEW YORK"
10013 "310 GREENWICH ST, NEW YORK"
10019 "457 W 57TH ST, NEW YORK"')
# if you use your own data and it is not a data.table, then do this:
setDT(dat1)
setDT(dat2)
Requirements to use foverlap:
Both frames must have two fields, a "from" and a "to". While it might seem inane since we want to determine if "zip5" is within "From" to "To", the premise of the function is to find overlaps in two ranges. Instead of putting in special-case code to allow a single column in one frame, they chose (I'm inferring) to keep it general. This means we need to copy zip5 to another column.
Both tables need to have their ranges as "keys". If there are other columns that are keys, then the range columns must be the last two. (And in order.)
# req't 1, need a range in the second frame
dat2[, zip5copy := zip5 ]
# set keys for both
setkey(dat1, From, To)
setkey(dat2, zip5, zip5copy)
And the code:
foverlaps(dat1, dat2)
# zip5 address zip5copy From To Territory
# 1: NA <NA> NA 7501 10000 Unassigned
# 2: 10009 424 E 9TH ST APT 12, NEW YORK 10009 10001 10463 Agent 1
# 3: 10010 15 E 26TH ST APT 10C, NEW YORK 10010 10001 10463 Agent 1
# 4: 10013 310 GREENWICH ST, NEW YORK 10013 10001 10463 Agent 1
# 5: 10019 457 W 57TH ST, NEW YORK 10019 10001 10463 Agent 1
# 6: NA <NA> NA 10464 10464 Unassigned
# 7: NA <NA> NA 10465 11769 Agent 2
The default mode when there are no matches is nomatch=NA, meaning that the missing columns of the extra rows are filled with NA, as above. This is equivalent to a "full join" (one ref for joins: https://stackoverflow.com/a/6188334). If you want just matching rows, then foverlaps(..., nomatch=NULL) will give you just 4 rows. (You can also reverse the order of dat1 and dat2, but you might still need to use this if your actual data requires.)

Remove duplicate entries after fuzzy matching between tables

I am trying to find data entry errors in the names and locations of my dataset by fuzzy matching. I am have a unique key from the original data, siterow_id, and have made a new key, pi_key, where I already identified some hard matches. (no fuzzy matching). After running the fuzzy matching I get duplicate values. The matches from both the left and right side of the join for some of the siterow_id's. I can manually look at the data and see where this occurs and hard code to remove the rows. I want a more algorithmic way of doing this as I go to a larger dataset with many more matches.
I tried doing it this way but it removes the matches on the left and the right. If possible I would love a tidyverse way to do this and not a loop.
The table output is included below. You can see a duplicate in row 8 and 9.
for(site in three_letter_matches$siterow_id.x){
if (any(three_letter_matches$siterow_id.y == site)) {
three_letter_matches <- three_letter_matches[!three_letter_matches$siterow_id.y == site,]
}
}
pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 6309 1-9CH29M kim kevin san f~ united s~ 11870 1-HC3YY6
2 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-2QBRZ2
3 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3AHHSU
4 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3JYF8V
5 7567 1-CW4DXI bar jair ramat~ israel 8822 1-E3UILG
6 8822 1-E3UILG bar jair ramat~ israel 7567 1-CW4DXI
7 11870 1-HC3YY6 kim kevin san f~ united s~ 6309 1-9CH29M
8 12357 1-HUUEA6 lee hyojin daeje~ korea re~ 13460 1-IGKCPP
9 13460 1-IGKCPP lee hyo jin daeje~ korea re~ 12357 1-HUUEA6
I found another way to do it
update <- three_letter_matches[!is.na(match(three_letter_matches$siterow_id.x, three_letter_matches$siterow_id.y)),]
update %<>% arrange(last_name.x, first_name.x) %>%
filter(row_number() %% 2 != 0)
three_letter_matches_update <- three_letter_matches %>%
anti_join(update)
Still open to suggestions.
Not the easiest problem, but there are a few ways to do this. The first that comes to mind for me is a bit slow (because it uses rowwise() which is equivalent to using map() or lapply()) is this one:
NOTE: This only works if siterow_id.x/y are character vectors. Won't work for factors.
three_letter_matches <- three_letter_matches %>%
rowwise() %>%
mutate(both_values = paste0(sort(c(siterow_id.x,siterow_id.y)),collapse = ",")) %>%
ungroup() %>%
distinct(both_values,.keep_all = TRUE) %>%
select(-both_values)
# pi_key.x siterow_id.x last_name.x first_name.x city.x country.x pi_key.y siterow_id.y
# 6309 1-9CH29M kim kevin san f~ united s~ 11870 1-HC3YY6
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-2QBRZ2
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3AHHSU
# 7198 1-CJGRSZ kim jinseok seoul korea re~ 2952 1-3JYF8V
# 7567 1-CW4DXI bar jair ramat~ israel 8822 1-E3UILG
# 12357 1-HUUEA6 lee hyojin daeje~ korea re~ 13460 1-IGKCPP
Basically what I'm doing here is doing rowwise so that I work on one row at a time, then I take the site_row ids and sort them so that every row will have the same order, then I paste them together into a single string that is easy to compare for equivalence. Next I ungroup so that you are looking at all rows again (get rid of that rowwise). Then run a distinct to only keep the first row for each value in the new column but with the .keep_all option to keep all the columns. Then I cleanup by removing my extra column.

Showing multiple columns in aggregate function including strings/characters in R

R noob question here.
Let's say I have this data frame:
City State Pop
Fresno CA 494
San Franciso CA 805
San Jose CA 945
San Diego CA 1307
Los Angeles CA 3792
Reno NV 225
Henderson NV 257
Las Vegas NV 583
Gresham OR 105
Salem OR 154
Eugene OR 156
Portland OR 583
Fort Worth TX 741
Austin TX 790
Dallas TX 1197
San Antonio TX 1327
Houston TX 2100
I want to get let's say every 3rd lowest population per State, which would have:
City State Pop
San Jose CA 945
Las Vegas NV 583
Eugene OR 156
Dallas TX 1197
I tried this one:
ord_pop_state <- aggregate(Pop ~ State , data = ord_pop, function(x) { x[3] } )
And I get this one:
State Pop
CA 945
NV 583
OR 156
TX 1197
What do I lack on this one, in order for me to get the desired output that includes the City?
I would suggest to try data.table package for such task as the syntax is easier and the code is more efficient. I would also suggest to add order function in order to make sure that the data is sorted
library(data.table)
setDT(ord_pop)[order(Pop), .SD[3L], keyby = State]
# State City Pop
# 1: CA San Jose 945
# 2: NV Las Vegas 583
# 3: OR Eugene 156
# 4: TX Dallas 1197
So basically, first the data was ordered by Pop, then we subsetted .SD (which the notation parameter of the data itself) by State
Though this is easily solvable with base R too (we will assume that the data is sorted here), we can just create an index per group and then just do a simple subset by that index
ord_pop$indx <- with(ord_pop, ave(Pop, State, FUN = seq))
ord_pop[ord_pop$indx == 3L, ]
# City State Pop indx
# 3 San Jose CA 945 3
# 8 Las Vegas NV 583 3
# 11 Eugene OR 156 3
# 15 Dallas TX 1197 3
Here is a dplyr version:
df2 <- df %>%
group_by(state) %>% # Group observations by state
arrange(-pop) %>% # Within those groups, sort in descending order by pop
slice(3) # Extract the third row in each arranged group
Here's the toy data I used to test it:
set.seed(1)
df <- data.frame(state = rep(LETTERS[1:3], each = 5), city = rep(letters[1:5], 3), pop = round(rnorm(15, 1000, 100), digits=0))
And here's the output from that; it's a coincidence that 'b' was third-largest in each case, not a glitch in the code:
> df2
Source: local data frame [3 x 3]
Groups: state
state city pop
1 A b 1018
2 B b 1049
3 C b 1039
In R same end results can be achieved using different packages.Choice of package is a trade-off between efficiency and simplicity of code.
Since you come from a strong SQL background,this might be easier to use:
library(sqldf)
#Example to return 3rd lowest population of a State
result <-sqldf('Select City,State,Pop from data order by Pop limit 1 offset 2;')
#Note the SQL query is a sample and needs to be modifed to get desired result.

Resources