I have two data sets, each containing five-digit ZIPs.
One data set looks like this:
From To Territory
7501 10000 Unassigned
10001 10463 Agent 1
10464 10464 Unassigned
10465 11769 Agent 2
And a second data set that looks like this:
zip5 address
1 10009 424 E 9TH ST APT 12, NEW YORK
2 10010 15 E 26TH ST APT 10C, NEW YORK
3 10013 310 GREENWICH ST, NEW YORK
4 10019 457 W 57TH ST, NEW YORK
I would like to write a for-loop in R that loops through the zip5 column in the second data set, then loops through both the From and the To columns from dataset 1, checking if the zip5 falls within the From and To range, and once it finds a match, assigns the Territory value from the first dataset into a new column in second dataset.
I started to try to think through the logic but quickly became overwhelmed and thought I would turn to the StackOverflow community for guidance.
Here was my initial attempt:
for (i in nrow(df1)){
for(j in nrow(df2)){
if(df1[1, "zip5"] > df2[1, "From"] & df1[1, "zip5"] <= df2[1, "To"])
df1$newColumn = df2[j, "Territory"]
}
}
You can use data.table::foverlaps for this:
library(data.table)
dat1 <- fread(text = '
From To Territory
7501 10000 Unassigned
10001 10463 "Agent 1"
10464 10464 Unassigned
10465 11769 "Agent 2"')
dat2 <- fread(text = '
zip5 address
10009 "424 E 9TH ST APT 12, NEW YORK"
10010 "15 E 26TH ST APT 10C, NEW YORK"
10013 "310 GREENWICH ST, NEW YORK"
10019 "457 W 57TH ST, NEW YORK"')
# if you use your own data and it is not a data.table, then do this:
setDT(dat1)
setDT(dat2)
Requirements to use foverlap:
Both frames must have two fields, a "from" and a "to". While it might seem inane since we want to determine if "zip5" is within "From" to "To", the premise of the function is to find overlaps in two ranges. Instead of putting in special-case code to allow a single column in one frame, they chose (I'm inferring) to keep it general. This means we need to copy zip5 to another column.
Both tables need to have their ranges as "keys". If there are other columns that are keys, then the range columns must be the last two. (And in order.)
# req't 1, need a range in the second frame
dat2[, zip5copy := zip5 ]
# set keys for both
setkey(dat1, From, To)
setkey(dat2, zip5, zip5copy)
And the code:
foverlaps(dat1, dat2)
# zip5 address zip5copy From To Territory
# 1: NA <NA> NA 7501 10000 Unassigned
# 2: 10009 424 E 9TH ST APT 12, NEW YORK 10009 10001 10463 Agent 1
# 3: 10010 15 E 26TH ST APT 10C, NEW YORK 10010 10001 10463 Agent 1
# 4: 10013 310 GREENWICH ST, NEW YORK 10013 10001 10463 Agent 1
# 5: 10019 457 W 57TH ST, NEW YORK 10019 10001 10463 Agent 1
# 6: NA <NA> NA 10464 10464 Unassigned
# 7: NA <NA> NA 10465 11769 Agent 2
The default mode when there are no matches is nomatch=NA, meaning that the missing columns of the extra rows are filled with NA, as above. This is equivalent to a "full join" (one ref for joins: https://stackoverflow.com/a/6188334). If you want just matching rows, then foverlaps(..., nomatch=NULL) will give you just 4 rows. (You can also reverse the order of dat1 and dat2, but you might still need to use this if your actual data requires.)
Related
I need to use one of the many customers ids and standarize it upon all companies names that are extact same.
Before
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1756 Lightz California
After
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1211 Lightz California
The customer ids for the two companies are now the same. Which code would be the best for this?
We can use match here as it returns the first matching position. We can match Company with Company. According to ?match
match returns a vector of the positions of (first) matches of its first argument in its second.
df$Customer.Ids <- df$Customer.Ids[match(df$Company, df$Company)]
df
# Customer.Ids Company Location
#1 1211 Lightz NewYork
#2 1325 Comput.Inc Seattle
#3 1211 Lightz California
where
match(df$Company, df$Company) #returns
#[1] 1 2 1
Some other options, using sapply
df$Customer.Ids <- df$Customer.Ids[sapply(df$Company, function(x)
which.max(x == df$Company))]
Here we loop over each Company and get the first instance of it's occurrence.
Or another option using ave which follows same logic as that of #Shree, to get first occurrence by group.
with(df, ave(Customer.Ids, Company, FUN = function(x) head(x, 1)))
#[1] 1211 1325 1211
Here's a way using dplyrpackage. It'll replace all Ids as per the first instance for any company -
df %>%
group_by(Company) %>%
mutate(
Customer.Ids = Customer.Ids[1]
) %>%
ungroup()
# A tibble: 3 x 3
Customer.Ids Company Location
<int> <fct> <fct>
1 1211 Lightz New York
2 1325 Comput.Inc Seattle
3 1211 Lightz California
I have two dfs as below
>codes1
Country State City Start No End No
IN Telangana Hyderabad 100 200
IN Maharashtra Pune (Bund Garden) 300 400
IN Haryana Gurgaon 500 600
IN Maharashtra Pune 700 800
IN Gujarat Ahmedabad (Vastrapur) 900 1000
Now i want to tag ip address from table 1
>codes2
ID No
1 157
2 346
3 389
4 453
5 562
6 9874
7 98745
Now i want to tag numbers in codes2 df as per the range given in codes1 df for No column , expected ouput is
ID No Country State City
1 157 IN Telangana Hyderabad
2 346 IN Maharashtra Pune(Bund Garden)
.
.
.
Basically want to tag No column in codes 2 with codes1 according to the range (Start No and End No) that No observations falls in.
Also the order could be anything in codes 2 df .
You could use the non-equi join capability of the data.table package for that:
library(data.table)
setDT(codes1)
setDT(codes2)
codes2[codes1, on = .(No > StartNo, No < EndNo), ## (1)
`:=`(cntry = Country, state = State, city = City)] ## (2)
(1) obtains matching row indices in codes2 corresponding to each row in codes1, while matching on the condition provided to the on argument.
(2) updates codes2 values for those matching rows for the columns specified directly by reference (i.e., you don't have to assign the result back to another variable).
This gives:
codes2
# ID No cntry state city
# 1: 1 157 IN Telangana Hyderabad
# 2: 2 346 IN Maharashtra Pune (Bund Garden)
# 3: 3 389 IN Maharashtra Pune (Bund Garden)
# 4: 4 453 NA NA NA
# 5: 5 562 IN Haryana Gurgaon
# 6: 6 9874 NA NA NA
# 7: 7 98745 NA NA NA
if you're comfortable writing SQL, you might consider using the sqldf package to do something like
library('sqldf')
result <- sqldf('select * from codes2 left join codes1 on codes2.No between codes1.StartNo and codes1.EndNo')
you may have to remove special characters and spaces from the columnnames of your dataframes beforehand.
I have a set of data (10 columns, 1000 rows) that is indexed by an ID number that one or more of these rows can share. To give a small example to illustrate my point, consider this table:
ID Name Location
5014 John
5014 Kate California
5014 Jim
5014 Ryan California
5018 Pete
5018 Pat Indiana
5019 Jeff Arizona
5020 Chris Kentucky
5020 Mike
5021 Will Indiana
I need for all entries to have something in the Location field and I'm having a hell of a time trying to do it.
Things to note:
Every unique ID number has at least one row with the location field populated.
If two rows have the same ID number, they have the same location.
Two different ID numbers can have the same location.
ID numbers are not necessarily consecutive, nor are they necessarily completely numeric. The arrangement of them isn't of importance to me, since any rows that are related share the same ID number.
Any ideas for a solution? I'm currently using R with the data.table package, but I'm relatively new to it.
We can convert the 'data.frame' to 'data.table' (setDT(df1)), Grouped by 'ID', get the elements of Location that are not '' (Location[Location!=''][1L]). Suppose, if there are more than one element per group that are not '', the [1L], selects the first non-blank element, and assign (:=) the output to Location
library(data.table)
setDT(df1)[, Location := Location[Location != ''][1L], by = ID][]
# ID Name Location
# 1: 5014 John California
# 2: 5014 Kate California
# 3: 5014 Jim California
# 4: 5014 Ryan California
# 5: 5018 Pete Indiana
# 6: 5018 Pat Indiana
# 7: 5019 Jeff Arizona
# 8: 5020 Chris Kentucky
# 9: 5020 Mike Kentucky
#10: 5021 Will Indiana
Or we can use setdiff as suggested by #Frank
setDT(df1)[, Location:= setdiff(Location,'')[1L], by = ID][]
I have two dataframes. 1 full of data about individuals, including their street name and house number but not their house size. And another with information about each house including street name and house number and house size but not data on the individuals living in that house. I'd like to add the size information to the first dataframe as a new column so I can see the house size for each individual.
I have over 200,000 individuals and around 100,000 houses and the methods I've tried so far (cutting down the second dataframe for each individual) are painfully slow. Is their an efficient way to do this? Thank you.
Using #jazzurro's example another option for larger datasets would be to use data.table
library(data.table)
setkey(setDT(df1), street, num)
setkey(setDT(df2), street, num)
df2[df1]
# size street num person
#1: large liliha st 3 bob
#2: NA mahalo st 32 dan
#3: small makiki st 15 ana
#4: NA nehoa st 11 ellen
#5: medium nuuanu ave 8 cathy
Here is my suggestion. Given what you described in your data, I created a sample data. However, please try to provide sample data from next time. When you provide sample data and your code, you are more likely to receive help and let people save more time. You have two key variables to merge two data frames, which are street name and house number. Here, I chose to keep all data points in df1.
df1 <- data.frame(person = c("ana", "bob", "cathy", "dan", "ellen"),
street = c("makiki st", "liliha st", "nuuanu ave", "mahalo st", "nehoa st"),
num = c(15, 3, 8, 32, 11),
stringsAsFactors = FALSE)
#person street num
#1 ana makiki st 15
#2 bob liliha st 3
#3 cathy nuuanu ave 8
#4 dan mahalo st 32
#5 ellen nehoa st 11
df2 <- data.frame(size = c("small", "large", "medium"),
street = c("makiki st", "liliha st", "nuuanu ave"),
num = c(15, 3, 8),
stringsAsFactors = FALSE)
# size street num
#1 small makiki st 15
#2 large liliha st 3
#3 medium nuuanu ave 8
library(dplyr)
left_join(df1, df2)
# street num person size
#1 makiki st 15 ana small
#2 liliha st 3 bob large
#3 nuuanu ave 8 cathy medium
#4 mahalo st 32 dan <NA>
#5 nehoa st 11 ellen <NA>
I wonder if there is an simpler way than writing if...else... for the following case. I have a dataframe and I only want the rows with number in column "percentage" >=95. Moreover, for one object, if there is multiple rows fitting this criteria, I only want the largest one(s). If there are more than one largest ones, I would like to keep all of them.
For example:
object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75
Then I'd like the result shows:
object city street percentage
A NY Sun 100
A NY Waterfall 100
B CA Washington 98
I am able to do it using if else statement, but I feel there should be some smarter ways to say: 1. >=95 2. if more than one, choose the largest 3. if more than one largest, choose them all.
You can do this by creating an variable that indicates the rows that have the maximum percentage for each of the objects. We can then use this indicator to subset the data.
# your data
dat <- read.table(text = "object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75", header=TRUE, na.strings="", stringsAsFactors=FALSE)
# create an indicator to identify the rows that have the maximum
# percentage by object
id <- with(dat, ave(percentage, object, FUN=function(i) i==max(i)) )
# subset your data - keep rows that are greater than 95 and have the
# maximum group percentage (given by id equal to one)
dat[dat$percentage >= 95 & id , ]
This works by the addition statement creating a logical, which can then be used to subset the rows of dat.
dat$percentage >= 95 & id
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
Or putting these together
with(dat, dat[percentage >= 95 & ave(percentage, object,
FUN=function(i) i==max(i)) , ])
# object city street percentage
# 1 A NY Sun 100
# 3 A NY Waterfall 100
# 4 B CA Washington 98
You could do this also in data.table using the same approach by #user20650
library(data.table)
setDT(dat)[dat[,percentage==max(percentage) & percentage >=95, by=object]$V1,]
# object city street percentage
#1: A NY Sun 100
#2: A NY Waterfall 100
#3: B CA Washington 98
Or using dplyr
dat %>%
group_by(object) %>%
filter(percentage==max(percentage) & percentage >=95)
Following works:
ddf2 = ddf[ddf$percentage>95,]
ddf3 = ddf2[-c(1:nrow(ddf2)),]
for(oo in unique(ddf2$object)){
tempdf = ddf2[ddf2$object == oo, ]
maxval = max(tempdf$percentage)
tempdf = tempdf[tempdf$percentage==maxval,]
for(i in 1:nrow(tempdf)) ddf3[nrow(ddf3)+1,] = tempdf[i,]
}
ddf3
object city street percentage
1 A NY Sun 100
3 A NY Waterfall 100
4 B CA Washington 98