R - Finding a specific row of a dataframe and then adding data from that row to another data frame - r

I have two dataframes. 1 full of data about individuals, including their street name and house number but not their house size. And another with information about each house including street name and house number and house size but not data on the individuals living in that house. I'd like to add the size information to the first dataframe as a new column so I can see the house size for each individual.
I have over 200,000 individuals and around 100,000 houses and the methods I've tried so far (cutting down the second dataframe for each individual) are painfully slow. Is their an efficient way to do this? Thank you.

Using #jazzurro's example another option for larger datasets would be to use data.table
library(data.table)
setkey(setDT(df1), street, num)
setkey(setDT(df2), street, num)
df2[df1]
# size street num person
#1: large liliha st 3 bob
#2: NA mahalo st 32 dan
#3: small makiki st 15 ana
#4: NA nehoa st 11 ellen
#5: medium nuuanu ave 8 cathy

Here is my suggestion. Given what you described in your data, I created a sample data. However, please try to provide sample data from next time. When you provide sample data and your code, you are more likely to receive help and let people save more time. You have two key variables to merge two data frames, which are street name and house number. Here, I chose to keep all data points in df1.
df1 <- data.frame(person = c("ana", "bob", "cathy", "dan", "ellen"),
street = c("makiki st", "liliha st", "nuuanu ave", "mahalo st", "nehoa st"),
num = c(15, 3, 8, 32, 11),
stringsAsFactors = FALSE)
#person street num
#1 ana makiki st 15
#2 bob liliha st 3
#3 cathy nuuanu ave 8
#4 dan mahalo st 32
#5 ellen nehoa st 11
df2 <- data.frame(size = c("small", "large", "medium"),
street = c("makiki st", "liliha st", "nuuanu ave"),
num = c(15, 3, 8),
stringsAsFactors = FALSE)
# size street num
#1 small makiki st 15
#2 large liliha st 3
#3 medium nuuanu ave 8
library(dplyr)
left_join(df1, df2)
# street num person size
#1 makiki st 15 ana small
#2 liliha st 3 bob large
#3 nuuanu ave 8 cathy medium
#4 mahalo st 32 dan <NA>
#5 nehoa st 11 ellen <NA>

Related

Join each term with list of keywords

Probably a simple problem and you can help me quickly.
I have a vector with all the terms contained in a list of keywords. Now I want to join each term with all keywords that contain this term. Here's an example
vec <- c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat", …)
keywords <- c("small boat tour", "a house on the river", "a houseboat", …)
The expected result looks like:
keywords terms
small boat tour small
small boat tour boat
small boat tour tour
a house on the river a
a house on the river house
a house on the river on
a house on the river the
a house on the river river
a houseboat a
a houseboat houseboat
You can use expand.grid to get all combinations, wrap the words of vec in word boundaries, grepl and filter, i.e.
df1 <- expand.grid(vec, keywords)
df1[mapply(grepl, paste0('\\b' ,df1$Var1, '\\b'), df1$Var2),]
Var1 Var2
1 small small boat tour
2 boat small boat tour
5 tour small boat tour
12 river a house on the river
13 house a house on the river
15 a a house on the river
16 on a house on the river
17 the a house on the river
24 a a houseboat
27 houseboat a houseboat
You can do a fuzzyjoin::fuzzy_join using stringr::str_detect as the matching function, and adding \\b word boundaries to each word in vec.
vec <- data.frame(terms = c("small", "boat", "river", "house", "tour", "a", "on", "the", "houseboat"))
keywords <- data.frame(keywords = c("small boat tour", "a house on the river", "a houseboat"))
fuzzyjoin::fuzzy_inner_join(keywords, vec, by = c("keywords" = "terms"),
match_fun = \(x, y) stringr::str_detect(x, paste0("\\b", y, "\\b")))
output
keywords terms
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river river
5 a house on the river house
6 a house on the river a
7 a house on the river on
8 a house on the river the
9 a houseboat a
10 a houseboat houseboat
A way can be using strsplit and intersect.
. <- lapply(strsplit(keywords, " ", TRUE), intersect, vec)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
# keywords terms
#1 small boat tour small
#2 small boat tour boat
#3 small boat tour tour
#4 a house on the river a
#5 a house on the river house
#6 a house on the river on
#7 a house on the river the
#8 a house on the river river
#9 a houseboat a
#10 a houseboat houseboat
In casevec contains all keywords there is no need for join.
. <- lapply(strsplit(keywords, " ", TRUE), unique)
data.frame(keywords = rep(keywords, lengths(.)), terms = unlist(.))
The expected result can be reproduced by
library(data.table)
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords]
keywords terms
1: small boat tour small
2: small boat tour boat
3: small boat tour tour
4: a house on the river a
5: a house on the river house
6: a house on the river on
7: a house on the river the
8: a house on the river river
9: a houseboat a
10: a houseboat houseboat
This is a rather simple answer which is an interpretation of OP's sentence
I have a vector with all the terms contained in a list of keywords.
The important point is all the terms. So, it is assumed that we just can split the keywords into separate terms.
Note that the regex \\W+ is used to separate the terms in case there are more than one non-word characters between the terms, e.g., ", ".
However, in case the vector does not contain all terms intentionally we need to subset the result, e.g.
vec <- c("small", "boat", "river", "house", "tour", "houseboat")
data.table(keywords)[, .(terms = tstrsplit(keywords, "\\W+")), by = keywords][
terms %in% vec]
keywords terms
1: small boat tour small
2: small boat tour boat
3: small boat tour tour
4: a house on the river house
5: a house on the river river
6: a houseboat houseboat
Here is a base R one-liner with strsplit + stack
> with(keywords, rev(stack(setNames(strsplit(keywords, " "), keywords))))
ind values
1 small boat tour small
2 small boat tour boat
3 small boat tour tour
4 a house on the river a
5 a house on the river house
6 a house on the river on
7 a house on the river the
8 a house on the river river
9 a houseboat a
10 a houseboat houseboat

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

How to collapse many records into one while removing NA values

Say I have the following dataframe df
name <- c("Bill", "Rob", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave")
favteam <- c("Dodgers", "Mets", "Pirates", NA)
df <- data.frame(name = name,
address = address,
favteam = favteam)
df
Which looks like:
name address favteam
1 Bill 123 Main St Dodgers
2 Rob 234 Broad St Mets
3 Joe <NA> Pirates
4 Joe 456 North Ave <NA>
What I want to do is collapse (coalesce) rows by name (or in general, any number of grouping variables) and have any other value than NA replace the NA value in the final data, like so:
df_collapse <- foo(df)
name address favteam
1 Bill 123 Main St Dodgers
2 Rob 234 Broad St Mets
3 Joe 456 North Ave Pirates
Here's an option with dplyr:
library(dplyr)
df %>%
group_by(name) %>%
summarise_each(funs(first(.[!is.na(.)]))) # or summarise_each(funs(first(na.omit(.))))
#Source: local data frame [3 x 3]
#
# name address favteam
#1 Bill 123 Main St Dodgers
#2 Joe 456 North Ave Pirates
#3 Rob 234 Broad St Mets
And with data.table:
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)][1L]), by = name]
# name address favteam
#1: Bill 123 Main St Dodgers
#2: Rob 234 Broad St Mets
#3: Joe 456 North Ave Pirates
Or
setDT(df)[, lapply(.SD, function(x) head(na.omit(x), 1L)), by = name]
Edit:
You say in your actual data you have varying numbers of non-NA responses per name. In that case, the following approach may be helpful.
Consider this modified sample data (look at last row):
name <- c("Bill", "Rob", "Joe", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave", "123 Boulevard")
favteam <- c("Dodgers", "Mets", "Pirates", NA, NA)
df <- data.frame(name = name,
address = address,
favteam = favteam)
df
# name address favteam
#1 Bill 123 Main St Dodgers
#2 Rob 234 Broad St Mets
#3 Joe <NA> Pirates
#4 Joe 456 North Ave <NA>
#5 Joe 123 Boulevard <NA>
Then, you can use this data.table approach to get the non-NA responses that can be varying in number by name:
setDT(df)[, lapply(.SD, function(x) unique(na.omit(x))), by = name]
# name address favteam
#1: Bill 123 Main St Dodgers
#2: Rob 234 Broad St Mets
#3: Joe 456 North Ave Pirates
#4: Joe 123 Boulevard Pirates

Find all largest values in a range, across different objects in data frame

I wonder if there is an simpler way than writing if...else... for the following case. I have a dataframe and I only want the rows with number in column "percentage" >=95. Moreover, for one object, if there is multiple rows fitting this criteria, I only want the largest one(s). If there are more than one largest ones, I would like to keep all of them.
For example:
object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75
Then I'd like the result shows:
object city street percentage
A NY Sun 100
A NY Waterfall 100
B CA Washington 98
I am able to do it using if else statement, but I feel there should be some smarter ways to say: 1. >=95 2. if more than one, choose the largest 3. if more than one largest, choose them all.
You can do this by creating an variable that indicates the rows that have the maximum percentage for each of the objects. We can then use this indicator to subset the data.
# your data
dat <- read.table(text = "object city street percentage
A NY Sun 100
A NY Malino 97
A NY Waterfall 100
B CA Washington 98
B WA Lieber 95
C NA Moon 75", header=TRUE, na.strings="", stringsAsFactors=FALSE)
# create an indicator to identify the rows that have the maximum
# percentage by object
id <- with(dat, ave(percentage, object, FUN=function(i) i==max(i)) )
# subset your data - keep rows that are greater than 95 and have the
# maximum group percentage (given by id equal to one)
dat[dat$percentage >= 95 & id , ]
This works by the addition statement creating a logical, which can then be used to subset the rows of dat.
dat$percentage >= 95 & id
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
Or putting these together
with(dat, dat[percentage >= 95 & ave(percentage, object,
FUN=function(i) i==max(i)) , ])
# object city street percentage
# 1 A NY Sun 100
# 3 A NY Waterfall 100
# 4 B CA Washington 98
You could do this also in data.table using the same approach by #user20650
library(data.table)
setDT(dat)[dat[,percentage==max(percentage) & percentage >=95, by=object]$V1,]
# object city street percentage
#1: A NY Sun 100
#2: A NY Waterfall 100
#3: B CA Washington 98
Or using dplyr
dat %>%
group_by(object) %>%
filter(percentage==max(percentage) & percentage >=95)
Following works:
ddf2 = ddf[ddf$percentage>95,]
ddf3 = ddf2[-c(1:nrow(ddf2)),]
for(oo in unique(ddf2$object)){
tempdf = ddf2[ddf2$object == oo, ]
maxval = max(tempdf$percentage)
tempdf = tempdf[tempdf$percentage==maxval,]
for(i in 1:nrow(tempdf)) ddf3[nrow(ddf3)+1,] = tempdf[i,]
}
ddf3
object city street percentage
1 A NY Sun 100
3 A NY Waterfall 100
4 B CA Washington 98

Lookup values in a vectorized way

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Resources