Remove rows based on several combinations in multiple columns - r

In R, I need to remove rows with multiple combinations from 2 character columns. I have searched for a solution, but most questions either deal with duplicates, or are only trying to remove 1 combination, not multiple combinations based on two columns. Here is an example df:
Species Harvest Hunt.Type
Sheep 1 Gun
Goat 4 Bow
Turkey 3 Gun
Pig 2 Bow
Quail 6 Bow
Here, I would need to remove any row that has a mammal species with "Gun" in the Hunt.Type, and any row that has a bird species with a "Bow" Hunt.Type. So I would want to end up with this:
Species Harvest Hunt.Type
Goat 4 Bow
Turkey 3 Gun
Pig 2 Bow
My data frame is much larger than this with 13 species and many more columns and rows.
I tried to do this using extract on conditions in base r as well as dplyr but I could not figure it out because of the added complication of the multiple combinations of species.
I tried something like this in base r:
df[df$Species== c("Goat" , "Sheep", "Pig") &
df$Hunt.Type == "Gun",]
But for some reason that code omits some rows with those conditions, and keeps others. for dplyr, I have not been able to get anything even close.

Yes, I would suggest to create a vector of all birds or mammals (whichever is shorter) and then filter. We need only one of them since the other half are just their complements. So assuming if a Species is not bird then it's a mammal and vice versa.
So for this example, we can do
bird.keywords <- c("Turkey", "Quail")
df[!with(df, (Species %in% bird.keywords & Hunt.Type == "Bow") |
(!Species %in% bird.keywords) & Hunt.Type == "Gun"),]
# Species Harvest Hunt.Type
#2 Goat 4 Bow
#3 Turkey 3 Gun
#4 Pig 2 Bow

Related

find every combination of elements in a column of a dataframe, which add up to a given sum in R

I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance

fuzzy and exact match of two databases

I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe
I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?

Categorizing types of duplicates in R

Let's say I have the following data frame:
df <- data.frame(address=c('654 Peachtree St','890 River Rd','890 River Rd','890 River Rd','1234 Main St','1234 Main St','567 1st Ave','567 1st Ave'), city=c('Atlanta','Eugene','Eugene','Eugene','Portland','Portland','Pittsburgh','Etna'), state=c('GA','OR','OR','OR','OR','OR','PA','PA'), zip5=c('30308','97404','97404','97404','97201','97201','15223','15223'), zip9=c('30308-1929','97404-3253','97404-3253','97404-3253','97201-5717','97201-5000','15223-2105','15223-2105'), stringsAsFactors = FALSE)
`address city state zip5 zip9
1 654 Peachtree St Atlanta GA 30308 30308-1929
2 8910 River Rd Eugene OR 97404 97404-3253
3 8910 River Rd Eugene OR 97404 97404-3253
4 8910 River Rd Eugene OR 97404 97404-3253
5 1234 Main St Portland OR 97201 97201-5717
6 1234 Main St Portland OR 97201 97201-5000
7 567 1st Ave Pittsburgh PA 15223 15223-2105
8 567 1st Ave Etna PA 15223 15223-2105`
I'm considering any rows with a matching address and zip5 to be duplicates.
Filtering out or keeping duplicates based on these two columns is simple enough in R. What I'm trying to do is create a new column with a conditional label for each set of duplicates, ending up with something similar to this:
`address city state zip5 zip9 type
1 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
2 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
3 8910 River Rd Eugene OR 97404 97404-3253 Exact Match
4 1234 Main St Portland OR 97201 97201-5717 Different Zip9
5 1234 Main St Portland OR 97201 97201-5000 Different Zip9
6 567 1st Ave Pittsburgh PA 15223 15223-2105 Different City
7 567 1st Ave Etna PA 15223 15223-2105 Different City`
(I'd also be fine with a True/False column for each type of duplicate.)
I'm assuming the solution will be in some mutate+ifelse+boolean code, but I think it's the comparing within each duplicate subset that has me stuck...
Any advice?
Edit:
I don't believe this is a duplicate of Find duplicated rows (based on 2 columns) in Data Frame in R. I can use that solution to create a T/F column for each type of duplicate/group_by match, but I'm trying to create exclusive categories. How could my conditions also take differences into account? The exact match rows should show true only on the "exact match" column, and false for every other column. If I define my columns simply by feeding different combinations of columns to group_by, the exact match rows will never return a False.
I think the key is grouping by "reference" variable--here address makes sense--and then you can count the number of unique items in that vector. It's not a perfect solution since my use of case_when will prioritize earlier options (i.e. if there are two different cities attributed to one address AND two different zip codes, you'll only see that there are two different cities--you will need to address this if it matters with additional case_when statements). However, getting the length of unique items is a reasonable heuristic in this case if you don't need a perfectly granular solution.
df %>%
group_by(address) %>%
mutate(
match_type = case_when(
all(
length(unique(city)) == 1,
length(unique(state)) == 1,
length(unique(zip5)) == 1,
length(unique(zip9)) == 1) ~ "Exact Match",
length(unique(city)) > 1 ~ "Different City",
length(unique(state)) > 1 ~ "Different State",
length(unique(zip5)) > 1 ~ "Different Zip5",
length(unique(zip9)) > 1 ~ "Different Zip9"
))
Otherwise, you'll have to do iterative grouping (address + other variable) and mutate in a Boolean column as you alluded to.
Edit
One additional approach I just thought of if you need a more granular solution is to utilize the addition of an id column (df %>% rowid_to_column("ID")) and then a full join of the table to itself by address with suffixes (e.g. suffix = c("a","b")), filtering out same IDs and calling distinct (since each comparison is there twice), and then you can make Boolean columns with mutate for the pairwise comparisons. It may be too computationally intensive, depending on the size of your dataset, but it should work on the scale of a few thousand if you have a reasonable amount of RAM.

Fuzzyjoin match based on two different columns instead of one?

I would like to ask a question regarding fuzzyjoin package. I am very new to R, and I promise I have read through the readme file and followed through examples on https://cran.r-project.org/web/packages/fuzzyjoin/index.html before I asked this question.
I have a list of vernacular names which I wanted to match with plant species names. A simple version of my list will look like below. Data 1 has a LocalName column with many typos of vernacular name. Data 2 is the table with correct local name and species where the matching should be based on.
data1 <- data.frame(Item=1:5, LocalName=c("BACTERIA F", "BAHIA", "BAIKEA", "BAIKIA", "BAIKIAEA SP"))
data 1
Item LocalName
1 1 BACTERIA F
2 2 BAHIA
3 3 BAIKEA
4 4 BAIKIA
5 5 BAIKIAEA SP
data2 <- data.frame(LocalName=c("ENGOKOM","BAHIA","BAIKIA","BANANIER","BALANITES"), Species=c("Barteria fistulosa","Mitragyna spp","Baikiaea spp", "Musa spp", "Balanites wilsoniana"))
data2
LocalName Species
1 ENGOKOM Barteria fistulosa
2 BAHIA Mitragyna spp
3 BAIKIA Baikiaea spp
4 BANANIER Musa spp
5 BALANITES Balanites wilsoniana
I tried using the stringdist_left_join function, and it managed to match many species correctly. I am being conservative by setting max_dist=1 because in my list, many vernacular names are very similar.
library(fuzzyjoin)
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"), max_dist=1)
table
Item LocalName.x LocalName.y Species
1 1 BACTERIA F <NA> <NA>
2 2 BAHIA BAHIA Mitragyna spp
3 3 BAIKEA BAIKIA Baikiaea spp
4 4 BAIKIA BAIKIA Baikiaea spp
5 5 BAIKIAEA SP <NA> <NA>
However, I have one question. As you can see from data1, the Item 5 BAIKIAEA SP actually matches with the Species column of data2 instead of LocalName. I have many entries like this where the LocalName in data 1 were either typos of vernacular names or species name, however, I am not sure how to make stringdist_left_join matches two columns of data 2 with one column of data 1. I tried modifying the codes into something like this:
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"|"Species"), max_dist=1)
but it did not work, citing "Error in "LocalName" | "Species" :
operations are possible only for numeric, logical or complex types". Does anyone know whether such matching is possible? Thanks in advance!

Aggregating frequencies with multiple columns in R

I am working with a dataframe in R that has three columns: House, Appliance, and Count. The data is essentially an inventory of the different types of kitchen appliances contained within each house on a block. The data look something like this: (spaces added for illustrative purposes)
House Appliance Count
1 Toaster 2
2 Dishwasher 1
2 Toaster 1
2 Refrigerator 1
2 Toaster 1
3 Dishwasher 1
3 Oven 1
For each appliance type, I would like to be able to compute the proportion of houses containing at least one of those appliances. Note that in my data, it is possible for a single house to have zero, one, or multiple appliances in a single category. If a house does not have an appliance, it is not listed in the data for that house. If the house has more than one appliance, the appliance could be listed once with a count >1 (e.g., toasters in House 1), or it could be listed twice (each with count = 1, e.g., toasters in House 2).
As an example showing what I am trying to compute, in the data shown here, the proportion of houses with toasters would be .67 (rounded) because 2/3 of the houses have at least one toaster. Similarly, the proportion of houses with ovens would be 0.33 (since only 1/3 of the houses have an oven). I do not care that any of the houses have more than one toaster -- only that they have at least one.
I have fooled around with xtabs and ftable in R but am not confident that they provide the simplest solution. Part of the problem is that these functions will provide the number of appliances for each house, which then throws off my proportion of houses calculations. Here's my current approach:
temp1 <- xtabs(~House + Appliance, data=housedata)
temp1[temp1[,] > 1] <- 1 # This is needed to correct houses with >1 unit.
proportion.of.houses <- data.frame(margin.table(temp1,2)/3)
This appears to work but it's not elegant. I'm guessing there is a better way to do this in R. Any suggestions much appreciated.
library(data.table)
setDT(df)
n.houses = length(unique(df$House))
df[, length(unique(House))/n.houses, by = Appliance]
library(dplyr)
n <- length(unique(df$House))
df %>%
group_by(Appliance) %>%
summarise(freq = n_distinct(House)/n)
Output:
Appliance freq
1 Dishwasher 0.6666667
2 Oven 0.3333333
3 Refrigerator 0.3333333
4 Toaster 0.6666667

Resources