I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe
I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?
Related
I have six .txt datasets files i've stored at '../data/csv'. All the datasets have similar structure(X1(speech),part(part of the speech i.e Charlotte_part_1 ...Charlotte_part_60)). Am having trouble combining all the six datasets into a single .csv file called biden.csv which has speech, part,location, event and date .But am having trouble extracting the speech, part(this two are from the file content) and event(from file name) of the file names because of their varying naming structure.
The six datasets
"Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt",
"Cleveland_Sep30_2020_Whistle_Stop_Tour.txt",
"Milwaukee_Aug20_2020_Democratic_National_Convention.txt",
"Philadelphia_Sep20_2020_SCOTUS.txt",
"Washington_Sep26_2020_US_Conference_of_Mayors.txt",
"Wilmington_Nov25_2020_Thanksgiving.txt"
Sample content from 'Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt'
X1 part
"Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist." Charlotte_part_1
"One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around." Charlotte_part_2
Here is the code i have designed but its not producing the output i wan't...i mean it just creat the tibble with the tittles but no contents in any of the variables
biden_data <- tibble() # initialize empty tibble
# loop through all text files in the specified directory
for (file in list.files(path="./data/csv", pattern='*.txt', full.names=T)){
filename <- strsplit(file, "[./]")[[1]][5] # extract file name from path
# extract location from file name
location <- strsplit(filename, split='_')[[1]][1]
# extract raw date from file name
raw_date <- strsplit(filename, split='_')[[1]][2]
date <- as.Date(raw_date, "%b%d_%Y") # format as datetime
# extract event from file name
event <- strsplit(filename, split='_')[[1]][3]
# extract speech and part from file
content <- readChar(file, file.info(file)$size)
speech <- content[grepl("^X1", content)]
part <- content[grepl("^part", content)]
# create a new observation (row)
new_obs <- tibble(speech=speech, part=part, location=location, event=event, date=date)
# append the new observation to the existing data
biden_data <- bind_rows(biden_data, new_obs)
rm(filename, location, raw_date, date, content, speech, part, new_obs, file) # cleanup
}
Desired Output is supposed to look like this:
## # A tibble: 128 x 5
## speech part location event date
## <chr> <chr> <chr> <chr> <date>
## 1 Folks, thanks for taking the time to be here~ Char~ Charlot~ Raci~ 2020-09-23
## 2 One of the things that really matters to me,~ Char~ Charlot~ Raci~ 2020-09-23
## 3 How people going to do that? And the way, in~ Char~ Charlot~ Raci~ 2020-09-23
## 4 In addition to that, we find ourselves in a ~ Char~ Charlot~ Raci~ 2020-09-23
## 5 If he had spoken, as I said, they said at Co~ Char~ Charlot~ Raci~ 2020-09-23
## 6 But what I want to talk to you about today i~ Char~ Charlot~ Raci~ 2020-09-23
## 7 And thirdly, if you’re a business person, le~ Char~ Charlot~ Raci~ 2020-09-23
## 8 For too many people, particularly in the Afr~ Char~ Charlot~ Raci~ 2020-09-23
## 9 It goes to education, as well as access to e~ Char~ Charlot~ Raci~ 2020-09-23
## 10 And then we’re going to talk about, I think ~ Char~ Charlot~ Raci~ 2020-09-23
## # ... with 118 more rows
Starting with a vector of file paths:
files <- c("Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt", "Cleveland_Sep30_2020_Whistle_Stop_Tour.txt", "Milwaukee_Aug20_2020_Democratic_National_Convention.txt", "Philadelphia_Sep20_2020_SCOTUS.txt", "Washington_Sep26_2020_US_Conference_of_Mayors.txt", "Wilmington_Nov25_2020_Thanksgiving.txt")
We can capture the components into a frame:
meta <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files, list(location="", date="", event=""))
meta
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Cleveland Sep30_2020 Whistle_Stop_Tour
# 3 Milwaukee Aug20_2020 Democratic_National_Convention
# 4 Philadelphia Sep20_2020 SCOTUS
# 5 Washington Sep26_2020 US_Conference_of_Mayors
# 6 Wilmington Nov25_2020 Thanksgiving
And then iterate on that for the contents into a single frame.
out <- do.call(Map, c(list(f = function(fn, ...) cbind(..., read.table(fn, header = TRUE))),
list(files), meta))
out <- do.call(rbind, out)
rownames(out) <- NULL
out[1:3,]
# location date event
# 1 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 2 Charlotte Sep23_2020 Racial_Equity_Discussion-1
# 3 Cleveland Sep30_2020 Whistle_Stop_Tour
# X1
# 1 Folks, thanks for taking the time to be here today. I really appreciate it. And we even have an astronaut in our house and I tell you what, that’s pretty cool. Look, first of all, I want to thank Chris and the mayor for being here, and all of you for being here. You know, these are tough times. Over 200,000 Americans have passed away. Over 200,000, and the number is still rising. The impact on communities is bad across the board, but particularly bad for African-American communities. Almost four times as likely, three times as likely to catch the disease, COVID, and when it’s caught, twice as likely to die as white Americans. It’s sort of emblematic of the inequality that exists and the circumstances that exist.
# 2 One of the things that really matters to me, is we could do … It didn’t have to be this bad. You have 30 million people on unemployment, you have 20 million people figuring whether or not they can pay their mortgage payment this month, and what they’re going to be able to do or not do as the consequence of that, and you’ve got millions of people who are worried that they’re going to be thrown out in the street because they can’t pay their rent. Although they’ve been given a reprieve for three months, but they have to pay double the next three months when it comes around.
# 3 Charlotte_Sep23_2020_Racial_Equity_Discussion-1.txt
# part
# 1 Charlotte_part_1
# 2 Charlotte_part_2
# 3 something
(I made fake files for all but the first file.)
Brief walk-through:
strcapture takes the regex (lots of _-separation) and creates a frame of location, date, etc.
Map takes a function with 1 or more arguments (we use two: fn= for the filename, and ... for "the rest") and applies it to each of the subsequent lists/vectors. In this case, I'm using ... to cbind (column-bind/concatenate) the columns from meta to what we read from the file itself. This is useful in that it combines the 1 row of each meta row with any-number-of-rows from the file itself. (We could have hard-coded ... instead as location, date, and event, but I tend to prefer to generalize, in case you need to extract something else from the filenames.)
Because we use ..., however, we need to combine files and the columns of meta in a list and then call our anon-function with the list contents as arguments.
The contents of out after our do.call(Map, ...) is in a list and not a single frame. Each element of this list is a frame with the same column-structure, so we then combine them by rows with do.call(rbind, out).
R is going to use the names from files into row names, which I find unnecessary (and distracting), so I removed the row names. Optional.
If you're interested, this may appear much easier to digest using dplyr and friends:
library(dplyr)
# library(tidyr) # unnest
out <- strcapture("^([^_]+)_([^_]+_[^_]+)_(.*)\\.txt$", files,
list(location="", date="", event="")) %>%
mutate(contents = lapply(files, read.table, header = TRUE)) %>%
tidyr::unnest(contents)
I'm trying to ease my life by writing a menu creator, which is supposed to permutate a weekly menu from a list of my favourite dishes, in order to get a little bit more variety in my life.
I gave every dish a value of how many days it approximately lasts and tried to arrange the dishes to end up with menus worth 7 days of food.
I've already tried solutions for knapsack functions from here, including dynamic programming, but I'm not experienced enough to get the hang of it. This is because all of these solutions are targeting only the most efficient option and not every combination, which fills the Knapsack.
library(adagio)
#create some data
dish <-c('Schnitzel','Burger','Steak','Salad','Falafel','Salmon','Mashed potatoes','MacnCheese','Hot Dogs')
days_the_food_lasts <- c(2,2,1,1,3,1,2,2,4)
price_of_the_food <- c(20,20,40,10,15,18,10,15,15)
data <- data.frame(dish,days_the_food_lasts,price_of_the_food)
#give each dish a distinct id
data$rownumber <- (1:nrow(data))
#set limit for how many days should be covered with the dishes
food_needed_for_days <- 7
#knapsack function of the adagio library as an example, but all other solutions I found to the knapsackproblem were the same
most_exspensive_food <- knapsack(days_the_food_lasts,price_of_the_food,food_needed_for_days)
data[data$rownumber %in% most_exspensive_food$indices, ]
#output
dish days_the_food_lasts price_of_the_food rownumber
1 Schnitzel 2 20 1
2 Burger 2 20 2
3 Steak 1 40 3
4 Salad 1 10 4
6 Salmon 1 18 6
Simplified:
I need a solution to a single objective single Knapsack problem, which returns all possible combinations of dishes which add up to 7 days of food.
Thank you very much in advance
I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE
First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116
would produce the following:
Medical.Record.Number Number.Of.Visits
AAAXXX 3
AAAZZZ 1
AAABBB 2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.
I have a dataset like this:
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
MQ is a data.frame, it shows the Productionplan for a week in the future. With Model producing Model and Quantity
C150A = data.frame( Material=c("A0015", "A0071", "Z00071", "Z00080","Z00090",
"Z00012","SZ0001"), Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"),
Number=c(1,1,1,1,2,4))
DG18= data.frame( Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),
Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),
Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),
Number=c(1,1,1,2,3,4))
These are Models which could be produced...
My first problem here is, that belonging on the Productionplan MQ, i want to open automatically the needed Models, and multiplicate the Quantity with the number, to know how many of each Component(Material) is needed.
The output could be a data.frame, where all needed Components ( different Models can use the same Components and different Components, also the amount of needed Components caan be different) over all in the production plan noted Models are combined.
Material_Master= data.frame( Material=c( "A0013", "A001","A0015", "A0071", "A0075",
"A0078", "Z00071", "Z00080", "Z00090", "Z00091",
"Z00012","Z00091","Z00010""Z00012","Z00017","SZ0001",
"SZ0005","SZ0005","SZ0000","SZ0008","SZ0009"),
Number=c(20000,180000,250000,480000,250000,170000,
690000,1800000,17000,45000,12000,5000, 5000,
8000,16000,17000,45000,88000,7500,12000,45000))
In the last step the created data.frame should be merged with the Material_Master data: in the Material Master data, there are all important Components with the stock noted.
In my example there are all Components which where needed for the production also noted in the Material Master, but it can also be that in Material_Master is a Component missing, then just ignore this Component.
The Output should be something like, Compare the needed amount of Components, with the actual stock of them. Give a report, if there is more need then the actual stock have.
Thank you for your help.
This should work:
mods <- do.call(rbind,lapply(MQ$Model,function(x)cbind(Model=x,get(x))))
full_plan <- merge(mods,MQ,by="Model")
material_plan <- with(full_plan,aggregate(Quantity*Number,by=list(Material),sum))
# Group.1 x
# 1 A0014 7800
# 2 A0015 5000
# 3 A0071 8800
# 4 A0075 4000
# 5 SZ0000 15200
# 6 SZ0001 20000
# 7 SZ0005 16000
# 8 Z00010 12000
# 9 Z00011 7600
# 10 Z00012 5000
# 11 Z00071 5000
# 12 Z00080 8800
# 13 Z00085 4000
# 14 Z00090 13000
# 15 Z00091 3800
The first line gets each of your models and stacks them, along with the model name. The second line merges back to get the Quantity, and the third aggregates.
I went ahead and made a usable example by trimming off the 1 at the beginning of each Number in your latter models. Also, I read the Model and Material columns in as character instead of factor.
options(stringsAsFactors=FALSE)
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
C150A = data.frame(Material=c("A0015","A0071","Z00071","Z00080","Z00090","Z00012","SZ0001"),Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"), Number=c(1,1,1,1,2,4))
DG18= data.frame(Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),Number=c(1,1,1,2,3,4))
Edit: Added the required stringsAsFactors option, as identified by #RicardoSaporta.