merge same row of different Vector and multiplicate afterwards - r

I have a dataset like this:
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
MQ is a data.frame, it shows the Productionplan for a week in the future. With Model producing Model and Quantity
C150A = data.frame( Material=c("A0015", "A0071", "Z00071", "Z00080","Z00090",
"Z00012","SZ0001"), Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"),
Number=c(1,1,1,1,2,4))
DG18= data.frame( Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),
Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),
Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),
Number=c(1,1,1,2,3,4))
These are Models which could be produced...
My first problem here is, that belonging on the Productionplan MQ, i want to open automatically the needed Models, and multiplicate the Quantity with the number, to know how many of each Component(Material) is needed.
The output could be a data.frame, where all needed Components ( different Models can use the same Components and different Components, also the amount of needed Components caan be different) over all in the production plan noted Models are combined.
Material_Master= data.frame( Material=c( "A0013", "A001","A0015", "A0071", "A0075",
"A0078", "Z00071", "Z00080", "Z00090", "Z00091",
"Z00012","Z00091","Z00010""Z00012","Z00017","SZ0001",
"SZ0005","SZ0005","SZ0000","SZ0008","SZ0009"),
Number=c(20000,180000,250000,480000,250000,170000,
690000,1800000,17000,45000,12000,5000, 5000,
8000,16000,17000,45000,88000,7500,12000,45000))
In the last step the created data.frame should be merged with the Material_Master data: in the Material Master data, there are all important Components with the stock noted.
In my example there are all Components which where needed for the production also noted in the Material Master, but it can also be that in Material_Master is a Component missing, then just ignore this Component.
The Output should be something like, Compare the needed amount of Components, with the actual stock of them. Give a report, if there is more need then the actual stock have.
Thank you for your help.

This should work:
mods <- do.call(rbind,lapply(MQ$Model,function(x)cbind(Model=x,get(x))))
full_plan <- merge(mods,MQ,by="Model")
material_plan <- with(full_plan,aggregate(Quantity*Number,by=list(Material),sum))
# Group.1 x
# 1 A0014 7800
# 2 A0015 5000
# 3 A0071 8800
# 4 A0075 4000
# 5 SZ0000 15200
# 6 SZ0001 20000
# 7 SZ0005 16000
# 8 Z00010 12000
# 9 Z00011 7600
# 10 Z00012 5000
# 11 Z00071 5000
# 12 Z00080 8800
# 13 Z00085 4000
# 14 Z00090 13000
# 15 Z00091 3800
The first line gets each of your models and stacks them, along with the model name. The second line merges back to get the Quantity, and the third aggregates.
I went ahead and made a usable example by trimming off the 1 at the beginning of each Number in your latter models. Also, I read the Model and Material columns in as character instead of factor.
options(stringsAsFactors=FALSE)
MQ = data.frame(Model=c("C150A","B174","DG18"),Quantity=c(5000,3800,4000))
C150A = data.frame(Material=c("A0015","A0071","Z00071","Z00080","Z00090","Z00012","SZ0001"),Number=c(1,1,1,1,1,1,4))
B174= data.frame(Material=c("A0014","A0071","Z00080","Z00091","Z00011","SZ0000"), Number=c(1,1,1,1,2,4))
DG18= data.frame(Material=c("A0014","A0075","Z00085","Z00090","Z00010","SZ0005"),Number=c(1,1,1,2,3,4))
T75A= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00012","SZ0005"),Number=c(1,1,1,2,3,4))
G95= data.frame(Material=c("A0013","A0075","Z00085","Z00090","Z00017","SZ0008"),Number=c(1,1,1,2,3,4))
Edit: Added the required stringsAsFactors option, as identified by #RicardoSaporta.

Related

fuzzy and exact match of two databases

I have two databases. The first one has about 70k rows with 3 columns. the second one has 790k rows with 2 columns. Both databases have a common variable grantee_name. I want to match each row of the first database to one or more rows of the second database based on this grantee_name. Note that merge will not work because the grantee_name do not match perfectly. There are different spellings etc. So, I am using the fuzzyjoin package and trying the following:
library("haven"); library("fuzzyjoin"); library("dplyr")
forfuzzy<-read_dta("/path/forfuzzy.dta")
filings <- read_dta ("/path/filings.dta")
> head(forfuzzy)
# A tibble: 6 x 3
grantee_name grantee_city grantee_state
<chr> <chr> <chr>
1 (ICS)2 MAINE CHAPTER CLEARWATER FL
2 (SUFFOLK COUNTY) VANDERBILT~ CENTERPORT NY
3 1 VOICE TREKKING A FUND OF ~ WESTMINSTER MD
4 10 CAN NEWBERRY FL
5 10 THOUSAND WINDOWS LIVERMORE CA
6 100 BLACK MEN IN CHICAGO INC CHICAGO IL
... 7 - 70000 rows to go
> head(filings)
# A tibble: 6 x 2
grantee_name ein
<chr> <dbl>
1 ICS-2 MAINE CHAPTER 123456
2 SUFFOLK COUNTY VANDERBILT 654321
3 VOICE TREKKING A FUND OF VOICES 789456
4 10 CAN 654987
5 10 THOUSAND MUSKETEERS INC 789123
6 100 BLACK MEN IN HOUSTON INC 987321
rows 7-790000 omitted for brevity
The above examples are clear enough to provide some good matches and some not-so-good matches. Note that, for example, 10 THOUSAND WINDOWS will match best with 10 THOUSAND MUSKETEERS INC but it does not mean it is a good match. There will be a better match somewhere in the filings data (not shown above). That does not matter at this stage.
So, I have tried the following:
df<-as.data.frame(stringdist_inner_join(forfuzzy, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Totally new to R. This is resulting in an error:
cannot allocate vector of size 375GB (with the big database of course). A sample of 100 rows from forfuzzy always works. So, I thought of iterating over a list of 100 rows at a time.
I have tried the following:
n=100
lst = split(forfuzzy, cumsum((1:nrow(forfuzzy)-1)%%n==0))
df<-as.data.frame(lapply(lst, function(df_)
{
(stringdist_inner_join(df_, filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
}
)%>% bind_rows)
I have also tried the above with mclapply instead of lapply. Same error happens even though I have tried a high-performance cluster setting 3 CPUs, each with 480G of memory and using mclapply with the option mc.cores=3. Perhaps a foreach command could help, but I have no idea how to implement it.
I have been advised to use the purrr and repurrrsive packages, so I try the following:
purrr::map(lst, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance", nthread = getOption("sd_num_thread")))
This seems to be working, after a novice error in the by=grantee_name statement. However, it is taking forever and I am not sure it will work. A sample list in forfuzzy of 100 rows, with n=10 (so 10 lists with 10 rows each) has been running for 50 minutes, and still no results.
If you split (with base::split or dplyr::group_split) your uniquegrantees data frame into a list of data frames, then you can call purrr::map on the list. (map is pretty much lapply)
purrr::map(list_of_dfs, ~stringdist_inner_join(., filings, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance"))
Your result will be a list of data frames each fuzzyjoined with filings. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.
See R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe
I haven't used foreach before but maybe the variable x is already the individual rows of zz1?
Have you tried:
stringdist_inner_join(x, zz2, by="grantee_name", method="jw", p=0.1, max_dist=0.1, distance_col="distance")
?

Is there a way I can use r code in order to calculate the average price for specific days? (AVERAGEIF function)

Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel

Randomise 380 samples by covariates across four 96-well plates using OSAT package

I need to randomise 380 samples (by age, sex and group [grp]) across four 96 well plates (8 rows, 12 columns), with A01 reserved in each plate for a positive control.
I tried the r-pkg (OSAT) and the recommended script is below. The only piece that does not work is excluding well A01 from each of the four plates.
library(OSAT)
samples <- read.table("~/file.csv", sep=";", header=T)
head(samples)
grp sex age
1 A F 45
2 A M 56
3 A F 57
4 A M 67
5 A F 45
6 A M 65
sample.list <- setup.sample(samples, optimal = c("grp", "sex", "age"))
excludedWells <- data.frame("plates"= 1:4, chips=rep(1,4), wells=rep(1,4))
container <- setup.container(IlluminaBeadChip96Plate, 4, batch = 'plates')
exclude(container) <- excludedWells
setup <- create.optimized.setup(fun ="optimal.shuffle", sample.list, container)
out <- map.to.MSA(setup, MSA4.plate)
The corresponding R help doc states:
"If for any reason we need to reserve certain wells for other usage, we can exclude them from the sample assignment process. For this one can create a data frame to mark these excluded wells. Any wells in the container can be identified by its location identified by three variable "plates", "chips", "wells". Therefore the data frame for the excluded wells should have these three columns.
For example, if we will use the first well of the first chip on each plate to hold QC samples, these wells will not be available for sample placement. We have 6 plates in our example so the following will reserve the 6 wells from sample assignment:
excludedWells <- data.frame(plates=1:6, chips=rep(1,6), wells=rep(1,6))
Our program can let you exclude multiple wells at the same position of plate/chip. For example, the following data frame will exclude the first well on each chips regardless how many plates we have:
ex2 <- data.frame(wells=1)
I tried both of these and they do not work - as they simply specify ANY well (and not well #1-A01).
*Update - I emailed the developer of the package and he acknowledged the error and provided a work around. Incorporated here (exclude wells after setting up the container)

Best way to get list of SNPs by gene id?

I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE

Tallying up data

I have two data sets. One has 2 million cases (individual donations to various causes), the other has about 38,000 (all zip codes in the U.S.).
I want to sort through the first data set and tally up the total number of donations by zip code. (Additionally, the total for each zip code will be broken down by cause.) Each case in the first data set includes the zip code of the corresponding donation and information about what kind of cause it went to.
Is there an efficient way to do this? The only approach that I (very much a novice) can think of is to use a for ... if loop to go through each case and count them up one by one. This seems like it would be really slow, though, for data sets of this size.
edit: thanks, #josilber. This gets me a step closer to what I'm looking for.
One more question, though. table seems to generate frequencies, correct? What if I'm actually looking for the sum for each cause by zip code? For example, if the data frame looks like this:
dat3 <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE),
amt = sample(250:2500, 2000000, replace=TRUE))
Suppose that instead of frequencies, I want to end up with output that looks like this?
# Cause 1(amt) Cause 2(amt) Cause 3(amt)
# Zip 1 (sum) (sum) (sum)
# Zip 2 (sum) (sum) (sum)
# Zip 3 (sum) (sum) (sum)
# etc. ... ... ...
Does that make sense?
Sure, you can accomplish what you're looking for with the table command in R. First, let's start with a reproducible example (I'll create an example with 2 million cases, 3 zip codes, and 3 causes; I know you have more zip codes and more causes but that won't cause the code to take too much longer to run):
# Data
set.seed(144)
dat <- data.frame(zip = sample(paste("Zip", 1:3), 2000000, replace=TRUE),
cause = sample(paste("Cause", 1:3), 2000000, replace=TRUE))
Please note that it's a good idea to include a reproducible example with all your questions on Stack Overflow because it helps make sure we understand what you're asking! Basically you should include a sample dataset (like the one I've just included) along with your desired output for that dataset.
Now you can use the table function to count the number of donations in each zip code, broken down by cause:
table(dat$zip, dat$cause)
# Cause 1 Cause 2 Cause 3
# Zip 1 222276 222004 222744
# Zip 2 222068 222791 222363
# Zip 3 221015 221930 222809
This took about 0.3 seconds on my computer.
could this work?-
aggregate(amt~cause+zip,data=dat3,FUN=sum)
cause zip amt
1 Cause 1 Zip 1 306231179
2 Cause 2 Zip 1 306600943
3 Cause 3 Zip 1 305964165
4 Cause 1 Zip 2 305788668
5 Cause 2 Zip 2 306306940
6 Cause 3 Zip 2 305559305
7 Cause 1 Zip 3 304898918
8 Cause 2 Zip 3 304281568
9 Cause 3 Zip 3 303939326

Resources