R subset data by multiple conditions on every row - r

I got a very large dataset which contains many columns and rows. Not every co-worker is allowed to see all data. Based on the dataframe Data_locatie I want to subset my original dataframe DF. The column acces tells me if the co-worker may see this combination yes(=1) or no(=0). I made a reproducible example which you can use.
CityChargeSessions <-c("Amsterdam","Amsterdam","Amsterdam","Amsterdam","Beverwaard","De meern","De Meern","De Meern","Den Haag","Den Haag")
RegionAbbreviation <- c("G4", "G4","G4","G4","G4","G4","G4","G4","G4","G4")
Provider<- c("ALLEGO","Essent","EVBOX","Nuon","EVBOX","EVnet","Ballast Nedam", "Nuon","Alfen","EVnet")
acces<- c(0,1,1,0,1,1,0,0,1,0)
Data_locatie<- data.frame(CityChargeSessions,RegionAbbreviation,Provider,acces)
CityChargeSessions <-c("Amsterdam" ,"Amsterdam" ,"Den Haag" , "Den Haag" ,"Rotterdam", "Rotterdam", "Rotterdam", "Utrecht" , "Utrecht" )
RegionAbbreviation <- c("G4", "G4","G4","G4","G4","G4","G4","G4","G4")
Provider <- c("Essent","Nuon","Alfen","EVnet","Alfen","EVBOX", "EVnet","Ballast Nedam", "EVnet")
kWh<- c(3366231.03, 7547896.10, 2535700.80, 245951.82, 62004.86, 3074192.86, 221362.13, 1272956.51, 281451.94)
DF<- data.frame(CityChargeSessions,RegionAbbreviation,Provider,kWh)
My expected output is:
CityChargeSessions <-c("Amsterdam" ,"Den Haag")
RegionAbbreviation <- c("G4", "G4")
Provider <- c("Essent","Alfen ")
kWh<- c(3366231.03, 2535700.80)
expected_output<- data.frame(CityChargeSessions,RegionAbbreviation,Provider,kWh)
Could you help me out?
Thanks for your help!
Martijn

You could use the data table and do the following:
require(data.table)
setDT(Data_locatie)
setkey(Data_locatie, "CityChargeSessions", "RegionAbbreviation", "Provider")
setDT(DF)
setkey(DF, "CityChargeSessions", "RegionAbbreviation", "Provider")
allowed_combinations <- DF[Data_locatie[acces==1], nomatch=0][, acces:=NULL]
not_allowed_combinations <- DF[Data_locatie[acces==0], nomatch=0][, acces:=NULL]

Related

Find differences betwen 2 dataframes with different lengths

I have two dataframes with each two columns c("price", "size") with different lengths.
Each price must be linked to its size. It's two lists of trade orders. I have to discover the differences between the two dataframes knowing that the two databases can have orders that the other doesn't have and vice versa. I would like an output with the differences or two outputs, it doesn't matter. But I need the row number in the output to find where are the differences in the series.
Here is sample data :
> out
price size
1: 36024.86 0.01431022
2: 36272.00 0.00138692
3: 36272.00 0.00277305
4: 36292.57 0.05420000
5: 36292.07 0.00403948
---
923598: 35053.89 0.30904890
923599: 35072.76 0.00232000
923600: 35065.60 0.00273000
923601: 35049.36 0.01760000
923602: 35037.23 0.00100000
>bit
price size
1: 37279.89 0.01340020
2: 37250.84 0.00930000
3: 37250.32 0.44284049
4: 37240.00 0.00056491
5: 37215.03 0.99891906
---
923806: 35053.89 0.30904890
923807: 35072.76 0.00232000
923808: 35065.60 0.00273000
923809: 35049.36 0.01760000
923810: 35037.23 0.00100000
For example, I need to know if the first row of the database out is in the database bit.
I've tried many functions : comparedf()
summary(comparedf(bit, out, by = c("price","size"))
but I've got error:
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, :
I've tried compare_df() :
compareout=compare_df(out,bit,c("price","size"))
But I know the results are wrong, I've only 23 results and I know that there are more than 200 differences minimum.
I've tried match(), which() functions but it doesn't get the results I search.
If you have any other methods, I will take them.
Perhaps you could just do inner_join on out and bit by price and size? But first make id variable for both data.frame's
library(dplyr)
out$id <- 1:nrow(out)
bit$id <- 1:nrow(bit)
joined <- inner_join(bit, out, by = c("price", "size"))
Now we can check which id from out and bit are not present in joined table:
id_from_bit_not_included_in_out <- bit$id[!bit$id %in% joined$id.x]
id_from_out_not_included_in_bit <- out$id[!out$id %in% joined$id.y]
And these ids are the rows not included in out or bit, i.e. variable id_from_bit_not_included_in_out contains rows present in bit, but not in out and variable id_from_out_not_included_in_bit contains rows present in out, but not in bit
First attempt here. It will be difficult to do a very clean job with this data tho.
The data I used:
out <- read.table(text = "price size
36024.86 0.01431022
36272.00 0.00138692
36272.00 0.00277305
36292.57 0.05420000
36292.07 0.00403948
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
bit <- read.table(text = "price size
37279.89 0.01340020
37250.84 0.00930000
37250.32 0.44284049
37240.00 0.00056491
37215.03 0.99891906
37240.00 0.00056491
37215.03 0.99891906
35053.89 0.30904890
35072.76 0.00232000
35065.60 0.00273000
35049.36 0.01760000
35037.23 0.00100000", header = T)
Assuming purely that row 1 of out should match with row 1 of bit a simple solution could be:
df <- cbind(distinct(out), distinct(bit))
names(df) <- make.unique(names(df))
However judging from the data you have provided I am not sure if this is the way to go (big differences in the first few rows) so maybe try sorting the data first?:
df <- cbind(distinct(out[order(out$price, out$size),]), distinct(bit[order(bit$price, bit$size),]))
names(df) <- make.unique(names(df))

Creating a for loop in R from a list

I'm trying to create a for loop in R to iterate through a list of genetic variants, labeled with rsID's, and filter the results by patient ID.
ace2_snps <- c("rs4646121", "rs4646127", "rs1996225", "rs2158082", "rs4830974", "rs148271868", "rs113539251", "rs4646135", "rs4646179", "rs2301693", "rs16980031", "rs12689012", "rs4646141", "rs142049267", "rs16979971", "rs12007623", "rs4646182", "rs147214574", "rs6632677", "rs139469582", "rs149000434", "rs148805807", "rs112032651", "rs144314464", "rs147077778", "rs182259051", "rs112621533", "rs35803318", "rs35304868", "rs113848176", "rs145345877", "rs12009805", "rs233570", "rs73635824", "rs73635823", "rs4646142", "rs4646157", "rs2074192", "rs79878075", "rs144239059", "rs67635467", "rs183583165", "rs137910448", "rs116419580", "rs2097723", "rs4646170")
for (snps in ace2_snps) {
genotype_snps <- as.data.frame(bgen_ACE2$data[snps,,])
idfromcsv <- read.csv("/Users/keeseyyyyy/Desktop/Walley/pospatid.csv")
id <- as.character(idfromcsv[[1]])
filtered_snps <- genotype_snps[id,] }
I need to run genotype_rs146217251 <- as.data.frame(bgen_ACE2$data["rs146217251",,]) for each rsID, and then I'd like to label the variable filtered_snps according to its rsID in the place of "snps" in the variable name for each variant.
I'm not very familiar with R syntax. Can anyone give me some tips?
For one variant, the process would go like this:
genotype_rs146217251 <- as.data.frame(bgen_ACE2$data["rs146217251",,])
idfromcsv <- read.csv("/Users/keeseyyyyy/Desktop/Walley/pospatid.csv")
id <- as.character(idfromcsv[[1]])
filtered <- genotype_rs146217251[id,]

Subset an xts by year into a list. Subset an xts by year and month into a list

im new to R and the stack platforms.
sti <- getSymbols("^STI", src = "yahoo", auto.assign = F, from = "2007-01-01", to = "2017-12-31")
sti_adjusted <- sti[, 6]
I done this in order to subset the data into a list of years.
ls_sti_adjusted <- list(sti_adjusted["2007"], sti_adjusted["2008"], sti_adjusted["2009"], sti_adjusted["2010"], sti_adjusted["2011"], sti_adjusted["2012"], sti_adjusted["2013"], sti_adjusted["2014"], sti_adjusted["2015"], sti_adjusted["2016"], sti_adjusted["2017"])
I'm looking for a more elegant solution, like a for-loop maybe?
ls_sti_adjusted <- list()
for (i in 2007:2017){
ls_sti_adjusted[[]] <- ???
}
The second issue is how can I further subset the elements into months in the year?
so for example: ls_sti_adjusted[[1]][[2]][[3]] returns the 3rd data point of February in 2007. Is this possible?
I hope that I am clear about the problem that I am facing. Thanks folks, plus any tips/tricks to understand loops and lists better would be greatly appreciated.
Combining .indexyear and split(x,f = “months” will give you the desired list.
lapply(unique(.indexyear(STI)),function(x) split.xts(STI[.indexyear(STI) == x ,],f='months’))
If you only need yearly lists leave out the split part, like so:
lapply(unique(.indexyear(STI)),function(x) STI[.indexyear(STI) == x ,])
UPDATE: OP’s follow-up question regarding naming of lists
Assuming you named the list of lists object STIlist you can do the following to name the list by years.( keep in mind that the names are converted to strings! )
names(STIlist) <- 2007:2018
To get the list of the year 2007:
> both(STIlist[['2007']])
STI.Open STI.High STI.Low STI.Close STI.Volume STI.Adjusted
2007-01-03 3015.74 3037.74 3010.22 3037.74 192739200 3037.74
2007-01-04 3035.08 3045.18 3008.23 3023.80 198216700 3023.80
2007-01-05 3031.09 3038.27 3000.50 3029.04 233321400 3029.04
STI.Open STI.High STI.Low STI.Close STI.Volume STI.Adjusted
2007-12-27 3469.11 3491.65 3459.97 3477.20 91474200 3477.20
2007-12-28 3452.18 3463.38 3441.96 3445.82 109442100 3445.82
2007-12-31 3424.48 3482.30 3424.48 3482.30 205741900 3482.30
If you need need more information about naming lists "Google is your best friend” or post another question :-)
for the first question something like this?
library(lubridate)
index(sti_adjusted)=floor_date(index(sti_adjusted),unit="years")
ls_sti_adjusted <- lapply(unique(index(sti_adjusted)),function(x) sti_adjusted[index(sti_adjusted)==x,1])
We could use the indexing directly from xts, check ?index.xts:
split(sti_adjusted, .indexyear(sti_adjusted))
In order to keep the correct naming 2012, 2013, ..., we can try:
split(sti_adjusted, as.integer(format(index(sti_adjusted), '%Y')))
Of course this can be nested in a list as much as you want:
nestedList <- lapply(
split(sti_adjusted, .indexyear(sti_adjusted))
, function(x) split(x, .indexmon(x))
)
nestedList[[3]][[2]][3] #3.year, 2.month, 3. obs.
Example using build-in data from xts:
data(sample_matrix, package = "xts")
sample_matrix <- as.xts(sample_matrix)
nestedList <- lapply(
split(sample_matrix, .indexyear(sample_matrix))
, function(x) split(x, .indexmon(x))
)
nestedList[[1]][[3]][5]
Open High Low Close
2007-03-05 50.26501 50.3405 50.26501 50.29567

linking crsp and compustat in R via WRDS

I am using R to connect to WRDS. Now, I would like to link compustat and crsp tables. In SAS, this would be achieved using macros and the CCM link table. What would be the best way to approach this topic in R?
PROGRESS UPDATE:
I downloaded crsp, compustat and ccm_link tables from wrds.
sql <- "select * from CRSP.CCMXPF_LINKTABLE"
res <- dbSendQuery(wrds, sql)
ccmxpf_linktable <- fetch(res, n = -1)
ccm.dt <- data.table(ccmxpf_linktable)
rm(ccmxpf_linktable)
I am then converting the suggested matching routine from the wrds event study sas file into R:
ccm.dt[,typeflag:=linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey, typeflag)
for (i in 1:nrow(compu.dt)) {
gvkey.comp = compu.dt[i, gvkey]
endfyr.comp = compu.dt[i,endfyr]
PERMNO.val <- ccm.dt[.(gvkey.comp, TRUE),][linkdt<=endfyr.comp & endfyr.comp<=linkenddt,lpermno]
if (length(PERMNO.val)==0) PERMNO.val <- NA
suppressWarnings(compu.dt[i, "PERMNO"] <- PERMNO.val)
}
However, this code is fantastically inefficient. I started out with data.table, but do not really understand how to apply the logic in the for-loop. I am hoping that some could point me to a way how to improve the for-loop.
Matching fields in stages works better. maybe someone finds this useful. Any suggestions for further improvement are of course very welcome!!!
# filter on ccm.dt
ccm.dt <- ccm.dt[linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey)
setkey(compu.dt, gvkey)
compu.merged <- merge(compu.dt, ccm.dt, all.x = TRUE, allow.cartesian = TRUE)
# deal with NAs in linkenddt - set NAs to todays date, assuming they still exist.
today <- as.character(Sys.Date())
compu.merged[is.na(linkenddt), "linkenddt":=today]
# filter out date mismatches
compu <- compu.merged[linkdt <= endfyr & endfyr<=linkenddt]

How to improve this code for getting pairwise?

It is a question build upon the previous question (http://stackoverflow.com/questions/6538448/r-how-to-write-a-loop-to-get-a-matrix).
It is different from the previous one, as more details is provided, and libraries and example file is provided according to comments from DWin. So, I submitted it as a new question. Could you mind to teach me how to modify this code further?
To load the necessary libraries:
source("http://bioconductor.org/biocLite.R")
biocLite()
My protseq.fasta file has the following contents:
>drugbank_target|1 Peptidoglycan synthetase ftsI (DB00303)
MVKFNSSRKSGKSKKTIRKLTAPETVKQNKPQKVFEKCFMRGRYMLSTVLILLGLCALVARAAYVQSINADTLSNEADKR
SLRKDEVLSVRGSILDRNGQLLSVSVPMSAIVADPKTMLKENSLADKERIAALAEELGMTENDLVKKIEKNSKSGYLYLA
RQVELSKANYIRRLKIKGIILETEHRRFYPRVEEAAHVVGYTDIDGNGIEGIEKSFNSLLVGKDGSRTVRKDKRGNIVAH
ISDEKKYDAQDVTLSIDEKLQSMVYREIKKAVSENNAESGTAVLVDVRTGEVLAMATAPSYNPNNRVGVKSELMRNRAIT
DTFEPGSTVKPFVVLTALQRGVVKRDEIIDTTSFKLSGKEIVDVAPRAQQTLDEILMNSSNRGVSRLALRMPPSALMETY
QNAGLSKPTDLGLIGEQVGILNANRKRWADIERATVAYGYGITATPLQIARAYATLGSFGVYRPLSITKVDPPVIGKRVF
SEKITKDIVGILEKVAIKNKRAMVEGYRVGVKTGTARKIENGHYVNKYVAFTAGIAPISDPRYALVVLINDPKAGEYYGG
AVSAPVFSNIMGYALRANAIPQDAEAAENTTTKSAKRIVYIGEHKNQKVN
>drugbank_target|3 Histidine decarboxylase (DB00114; DB00117)
MMEPEEYRERGREMVDYICQYLSTVRERRVTPDVQPGYLRAQLPESAPEDPDSWDSIFGDIERIIMPGVVHWQSPHMHAY
YPALTSWPSLLGDMLADAINCLGFTWASSPACTELEMNVMDWLAKMLGLPEHFLHHHPSSQGGGVLQSTVSESTLIALLA
ARKNKILEMKTSEPDADESCLNARLVAYASDQAHSSVEKAGLISLVKMKFLPVDDNFSLRGEALQKAIEEDKQRGLVPVF
VCATLGTTGVCAFDCLSELGPICAREGLWLHIDAAYAGTAFLCPEFRGFLKGIEYADSFTFNPSKWMMVHFDCTGFWVKD
KYKLQQTFSVNPIYLRHANSGVATDFMHWQIPLSRRFRSVKLWFVIRSFGVKNLQAHVRHGTEMAKYFESLVRNDPSFEI
PAKRHLGLVVFRLKGPNCLTENVLKEIAKAGRLFLIPATIQDKLIIRFTVTSQFTTRDDILRDWNLIRDAATLILSQHCT
SQPSPRVGNLISQIRGARAWACGTSLQSVSGAGDDPVQARKIIKQPQRVGAGPMKRENGLHLETLLDPVDDCFSEEAPDA
TKHKLSSFLFSYLSVQTKKKTVRSLSCNSVPVSAQKPLPTEASVKNGGSSRVRIFSRFPEDMMMLKKSAFKKLIKFYSVP
SFPECSSQCGLQLPCCPLQAMV
>drugbank_target|5 Glutaminase liver isoform, mitochondrial (DB00130; DB00142)
MRSMKALQKALSRAGSHCGRGGWGHPSRSPLLGGGVRHHLSEAAAQGRETPHSHQPQHQDHDSSESGMLSRLGDLLFYTI
AEGQERTPIHKFTTALKATGLQTSDPRLRDCMSEMHRVVQESSSGGLLDRDLFRKCVSSSIVLLTQAFRKKFVIPDFEEF
TGHVDRIFEDVKELTGGKVAAYIPQLAKSNPDLWGVSLCTVDGQRHSVGHTKIPFCLQSCVKPLTYAISISTLGTDYVHK
FVGKEPSGLRYNKLSLDEEGIPHNPMVNAGAIVVSSLIKMDCNKAEKFDFVLQYLNKMAGNEYMGFSNATFQSEKETGDR
NYAIGYYHEEKKCFPKGVDMMAALDLYFQLCSVEVTCESGSVMAATLANGGICPITGESVLSAEAVRNTLSLMHSCGMYD
FSGQFAFHVGLPAKSAVSGAILLVVPNVMGMMCLSPPLDKLGNSHRGTSFCQKLVSLFNFHNYDNLRHCARKLDPRREGA
EIRNKTVVNLLFAAYSGDVSALRRFALSAMDMEQKDYDSRTALHVAAAEGHIEVVKFLIEACKVNPFAKDRWGNIPLDDA
VQFNHLEVVKLLQDYQDSYTLSETQAEAAAEALSKENLESMV
>drugbank_target|6 Coagulation factor XIII A chain (DB00130; DB01839; DB02340)
SETSRTAFGGRRAVPPNNSNAAEDDLPTVELQGVVPRGVNLQEFLNVTSVHLFKERWDTNKVDHHTDKYENNKLIVRRGQ
SFYVQIDFSRPYDPRRDLFRVEYVIGRYPQENKGTYIPVPIVSELQSGKWGAKIVMREDRSVRLSIQSSPKCIVGKFRMY
VAVWTPYGVLRTSRNPETDTYILFNPWCEDDAVYLDNEKEREEYVLNDIGVIFYGEVNDIKTRSWSYGQFEDGILDTCLY
VMDRAQMDLSGRGNPIKVSRVGSAMVNAKDDEGVLVGSWDNIYAYGVPPSAWTGSVDILLEYRSSENPVRYGQCWVFAGV
FNTFLRCLGIPARIVTNYFSAHDNDANLQMDIFLEEDGNVNSKLTKDSVWNYHCWNEAWMTRPDLPVGFGGWQAVDSTPQ
ENSDGMYRCGPASVQAIKHGHVCFQFDAPFVFAEVNSDLIYITAKKDGTHVVENVDATHIGKLIVTKQIGGDGMMDITDT
YKFQEGQEEERLALETALMYGAKKPLNTEGVMKSRSNVDMDFEVENAVLGKDFKLSITFRNNSHNRYTITAYLSANITFY
TGVPKAEFKKETFDVTLEPLSFKKEAVLIQAGEYMGQLLEQASLHFFVTARINETRDVLAKQKSTVLTIPEIIIKVRGTQ
VVGSDMTVTVQFTNPLKETLRNVWVHLDGPGVTRPMKKMFREIRPNSTVQWEEVCRPWVSGHRKLIASMSSDSLRHVYGE
LDVQIQRRPSM
To load the data to R for the analysis, I have done:
require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("./protseq.fasta", strip.descs=TRUE)
To get the the pairwise numbers, as there are a total of 4 sequences, I have done:
number <-c(1:4); dat <- expand.grid(number,number, stringsAsFactors=FALSE)
datr <- dat[dat[,1] > dat[,2] , ]
In order to calculate the score one by one, I can do this:
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))
However, I have problem to add a new column as "score" to include all the score for each pairs of the proteins. I tried to do this, but did not work.
datr$score <- lapply(datr, 1, function(i) { x <- datr[i,1]; y<- datr[i,2]; score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5))})
Could you mind to comments how to further improve it? Thanks DWin and diliop for wonderful solutions to my previous question.
Try:
datr$score <- sapply(1:nrow(datr), function(i) {
x <- datr[i,1]
y <- datr[i,2]
score(pairwiseAlignment(seqs[[x]]$seq, seqs[[y]]$seq, substitutionMatrix=BLOSUM100,gapOpening=0, gapExtension=-5))
})
To be able to reference your sequences better using their names, you might want to tidy up datr by doing the following:
colnames(datr) <- c("seq1id", "seq2id", "score")
datr$seq1name <- sapply(datr$seq1id, function(i) seqs[[i]]$desc)
datr$seq2name <- sapply(datr$seq2id, function(i) seqs[[i]]$desc)
Or if you just want to extract the accession IDs i.e. the contents of your parentheses, you could use stringr as such:
library(stringr)
datr$seq1name <- sapply(datr$seq2id, function(i) str_extract(seqs[[i]]$desc, "DB[0-9\\ ;DB]+"))
Hope this helps!

Resources