Everybody. My program counts statistics for all groups. So for example I have groups 1,2,3,4. The program processes in cycles: 1-2; 1-3; 1-4; 2-3; 2-4; 3-4; but NOT columns 2 and 1, because that's the same as the first pair, etc.
I created a new data set with columns c('n', 'v', 'd','pv',' pvf') from 1 to 3. I would like it to be recorded in this format:
1:...
2:...
3:...
And from 4 to 5:
1-2:...
1-3:...
1-4:...
2-3:...
2-4:...
3-4:...
I wrote a small example:
res=list()
for (i in 1:(ncol(mtcars)-1)) {
for (j in (i+1):ncol(mtcars)) {
res=c(res,list(c(i,j,paste0(mtcars[,i],'_',mtcars[,j])))
)
}
}
res=do.call(cbind,res)
How do I write data with different indexes to one cell and glue them together?
Related
I have a data frame with media spending for different media channels:
TV <- c(200,500,700,1000)
Display <- c(30,33,47,55)
Social <- c(20,21,22,23)
Facebook <- c(30,31,32,33)
Print <- c(50,51,52,53)
Newspaper <- c(60,61,62,63)
df_media <- data.frame(TV,Display,Social,Facebook, Print, Newspaper)
My goal is to calculate the row sums of specific columns based on their name.
For example: Per definition Facebook falls into the category of Social, so I want to add the Facebook column to the Social column and just have the Social column left. The same goes for Newspaper which should be added to Print and so on.
The challenge is that the names and the number of columns that belong to one category change from data set to data set, e.g. the next data set could contain Social, Facebook and Instagram which should be all summed up to Social.
There is a list of rules, which define which media types (column names) belong to each other, but I have to admit that I'm a bit clueless and can only think about a long set of if commands right now, but I hope there is a better solution.
I'm thinking about putting all the names that belong to each other in vectors and use them to find and summarize the relevant columns, but I have no idea, how to execute this.
Any help is appreciated.
You could something along those lines, which allows columns to not be part of every data set (with intersect and setdiff).
Define a set of rules, i.e. those columns that are going to be united/grouped together.
Create a vector d of the remaining columns
Compute the rowSums of every subset of the data set defined in the rules
append the remaining columns
cbind the columns of the list using do.call.
#Rules
rules = list(social = c("Social", "Facebook", "Instagram"),
printed = c("Print", "Newspaper"))
d <- setdiff(colnames(df_media), unlist(rules)) #columns that are not going to be united
#data frame
lapply(rules, function(x) rowSums(df_media[, intersect(colnames(df_media), x)])) |>
append(df_media[, d]) |>
do.call(cbind.data.frame, args = _)
social printed TV Display
1 50 110 200 30
2 52 112 500 33
3 54 114 700 47
4 56 116 1000 55
strong text Below is my objective and the code I made to represent that Row 19 is the original street text and 24 is where street2 is located
https://www.opendataphilly.org/dataset/shooting-victims/resource/a6240077-cbc7-46fb-b554-39417be606ee << where the .csv is
Let's deal with the streets with '&' separating their names. Create a new column named street2 and set it equal to NA.
Then, iterate over the data frame using a for loop, testing if the street variable you created earlier contains an NA value.
In cases where this occurs, separate the names in block according to the & delimiter into the fields street and street2 accordingly.
Output the first 5 lines of the data frame to the screen.
Hint: for; if; :; nrow(); is.na(); strsplit(); unlist().
NewLocation$street2 <- 'NA'
Task7 <- unlist(NewLocation)
for (col in seq (1:dim(NewLocation)[19])) {
if (Task7[street2]=='NA'){
for row in seq (1:dim(NewLocation[24])){
NewLocation[row,col] <-strsplit(street,"&",(NewLocation[row,col]))
}
}
}
I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE
My data looks approximately like this (millions of rows):
Customers Market Firm
1 NY A
2 LA B
1 LA A
1 NY A
... ... ...
Some of the entries in the "Firm" Column are equal to 'x', and I need to ignore those rows.
I need to create another matrix with Markets as the rows and Firms as the columns, with each element being the sum of "Customers" for each Market-Firm pair.
The code I'm currently using is a relatively straightforward for-loop:
for (i in 1:length(mydata$Customers)) {
if(mydata$Firm[i]!="x") {
newmatrix[mydata$Market[i],mydata$Firm[i]] <- newmatrix[mydata$Market[i],mydata$Firm[i]] + mydata$Customers[i]
}
}
It works, but it takes FOREVER. Is there a way I can speed it up? I'm new to R, but I understand that doing these kinds of operations is supposed to be more simple...
I have a tricky problem with applying a function to a list of data frames. Ultimately I want to plot individual time series charts for large data set of drug usage figures.
My dataset comprises 30 different antibiotics with a usage rate that has been collected monthly over a 5 year period. It has 3 columns and 1692 rows.
So far I have made a list of individual data frames for each antibiotic class. (The name of the list is drug and drug.class is a character vector of drug names from the original data frame)
drugList <- list()
n<-length(drug.class)
for (i in 1:n){
drugList[[i]] <-AB[Drug==(drug.class[i]),]
}
For example, I have 30 data frames in a list with the following columns:
[[29]]
Drug Usage DateA
1353 Tobramycin 5.06 01-Jan-2006
1354 Tobramycin 4.21 01-Feb-2006
1355 Tobramycin 6.34 01-Mar-2006
.
.
.
Drug Usage DateA
678 Vancomycin 11.62 01-Jan-2006
679 Vancomycin 11.94 01-Feb-2006
680 Vancomycin 14.29 01-Mar-2006
Before each plot is made a logical test is performed to determine if the time series is autocorrelated. The data frmaes in the list are of verying lengths.
I have written a function to perform the test as follows:
acTest <- function(){
id<-ts(1:length(DateA))
a1<-ts(Usage)
a2<-lag(a1-1)
tg<-ts.union(a1,id,a2)
mg<-lm(a1~a2+bs(id,df=3), data=tg)
a2Pval <- summary(mg)$coefficients[2, 4]
if (a2Pval<=0.05) {
TRUE
} else {
FALSE
}
}
I have previously tested all my functions on individual data frames and they work as expected.
I am trying to work out how to apply the test to each data frame in the drug list. I believe if I can get help working this out I will be in a position to apply the time series functions in the same manner.
Thanks in advance for any assistance offered.
A few suggestions:
Change your acTest function so that it actually accepts a data.frame as a parameter. Otherwise you'll have lots of problems with the function looking for (and modifying) objects named DateA and Usage in the global environment.
acTest <- function(dat){
id<-ts(1:length(dat$DateA))
a1<-ts(dat$Usage)
a2<-lag(a1-1)
tg<-ts.union(a1,id,a2)
mg<-lm(a1~a2+bs(id,df=3), data=tg)
a2Pval <- summary(mg)$coefficients[2, 4]
if (a2Pval<=0.05) {
TRUE
} else {
FALSE
}
}
Applying a function to each element of a list is a common task in R. It is (most often) done using lapply.
lapply(drugList,FUN=acTest)
Finally, you can do tasks like this without storing each data frame as a separate list element by using tools like ddply (among others) that split a data frame using one variable, apply a function to each piece and then reassemble them into a single data frame again. In your case, that would look something like:
ddply(AB,.(Drug),.fun = acTest)