Calculating row sums in data frame based on column names - r

I have a data frame with media spending for different media channels:
TV <- c(200,500,700,1000)
Display <- c(30,33,47,55)
Social <- c(20,21,22,23)
Facebook <- c(30,31,32,33)
Print <- c(50,51,52,53)
Newspaper <- c(60,61,62,63)
df_media <- data.frame(TV,Display,Social,Facebook, Print, Newspaper)
My goal is to calculate the row sums of specific columns based on their name.
For example: Per definition Facebook falls into the category of Social, so I want to add the Facebook column to the Social column and just have the Social column left. The same goes for Newspaper which should be added to Print and so on.
The challenge is that the names and the number of columns that belong to one category change from data set to data set, e.g. the next data set could contain Social, Facebook and Instagram which should be all summed up to Social.
There is a list of rules, which define which media types (column names) belong to each other, but I have to admit that I'm a bit clueless and can only think about a long set of if commands right now, but I hope there is a better solution.
I'm thinking about putting all the names that belong to each other in vectors and use them to find and summarize the relevant columns, but I have no idea, how to execute this.
Any help is appreciated.

You could something along those lines, which allows columns to not be part of every data set (with intersect and setdiff).
Define a set of rules, i.e. those columns that are going to be united/grouped together.
Create a vector d of the remaining columns
Compute the rowSums of every subset of the data set defined in the rules
append the remaining columns
cbind the columns of the list using do.call.
#Rules
rules = list(social = c("Social", "Facebook", "Instagram"),
printed = c("Print", "Newspaper"))
d <- setdiff(colnames(df_media), unlist(rules)) #columns that are not going to be united
#data frame
lapply(rules, function(x) rowSums(df_media[, intersect(colnames(df_media), x)])) |>
append(df_media[, d]) |>
do.call(cbind.data.frame, args = _)
social printed TV Display
1 50 110 200 30
2 52 112 500 33
3 54 114 700 47
4 56 116 1000 55

Related

R - Using Stringr to identify a string across hundreds of rows

I have a database where some people have multiple diagnoses. I posted a similar question in the past, but now have some more nuances I need to work through:
R- How to test multiple 100s of similar variables against a condition
I have this dataset (which was an import of a SAS file)
ID dx1 dx2 dx3 dx4 dx5 dx6 .... dx200
1 343 432 873 129 12 123 3445
2 34 12 44
3 12
4 34 56
Initially, I wanted to be able to create a new variable if any of the "dxs" equals a certain number without using hundreds of if statements? All the different variables have the same format (dx#). So I used the following code:
Ex:
dataset$highbloodpressure <- rowSums(screen[0:832] == "410") > 0
This worked great. However, there are many different codes for the same diagnosis. For example, a heart attack can be defined as:
410.1,
410.71,
410.62,
410.42,
...this goes on for 20 additional codes. BUT! They all start with 410.
I thought about using stringr (the variable is a string), to identify the common code components (410, for the example above), but am not sure how to use it in the context of rowsums.
If anyone has any suggestions for this, please let me know!
Thanks for all the help!
You can use the grepl() function that returns TRUE if a value is present. In order to check all columns simultaneously, just collapse all of them to one character per row:
df$dx.410 = NA
for(i in 1:dim(df)[1]){
if(grepl('410',paste(df[i,2:200],collapse=' '))){
df$dx.410[i]="Present"
}
}
This will loop through all lines, create one large character containing all diagnoses for this case and write "Present" in column dx.410 if any column contains a 410-diagnosis.
(The solution expects the data structure you have here with the dx-variables in columns 2 to 200. If there are some other columns, just adjust these numbers)

Randomise 380 samples by covariates across four 96-well plates using OSAT package

I need to randomise 380 samples (by age, sex and group [grp]) across four 96 well plates (8 rows, 12 columns), with A01 reserved in each plate for a positive control.
I tried the r-pkg (OSAT) and the recommended script is below. The only piece that does not work is excluding well A01 from each of the four plates.
library(OSAT)
samples <- read.table("~/file.csv", sep=";", header=T)
head(samples)
grp sex age
1 A F 45
2 A M 56
3 A F 57
4 A M 67
5 A F 45
6 A M 65
sample.list <- setup.sample(samples, optimal = c("grp", "sex", "age"))
excludedWells <- data.frame("plates"= 1:4, chips=rep(1,4), wells=rep(1,4))
container <- setup.container(IlluminaBeadChip96Plate, 4, batch = 'plates')
exclude(container) <- excludedWells
setup <- create.optimized.setup(fun ="optimal.shuffle", sample.list, container)
out <- map.to.MSA(setup, MSA4.plate)
The corresponding R help doc states:
"If for any reason we need to reserve certain wells for other usage, we can exclude them from the sample assignment process. For this one can create a data frame to mark these excluded wells. Any wells in the container can be identified by its location identified by three variable "plates", "chips", "wells". Therefore the data frame for the excluded wells should have these three columns.
For example, if we will use the first well of the first chip on each plate to hold QC samples, these wells will not be available for sample placement. We have 6 plates in our example so the following will reserve the 6 wells from sample assignment:
excludedWells <- data.frame(plates=1:6, chips=rep(1,6), wells=rep(1,6))
Our program can let you exclude multiple wells at the same position of plate/chip. For example, the following data frame will exclude the first well on each chips regardless how many plates we have:
ex2 <- data.frame(wells=1)
I tried both of these and they do not work - as they simply specify ANY well (and not well #1-A01).
*Update - I emailed the developer of the package and he acknowledged the error and provided a work around. Incorporated here (exclude wells after setting up the container)

Extracting a value based on multiple conditions in R

Quick question - I have a dataframe (severity) that looks like,
industryType relfreq relsev
1 Consumer Products 2.032520 0.419048
2 Biotech/Pharma 0.650407 3.771429
3 Industrial/Construction 1.327913 0.609524
4 Computer Hardware/Electronics 1.571816 2.019048
5 Medical Devices 1.463415 3.028571
6 Software 0.758808 1.314286
7 Business/Consumer Services 0.623306 0.723810
8 Telecommunications 0.650407 4.247619
if I wanted to pull the relfreq of Medical Devices (row 5) - how could I subset just that value?
I was thinking about just indexing and doing severity$relfreq[[5]], but I'd be using this line in a bigger function where the user would specify the industry i.e.
example <- function(industrytype) {
weight <- relfreq of industrytype parameter
thing2 <- thing1*weight
return(thing2)
}
So if I do subset by an index, is there a way R would know which index corresponds to the industry type specified in the function parameter? Or is it easier/a way to just subset the relfreq column by the industry name?
You would require to first select the row of interest and then keep the 2 column you requested (industryType and relfreq).
There is a great package that allows you to do this intuitively with tidyverse library(tidyverse)
data_want <- severity %>%
subset(industryType =="Medical Devices") %>%
select(industryType, relfreq)
Here you read from left to right with the %>% serving as passing the result to the next step as if nesting.
I think that selecting whole row is better, then choose column which you would like to see.
frame <- severity[severity$industryType == 'Medical Devices',]
frame$relfreq

Best way to get list of SNPs by gene id?

I have a long data frame of genes and various forms of ids for them (e.g. OMIM, Ensembl, Genatlas). I want to get the list of all SNPs that are associated with each gene. (This is the reverse of this question.)
So far, the best solution I have found is using the biomaRt package (bioconductor). There is an example of the kind of lookup I need to do here. Fitted for my purposes, here is my code:
library(biomaRt)
#load the human variation data
variation = useEnsembl(biomart="snp", dataset="hsapiens_snp")
#look up a single gene and get SNP data
getBM(attributes = c(
"ensembl_gene_stable_id",
'refsnp_id',
'chr_name',
'chrom_start',
'chrom_end',
'minor_allele',
'minor_allele_freq'),
filters = 'ensembl_gene',
values ="ENSG00000166813",
mart = variation
)
This outputs a data frame that begins like this:
ensembl_gene_stable_id refsnp_id chr_name chrom_start chrom_end minor_allele minor_allele_freq
1 ENSG00000166813 rs8179065 15 89652777 89652777 T 0.242412
2 ENSG00000166813 rs8179066 15 89652736 89652736 C 0.139776
3 ENSG00000166813 rs12899599 15 89629243 89629243 A 0.121006
4 ENSG00000166813 rs12899845 15 89621954 89621954 C 0.421126
5 ENSG00000166813 rs12900185 15 89631884 89631884 A 0.449681
6 ENSG00000166813 rs12900805 15 89631593 89631593 T 0.439297
(4612 rows)
The code works, but the running time is extremely long. For the above, it takes about 45 seconds. I thought maybe this was related to the allele frequencies, which the server perhaps calculated on the fly. But looking up the bare minimum of only the SNPs rs ids takes something like 25 seconds. I have a few thousand genes, so this would take an entire day (assuming no timeouts or other errors). This can't be right. My internet connection is not slow (20-30 mbit).
I tried looking up more genes per query. This did dot help. Looking up 10 genes at once is roughly 10 times as slow as looking up a single gene.
What is the best way to get a vector of SNPs that associated with a vector of gene ids?
If I could just download two tables, one with genes and their positions and one with SNPs and their positions, then I could easily solve this problem using dplyr (or maybe data.table). I haven't been able to find such tables.
Since you're using R, here's an idea that uses the package rentrez. It utilizes NCBI's Entrez database system and in particular the eutils function, elink. You'll have to write some code around this and probably tweak parameters, but could be a good start.
library(rentrez)
# for converting gene name -> gene id
gene_search <- entrez_search(db="gene", term="(PTEN[Gene Name]) AND Homo sapiens[Organism]", retmax=1)
geneId <- gene_search$ids
# elink function
snp_links <- entrez_link(dbfrom='gene', id=geneId, db='snp')
# access results with $links
length(snp_links$links$gene_snp)
5779
head(snp_links$links$gene_snp)
'864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
I suggest you manually double-check that the number of SNPs is about what you'd expect for your genes of interest -- you may need to drill down further and limit by transcript, etc...
For multiple gene ids:
multi_snp_links <- entrez_link(dbfrom='gene', id=c("5728", "374654"), db='snp', by_id=TRUE)
lapply(multi_snp_links, function(x) head(x$links$gene_snp))
1. '864622690' '864622594' '864622518' '864622451' '864622387' '864622341'
2. '797045093' '797044466' '797044465' '797044464' '797044463' '797016353'
The results are grouped by gene with by_id=TRUE

R: iterating through unique values of a vector in for loop

I'm new to R and I am having some trouble iterating through the unique element of a vector. I have a dataframe "School" with 700 different teachers. Each teacher has around 40 students.
I want to be able to loop through each teacher, create a graphs for the mean score of his/her students' over time, save the graphs in a folder and automatically email that folder to that teacher.
I'm just getting started and am having trouble setting up the for-loop. In Stata, I know how to loop through each unique element in a list, but am having trouble doing that in R. Any help would be appreciated.
School$Teacher School$Student School$ScoreNovember School$ScoreDec School$TeacherEmail
A 1 35 45 A#school.org
A 2 43 65 A#school.org
B 1 66 54 B#school.org
A 3 97 99 A#school.org
C 1 23 45 C#school.org
Your question seems a bit vague and it looks like you want us to write your whole project. Could you share what you have done so far and where exactly you are struggling?
see ?subset
School=data.frame(Teacher=c("A","B"), ScoreNovember=10:11, ScoreDec=13:14)
for (teacher in unique(School$Teacher)) {
teacher_df=subset(School, Teacher==teacher)
MeanScoreNovember=mean(teacher_df$ScoreNovember)
MeanScoreDec =mean(teacher_df$ScoreDec)
# do your plot
# send your email
}
I think you have 3 questions, which will need separate questions, how do I:
Create graphs
Automatically email output
Compute a subset mean based on group
For the 3rd one, I like using the plyr package, other people will recommend data.table or dplyrpackages. You can also use aggregate from base. To get a teacher's mean:
library(plyr)
ddply(School,.(Teacher),summarise,Nov_m=mean(ScoreNovember))
If you want per student per teacher, etc. just add between the columns, like:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember))
You could do that for each score column (and then chart it) if your data was long rather than wide you could also add the date ('November', 'Dec') as a group in the brackets, or:
library(plyr)
ddply(School,.(Teacher,Student),summarise,Nov_m=mean(ScoreNovember),Dec_m=mean(ScoreDec))
See if that helps with the 3rd, but look at splitting your questions up too.

Resources