Choose higher values from two columns after extracting the number, R - r

I have a data frame (451 obs of 8 variables) that has two columns (6&7) that look like this:
Major Minor
C:726 T:2
A:687 G:41
T:3 C:725
I want to create one column that summarises this. To do this, I don't care about the letters in each cell, but I want the larger number to remain, whatever row it's in. i.e. I want it to look like this:
Summary_column
726
687
725
Not necessary, but for those that wonder what Im doing, this is the output from a programme called VCFtools; it has a count function that counts alleles in a VCF, but sometimes it names the allele as "Minor" when it is clearly more common.
Thanks for your help!

I would do something like this :
extract <- function(v) {
gsub("^.*:", "", v)
}
within(d, Summary_column <- pmax(extract(Major), extract(Minor)))
Which gives :
Major Minor Summary_column
1 C:726 T:2 726
2 A:687 G:41 687
3 T:3 C:725 725

Related

Calculating row sums in data frame based on column names

I have a data frame with media spending for different media channels:
TV <- c(200,500,700,1000)
Display <- c(30,33,47,55)
Social <- c(20,21,22,23)
Facebook <- c(30,31,32,33)
Print <- c(50,51,52,53)
Newspaper <- c(60,61,62,63)
df_media <- data.frame(TV,Display,Social,Facebook, Print, Newspaper)
My goal is to calculate the row sums of specific columns based on their name.
For example: Per definition Facebook falls into the category of Social, so I want to add the Facebook column to the Social column and just have the Social column left. The same goes for Newspaper which should be added to Print and so on.
The challenge is that the names and the number of columns that belong to one category change from data set to data set, e.g. the next data set could contain Social, Facebook and Instagram which should be all summed up to Social.
There is a list of rules, which define which media types (column names) belong to each other, but I have to admit that I'm a bit clueless and can only think about a long set of if commands right now, but I hope there is a better solution.
I'm thinking about putting all the names that belong to each other in vectors and use them to find and summarize the relevant columns, but I have no idea, how to execute this.
Any help is appreciated.
You could something along those lines, which allows columns to not be part of every data set (with intersect and setdiff).
Define a set of rules, i.e. those columns that are going to be united/grouped together.
Create a vector d of the remaining columns
Compute the rowSums of every subset of the data set defined in the rules
append the remaining columns
cbind the columns of the list using do.call.
#Rules
rules = list(social = c("Social", "Facebook", "Instagram"),
printed = c("Print", "Newspaper"))
d <- setdiff(colnames(df_media), unlist(rules)) #columns that are not going to be united
#data frame
lapply(rules, function(x) rowSums(df_media[, intersect(colnames(df_media), x)])) |>
append(df_media[, d]) |>
do.call(cbind.data.frame, args = _)
social printed TV Display
1 50 110 200 30
2 52 112 500 33
3 54 114 700 47
4 56 116 1000 55

R - Using Stringr to identify a string across hundreds of rows

I have a database where some people have multiple diagnoses. I posted a similar question in the past, but now have some more nuances I need to work through:
R- How to test multiple 100s of similar variables against a condition
I have this dataset (which was an import of a SAS file)
ID dx1 dx2 dx3 dx4 dx5 dx6 .... dx200
1 343 432 873 129 12 123 3445
2 34 12 44
3 12
4 34 56
Initially, I wanted to be able to create a new variable if any of the "dxs" equals a certain number without using hundreds of if statements? All the different variables have the same format (dx#). So I used the following code:
Ex:
dataset$highbloodpressure <- rowSums(screen[0:832] == "410") > 0
This worked great. However, there are many different codes for the same diagnosis. For example, a heart attack can be defined as:
410.1,
410.71,
410.62,
410.42,
...this goes on for 20 additional codes. BUT! They all start with 410.
I thought about using stringr (the variable is a string), to identify the common code components (410, for the example above), but am not sure how to use it in the context of rowsums.
If anyone has any suggestions for this, please let me know!
Thanks for all the help!
You can use the grepl() function that returns TRUE if a value is present. In order to check all columns simultaneously, just collapse all of them to one character per row:
df$dx.410 = NA
for(i in 1:dim(df)[1]){
if(grepl('410',paste(df[i,2:200],collapse=' '))){
df$dx.410[i]="Present"
}
}
This will loop through all lines, create one large character containing all diagnoses for this case and write "Present" in column dx.410 if any column contains a 410-diagnosis.
(The solution expects the data structure you have here with the dx-variables in columns 2 to 200. If there are some other columns, just adjust these numbers)

Looping regressions and running column sum based on results

I have a data frame with panel data that looks as follows:
countrycode year 7111 7112 7119 7126 7129 7131 7132 7133 7138
1 AGO 1981 380491 149890 238832 0 166690 449982 710642 430481 890546
2 AGO 1982 339626 66434 183487 0 79682 108356 486799 186884 220545
3 AGO 1983 128043 2697 91404 148617 3988 432725 829958 138764 152822
4 AGO 1984 67832 0 85613 1251 45644 361733 1250272 237236 2952746
5 AGO 1985 354335 11225 143000 2130 7687 2204297 942071 408907 474666
There are 159 four-digit column variables like the ones shown above. There are also column variables named CEPI1_fw and CIPI1_fw. Furthermore, there are 46 countries and 34 years in the data set.
I would like to use the plm command to regress each of the numerical column variables on CEPI1_fw and CIPI1_fw. Then, I would like to sum the numerical column variables in the data frame above based on whether the coefficients from the regressions are above or below a certain threshold. The resulting output should be a pair of columns added to the data frame above.
There are a few ambiguities in your question, but I'll take a shot.
First, I'm going to revamp your code slightly: adding rows to data frames is very inefficient (probably doesn't matter in this application, but it's a bad habit to get into ...)
out <- list()
for (i in colnames(master5)) {
f <- reformulate(c("CEPI1_fw","CIPI1_fw"),
response=paste0("master5$",i))
m <- summary(plm(f, data = master4, model = "within"))
out <- c(out, list(data.frame(yvar=i, coef=m$coefficients[1,1],
pval= m$coefficients[1,4],
stringsAsFactors=FALSE)))
}
out <- do.call(rbind, out) ## combine elements into a single data frame
Select only statistically significant response variables. From a statistical/inferential point of view, this is probably a bad idea ...
out <- out[out$pval<0.05,]
Select the names of variables where the coefficients are above a threshold
big_vars <- out$yvar[abs(out$coef)>threshold]
Compute column sums from another data set ...
colSums(other_data[big_vars])

Highlighting regions in ggplot2 barplot fulfilling a condition

I want to plot a horizontal barplot using ggplot2 and highlight regions satisfying a particular criteria.
In this case, if any "Term" for point "E15.5-E18.5_up_down" has more than twice the number of samples compared to point "P22-P29_up_down" and vice-versa, highlight that label or region.
I have a dataframe in following format:
CLID CLSZ GOID NodeSize SampleMatch Phyper Padj Term Ont SampleKeys
E15.5-E18.5_up_down 1364 GO:0007568 289 20 0.141830716154421 1 aging BP ENSMUSG00000049932 ENSMUSG00000046352 ENSMUSG00000078249 ENSMUSG00000039428 ENSMUSG00000014030 ENSMUSG00000039323 ENSMUSG00000026185 ENSMUSG00000027513 ENSMUSG00000023224 ENSMUSG00000037411 ENSMUSG00000020429 ENSMUSG00000020897 ENSMUSG00000025486 ENSMUSG00000021477 ENSMUSG00000019987 ENSMUSG00000023067 ENSMUSG00000031980 ENSMUSG00000023070 ENSMUSG00000025747 ENSMUSG00000079017
E15.5-E18.5_up_down 1364 GO:0006397 416 3 0.999999969537913 1 mRNA processing BP ENSMUSG00000027510 ENSMUSG00000021210 ENSMUSG00000027951
P22-P29_up_down 476 GO:0007568 289 11 0.0333771791166823 1 aging BP ENSMUSG00000049932 ENSMUSG00000037664 ENSMUSG00000026879 ENSMUSG00000026185 ENSMUSG00000026043 ENSMUSG00000060600 ENSMUSG00000022508 ENSMUSG00000020897 ENSMUSG00000028702 ENSMUSG00000030562 ENSMUSG00000021670
P22-P29_up_down 476 GO:0006397 416 2 0.998137879564768 1 mRNA processing BP ENSMUSG00000024007 ENSMUSG00000039878
reduced to (only those terms which are necessary for plotting):
CLID SampleMatch Term
E15.5-E18.5_up_down 20 aging
P22-P29_up_down 2 mRNA processing
E15.5-E18.5_up_down 3 mRNA processing
P22-P29_up_down 11 aging
I would prefer a general approach which will work with any condition, not just the one I need for this scenario. One way I imagined is to use sapply for each pair of CLID/Term and create another column which stores if the criteria is fulfilled as a boolean, but still I cannot find a way to highlight the values. What would be the most efficient way to achieve this ?
Pseudo-code for my approach:
for(i in CLID) {
for(k in CLID) {
if (Term[i] == Term[k]) {
condition = check(Term[i], Term[k]) #check if the SampleMatch count for any for any CLID/Term pair is significantly higher compared to corresponding CLID/Term pair
if (condition == True) {
highlight(term)
}
}
}
}
In the end I want something like this (highlighting the label or column):
or like this: Highlight data individually with facet_grid in R.

How do I generate a dataframe displaying the number of unique pairs between two vectors, for each unique value in one of the vectors?

First of all, I apologize for the title. I really don't know how to succinctly explain this issue in one sentence.
I have a dataframe where each row represents some aspect of a hospital visit by a patient. A single patient might have thousands of rows for dozens of hospital visits, and each hospital visit could account for several rows.
One column is Medical.Record.Number, which corresponds to Patient IDs, and the other is Patient.ID.Visit, which corresponds to an ID for an individual hospital visit. I am trying to calculate the number of hospital visits each each patient has had.
For example:
Medical.Record.Number    Patient.ID.Visit
AAAXXX           1111
AAAXXX           1112
AAAXXX           1113
AAAZZZ           1114
AAAZZZ           1114
AAABBB           1115
AAABBB           1116
would produce the following:
Medical.Record.Number   Number.Of.Visits
AAAXXX          3
AAAZZZ          1
AAABBB          2
The solution I am currently using is the following, where "data" is my dataframe:
#this function returns the number of unique hospital visits associated with the
#supplied record number
countVisits <- function(record.number){
visits.by.number <- data$Patient.ID.Visit[which(data$Medical.Record.Number
== record.number)]
return(length(unique(visits.by.number)))
}
recordNumbers <- unique(data$Medical.Record.Number)
visits <- integer()
for (record in recordNumbers){
visits <- c(visits, countVisits(record))
}
visit.counts <- data.frame(recordNumbers, visits)
This works, but it is pretty slow. I am dealing with potentially millions of rows of data, so I'd like something efficient. From what little I know about R, I know there's usually a faster way to do things without using a for-loop.
This essentially looks like a table() operation after you take out duplicates. First, some sample data
#sample data
dd<-read.table(text="Medical.Record.Number Patient.ID.Visit
AAAXXX 1111
AAAXXX 1112
AAAXXX 1113
AAAZZZ 1114
AAAZZZ 1114
AAABBB 1115
AAABBB 1116", header=T)
then you could do
tt <- table(Medical.Record.Number=unique(dd)$Medical.Record.Number)
as.data.frame(tt, responseName="Number.Of.Visits") #to get a data.frame rather than named vector (table)
# Medical.Record.Number Number.Of.Visits
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
Or you could also think of this as an aggregation problem
aggregate(Patient.ID.Visit~Medical.Record.Number, dd, function(x) length(unique(x)))
# Medical.Record.Number Patient.ID.Visit
# 1 AAABBB 2
# 2 AAAXXX 3
# 3 AAAZZZ 1
There are many ways to do this, #MrFlick provided handful of perfectly valid approaches. Personally I'm fond of the data.table package. Its faster on large data frames and I find the logic to be more intuitive than the base functions. I'd check it out if you are having problems with execution time.
library(data.table)
med.dt <- data.table(med_tbl)
num.visits.dt <- med.dt[ , num_visits = length(unique(Patient.ID.Visit)),
by = Medical.Record.Number]
data.Table should be much faster than data.frame on a large tables.

Resources