Extract columns from df by subset of column id characters - r

I am working on a gene expression dataset with hundreds of samples. Each sample in the data frame has a unique column ID (example: OHC_112 of IHC_123). I want to make a new dataframe containing only the columns containing the "OHC". How can i do this?
I am struggling to make workable example dataframe... but this is the best i was able to do.
Data frame "DF"
OHC_1 OHC_2 OHC_3 IHC_4 IHC_5 OHC_6
Gene1 1 1 0 1 1 0
Gene2 0 0 0 1 1 0
Gene3 1 1 1 0 0 1
Gene4 1 1 1 0 0 0
I got close by using the following subset command
newDF <- subset(DF, ,select = OHC_1:OHC_3)
This allows me to subset the dataframe by a range of the columns but does not allow me to choose all the columns containing "OHC" in the header.
Thanks for your help!

Just subset the columns with names that match using grepl?
> DF[, grepl("OHC",names(DF))]
OHC_1 OHC_2 OHC_3 OHC_6
1 1 1 0 0
2 0 0 0 0
3 1 1 1 1
4 1 1 1 0

You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^OHC_[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.

We can use select with matches from tidyverse
library(tidyverse)
DF %>%
select(matches("^OHC"))
# OHC_1 OHC_2 OHC_3 OHC_6
#Gene1 1 1 0 0
#Gene2 0 0 0 0
#Gene3 1 1 1 1
#Gene4 1 1 1 0

Related

R - In new dataframe: if cell matches another column of same row, then

d <- data.frame(B1 = c(1,2,3,4),B2 = c(0,1,2,3))
d$total=rowSums(d)
B1 B2 total
1 0 1
2 1 3
3 2 5
4 3 7
Using the dataframe above, I want to create a new dataframe with the following logic:
Going by rows, if cells (B1:B2) matches d$total, return 1, else 0.
Ideally output to look like:
B1n B2n
1 0
0 0
0 0
0 0
What is the best way to do this in R?
Thank you.
You can compare first 2 columns with total value.
res <- +(d[1:2] == d$total)
res
# B1 B2
#[1,] 1 0
#[2,] 0 0
#[3,] 0 0
#[4,] 0 0
The result is a matrix, if you want dataframe as output you can do res <- data.frame(res).
Here is an alternate way to solve this problem. You can use dplyr::transmute which is the opposite of dplyr::mutate which will give you two separate columns. Inside transmute are just conditions.
library(dplyr)
newdf <- d %>% transmute(B1n=ifelse(B1+B2==B1,1,0),B2n=ifelse(B1+B2==B2,1,0))
> newdf
B1n B2n
1 1 0
2 0 0
3 0 0
4 0 0

How to add multiple values to data.frame without loop?

Suppose I have matrix D which consists of death counts per year by specific ages.
I want to fill this matrix with appropriate death counts that is stored in
vector Age, but the following code gives me wrong answer. How should I write the code without making a loop?
# Year and age grid for tables
Years=c(2007:2017)
Ages=c(60:70)
#Data.frame of deaths
D=data.frame(matrix(ncol=length(Years),nrow=length(Ages))); D[is.na(D)]=0
colnames(D)=Years
rownames(D)=Ages
Age=c(60,61,62,65,65,65,68,69,60)
year=2010
D[as.character(Age),as.character(year)]<-
D[as.character(Age),as.character(year)]+1
D[,'2010'] # 1 1 1 0 0 1 0 0 1 1 0
# Should be 2 1 1 0 0 3 0 0 1 1 0
You need to use table
AgeTable = table(Age)
D[names(AgeTable), as.character(year)] = AgeTable
D[,'2010']
[1] 2 1 1 0 0 3 0 0 1 1 0

merge multiple columns with condition

I have a data frame like this:
Q17a_17 Q17a_18 Q17a_19 Q17a_20 Q17a_21 Q17a_22 Q17a_23
1 NA NA NA NA NA NA NA
2 0 0 0 0 0 0 1
3 0 0 0 0 0 1 1
4 0 0 0 0 0 0 1
5 1 0 0 0 1 1 0
6 0 0 0 0 0 1 1
7 1 1 0 0 1 0 1
And I would like to merge Q17a_17, Q17a_19 and Q17a_23 in a new column with a new name. The "old" columns Q17a_17, Q17a_19 and Q17a_23 should be deleted.
In the new column should be just one value with the following conditions: "NA" if there was "NA" before, "1" if there was somewhere "1" as value before (like in row 3 or 4 or 7) and "0" if there were only zeros before.
Maybe this is really simple, but I struggle already for hours...
The approach I use here is to first compute a vector which is NA when an NA value occurs in at least one of the three columns, and zero otherwise. Also, we compute a vector containing the numerical result you want. What you want can be obtained by logically ORing together the three columns. Then, adding these two computed vectors together produces the desired result.
na.vector <- df$Q17a_17 * df$Q17a_19 * df$Q17a_23
na.vector[!is.na(na.vector)] <- 0
num.vector <- as.numeric(df$Q17a_17 | df$Q17a_19 | df$Q17a_23)
df$new_column <- na.vector + num.vector
df <- df[ , -which(names(df) %in% c("Q17a_17", "Q17a_19", "Q17a_23"))]

Sample random column in dataframe

I have the following code: model$data
model$data
[[1]]
Category1 Category2 Category3 Category4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
I am trying to use this formula to get an index of the column names:
sample(colnames(model$data), 1)
But I receive the following error message:
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Is there a way to avoid that error?
Notice this?
model$data
[[1]]
The [[1]] means that model$data is a list, whose first component is a data frame. To do anything with it, you need to pass model$data[[1]] to your code, not model$data.
sample(colnames(model$data[[1]]), 1)
This seems to be a near-duplicate of Random rows in dataframes in R and should probably be closed as duplicate. But for completeness, adapting that answer to sampling column-indices is trivial:
you don't need to generate a vector of column-names, only their indices. Keep it simple.
sample your col-indices from 1:ncol(df) instead of 1:nrow(df)
then put those column-indices on the RHS of the comma in df[, ...]
df[, sample(ncol(df), 1)]
the 1 is because you apparently want to take a sample of size 1.
one minor complication is that your dataframe is model$data[[1]], since your model$data looks like a list with one element which is a dataframe, rather than a plain dataframe. So first, assign df <- model$data[[1]]
finally, if you really really want the sampled column-name(s) as well as their indices:
samp_col_idxs <- sample(ncol(df), 1)
samp_col_names <- colnames(df) [samp_col_idxs]

Create virtual categorical column based on existing values

Reference:
Transpose and create categorical values in R
Follow-up to this question. While both model.matrix and data.table work very well with values already in it, how can we use them to simulate a column?
Meaning, from data in the same data frame,
data <- read.table(header=T, text='
subject weight sex test
1 2 M control
2 3 F cond1
3 2 F cond2
4 4 M control
5 3 F control
6 2 F control
')
If I were to simulate the case statement with OR condition from SQL in R, how do I go about it? In SQL I would do:
case when ( sex = 'F' OR sex = 'M') AND CONTROL IS NOT NULL THEN 1 ELSE 0 AS F_M_CONTROL
case when (sex = 'F' OR sex = 'M') AND COND1 IS NOT NULL THEN 1 ELSE 0 AS F_M_COND1
bringing the output to:
subject weight control_F_M control_M condtrol_F cond1_F_M cond1_F cond1_M
1 2 0 1 0 0 0 0
2 3 0 0 1 0 0 0
3 2 0 0 0 0 1 0
4 4 0 1 0 0 0 0
5 3 1 0 0 0 0 0
6 2 1 0 0 0 0 0
Any idea how I can generate the "Control_F_M" and Cond1_F_M columns in R?
Thanks in advance,
Bee
Edit:
To generate the afore mentioned output, i'm using the data table & dcast as suggested before.
I can use If-Else if I knew all the values in the column: test. I apologize for not clarifying this earlier. The challenge ofcourse is that the column is dynamic and so I'm hoping to generate that many columns dynamically as an extension to the below or using a similar approach.
dcast(data, subject+weight~test+sex, fun=length, drop=c(TRUE,FALSE))

Resources