merge multiple columns with condition - r

I have a data frame like this:
Q17a_17 Q17a_18 Q17a_19 Q17a_20 Q17a_21 Q17a_22 Q17a_23
1 NA NA NA NA NA NA NA
2 0 0 0 0 0 0 1
3 0 0 0 0 0 1 1
4 0 0 0 0 0 0 1
5 1 0 0 0 1 1 0
6 0 0 0 0 0 1 1
7 1 1 0 0 1 0 1
And I would like to merge Q17a_17, Q17a_19 and Q17a_23 in a new column with a new name. The "old" columns Q17a_17, Q17a_19 and Q17a_23 should be deleted.
In the new column should be just one value with the following conditions: "NA" if there was "NA" before, "1" if there was somewhere "1" as value before (like in row 3 or 4 or 7) and "0" if there were only zeros before.
Maybe this is really simple, but I struggle already for hours...

The approach I use here is to first compute a vector which is NA when an NA value occurs in at least one of the three columns, and zero otherwise. Also, we compute a vector containing the numerical result you want. What you want can be obtained by logically ORing together the three columns. Then, adding these two computed vectors together produces the desired result.
na.vector <- df$Q17a_17 * df$Q17a_19 * df$Q17a_23
na.vector[!is.na(na.vector)] <- 0
num.vector <- as.numeric(df$Q17a_17 | df$Q17a_19 | df$Q17a_23)
df$new_column <- na.vector + num.vector
df <- df[ , -which(names(df) %in% c("Q17a_17", "Q17a_19", "Q17a_23"))]

Related

R - In new dataframe: if cell matches another column of same row, then

d <- data.frame(B1 = c(1,2,3,4),B2 = c(0,1,2,3))
d$total=rowSums(d)
B1 B2 total
1 0 1
2 1 3
3 2 5
4 3 7
Using the dataframe above, I want to create a new dataframe with the following logic:
Going by rows, if cells (B1:B2) matches d$total, return 1, else 0.
Ideally output to look like:
B1n B2n
1 0
0 0
0 0
0 0
What is the best way to do this in R?
Thank you.
You can compare first 2 columns with total value.
res <- +(d[1:2] == d$total)
res
# B1 B2
#[1,] 1 0
#[2,] 0 0
#[3,] 0 0
#[4,] 0 0
The result is a matrix, if you want dataframe as output you can do res <- data.frame(res).
Here is an alternate way to solve this problem. You can use dplyr::transmute which is the opposite of dplyr::mutate which will give you two separate columns. Inside transmute are just conditions.
library(dplyr)
newdf <- d %>% transmute(B1n=ifelse(B1+B2==B1,1,0),B2n=ifelse(B1+B2==B2,1,0))
> newdf
B1n B2n
1 1 0
2 0 0
3 0 0
4 0 0

Compute combination of a pair variables for a given operation in R

From a given dataframe:
# Create dataframe with 4 variables and 10 obs
set.seed(1)
df<-data.frame(replicate(4,sample(0:1,10,rep=TRUE)))
I would like to compute a substract operation between in all columns combinations by pairs, but only keeping one substact, i.e column A- column B but not column B-column A and so on.
What I got is very manual, and this tend to be not so easy when there are lots of variables.
# Result
df_result <- as.data.frame(list(df$X1-df$X2,
df$X1-df$X3,
df$X1-df$X4,
df$X2-df$X3,
df$X2-df$X4,
df$X3-df$X4))
Also the colname of the feature name should describe the operation i.e.(x1_x2) being x1-x2.
You can use combn:
COMBI = combn(colnames(df),2)
res = data.frame(apply(COMBI,2,function(i)df[,i[1]]-df[,i[2]]))
colnames(res) = apply(COMBI,2,paste0,collapse="minus")
head(res)
X1minusX2 X1minusX3 X1minusX4 X2minusX3 X2minusX4 X3minusX4
1 0 0 -1 0 -1 -1
2 1 1 0 0 -1 -1
3 0 0 0 0 0 0
4 0 0 -1 0 -1 -1
5 1 1 1 0 0 0
6 -1 0 0 1 1 0

Sample random column in dataframe

I have the following code: model$data
model$data
[[1]]
Category1 Category2 Category3 Category4
3555 1 0 0 0
6447 1 0 0 0
5523 1 0 1 0
7550 1 0 1 0
6330 1 0 1 0
2451 1 0 0 0
4308 1 0 1 0
8917 0 0 0 0
4780 1 0 1 0
6802 1 0 1 0
2021 1 0 0 0
5792 1 0 1 0
5475 1 0 1 0
4198 1 0 0 0
223 1 0 1 0
4811 1 0 1 0
678 1 0 1 0
I am trying to use this formula to get an index of the column names:
sample(colnames(model$data), 1)
But I receive the following error message:
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
Is there a way to avoid that error?
Notice this?
model$data
[[1]]
The [[1]] means that model$data is a list, whose first component is a data frame. To do anything with it, you need to pass model$data[[1]] to your code, not model$data.
sample(colnames(model$data[[1]]), 1)
This seems to be a near-duplicate of Random rows in dataframes in R and should probably be closed as duplicate. But for completeness, adapting that answer to sampling column-indices is trivial:
you don't need to generate a vector of column-names, only their indices. Keep it simple.
sample your col-indices from 1:ncol(df) instead of 1:nrow(df)
then put those column-indices on the RHS of the comma in df[, ...]
df[, sample(ncol(df), 1)]
the 1 is because you apparently want to take a sample of size 1.
one minor complication is that your dataframe is model$data[[1]], since your model$data looks like a list with one element which is a dataframe, rather than a plain dataframe. So first, assign df <- model$data[[1]]
finally, if you really really want the sampled column-name(s) as well as their indices:
samp_col_idxs <- sample(ncol(df), 1)
samp_col_names <- colnames(df) [samp_col_idxs]

Extract columns from df by subset of column id characters

I am working on a gene expression dataset with hundreds of samples. Each sample in the data frame has a unique column ID (example: OHC_112 of IHC_123). I want to make a new dataframe containing only the columns containing the "OHC". How can i do this?
I am struggling to make workable example dataframe... but this is the best i was able to do.
Data frame "DF"
OHC_1 OHC_2 OHC_3 IHC_4 IHC_5 OHC_6
Gene1 1 1 0 1 1 0
Gene2 0 0 0 1 1 0
Gene3 1 1 1 0 0 1
Gene4 1 1 1 0 0 0
I got close by using the following subset command
newDF <- subset(DF, ,select = OHC_1:OHC_3)
This allows me to subset the dataframe by a range of the columns but does not allow me to choose all the columns containing "OHC" in the header.
Thanks for your help!
Just subset the columns with names that match using grepl?
> DF[, grepl("OHC",names(DF))]
OHC_1 OHC_2 OHC_3 OHC_6
1 1 1 0 0
2 0 0 0 0
3 1 1 1 1
4 1 1 1 0
You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^OHC_[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.
We can use select with matches from tidyverse
library(tidyverse)
DF %>%
select(matches("^OHC"))
# OHC_1 OHC_2 OHC_3 OHC_6
#Gene1 1 1 0 0
#Gene2 0 0 0 0
#Gene3 1 1 1 1
#Gene4 1 1 1 0

R subsetting data according to non zeros in row

I have a dataset like this:
1 2 3 4 5
1 1 0 0 0 0
2 1 0 0 2 0
3 0 0 0 1 5
4 2 0 0 0 0
I want to subset the rows with more than one column beeing non-zero (means rows 2,3)
I know that it has to be something like dataset[... dataset ...] but I did not find out how to access rows as such without using a for-loop
You just need rowSums really. Assuming your dataset is called "mydf", try:
> mydf[rowSums(mydf != 0) > 1, ]
X1 X2 X3 X4 X5
2 1 0 0 2 0
3 0 0 0 1 5
Here, rowSums(mydf != 0) will return a vector of how many values in each row is greater not zero. Then, adding our condition > 1 would create a logical vector which can be used to subset the rows we want.

Resources