Coding a series of columns with a numerical condition - r

I have a series of columns which are numeric ranged from 0 to 8. I want to make a binominal variable when a row just one time reported 3 or more than coded as "high" otherwise "low".
structure(list(AE_1 = c(0L, 1L, 0L, 0L, 0L, 2L, 0L), AE_2 = c(0L,
1L, 2L, 1L, 0L, 0L, 0L), AE_3 = c(1L, 4L, 1L, 8L, 0L, 8L, 1L),
AE_4 = c(0L, 1L, 1L, 0L, 0L, 0L, 0L), AE_5 = c(0L, 0L, 1L,
1L, 0L, 0L, 1L), AE_6 = c(0L, 5L, 1L, 3L, 0L, 4L, 1L), AE_7 = c(0L,
1L, 1L, 1L, 0L, 2L, 0L), AE_8 = c(0L, 2L, 1L, 2L, 0L, 0L,
0L), new_AE = c("low", "low", "low", "low", "low", "low",
"low")), class = "data.frame", row.names = c(NA, -7L))
I had this code and the outcome is low for all rows.
df<-df%>%
mutate(new_AE= pmap_chr(select(., starts_with('AE')), ~
case_when(any(c(...) <= 2) ~ "low" , any(c(...) >=3) ~ "high")))
while I want something like this :

This may be done esaily be checking max of each row in base R using pmax. Now of course, you won't write 8 col names into pmax so do this.
df[,9] <- c("low", "high")[ 1 + (do.call(pmax, df[,-9]) >= 3)]
> df
AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
1 0 0 1 0 0 0 0 0 low
2 1 1 4 1 0 5 1 2 high
3 0 2 1 1 1 1 1 1 low
4 0 1 8 0 1 3 1 2 high
5 0 0 0 0 0 0 0 0 low
6 2 0 8 0 0 4 2 0 high
7 0 0 1 0 1 1 0 0 low
see that expr inside [] returns true/false as per your desired condition
# this returns max of each row
do.call(pmax, df[,-9])
[1] 1 5 2 8 0 8 1
# this checks whether max of each row is 3 or more
do.call(pmax, df[,-9]) >= 3
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE
So if you aren't comfortable using this strategy, you may use replace instead
df$new_AE <- replace(df$new_AE, do.call(pmax, df[,-9]) >= 3, "high")

Update
I made a slight modification to my solution as it appears new_AE column exists from the beginning and only the values were not right so here is also another solution just in case you would like to use pmap in one go. However, you already received some fabulous solutions.
library(dplyr)
library(purrr)
df %>%
mutate(new_AE = pmap(df %>%
select(-9), ~ ifelse(any(c(...) >= 3), "high", "low")))
AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
1 0 0 1 0 0 0 0 0 low
2 1 1 4 1 0 5 1 2 high
3 0 2 1 1 1 1 1 1 low
4 0 1 8 0 1 3 1 2 high
5 0 0 0 0 0 0 0 0 low
6 2 0 8 0 0 4 2 0 high
7 0 0 1 0 1 1 0 0 low

The issue is that case_when with the first condition is all TRUE, thus we are only getting the 'low' values. Here, we don't even need a case_when as there are only two categories, and this can be created by converting the logical to numeric index and replace with a vector of labels
library(dplyr)
df %>%
rowwise %>%
mutate(new_AE = c('low', 'high')[1+ any(c_across(where(is.numeric)) >=3)]) %>%
ungroup
-output
# A tibble: 7 x 9
# AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
# <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 0 0 1 0 0 0 0 0 low
#2 1 1 4 1 0 5 1 2 high
#3 0 2 1 1 1 1 1 1 low
#4 0 1 8 0 1 3 1 2 high
#5 0 0 0 0 0 0 0 0 low
#6 2 0 8 0 0 4 2 0 high
#7 0 0 1 0 1 1 0 0 low
Or this may be done more easily with rowSums from base R
df$new_AE <- c("low", "high")[(!!rowSums(df >= 3)) + 1]
df$new_AE
#[1] "low" "high" "low" "high" "low" "high" "low"
While applying case_when have to consider the order of logical statements or make sure to do corrections in the succeeding expressions. if we test the second of OP's data
v1 <- c(1, 1, 4, 1, 0, 5, 1)
any(v1 <= 2)
#[1] TRUE
which is the first expression in case_when. As the first one is already executed and found a match, the subsequent expressions are not executed
case_when(any(v1 <=2) ~ 'low', any(v1 >=3) ~ 'high')
#[1] "low"
By reversing the order, we get "high"
case_when( any(v1 >=3) ~ 'high', any(v1 <=2) ~ 'low')
#[1] "high"
So, make sure which one is more priority and set the order of those expressions based on that

Related

Select columns on match with vector and create ifelse condition with their content

I have a dataset with over several diseases, 0 indicating not having the disease and 1 having the disease.
To illustrate it with an example: I am interested in Diseases A and whether the people in the dataset have this diseases on its own or as the cause of another disease. Therefore I want to create a new variable "Type" with the values "NotDiseasedWithA", "Primary" and "Secondary". The diseases that can cause A are contained in a vector "SecondaryCauses":
SecondaryCauses = c("DiseaseB", "DiseaseD")
"NotDiseasedWithA" means that they do not have disease A.
"Primary" means that they have disease A but not any of the known diseases that can cause it.
"Secondary" means that they have disease A and a diseases that probably caused it.
Sample data
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE
1 0 1 0 0 0
2 1 0 0 0 1
3 1 0 1 1 0
4 1 0 1 1 1
5 0 0 0 0 0
My question is:
How do I select the columns I am interested in? I have more than 20 columns that are not ordered. Therefore I created the vector.
How do I create the condition based on the content of the diseases I am interested in?
I tried something like the following, but this did not work:
DF %>% mutate(Type = ifelse(DiseaseA == 0, "NotDiseasedWithA", ifelse(sum(names(DF) %in% SecondaryCauses) > 0, "Secondary", "Primary")))
So in the end I want to have this results:
ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
1 0 1 0 0 0 NotDiseasedWithA
2 1 0 0 0 1 Primary
3 1 0 1 1 0 Secondary
4 1 0 1 1 1 Secondary
5 0 0 0 0 0 NotDiseasedWithA
using data.table
df <- structure(list(ID = 1:5, DiseaseA = c(0L, 1L, 1L, 1L, 0L), DiseaseB = c(1L,
0L, 0L, 0L, 0L), DiseaseC = c(0L, 0L, 1L, 1L, 0L), DiseaseD = c(0L,
0L, 1L, 1L, 0L), DiseaseE = c(0L, 1L, 0L, 1L, 0L)), row.names = c(NA,
-5L), class = c("data.frame"))
library(data.table)
setDT(df) # make it a data.table
SecondaryCauses = c("DiseaseB", "DiseaseD")
df[DiseaseA == 0, Type := "NotDiseasedWithA"][DiseaseA == 1, Type := ifelse(rowSums(.SD) > 0, "Secondary", "Primary"), .SDcols = SecondaryCauses]
df
# ID DiseaseA DiseaseB DiseaseC DiseaseD DiseaseE Type
# 1: 1 0 1 0 0 0 NotDiseasedWithA
# 2: 2 1 0 0 0 1 Primary
# 3: 3 1 0 1 1 0 Secondary
# 4: 4 1 0 1 1 1 Secondary
# 5: 5 0 0 0 0 0 NotDiseasedWithA

comparing a value with next value in a column

I have a data set of the following form:
Interval | Count | criteria
0 0 0
0 1 0
0 2 0
0 3 0
1 4 1
1 5 2
1 6 3
1 7 4
2 8 1
2 9 2
3 10 3
I need to compare the values in the Interval. I first need to create a new variable to store the values. If the value in the Interval is the same as the next value, then the new variable should have blanks. If the Interval value is not the same as the next value, then it should return criteria/Count. Output should be like this:
Interval | Count | criteria | N
0 0 0
0 1 0
0 2 0
0 3 0 0
1 4 1
1 5 2
1 6 3
1 7 4 0.5714
2 8 1
2 9 2 0.2222
3 10 3
Here is my code:
fid$N<-''
for (i in 1:length(fid$Interval))
{
if (fid$Interval[i] != fid$Interval[i+1])
fid$N<-fid$criteria/fid$Count
else
fid$N<-''
}
and this is the error I am getting.
Error in if (fid$Interval[i] != fid$Interval[i + 1]) fid$N <- fid$criteria/fid$Count else fid$N <- "" :
missing value where TRUE/FALSE needed
To add that, there is no missing value in the data set.
I would appreciate if anyone could help.
You don't necessarily need a loop for this since most of the R functions are vectorized. Here is a way to do this in base R, dplyr and data.table without using a loop.
#Base R
transform(df, N = ifelse(Interval != c(tail(Interval, -1), NA), criteria/Count, NA))
#dplyr
library(dplyr)
df %>% mutate(N = if_else(Interval != lead(Interval), criteria/Count, NA_real_))
#data.table
library(data.table)
setDT(df)[, N:= fifelse(Interval != shift(Interval, type = 'lead'), criteria/Count, NA_real_)]
All of which return :
# Interval Count criteria N
#1 0 0 0 NA
#2 0 1 0 NA
#3 0 2 0 NA
#4 0 3 0 0.0000000
#5 1 4 1 NA
#6 1 5 2 NA
#7 1 6 3 NA
#8 1 7 4 0.5714286
#9 2 8 1 NA
#10 2 9 2 0.2222222
#11 3 10 3 NA
I return NA instead of blank value because if we return blank value the entire column becomes of type character and the numbers are no longer useful. In the answer you can replace NA with '' to get a blank value.
data
df <- structure(list(Interval = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L,
2L, 3L), Count = 0:10, criteria = c(0L, 0L, 0L, 0L, 1L, 2L, 3L,
4L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA, -11L))

Replace multiple columns by head string into one column

I want to replace multiple columns of a data frame by one column each for each group whereas I also want to change the numbers. Example:
A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0
I want to sort this data frame by it's headers meaning I only want one column "A" instead of 4 here and only column "B" instead of 3 here. The numbers should change with the following pattern: If you are in group "A2" and the observation has the number "1" it should be changed into a "2" instead. If you are in group "A3" and the observation has the number "1" it should be changed into a "3" instead. The end result should be that I want to contain the highest number in that specific column and row (if I have 3 "1"s in my row and group, the number which is going to replace all of them is going to be the one of the highest group)
If the number is 0 then nothing changes. Here is the result I'm looking for:
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
How can I replace all of these groups by a single column each? (one column for each group)
So far I've tried a lot with the function unite(data= testdata, col= "A") for example, but doing this manually would take too long. There has to be a better way, right?
Thanks in advance!
You can do:
dat <- read.table(header=TRUE, text=
"A1 A2 A3 A4 B1 B2 B3
1 1 1 0 1 1 0 0
2 1 0 1 1 0 1 1
3 1 1 1 1 0 1 1
4 0 0 1 0 0 0 1
5 0 0 0 0 0 1 0")
myfu <- function(x) if (any(x)) max(which(x)) else 0
new <- data.frame(
A=apply(dat[, 1:4]==1, 1, myfu),
B=apply(dat[, 5:7]==1, 1, myfu))
new
A more general solution:
new2 <- data.frame(
A=apply(dat[, grepl("^A", names(dat))]==1, 1, myfu),
B=apply(dat[, grepl("^B", names(dat))]==1, 1, myfu))
new2
You can try the code like below
dfout <- as.data.frame(
lapply(
split.default(df, gsub("\\d+$", "", names(df))),
function(v) max.col(v, ties.method = "last") * +(rowSums(v) >= 1)
)
)
such that
> dfout
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2
Data
df <- structure(list(A1 = c(1L, 1L, 1L, 0L, 0L), A2 = c(1L, 0L, 1L,
0L, 0L), A3 = c(0L, 1L, 1L, 1L, 0L), A4 = c(1L, 1L, 1L, 0L, 0L
), B1 = c(1L, 0L, 0L, 0L, 0L), B2 = c(0L, 1L, 1L, 0L, 1L), B3 = c(0L,
1L, 1L, 1L, 0L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5"))
assuming your data is in a data.frame called df1 this works in Base-R
df1 <- t(df1)*as.numeric(regmatches(colnames(df1), regexpr("\\d+$", colnames(df1))))
df1 <- split(as.data.frame(df1),sub("\\d+$","",row.names(df1)))
df1 <- sapply(df1, apply, 2, max)
output:
> df1
A B
1 4 1
2 4 3
3 4 3
4 3 3
5 0 2

Concatenate two rows with the same ID into 0,1,2 from presence/absence

I am trying to record an original table with SNP ID in rows and Sample ID in columns.
So far, I only managed to convert the data into presence/absence with 0 and 1.
I tried some easy codes to do further conversion but cannot find one that does I want.
The original table looks like this
snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
A_001 0 1 1 1 0 0 1 0
A_001 0 0 1 0 1 0 1 1
A_002 1 1 0 1 1 1 0 0
A_002 0 1 1 0 1 0 1 1
A_003 1 0 0 1 0 1 1 0
A_003 1 1 0 1 1 0 0 1
A_004 0 0 1 0 0 1 0 0
A_004 1 0 0 1 0 1 1 0
I would like to record the scores to 0/0 = NA, 0/1 = 0, 1/1 = 2, 1/0 = 1 so the product looks something like this.
snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
A_001 NA 1 2 1 0 NA 2 0
A_002 1 2 0 1 2 1 0 0
A_003 2 0 NA 2 0 1 1 0
A_004 0 NA 1 0 NA 2 0 NA
This is just an example. My total snpID is ~96000 and total sample ID column is ~500.
Any helps with writing this code would be really appreciated.
Here are a few dplyr-based examples that each work in a single pipe and get the same output. The main first step is to group by your ID, then collapse all the columns with a /. Then you can use mutate_at to select all columns that start with Cal_—this may be useful if you have other columns besides the ID that you don't want to do this operation on.
First method is a case_when:
library(dplyr)
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")), ~case_when(
. == "0/1" ~ 0,
. == "1/1" ~ 2,
. == "1/0" ~ 1,
TRUE ~ NA_real_
))
#> # A tibble: 4 x 9
#> snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A_001 NA 1 2 1 0 NA 2 0
#> 2 A_002 1 2 0 1 2 1 0 0
#> 3 A_003 2 0 NA 2 0 1 1 0
#> 4 A_004 0 NA 1 0 NA 2 0 NA
However, (in my opinion) case_when is a little tricky to read, and this doesn't showcase its real power, which is doing if/else checks on multiple variables. Better suited to checks on one variable at a time is dplyr::recode:
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")),
~recode(.,
"0/1" = 0,
"1/1" = 2,
"1/0" = 1,
"0/0" = NA_real_))
# same output as above
Or, for more flexibility & readability, create a small lookup object. That way, you can reuse the recode logic and change it easily. recode takes a set of named arguments; using tidyeval, you can pass in a named vector and unquo it with !!! (there's a similar example in the recode docs):
lookup <- c("0/1" = 0, "1/1" = 2, "1/0" = 1, "0/0" = NA_real_)
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")), recode, !!!lookup)
# same output
You might use aggregate to concatenate the values for each snpID and then replace the values according to your needs with the help of case_when from dplyr.
(out <- aggregate(.~ snpID, dat, toString))
# snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#1 A_001 0, 0 1, 0 1, 1 1, 0 0, 1 0, 0 1, 1 0, 1
#2 A_002 1, 0 1, 1 0, 1 1, 0 1, 1 1, 0 0, 1 0, 1
#3 A_003 1, 1 0, 1 0, 0 1, 1 0, 1 1, 0 1, 0 0, 1
#4 A_004 0, 1 0, 0 1, 0 0, 1 0, 0 1, 1 0, 1 0, 0
Now recode the columns
library(dplyr)
out[-1] <- case_when(out[-1] == "0, 0" ~ NA_integer_,
out[-1] == "0, 1" ~ 0L,
out[-1] == "1, 0" ~ 1L,
TRUE ~ 2L)
Result
out
# snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#1 A_001 NA 1 2 1 0 NA 2 0
#2 A_002 1 2 0 1 2 1 0 0
#3 A_003 2 0 NA 2 0 1 1 0
#4 A_004 0 NA 1 0 NA 2 0 NA
data
dat <- structure(list(snpID = c("A_001", "A_001", "A_002", "A_002",
"A_003", "A_003", "A_004", "A_004"), Cal_X1 = c(0L, 0L, 1L, 0L,
1L, 1L, 0L, 1L), Cal_X2 = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L),
Cal_X3 = c(1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L), Cal_X4 = c(1L,
0L, 1L, 0L, 1L, 1L, 0L, 1L), Cal_X5 = c(0L, 1L, 1L, 1L, 0L,
1L, 0L, 0L), Cal_X6 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L),
Cal_X7 = c(1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), Cal_X8 = c(0L,
1L, 0L, 1L, 0L, 1L, 0L, 0L)), .Names = c("snpID", "Cal_X1",
"Cal_X2", "Cal_X3", "Cal_X4", "Cal_X5", "Cal_X6", "Cal_X7", "Cal_X8"
), class = "data.frame", row.names = c(NA, -8L))

Segregating dataset and name each new dataset as per unique column names

I have a dataset(nm) as shown below:
nm
2_V2O 10_Kutti 14_DD 15_TT 16_DD 19_V2O 20_Kutti
0 1 1 0 0 1 0
1 1 1 1 1 0 0
0 1 0 1 0 0 1
0 1 1 0 1 0 0
Now I want to have multiple new datasets which got segregated as per their unique column names. All dataset names also must be created as per their column names as shown below:
Kutti
10_Kutti 20_Kutti
1 0
1 0
1 1
1 0
V2O
2_V2O 19_V2O
0 1
1 0
0 0
0 0
DD
14_DD 16_DD
1 0
1 1
0 0
1 1
TT
16_TT
0
1
0
1
I know this can be done using "select" function in dplyr but I need one dynamic programme which builds this automatically for any dataset.
We can split by the substring of the column names of 'nm'. Remove the prefix of the columnames until the _ with sub and use that to split the 'nm'.
lst <- split.default(nm, sub(".*_", "", names(nm)))
lst
#$DD
# 14_DD 16_DD
#1 1 0
#2 1 1
#3 0 0
#4 1 1
#$Kutti
# 10_Kutti 20_Kutti
#1 1 0
#2 1 0
#3 1 1
#4 1 0
#$TT
# 15_TT
#1 0
#2 1
#3 1
#4 0
#$V2O
# 2_V2O 19_V2O
#1 0 1
#2 1 0
#3 0 0
#4 0 0
It is better to keep the data.frames in a list. If we insist that it should be individual data.frame objects in the global environment (not recommended), use list2env
list2env(lst, envir = .GlobalEnv)
Now, just call
DD
data
nm <- structure(list(`2_V2O` = c(0L, 1L, 0L, 0L), `10_Kutti` = c(1L,
1L, 1L, 1L), `14_DD` = c(1L, 1L, 0L, 1L), `15_TT` = c(0L, 1L,
1L, 0L), `16_DD` = c(0L, 1L, 0L, 1L), `19_V2O` = c(1L, 0L, 0L,
0L), `20_Kutti` = c(0L, 0L, 1L, 0L)), .Names = c("2_V2O", "10_Kutti",
"14_DD", "15_TT", "16_DD", "19_V2O", "20_Kutti"), class = "data.frame",
row.names = c(NA, -4L))

Resources