I have a dataframe like this :
G2_ref G10_ref G12_ref G2_alt G10_alt G12_alt
20011953 3 6 0 5 1 5
12677336 0 0 0 1 3 6
20076754 0 3 0 12 16 8
2089670 0 4 0 1 11 9
9456633 0 2 0 3 10 0
468487 0 0 0 0 0 0
And I'm trying to sort the columns to have finally this column order :
G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
I tried : df[,order(colnames(df))]
But I had this order :
G10_alt G10_ref G12_alt G12_ref G2_alt G2_ref
If anyone had any idea it will be great.
One option would be to extract the numeric part and also the substring at the end and then do the order
df[order(as.numeric(gsub("\\D+", "", names(df))),
factor(sub(".*_", "", names(df)), levels = c('ref', 'alt')))]
# G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
#20011953 3 5 6 1 0 5
#12677336 0 1 0 3 0 6
#20076754 0 12 3 16 0 8
#2089670 0 1 4 11 0 9
#9456633 0 3 2 10 0 0
#468487 0 0 0 0 0 0
data
df <- structure(list(G2_ref = c(3L, 0L, 0L, 0L, 0L, 0L), G10_ref = c(6L,
0L, 3L, 4L, 2L, 0L), G12_ref = c(0L, 0L, 0L, 0L, 0L, 0L), G2_alt = c(5L,
1L, 12L, 1L, 3L, 0L), G10_alt = c(1L, 3L, 16L, 11L, 10L, 0L),
G12_alt = c(5L, 6L, 8L, 9L, 0L, 0L)), .Names = c("G2_ref",
"G10_ref", "G12_ref", "G2_alt", "G10_alt", "G12_alt"),
class = "data.frame", row.names = c("20011953",
"12677336", "20076754", "2089670", "9456633", "468487"))
I am guessing your data is from genetics and looks pretty standard, first columns with ref alleles for all variants then followed by alt alleles for all variants.
Meaning we could just use alternated column index from half way of your dataframe, i.e.: we will try to create this index - c(1, 4, 2, 5, 3, 6) then subset:
ix <- c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))
ix
# [1] 1 4 2 5 3 6
df1[, ix]
# G2_ref G2_alt G10_ref G10_alt G12_ref G12_alt
# 20011953 3 5 6 1 0 5
# 12677336 0 1 0 3 0 6
# 20076754 0 12 3 16 0 8
# 2089670 0 1 4 11 0 9
# 9456633 0 3 2 10 0 0
# 468487 0 0 0 0 0 0
# or all in one line
df1[, c(rbind(seq(1, ncol(df1)/2), seq(ncol(df1)/2 + 1, ncol(df1))))]
An easy solution using dplyr:
library(dplyr)
df <- df %>%
select(G2_ref, G2_alt, G10_ref, G10_alt, G12_ref, G12_alt)
Perhaps this is less (complicated) code than #akrun's answer, but only really suitable for when you want to order a small number of columns.
Related
I have a series of columns which are numeric ranged from 0 to 8. I want to make a binominal variable when a row just one time reported 3 or more than coded as "high" otherwise "low".
structure(list(AE_1 = c(0L, 1L, 0L, 0L, 0L, 2L, 0L), AE_2 = c(0L,
1L, 2L, 1L, 0L, 0L, 0L), AE_3 = c(1L, 4L, 1L, 8L, 0L, 8L, 1L),
AE_4 = c(0L, 1L, 1L, 0L, 0L, 0L, 0L), AE_5 = c(0L, 0L, 1L,
1L, 0L, 0L, 1L), AE_6 = c(0L, 5L, 1L, 3L, 0L, 4L, 1L), AE_7 = c(0L,
1L, 1L, 1L, 0L, 2L, 0L), AE_8 = c(0L, 2L, 1L, 2L, 0L, 0L,
0L), new_AE = c("low", "low", "low", "low", "low", "low",
"low")), class = "data.frame", row.names = c(NA, -7L))
I had this code and the outcome is low for all rows.
df<-df%>%
mutate(new_AE= pmap_chr(select(., starts_with('AE')), ~
case_when(any(c(...) <= 2) ~ "low" , any(c(...) >=3) ~ "high")))
while I want something like this :
This may be done esaily be checking max of each row in base R using pmax. Now of course, you won't write 8 col names into pmax so do this.
df[,9] <- c("low", "high")[ 1 + (do.call(pmax, df[,-9]) >= 3)]
> df
AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
1 0 0 1 0 0 0 0 0 low
2 1 1 4 1 0 5 1 2 high
3 0 2 1 1 1 1 1 1 low
4 0 1 8 0 1 3 1 2 high
5 0 0 0 0 0 0 0 0 low
6 2 0 8 0 0 4 2 0 high
7 0 0 1 0 1 1 0 0 low
see that expr inside [] returns true/false as per your desired condition
# this returns max of each row
do.call(pmax, df[,-9])
[1] 1 5 2 8 0 8 1
# this checks whether max of each row is 3 or more
do.call(pmax, df[,-9]) >= 3
[1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE
So if you aren't comfortable using this strategy, you may use replace instead
df$new_AE <- replace(df$new_AE, do.call(pmax, df[,-9]) >= 3, "high")
Update
I made a slight modification to my solution as it appears new_AE column exists from the beginning and only the values were not right so here is also another solution just in case you would like to use pmap in one go. However, you already received some fabulous solutions.
library(dplyr)
library(purrr)
df %>%
mutate(new_AE = pmap(df %>%
select(-9), ~ ifelse(any(c(...) >= 3), "high", "low")))
AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
1 0 0 1 0 0 0 0 0 low
2 1 1 4 1 0 5 1 2 high
3 0 2 1 1 1 1 1 1 low
4 0 1 8 0 1 3 1 2 high
5 0 0 0 0 0 0 0 0 low
6 2 0 8 0 0 4 2 0 high
7 0 0 1 0 1 1 0 0 low
The issue is that case_when with the first condition is all TRUE, thus we are only getting the 'low' values. Here, we don't even need a case_when as there are only two categories, and this can be created by converting the logical to numeric index and replace with a vector of labels
library(dplyr)
df %>%
rowwise %>%
mutate(new_AE = c('low', 'high')[1+ any(c_across(where(is.numeric)) >=3)]) %>%
ungroup
-output
# A tibble: 7 x 9
# AE_1 AE_2 AE_3 AE_4 AE_5 AE_6 AE_7 AE_8 new_AE
# <int> <int> <int> <int> <int> <int> <int> <int> <chr>
#1 0 0 1 0 0 0 0 0 low
#2 1 1 4 1 0 5 1 2 high
#3 0 2 1 1 1 1 1 1 low
#4 0 1 8 0 1 3 1 2 high
#5 0 0 0 0 0 0 0 0 low
#6 2 0 8 0 0 4 2 0 high
#7 0 0 1 0 1 1 0 0 low
Or this may be done more easily with rowSums from base R
df$new_AE <- c("low", "high")[(!!rowSums(df >= 3)) + 1]
df$new_AE
#[1] "low" "high" "low" "high" "low" "high" "low"
While applying case_when have to consider the order of logical statements or make sure to do corrections in the succeeding expressions. if we test the second of OP's data
v1 <- c(1, 1, 4, 1, 0, 5, 1)
any(v1 <= 2)
#[1] TRUE
which is the first expression in case_when. As the first one is already executed and found a match, the subsequent expressions are not executed
case_when(any(v1 <=2) ~ 'low', any(v1 >=3) ~ 'high')
#[1] "low"
By reversing the order, we get "high"
case_when( any(v1 >=3) ~ 'high', any(v1 <=2) ~ 'low')
#[1] "high"
So, make sure which one is more priority and set the order of those expressions based on that
This question already has answers here:
How to combine multiple conditions to subset a data-frame using "OR"?
(5 answers)
Closed 2 years ago.
treat
age
education
black
hispanic
married
nodegree
re74
re75
1:
0
23
10
1
0
0
1
0
2:
0
26
12
0
0
0
0
0
3:
0
22
9
1
0
0
1
0
4:
0
18
9
1
0
0
1
0
I'm trying to only display data where either re74==0 or re75==0 or both are equal to zero, which implies that I'm disregarding the rows where both are equal to one.
(¬_¬)df <- data.frame(
... stringsAsFactors = FALSE,
... treat = c("1:", "2:", "3:", "4:"),
... age = c(0L, 0L, 0L, 0L),
... education = c(23L, 26L, 22L, 18L),
... black = c(10L, 12L, 9L, 9L),
... hispanic = c(1L, 0L, 1L, 1L),
... married = c(0L, 0L, 0L, 0L),
... nodegree = c(0L, 0L, 0L, 0L),
... re74 = c(1L, 0L, 1L, 1L),
... re75 = c(1L, 0L, 0L, 0L)
... )
(¬_¬)df[df$re74==0 |df$re75==0, ]
treat age education black hispanic married nodegree re74 re75
2 2: 0 26 12 0 0 0 0 0
3 3: 0 22 9 1 0 0 1 0
4 4: 0 18 9 1 0 0 1 0
You can use filter from dplyr
library(dplyr)
df %>% filter(re74 == 0 | re75 == 0)
We can use subset
subset(df, re74 == 0 | re75 == 0)
I have a database with 100 columns, but a minimal production of my data are as follows:
df1<=read.table(text="PG1S1AW KOM1S1zo PG2S2AW KOM2S2zo PG3S3AW KOM3S3zo PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
4 1 2 4 4 3 0 4 0 5
4 4 3 1 3 1 0 3 0 1
2 3 5 3 3 2 1 4 0 2
1 1 1 1 1 3 0 5 0 1
2 5 3 4 4 5 0 1 3 4", header=TRUE)
I want to get columns starting with KOM and PG which have a greater of 3 . So we need to have PG4, KOM4 and above. Put it simply, starting with PG and KOM have the same values which is 4 and greater.
The intended output is:
PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
0 4 0 5
0 3 0 1
1 4 0 2
0 5 0 1
0 1 3 4
I have used the following code, but it does not work for me:
df2<- df1%>% select(contains("KO"))
Thanks for your help.
It is not entirely clear about the patterns. We create a function (f1) to extract one or more digits (\\d+) that follows the 'KOM' or (|) 'PG' with str_extract (from stringr), convert to numeric ('v1'), similarly, extract numbers after the 'S' ('v2'). Do a check whether these values are same and if one of the value is greater than 3, wrap with which so that if there are any NAs resulting from str_extract would be removed as which gives the column index while removing any NAs. Use the function in select to select the columns that follow the pattern
library(dplyr)
library(stringr)
f1 <- function(nm) {
v1 <- as.numeric(str_extract(nm, "(?<=(KOM|PG))\\d+"))
v2 <- as.numeric(str_extract(nm, "(?<=S)\\d+"))
nm[which((v1 == v2) & (v1 > 3))]
}
df1 %>%
select(f1(names(.)))
# PG4S4AW KOM4S4zo PG5S5AW KOM5S5zo
#1 0 4 0 5
#2 0 3 0 1
#3 1 4 0 2
#4 0 5 0 1
#5 0 1 3 4
data
df1 <- structure(list(PG1S1AW = c(4L, 4L, 2L, 1L, 2L), KOM1S1zo = c(1L,
4L, 3L, 1L, 5L), PG2S2AW = c(2L, 3L, 5L, 1L, 3L), KOM2S2zo = c(4L,
1L, 3L, 1L, 4L), PG3S3AW = c(4L, 3L, 3L, 1L, 4L), KOM3S3zo = c(3L,
1L, 2L, 3L, 5L), PG4S4AW = c(0L, 0L, 1L, 0L, 0L), KOM4S4zo = c(4L,
3L, 4L, 5L, 1L), PG5S5AW = c(0L, 0L, 0L, 0L, 3L), KOM5S5zo = c(5L,
1L, 2L, 1L, 4L)), class = "data.frame", row.names = c(NA, -5L
))
Given your example data, you can just instead look for the numbers 4 or 5.
df1 %>%
select(matches("4|5"))
#> KO4S4AW KOM4S4zo KO5S5AW KOM5S5zo
#> 1 0 4 0 5
#> 2 0 3 0 1
#> 3 1 4 0 2
#> 4 0 5 0 1
#> 5 0 1 3 4
I am trying to record an original table with SNP ID in rows and Sample ID in columns.
So far, I only managed to convert the data into presence/absence with 0 and 1.
I tried some easy codes to do further conversion but cannot find one that does I want.
The original table looks like this
snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
A_001 0 1 1 1 0 0 1 0
A_001 0 0 1 0 1 0 1 1
A_002 1 1 0 1 1 1 0 0
A_002 0 1 1 0 1 0 1 1
A_003 1 0 0 1 0 1 1 0
A_003 1 1 0 1 1 0 0 1
A_004 0 0 1 0 0 1 0 0
A_004 1 0 0 1 0 1 1 0
I would like to record the scores to 0/0 = NA, 0/1 = 0, 1/1 = 2, 1/0 = 1 so the product looks something like this.
snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
A_001 NA 1 2 1 0 NA 2 0
A_002 1 2 0 1 2 1 0 0
A_003 2 0 NA 2 0 1 1 0
A_004 0 NA 1 0 NA 2 0 NA
This is just an example. My total snpID is ~96000 and total sample ID column is ~500.
Any helps with writing this code would be really appreciated.
Here are a few dplyr-based examples that each work in a single pipe and get the same output. The main first step is to group by your ID, then collapse all the columns with a /. Then you can use mutate_at to select all columns that start with Cal_—this may be useful if you have other columns besides the ID that you don't want to do this operation on.
First method is a case_when:
library(dplyr)
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")), ~case_when(
. == "0/1" ~ 0,
. == "1/1" ~ 2,
. == "1/0" ~ 1,
TRUE ~ NA_real_
))
#> # A tibble: 4 x 9
#> snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A_001 NA 1 2 1 0 NA 2 0
#> 2 A_002 1 2 0 1 2 1 0 0
#> 3 A_003 2 0 NA 2 0 1 1 0
#> 4 A_004 0 NA 1 0 NA 2 0 NA
However, (in my opinion) case_when is a little tricky to read, and this doesn't showcase its real power, which is doing if/else checks on multiple variables. Better suited to checks on one variable at a time is dplyr::recode:
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")),
~recode(.,
"0/1" = 0,
"1/1" = 2,
"1/0" = 1,
"0/0" = NA_real_))
# same output as above
Or, for more flexibility & readability, create a small lookup object. That way, you can reuse the recode logic and change it easily. recode takes a set of named arguments; using tidyeval, you can pass in a named vector and unquo it with !!! (there's a similar example in the recode docs):
lookup <- c("0/1" = 0, "1/1" = 2, "1/0" = 1, "0/0" = NA_real_)
dat %>%
group_by(snpID) %>%
summarise_all(paste, collapse = "/") %>%
mutate_at(vars(starts_with("Cal_")), recode, !!!lookup)
# same output
You might use aggregate to concatenate the values for each snpID and then replace the values according to your needs with the help of case_when from dplyr.
(out <- aggregate(.~ snpID, dat, toString))
# snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#1 A_001 0, 0 1, 0 1, 1 1, 0 0, 1 0, 0 1, 1 0, 1
#2 A_002 1, 0 1, 1 0, 1 1, 0 1, 1 1, 0 0, 1 0, 1
#3 A_003 1, 1 0, 1 0, 0 1, 1 0, 1 1, 0 1, 0 0, 1
#4 A_004 0, 1 0, 0 1, 0 0, 1 0, 0 1, 1 0, 1 0, 0
Now recode the columns
library(dplyr)
out[-1] <- case_when(out[-1] == "0, 0" ~ NA_integer_,
out[-1] == "0, 1" ~ 0L,
out[-1] == "1, 0" ~ 1L,
TRUE ~ 2L)
Result
out
# snpID Cal_X1 Cal_X2 Cal_X3 Cal_X4 Cal_X5 Cal_X6 Cal_X7 Cal_X8
#1 A_001 NA 1 2 1 0 NA 2 0
#2 A_002 1 2 0 1 2 1 0 0
#3 A_003 2 0 NA 2 0 1 1 0
#4 A_004 0 NA 1 0 NA 2 0 NA
data
dat <- structure(list(snpID = c("A_001", "A_001", "A_002", "A_002",
"A_003", "A_003", "A_004", "A_004"), Cal_X1 = c(0L, 0L, 1L, 0L,
1L, 1L, 0L, 1L), Cal_X2 = c(1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L),
Cal_X3 = c(1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L), Cal_X4 = c(1L,
0L, 1L, 0L, 1L, 1L, 0L, 1L), Cal_X5 = c(0L, 1L, 1L, 1L, 0L,
1L, 0L, 0L), Cal_X6 = c(0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L),
Cal_X7 = c(1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), Cal_X8 = c(0L,
1L, 0L, 1L, 0L, 1L, 0L, 0L)), .Names = c("snpID", "Cal_X1",
"Cal_X2", "Cal_X3", "Cal_X4", "Cal_X5", "Cal_X6", "Cal_X7", "Cal_X8"
), class = "data.frame", row.names = c(NA, -8L))
I am trying to create a summary table and having a mental hang up. Essentially, what I think I want is a summaryBy statement getting colSums for the subsets for ALL columns except the factor to summarize on.
My data frame looks like this:
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524
comp103680_c0 10 0 0 0 0 0 1
comp103947_c0 3 0 0 0 0 0 0
comp104660_c0 1 1 1 0 0 0 0
comp105255_c0 10 0 0 0 0 0 0
What I would like to do is get colSums for all columns after Cluster using Cluster as the grouping factor.
I have tried a bunch of things. The last was the ply ddply
> groupColumns = "Cluster"
> dataColumns = colnames(GO_matrix_MF[,2:ncol(GO_matrix_MF)])
> res = ddply(GO_matrix_MF, groupColumns, function(x) colSums(GO_matrix_MF[dataColumns]))
> head(res)
Cluster GO:0003677 GO:0003700 GO:0046872 GO:0008270 GO:0043565 GO:0005524 GO:0004674 GO:0045735
1 1 121 138 196 94 43 213 97 20
2 2 121 138 196 94 43 213 97 20
I am not sure what the return values represent, but they do not represent the colSums
Try:
> aggregate(.~Cluster, data=ddf, sum)
Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
1 1 1 1 0 0 0 0
2 3 0 0 0 0 0 0
3 10 0 0 0 0 0 1
I think you are looking for something like this. I modified your data a bit. There are other options too.
# Modified data
foo <- structure(list(Cluster = c(10L, 3L, 1L, 10L), GO.0003677 = c(11L,
0L, 1L, 5L), GO.0003700 = c(0L, 0L, 1L, 0L), GO.0046872 = c(0L,
9L, 0L, 0L), GO.0008270 = c(0L, 0L, 0L, 0L), GO.0043565 = c(0L,
0L, 0L, 0L), GO.0005524 = c(1L, 0L, 0L, 0L)), .Names = c("Cluster",
"GO.0003677", "GO.0003700", "GO.0046872", "GO.0008270", "GO.0043565",
"GO.0005524"), class = "data.frame", row.names = c("comp103680_c0",
"comp103947_c0", "comp104660_c0", "comp105255_c0"))
library(dplyr)
foo %>%
group_by(Cluster) %>%
summarise_each(funs(sum))
# Cluster GO.0003677 GO.0003700 GO.0046872 GO.0008270 GO.0043565 GO.0005524
#1 1 1 1 0 0 0 0
#2 3 0 0 9 0 0 0
#3 10 16 0 0 0 0 1