Ordered picking value from 2nd column - r

I have a parameter as AD in columns. But it is different sequence in per row. How can i pick 'AD' from X2.
X1 X2
GT:GQ:GQX:DPI:AD:DP 0/1:909:12:125:93,26:119
GT:GQ:GQX:DPI:AD 0/1:909:12:125:35,24
GT:GQ:GQX:DP:DPF:AD 0/1:57:3:11:130:8,3
GT:AD:DP:GQ:PL 0/1:211,31:242:99:138,0,7251
Output
AD
93,26
35,24
8,3
211,31

Split columns at ":" using strsplit and select "AD" position identified using grep with an mapply.
mapply(`[`, strsplit(d$X2, ":"), sapply(strsplit(d$X1,":"), grep, pattern="AD"))
# [1] "93,26" "35,24" "8,3" "211,31"
Data:
d <- structure(list(X1 = c("GT:GQ:GQX:DPI:AD:DP", "GT:GQ:GQX:DPI:AD",
"GT:GQ:GQX:DP:DPF:AD", "GT:AD:DP:GQ:PL"), X2 = c("0/1:909:12:125:93,26:119",
"0/1:909:12:125:35,24", "0/1:57:3:11:130:8,3", "0/1:211,31:242:99:138,0,7251"
)), class = "data.frame", row.names = c(NA, -4L))

Maybe you can try regmatches + regexpr when with base R
> unlist(regmatches(df$X2,regexpr("\\d+,\\d+",df$X2)))
[1] "93,26" "35,24" "8,3" "211,31"

Using base R and split to extract the "AD" element.
mapply(
function(x, i) x[i],
strsplit(df$X2, ":"),
lapply(strsplit(df$X1, ":"), function(x) which(x == "AD"))
)
[1] "93,26" "35,24" "8,3" "211,31"
Reproducible data
df <- data.frame(
X1 = c("GT:GQ:GQX:DPI:AD:DP", "GT:GQ:GQX:DPI:AD", "GT:GQ:GQX:DP:DPF:AD", "GT:AD:DP:GQ:PL"),
X2 = c("0/1:909:12:125:93,26:119", "0/1:909:12:125:35,24", "0/1:57:3:11:130:8,3", "0/1:211,31:242:99:138,0,7251")
)

Related

Sort numbers separated by colons using R

I have the same question as here but using R:
Sort numbers with colons
I have a data frame A with a column like this one:
1:5
11:36
2:1
2:14
2:8
I'd like to sort A based on that column, in this way:
1:5
2:1
2:8
2:14
11:36
We can separate the data into different columns, arrange the data by all columns and combine them again.
library(dplyr)
library(tidyr)
df %>%
separate(V1, into = c("A", "B"), sep = ":", convert = TRUE) %>%
arrange_all() %>%
unite(A, A, B, sep = ":")
# A
#1 1:5
#2 2:1
#3 2:8
#4 2:14
#5 11:36
data
df <- structure(list(V1 = c("1:5", "11:36", "2:1", "2:14", "2:8")),
row.names = c(NA, -5L), class = "data.frame")
Here is a base R solution using order + gsub, i.e.,
r <- v[order(as.numeric(gsub(":.*","",v)),
as.numeric(gsub(".*:","",v)))]
such that
> r
[1] "1:5" "2:1" "2:8" "2:14" "11:36"
1) gtools mixedsort and mixedorder in gtools can do that. We show how to do it for a vector v and an entire data frame DF which may have additional columns that are to be moved along with the v column. (The test data is defined reproducibly in the Note at the end. If the v column in DF were factor rather than character then use as.character(DF$v) in place of DF$v).
library(gtools)
mixedsort(v)
## [1] "1:5" "2:1" "2:8" "2:14" "11:36"
DF[mixedorder(DF$v), ]
## v x
## 1 1:5 1
## 3 2:1 3
## 5 2:8 5
## 4 2:14 4
## 2 11:36 2
2) Base R This alternative is slightly longer but only uses base R. It gives the same answers as (1). The comment about factors in (1) applies here too.
o <- do.call("order", read.table(text = v, sep = ":"))
v[o]
o <- do.call("order", read.table(text = DF$v, sep = ":"))
DF[o, ]
Note
Test data used
v <- c("1:5", "11:36", "2:1", "2:14", "2:8")
DF <- data.frame(v, x = seq_along(v), stringsAsFactors = FALSE)

Extract subset of string in dataframe column

I have one of the columns in the data frame as follows. Need to get the output as shown.
Data :
NM_001104633|0|Sema3d|-
NM_0011042|0|XYZ|-
NM_0956|0|ghd|+
Required output :
Sema3d
XYZ
ghd
x = c("NM_001104633|0|Sema3d|-", "NM_0011042|0|XYZ|-", "NM_0956|0|ghd|+")
sub(".*0\\|(.*)\\|[+|-]", "\\1", x)
#[1] "Sema3d" "XYZ" "ghd"
#OR
sapply(strsplit(x, "\\|"), function(s) s[3])
#[1] "Sema3d" "XYZ" "ghd"
#OR
sapply(x, function(s){
inds = gregexpr("\\|", s)[[1]]
substring(s, inds[2] + 1, inds[3] - 1)
},
USE.NAMES = FALSE)
#[1] "Sema3d" "XYZ" "ghd"
We can use read.table to separate them in different columns and then select only the one which we are interested in.
read.table(text = df$V1, sep = "|")
# V1 V2 V3 V4
#1 NM_001104633 0 Sema3d -
#2 NM_0011042 0 XYZ -
#3 NM_0956 0 ghd +
We can also use tidyr::separate for this
tidyr::separate(df, V1, into = paste0("col1", 1:4), sep = "\\|")
Or cSplit from splitstackshape
splitstackshape::cSplit(df, "V1", sep = "|")
data
df <- structure(list(V1 = c("NM_001104633|0|Sema3d|-", "NM_0011042|0|XYZ|-",
"NM_0956|0|ghd|+")), class = "data.frame", row.names = c(NA, -3L))
The following regex takes all text between the last pair of | followed by a + or a -.
([^\|]*)(?=\|(\+|-))
Demo
We can use sub from base R
sub(".*\\|(\\w+)\\|[-+]$", "\\1", x)
#[1] "Sema3d" "XYZ" "ghd"
Or using gsub
gsub(".*\\d+\\||\\|.*", "", x)
#[1] "Sema3d" "XYZ" "ghd"
data
x <- c("NM_001104633|0|Sema3d|-", "NM_0011042|0|XYZ|-", "NM_0956|0|ghd|+")
The package unglue offers a readable alternative, if not as efficient :
x = c("NM_001104633|0|Sema3d|-", "NM_0011042|0|XYZ|-", "NM_0956|0|ghd|+")
unglue::unglue_vec(x, "{drop1}|0|{keep}|{drop2}",var = "keep")
#> [1] "Sema3d" "XYZ" "ghd"
# or
unglue::unglue_vec(x, "{=.*?}|0|{keep}|{=.*?}")
#> [1] "Sema3d" "XYZ" "ghd"
Or in the data frame directly :
df <- data.frame(col = x)
unglue::unglue_unnest(df, col, "{=.*?}|0|{new_col}|{=.*?}")
#> new_col
#> 1 Sema3d
#> 2 XYZ
#> 3 ghd

Removing the special symbols in data.frame column values

I have two data frame each with a column Name
df1:
name
#one2
!iftwo
there_2_go
come&go
df1 = structure(list(name = c("#one2", "!iftwo", "there_2_go", "come&go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
df2:
name
One2
IfTwo#
there-2-go
come.go
df2 = structure(list(name = c("One2", "IfTwo#", "there-2-go", "come.go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
Now to compare the two data frames for inequality is cumbersome because of special symbols using %in%. To remove the special symbols using stringR can be useful. But how exactly we can use stringR functions with %in% and display the mismatch between them
have already done the mutate() to convert all in lowercasestoLower()as follows
df1<-mutate(df1,name=tolower(df1$name))
df2<-mutate(df2,name=tolower(df2$name))
Current output of comparison:
df2[!(df2 %in% df1),]
[1] "one2" "iftwo#" "there-2-go" "come.go"
Expected output as essentially the contents are same but with special symbols:
df2[!(df2 %in% df1),]
character(0)
Question : How do we ignore the symbols in the contents of the Frame
Here it is in a function,
f1 <- function(df1, df2){
i1 <- tolower(gsub('[[:punct:]]', '', df1$name))
i2 <- tolower(gsub('[[:punct:]]', '', df2$name))
d1 <- sapply(i1, function(i) grepl(paste(i2, collapse = '|'), i))
return(!d1)
}
f1(df, df2)
# one2 iftwo there2go comego
# FALSE FALSE FALSE FALSE
#or use it for indexing,
df2[f1(df, df2),]
#character(0)

Calculate Mean of Comma-Separated String of Numbers

I have a column in my dataframe that is made up of strings of numbers, separated by commas. I would like to convert the string to a list of numbers, and then get the mean. My dataframe, df:
a3
1,5,2
103.1
34,6
First, I converted the string to a list:
> df$a3_list <- strsplit(as.character(df$a3), split = ',')
New df:
a3 a3_list
1,5,2 c("1", "5", "2")
103.1 103.1
34,6 c("34", "6")
At this point, however, I'm not sure how to get a new column containing the mean of each cell in df$a3_list
You can use stringi, it's fast
library(stringi)
mat <- stri_split_fixed(df$a3, ',', simplify=T)
mat <- `dim<-`(as.numeric(mat), dim(mat)) # convert to numeric and save dims
rowMeans(mat, na.rm=T)
# [1] 2.666667 103.100000 20.000000
or with Base R
sapply(strsplit(as.character(df$a3), ",", fixed=T), function(x) mean(as.numeric(x)))
Another base R option
rowMeans(read.table(text=df$a3, sep=",", fill=TRUE), na.rm=TRUE)
#[1] 2.666667 103.100000 20.000000
NOTE: Assuming that the 'a3' is character class. Otherwise, wrap with as.character(df$a3)
data
df <- structure(list(a3 = c("1,5,2", "103.1", "34,6")), .Names = "a3",
class = "data.frame", row.names = c(NA, -3L))

Ordering columns of a data frame

I am interested to order columns of the data frame give below
structure(list(DETECTION = c(0.000219, 0.000673, 0.000322, 0.602006,
0.000468, 0.204022, 0.000491, 0.003067), VALUE = structure(1:8, .Label = c("10071_s_at",
"1053_at", "117_at", "1255_g_at", "1294_at", "1320_at", "1405_i_at",
"14312_at"), class = "factor")), .Names = c("DETECTION", "VALUE"
), class = "data.frame", row.names = c(NA, -8L))
I want numeric column (DETECTION) at the second.
I tried something here
d1 <- data[1, , drop = FALSE]
nums <- d1[, nn <- sapply(d1, is.numeric)]
ch <- d1[, !nn, drop = FALSE]
id <- names(ch[, grepl('_at$', as.character(unlist(ch))), drop = FALSE])
p <- names(nums)
d <- data[,c(id,p)]
However names(nums) returns NULL . What is going wrong here.
dt <- as.data.table(data)
From R help : " When it's required to reorder the columns of a data.table, the idiomatic way is to use setcolorder(x, neworder), instead of doing x <- x[, neworder, with=FALSE]. This is because the latter makes an entire copy of the data.table, which maybe unnecessary in most situations."
setcolorder(dt,c("VALUE","DETECTION"))
names(nums) is NULL because the dimensions were dropped. You can add the argument drop to keep the dimensions as they are:
names(nums)
#NULL
nums <- d1[, nn <- sapply(d1, is.numeric), drop=FALSE]
names(nums)
#[1] "DETECTION"

Resources