Merging / combining columns is not working at all - r

I cant for the life of me figure out a way of getting this working without changing the class of the columns or getting a random level that isn't even in the original columns!
I have data that looks like this:
data <- structure(list(WHY = structure(1:4, .Label = c("WHY1", "WHY2",
"WHY3", "WHY4"), class = "factor"), HELP1 = structure(c(3L, NA,
1L, 2L), .Label = c("1", "2", "D/A"), class = "factor"), HELP2 = c(NA,
2L, NA, NA)), class = "data.frame", row.names = c(NA, -4L))
What I want to do:
If HELP 2 IS NOT NA & IF HELP1 is D/A then merge columns WITHOUT changing class.
Here is what I tried:
data$HELP3 <-
ifelse(
!is.na(data$HELP2) &
data$HELP1 == "D/A",
data$HELP1, data$HELP2)
Result:
data
WHY HELP1 HELP2 HELP3
1 WHY1 D/A NA NA
2 WHY2 <NA> 2 NA
3 WHY3 1 NA NA
4 WHY4 2 NA NA
>
I would be so very very very grateful for any help with this. I have been on stack overflow for 5 hours now and no closer to making this work :( I am not that hot with dyplr so a base r or anything else would be wonderful!

Since HELP2 and HELP1 have different class and ifelse also has issues to return vector of factor class. You could however, do this without ifelse and without changing the classes of columns.
data$HELP3 <- data$HELP1
inds <- (!is.na(data$HELP2)) & data$HELP1 == "D/A"
data$HELP3[inds] <- data$HELP2[inds]

Related

Is there a way in R to convert the following character variable?

I have the following dataframe with a character variable that represents the number of lanes on a highway, can I replace this vector with a similar vector that has numbers instead of letter?
df<- structure(list(Blocked.Lanes = c("|RS|RS|ML|", "|RS|", "|RS|ML|ML|ML|ML|",
"|RS|", "|RS|RE|", "|ML|ML|ML|", "|RS|ML|", "|RS|", "|ML|ML|ML|ML|ML|ML|",
"|RS|ML|ML|"), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), row.names = c(NA,
10L), class = "data.frame")
The output should be something like df2 below:
df2<- structure(list(Blocked.Lanes = c(3L, 1L, 5L, 1L, 2L, 3L, 2L,
1L, 6L, 3L), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), class = "data.frame", row.names = c(NA,
-10L))
One way would be to count number of "|" in each string. We subtract it with - 1 since there is an additional "|".
stringr::str_count(df$Blocked.Lanes, '\\|') - 1
#[1] 3 1 5 1 2 3 2 1 6 3
In base R :
lengths(gregexpr("\\|", df$Blocked.Lanes)) - 1
Another way would to be count exact words in the string.
stringr::str_count(df$Blocked.Lanes, '\\w+')
lengths(gregexpr("\\w+", df$Blocked.Lanes))
Similar to Ronak's solution you could also do:
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]{2}\\b")
if the lanes are always 2 letters long, or
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]+\\b")
if the lanes are always at least one letter long.
stringr:str_count(df$Blocked.Lanes, "(?<=\\|)[A-Z]+(?=\\|)")
also works.
Not as succinct as #Ronak Shah's, but another method in Base R.
String split on string literal "|" and then count elements:
df2 <- transform(df, Blocked.Lanes = lengths(Map(function(x) x[x != ""],
strsplit(df$Blocked.Lanes, "|", fixed = TRUE))))

Looking for how to use separate() with multiple separators in R (ClinVar variant data dealing)

Dear StackOverflow community
I'm a biologist and I'm working with a disease/genetic variants from ClinVar official database. My aim is to extract all gene names, transcripts and variants from this list.
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_2020-01.xml.gz
However, ClinVar offers the information I need in a single column called "Name". (I've separated some of the values with different results that I want to deal with in the example in the table below:)
Name ClinicalSignificance
1 NG_012236.2:g.11027del Pathogenic
2 NM_018077.3(RBM28):c.1052T>C (p.Leu351Pro) Pathogenic
3 NC_012920.1:m.7445A>G Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11:g.(134493178_134493182)_(134501172_134501176)del Pathogenic
(there is other type of data, however since it does not contain the information I need I will treat it as garbage)
I am looking for a way to split the "Name" column in 3 other columns, using multiple separators. I've tried using "|" as part of my regex argument for multiple matches. However, for each time it works, sends the data that has already been separated to a column to the right.
My code:
ClinVar_Clean <- separate(ClinVar_Clean, Name, into = c("Transcript","gene.var"),sep = "(?<=\\.[0-9]{1,2})[(]|(?<=[0-9]{3,16}\\.[0-9]{1,2}):|(?=[cmpng]\\.)")
ClinVar_Clean <- separate(ClinVar_Clean, gene.var, into = c("Gene","Variant"),sep = "\\):|(?=[cmpng]\\.)")
My result:
Transcript Gene Variant ClinicalSignificance
1 NG_012236.2 <NA> Pathogenic
2 NM_018077.3 RBM28 Pathogenic
3 NC_012920.1 <NA> Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11 <NA> Pathogenic
How the result should look like:
Transcript Gene Variant ClinicalSignificance
1 NG_012236.2 g.11027del Pathogenic
2 NM_018077.3 RBM28 c.1052T>C (p.Leu351Pro) Pathogenic
3 NC_012920.1 m.7445A>G Pathogenic
4 m.7510T>C Pathogenic
5 NC_000023.11 g.(134493178_134493182)_(134501172_134501176)del Pathogenic
I also tried to execute each separator individually, instead of shifting the data to the right, however it also overwrites the remaining data.
Please if anyone could help, appreciates!
I was trying to do this with one single extract/separate but I couldn't come up which would give the exact expected output. So here is an attempt breaking it down into separate steps using str_extract from stringr and sub from base R.
library(dplyr)
library(stringr)
df %>%
mutate(Transcript = str_extract(Name, ".*(?<=:)"),
Gene = str_extract(Transcript, "(?<=\\().*(?=\\))"),
Variant = sub(".*:(.*)", "\\1", Name)) %>%
select(Transcript, Gene, Variant)
# Transcript Gene Variant
#1 NG_012236.2: <NA> g.11027del
#2 NM_018077.3(RBM28): RBM28 c.1052T>C(p.Leu351Pro)
#3 NC_012920.1: <NA> m.7445A>G
#4 <NA> <NA> m.7510T>C
#5 NC_000023.11: <NA> g.(134493178_134493182)_(134501172_134501176)del
In Transcript we capture everything before the colon.
For Gene, we get character which is in parenthesis in Transcript.
For Variant, we get everything after colon.
data
df <- structure(list(Name = structure(c(4L, 5L, 3L, 1L, 2L), .Label = c("m.7510T>C",
"NC_000023.11:g.(134493178_134493182)_(134501172_134501176)del",
"NC_012920.1:m.7445A>G", "NG_012236.2:g.11027del",
"NM_018077.3(RBM28):c.1052T>C(p.Leu351Pro)"
), class = "factor"), ClinicalSignificance = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "Pathogenic", class = "factor")), class =
"data.frame", row.names = c("1", "2", "3", "4", "5"))

How to unwind column contents into new columns based on condition in another column in R

I have a dataframe called my mydf. I want to split the contents in columns ASM and GPM based on the format given in the FORMAT column and get the result. So basically, there will be as many columns for ASM and GPM columns as there are total unique elements (i.e. 5 different unique elements) in FORMAT column separated by : to unwind in the result. Then need to place the right value in the right columns (with .GT, .FT, and so on) as indicated in FORMAT column.
mydf <- structure(list(`#CHROM` = c(1L, 1L, 1L), POS = c(10490L, 10493L,
10494L), FORMAT = c("GT:FT:GQ", "GT:PS:GL", "GT:PS:FT"), ASM = c("1/1:TRUE:4,2,333",
"./.:.:.", "0/1:.:VQLOW"), GPM = c("./.:.:.", "1/1:4:2,233",
"0/1:22:VQHIGH")), .Names = c("#CHROM", "POS", "FORMAT", "ASM",
"GPM"), class = "data.frame", row.names = c(NA, -3L))
result:
result <- structure(list(`#CHROM` = c(1L, 1L, 1L), POS = c(10490L, 10493L,
10494L), FORMAT = c("GT:FT:GQ", "GT:PS:GL", "GT:PS:FT"), ASM.GT = c("1/1",
"./.", "0/1"), ASM.FT = c("TRUE", NA, "VQLOW"), ASM.GQ = c("4,2,333",
NA, NA), ASM.PS = c(NA, NA, NA), ASM.GL = c(NA, NA, NA), GPM.GT = c("./.",
"1/1", "0/1"), GPM.FT = c(NA, NA, "VQHIGH"), GPM.GQ = c(NA, NA,
NA), GPM.PS = c(NA, 4L, 22L), GPM.GL = c(NA, 2233L, NA)), .Names = c("#CHROM",
"POS", "FORMAT", "ASM.GT", "ASM.FT", "ASM.GQ", "ASM.PS", "ASM.GL",
"GPM.GT", "GPM.FT", "GPM.GQ", "GPM.PS", "GPM.GL"), class = "data.frame", row.names = c(NA,
-3L))
Since it appears that the number of values in each of the columns to be split is the same, we can take advantage of the ability of dcast in "data.table" to handle multiple value.vars.
The splitting can be done by cSplit from my "splitstackshape" package.
library(splitstackshape)
dcast(cSplit(mydf, c("FORMAT", "ASM", "GPM"), ":", "long"),
`#CHROM` + POS ~ FORMAT, value.var = c("ASM", "GPM"))
# #CHROM POS ASM_FT ASM_GL ASM_GQ ASM_GT ASM_PS GPM_FT GPM_GL GPM_GQ GPM_GT GPM_PS
# 1: 1 10490 TRUE NA 4,2,333 1/1 NA . NA . ./. NA
# 2: 1 10493 NA . NA ./. . NA 2,233 NA 1/1 4
# 3: 1 10494 VQLOW NA NA 0/1 . VQHIGH NA NA 0/1 22
Note that "#CHROM" is a very R-unfriendly column name since the # is the comment character.
If you need to add back in the "FORMAT" column, add a [, FORMAT:= mydf$FORMAT][] to the end of the dcast above.
I'm presuming that you can handle further cleaning from here (for example, replacing . with NA and removing the thousand comma separator wherever it appears.

How do I plot boxplots of two different series?

I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like

Extracting values from R table within grouped values

I have the following table ordered group by first, second and name.
myData <- structure(list(first = c(120L, 120L, 126L, 126L, 126L, 132L, 132L), second = c(1.33, 1.33, 0.36, 0.37, 0.34, 0.46, 0.53),
Name = structure(c(5L, 5L, 3L, 3L, 4L, 1L, 2L), .Label = c("Benzene",
"Ethene._trichloro-", "Heptene", "Methylamine", "Pentanone"
), class = "factor"), Area = c(699468L, 153744L, 32913L,
4948619L, 83528L, 536339L, 105598L), Sample = structure(c(3L,
2L, 3L, 3L, 3L, 1L, 1L), .Label = c("PO1:1", "PO2:1", "PO4:1"
), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -7L))
Within each group I want to extract the area that correspond to the specific sample. Several groups don´t have areas from the samples, so if the sample is´nt detected it should return "NA".Ideally, the final output should be a column for each sample.
I have tried the ifelse function to create one column to each sample:
PO1<-ifelse(myData$Sample=="PO1:1",myData$Area, "NA")
However this doesn´t takes into account the group distribution. I want to do this, but within the group. Within each group (a group as equal value for first, second and Name columns) if sample=PO1:1, Area, else NA.
For the first group:
structure(list(first = c(120L, 120L), second = c(1.33, 1.33),
Name = structure(c(1L, 1L), .Label = "Pentanone", class = "factor"),
Area = c(699468L, 153744L), Sample = structure(c(2L, 1L), .Label = c("PO2:1",
"PO4:1"), class = "factor")), .Names = c("first", "second", "Name",
"Area", "Sample"), class = "data.frame", row.names = c(NA, -2L))
The output should be:
structure(list(PO1.1 = NA, PO2.1 = 153744L, PO3.1 = NA, PO4.1 = 699468L), .Names =c("PO1.1", "PO2.1", "PO3.1", "PO4.1"), class = "data.frame", row.names = c(NA, -1L))
Any suggestion?
As in the example in the quesiton, I am assuming Sample is a factor. If this is not the case, consider making it such.
First, lets clean up the column Sample to make it a legal name, or else it might cause errors
levels(myData$Sample) <- make.names(levels(myData$Sample))
## DEFINE THE CUTS##
# Adjust these as necessary
#--------------------------
max.second <- 3 # max & nin range of myData$second
min.second <- 0 #
sprd <- 0.15 # with spread for each group
#--------------------------
# we will cut the myData$second according to intervals, cut(myData$second, intervals)
intervals <- seq(min.second, max.second, sprd*2)
# Next, lets create a group column to split our data frame by
myData$group <- paste(myData$first, cut(myData$second, intervals), myData$Name, sep='-')
groups <- split(myData, myData$group)
samples <- levels(myData$Sample) ## I'm assuming not all samples are present in the example. Manually adjusting with: samples <- sort(c(samples, "PO3.1"))
# Apply over each group, then apply over each sample
myOutput <-
t(sapply(groups, function(g) {
#-------------------------------
# NOTE: If it's possible that within a group there is more than one Area per Sample, then we have to somehow allow for thi. Hence the "paste(...)"
res <- sapply(samples, function(s) paste0(g$Area[g$Sample==s], collapse=" - ")) # allowing for multiple values
unlist(ifelse(res=="", NA, res))
## If there is (or should be) only one Area per Sample, then remove the two lines aboce and uncomment the two below:
# res <- sapply(samples, function(s) g$Area[g$Sample==s]) # <~~ This line will work when only one value per sample
# unlist(ifelse(res==0, NA, res))
#-------------------------------
}))
# Cleanup names
rownames(myOutput) <- paste("Group", 1:nrow(myOutput), sep="-") ## or whichever proper group name
# remove dummy column
myData$group <- NULL
Results
myOutput
PO1.1 PO2.1 PO3.1 PO4.1
Group-1 NA "153744" NA "699468"
Group-2 NA NA NA "32913 - 4948619"
Group-3 NA NA NA "83528"
Group-4 "536339" NA NA NA
Group-5 "105598" NA NA NA
You cannot really expect R to intuit that there is a fourth factor level between PO2 and PO4 , now can you.
> reshape(inp, direction="wide", idvar=c('first','second','Name'), timevar="Sample")
first second Name Area.PO4:1 Area.PO2:1 Area.PO1:1
1 120 1.3 Pentanone 699468 153744 NA
3 126 0.4 Heptene 32913 NA NA
4 126 0.4 Heptene 4948619 NA NA
5 126 0.3 Methylamine 83528 NA NA
6 132 0.5 Benzene NA NA 536339
7 132 0.5 Ethene._trichloro- NA NA 105598

Resources