Concatenating two vectors in R [duplicate] - r

This question already has answers here:
How to concatenate factors, without them being converted to integer level?
(9 answers)
Closed 5 years ago.
I want to concatenate two vectors one after the other in R. I have written the following code to do it:
> a = head(tracks_listened_train)
> b = head(tracks_listened_test)
> a
[1] cc1a46ee0446538ecf6b65db01c30cd8 19acf9a5cbed34743ce0ee42ef3cae3e
[3] 9e7fdbf2045c9f814f6c0bed5da9bed7 3441b1031267fbb6009221bf47f9c5e8
[5] 206c8b79bd02beeea200879afc414879 1a7a95e3845a6815060628e847d14362
18585 Levels: 0001a423baf29add84af6ec58aeb5b90 ...
> b
[1] 7251a7694b79aa9a39f9a1a5f5c8a253 2f362377ef0e7bca112233fdda22a79c
[3] c1196625b1b733b62c43935334e1d190 58e41e462af4185b08231a41453c3faf
[5] 1cc2517fa9c037e02a14ce0950a28f67
10186 Levels: 0001a423baf29add84af6ec58aeb5b90 ...
> res = c(a,b)
> res
[1] 14898 1898 11556 3859 2408 1950 4473 1865 7674 3488 1130
However, I get the unexpected result of the resultant vector. What could the problem be?

We need to convert the factor class to character class
c(as.character(a), as.character(b))
The reason we get numbers instead of the character is based on the storage mode of factor i.e. an integer. So when we do the concatenation, it coerces to the integer mode

Related

Creating Vectors sequence in R [duplicate]

This question already has answers here:
Alternate, interweave or interlace two vectors
(2 answers)
Closed 1 year ago.
I want to write a R program that creates the vector 0.1^3, 0.2^1, 0.1^6, 0.2^4, ..., 0.1^36, 0.2^34.
v=c(seq(3,36,3))
w=c(seq(1,34,3))
x=c(0.1^v)
y=c(0.2^w)
z=c(x,y)
Please help.
rbind to a matrix and convert to vector again:
c(rbind(x, y))
Or more directly:
rep(c(0.1, 0.2), 12)^c(rbind(seq(3,36,3), seq(1,34,3)))
You can use matrix to create the desired vector.
c(matrix(z, 2, byrow=TRUE))
# [1] 1.000000e-03 2.000000e-01 1.000000e-06 1.600000e-03 1.000000e-09
# [6] 1.280000e-05 1.000000e-12 1.024000e-07 1.000000e-15 8.192000e-10
#[11] 1.000000e-18 6.553600e-12 1.000000e-21 5.242880e-14 1.000000e-24
#[16] 4.194304e-16 1.000000e-27 3.355443e-18 1.000000e-30 2.684355e-20
#[21] 1.000000e-33 2.147484e-22 1.000000e-36 1.717987e-24

How to concatenate (merge) AAStringSets by name?

In bioinformatics/microbial ecology literature a fairly common practice is to concatenate multiple sequence alignments of multiple genes prior to building phylogenetic trees. In R terminology it may be clearer to say 'merge' these sequences by the organism they came from, but I'm sure examples are better.
Say these are two multiple sequence alignments.
library(Biostrings)
set1<-AAStringSet(c("IVR", "RDG", "LKS"))
names(set1)<-paste("org", 1:3, sep="_")
set2<-AAStringSet(c("VRT", "RKG", "AST"))
names(set2)<-paste("org", 2:4, sep="_")
set1
A AAStringSet instance of length 3
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
set2
A AAStringSet instance of length 3
width seq names
[1] 3 VRT org_2
[2] 3 RKG org_3
[3] 3 AST org_4
The correct concatenation of these sequences would be
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
The "-" notes a 'gap' (lack of amino acid) in that position, or in this case a lack of a gene to concatenate.
I thought there would be a function to do this in BioStrings, MSA, DECIPHER, or other related packages, but have been unable to find one.
I found the following Q&As, each does not provide the desired output as described.
1: https://support.bioconductor.org/p/38955/
output
A AAStringSet instance of length 6
width seq names
[1] 3 IVR org_1
[2] 3 RDG org_2
[3] 3 LKS org_3
[4] 3 VRT org_2
[5] 3 RKG org_3
[6] 3 AST org_4
May be better described as 'appending' the sequences (joins the two sets vertically).
2: https://support.bioconductor.org/p/39878/
output
A AAStringSet instance of length 2
width seq
[1] 9 IVRRDGLKS
[2] 9 VRTRKGAST
Concatenates sequences in each set, a complete chimera of each set (certainly not desired).
3: How to concatenate two DNAStringSet sequences per sample in R?
output
A AAStringSet instance of length 3
width seq
[1] 6 IVRVRT
[2] 6 RDGRKG
[3] 6 LKSAST
Creates chimeras of sequences by the order they are in. Even worse with different number of sequences (loops and concatenates shorter set...)
4: https://www.biostars.org/p/115192/
Output
A AAStringSet instance of length 2
width seq
[1] 3 IVR
[2] 3 VRT
Only appends the first sequence from each set, not sure why anyone wants this...
I would normally think these kinds of processes would be done with some combination of bash and Python, but I'm using the DECIPHER multiple sequence aligner in R, so it makes sense to do the rest of the processing in R. In the process of writing up this question I came up with an answer that I will post, but I'm kind of expecting someone to point me to the manual I missed that describes the function that does this. Thanks!
So I am a somewhat fanatical user of data.table in R, among many things it is great to merge datasets by names. I found Biostrings::AAStringSets can be converted to matrices using as.matrix and these can be converted to data.table and merged.
set1.dt<-data.table(as.matrix(set1), keep.rownames = TRUE)
set2.dt<-data.table(as.matrix(set2), keep.rownames = TRUE)
set12.dt<-merge(set1.dt, set2.dt, by="rn", all=TRUE)
set12.dt
rn V1.x V2.x V3.x V1.y V2.y V3.y
1: org_1 I V R <NA> <NA> <NA>
2: org_2 R D G V R T
3: org_3 L K S R K G
4: org_4 <NA> <NA> <NA> A S T
This is the correct merge, but needs more work to get the final result.
Need to replace "NA" with "-". I always need to look up this question to remember the best way to do this with a data.table.
Fastest way to replace NAs in a large data.table
#slightly modified from original, added arg "x"
f_dowle = function(dt, x) { # see EDIT later for more elegant solution
na.replace = function(v,value=x) { v[is.na(v)] = value; v }
for (i in names(dt))
eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}
f_dowle(set12.dt, "-")
Concatenate the sequences (not included the names with !"rn")
set12<-apply(set12.dt[ ,!"rn"], 1, paste, collapse="")
Convert back to AAStringSet and add back names
set12<-AAStringSet(set12)
names(set12)<-set12.dt$rn
Desired output
set12
A AAStringSet instance of length 4
width seq names
[1] 6 IVR--- org_1
[2] 6 RDGVRT org_2
[3] 6 LKSRKG org_3
[4] 6 ---AST org_4
This works, but seems quite cumbersome, especially converting between different data formats. Obviously can wrap it into a function to use more easily, but again seems like this should already be a function in some Bioconductor package...

R "read.table" function gives only odd numbered columns while merging the even numbered columns

I am trying to read a TSV file in R using the read.table function.
myTable <- read.table("file_path", sep='\t', header=T)
But when I try the command
names(myTable)
It gives me column names which are odd numbered, while merging the even numbered columns with those.
[1] "GeneSymbol" "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp" "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp" "GSM480634_JK_C_05.07.mas5.chp"
These are exact column names and you can see that two column names are separated by space while only ODD numbered column names are listed.
The output should be like this:
[1] "GeneSymbol"
[2] "GSM480304_JK_C_05.07.mas5.chp"
[3] "GSM480355_JK_C_05.07.mas5.chp"
[4] "GSM480480_JK_C_05.07.mas5.chp"
[5] "GSM480555_JK_C_05.07.mas5.chp"
[6] "GSM480634_JK_C_05.07.mas5.chp"
This is creating problem in assigning names to another table where I want to use these column names. Any suggestions ?
As noted in the comments, R is displaying all the columns, but not in the format you expect. This can be forced by casting the result of names() with as.data.frame() as follows:
rawData <- "
Number,Name,Type1,Type2,Total,HP,Attack,Defense,SpecialAtk,SpecialDef,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False
8,Wartortle,Water,,405,59,63,80,65,80,58,1,False
9,Blastoise,Water,,530,79,83,100,85,105,78,1,False"
gen01 <- read.csv(textConnection=rawData,header=TRUE)
as.data.frame(names(gen01))
...and the output:
> as.data.frame(names(gen01))
names(gen01)
1 Number
2 Name
3 Type1
4 Type2
5 Total
6 HP
7 Attack
8 Defense
9 SpecialAtk
10 SpecialDef
11 Speed
12 Generation
13 Legendary

How keep only date variable without time-field string [duplicate]

This question already has answers here:
Date conversion from POSIXct to Date in R
(3 answers)
Closed 5 years ago.
My question takes a general aspect comparing to which was proposed here How to remove time-field string from a date-as-character variable?.
In fact, suppose I have this date type variable:
> head(DataDia$Date)
[1] "2016-09-13 15:56:30.827" "2016-12-12 13:39:17.537" "2016-09-16 21:57:24.977" "2016-09-23 11:19:22.010"
[5] "2017-01-11 20:06:58.490" "2016-10-21 23:40:43.927"
How do I delete all time-field strings and just keep the date format. SO that I get this:
> head(DataDia$Date)
[1] "2016-09-13" "2016-12-12" "2016-09-16" "2016-09-23"
[5] "2017-01-11" "2016-10-21"
Note please that I am working on a data table. So I need a way using data.table
operations.
Just use as.Date(DataDia$Date).
You Can use:
as.POSIXct(Df$Date,format='%Y-%m-%d',tz= "UTC")
Combining as.Date and as.character
x = c("2016-09-13 15:56:30.827", "2016-12-12 13:39:17.537", "2016-09-16 21:57:24.977", "2016-09-23 11:19:22.010",
"2017-01-11 20:06:58.490", "2016-10-21 23:40:43.927")
y = as.character(as.Date(x, format = "%Y-%m-%d"))
y
[1] "2016-09-13" "2016-12-12" "2016-09-16" "2016-09-23" "2017-01-11" "2016-10-21"

R: order a vector of strings with both character and numeric values both alphabetically and numerically

I have a vector of strings that contain both character and numeric values. For example:
a=c("ILLUMINA:420:C2D7UACXX:1:1102:14591:91480","ILLUMINA:420:C2D7UACXX:1:1102:14592:3881","ILLUMINA:420:C2D7UACXX:1:1102:14592:37103","ILLUMINA:420:C2D7UACXX:1:1102:14592:37356")
I'd like to order the vector so that the characters are sorted alphabetically and the numbers numerically. The structure of the strings is always of the format:
"ILLUMINA:420:C2D7UACXX:1:<number>:<number>:<number>", so actually the order only applies to the last three colon separated numbers.
I did try mixedsort {gtools} but the result was the same as using sort and
sort.int, which is:
> mixedsort(a)
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
Clearly the right order should be:
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
Is there any immediate solution?
EDIT completely change the solution after OP clarification
You can extract the last 3 elements and order, and you create a data.frame:
dat = read.table(text=sub('.*:1:([0-9]+):([0-9]+):([0-9]+)','\\1|\\2|\\3',a),sep='|')
dat
V1 V2 V3
1 1102 14591 91480
2 1102 14592 3881
3 1102 14592 37103
4 1102 14592 37356
Then you order using 3 columns:
a[with(dat,order(V1,V2,V3))]
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
gtools::mixedsort does work in your case, actually:
> a=c("ILLUMINA:420:C2D7UACXX:1:1102:14591:91480",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:3881",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:37103",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:37356")
>
> mixedsort(a)
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480"
[2] "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[4] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
I am using gtools_3.4.2 and R-3.2.0
Here's a faster solution:
fields.list = strsplit(a,split=":")
sort.dt = data.table(t(sapply(fields.list,function(x) as.numeric(c(x[5],x[6],x[7])))))
sorted.a = v[with(sort.dt,order(V1,V2,V3))]
> sorted.a
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[4] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"

Resources