Read text file in R and convert it to a character object - r

I'm reading a text file like this in R 2.10.0.
248585_at 250887_at 245638_s_at AFFX-BioC-5_at
248585_at 250887_at 264488_s_at 245638_s_at AFFX-BioC-5_at AFFX-BioC-3_at AFFX-BioDn-5_at
248585_at 250887_at
Using the command
clusters<-read.delim("test",sep="\t",fill=TRUE,header=FALSE)
Now, I must pass every row in this file to a BioConductor function that takes only character vectors as input.
My problem is that using as.character on this "clusters" object turns everything into numeric strings.
> clusters[1,]
V1 V2 V3 V4 V5 V6 V7
1 248585_at 250887_at 245638_s_at AFFX-BioC-5_at
But
> as.character(clusters[1,])
[1] "1" "1" "2" "3" "1" "1" "1"
Is there any way to keep the original names and put them into a character vector?
Maybe it helps: my "clusters" object given by the "read.delim" file belongs to the "list" type.
Thanks a lot :-)
Federico

By default character columns are converted to factors. You can avoid this by setting as.is=TRUE argument:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE, as.is=TRUE)
If you only pass arguments from text file to character vector you could do something like:
x <- readLines("test")
xx <- strsplit(x,split="\t")
xx[[1]] # xx is a list
# [1] "248585_at" "250887_at" "245638_s_at" "AFFX-BioC-5_at"

I never would have expected that to happen, but trying a small test case produces the same results you're giving.
Since the result of df[1,] is itself a data.frame, one fix I thought to try was to use unlist -- seems to work:
> df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20], c=LETTERS[5:14])
> df[1,]
a b c
1 A K E
> as.character(df[1,])
[1] "1" "1" "1"
> as.character(unlist(df[2,]))
[1] "B" "L" "F"
I think turning the data.frame into a matrix first would also get around this:
m <- as.matrix(df)
> as.character(m[2,])
[1] "B" "L" "F"
To avoid issues with factors in your data.frame you might want to set stringsAsFactors=TRUE when reading in your data from the text file, eg:
clusters <- read.delim("test", sep="\t", fill=TRUE, header=FALSE,
stringsAsFactors=FALSE)
And, after all that, the unexpected behavior seems to come from the fact that the original affy probes in your data.frame are treated as factors. So, doing the stringsAsFactors=FALSE thing will side-step the fanfare:
df <- data.frame(a=LETTERS[1:10], b=LETTERS[11:20],
c=LETTERS[5:14], stringsAsFactors=FALSE)
> as.character(df[1,])
[1] "A" "K" "E"

Related

Multiple gsub() expressions in R

I'm trying to clean a column of data from a data frame with many gsub commands.
Some examples would be:
df$col1<-gsub("-00070", "-0070", df$col1)
df$col1<-gsub("-00063", "-0063",df$col1)
df$col1<-gsub("F4", "FA", df$col1)
...
Looking at the column after running these lines of code, it looks like some of the changes have taken, but some have not. Moreover, if I run the block of code with the gsub() commands more changes start taking effect the more I run the block.
I'm very confused by this behavior, any information is appreciated.
There's probably a better way, but you could always use Map
new <- 1:3
old <- letters[1:3]
to.change <- letters[1:10]
Map(function(x, y) to.change <<- gsub(x, y, to.change), old, new)
to.change
# [1] "1" "2" "3" "d" "e" "f" "g" "h" "i" "j"

How To Add Characters Around Elements

I am using R and want to covert following:
"A,B"
to
"A","B" OR 'A','B'
I tried str_replace(), but that's not working out.
Please suggest, thanks.
Update
I tried the suggested answer by d.b. Though it works, but I didn't realize that I should have shared that, I am going to use above solution for vector. I need the values in data with "A,B" to split in order to use it as a vector.
Using strsplit
> data
[1] "A,B"
> test <- strsplit(x = data, split = ",")
> test
[[1]]
[1] "A" "B"
above test won't be useful because I can't use it for following:
> output_1 <- c(test)
> outputFinalData <- outputFinal[outputFinal$Column %in% output_1,]
outputFinalData is empty with above process. But is not empty when I do:
> output_2 <- c("A", "B")
> outputFinalData <- outputFinal[outputFinal$Column %in% output_2,]
Also, output_1 and output_2 are not same:
> output_1
[[1]]
[1] "Bin_14" "Bin_15"
> output_2
[1] "Bin_14" "Bin_15"
> output_1 == output_2
[1] FALSE FALSE
Use strsplit:
> data = "A,B"
> strsplit(x=data,split=",")
[[1]]
[1] "A" "B"
Note that it returns a list with a vector. The list is length one because you asked it to split one string. If you ask it to split two strings you get a list of length 2:
> data = c("A,B","Foo,bar")
> strsplit(x=data,split=",")
[[1]]
[1] "A" "B"
[[2]]
[1] "Foo" "bar"
So if you know you are only going to have one thing to split you can get a vector of the parts by taking the first element:
> data = "A,B"
> strsplit(x=data,split=",")[[1]]
[1] "A" "B"
However it might be more efficient to do a load of splits in one go and put the bits in a matrix. As long as you can be sure everything splits into the same number of parts, then something like:
> data = c("A,B","Foo,bar","p1,p2")
> do.call(rbind,(strsplit(x=data,split=",")))
[,1] [,2]
[1,] "A" "B"
[2,] "Foo" "bar"
[3,] "p1" "p2"
>
Gets you the two parts in columns of a matrix that you can then add to a data frame if that's what you need.

How to create multiple columns filled with 0 in an xts object in R?

I would like to create multiple columns filled with zeroes in one xts object. Manually I can use this code :
> class(data)
[1] "xts" "zoo"
> colnames(data)
[1] "A" "B"
> data$C <- 0
> colnames(data)
[1] "A" "B" "C"
But unfortunately when in a for loop the i is interpreted as an object name instead of a variable.
> symbols
[1] "D" "E" "F"
for (i in symbols) {
data$i <- 0
}
> colnames(data)
> [1] "A" "B" "C" "i"
When I use [[, the programmatic equivalent of $ then colnames(data) returns NULL.
Finally I try with the apply family of functions like below but it doesn't work as expected.
> sapply(symbols, function(i) {data$i <- 0})
D E F
0 0 0
What could be the best solution to do this ?
Thanks in advance
I recommend creating a new xts object with the desired column names and values, then merge that with the original data.
require(xts)
data <- xts(cbind(A=1:5,B=5:1), Sys.Date()-5:1)
symbols <- LETTERS[3:6]
zeros <- xts(matrix(0,nrow(data),length(symbols)),
index(data), dimnames=list(NULL,symbols))
data <- merge(data, zeros)
That makes what is being done explicit, and therefore less confusing for future you. :)
You can try:
do.call(cbind,setNames(c(list(data),rep(list(0),length(symbols))),
c("",symbols)‌​))

Sapply different than individual application of function

When applied individually to each element of the vector, my function gives a different result than using sapply. It's driving me nuts!
Item I'm using: this (simplified) list of arguments another function was called with:
f <- as.list(match.call()[-1])
> f
$ampm
c(1, 4)
To replicate this you can run the following:
foo <- function(ampm) {as.list(match.call()[-1])}
f <- foo(ampm = c(1,4))
Here is my function. It just strips the 'c(...)' from a string.
stripConcat <- function(string) {
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
When applied alone it works as so, which is what I want:
> stripConcat(f)
[1] "1, 4"
But when used with sapply, it gives something totally different, which I do NOT want:
> sapply(f, stripConcat)
ampm
[1,] "c"
[2,] "1"
[3,] "4"
Lapply doesn't work either:
> lapply(f, stripConcat)
$ampm
[1] "c" "1" "4"
And neither do any of the other apply functions. This is driving me nuts--I thought lapply and sapply were supposed to be identical to repeated applications to the elements of the list or vector!
The discrepency you are seeing, I believe, is simply due to how as.character coerces elements of a list.
x2 <- list(1:3, quote(c(1, 5)))
as.character(x2)
[1] "1:3" "c(1, 5)"
lapply(x2, as.character)
[[1]]
[1] "1" "2" "3"
[[2]]
[1] "c" "1" "5"
f is not a call, but a list whose first element is a call.
is(f)
[1] "list" "vector"
as.character(f)
[1] "c(1, 4)"
> is(f[[1]])
[1] "call" "language"
> as.character(f[[1]])
[1] "c" "1" "4"
sub attempts to coerce anything that is not a character into a chracter.
When you pass sub a list, it calls as.character on the list.
When you pass it a call, it calls as.character on that call.
It looks like for your stripConcat function, you would prefer a list as input.
In that case, I would recommend the following for that function:
stripConcat <- function(string) {
if (!is.list(string))
string <- list(string)
sub(')','',sub('c(','',string,fixed=TRUE),fixed=TRUE)
}
Note, however, that string is a misnomer, since it doesn't appear that you are ever planning to pass stripConcat a string. (not that this is an issue, of course)

R - How to get at a string from a single column and row in a data frame

So I'm trying to do these problems in R in order to learn it.
But I'm stuck on the first problem to simply count the frequency of charactors in a string. I can't even seem to get past loading the data and getting to the string :-(
How do I do something like print the first charactor of the string from this text file?
Here's what I've tried so far:
> rosalind_dna <- read.table("~/Downloads/rosalind_dna.txt", quote="")
Warning message:
In read.table("~/Downloads/rosalind_dna.txt", quote = "") :
incomplete final line found by readTableHeader on '~/Downloads/rosalind_dna.txt'
> viewData(rosalind_dna)
> str(rosalind_dna[1,1,1])
Factor w/ 1 level "GGCCCGGTTACTGCGACTGAACAATCAAAATCTGAAGCATTTAAGCCAAACCAATTGAGATCGACTTACGAGCGATAACCCAGTATATTCAAGTGCTACTGATGAGGCGTGGTCCCCTGGACAAGGC"| __truncated__: 1
What you've done so far is just fine.
read.table returns a data frame. In this case, you just get a data frame with a single column and only a single value in that column.
By default, R will convert character columns in data frames to factors. You can convert it back using as.character.
Then you'll simply want to split that single string into individual characters (strsplit) and then make a table (table). (No need for loops!)
Here's a toy example illustrating all the functions I mentioned:
> dat <- data.frame(V1 = factor("abcdfjtusje"))
> str(dat)
'data.frame': 1 obs. of 1 variable:
$ V1: Factor w/ 1 level "abcdfjtusje": 1
> x <- as.character(dat[1,1])
> x
[1] "abcdfjtusje"
> strsplit(x,"")
[[1]]
[1] "a" "b" "c" "d" "f" "j" "t" "u" "s" "j" "e"
> strsplit(x,"")[[1]]
[1] "a" "b" "c" "d" "f" "j" "t" "u" "s" "j" "e"
> table(strsplit(x,"")[[1]])
a b c d e f j s t u
1 1 1 1 1 1 2 1 1 1
>
I've copied the file in the link into /tmp/string.txt This file has just has a single line of:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
We can read the file using the readLines command:
s = readLines("/tmp/string.txt")
The variable s is just a single string. To split up the bases, we use:
strsplit(s, "")
then tabulate using table:
table(strsplit(s, ""))
If you want to display the first character of the whole file you may act as follows:
s = readLines("Your file.txt",n=1)
substr(s, 1, 1)
To display the first character of every line:
s = readLines("Your file.txt")
substr(s, 1, 1)
To display n-th character of every line:
n = 5
s = readLines("Your file.txt")
substr(s, n, n)
you can use readLine and substr command to solve the problem, but if you insist to grep the first character from a datafram, simply, you can use
substr(dataframe$colname,1,1)
it will return a string vector.

Resources