read.table() changes column names [duplicate] - r

Whenever I read in a file using read.csv() with option header=T, the headers change in weird (but predictable) ways. A header name which ought to read "P(A<B)" becomes "P.A.B.", for instance:
> # when header=F:
> myfile1 <- read.csv(fullpath,sep="\t",header=F,nrow=3)
> myfile1
V1 V2 V3
1 ID Name P(A>B)
2 AB001 Alice 0.997
3 AB002 Bob 0.497
>
> # When header=T:
> myfile2 <- read.csv(fullpath,sep="\t",header=T,nrow=3)
> myfile2
ID Name P.A.B.
1 AB001 Alice 0.997
2 AB002 Bob 0.497
3 AB003 Charles 0.732
I tried to fix it like this, but it didn't work:
> names(myfile2) <- myfile1[1,]
> myfile2
3 3 3
1 AB001 Alice 0.997
2 AB002 Bob 0.497
3 AB003 Charles 0.732
So then I tried to use sub() to write a function that would take any vector "arbitrary.lengths.here." and return a vector "arbitrary(lengths>here)", but I didn't really get anywhere, and I started to suspect that I was making this problem more complicated than it had to be.
How would you deal with this problem of headers? Was I on the right track with sub()?

Set check.names=FALSE in read.csv()
read.csv(fullpath,sep="\t", header=FALSE, nrow=3, check.names=FALSE)
From the help for ?read.csv:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names. If necessary they are adjusted (by make.names) so that they are, and also to ensure that there are no duplicates.

Not really intended as an answer, but intended to be helpful to Rnewbs: Those headers were read in as factors (and caused the third column to also be a factor. The screwy names() assignments probably used their integer storage mode. #Andrie has already given you the preferred solution, but if you wanted to just reassign the names (which would not undo the damage to the thrid column) you could use:
names(myfile1) <- scan(file=fullpath, what="character" nmax=1 , sep="\t")
myfile1 <- myfile[-1, ] # gets rid of unneeded line

Related

R convert data into a vector

I have an object (Seurat object) an I need to get certain data out of it
> sc#misc[["colors"]][["seurat_clusters"]]
0 1 2 3 4 5 6 7
"#CC0C00FF" "#5C88DAFF" "#84BD00FF" "#FFCD00FF" "#7C878EFF" "#00B5E2FF" "#00AF66FF" "#CC0C00B2"
This data is needed as an vector but I don't know how to pull "#CC0C00FF" "#5C88DAFF" etc. out of it.
In order to hand this data to the next function, the result should look like this:
> vec
[1] "#CC0C00FF" "#5C88DAFF" "#84BD00FF"
Thanks in advance!
Solved it! I'm pretty disappointed by myself, because I didn't know this function existed:
> as.vector(sc#misc[["colors"]][["seurat_clusters"]])
[1] "#CC0C00FF" "#5C88DAFF" "#84BD00FF" "#FFCD00FF" "#7C878EFF" "#00B5E2FF" "#00AF66FF" "#CC0C00B2"

subset function returns all rows

I recently reverted to R version 3.1.3 for compatibility reasons and am now encountering an unexplained error with the subset function.
I want to extract all rows for the gene "Migut.A00003" from the data frame transcr_effects using the gene name as listed in the data frame expr_mim_genes. (this will later become a loop). This action always returns all rows instead of specific rows I am looking for, no matter the formatting of the subset lookup:
> class(expr_mim_genes)
[1] "data.frame"
> sapply(expr_mim_genes, class)
gene longest.tr pair.length
"character" "logical" "numeric"
> head(expr_mim_genes)
gene longest.tr pair.length
1 Migut.A00003 NA 0
2 Migut.A00006 NA 0
3 Migut.A00007 NA 0
4 Migut.A00012 NA 0
5 Migut.A00014 NA 0
6 Migut.A00015 NA 0
> class(transcr_effects)
[1] "data.frame"
> sapply(transcr_effects, class)
pair gene
"character" "character"
> head(transcr_effects)
pair gene
1 pair1 Migut.N01020
2 pair10 Migut.A00351
3 pair1000 Migut.F00857
4 pair10007 Migut.D01637
5 pair10008 Migut.A00401
6 pair10009 Migut.G00442
. . .
7168 pair3430 Migut.A00003
. . .
The gene I am interested in:
> expr_mim_genes[1,"gene"]
[1] "Migut.A00003"
R sees these two terms as equivalent:
> expr_mim_genes[1,"gene"] == "Migut.A00003"
[1] TRUE
If I type in the name of the gene manually, the correct number of rows are returned:
> nrow(subset(transcr_effects, transcr_effects$gene=="Migut.A00003"))
[1] 1
> subset(transcr_effects, transcr_effects$gene=="Migut.A00003")
pair gene
7168 pair3430 Migut.A00003
However, this should return one row from the data.frame but it returns all rows:
> nrow(subset(transcr_effects, transcr_effects$gene == (expr_mim_genes[1,"gene"]))
[1] 10122
I have a feeling this has something to do with text formatting, but I've tried everything and haven't been able to figure it out. I've seen this issue with quoted v.s. unquoted entries, but it does not appear to be the issue here (see equality above).
I didn't have this problem before switching to R v.3.1.3, so maybe it is a version convention I am unaware of?
EDIT:
This is driving me crazy, but at least I think I have found a patch. There was quite a bit of data and file processing to get to this point in the code, involving loading at least 4 files. I've tried taking snippets of each file to post a reproducible example here, but sometimes when I analyze the snippets the error recurs, sometimes it does not (!!). After going through the process though, I discover that:
i = 1
gene = expr_mim_genes[i,"gene"]
> nrow(subset(transcr_effects, gene == gene))
[1] 10122
> nrow(subset(transcr_effects, gene == (expr_mim_genes[i,"gene"])))
[1] 1
I still can't explain this behavior of the code, but at least I know how to work around it.
Thanks all.

why my data lose it's dimension when I use lapply?

I'm reading lot's of files from directory and doing some computation on each file.
because I want to make my script parralel, I use lapply. when I look at the dimention of my data frame at each element of list, it's become 1.
would someone help me to fix it ?
Here is my effort :
files <- list.files(path="path to file")
dfr <- lapply(files, function(x) read.table(x,header=T,sep="\n"))
for(i in drf){
Do some computation
if (ncol(i) > 1){
y <- as.matrix(i[1])
x <- as.matrix(i[2:ncol(i)])
}
.
.
}
#
> i
[[1]]
ACAA2.hsa.miR.124.AGO2.hsa.miR.124.AGO134
1 7.6561 18.5924339201 23.4560035028
2 7.2355 26.2524888635 33.6513700944
3 7.365 23.6841865928 28.2168475593
4 8.4768 22.4003094419 28.0983702155
5 5.5838 20.4838449736 26.8616064228
6 7.3123 20.8488005184 26.9155966811
7 7.2345 21.5272944711 26.2954400309
8 7.05 23.3113502366 29.3856555269
> dim(i[1])
NULL
> dim(i[[1]])
[1] 67 1
> a<-i[[1]]
> dim(a)
[1] 67 1
> a
ACAA2.hsa.miR.124.AGO2.hsa.miR.124.AGO134
1 7.6561 18.5924339201 23.4560035028
2 7.2355 26.2524888635 33.6513700944
3 7.365 23.6841865928 28.2168475593
4 8.4768 22.4003094419 28.0983702155
5 5.5838 20.4838449736 26.8616064228
6 7.3123 20.8488005184 26.915596681
but I would expect
>dim(a)
67 3
Because I loses the dimension of data, my *for loop* doesn't work
You problem is not the for loop or the lapply call but your read.table command. You use sep="\n" instead of sep=" ".
?read.table shows you that the sep argument is the field separator. It seems your field separator is a space " ". Just use read.table without specifying the sep argument should work.

R correct use of read.csv

I must be misunderstanding how read.csv works in R. I have read the help file, but still do not understand how a csv file containing:
40900,-,-,-,241.75,0
40905,244,245.79,241.25,244,22114
40906,244,246.79,243.6,245.5,18024
40907,246,248.5,246,247,60859
read into R using: euk<-data.matrix(read.csv("path\to\csv.csv"))
produces this as a result (using tail):
Date Open High Low Close Volume
[2713,] 15329 490 404 369 240.75 62763
[2714,] 15330 495 409 378 242.50 127534
[2715,] 15331 1 1 1 241.75 0
[2716,] 15336 504 425 385 244.00 22114
[2717,] 15337 504 432 396 245.50 18024
[2718,] 15338 512 442 405 247.00 60859
It must be something obvious that I do not understand. Please be kind in your responses, I am trying to learn.
Thanks!
The issue is not with read.csv, but with data.matrix. read.csv imports any column with characters in it as a factor. The '-' in the first row for your dataset are character, so the column is converted to a factor. Now, you pass the result of the read.csv into data.matrix, and as the help states, it replaces the levels of the factor with it's internal codes.
Basically, you need to insure that the columns of your data are numeric before you pass the data.frame into data.matrix.
This should work in your case (assuming the only characters are '-'):
euk <- data.matrix(read.csv("path/to/csv.csv", na.strings = "-", colClasses = 'numeric'))
I'm no R expert, but you may consider using scan() instead, eg:
> data = scan("foo.csv", what = list(x = numeric(), y = numeric()), sep = ",")
Where foo.csv has two columns, x and y, and is comma delimited. I hope that helps.
I took a cut/paste of your data, put it in a file and I get this using 'R'
> c<-data.matrix(read.csv("c:/DOCUME~1/Philip/LOCALS~1/Temp/x.csv",header=F))
> c
V1 V2 V3 V4 V5 V6
[1,] 40900 1 1 1 241.75 0
[2,] 40905 2 2 2 244.00 22114
[3,] 40906 2 3 3 245.50 18024
[4,] 40907 3 4 4 247.00 60859
>
There must be more in your data file, for one thing, data for the header line. And the output you show seems to start with row 2713. I would check:
The format of the header line, or get rid of it and add it manually later.
That each row has exactly 6 values.
The the filename uses forward slashes and has no embedded spaces
(use the 8.3 representation as shown in my filename).
Also, if you generated your csv file from MS Excel, the internal representation for a date is a number.

store summary output in a list of tables or matrix

How to read the following vector "c" of strings into a list of tables? Which way is the shortest read.table strsplit? e.g. I cant see how to read the table Edit:c[4:6] a[4:6] in one command.
require(car)
m<-matrix(rnorm(16),4,4,byrow=T)
a<-Anova(lm(m~1),type=3,idata=data.frame(treatment=factor(1:4)),idesign=~treatment)
c<-capture.output(summary(a,multivariate=F))
c
This returns lines 4:6
c[4:6]
Now if you wanted to parse this I would do it in two steps. First on the column values from rows 5:6 and then add back the names.
> vals <- read.table(text=c[5:6])
> txt <- " \t SS\t num Df\t Error SS\t den Df\t F\t Pr(>F)"
> names(vals) <- names(read.delim(text=txt))
> vals
X SS num.Df Error.SS den.Df F Pr..F.
1 (Intercept) 0.57613392 1 0.4219563 3 4.09616 0.13614
2 treatment 1.85936442 3 8.2899759 9 0.67287 0.58996
EDIT --
you could look at the source code of the summary function and calculate the quantities required by yourself
getAnywhere(summary.Anova.mlm)
The original idea seems not to work.
c2 <- summary(a)
# find out what 'properties' the summary object has
# turns out, it is just the Anova object
class(c2) <- "list"
names(c2)
This returns
[1] "SSP" "SSPE" "P" "df" "error.df"
[6] "terms" "repeated" "type" "test" "idata"
[11] "idesign" "icontrasts" "imatrix" "singular"
and we can get access them
c2$SSP
c2$SSPE
It seems not a good idea to use R internal c function as a variable name

Resources