I'm using the SVA packages in R, dat is a csv file containing genes in rows and samples in columns. The file SIF.csv contains only 3 columns, array, sample and batch.
http://www.filedropper.com/samplesmall
http://www.filedropper.com/sifsmall
I followed the SVA manual, though I don't understand what does
modcombat do here. I understand it turns the data table into a
matrix, what why do we write ~1 in bracket ?? What does it mean?
Also, it generates an error, I think it means that the number of rows
isn't matching, is there a way to fix that?
Library(sva)
dat = read.csv("Combat_matrix_input.csv");
sif = read.csv("sif.csv");
modcombat = model.matrix(~1, data=dat)
newdata = ComBat(dat=dat, batch=sif$Batch, par.prior = TRUE, mod = modcombat)
Found 6 batches
Error in cbind(batchmod, mod) :
number of rows of matrices must match (see arg 2)
Firstly, please kindly post your csv file by uploading on some cloud drive. "~" commands generally means approximately equal for more details of the operator see: http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/plotmath.html
Also, for the model.matrix parameters refer to the manual that will help you understand the function, see: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/model.matrix.html
Edit: After looking at your both the files, I can see that the number of columns is different in each of the file and you might have understood that by now. Following document enlists the detailed steps, see: http://www.bioconductor.org/packages/release/bioc/vignettes/sva/inst/doc/sva.pdf. I hope it helps.
Related
Trying to learn the ropes in R and already struggling trying to find a replacement for SAS macro.
I'm trying to run a piece of code several times, but I'm having a hard time and came here for help.
First, I'm working with this example file, with a variable that gives me the quantity of rows that I have previously analised in another file (qtde_registros), followed by three variables that give me the quantity of rows that had different type of errors.
file <- readRDS(file="file.Rda")
file
qtde_registros error1 error2 error3
1 1175 0 0 0
After that, I created a list with the errors and another one with the description of each one of them.
Then, using those lists and the file mentioned initially, I wish to create several files (one for each error) that will later be binded in one last file to form a final report.
As I said, I'm struggling with it, so I made an example code of how it would be forming the first file:
error_list <- list("Error1","Error2","Error3",)
description_list <- list("Code not found",
"Invalid date.",
"Negative value.")
error1 <- file
error1$file_name <- "Clients"
error1$error <- error_list[1]
error1$qtde <- error1$error1
error1$desc <- description_list[1]
error1 <- select(error1, file_name, error, qtde, desc)
error1
file_name error qtde desc
1 Clients Error1 0 Code not found
And that leads to my question: how can I make the code above run several times, one for each erros on my list?
I'm aware that the whole mentality may not be the best, as the approach to do certain things are different depending on the language used, but I have to work with the knowledge I have at the moment.
I'm thinking of using the apply family of functions, but I didn't managed to work it out.
Thanks in advance for the help and sorry for any errors in typing or grammar (english is not my first language).
EDIT: forgot to say that I'm not intend to do via For or While loop.
In R (and many other languages) you'll be using a form of for-loop. In R there are several wrappers for loops with specific outcome in the *apply family. Here's a short (incomplete) list of the *apply family and their input/output:
lapply -> list output
sapply -> List or atomic (integer vector, numeric vector etc.)
mapply -> Similar to sapply but can take more than 1 input to go over (so if you have 2 simultanious things to loop over for example)
tapply -> loop over groups defined by INDEX
apply -> Loop over an array (either rows or columns) return matrix/vector
And so on.
I am guessing that your example is incomplete, but I'll show 3 examples to get you started. One using a for-loop, one using lapply and one using mapply.
for-loop
A for-loop is the classic method (found in most programming languages). It works by having a for(---) where --- is replaced by something to iterate over. This could be error_list or it could be a numeric vector seq(1, n) or 1:n. Here you have more than 1 thing to iterate over, so a numeric vector makes sense (and we use this to subset the data)
errors <- list() # <== Somewhere to put our results
for(i in 1:length(error_list)){
error_i <- list(file = file,
file_name = "Clients",
error = error_list[[i]], # Use i to subset error_list
qtde = error_list[[i]], # Maybe this should be something else in your case
desc = description_list[[i]]
)
# Put into our errors list. Create "error1" using paste and our index
errors[[paste0('error', i)]] <- error_i
}
And by the end all of your results will be in the errors list to be extracted using errors[1] or errors["errors1"] (change the number to your error). This can then be combined using do.call(rbind, errors) and then saved using write.table (or write.csv or similar).
lapply
For the *apply family, the *apply takes care of the looping. But instead we have to provide a function to execute (a macro in SAS terms) in each iteration. So we wrap the contents of the loop in the function above.
macro <- function(i){
list(file = file,
file_name = "Clients",
error = error_list[[i]], # Use i to subset error_list
qtde = error_list[[i]], # Maybe this should be something else in your case?
desc = description_list[[i]]
)
}
errors <- lapply(1:length(error_list), macro)
#set names afterwards
names(errors) <- paste0("error", 1:length(error_list))
And once again we have the data ready to be extracted saved etc. This is equivalent to:
errors <- list()
for(i in 1:length(error_list))
errors[[i]] <- macro(i)
names(errors) <- paste0("error", 1:length(error_list))
mapply
Now in your case you have more than 1 thing to iterate over. An alternative is to use mapply and add these as parameters to your function instead. This way we remove error_list[[i]] and description_list[[i]] from the function and instead add these as parameters
macro_mapply <- function(error, description){
list(file = file,
file_name = "Clients",
error = error, # No need to use I here anymore
qtde = error, # Maybe this should be something else in your case?
desc = description
)
}
errors <- mapply(macro_mapply,
# parameters to iterate over comes after function
error = error_list,
description = description_list,
# Avoid simplification (if we want a list returned)
SIMPLIFY = FALSE)
names(errors) <- paste0("error", 1:length(error_list))
Note that "mapply" will try to return a vector if possible, so I set SIMPLIFY = FALSE to avoid this.
Things to note:
In the above 3 examples I have not taken into account if you read multiple files, or any other parameters changing. So if you have to read a file in each iteration it will make sense to go with the first 2 examples and add readRDS to the loop or function with appropriate file naming. Also I have used your data, but I am guessing qtde and error should be different in your specific case but this is not clear from your example.
I hope this will help getting you started.
Once you've gotten the hang of your first loops I and somewhat understand how *applys work, I would then suggest checking out tidyverse which provides what many find to be a more "user-friendly" and intuitive interface to data transformation.
I hope that this will help you getting started on solving your problem.
I know there are ways to extract all arguments to a function using, for example, rlang::fn_fmls. But is it possible to extract the description of one of these arguments from the package documentation?
For example, what if I wanted to extract the description of the na.rm argument to base::sum()?
I'm imagining something like:
get_argument_description(fn = 'base::sum', arg = 'na.rm')
Return:
"logical. Should missing values (including NaN) be removed?"
You could try to read the associated help file for that function, and grep the line where \item{argument}. However, multi-line help texts are allowed, if the next line does not start with a \ you would want to grab that too.
This answer shows a way to acess the file, then it is just a matter of grabbing the correct line(s). I also want to highlight a different function in tools,
tools:::.Rd_get_text()
Which almost gets you where you want, (if you find the correct line)
library(tools)
db = Rd_db("base")
tools:::.Rd_get_text(x = db[["sum.Rd"]])[21] # 21 is the line you want here
[1] " na.rm: logical. Should missing values (including 'NaN') be removed?"
I have an ID variable with 20 digits. Once i read the data in R , it changes to Scientific notation and then if i write the same id to csv file, the value of ID changes.
For example , running the below code should print me the value of x as "12345678912345678912",but it prints "12345678912345679872":
Code:
options(scipen=999)
x <- 12345678912345678912
print(x)
Output:
[1] 12345678912345679872
My questions are :
1) Why it is happening ?
2) How to fix this problem ?
I know it has to do with the storage of data types in R but still i think there should be some way to deal with this problem. I hope i am clear with this question.
I don't know if this question was asked or not in so point me to a link if its a duplicate.I will remove this post
I have gone through this, so i can relate with the issue of mine, but i am unable to fix it.
Any help would be highly appreciated. Thanks
R does not by default handle integers numerically larger than 2147483647L.
If you append an L to your number (to tell R its an integer), you get:
x <- 12345678912345678912L
#Warning message:
#non-integer value 12345678912345678912L qualified with L; using numeric value
This also explains the change of the last digits as R stores the number as a double.
I think the gmp-package should be able to handle large numbers in general. You should therefore either accept the loss of precision, store them as character stings, or use a data-type from the gmp package.
To circumvent the problem due to number storing/representation, you can import your ID variable directly as character with the option colClasses, for example, if using read.csv and importing a data.frame with the ÌD column and another numeric column:
mydata<-read.csv("file.csv",colClasses=c("character","numeric"),...)
Using readr you can do
mydata <- readr::read_csv("file.csv", col_types = list(ID=col_character()))
where "ID" is the name of your ID column
Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.
I have a file with 15 million lines (will not fit in memory). I also have a small vector of line numbers - the lines that I want to extract.
How can I read-out the lines in one pass?
I was hoping for a C function that does it on one pass.
The trick is to use connection AND open it before read.table:
con<-file('filename')
open(con)
read.table(con,skip=5,nrow=1) #6-th line
read.table(con,skip=20,nrow=1) #27-th line
...
close(con)
You may also try scan, it is faster and gives more control.
If it's a binary file
Some discussion is here:
Reading in only part of a Stata .DTA file in R
If it's a CSV or other text file
If they are contiguous and at the top of the file, just use the ,nrows argument to read.csv or any of the read.table family. If not, you can combine the ,nrows and the ,skip arguments to repeatedly call read.csv (reading in a new row or group of contiguous rows with each call) and then rbind the results together.
If your file has fixed line lengths then you can use 'seek' to jump to any character position. So just jump to N * line_length for each N you want, and read one line.
However, from the R docs:
Use of seek on Windows is discouraged. We have found so many
errors in the Windows implementation of file positioning that
users are advised to use it only at their own risk, and asked not
to waste the R developers' time with bug reports on Windows'
deficiencies.
You can also use 'seek' from the standard C library in C, but I don't know if the above warning also applies!
Before I was able to get an R solution/answer, I've done it in Ruby:
#!/usr/bin/env ruby
NUM_SEQS = 14024829
linenumbers = (1..10).collect{(rand * NUM_SEQS).to_i}
File.open("./data/uniprot_2011_02.tab") do |f|
while line = f.gets
print line if linenumbers.include? f.lineno
end
end
runs fast (as fast as my storage can read the file).
I compile a solution based on the discussions here.
scan(filename,what=list(NULL),sep='\n',blank.lines.skip = F)
This will only show you number of lines but will read in nothing. If you really want to skip the blank lines, you could just set the last argument to TRUE.