I've a link where I need to download data which is in ".iqy" file and I need to read that for further cleaning.
I'm able to do it manually by entering the link present(in 3rd line) in the file using
con <- file("ABC1.iqy", "r", blocking = FALSE)
readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown').
Output:
[1] "WEB"
[2] "1"
[3] "https:abc.../excel/execution/EPnx?view=vrs" [4] ""
[5] ""
[6] "Selection=AllTables"
[7] "Formatting=None"
[8] "PreFormattedTextToColumns=True"
[9] "ConsecutiveDelimitersAsOne=True"
[10] "SingleBlockTextImport=False"
[11] "DisableDateRecognition=False"
[12] "DisableRedirections=False"
[13] ""
I need to automate this instead of doing it manually. Is there any option in r that I can use?
simply use download.file :)
con <- file("ABC1.iqy", "r", blocking = FALSE)
dest_path <- "ABC.file"
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path)
if you can't read the file you get, try :
download.file(readLines(con=con,n=-1L,ok=TRUE, warn=FALSE,encoding='unknown')[3],destfile= dest_path, mode = "wb")
I tried to import a text file in R (3.4.0) which actually contains 4 columns but the 4th column is mostly empty until 200,000+th row. I use the fread() in package data.table (ver 1.10.4)
fread("test.txt",fill = TRUE, sep = "\t", quote = "", header = FALSE)
I got this error message:
Error in fread("test.txt", fill = TRUE, sep = "\t", quote = "", header = FALSE) :
Expecting 3 cols, but line 258088 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
I checked the file and there's additional text in 258088th row in the 4th column ("8-4").
Nevertheless, fill = TRUE did not solve this as I expected. I thought it might be fread() determining column numbers inappropriately because the additional column occurs very late in the file. So I tried this:
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 250000)
The error persisted. On the other hand,
fread("test.txt", fill = TRUE, header = FALSE, sep = "\t", skip = 258080)
This gives no error.
I thought I found the reason, but the weird thing happened when I tested with a dummy file generated by:
write.table(matrix(c(1:990000), nrow = 330000), "test2.txt", sep = "\t", row.names = FALSE)
with the addition of a "8-4" in the 4th column of the 250000th row by Excel. When read by fread():
fread("test2.txt", fill = TRUE, header = FALSE, sep = "\t")
It worked fine with no error message, and this should indicate some late additional column not necessarily trigger error.
I also tried changing encoding ("Latin-1" and "UTF-8") or quote, but neither helped.
Now I feel clueless, and hopefully I did my homework enough with a reproducible information. Thank you for helping.
For additional environmental info, my sessionInfo() is:
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] zh_TW.UTF-8/zh_TW.UTF-8/zh_TW.UTF-8/C/zh_TW.UTF-8/zh_TW.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.5.0 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3
[5] tibble_1.3.3 ggplot2_2.2.1 tidyverse_1.1.1 stringr_1.2.0
[9] microbenchmark_1.4-2.1 data.table_1.10.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 cellranger_1.1.0 compiler_3.4.0 plyr_1.8.4 forcats_0.2.0
[6] tools_3.4.0 jsonlite_1.5 lubridate_1.6.0 nlme_3.1-131 gtable_0.2.0
[11] lattice_0.20-35 rlang_0.1.1 psych_1.7.5 DBI_0.6-1 parallel_3.4.0
[16] haven_1.0.0 xml2_1.1.1 httr_1.2.1 hms_0.3 grid_3.4.0
[21] R6_2.2.1 readxl_1.0.0 foreign_0.8-68 reshape2_1.4.2 modelr_0.1.0
[26] magrittr_1.5 scales_0.4.1 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5
[31] colorspace_1.3-2 stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
Actually there is a difference between the two files that you provide, and I think this is the cause of the different outputs of the fread.
The first file has an end of the line after the 3rd column, except line 258088, where there is a tab a 4th column and then the end of the line. (You can use the option 'show all characters to confirm that').
On the other hand the second file has in all rows an extra tab, i.e. a new empty column.
So in the first case fread expects 3 columns and then finds out a 4th column. On the contrary in the second file, fread expects 4 columns.
I checked read.table with fill=TRUE and it worked with both files. So I think that something is done differently with the fill option of the fread.
I would expect since fill=TRUE, all the lines to be used so as to infer the number of columns (with cost on computational time).
In the comments there are some nice workarounds you can use.
The file has issue: if the table has four columns, at the end of each row with the fourth column missing a \t should have been present.
In this case you may have better luck with a low-level approach: read the file line by line, add a \t to each row which doesn't have the fourth column, split each line with \t and collect all together in a data.frame. Most of the above work is done by the data.table::tstrsplit function. Try something like:
f<-readLines("test.txt")
require(stringr)
require(data.table)
a<-data.frame(tstrsplit(f,"\t",type.convert=TRUE,names=TRUE,keep=1:4),stringsAsFactors=FALSE)
str(a)
#'data.frame': 273070 obs. of 4 variables:
# $ V1: num 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 ...
# $ V2: num -18.7113 -1.2685 0.0768 0.1507 0.1609 ...
# $ V3: num 0 0 0 0 0 0 0 0 0 0 ...
# $ V4: chr NA NA NA NA ...
I was struggling with this as well. I found another solution (for csv and read.table) here How can you read a CSV file in R with different number of columns. This answer you can use the handy function count.fields to count the delimiters of a file by line and then take the max field count to pass the max number of column names to fread. A reproducible example is below.
Generate text with uneven number of fields
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n16520, California, ocean, summer, golden gate, beach, San Francisco\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n"
Write to file
cat(text, file = "foo")
Scan file for delimeters
max.fields<-max(count.fields("foo", sep = ','))
Now use fread to read file, but expect a max number of columns from the col.names argument
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
However, I was basing this data on the example data from ?count.fields and found if the max number of fields is in the last line of the file, fread will still fail with the following error.
Error in fread("foo", header = FALSE, fill = TRUE, sep = ",", col.names = paste("V", :
Expecting 3 cols, but line 9 contains text after processing all cols. Try again with fill=TRUE. Another reason could be that fread's logic in distinguishing one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes has failed. If quote='' doesn't help, please file an issue to figure out if the logic could be improved.
example
text <- "12223, University\n12227, bridge, Sky\n12828, Sunset\n13801, Ground\n14853, Tranceamerica\n14854, San Francisco\n15595, shibuya, Shrine\n16126, fog, San Francisco\n16520, California, ocean, summer, golden gate, beach, San Francisco\n"
cat(text, file = "foo")
max.fields<-max(count.fields("foo", sep = ','))
fread("foo", header = FALSE, fill=TRUE, sep=",", col.names = paste("V", 1:max.fields, sep = ""))
I'll report this as an issue to the data.table Github. Update: issue logged here https://github.com/Rdatatable/data.table/issues/2691
I'm attempting to use the system function in R to run a program, which I expect to yield an error message in some cases. For this I want to write a tryCatch function.
system(command, intern = TRUE) only returns the actual values which were echo'd by the program I'm running, it does not return my error.
In R, how can I get the error message which was yielded by my system?
My code:
test <- tryCatch({
cmd <- paste0("../Scripts/Plink2/plink --file ../InputData/",prefix," --bmerge ",
"../InputData/fs --missing --out ../InputData/",prefix)
print(cmd)
system(cmd)
} , error = function(e) {
# error handler picks up where error was generated
print("EZEL")
print(paste("MY_ERROR: ",e))
}, finally = {
print("something")
})
[1] "../Scripts/Plink2/plink --file ../InputData/GS80Kdata --bmerge ../InputData/fs --missing --out ../InputData/GS80Kdata"
PLINK v1.90b3.37 64-bit (16 May 2016) https://www.cog-genomics.org/plink2
#....
#skipping some lines here to reduce size
#....
Of these, 1414410 are new, while 2462 are present in the base dataset.
Error: 1 variant with 3+ alleles present.
* If you believe this is due to strand inconsistency, try --flip with
# Skipping some more lines here
[1] "something"
However when using intern=TRUE and assigning the system function to a variable won't catch the error in the vector and still prints it in the R console.
Edit: Here the output of the vector (using gsub to reduce the ridiculous size)
> gsub(pattern="\b\\d.*", replacement = "", x = tst)
[1] "PLINK v1.90b3.37 64-bit (16 May 2016) https://www.cog-genomics.org/plink2"
[2] "(C) 2005-2016 Shaun Purcell, Christopher Chang GNU General Public License v3"
[3] "Logging to ../InputData/GS80Kdata.log."
[4] "Options in effect:"
[5] " --bmerge ../InputData/fs"
[6] " --file ../InputData/GS80Kdata"
[7] " --missing"
[8] " --out ../InputData/GS80Kdata"
[9] ""
[10] "64381 MB RAM detected; reserving 32190 MB for main workspace."
[11] "Scanning .ped file... 0%\b"
[12] "2%\b\b"
[13] "%\b\b"
[14] "\b\b"
[15] "\b"
[16] ""
[17] "58%\b\b"
[18] "7%\b\b"
[19] "%\b\b"
[20] "\b\b"
[21] "\b"
[22] "Performing single-pass .bed write (42884 variants, 14978 people)."
[23] "0%\b"
[24] "../InputData/GS80Kdata-temporary.bim + ../InputData/GS80Kdata-temporary.fam"
[25] "written."
[26] "14978 people loaded from ../InputData/GS80Kdata-temporary.fam."
[27] "144 people to be merged from ../InputData/fs.fam."
[28] "Of these, 140 are new, while 4 are present in the base dataset."
[29] "42884 markers loaded from ../InputData/GS80Kdata-temporary.bim."
[30] "1416872 markers to be merged from ../InputData/fs.bim."
[31] "Of these, 1414410 are new, while 2462 are present in the base dataset."
attr(,"status")
[1] 3
>
I am working with data from: Environment Canada
I am using download.file() to acquire this data. When I use:
download.file(url="http://dd.weather.gc.ca/model_gem_global/25km/grib2/lat_lon/00/000/CMC_glb_VGRD_ISBL_1000_latlon.24x.24_2015091100_P000.grib2",destfile = "Local_Grib.grib2")
GribInfo(grib.file = "Local_File.grib2",file.type = "grib2")
It yields:
$inventory
[1] "" "*** FATAL ERROR: rd_grib2_msg, missing end section ('7777') ***"
[3] ""
attr(,"status")
[1] 8
$grid
[1] "" "*** FATAL ERROR: rd_grib2_msg, missing end section ('7777') ***"
[3] ""
attr(,"status")
[1] 8
Warning messages:
1: running command 'wgrib2 Local_File.grib2 -inv -' had status 8
2: running command 'wgrib2 Local_File.grib2 -grid' had status 8
Whilst a manual download followed by:
GribInfo(grib.file = "CMC_glb_TMP_ISBL_985_latlon.24x.24_2015091100_P000.grib2",file.type = "grib2")
Yields:
$inventory
[1] "1:0:d=2015091100:TMP:985 mb:anl:"
$grid
[1] "1:0:grid_template=0:winds(N/S):" "\tlat-lon grid:(1500 x 751) units 1e-06 input WE:SN output WE:SN res 48"
[3] "\tlat -90.000000 to 90.000000 by 0.240000" "\tlon 180.000000 to 179.760000 by 0.240000 #points=1126500"
I have attempted using the Curl and Wget methods within download.file() however they fail giving a non exit error. I am able to obtain these files using a wget batch file however, I would prefer my entire system be run within R for consistency and ease of use.
As per #Martin Morgan. Downloading as a binary will circumvent this issue. Thanks again Martin.
download.file(url="http://dd.weather.gc.ca/model_gem_global/25km/grib2/lat_lon/00/000/CMC_glb_VGRD_ISBL_1000_latlon.24x.24_2015091100_P000.grib2",destfile = "Local_Grib.grib2", mode="wb")
GribInfo(grib.file = "Local_File.grib2",file.type = "grib2")
Is there a simple way to get the list/array of retention times from a xcmsRaw object?
Example Code:
xraw <- xcmsRaw(cdfFile)
So for example getting information from it :
xraw#env$intensity
or
xraw#env$mz
You can see what slots are available in your xcmsRaw instance with
> slotNames(xraw)
[1] "env" "tic" "scantime"
[4] "scanindex" "polarity" "acquisitionNum"
[7] "profmethod" "profparam" "mzrange"
[10] "gradient" "msnScanindex" "msnAcquisitionNum"
[13] "msnPrecursorScan" "msnLevel" "msnRt"
[16] "msnPrecursorMz" "msnPrecursorIntensity" "msnPrecursorCharge"
[19] "msnCollisionEnergy" "filepath"
What you want is xraw#msnRt - it is a vector of numeric.
The env slot is a environment that stores 3 variables:
> ls(xraw#env)
[1] "intensity" "mz" "profile"
More details on the class itself at class?xcmsRaw.
EDIT: The msnRt slot is populated only if you specify includeMSn = TRUE and your input file must be in mzXML or mzML, not in cdf; if you use the faahKO example from ?xcmasRaw, you will see that
xr <- xcmsRaw(cdffiles[1], includeMSn = TRUE)
Warning message:
In xcmsRaw(cdffiles[1], includeMSn = TRUE) :
Reading of MSn spectra for NetCDF not supported
Also, xr#msnRt will only store the retention times for MSn scans, with n > 1. See the xset#rt where xset is an xcmsSet instance for the raw/corrected MS1 retention times as provided by xcms.
EDIT2: Alternatively, have a go with the mzR package
> library(mzR)
> cdffiles[1]
[2] "/home/lgatto/R/x86_64-unknown-linux-gnu-library/2.16/faahKO/cdf/KO/ko15.CDF"
> xx <- openMSfile(cdffiles[1])
> xx
Mass Spectrometry file handle.
Filename: /home/lgatto/R/x86_64-unknown-linux-gnu-library/2.16/faahKO/cdf/KO/ko15.CDF
Number of scans: 1278
> hd <- header(xx)
> names(hd)
[1] "seqNum" "acquisitionNum"
[3] "msLevel" "peaksCount"
[5] "totIonCurrent" "retentionTime"
[7] "basePeakMZ" "basePeakIntensity"
[9] "collisionEnergy" "ionisationEnergy"
[11] "highMZ" "precursorScanNum"
[13] "precursorMZ" "precursorCharge"
[15] "precursorIntensity" "mergedScan"
[17] "mergedResultScanNum" "mergedResultStartScanNum"
[19] "mergedResultEndScanNum"
> class(hd)
[1] "data.frame"
> dim(hd)
[1] 1278 19
but you will be outside of the default xcms pipeline if you take this route (although Steffen Neumann, from xcms, does know mzR very well, oubviously).
Finally, you are better of to use the Bioconductor mailing list of the xcms online forum if you want to maximise your chances to get feedback from the xcms developers.
Hope this helps.
Good answer but i was looking for this :
xraw <- xcmsRaw(cdfFile)
dp_index <- which(duplicated(rawMat(xraw)[,1]))
xraw_rt_dp <- rawMat(xraw)[,1]
xrawData.rt <- xraw_rt_dp[-dp_index]
Now :
xrawData.rt #contains the retention time.
Observation --> Using mzr package:
nc <- mzR:::netCDFOpen(cdfFile)
ncData <- mzR:::netCDFRawData(nc)
mzR:::netCDFClose(nc)
ncData$rt #contains the same retention time for the same input !!!