How do I read ragged/implied do data into r - r

How do I read data like the example below? (My actual files are like ftp://ftp.aoml.noaa.gov/hrd/pub/hwind/Operational/2012/AL182012/1030/0730/AL182012_1030_0730.gz formatted per http://www.aoml.noaa.gov/hrd/Storm_pages/grid.html -- They look like fortran implied-do writes)
The issue I have is that there are multiple headers and vectors within the file having differing numbers of values per line. Scan seems to start from the beginning for .gz files, while I want the reads to parse incrementally through the file.
This is a headerline with a name.
The fourth line has the number of elements in the first vector,
and the next vector is encoded similarly
7
1 2 3
4 5 6
7
8
1 2 3
4 5 6
7 8
This doesn't work as I'd like:
fh<-gzfile("junk.gz")
headers<-readLines(fh,3)
nx<-as.numeric(readLines,1)
x<-scan(fh,nx)
ny<-as.numeric(readLines,1)
y<-scan(fh,ny)
This sort of works, but I have to then calculate the skip values:
...
x<-scan(fh,skip=3,nx)
...
Ah... I discovered that using the gzfile() to open does not allow seek operations on the data, so the scan()s all rewind and start at the beginning of the file. If I unzip the file and operate on the uncompressed data, I can read the various bits incrementally with readLines(fh,n) and scan(fh,n=n)
readVector<-function(fh,skip=0,what=double()){
if (skip !=0 ){junk<-readLines(fh,skip)}
n<-scan(fh,1)
scan(fh,what=what,n=n)
}
fh<-file("junk")
headers<-readLines(fh,3)
x<-readVector(fh)
y<-readVector(fh)
xl<-readVector(fh)
yl<-readVector(fh)
... # still need to process a parenthesized complex array, but that is a different problem.

Looking at a few sample files, it looks like you only need to determine the number to be read once, and that can be used for processing all parts of the file.
As I mentioned in a comment, grep would be useful for helping automate the process. Here's a quick function I came up with:
ReadFunky <- function(myFile) {
fh <- gzfile(myFile)
myFile <- readLines(fh)
vecLen <- as.numeric(myFile[5])
startAt <- grep(paste("^\\s+", vecLen), myFile)
T1 <- lapply(startAt[-5], function(x) {
scan(fh, n = vecLen, skip = x)
})
T2 <- gsub("\\(|\\)", "",
unlist(strsplit(myFile[(startAt[5]+1):length(myFile)], ")(",
fixed = TRUE)))
T2 <- read.csv(text = T2, header = FALSE)
T2 <- split(T2, rep(1:vecLen, each = vecLen))
T1[[5]] <- T2
names(T1) <- myFile[startAt-1]
T1
}
You can apply it to a downloaded file. Just replace with the actual path to where you downloaded the file.
temp <- ReadFunky("~/Downloads/AL182012_1030_0730.gz")
The function returns a list. The first four items in the list are the vectors of coordinates.
str(temp[1:4])
# List of 4
# $ MERCATOR X COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ MERCATOR Y COORDINATES ... KILOMETERS : num [1:159] -476 -470 -464 -458 -452 ...
# $ EAST LONGITUDE COORDINATES ... DEGREES: num [1:159] -81.1 -81 -80.9 -80.9 -80.8 ...
# $ NORTH LATITUDE COORDINATES ... DEGREES: num [1:159] 36.2 36.3 36.3 36.4 36.4 ...
The fifth item is a set of 2-column data.frames that contain the data from your "parenthesized complex array". Not really sure what the best structure for this data was, so I just stuck it in data.frames. You'll get as many data.frames as the expected number of values for the given data set (in this case, 159).
length(temp[[5]])
# [1] 159
str(temp[[5]][1:4])
# List of 4
# $ 1:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.59 7.6 7.59 7.59 7.58 ...
# ..$ V2: num [1:159] -1.33 -1.28 -1.22 -1.16 -1.1 ...
# $ 2:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.66 7.66 7.65 7.65 7.64 ...
# ..$ V2: num [1:159] -1.29 -1.24 -1.19 -1.13 -1.07 ...
# $ 3:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.73 7.72 7.72 7.71 7.7 ...
# ..$ V2: num [1:159] -1.26 -1.21 -1.15 -1.1 -1.04 ...
# $ 4:'data.frame': 159 obs. of 2 variables:
# ..$ V1: num [1:159] 7.8 7.8 7.79 7.78 7.76 ...
# ..$ V2: num [1:159] -1.22 -1.17 -1.12 -1.06 -1.01 ...
Update
If you want to modify the function so you can read directly from the FTP url, change the first two lines to read as the following and continue from the "myFile" line:
ReadFunky <- function(myFile, fromURL = TRUE) {
if (isTRUE(fromURL)) {
x <- strsplit(myFile, "/")[[1]]
y <- download.file(myFile, destfile = x[length(x)])
fh <- gzfile(x[length(x)])
} else {
fh <- gzfile(myFile)
}
Usage would be like: temp <- ReadFunky("ftp://ftp.aoml.noaa.gov/hrd/pub/hwind/Operational/2012/AL182012/1023/1330/AL182012_1023_1330.gz") for a file that you are going to download directly, and temp <- ReadFunky("~/AL182012_1023_1330.gz", fromURL=FALSE) for a file that you already have saved on your system.

Related

Loop in R not working, generating single Value

I have some metabolomics data I am trying to process (validate the compounds that are actually present).
`'data.frame': 544 obs. of 48 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ No. : int 2 32 34 95 114 141 169 234 236 278 ...
$ RT..min. : num 0.89 3.921 0.878 2.396 0.845 ...
$ Molecular.Weight : num 70 72 72 78 80 ...
$ m.z : num 103 145 114 120 113 ...
$ HMDB.ID : chr "HMDB0006804" "HMDB0031647" "HMDB0006112" "HMDB0001505" ...
$ Name : chr "Propiolic acid" "Acrylic acid" "Malondialdehyde" "Benzene" ...
$ Formula : chr "C3H2O2" "C3H4O2" "C3H4O2" "C6H6" ...
$ Monoisotopic_Mass: num 70 72 72 78 80 ...
$ Delta.ppm. : num 1.295 0.833 1.953 1.023 0.102 ...
$ X1 : num 288.3 16.7 1130.9 3791.5 33.5 ...
$ X2 : num 276.8 13.4 1069.1 3228.4 44.1 ...
$ X3 : num 398.6 19.3 794.8 2153.2 15.8 ...
$ X4 : num 247.6 100.5 1187.5 1791.4 33.4 ...
$ X5 : num 98.4 162.1 1546.4 1646.8 45.3 ...`
I tried to write a loop so that if the Delta.ppm value is larger than (m/z - molecular weight)/molecular weight, the entire row is deleted in the subsequent dataframe.
for (i in 1:nrow(rawdata)) {
ppm <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}
Instead of giving me a new df with the validated compounds, under the 'Values' section, it generates a single number for 'ppm'.
Still very new to R, any help is super appreciated!
No need to do this row-by-row, we can remove all undesired rows in one operation:
## base R
good <- with(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
newdat <- rawdat[good, ]
## dplyr
newdat <- filter(rawdat, (m.z - Molecular.Weight)/Molecular.Weight < Delta.ppm.)
Iteratively adding rows to a frame using rbind(old, newrow) works in practice but scales horribly, see "Growing Objects" in The R Inferno. For each row added, it makes a complete copy of all rows in old, which works but starts to slow down a lot. It is far better to produce a list of these new rows and then rbind them at one time; e.g.,
out <- list()
for (...) {
# ... newrow ...
out <- c(out, list(newrow))
}
alldat <- do.call(rbind, out)
ppm[i] <- NULL
for (i in 1:nrow(rawdata)) {
ppm[i] <- (rawdata$m.z[i] - rawdata$Molecular.Weight[i]) /
rawdata$Molecular.Weight[i]
if (ppm[i] > rawdata$Delta.ppm[i]) {
filtered_data <- rbind(filtered_data, rawdata[i,])
}
}

How do you get structure of data frame with limited length for variable names?

I have a data frame for a raw data set where the variable names are extremely long. I would like to display the structure of the data frame using the str function, and impose a character limit on the displayed variable names, so that it is easier to read.
Here is a reproducible example of the kind of thing I am talking about.
#Data frame with long names
set.seed(1);
DATA <- data.frame(ID = 1:50,
Value = rnorm(50),
This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name = runif(50));
#Show structure of DATA
str(DATA);
> str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name: num 0.655 0.353 0.27 0.993 0.633 ...
I would like to use the str function but impose an upper limit on the number of characters to display in the variable names, so that I get output that is something like the one below. I have read the documentation, but I have not been able to identify if there is an option to do this. (There seem to be options to impose upper limits on the lengths of strings in the data, but I cannot see an option to impose a limit on the length of the variable name.)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...
Question: Is there a simple way to get the structure of the data frame, but imposing a limitation on the length of the variable names (to get output something like the above)?
As far as I can see you're right, there doesn't seem to be a built in means to control this. You also can't do it after the fact because str() doesn't return anything. So the easiest option seems to be renaming beforehand. Relying on setNames(), you could create a simple function to accomplish this:
short_str <- function(data, n = 20, ...) {
name_vec <- names(data)
str(setNames(data, ifelse(
nchar(name_vec) > n, paste0(substring(name_vec, 1, n - 4), "... "), name_vec
)), ...)
}
short_str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...

R "for loop" and/or Apply to transform several variables dynamically

I am trying to translate/replicate into R a shorthand "for loop" technique that I would use in EViews. I'm trying to replicate a "for loop" where I would divide one time series variable by another (vectors) and save it as a new series.
As I use a common naming convention (for example GDP (real), GDPn (nominal) and GDP_P (prices), see EViews example below), I can declare the list of variables once and use changes in the suffix ("n" or "_P") to create dynamic series names and loop through the calculations I need. My input data is national accounts expenditure series.
'EViews shorthand "for next" loop:
%CATS = "GDP CONS INV GOV EX IM"
for %CATS {%cats}
series {%cats}_P= {%cats}n / {%cats}
next
'Which is shorthand replication of below ("series" declares a series of the subsequent name):
series GDP_P = GDPn / GDP
series CONS_P = CONSn / CONS
series INV_P = INVn / INV
series GOV_P = GOVn / GOV
series EX_P = EXn / EX
series IM_P = IMn / IM
So far I've tried using an R for loop (which I have read is not the preferred way in R) by creating a vector of the series name and used "assign(paste" to do the calculation. An example is below but it does not work. From what I have read about the "for" command, the declared series for "i" can only be a vector of values or a vector of names with no further context:
cats<-c("GDP","CONS","GOV","INV","EX","IM")
for (i in cats){
assign(paste(i, "_P",sep=""), paste(i, "n",sep="")/i)
}
I've also done a lot of reading into the "apply" function and derivatives, but I can't see how it works the above scenario. Any suggestions for how to do this in R is helpful.
Your function should work like this:
cats<-c("GDP","CONS","GOV","INV","EX","IM")
for (i in cats){
assign(paste(i, "_P",sep=""), get(paste(i, "n",sep=""))/get(i))
}
The get will use the strings you provide and find the vector of that name.
There's also a non-for-loop way of doing it, using the idea from one of the answers here:
txt<-paste0(cats, "_P <- ", cats, "n/", cats)
eval(parse(text=txt))
txt will include a list of all the lines that you would have had to type to create all your vectors manually, and then eval(parse(text=txt)) takes each of those commands and executes them one by one.
You can of course skip the assigning of the text to txt -- I just wanted it to be clearer what's going on here:
eval(parse(text=paste0(cats, "_P <- ", cats, "n/", cats)))
Consider working with lists especially for many similar elements. Doing so you can better manage your global environment and process data more compactly and efficiently. For you this means maintaining 3 lists of vectors instead of 18 separate named vectors (2 original sets and new 3rd set). The use of assign to dynamically create variables on the fly usually indicates the opportunity to use a named list.
Specifically, gather your items in GDPn_list and GDP_list and then use Map (the non-simplified wrapper to mapply) to iterate elementwise between both equal-length lists that calls the division function /. Then name the list with setNames(). Below demonstrates with random data but for you as the OP can use commented out lines to build list.
Original Data
cats <- c("GDP","CONS","GOV","INV","EX","IM")
set.seed(9272018)
GDPn_list <- setNames(replicate(6, runif(50)*120, simplify=FALSE), paste0(cats, "n"))
# GDPn_list <- list(GDPn, CONSn, GOVn, INVn, EXn, IMn)
str(GDPn_list)
# List of 6
# $ GDPn : num [1:50] 52.4 31.9 10.6 118.4 66 ...
# $ CONSn: num [1:50] 18.27 22.3 95.13 87.44 9.79 ...
# $ GOVn : num [1:50] 48.83 69.73 113.61 35.53 1.21 ...
# $ INVn : num [1:50] 51.9 96.9 28.2 67.2 19 ...
# $ EXn : num [1:50] 28.3 94.3 42.3 65.5 83.6 ...
# $ IMn : num [1:50] 109.3 26.6 60.2 78.2 55.5 ...
GDP_list <- setNames(replicate(6, runif(50)*100, simplify=FALSE), cats)
# GDPn_list <- list(GDP, CONS, GOV, INV, EX, IM)
str(GDP_list)
# List of 6
# $ GDP : num [1:50] 51.1 65.9 41.5 24.5 87.3 ...
# $ CONS: num [1:50] 47.66 77.32 46.97 48.61 2.98 ...
# $ GOV : num [1:50] 32.6 70.3 21.5 73.4 97.8 ...
# $ INV : num [1:50] 80.7 16.8 57.4 80.7 12.1 ...
# $ EX : num [1:50] 38.1 78.1 40.6 62.8 61.9 ...
# $ IM : num [1:50] 39.8 84.8 11.4 39.7 14.7 ...
New Data
GDPp_list <- setNames(Map(`/`, GDPn_list, GDP_list), paste0(cats, "p"))
str(GDPp_list)
# List of 6
# $ GDPp : num [1:50] 1.025 0.484 0.256 4.835 0.756 ...
# $ CONSp: num [1:50] 0.383 0.288 2.025 1.799 3.286 ...
# $ GOVp : num [1:50] 1.4969 0.9921 5.2891 0.4844 0.0124 ...
# $ INVp : num [1:50] 0.644 5.775 0.491 0.832 1.578 ...
# $ EXp : num [1:50] 0.744 1.207 1.043 1.043 1.352 ...
# $ IMp : num [1:50] 2.747 0.314 5.293 1.971 3.783 ...
And you still can reference your underlying numeric vectors via names or index numbers without losing any functionality or data:
GDPp_list$GDPp
GDPp_list$CONSp
GDPp_list$GOVp
...
GDPp_list[[1]]
GDPp_list[[2]]
GDPp_list[[3]]
...
And if equal-length vectors, build a matrix from your lists! This time using mapply:
GDPp_matrix <- mapply(`/`, GDPn_list, GDP_list)
colnames(GDPp_matrix) <- paste0(cats, "p")
head(GDPp_matrix)
# GDPp CONSp GOVp INVp EXp IMp
# [1,] 1.0252871 0.3832836 1.49687150 0.6436575 0.7441159 2.746551
# [2,] 0.4835700 0.2884577 0.99208666 5.7753575 1.2067694 0.314102
# [3,] 0.2562130 2.0251752 5.28913247 0.4910816 1.0429316 5.292843
# [4,] 4.8345697 1.7987625 0.48436284 0.8322211 1.0431301 1.970523
# [5,] 0.7563794 3.2859395 0.01236608 1.5781949 1.3518592 3.783420
# [6,] 0.1515318 10.9332338 1.10608066 13.7953500 0.7211371 1.918249

How to load several files into R without overwriting the existing files?

Hi I would like to load into R several databases in .sas7bdat format. Each time a new database is loaded I would like to display its name (e.g. file.sas7bdat -> file). I wrote a code in R (shown below) but it does not work. I think it overwrites the existing database with a new database. I would be grateful for any suggestions how to improve it.
getwd()
files<-list.files(pattern="*.sas7bdat")
for (i in 1:length(files)) {
data[i]<-read.sas7bdat(files[i])
}
I don't have any sad7bdat files handy, but this concept should translate across most of the read.* functions. You're on the right track with the for-loop, but can create the list directly by using lapply() like so:
#Make a few CSV files
x <- matrix(rnorm(10), ncol = 2)
write.csv(x, "a.csv")
write.csv(x, "b.csv")
#Read them into a list
fileList <- lapply(list.files(pattern = "*.csv"), function(x) read.csv(x))
#check out what we ended up with
str(fileList)
#---
List of 2
$ :'data.frame': 5 obs. of 3 variables:
..$ X : int [1:5] 1 2 3 4 5
..$ V1: num [1:5] -0.451 -0.317 -1.225 0.445 -1.361
..$ V2: num [1:5] 0.489 -2.8154 0.5147 -0.0561 0.826
$ :'data.frame': 5 obs. of 3 variables:
..$ X : int [1:5] 1 2 3 4 5
..$ V1: num [1:5] -0.451 -0.317 -1.225 0.445 -1.361
..$ V2: num [1:5] 0.489 -2.8154 0.5147 -0.0561 0.826
library(sas7bdat)
setwd(".... WHERE SAS FILES LIVES...")
load.sas <- function(x) {
name <- strsplit(x,"\\.")[[1]][1]
assign(name, read.sas7bdat(x), env=.GlobalEnv)
TRUE
}
sapply(list.files(path=".", pattern="*.sas7bdat", full.names=F), load.sas)`
You can add any futures to this code to revrite only some data or....

Getting column name which holds a max value within a row of a matrix holding a separate max value within an array

For instance given:
dim1 <- c("P","PO","C","T")
dim2 <- c("LL","RR","R","Y")
dim3 <- c("Jerry1", "Jerry2", "Jerry3")
Q <- array(1:48, c(4, 4, 3), dimnames = list(dim1, dim2, dim3))
I want to reference within this array, the matrix that has the max dim3 value at the (3rd row, 4th column) location.
Upon identifying that matrix, I want to return the column name which has the maximum value within the matrix's (3rd Row, 1st Column) to (3rd Row, 3rd Column) range.
So what I'd hope to happen is that Jerry3 gets referenced because the number 47 is stored in its 3rd row, 4th column, and then within Jerry3, I would want the maximum number in row 3 to get referenced which would be 43, and ultimately, what I need returned (the only value I need) is then the column name which would be "R".
That's what I need to know how to do, obtain get that "R" and assign it to a variable, i.e. "column_ref", such that column_ref <- "R".
This should do it - if I understand correctly:
Q <- array(1:48, c(4,4,3), dimnames=list(
c("P","PO","C","T"), c("LL","RR","R","Y"), c("Jerry1", "Jerry2", "Jerry3")))
column_ref <- names(which.max(Q[3,1:3, which.max(Q[3,4,])]))[1] # "R"
Some explanation:
which.max(Q[3,4,]) # return the index of the "Jerry3" slice (3)
which.max(Q[3,1:3, 3]) # returns the index of the "R" column (3)
...and then names returns the name of the index ("R").
Here's a simple way to solve:
mxCol=function(df, colIni, colFim){ #201609
if(missing(colIni)) colIni=1
if(missing(colFim)) colFim=ncol(df)
if(colIni>=colFim) { print('colIni>=ColFim'); return(NULL)}
dfm=cbind(mxC=apply(df[colIni:colFim], 1, function(x) colnames(df)[which.max(x)+(colIni-1)])
,df)
dfm=cbind(mxVal=as.numeric(apply(dfm,1,function(x) x[x[1]]))
,dfm)
returndfm
}
This post helped me to solve a data.frame general problem.
I have repeated measures for groups, G1 e G2.
> str(df)
'data.frame': 6 obs. of 15 variables:
$ G1 : num 0 0 2 2 8 8
$ G2 : logi FALSE TRUE FALSE TRUE FALSE TRUE
$ e.10.100 : num 26.41 -11.71 27.78 3.17 26.07 ...
$ e.10.250 : num 27.27 -12.79 29.16 3.19 26.91 ...
$ e.20.100 : num 29.96 -12.19 26.19 3.44 27.32 ...
$ e.20.100d: num 26.42 -13.16 28.26 4.18 25.43 ...
$ e.20.200 : num 24.244 -18.364 29.047 0.553 25.851 ...
$ e.20.50 : num 26.55 -13.28 29.65 4.34 27.26 ...
$ e.20.500 : num 27.94 -13.92 27.59 2.47 25.54 ...
$ e.20.500d: num 24.4 -15.63 26.78 4.86 25.39 ...
$ e.30.100d: num 26.543 -15.698 31.849 0.572 29.484 ...
$ e.30.250 : num 26.776 -16.532 28.961 0.813 25.407 ...
$ e.50.100 : num 25.995 -14.249 28.697 0.803 27.852 ...
$ e.50.100d: num 26.1 -12.7 27.1 2.5 27.4 ...
$ e.50.500 : num 28.78 -9.39 25.77 2.73 23.73 ..
I need to know which measure (column) has the best (max) result. And I need to disconsider grouping columns.
I ended up with this function
apply(df[colIni:colFim], 1, function(x) colnames(df)[which.max(x)+(colIni-1)]
#colIni: first column to consider; colFim: last column to consider
After having column name, another tiny function to get the max value
apply(dfm,1,function(x) x[x[1]])
And the function to solve similar problems, that return the column and the max value
mxCol=function(df, colIni, colFim){ #201609
if(missing(colIni)) colIni=1
if(missing(colFim)) colFim=ncol(df)
if(colIni>=colFim) { print('colIni>=ColFim'); return(NULL)}
dfm=cbind(mxCol=apply(df[colIni:colFim], 1, function(x) colnames(df)[which.max(x)+(colIni-1)])
,df)
dfm=cbind(mxVal=as.numeric(apply(dfm,1,function(x) x[x[1]]))
,dfm)
return(dfm)
}
In this case,
> mxCol(df,3)[1:11]
mxVal mxCol G1 G2 e.10.100 e.10.250 e.20.100 e.20.100d e.20.200 e.20.50 e.20.500
1 29.958 e.20.100 0 FALSE 26.408 27.268 29.958 26.418 24.244 26.553 27.942
2 -9.395 e.50.500 0 TRUE -11.708 -12.789 -12.189 -13.162 -18.364 -13.284 -13.923
3 31.849 e.30.100d 2 FALSE 27.782 29.158 26.190 28.257 29.047 29.650 27.586
4 4.862 e.20.500d 2 TRUE 3.175 3.190 3.439 4.182 0.553 4.337 2.467
5 29.484 e.30.100d 8 FALSE 26.069 26.909 27.319 25.430 25.851 27.262 25.535
6 -9.962 e.30.250 8 TRUE -11.362 -12.432 -15.960 -11.760 -12.832 -12.771 -12.810

Resources