How to read OMI/MLS unstructured ASCII data in R? - r

I am trying to read following data in R. I could find a code written for numpy but there was not any for R.
According to the link: "After skipping the first three lines, each following 12 lines contains one row of the total 2D array. It's concatenated integers of three digits each. Reading ASCII in R needs a separator keyword, but I don't have a separator here.Any help would be really appreciated.
April 2008 OMI/MLS Tropo O3 Column (Dobson Units) X 10
Longitudes: 288 bins centered on 179.375W to 179.375E (1.25 degree steps)
Latitudes: 120 bins centered on -59.5S to 59.5N (1.00 degree steps)
166217223211241326285354290338228257199268342304270284206277284257222239296
260308184247224202214256242220241234247297278270237201247186238239284263212
201204206191184186177199218199234242242205240231246269233208210227275242213
246222272320276248166201197200188173209231199198234238231188250228285253247
249256181180233249252211210184212242242179287233269310305315234236228135128
117219226194243273280321323389427283352446431379385319265243288309242222346
289306281288271267274235277306306273250235252264274260300304322305357336297
281286324324316347308296323294256289246239197256279298275340317337289322356
315286280281276206228225211220251255244210208232239280274268237252238226181
209246254242274248252250240262253226223229204204228197200209245263253247260
267313334255264206163165188220261300266266240247224208223213262248169260298
208356315279242223222168245286303325329 lat = -59.5
312293362274234235237227305283172142294249334273287306283264267228238348370
314255273250279264261285254275261254246249282223225206214199241236244254276
224224232214225226205192175195204202183176194231285284274277284264275286265
242253247187264238203212213181184188219214221192227209233185217255254325236
206328342260262279277296265222275339239319292237352279311238273169141268188
222266281238251310281268268326363389305366312251281289443469386293271355354
342301281240278232224246244218213251291257222242233254230300323350348289258
256264309269273251241264273292271275236245250280277292280299335323345348367
352358346323265265254246235235266260258206221218212232255259260258246233233
223240299281223208269260233246261257250238232236238210215210218259279291298
304301291271240244205193221244252287306297278214235176261322276282196239296
339340295266256241263348321274309308267 lat = -58.5
Edited: I manipulated original data. First I have removed the first three lines using "sed '1,3d' -i file.txt" which is not that necessary. Second, I have removed the first blank column to have a consistent 3-integer separator by "sed -i 's/^ *//g' file.txt". Then, According to Andrew Gustar, I have tried read.fwf. Now, it reads data column-wise whereas I want them to be read every three data as an input. BTW, output is only one column.
data=read.fwf(fils[1],skip = 0,widths = 3,
comment.char = "l")
It is part of the output:
head(data)
V1
1 166
2 260
3 201
4 246
5 249
6 117
They have two integers long while it should be 3

Related

Reading tab-delimited text file into R only expecting one column out of five

Here's my problem. I'm trying to read a tab-delimited text file into R, and I keep getting an error messages and it only loads one column out of the five in the dataset.
Our professor is requiring us to use the read_csv() command for this, but I've tried using read_tsv() as well, and neither has worked. I've looked into it everywhere, I just can't find anything about what could possibly be going wrong.
waste <- read_tsv("wasterunup.txt", col_names=TRUE, na=c("*"))
I can't seem to link the text file here, but it's a simple tab-delimited text file with 5 columns, column headers, 22 rows (not counting the headers). * is used for N/A results.
I have no clue how to do this "properly" according to my professor by using read_csv.
waste <- read_tsv("wasterunup.txt", col_names=TRUE, na=c("*"))
Parsed with column specification:
cols(
`PT1 PT2 PT3 PT4 PT5` = col_double()
)
Warning: 22 parsing failures.
row col expected actual file
1 -- 1 columns 5 columns 'wasterunup.txt'
2 -- 1 columns 5 columns 'wasterunup.txt'
3 -- 1 columns 5 columns 'wasterunup.txt'
4 -- 1 columns 5 columns 'wasterunup.txt'
5 -- 1 columns 5 columns 'wasterunup.txt'
... ... ......... ......... ................
See problems(...) for more details.
To clarify my errors:
When I use read_csv(), all of the data is there, but all five datapoints are crammed into one cell of each row.
When I use read_tsv(), only one column of the data is there.

Reshape specific rows into columns in R

My sample data frame would look like the following:
1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y
I want to make rows 1, 3, and 5 column names and have the data below each fall into each column, respectively. I was looking into the reshape function, but I only saw examples where an entire column of values needed to be individual columns. So I wasn't sure what to do in this scenario--apologies if it's obvious.
Here is the desired output:
1 Number Type Code Reason Date Amount Damage Act State City Zip Phone
2 0123 06 09 010 08/31/16 10,000 Y N WI GB 1234 Y
Thanks
As some people have commented, you could build a data frame out of the rows of your starting data frame, but I think it's a little easier to work on the lines of text.
If your starting file looks something like this
Number , Type , Code ,Reason
0123 , 06 , 09 , 010
Date , Amount , Damage , Act
08/31/16 , 10000 , Y , N
State , City , Zip , Phone
WI , GB , 1234, Y
we can read it in with each line as an element of a character vector:
lines <- readLines("start.csv")
make all the odd lines into a single line:
oddind <- seq(from=1, to= length(lines), by=2)
namelines <- paste(lines[oddind], collapse=",")
make all the even lines into a single line:
datlines <- paste(lines[oddind+1], collapse=",")
make those lines into a new CSV to read:
writeLines(text= c(namelines, datlines), con= "nice.csv")
print(read.csv("nice.csv"))
This gives
Number Type Code Reason Date Amount Damage Act State
1 123 6 9 10 08/31/16 10000 Y N WI
City Zip Phone
1 GB 1234 Y
So, it's all in one row of the data frame and all the variable names show up correctly in the data frame.
The benefits of this strategy are:
It will work for starting CSV files where the number of variables isn't a multiple of 4.
It will work for starting CSV files with any number of rows
There is no chance of weird dynamic casting happening with unlist() or as.character().
Creating a dataframe roughly appearing like that (although it necessarily has column names). Those are probably factor columns if you just used one of the standard read.* functions without using stringsAsFactors=FALSE, hence the need to convert with as.character.
dat=read.table(text="1 Number Type Code Reason
2 0123 06 09 010
3 Date Amount Damage Act
4 08/31/16 10,000 Y N
5 State City Zip Phone
6 WI GB 1234 Y")
Then you can set odd number rows as names of the values-vector of the even number rows with:
setNames( unlist( lapply( dat[!c(TRUE,FALSE), ] ,as.character)),
unlist( lapply( dat[c(TRUE,FALSE), ] ,as.character)) )
1 3 5 Number Date State Type
"2" "4" "6" "0123" "08/31/16" "WI" "06"
Amount City Code Damage Zip Reason Act
"10,000" "GB" "09" "Y" "1234" "010" "N"
Phone
"Y"
The !c(TRUE,FALSE) and its logical complement in the next extract operation get magically recycled along all the possible rows. Obviously there would be better ways of doing this if you posted a version of a text file rather than saying that the starting point was a dataframe. You would need to remove what were probably rownames. If you want a "clean solution then post either dput(.) from your dataframe or the raw text file.

read.table cut off a long row into half. how to reorganize the cvs file

c <- read.table("sid-110-20130826T164704.csv", sep = ',', fill=TRUE, )
so I use the above code to read about 300 csv files.
and some files look like this
65792,1,round-5,72797,140,yellow,75397,192,red,75497,194,crash
86267,1,round6,92767,130,yellow,94702,168,brake,95457,178,go,95807,185,red,96057,190,brake,97307,200,crash
108092,1,round-7,116157,130,yellow,117907,165,red
120108,1,round-8,130173,130,yellow,130772,142,brake,133173,152,red
137027,1,round-9,147097,130,yellow,148197,152,brake,148597,160,red
As you can see the second is longer than other line (for each row the third element is supposed have round#) and when I do read.table R cuts the line in half, below I copied the first 5 columns from R
9 86267 1 round-6 92767 130
10 95807 185 red 96057 190
11 108092 1 round-7 116157 130
12 120108 1 round-8 130173 130
is there a way to edit that so that the row is one line instead of being split?
You can prime the data.frame width by specifying "col.names" argument along with "fill=TRUE" as in:
c <- read.table("sid-110-20130826T164704.csv", sep = ',', fill=TRUE,
col.names=paste("V", 1:21,sep=""))
That's assuming you know how many columns you have. If you don't know, you might want to make a single pass through the file to find the maximum width.

remove degree symbol from numeric values in data frame

I've had some temperature measurements in .csv format and am trying to analyse them in R. For some reason the data files contain temperature with degree C following the numeric value. Is there a way to remove the degree C symbol and return the numeric value? I though of producing an example here but did not know how to generate a degree symbol in a string in R. Anyhow, this is what the data looks like:
> head(mm)
dateTime Temperature
1 2009-04-23 17:01:00 15.115 °C
2 2009-04-23 17:11:00 15.165 °C
3 2009-04-23 17:21:00 15.183 °C
where the class of mm[,2] is 'factor'
Can anyone suggest a method for converting the second column to 15.115 etc?
You can remove the unwanted part and convert the rest to numeric all at the same time with scan(). Setting flush = TRUE treats the last field (after the last space) as a comment and it gets discarded (since sep expects whitespace separators by default).
mm <- read.table(text = "dateTime Temperature
1 '2009-04-23 17:01:00' '15.115 °C'
2 '2009-04-23 17:11:00' '15.165 °C'
3 '2009-04-23 17:21:00' '15.183 °C'", header = TRUE)
replace(mm, 2, scan(text = as.character(mm$Temp), flush = TRUE))
# dateTime Temperature
# 1 2009-04-23 17:01:00 15.115
# 2 2009-04-23 17:11:00 15.165
# 3 2009-04-23 17:21:00 15.183
Or you can use a Unicode general category to match the unicode characters for the degree symbol.
type.convert(sub("\\p{So}C", "", mm$Temp, perl = TRUE))
# [1] 15.115 15.165 15.183
Here, the regular expression \p{So} matches various symbols that are not math symbols, currency signs, or combining characters. C matches the character C literally (case sensitive). And type.convert() takes care of the extra whitespace.
If all of your temperature values have the same number of digits you can make left and right functions (similar to those in Excel) to select the digits that you want. Such as in this answer from a different post: https://stackoverflow.com/a/26591121/4459730
First make the left function:
left = function (string,char){
substr(string,1,char)
}
Then recreate your Temperature string using just the digits you want:
mm$Temperature<-left(mm$Temperature,6)
degree symbol is represented as \u00b0, hence following code should work:
df['Temperature'] = df['Temperature'].replace('\u00b0','', regex=True)

How to import a mixed tab and comma delineated ASCII file in R

I have an ASCII file that includes a set of MODIS data containing a series of pixel values for each acquisition date. The data format is:
ASCII values are comma delimited
Data values start after header rows and are space delimited.
An example of two dates from the data is shown below:
----------------------------------------------------------------------------
MODIS HDF Tile MOD13Q1.A2003273.h11v03.005.2008260032604.hdf
Scientific Data Set (Band) 250m_16_days_EVI
Number of Values Passing QA Filter 81 of 81
Applying the Scale of .0001 MEAN: 0.24070987654321, STD-DEV: 0.0257345931611507
Unscaled MEAN: 2407.0987654321, STD-DEV: 257.345931611507
2213,2160,2206,2408,2369,2362,2423,2466,2318,2160,2429,2316,2260,2362,2431,2172,2021,2254,2424,2391,2427,2331,1934,2220,2235,2254,2186,2325,2046,1956,2273,2220,2235,2257,2425,2534,2141,2288,2273,2263,2436,2568,2603,2470,2561,2288,2369,2628,2725,2730,2603,2704,2744,2732,2624,2606,2694,2730,2718,2765,2771,2732,2771,2726,2694,2637,2699,2806,2712,2384,1904,1982,2747,2788,2610,2647,2408,2096,1946,1858,1791
----------------------------------------------------------------------------
MODIS HDF Tile MOD13Q1.A2003289.h11v03.005.2008263131227.hdf
Scientific Data Set (Band) 250m_16_days_EVI
Number of Values Passing QA Filter 81 of 81
Applying the Scale of .0001 MEAN: 0.261756790123457, STD-DEV: 0.0232843291670261
Unscaled MEAN: 2617.56790123457, STD-DEV: 232.843291670261
2074,2323,2382,2574,2614,2661,2631,2599,2525,2399,2548,2545,2541,2599,2415,2428,2417,2518,2549,2471,2539,2520,2407,2358,2426,2461,2575,2427,2412,2518,2500,2394,2509,2567,2569,2648,2414,2573,2498,2626,2509,2708,2694,2654,2702,2536,2750,2804,2917,2926,2942,2938,2844,2839,2863,2985,3006,2991,2997,2937,2830,2838,2607,3101,3093,3085,2950,2881,2608,2570,2499,2233,2912,2833,2819,2348,2426,2541,2243,2239,2071
A typical ASCII file has about 900 dates included i.e. 900 "tiles" of information in exactly the same format as listed above, on after another. The number of pixels is the same in each i.e. 81 values for each date.
What i would like to do is to read in the file and for each date, extract the "MODIS HDF Tile" name e.g. MOD13Q1.A2003289.h11v03.005.2008263131227.hdf and each pixel value to individual columns, something like:
MODIS HDF Tile Scientific Data Set (Band) V2 V3 V4 V5 V6 V7...
MOD13Q1.A2003273.h11v03.005.2008263131227.hdf 250m_16_days_ENVI 2213 2160 2206 2408 2369 .......
MOD13Q1.A2003289.h11v03.005.2008263131227.hdf 250m_16_days_ENVI 2074 2323 2382 2574 2614 .....
Any help would be greatly appreciated!
Perhaps something like this could work
modis <- readLines("modis.txt")
headers <- grep("^MODIS", modis)
headtiles <- sapply(strsplit(modis[headers[1]],"\\s{2,}"), '[',1 )
headbands <- sapply(strsplit(modis[headers[1]+1],"\\s{2,}"), '[',1 )
tiles <- sapply(strsplit(modis[headers],"\\s{2,}"), '[',2 )
bands <- sapply(strsplit(modis[headers+1],"\\s{2,}"), '[',2 )
pxlines <- grep("(,.*?){5,}", modis)
pixels <- do.call(rbind, lapply(strsplit(modis[pxlines], ","), as.numeric))
dd<-data.frame(tiles, bands, pixels)
names(dd)<-c(headtiles , headbands , paste0("pixel", seq.int(ncol(pixels))))
Here we grep through all the lines to find the header line, then we assume the next line is the band line. And then we look for lines with lots of commas for the pixel values. This is making a lot of assumptions about the data file based on the limited sample you provided.

Resources