My question follows this previous question, that has been partially answered : Reading a grib2 dataset with 4 dimensions and 2 variables with R
I am trying to read a GRIB2 file with R. This file is a probabilistic meteorological forecast : 10 variables, 1 lead time, 17 longitudes, 23 latitudes, and 51 members.
I can extract that thanks to the terra package and this script :
require(terra)
require(dplyr)
require(data.table)
require(stats)
destfile <- "C:/Users/XXX/Documents/example_grib_file_3"
##Downloading file
grib_data <- terra::rast(destfile)
print(grib_data)
## Convert to data frame
df <- as.data.frame(grib_data, xy=TRUE)
## Colnames is a combination of members (50) X time (57) X variables (2)
colNames <- paste(names(grib_data), as.character(time(grib_data)), sep="_")
colnames(df) <- c("lon", "lat", colNames)
df2 <- data.table::melt(as.data.table(df), c("lon", "lat"))
## Split variable and time
df2$time_UTC <- sub(".*_", "", df2$variable)
df2$variable <- sub("_.*", "", df2$variable)
## Add members
df2 <- df2 %>% group_by(lon, lat, variable, time_UTC) %>% mutate(member=(1:length(value)))
##Convert to array
df_array <- stats::xtabs(value~lon+lat+variable+member+time_UTC, df2, drop=F)
The BIG problem is that I can't retrieve the metadata concerning the perturbation number (member). For now, variables and members are mixed, giving 500 columns, with the variable name repeated 50 times. The member is not always at the same position along the other dimension (variable). For example, one particular member is in position 6 for temperature data, and position 50 for precipitation data.
So the line "Add member" in my code is totally irrelevant and needs the "perturbation number" to arrange the array. If I use eccodes, there is a field called "perturbationNumber", how can I retrieve it from terra and R ?
The example file : https://drive.google.com/file/d/1kUfTdtAdNJpugMpPhdQ6tY4QZaZlUQcA/view?usp=sharing
The example file in grib2 : https://drive.google.com/file/d/1Zsf8uajm5AOXTiCus6PyANZAsJ5cKsc1/view?usp=sharing
These are the different parameters retrieved using eccodes with Python :
grib_ls -l 48.5,3.5,1 -p paramId,name,typeOfLevel,level,dataDate,stepRange,dataType,shortName,step,perturbationNumber PATH_TO_FILE
paramId name typeOfLevel level dataDate stepRange dataType shortName step perturbationNumber value
167 2 metre temperature surface 0 20230124 0 pf 2t 0 35 273.517
167 2 metre temperature surface 0 20230124 0 pf 2t 0 3 273.589
167 2 metre temperature surface 0 20230124 0 pf 2t 0 50 273.229
etc...
Thanks for any help
With your example file
f <- "example_grib_file_3"
library(terra)
#terra 1.7.5
x <- rast(f)
This is the metadata that is available
names(x)[1]
#[1] "SFC (Ground or water surface); 2 metre temperature [C]"
time(x)[1]
#[1] "2023-01-24 UTC"
units(x)[1]
#[1] "C"
And with version 1.7-5 (currently the development version) you can also get raw metadata
metadata(x)[[1]]
# [,1] [,2]
#[1,] "GRIB_COMMENT" "2 metre temperature [C]"
#[2,] "GRIB_ELEMENT" "2T"
#[3,] "GRIB_FORECAST_SECONDS" "0"
#[4,] "GRIB_REF_TIME" "1674518400"
#[5,] "GRIB_SHORT_NAME" "0-SFC"
#[6,] "GRIB_UNIT" "[C]"
#[7,] "GRIB_VALID_TIME" "1674518400"
I do not know if there are ways to get other metdata from this file but in principle that is possible: see your previous question.
Related
There is five polygons for five different cities (see attached file in the link, it's called bound.shp). I also have a point file "points.csv" with longitude and latitude where for each point I know the proportion of people belonging to group m and group h.
I am trying to calculate the spatial segregation proposed by Reardon and O’Sullivan, “Measures of Spatial Segregation”
There is a package called "seg" which should allow us to do it. I am trying to do it but so far no success.
Here is the link to the example file: LINK. After downloading the "example". This is what I do:
setwd("~/example")
library(seg)
library(sf)
bound <- st_read("bound.shp")
points <- st_read("points.csv", options=c("X_POSSIBLE_NAMES=x","Y_POSSIBLE_NAMES=y"))
#I apply the following formula
seg::spseg(bound, points[ ,c(group_m, group_h)] , smoothing = "kernel", sigma = bandwidth)
Error: 'x' must be a numeric matrix with two columns
Can someone help me solve this issue? Or is there an alternate method which I can use?
Thanks a lot.
I don't know what exactly spseg function does but when evaluating the spseg function in the seg package documentation;
First argument x should be dataframe or object of class Spatial.
Second argument data should be matrix or dataframe.
After evaluating the Examples for spseg function, it should have been noted that the data should have the same number of rows as the id number of the Spatial object. In your sample, the id is the cities that have different polygons.
First, let's examine the bound data;
setwd("~/example")
library(seg)
library(sf)
#For the fortify function
library(ggplot2)
bound <- st_read("bound.shp")
bound <- as_Spatial(bound)
class(bound)
"SpatialPolygonsDataFrame"
attr(,"package")
"sp"
tail(fortify(bound))
Regions defined for each Polygons
long lat order hole piece id group
5379 83.99410 27.17326 972 FALSE 1 5 5.1
5380 83.99583 27.17339 973 FALSE 1 5 5.1
5381 83.99705 27.17430 974 FALSE 1 5 5.1
5382 83.99792 27.17552 975 FALSE 1 5 5.1
5383 83.99810 27.17690 976 FALSE 1 5 5.1
5384 83.99812 27.17700 977 FALSE 1 5 5.1
So you have 5 id's in your SpatialPolygonsDataFrame. Now, let's read the point.csv with read.csv function since the data is required to be in matrix format for the spseg function.
points <- read.csv("c://Users/cemozen/Downloads/example/points.csv")
tail(points)
group_m group_h x y
950 4.95 78.49000 84.32887 26.81203
951 5.30 86.22167 84.27448 26.76932
952 8.68 77.85333 84.33353 26.80942
953 7.75 82.34000 84.35270 26.82850
954 7.75 82.34000 84.35270 26.82850
955 7.75 82.34000 84.35270 26.82850
In the documentation and the example within, it has been strictly stated that; the row number of the points which have two attributes (group_m and group_h in our data), should be equal to the id number (which is the cities). Maybe, you should calculate a value by using the mean for each polygon or any other statistics for each city in your data to be able to get only one value for each polygon.
On the other hand, I just would like to show that the function is working properly after feeding with a matrix that has 5 rows and 2 groups.
sample_spseg <- spseg(bound, as.matrix(points[1:5,c("group_m", "group_h")]))
print(sample_spseg)
Reardon and O'Sullivan's spatial segregation measures
Dissimilarity (D) : 0.0209283
Relative diversity (R): -0.008781
Information theory (H): -0.0066197
Exposure/Isolation (P):
group_m group_h
group_m 0.07577679 0.9242232
group_h 0.07516285 0.9248372
--
The exposure/isolation matrix should be read horizontally.
Read 'help(spseg)' for more details.
first: I do not have experience with the seg-package and it's function.
What I read from your question, is that you want to perform the spseg-function, om the points within each area?
If so, here is a possible apprach:
library(sf)
library(tidyverse)
library(seg)
library(mapview) # for quick viewing only
# read polygons, make valif to avoid probp;ems later on
areas <- st_read("./temp/example/bound.shp") %>%
sf::st_make_valid()
# read points and convert to sf object
points <- read.csv("./temp/example/points.csv") %>%
sf::st_as_sf(coords = c("x", "y"), crs = 4326) %>%
#spatial join city (use st_intersection())
sf::st_join(areas)
# what do we have so far??
mapview::mapview(points, zcol = "city")
# get the coordinates back into a data.frame
mydata <- cbind(points, st_coordinates(points))
# drop the geometry, we do not need it anymore
st_geometry(mydata) <- NULL
# looks like...
head(mydata)
# group_m group_h city X Y
# 1 8.02 84.51 2 84.02780 27.31180
# 2 8.02 84.51 2 84.02780 27.31180
# 3 8.02 84.51 2 84.02780 27.31180
# 4 5.01 84.96 2 84.04308 27.27651
# 5 5.01 84.96 2 84.04622 27.27152
# 6 5.01 84.96 2 84.04622 27.27152
# Split to a list by city
L <- split(mydata, mydata$city)
# loop over list and perform sppseg function
final <- lapply(L, function(i) spseg(x = i[, 4:5], data = i[, 1:2]))
# test for the first city
final[[1]]
# Reardon and O'Sullivan's spatial segregation measures
#
# Dissimilarity (D) : 0.0063
# Relative diversity (R): -0.0088
# Information theory (H): -0.0067
# Exposure/Isolation (P):
# group_m group_h
# group_m 0.1160976 0.8839024
# group_h 0.1157357 0.8842643
# --
# The exposure/isolation matrix should be read horizontally.
# Read 'help(spseg)' for more details.
spplot(final[[1]], main = "Equal")
With some effort and help from the stackers, I have been able to parse a webpage and save it as a dataframe. I want to repeat the same operation on multiple xml files and rbind the list. Here is what I tried and did successfully:
library(XML)
xml.url <- "http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml"
doc <- xmlParse(xml.url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
Above code works well, now when I try to apply a function to do the same for multiple xml files :
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
xml_url_test = as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml",
ERS_ID))
XML_parser <- function(XML_url){
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
return(x_t)
}
major_test <- sapply(xml_url_test, XML_parser)
It works, but gives me a long list that is not in the right data frame format as I generated for the single XML file.
Finally I would like to also add a column to the final dataframe that has the ERS number from the ERS_ID vector
Something like x_t$ERSid <- ERS_ID in the function
Can someone point out what am I missing in the function as well as any better ways to do the task?
Thanks!
Your main issue is using sapply over lapply() where the latter returns a list and former attempts to simplify to a vector or matrix, here being a matrix.
major_test <- lapply(xml_url_test, XML_parser)
Of course, sapply is a wrapper for lapply and can also return a list: sapply(..., simplify=FALSE):
major_test <- sapply(xml_url_test, XML_parser, simplify=FALSE)
However, a few other items came up:
At beginning, you are not concatenating your ERS_ID to the url stem with sprintf's %s operator. So right now, the same urls are repeating.
At end, you are not binding your list of data frames into a compiled final single dataframe.
Add new ERS column inside your defined function, passing in ERS_ID vector. And while creating column, also remove the ERS prefix with gsub.
R code (adjusted)
XML_parser <- function(eid) {
XML_url <- as.vector(sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", eid))
doc <- xmlParse(XML_url)
x <- xmlToDataFrame(getNodeSet(doc,"//SAMPLE_ATTRIBUTE"))
x$UNITS <- NULL
x_t <- t(x)
x_t <- as.data.frame(x_t)
names(x_t) <- as.matrix(x_t[1, ])
x_t <- x_t[-1, ]
x_t[] <- lapply(x_t, function(x) type.convert(as.character(x)))
x_t$ERSid <- gsub("ERS", "", eid) # ADD COL, REMOVE ERS
x_t <- x_t[,c(ncol(x_t),2:ncol(x_t)-1)] # MOVE NEW COL TO FIRST
return(x_t)
}
major_test <- lapply(ERS_ID, XML_parser)
# major_test <- sapply(ERS_ID, XML_parser, simplify=FALSE)
# BIND DATA FRAMES TOGETHER
finaldf <- do.call(rbind, major_test)
# RESET ROW NAMES
row.names(finaldf) <- seq(nrow(finaldf))
Using xml2 and the tidyverse you can do something like this:
require(xml2)
require(purrr)
require(tidyr)
urls <- rep("http://www.ebi.ac.uk/ena/data/view/ERS445758&display=xml", 2)
identifier <- LETTERS[seq_along(urls)] # Take a unique identifier per url here
parse_attribute <- function(x){
out <- data.frame(tag = xml_text(xml_find_all(x, "./TAG")),
value = xml_text(xml_find_all(x, "./VALUE")), stringsAsFactors = FALSE)
spread(out, tag, value)
}
doc <- map(urls, read_xml)
out <- doc %>%
map(xml_find_all, "//SAMPLE_ATTRIBUTE") %>%
set_names(identifier) %>%
map_df(parse_attribute, .id="url")
Which gives you a 2x36 data.frame. To parse the column type i would suggest using readr::type_convert(out)
Out looks as follows:
url age body product body site body-mass index chimera check collection date
1 A 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
2 B 28 mucosa Sigmoid colon 16.95502 ChimeraSlayer; Usearch 4.1 database 2009-03-16
disease status ENA-BASE-COUNT ENA-CHECKLIST ENA-FIRST-PUBLIC ENA-LAST-UPDATE ENA-SPOT-COUNT
1 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
2 remission 627051 ERC000015 2014-12-31 2016-10-21 1668
environment (biome) environment (feature) environment (material) experimental factor
1 organism-associated habitat organism-associated habitat mucus microbiome
2 organism-associated habitat organism-associated habitat mucus microbiome
gastrointestinal tract disorder geographic location (country and/or sea,region) geographic location (latitude)
1 Ulcerative Colitis India 72.82807
2 Ulcerative Colitis India 72.82807
geographic location (longitude) host subject id human gut environmental package investigation type
1 18.94084 1 human-gut metagenome
2 18.94084 1 human-gut metagenome
medication multiplex identifiers pcr primers phenotype project name
1 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
2 ASA;Steroids;Probiotics;Antibiotics TGATACGTCT 27F-338R pathological BMRP
sample collection device or method sequence quality check sequencing method sequencing template sex target gene
1 biopsy software pyrosequencing DNA male 16S rRNA
2 biopsy software pyrosequencing DNA male 16S rRNA
target subfragment
1 V1V2
2 V1V2
purrr is really helpful here, as you can iterate over a vector of URLs or a list of XML files with map, or within nested elements with at_depth, and simplify the results with the *_df forms and flatten.
library(tidyverse)
library(xml2)
# be kind, don't call this more times than you need to
x <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762") %>%
sprintf("http://www.ebi.ac.uk/ena/data/view/%s&display=xml", .) %>%
map(read_xml) # read each URL into a list item
df <- x %>% map(xml_find_all, '//SAMPLE_ATTRIBUTE') %>% # for each item select nodes
at_depth(2, as_list) %>% # convert each (nested) attribute to list
map_df(map_df, flatten) # flatten items, collect pages to df, then all to one df
df
## # A tibble: 175 × 3
## TAG VALUE UNITS
## <chr> <chr> <chr>
## 1 investigation type metagenome <NA>
## 2 project name BMRP <NA>
## 3 experimental factor microbiome <NA>
## 4 target gene 16S rRNA <NA>
## 5 target subfragment V1V2 <NA>
## 6 pcr primers 27F-338R <NA>
## 7 multiplex identifiers TGATACGTCT <NA>
## 8 sequencing method pyrosequencing <NA>
## 9 sequence quality check software <NA>
## 10 chimera check ChimeraSlayer; Usearch 4.1 database <NA>
## # ... with 165 more rows
You can retrieve multiple IDs with a single REST url using a comma-separated list or range like ERS445758-ERS445762 and avoid multiple queries to the ENA.
This code gets all 5 samples into a node set and then applies functions using a leading dot in the xpath string so its relative to that node.
ERS_ID <- c("ERS445758","ERS445759", "ERS445760", "ERS445761", "ERS445762")
url <- paste0( "http://www.ebi.ac.uk/ena/data/view/", paste(ERS_ID, collapse=","), "&display=xml")
doc <- xmlParse(url)
samples <- getNodeSet( doc, "//SAMPLE")
## check the first node
samples[[1]]
## get the sample attribute node set and apply xmlToDataFrame to that
x <- lapply( lapply(samples, getNodeSet, ".//SAMPLE_ATTRIBUTE"), xmlToDataFrame)
# labels for bind_rows
names(x) <- sapply(samples, xpathSApply, ".//PRIMARY_ID", xmlValue)
library(dplyr)
y <- bind_rows(x, .id="sample")
z <- subset(y, TAG %in% c("age","sex","body site","body-mass index") , 1:3)
sample TAG VALUE
15 ERS445758 age 28
16 ERS445758 sex male
17 ERS445758 body site Sigmoid colon
19 ERS445758 body-mass index 16.9550173
50 ERS445759 age 58
51 ERS445759 sex male
...
library(tidyr)
z %>% spread( TAG, VALUE)
sample age body site body-mass index sex
1 ERS445758 28 Sigmoid colon 16.9550173 male
2 ERS445759 58 Sigmoid colon 23.22543185 male
3 ERS445760 26 Sigmoid colon 20.76124567 female
4 ERS445761 30 Sigmoid colon 0 male
5 ERS445762 36 Sigmoid colon 0 male
I would like to discretize data with zip codes into regions
I have character data
sample:
zip_code
'45654'
'12321'
'99453'
etc
I have 6 categories with rules:
region 1 - NE: 01000-19999
region 2 - SE: 20000-39999
region 3 - MW: 40000-58999,60000-69999
region 4 - SW: 70000-79999,85000-88499
region 5 - MT: 59000-59999,80000-84999,88900-89999
region 6 - PC: 90000-99999
I would like my output to be factor data:
region
'MW'
'NE'
'PC'
etc
Obviously, I know many ways to discretize the data, but none are clean and elegant (like loops, ifelse, etc)
Is there an elegant way to apply a case with 6 categories to discretize this data?
Okay, messy but this can work. I assume you'll have to use character objects since some zip codes start with 0. Obs. replace these numbers with your zip codes.
zip_code <- c('1','6','15')
regions <- list(NE = as.character(1:3),
SE = as.character(4:6),
MW = as.character(7:9),
SW = as.character(10:12),
MT = as.character(13:15),
PC = as.character(16:19))
sapply(zip_code, function(x) names(regions[sapply(regions, function(y) x %in% y)]))
1 6 15
"NE" "SE" "MT"
Here is a data.table solution using foverlaps(...) and the full US zip code database in package zipcode for the example. Note that your definitions of the ranges are deficient: for instance there are zip codes in NH that are outside the NE range, and PR is completely missing.
library(data.table) # 1.9.4+
library(zipcode)
data(zipcode) # database of US zip codes (a data frame)
zips <- data.table(zip_code=zipcode$zip)
regions <- data.table(region=c("NE" , "SE", "MW", "MW", "SW", "SW", "MT", "MT", "MT", "PC"),
start =c(01000,20000,40000,60000,70000,85000,59000,80000,88900,90000),
end =c(19999,39999,58999,69999,79999,88400,59999,84999,89999,99999))
setkey(regions,start,end)
zips[,c("start","end"):=list(as.integer(zip_code),as.integer(zip_code))]
result <- foverlaps(zips,regions)[,list(zip_code,region)]
result[sample(1:nrow(result),10)] # random sample of the result
# zip_code region
# 1: 27113 SE
# 2: 36101 SE
# 3: 55554 MW
# 4: 91801 PC
# 5: 20599 SE
# 6: 90250 PC
# 7: 95329 PC
# 8: 63435 MW
# 9: 60803 MW
# 10: 07040 NE
foverlaps(...) works this way: suppose a data.table x has columns a and b that represent a range (e.g., a <= b for all rows), and a data.table y has columns c and d that similarly represent a range. Then foverlaps(x,y) finds, for each row in x, all the rows in y which have overlapping ranges.
In your case we set up the y argument as the regions, where the ranges are the beginning and ending zipcodes for each (sub) region. Then we set up x as the original zip code database using the actual zip codes (converted to integer) for both the beginning and end of the range.
foverlaps(...) is extremely fast. In this case the full US zip code database (>44,000 zipcodes) was processed in about 23 milliseconds.
You could also try (Using #Scott Chamberlain's data)
with(stack(regions), unique(ind[ave(values %in% zip_code, ind, FUN=I)]))
#[1] NE SE MT
#Levels: MT MW NE PC SE SW
Or
library(dplyr)
library(tidyr)
unnest(regions, region) %>%
group_by(region) %>%
filter(x %in% zip_code)
# region x
#1 NE 1
#2 SE 6
#3 MT 15
Or
r1 <- vapply(regions, function(x) any(x %in% zip_code), logical(1))
names(r1)[r1]
#[1] "NE" "SE" "MT"
I am currently working on airmass trajectories for 11 different stations all over the city for one year.
For each station I have dataframes of 72-hour trajectories that looks like this
date lon/lat
yymmddhh_1 lon_1
yymmddhh_1 lat_1
yymmddhh_1 lon_2
yymmddhh_1 lat_2
yymmddhh_1 lon_3
yymmddhh_1 lat_3
I didn't put the longitude and latitude values in separate columns because I need them to be in one for my analysis.
The date column starts with a certain day (in my case 011022: 22/10/2001) and goes backwards for 72 hours in 1-hour steps, leaving me with 146 separate lon/lat values. I have trajectories for 329 days, so the dimension of the dataframe is dim=48180 x 2.
Now I need a new dataframe where the columns are my backward timesteps (t-0, t-1, t-2,...,t-72) and each row represents one trajectory (yymmddhh_1,yymmddhh_2,...,yymmddhh_329).
date t-0 t-0 t-1 t-1
yymmddhh_1 lon_1 lat_1 lon_2 lat_2
yymmddhh_2 lon_1 lat_1 lon_2 lat_2
yymmddhh_3 lon_1 lat_1 lon_2 lat_2
So I think my code needs to read column 2 of my current dataframe up to row=146, write these values in the first row of my new dataframe, and repeat the process until the end of the dataframe is reached.
I already managed to do that for the first 146 values, which is rather easy because I just need to
trajectory_1 <- t(station.trajectory[1:146,2])
I also already created the date column.
Maybe I can use read.table? I really have no idea where to start with this, so any help would be highly appreciated.
EDIT: To clear things up, here's an example of what the current dataframe looks like, and what the new one should look like:
[,1] is the date (format: YYMMDDHH), [,2] are the lon, lat values
[,1] [,2]
[1,] 2071000 525500
[2,] 2071000 133300
[3,] 2070923 524918
[4,] 2070923 134759
[5,] 2070922 524238
[6,] 2070922 136058
...
[146,] 2070700 140147
[147,] 2071100 525500
[148,] 2071100 133300
[149,] 2071023 525142
[150,] 2071023 128926
Note that at [147,] a new trajectory for the day following [1,] begins.
Keeping the content of[,1] is not important here, what my code should to in the end, is take [,2] and make it look like this :
[,1] [,2] [,3] [,4] [,5]
[1,] 2071000 525500 133300 524918 134759
[2,] 2071100 ... ... ... ...
EDIT 2: I also should add that I am trying to prepare my data for the k mean clustering (http://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html). Maybe I am not understanding the manual properly, but to me it looks like each trajectory should have its own row...
EDIT 3:
I tried writing a loop to do the work.
ind1<- matrix()
ind1 <- cbind(seq(0,48034,146))
ind1[1,] <- 1
First I created an index to have steps of 146. My final dataframe shall be named beusselstr.dataframe
beusselstr.dataframe <- NULL
k<- NULL
The station "beusselstr" only has 115 days, so I want to use only the first 115 index values until 16790:
for (j in 1:115){
k[j] <- ind1[j+1]
beusselstr.dataframe[j] <- cbind(beusselstr.dataframe[j],t(beusselstr.trajectories[ind1[j]:k[j],2]))
}
However I receive the error "number of items to replace is not a multiple of replacement length".
First, let's generate some toy data:
df = as.data.frame(matrix(c(seq(2070700,2070700-72*2+1,-1),seq(2071100,2071100-72*2+1,-1),runif(72*4)),ncol=2))
colnames(df) = c('date','lon.lat')
df$date[seq(2,nrow(df),2)] = df$date[seq(1,nrow(df)-1,2)]
That's a matrix representing two sequences of coordinates, kind of similar to your example except that the date format is a bit different. The important point being that each date is repeated twice.
Next, the method I suggest is relying on having your data sorted. In case your data is messy, you should re-order it before going forward:
df = df[order(df$date),]
The trick is to do reshaping in an easy way is to add new columns that labels recordings from the same experiment (rec.nb) and the relative time (rec.time). As your data is now sorted, all you need to do is:
df$rec.nb = rep(seq(1:2),each=72*2)
df$rec.time = rep(seq(1:72),2)
(if you had 3 trajectories, you would put: df$rec.nb = rep(seq(1:3),each=72*3) and so on)
Your data frame should now look like:
date lon.lat rec.nb rec.time
1 2070700 0.47047887 1 1
2 2070700 0.26357648 1 2
3 2070698 0.10793420 1 3
4 2070698 0.09126992 1 4
5 2070696 0.75242114 1 5
6 2070696 0.85941990 1 6
[...]
142 2070560 0.5561255161 1 70
143 2070558 0.7901997303 1 71
144 2070558 0.6179680785 1 72
145 2071100 0.0926457571 2 1
146 2071100 0.7780607140 2 2
147 2071098 0.7008311108 2 3
Finally, you can reshape your data:
reshape(df,v.names='lon.lat',timevar='rec.time',idvar='rec.nb',direction='wide')
outputting something along the lines of:
date rec.nb lon.lat.1 lon.lat.2 lon.lat.3 lon.lat.4 lon.lat.5 [...]
1 2070700 1 0.47047887 0.2635765 0.1079342 0.09126992 0.7524211 [...]
145 2071100 2 0.09264576 0.7780607 0.7008311 0.48613669 0.4928686 [...]
I'm trying to read a GRIB file wavedata.grib with wave heights from the ECMWF ERA-40 website, using an R function. Here is my source code until now:
mylat = 43.75
mylong = 331.25
# read the GRIB file
library(rgdal)
library(sp)
gribfile<-"wavedata.grib"
grib <- readGDAL(gribfile)
summary = GDALinfo(gribfile,silent=TRUE)
save(summary, file="summary.txt",ascii = TRUE)
# >names(summary): rows columns bands ll.x ll.y res.x res.y oblique.x oblique.y
rows = summary[["rows"]]
columns = summary[["columns"]]
bands = summary[["bands"]]
# z=geometry(grib)
# Grid topology:
# cellcentre.offset cellsize cells.dim
# x 326.25 2.5 13
# y 28.75 2.5 7
# SpatialPoints:
# x y
# [1,] 326.25 43.75
# [2,] 328.75 43.75
# [3,] 331.25 43.75
myframe<-t(data.frame(grib))
# myframe[bands+1,3]=331.25 myframe[bands+2,3]=43.75
# myframe[1,3]=2.162918 myframe[2,3]=2.427078 myframe[3,3]=2.211989
# These values should match the values read by Degrib (see below)
# degrib.exe wavedata.grib -P -pnt 43.75,331.25 -Interp 1 > wavedata.txt
# element, unit, refTime, validTime, (43.750000,331.250000)
# SWH, [m], 195709010000, 195709010000, 2.147
# SWH, [m], 195709020000, 195709020000, 2.159
# SWH, [m], 195709030000, 195709030000, 1.931
lines = rows * columns
mycol = 0
for (i in 1:lines) {
if (mylat==myframe[bands+2,i] & mylong==myframe[bands+1,i]) {mycol = i+1}
}
# notice mycol = i+1 in order to get values in column to the right
myvector <- as.numeric(myframe[,mycol])
sink("output.txt")
cat("lat:",myframe[bands+2,mycol],"long:",myframe[bands+1,mycol],"\n")
for (i in 1:bands) { cat(myvector[i],"\n") }
sink()
The wavedata.grib file has grided SWH values, in the period 1957-09-01 to 2002-08-31. Each band refers to a pair of lat/long and has a series of 16346 SWH values at 00h of each day (1 band = 16346 values at a certain lat/long).
myframe has dimensions 16438 x 91. Notice 91 = 7rows x 13columns. And the number 16438 is almost equal to number of bands. The additional 2 rows/bands are long and lat values, all other columns should be wave heights corresponding to the 16436 bands.
The problem is I want to extract SWH (wave heights) at lat/long = 43.75,331.25, but they don't match the values I get reading the file with Degrib utility at this same lat/long.
Also, the correct values I want (2.147, 2.159, 1.931, ...) are in column 4 and not column 3 of myframe, even though myframe[16438,3]=43.75 (lat) and myframe[16437,3]=331.25 (long). Why is this? I would like to know to which lat/long do myframe[i,j] values actually correspond to or if there is some data import error in the process. I'm assuming Degrib has no errors.
Is there any R routine to easily interpolate values in a matrix if I want to extract values between grid points? More generally, I need help in writing an effective R function to extract wave heights like this:
SWH <- function (latitude, longitude, date/time)
Please help.