Ocean flows model in R/Excel (millions of data) - r

I am building a stochastic model to predict the movement of objects floating in the ocean. I have thousands of data from drifter buoys all around the world. In the format as below:
index month year lat long
72615 10 2010 35,278 129,629
72615 11 2010 37,604 136,365
72615 12 2010 39,404 137,775
72615 1 2011 39,281 138,235
72620 1 2011 35,892 132,766
72620 2 2011 38,83 133,893
72620 3 2011 39,638 135,513
72620 4 2011 41,297 139,448
The general concept for the model is to divide whole world into 2592 cells of magnitude of 5x5 degrees. And then create the Markov's chain transition matrix using the formula that
the probability of going from cell i to cell j in 1 month equals to:
the number of times any buoy went from cell i to cell j in 1 month
divided by the
number of times any buoy exitted i (including going from i to i).
However I have two troubles related to managing the data.
1. Is there an easy solution (preferably in Excel or R) to add 6-th column to the data set, whose values would depend only on the value of latitude and longitude, such that it would equal to:
1 when both latitude and longitude are between 0 and 5
2 when latitude is between 0 and 5 and longitude between 5 and 10
3 when latitude is betwwen 0 and 5 and longitude between 10 and 15
and so on up to the number 2592
2. Is there an easy way to count the number of times any buoy went from cell i to cell j in 1 month?
I was trying to figure out the solution to the question 1 in Excel, but could not think of anything more efficient than just sorting by the latitude / longitude columns and then writing the values manually.
I've been also told that R is much better for managing such data sets, but I am not experienced with it and couldn't find the solution myself.
I would really appreciate any help.

Someone can probably come up with something much more sophisticated/faster, but this is a crude approach that has the benefit of being relatively easy to understand.
Sample data:
dd <- read.table(header=TRUE,dec=",",text="
index month year lat long
72615 10 2010 35,278 129,629
72615 11 2010 37,604 136,365
72615 12 2010 39,404 137,775
72615 1 2011 39,281 138,235
72620 1 2011 35,892 132,766
72620 2 2011 38,83 133,893
72620 3 2011 39,638 135,513
72620 4 2011 41,297 139,448")
Generate indices that equal 1 for (0-5), 2 for (6-10), etc.
dd$x <- (dd$lat %/% 5) + 1
dd$y <- (dd$long %/% 5) + 1
Set up an empty matrix (not sure I have the rows/columns right)
mm <- matrix(0,nrow=36,ncol=72)
(you might want to use the dimnames argument here for clarity)
Fill it in:
for (i in 1:nrow(dd)) {
mm[dd[i,"x"],dd[i,"y"]] <- mm[dd[i,"x"],dd[i,"y"]]+1
}
If you have only thousands of rows, this might be fast enough. I would try it and see if you need something fancier. (If you need to collapse the matrix back to a set of columns, you can use reshape2::melt or tidyr::gather ...)

Related

Merge two data frames - no unique identifier

I would like to combine two data frames. One is information for birds banded. The other is information on recovered banded birds. I would like to add the recovery data to the banding data, if the bird was recovered (not all birds were recovered). Unfortunately the full band number is not included in the banding data, only in the recovery data, so there is not a unique column to join them by.
One looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band Prefix Plus
-85.41667
42.41667
8
5
2001
12456
-85.41655
36.0833
9
6
2003
21548
The other looks like this:
GISBLong
GISBLat
B Flyway
B Month
B Year
Band
R Month
R Year
-85.41667
42.41667
8
5
2001
124565482
12
2002
-85.41655
36.0833
9
6
2003
215486256
1
2004
I have tried '''merge''', '''ifelse''', '''dplyr-join''' with no luck. Any suggestions? Thanks in advance!
you should look up rbind(). That might do the trick. For it to work the data frames have to have the same columns. I'd suggest you to add missing columns to your first DF with dplyr::mutate() and later on eliminate useless rows.

Is there a way I can use r code in order to calculate the average price for specific days? (AVERAGEIF function)

Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel

Possible forecast algorithms when time series is short with quarterly data spikes

I have an year s data with quarterly spikes, like below:
Sample code in R to create the dataframe:
x <- data.frame("Month" = c(1:12), "Count" = c(110,220,2500,150,180,1800,300,550,5000,205,313,4218))
Here is how the data looks:
Month Count
1 110
2 220
3 2500
4 150
5 180
6 1800
7 300
8 550
9 5000
10 205
11 313
12 4218
We can see that last month of every quarter has spike. My target is to forecast for next one year based on this data. I tried linear regression with some feature engineering (like how far a month is away from quarter) and results were obviously not satisfactory as it doesn't appear there is linear dependency.
I tried other techniques like seasonal naive and STLF (using R) and am currently going through few interpolation techniques (like lagrange or newtonInterpolation), there appears to be a lot of materials to study. Can anyone suggest a good possible solution for this so that I can explore further?

How to apply a summarization measure to matching data.frame columns in R

I have a hypothetical data-frame as follows:
# inventory of goods
year category count-of-good
2010 bikes 1
2011 bikes 3
2013 bikes 5
2010 skates 1
2011 skates 1
2013 skates 0
2010 skis 0
2011 skis 2
2013 skis 2
my end goal is to show a stacked bar chart of how the %-<good>-of-decade-total has changed year-to-year.
therefore, i want to compute the following:
now, i should be able to ggplot(df, aes(factor(year), fill=percent.total.decade.goods) + geom_bar, or similar (hopefully!), creating a bar chart where each bar sums to 100%.
however, i'm struggling to determine how to get percent.good.of.decade.total (the far right column) in non-hacky way. Thanks for your time!
You can use dplyr to compute the sum:
library("dplyr")
newDf=df%>%group_by(year)%>%mutate(decades.total.goods=sum(count.of.goods))%>%ungroup()
Either use mutate or normal R syntax to compute the "% good of decade total"
Note: you have not shared your exact data-frame, so the names are obviously made up.
We can do this with ave from base R
df1$decades.total.goods <- with(df1, ave(count.of.good, year, FUN = sum))
df1$decades.total.goods
#[1] 2 6 7 2 6 7 2 6 7

Merge spatial point dataset with Spatial grid dataset using R. (Master dataset is in SP Points format)

I am working on spatial datasets using R.
Data Description
My master dataset is in SpatialPointsDataFrame format and has surface temperature data (column names - "ruralLSTday", "ruralLSTnight") for every month. Data snippet is shown below:
Master Data - (in SpatialPointsDataFrame format)
TOWN_ID ruralLSTday ruralLSTnight year month
2920006.11 2920006 303.6800 289.6400 2001 0
2920019.11 2920019 302.6071 289.0357 2001 0
2920015.11 2920015 303.4167 290.2083 2001 0
3214002.11 3214002 274.9762 293.5325 2001 0
3214003.11 3214003 216.0267 293.8704 2001 0
3207010.11 3207010 232.6923 295.5429 2001 0
Coordinates:
longitude latitude
2802003.11 78.10401 18.66295
2802001.11 77.89019 18.66485
2803003.11 79.14883 18.42483
2809002.11 79.55173 18.00016
2820004.11 78.86179 14.47118
I want to add columns in the above data about rainfall and air temperature - This data is present in SpatialGridDataFrame in the table "secondary_data" for every month. Snippet of "secondary_data" is shown below:
Secondary Data - (in SpatialGridDataFrame format)
month meant.69_73 rainfall.69_73
1 1 25.40968 0.6283871
2 2 26.19570 0.4580542
3 3 27.48942 1.0800000
4 4 28.21407 4.9440000
5 5 27.98987 9.3780645
Coordinates:
longitude latitude
[1,] 76.5 8.5
[2,] 76.5 8.5
[3,] 76.5 8.5
[4,] 76.5 8.5
[5,] 76.5 8.5
Question
How do I add the columns from secondary data to my master data by matching over latitude longitude and month? Currently the latitude/longitude information in the two table above will not match exactly as master data is a set of points and secondary data is grid.
Is there a way to find the square of the grid on the "Secondary Data" that the lat/long of my master data falls into, and interpolate?
If your SpatialPointsDataFrame object is called x, and your SpatialGridDataFrame is called y, then
x <- cbind(x, over(x, y))
will add the attributes (grid cell values) of y matching to the locations of x, to the attributes of x. Match is done by point-in-grid cell.
Interpolation is a different question; a simple way would be inverse distance with the four nearest neighbours, e.g. by
library(gstat)
x = idw(meant.69_73~1, y, x, nmax = 4)
whether you want one, or the other really depends on what your grid cells mean: do they refer to (i) the point value at the grid cell center, (ii) a value that is constant throughout the grid cell, or (iii) an average value over the whole grid cell. First case: interpolate, second: use over, third: use area-to-point interpolation (not explained here).
R package raster will offer similar functionality, but use different names.

Resources