I am a newbie with R. I have a large dataset (66M obs) with pixel temperature data of 4 water bodies (REF,LMB, OTH, FP) at hourly time steps (6am,7am,8am...), with several NA values illustrating blank pixels. I want to calculate a proxy for temperature heterogeneity/diversity for each water body at each time, by using Shannon Diversity or other similar indexes. I have so far managed to calculate basic stats using an available online source, but not sure how to apply more specific diversity indexes.
My data looks like:
First column Temp, second Time, third water
My code:
DF<-read.csv("DF_total.csv",stringsAsFactors = T)
levels(DF$water)
[1]"OTH" "LMB" "REF" "FP"
levels(DF$time)
NULL
source("group_by_summary_stats.R")[**]
summary<-group_by_summary_stats(DF, Temp ,water ,time)
[**]source found online
Related
I have generated a decomposition of an additive time series for METAR wind data at a Norwegian airport. I have noticed that the monthly average wind values do not correspond with the observed values shown in the decomposition chart. During the month of January (2014) average winds were measured at 5.74 kts, however the chart shows a dip down to a value below 3 kts. I noticed, however, that when I separated each variable into its own dataset and ran the decomposition separately, the issue had been resolved. Has this got something to do with the way imported data is read? ... Sorry if it seems to be a silly question. Screenshots and code below. Thanks!
To define ts data:
RtestENGM_ts <- ts(test$Sknt, start=c(2012, 1), frequency=12)
To decompose ts data:
decomposed_test <- decompose(RtestENGM_ts, type="additive")
To plot decomposed data:
plot(decomposed_sknt2012ENGM)
To plot ts data
plot(RtestENGM_ts)
Input dataset:
Decompoition of additive time series 2012-22:
I tried importing each variable individually as part of their own respective datasets, this allowed for the correct observed values to be plotted. I still do not understand why r needs the imported variables to be separate. Do I really need to split my data across dozens of spreadsheets? Does R stryggle to isolate a single column during decomposition?
I would like to create a model to understand how habitat type affects the abundance of bats found, however I am struggling to understand which terms I should include. I wish to use lme4 to carry out a glmm model, I have chosen glmm as the distribution is poisson - you can't have half a bat, and also distribution is left skewed - lots of single bats.
My dataset is very big and is comprised of abundance counts recorded by an individual on a bat survey (bat survey number is not included as it's public data). My data set includes abundance, year, month, day, environmental variables (temp, humidity, etc.), recorded_habitat, surrounding_habitat, latitude and longitude, and is structured like the set shown below. P.S Occurrence is an anonymous recording made by an observer at a set location, at a location a number of bats will be recorded - it's not relevant as it's from a greater dataset.
occurrence
abundance
latitude
longitude
year
month
day
(environmental variables
3456
45
53.56
3.45
2000
5
3
34.6
surrounding_hab
recorded_hab
A
B
Recorded habitat and surrounding habitat range in letters (A-I) corresponding to a habitat type. Also, the table is split as it wouldn't fit in the box.
These models shown below are the models I think are a good choice.
rhab1 <- glmer(individual_count ~ recorded_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab1)
rhab2 <- glmer(individual_count ~ surrounding_hab + (1|year) + latitude + longitude + sun_duration2, family = poisson, data = BLE)
summary(rhab2)
I'll now explain my questions in regards to the models I have chosen, with my current thinking/justification.
Firstly, I am confused about the mix of categorical and numeric variables, is it wise to include the environmental variables as they are numeric? My current thinking is scaling the environmental variables allowed the model to converge so including them is okay?
Secondly, I am confused about the mix of spatial and temporal variables, primarily if I should include temporal variables as the predictor is a temporal variable. I'd like to include year as a random variable as bat populations from one year directly affect bat populations the next year, and also latitude and longitude, does this seem wise?
I am also unsure if latitude and longitude should be random? The confusion arises because latitude and longitude do have some effect on the land use.
Additionally, is it wise to include recorded_habitat and surrounding_habitat in the same model? When I have tried this is produces a massive output with a huge correlation matrix, so I'm thinking I should run two models (year ~ recorded_hab) and (year ~ surrounding_hab) then discuss them separately - hence the two models.
Sorry this question is so broad! Any help or thinking is appreciated - including data restructuring or model term choice. I'm also new to stack overflow so please do advise on question lay out/rules etc if there are glaringly obvious mistakes.
I have two arrays:
data1=array(-10:30, c(2160,1080,12))
data2=array(-20:30, c(2160,1080,12))
#Add in some NAs
ind <- which(data1 %in% sample(data1, 1500))
data1[ind] <- NA
One is modelled global gridded data (lon,lat,month) and the other, global gridded observations (lon,lat,month).
I want to assess how 'skillful' the modelled data is at recreating the obs. I think the best way to do this is with a spatial correlation between the datasets. How can I do that?
I tried a straightforward x<-cor(data1,data2) but that just returned x<-NA_real_.
Then I was thinking that I probably have to break it up by month or season. So, just looking at one month x<-cor(data1[,,1],data2[,,1]) it returned a matrix of size 1080*1080 (most of which are NAs).
How can I get a spatial correlation between these two datasets? i.e. I want to see where the modelled data performs 'well' i.e. has high correlation with observations, or where it does badly (low correlation with observations).
I've been recently studying DBSCAN with R for transit research purposes, and I'm hoping if someone could help me with this particular dataset.
Summary of my dataset is described below.
BTIME ATIME
1029 20001 21249
2944 24832 25687
6876 25231 26179
11120 20364 21259
11428 25550 26398
12447 24208 25172
What I am trying to do is to cluster these data using BTIME as x axis, ATIME as y axis. A pair of BTIME and ATIME represents the boarding time and arrival time of a subway passenger.
For more explanation, I will add the scatter plot of my total data set.
However if I split my dataset in different smaller time periods, the scatter plot looks like this. I would call this a sample dataset.
If I perform a DBSCAN clustering on the second image(sample data set), the clustering is performed as expected.
However it seems that DBSCAN cannot perform cluster on the total dataset with smaller scales. Maybe because the data is too dense.
So my question is,
Is there a way I can perform clustering in the total dataset?
What criteria should be used to separate the time scale of the data
I think the total data set is highly dense, which was why I tried clustering on a sample time period.
If I seperate my total data into smaller time scale, how would I choose the hyperparameters for each seperated dataset? If I look at the data, the distribution of the data is similar both in the total dataset and the seperated sample dataset.
I would sincerely appreciate some advices.
first post here :) Really simple questions in bold. I want to do kriging using PM10 daily data for 8 static stations in Santiago, Chile, from 1997-2012, into 34 centroids which map different counties. I explain what I've done so far with some questions in between. I'm using just 2008 (less missing values) data as first experiment.
DESCRIPTION:
DATA: I have a column with days from 1997 to 2012 and 8 columns with PM10 station data. I imported to R this data, and established time:
time <-as.POSIXlt(data$date) with no error:
" 1997-04-05 UTC 1997-04-12 UTC.... 2012-12-27 UTC"
I imported stations with it's coordinates and built its projections.
coordinates(stations)=~longitud+latitud
proj4string(stations) <- CRS("+proj=longlat + ellps=WGS84")
In order to create the STFDF, I first built the vector PM10 of data with 8 stations, ordered:
PM10<-as.vector(cbind(data $PM10bosque,data $PM10cerrillos,data $PM10cerronavia,data $PM10condes,data $PM10florida,data $PM10independencia,data $PM10parqueoh,data $PM10pudahuel))
PM10<-data.frame(PM10)
DFPM=STFDF(stations, time, PM10)
DFPM<-as(DFPM,"STSDF")
, the last line because I am working with missing data. Then the estimated variogram and its modelling (which I know it's poor) was done with:
varPM10 <- variogramST(PM10~1,data=DFPM,tunit="days",assumeRegular=F,na.omit=T)
sepVgm <- vgmST("separable",space=vgm(0,"Exp", 8, 700),time =vgm(200,"Exp", 15, 700), sill=100)
sepVgm <- fit.StVariogram(varPM10, sepVgm)
Which results in:
Variograms
Then I used KrigeST this way:
gridPM10 <-STF(centroids,time) (centroids defined previously the same way as stations)
krigedPM10<-krigeST(PM10~1, DFPM, newdata=gridPM10,modelList=sepVgm)
The result of ploting one station data and kriged data for that station county's centroid is:
Kriging result for Cerillos county and its station data
which seems as if the estimation ocurrs for time windows by a set of dates.
First question: Does anybody know why this kriging has this shape?
Then I wondered what would happen if I just used distance as predictor so I coded instead:
varPM10 <- variogramST(PM10~1,data=DFPM,tunit="days", tlags=0:0, assumeRegular=F,na.omit=T)
Second question: Is this a reasonable way to try just distnace as predictor? If not, any advise about how to adjust my code do I can do this is very appretiated. Anyway, this is the result:
Variogram with tlag=0:0
using sepVgm <- vgmST("separable",space=vgm(1,"Per", 8, 700),time =vgm(200,"Exp", 15, 700), sill=100)
By the way, how would you guys fit this?
, then the output really surprised me:
Kriging result with tlags=0:0
Third question: Why I am getting this result? I know the variogram modelling is poor, but even if that's true I understand the program should use the station data of the corresponding date so at least it should change in time.