Missing data warning R - r

I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!

For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.

Related

SpatioTemporal Package error: "Length of 'LUR.in' does not match number of temporal trends."

The full dataset and code I am asking about here can be found at: https://github.com/claytonglasser/siuslaw-basin-precipitation
I am following the Comprehensive Tutorial for the Spatio-Temporal R-package using my own data. I am able to track with all of the steps until I get to the "3 createSTmodel(): Specifying the Spatio-Temporal model" section at which point I encounter the following error, which I am having trouble interpreting.
My code is as follows:
LUR <- list(~ELEVATION)
cov.beta <- list(covf="exp", nugget=FALSE)
cov.nu <- list(covf="exp", nugget=~ELEVATION, random.effect=FALSE)
locations <- list(coords=c("LONGITUDE","LATITUDE"), long.lat=c("LONGITUDE","LATITUDE"))
siuslaw.ST.model <- createSTmodel(siuslaw.ST, LUR=LUR,
ST=NULL,
cov.beta=cov.beta, cov.nu=cov.nu,
locations=locations)
When creating the siuslaw.ST.model variable, this error is returned:
Error in processLUR(STmodel, LUR) :
Length of 'LUR.in' does not match number of temporal trends.
I don't know how to approach fixing this problem because I'm not sure how to inspect/evaluate the components 'LUR.in' and 'temporal trends'.
Question: My assumption is that there there is one temporal trend per Location, so therefore 10 in this case. However, I also use the following code to command the Siuslaw.ST object to use 2 temporal basis functions. Is this the "temporal trends" being referred to?
siuslaw.ST <- updateTrend(siuslaw.ST, n.basis=2)
Question: I don't understand how the LUR argument works, what kind of object it expects to take as input, or how critical of a role it has.
LUR.in is defined as: A vector or list indicating which geographic covariates to use.
In the tutorial, multiple covariates are listed, prepended with ~'s like they are formulas. I only have the one LUR item, ELEVATION, from the siuslaw.ST$covars object:
> siuslaw.ST$covars
# A tibble: 10 x 4
ID LATITUDE LONGITUDE ELEVATION
<chr> <dbl> <dbl> <dbl>
1 US1ORLA0076 44.0 -124. 20.7
2 US1ORLA0003 44.0 -124. 20.4
3 US1ORLA0031 44.0 -124. 25.6
4 US1ORLA0091 44.1 -124. 64
5 USC00352973 44.0 -124. 22.9
6 USC00352972 44.0 -124. 3.7
7 USC00353995 43.9 -124. 35.1
8 US1ORLA0171 43.8 -123. 180.
9 USC00355204 44.0 -124. 5.2
10 US1ORLA0132 44.1 -124. 74.4
Notice there are 10 observations of ELEVATION. I think the LUR argument knows to look in siuslaw.ST$covars for the input, where I think it would find a single vector of 10 observations.
So in summary, why doesn't "Length of 'LUR.in' does not match number of temporal trends." and what do I need to inspect/change to in order to make them match?
I know this question is a bit of a hydra. Please let me know anything I can clarify and I am happy to do so.
I was able solve this through experimentation. The discrepancy between the LUR.in and the number of temporal trends was due to LUR argument missing the rest of the geographic coordinates. Ultimately, I was able to succeed at creating the model by modifying the input for the LUR argument like so:
LUR <- list(~ELEVATION, ~LATITUDE, ~LONGITUDE)
The tutorial chooses to list an arbitrary subset of 3 of the geographic covariates in LUR, and still works. I can't say I totally understand why all three of these formulas needed to be specified, as opposed to just ELEVATION. If anyone with more knowledge can shed some light on this, that would be great.
Is it possible you did not specify the "locations" list with both x,y and long,lat? If that does not happen, then it could interpret the long/lat as spatial covariates, which means you would then need to specify it in the "LUR" input. This would explain why adding them as covariates remedied your situation. However, I am surprised not having both x,y and long,lat specified did not throw an error.

Power BI: Make a line chart continuous when source contains null values (handle missing values)

This question spinned off a question I posted earlier;
Custom x-axis values in Power BI
Suppose the following data set:
Focus on the second and third row. How can I make the line in corresponding graph hereunder be continuous and not stop in the middle?
In Excel I used to solve this problem by applying NA() in a formula generating data for the graph. Is there a similar solution using DAX perhaps?
The short version:
What you should do:
Unleash python and after following the steps there, insert this script (dont't worry, I've added some details here as well):
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Then you will be able to set it up like this:
Why you should do it:
For built-in options, these are your choices as far as I know:
1. Categorical YearMonth and categorical x-axis does not work
2. Categorical YearMonth and continuous x-axis does not work
3. Numerical YearMonth and categorical x-axis does not work
4. Numerical YearMonth and continuous x-axis does not work
The details, starting with why built-in approaches fail:
1. Categorical YearMonth and categorical x-axis
I've used the following dataset that resembles the screenshot of your table:
YearWeek 1 2
201603 2.37 2.83
201606 2.55
201607 2.98
201611 2.33 2.47
201615 2.14 2.97
201619 2.15 2.02
201623 2.33 2.8
201627 3.04 2.57
201631 2.95 2.98
201635 3.08 2.16
201639 2.50 2.42
201643 3.07 3.02
201647 2.19 2.37
201651 2.38 2.65
201703 2.50 3.02
201711 2.10 2
201715 2.76 3.04
And NO, I didn't bother manually copying ALL your data. Just your YearWeek series. The rest are random numbers between 2 and 3.
Then I set the data up as numerical for 1 and 2, and YearWeek aa type text in the Power Query Editor:
So this is the original setup with a table and chart like yours:
The data is sorted descending by YearWeek_txt:
And the x-axis is set up asCategorical:
Conclusion: Fail
2. Categorical YearMonth and numercial x-axis
With the same setup as above, you can try to change the x-axis type to Continuous:
But as you'll see, it just flips right back to 'Categorical', presumably because the type of YearWeek is text.
Conclusion: Fail
3. Numerical YearMonth and categorical x-axis
I've duplicated the original setup so that I've got two tables Categorical and Numerical where the type of YearWeek are text and integer, respectively:
So numerical YearMonth and categorical x-axis still gives you this:
Conclusion: Fail
4. Numerical YearMonth and continuous x-axis does not work
But now, with the same setup as above, you are able to change the x-axis type to Continuous:
And you'll end up with this:
Conclusion: LOL
And now, Python:
In the Power Query Editor, activate the Categorical table, select Transform > Run Python Script and insert the following snippet in the Run Python Script Editor:
# 'dataset' holds the input data for this script
import pandas as pd
# retrieve index column
index = dataset[['YearWeek_txt']]
# subset input data for further calculations
dataset2 = dataset[['1', '2']]
# handle missing values
dataset2 = dataset2.fillna(method='ffill')
Click OK and click Table next to dataset2 here:
And you'll get this (make sure that the column data types are correct):
As you can see, no more missing values. dataset2 = dataset2.fillna(method='ffill') has replaced all missing values with the preceding value in both columns.
Click Close&Apply to get back to the Desktop, and enjoy your table and chart with no more missing values:
Conclusion: Python is cool
End note:
There are a lot of details that can go wrong here with decimal points, data types and so on. Let me know how things work out for you and I'll have a look at it again if it doesn't work on your end.

Using signal spikes to partition data set in R

I have an example data set that looks like this:
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
This data set has both positive and negative spikes in it that I would like to use as markers to calculate means on within the data. I would define the start of a spike as any number that is 40% greater or lessor than the number preceding it. A spike ends when it jumps back by more than 40%. So ideally I would like to locate each spike in the data set, and take the mean of the 5 data points immediately following the last number of the spike.
As can be seen, a spike can last for up to 5 data points long. The rule for averaging I would like to follow are:
Start averaging after the last recorded spike data point, not after the first spike data point. So if a spike lasts for three data points, begin averaging after the third spiked data point.
So the ideal output would look something like this:
1= 12.2
2= 11.8
3= 12.4
4= 12.2
5= 12.6
With the first spike being Ho(4)- followed by the following 5 numbers (12,11,12,12,14) for a mean of 12.1
The next spike in the data is data points Ho(13,14) (25,25) followed by the set of 5 numbers (12,11,13,12,11) for an average of 11.8.
And so on for the rest of the sequence.
It kind of seems like you're actually defining a spike to mean differing from the "medium" values in the dataset, as opposed to differing from the previous value. I've operationalized this by defining a spike as being any data more than 40% above or below the median value (which is 12 for the sample data posted). Then you can use the nifty rle function to get at your averages:
r <- rle(Ho >= mean(Ho)*0.6 & Ho <= median(Ho)*1.4)
run.begin <- cumsum(r$lengths)[r$values] - r$lengths[r$values] + 1
run.end <- run.begin + pmin(4, r$lengths[r$values]-1)
apply(cbind(run.begin, run.end), 1, function(x) mean(Ho[x[1]:x[2]]))
# [1] 12.2 11.8 12.4 12.2 12.6
So here is come code that seems to get the same result as you.
#Data
Ho<-c(12,12,12,24,12,11,12,12,14,12,11,13,25,25,12,11,13,12,11,11,12,14,12,2,2,2,11,12,13,14,12,11,12,3,2,2,2,3,2,2,1,14,12,11,13,11,12,13,12,11,12,12,12,2,2,2,12,12,12,12,15)
#plot(seq_along(Ho), Ho)
#find changes
diffs<-tail(Ho,-1)/head(Ho,-1)
idxs<-which(diffs>1.4 | diffs<.6)+1
starts<-idxs[seq(2, length(idxs), by=2)]
ends<-ifelse(starts+4<=length(Ho), starts+4, length(Ho))
#find means
mapply(function(a,b) mean(Ho[a:b]), starts, ends)

Ensuring temporal data density in R

ISSUE ---------
I have thousands of time series files (.csv) that contain intermittent data spanning for between 20-50 years (see df). Each file contains the date_time and a metric (temperature). The data is hourly and where no measurement exists there is an 'NA'.
>df
date_time temp
01/05/1943 11:00 5.2
01/05/1943 12:00 5.2
01/05/1943 13:00 5.8
01/05/1943 14:00 NA
01/05/1943 15:00 NA
01/05/1943 16:00 5.8
01/05/1943 17:00 5.8
01/05/1943 18:00 6.3
I need to check these files to see if they have sufficient data density. I.e. that the ratio of NA's to data values is not too high. To do this I have 3 criteria that must be checked for each file:
Ensure that no more than 10% of the hours in a day are NA's
Ensure that no more than 10% of the days in a month are NA's
Ensure that there are 3 continuous years of data with valid days and months.
Each criterion must be fulfilled sequentially and if the file does not meet the requirements then I must create a data frame (or any list) of the files that do not meet the criteria.
QUESTION--------
I wanted to ask the community how to go about this. I have considered the value of nested if loops, along with using sqldf, plyr, aggregate or even dplyr. But I do not know the simplest way to achieve this. Any example code or suggestions would be very much appreciated.
I think this will work for you. These will check every hour for NA's in the next day, month or 3 year period. Not tested because I don't care to make up data to test it. These functions should spit out the number of NA's in the respective time period. So for function checkdays if it returns a value greater than 2.4 then according to your 10% rule you'd have a problem. For months 72 and for 3 year periods you're hoping for values less than 2628. Again please check these functions. By the way the functions assume your NA data is in column 2. Cheers.
checkdays <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-23)){
nadata=data[i:(i+23),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
checkmonth <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-719)){
nadata=data[i:(i+719),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
check3years <- function(data){
countNA=NULL
for(i in 1:(length(data[,2])-26279)){
nadata=data[i:(i+26279),2]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
So I ended up testing these. They work for me. Here are system times for a dataset a year long. So I don't think you'll have problems.
> system.time(checkdays(RM_W1))
user system elapsed
0.38 0.00 0.37
> system.time(checkmonth(RM_W1))
user system elapsed
0.62 0.00 0.62
Optimization:
I took the time to run these functions with the data you posted above and it wasn't good. For loops are dangerous because they work well for small data sets but slow down exponentially as datasets get larger, that is if they're not constructed properly. I cannot report system times for the functions above with your data (it never finished) but I waited about 30 minutes. After reading this awesome post Speed up the loop operation in R I rewrote the functions to be much faster. By minimising the amount of things that happen in the loop and pre-allocating memory you can really speed things up. You need to call the function like checkdays(df[,2]) but its faster this way.
checkdays <- function(data){
countNA=numeric(length(data)-23)
for(i in 1:(length(data)-23)){
nadata=data[i:(i+23)]
countNA[i]=length(nadata[is.na(nadata)])}
return(countNA)
}
> system.time(checkdays(df[,2]))
user system elapsed
4.41 0.00 4.41
I believe this should be sufficient for your needs. In regards to leap years you should be able to modify the optimized function as I mentioned in the comments. However make sure you specify a leap year dataset as second dataset rather than a second column.

plotting histogram by a data frame

I am a new user of R and I am running the last 7 days this language using the mixdist package for the modal analysis of finite mixture distributions. I am working on nanoparticles thus the R is for the analysis of particle size distributions recorded by a particle analyser I am using to my experiments.
My problem is illustrated below:
Firstly I am collecting my data from excel (raw data)
Diameter dN/dlog(dp) frequencies
4.87 1825.078136 0.001541926
5.62 2363.940947 0.001997187
6.49 2022.259831 0.001708516
7.5 1136.653264 0.000960307
8.66 363.4570006 0.000307068
10 255.6702845 0.000216004
11.55 241.6525906 0.000204161
13.34 410.3425535 0.00034668
15.4 886.929307 0.000749327
17.78 936.4632499 0.000791176
20.54 579.7940281 0.000489842
23.71 11.915522 0.00001
27.38 0 0
31.62 0 0
36.52 5172.088 0.004369665
42.17 19455.13684 0.01643677
48.7 42857.20502 0.036208126
56.23 68085.64903 0.057522504
64.94 87135.1959 0.07361661
74.99 96708.55662 0.081704712
86.6 97982.18946 0.082780747
100 95617.46266 0.080782896
115.48 93732.08861 0.079190028
133.35 93718.2981 0.079178377
153.99 92982.3002 0.078556565
177.83 88545.18227 0.074807844
205.35 78231.4116 0.066094203
237.14 63261.43349 0.053446741
273.84 46759.77702 0.039505233
316.23 32196.42834 0.027201315
365.17 21586.84472 0.018237755
421.7 14703.9162 0.012422678
486.97 10539.84662 0.008904643
562.34 7986.233881 0.00674721
649.38 6133.971913 0.005182317
749.89 4500.351801 0.003802145
865.96 2960.469207 0.002501167
1000 1649.858041 0.001393891
Inf 0 0
using the function
pikraw<-read.table(file="clipboard", sep="\t", header=T)
After importing the data in R I am choosing the 1st and the 3rd column of the above table :
diameter<- pikraw$Diameter
frequencies<-pikraw[3]
Then I am grouping my data using the functions
pikgrp <- data.frame(length =diameter, freq =frequencies)
class(pikgrp) <- c("mixdata", "data.frame")
Doing all these I am going to plot the histogram of this data
plot(pikgrp,log="x")
and there something strange happens: The horizontal axis and the values on this look fine although the y axis of the graph appear the low values of the frequencies as they are and the high values with a cut decimal lowering the plot.
Have you got any explanation on what is happening? Probably the answer could be very simple although after exhausting my self and losing a whole weekend I believe that I have all the rights on my side.
It looks to me like you are reading your data wrong. Try this:
pikraw <- read.table(file="clipboard", sep="", header=T)
That is, change the sep argument to sep="". Everything worked fine from there.
Also, note that using the clipboard as file argument only works if you have your data on the clipboard. I recommend creating a .txt (or .csv) file with your data. That way you don't have to have your data on the clipboard everytime you want to read it.

Resources