Interpolate missing climate data from similar datasets - r

I am working with climate data of several (5) data loggers I used in a field experiment. I have some missing climate data, partially because the loggers were installed after start of the experiment, partially because of defect loggers.
For interpolation of my missing data, I have climate data from two nearby weather stations.
I'm trying to find a rapid and safe way to replace the missing data in this dataset, because the way I have come up with is very tedious and quite convoluted.
df <- data.frame(date=c("2015-06-17","2015-06-18","2015-06-19","2015-06-20","2015-06-21"),
meantemp1=c(15,17,19,15,19),maxtemp1=c(18,25,27,25,28),
meantemp2=c(13,12,12,18,14),maxtemp2=c(22,25,25,24,26),
meantemp3=c(NA,NA,21,17,21),maxtemp3=c(NA,NA,29,25,29),
meantemp4=c(NA,14,14,20,16),maxtemp4=c(NA,27,27,26,28))#create toy dataset
df$date <- as.Date(df$date)
View(df)
date meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17 15 18 13 22 NA NA NA NA
2 2015-06-18 17 25 12 25 NA NA 14 27
3 2015-06-19 19 27 12 25 21 29 14 27
4 2015-06-20 15 25 18 24 17 25 20 26
5 2015-06-21 19 28 14 26 21 29 16 28
Let's say the first four column correspond to weather station data, and the last four columns correspond to data logger data. In reality, I look at many more climatic factors.
Now, I shorten the dataset to a relevant timescale (irl a month), without any NA values, to calculate correlation.
#create a short dataset without NA values
dfshort <- df[df$date>="2015-06-19"&df$date<="2015-06-21 ",]
dfshort$date<-as.numeric(dfshort$date)#date needs to be transformed to numeric for cor()
corrmatrix <-((cor(dfshort)))
library(reshape)
m <- melt(corrmatrix)#show correlation matrix as a list
m <- m[order(- abs(m$value)), ]#order correlation matrix according to correlation values
View(m)
X1 X2 value
1 date date 1
11 meantemp1 meantemp1 1
15 meantemp3 meantemp1 1
16 maxtemp3 meantemp1 1
21 maxtemp1 maxtemp1 1
31 meantemp2 meantemp2 1
35 meantemp4 meantemp2 1
10 date meantemp1 0
46 date meantemp3 0
55 date maxtemp3 0
To get a better overview which factors are most correlated, I shorten this list to the cases where the 'logger data' are dependent on the 'weather station data'.
m1 <- subset(m, X1 %in% c('meantemp3', 'maxtemp3', 'meantemp4', 'maxtemp4'))
#select logger data for first ("dependent") column
m2 <-subset(m1,X2 %in% c('meantemp1', 'maxtemp1', 'meantemp2', 'maxtemp2'))
#select cases with weather station data for second ("reference")column
View(m2)
X1 X2 value
15 meantemp3 meantemp1 1.0000000
16 maxtemp3 meantemp1 1.0000000
35 meantemp4 meantemp2 1.0000000
45 maxtemp4 maxtemp2 1.0000000
27 maxtemp4 maxtemp1 0.9819805
17 meantemp4 meantemp1 -0.9449112
24 meantemp3 maxtemp1 0.9449112
26 meantemp4 maxtemp1 -0.7857143
36 maxtemp4 meantemp2 -0.6546537
44 meantemp4 maxtemp2 -0.6546537
Now I mark down the highest correlations for the 'logger data', and create lm's following this post:
#formulate linear models
model.meantemp3 <- lm(meantemp3 ~ meantemp1, data = df)
model.maxtemp3 <- lm(maxtemp3 ~ meantemp1, data = df)
model.meantemp4 <- lm(meantemp4 ~ meantemp2, data = df)
model.maxtemp4 <- lm(maxtemp4 ~ maxtemp2, data = df)
#predict values as column
df$predict.meantemp3 = predict(model.meantemp3, newdata = df)
df$predict.maxtemp3 = predict(model.maxtemp3, newdata = df)
df$predict.meantemp4 = predict(model.meantemp4, newdata = df)
df$predict.maxtemp4 = predict(model.maxtemp4, newdata = df)
# replace (only) NAs with predictions
df$meantemp3 = ifelse(is.na(df$meantemp3), df$predict.meantemp3, df$meantemp3)
df$maxtemp3 = ifelse(is.na(df$maxtemp3), df$predict.maxtemp3,df$maxtemp3)
df$meantemp4 = ifelse(is.na(df$meantemp4), df$predict.meantemp4, df$meantemp4)
df$maxtemp4 = ifelse(is.na(df$maxtemp4), df$predict.maxtemp4, df$maxtemp4)
#tadaa!
df<- df[c(-10:-13)] #drop column we are not interested in
head(df) #dataset without NA's
date meantemp1 maxtemp1 meantemp2 maxtemp2 meantemp3 maxtemp3 meantemp4 maxtemp4
1 2015-06-17 15 18 13 22 17 25 15 24
2 2015-06-18 17 25 12 25 19 27 14 27
3 2015-06-19 19 27 12 25 21 29 14 27
4 2015-06-20 15 25 18 24 17 25 20 26
5 2015-06-21 19 28 14 26 21 29 16 28
There has to be a more concise and less error-prone way to do this, and I can't be the only one with this issue, as this unanswered stackexchange question suggests.
I have looked for packages to do this (package "mice" e.g.), but they tend to end up in quite complicated outputs. Geographers seem to agree that simple linear models are too primitive for temperature data imputation. However, my climate data are highly correlated, so I want to do it this way for the sake of simplicity.
Help is highly appreciated!

Related

evaluating neural network performance

I trained my neural network with a sigmoid activation function so that the predicted values lie in the range [0,1). However, the range of real data in which the z-score transformation has been performed goes beyond [0,1). In this case what would be the appropriate way to evaluate my model. Should I rescale as well the original test data to the same range and then evaluate with criteria like mean square forecast error?
> real_predicted_neural
predicted real
1 1.909219e-07 -3.57877473
2 4.161819e-08 -2.28704595
3 1.754706e-11 -1.08509429
4 1.149891e-13 -0.46573114
5 7.777560e-02 0.42381300
6 4.173448e-07 -0.44060297
7 1.119703e-01 0.21075550
8 8.682557e-01 -0.01292402
9 4.736056e-08 -0.29830701
10 7.506821e-08 -1.20302227
11 7.341235e-01 -0.03986571
12 7.501776e-05 -0.94315815
13 1.145697e-04 0.49730175
14 2.214929e-13 0.04252241
15 4.597199e-01 -0.38539901
16 2.324931e-03 -0.74468628
17 4.366025e-06 -0.77037244
18 1.394450e-06 0.16679048
19 5.869884e-11 -0.75876486
20 1.817941e-04 0.04303387
21 7.060773e-04 0.06099372
22 8.267170e-06 -1.21687318
23 9.388680e-02 0.61135319
24 1.099290e-01 0.55715201
25 9.757236e-01 -0.33480226
26 9.544055e-01 0.09061006
27 7.322074e-07 0.09290822
28 1.014327e-06 -0.61658893
29 7.848382e-08 -0.78739456
30 1.791908e-04 -0.44073540
31 1.357918e-03 -0.22099008
32 5.192233e-06 -0.32744703
33 2.624779e-06 -0.37644068
34 6.414216e-02 -0.36947939
35 1.388143e-06 -0.00994845
36 3.010872e-05 -0.05984833
37 9.873201e-03 -0.21815268
38 3.896163e-04 -0.24009094
39 2.718760e-02 0.33383333
40 1.025650e-02 0.09779867

Eliminate cases based on multiple rows values

I have a base with the following information:
edit: *each row is an individual that lives in a house, multiple individuals with a unique P_ID and AGE can live in the same house with the same H_ID, I'm looking for all the houses with all the individuals based on the condition that there's at least one person over 60 in that house, I hope that explains it better *
show(base)
H_ID P_ID AGE CONACT
1 10010000001 1001000000102 35 33
2 10010000001 1001000000103 12 31
3 10010000001 1001000000104 5 NA
4 10010000001 1001000000101 37 10
5 10010000002 1001000000206 5 NA
6 10010000002 1001000000205 10 NA
7 10010000002 1001000000204 18 31
8 10010000002 1001000000207 3 NA
9 10010000002 1001000000203 24 35
10 10010000002 1001000000202 43 33
11 10010000002 1001000000201 47 10
12 10010000003 1001000000302 26 33
13 10010000003 1001000000301 29 10
14 10010000004 1001000000401 56 32
15 10010000004 1001000000403 22 31
16 10010000004 1001000000402 49 10
17 10010000005 1001000000503 1 NA
18 10010000005 1001000000501 24 10
19 10010000005 1001000000502 23 10
20 10010000006 1001000000601 44 10
21 10010000007 1001000000701 69 32
I want a list with all the houses and all the individuals living there based on the condition that there's at least one person 60+, here's a link for the data: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
And here's how I made the base:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
base<-data.frame(datos$ID_VIV, datos$ID_PERSONA, datos$EDAD, datos$CONACT)
base
Any help is much much appreciated, Thanks!
This can be done by:
Adding a variable with the maximum age per household
base$maxage <- ave(base$AGE, base$H_ID, FUN=max)
Then only keeping households with a maximum age above 60.
base <- subset(base, maxage >= 60)
Or you could combine the two lines into one. With the column names in your linked data:
> base <- subset(base, ave(base$datos.EDAD, base$datos.ID_VIV, FUN=max) >= 60)
> head(base)
datos.ID_VIV datos.ID_PERSONA datos.EDAD datos.CONACT
21 10010000007 1001000000701 69 32
22 10010000008 1001000000803 83 33
23 10010000008 1001000000802 47 33
24 10010000008 1001000000801 47 10
36 10010000012 1001000001204 4 NA
37 10010000012 1001000001203 2 NA
Using dplyr, we can group_by H_ID and select houses where any AGE is greater than 60.
library(dplyr)
df %>% group_by(H_ID) %>% filter(any(AGE > 60))
Similarly with data.table
library(data.table)
setDT(df)[, .SD[any(AGE > 60)], H_ID]
To get a list of the houses with a tenant Age > 60 we can filter and create a list of distinct H_IDs
house_list <- base %>%
filter(AGE > 60) %>%
distinct(H_ID) %>%
pull(H_ID)
Then we can filter the original dataframe based on that house_list to remove any households that do not have someone over the age of 60.
house_df <- base %>%
filter(H_ID %in% house_list)
To then calculate the CON values we can filter out NA values in CONACT, group_by(H_ID) and summarize to find the number of individuals within each house that have a non-NA CONACT value.
CON_calcs <- house_df %>%
filter(!is.na(CONACT)) %>%
group_by(H_ID) %>%
summarize(Count = n())
And join that back into the house_df based on H_ID to include the newly calculated CON values, and I believe that should end with your desired result.
final_df <- left_join(house_df, CON_calcs, by = 'H_ID')

Cut dendrogram / cluster: Error in function 'cutree': tree incorrect (composante 'merge')

I have a dendrogram which I want to cut into less clusters because right know there are too many for interpretation.
My dataframe looks like this:
> head(alpha)
locs 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
1 A1 12 14 15 15 14 21 10 18 18 20
2 A2 11 11 12 13 9 16 20 18 18 11
3 B1 12 13 20 17 21 20 27 14 22 25
4 B2 15 18 18 25 21 17 27 23 28 23
5 B3 22 22 26 24 28 23 31 25 32 25
6 B4 18 21 25 20 20 14 23 22 20 26
library("ggplot2") #for the heatmap
library("ggdendro") #for the dendrogram
library("reshape2") #for data wrangling
library("grid") #to combine the two plots heatmap and dendrogram
# Read in data
setwd("C:/Users/data")
alpha <- read.csv2("alpha.csv", header=TRUE, check.names=FALSE)
str(alpha) #structure of the dataset: locations (locs) = factor, values = integer
#scale the data variables (columns 4-9)
alpha.scaled <- alpha
alpha.scaled[, c(2:11)] <- scale(alpha.scaled[, 2:11])
alpha.scaled[, c(2:11)] <- scale(alpha.scaled[, 2:11])
# run clustering
alpha.matrix <- as.matrix(alpha.scaled[, -c(1)])
rownames(alpha.matrix) <- alpha.scaled$locs
alpha.dendro <- as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" ))
# Create dendrogram (=cluster)
dendro.plot <- ggdendrogram(data = alpha.dendro, rotate = TRUE)
alphacut <- cutree(alpha.dendro, h=3)
alphacut <- cutree(alpha.dendro, h=3)
Error in cutree(alpha.dendro, h = 3) :
'tree' incorrect (composante 'merge')
alphacut <- cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" )),k=5)
alphacut <- cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" )), k=5)
Error in cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method = "complete")), :
'tree' incorrect (composante 'merge')
I haven't found a solution to this. When I look at 'alpha.dendro' there is a list of 2 but no merge component, so this seems to be the problem. Does somebody know what to do?

Ceil and floor values in R

I have a data.table of integers with values between 1 and 60.
My question is about flooring or ceiling any number to the following values: 12 18 24 30 36 ... 60.
For example, let's say my data.table contains the number 13. I want R to "transform" this number into 12 and 18 as 13 lies in between those numbers. Moreover, if I have 18 I want R to keep it at 18.
If my data.table contains the value 50, I want R to convert that number into 48 and 54 and so on.
My goal is to get two different data.tables. One where the floored values are saved and one where the ceiled values are saved.
Any idea how one could do this in R?
EDIT: Numbers smaller than 12 should always be transformed to 12.
Example output:
If have the following data.table data.table(c(1,28,29,41,53,53,17,41,41,53))
I want the following two output data.tables: floored values data.table(c(12,24,24,36,48,48,12,36,36,48))
I want the following two output data.tables: ceiled values data.table(c(12,30,30,42,54,54,18,42,42,54))
Here is a fairly direct way (edited to round up to 12 if any values are below):
df <- data.frame(nums = 10:20)
df$floors <- with(df,pmax(12,6*floor(nums/6)))
df$ceils <- with(df,pmax(12,6*ceiling(nums/6)))
Leading to:
> df
nums floors ceils
1 10 12 12
2 11 12 12
3 12 12 12
4 13 12 18
5 14 12 18
6 15 12 18
7 16 12 18
8 17 12 18
9 18 18 18
10 19 18 24
11 20 18 24
Here's a way we could do this, using sapply and the which.min functions. From your question, it's not immediately clear how values < 12 should be handled.
x <- 1:60
num_list <- seq(12, 60, 6)
floorr <- sapply(x, function(x){
diff_vec <- x - num_list
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
ceill <- sapply(x, function(x){
diff_vec <- num_list - x
diff_vec <- ifelse(diff_vec < 0, Inf, diff_vec)
num_list[which.min(diff_vec)]
})
tail(cbind(x, floorr, ceill))
x floorr ceill
[55,] 55 54 60
[56,] 56 54 60
[57,] 57 54 60
[58,] 58 54 60
[59,] 59 54 60
[60,] 60 60 60

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

Resources