New R user here, working with meteorological data (data frame is called "Stations"). Trying to plot 3 time series with temperature on y-axis with a regression line on each one, but I encounter a few problems and there is no error messages.
Loop doesn't seem to be working and I can't figure out why.
Didn't manage to change x-axis graduation values for years ("Année" in the data frame) instead of a number.
Title is the same for the 3 plots, how do I change it so each plot has its own title?
Regression line is not shown on the graph.
Thanks in advance!
Here is my code :
for (i in c(6,8,10))
plot(ts(Stations[,i]), col="dodgerblue4", xlab="Temps", ylab="Température", main="Genève")
for (i in c(6,8,10))
abline(h=Stations[,i])```
Nb.enr time Année Mois Jour T2m_GE pcp_GE T2m_PU pcp_PU T2m_NY
1 19810101 1981 1 1 1.3 0.3 2.8 0.0 2.3
2 19810102 1981 1 2 1.2 0.1 2.3 1.2 1.6
3 19810103 1981 1 3 4.1 21.8 4.9 5.2 3.8
4 19810104 1981 1 4 5.1 10.3 5.1 17.4 4.9
5 19810105 1981 1 5 0.9 0.0 1.0 0.1 0.8
6 19810106 1981 1 6 0.5 5.7 0.7 6.0 0.5
7 19810107 1981 1 7 -2.7 0.0 -2.1 0.1 -1.9
8 19810108 1981 1 8 -3.2 0.0 -4.1 0.0 -3.8
9 19810109 1981 1 9 -5.2 0.0 -3.5 0.0 -5.1
10 19810110 1981 1 10 -3.1 10.6 -0.9 6.0 -2.6
Related
s1
name tis1 tis2 tis3 tis4 tis5 tis6 tis7 tis8 tis9 tis10 tis11 tis12
S1 0 0 0 12.1 29.2 1.9 0.45 0.2 17.0 0.4 0.7 0.1
s2
name tis1 tis2 tis3 tis4 tis5 tis6 tis7 tis8 tis9 tis10 tis11 tis12
S2 1 2 0.4 14.1 9.2 1.8 0.7 0.9 7.0 0.3 0.7 0.3
I would to plot them to visualize their degree of correlation.
There is a way to do it?
I have a problem with the way aggregate or N/A deals with sums.
I would like the sums per area.code from following table
test <- read.table(text = "
area.code A B C D
1 0 NA 0.00 NA NA
2 1 0.0 3.10 9.6 0.0
3 1 0.0 3.20 6.0 0.0
4 2 0.0 6.10 5.0 0.0
5 2 0.0 6.50 8.0 0.0
6 2 0.0 6.90 4.0 3.1
7 3 0.0 6.70 3.0 3.2
8 3 0.0 6.80 3.1 6.1
9 3 0.0 0.35 3.2 6.5
10 3 0.0 0.67 6.1 6.9
11 4 0.0 0.25 6.5 6.7
12 5 0.0 0.68 6.9 6.8
13 6 0.0 0.95 6.7 0.0
14 7 1.2 NA 6.8 0.0
")
So, seems pretty easy:
aggregate(.~area.code, test, sum)
area.code A B C D
1 1 0 6.30 15.6 0.0
2 2 0 19.50 17.0 3.1
3 3 0 14.52 15.4 22.7
4 4 0 0.25 6.5 6.7
5 5 0 0.68 6.9 6.8
6 6 0 0.95 6.7 0.0
Apparently not so simple, because area code 7 is completely omitted from the aggregate() command.
I would however like the N/As to be completely ignored or computed as zero values, which na= command gives that option?
replacing all N/As with 0 is an option if I just want the sum... but the mean is really problematic then (since it can't differentiate between 0 and N/A anymore)
If you are willing to consider an external package (data.table):
setDT(test)
test[, lapply(.SD, sum), area.code]
area.code A B C D
1: 0 NA 0.00 NA NA
2: 1 0.0 6.30 15.6 0.0
3: 2 0.0 19.50 17.0 3.1
4: 3 0.0 14.52 15.4 22.7
5: 4 0.0 0.25 6.5 6.7
6: 5 0.0 0.68 6.9 6.8
7: 6 0.0 0.95 6.7 0.0
8: 7 1.2 NA 6.8 0.0
One option is to create a function that gives NA when all the values are NA or otherwise use sum. Along with that, use na.action argument in aggregate as aggregate can remove the row if there is at least one NA
f1 <- function(x) if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
aggregate(.~area.code, test, f1, na.action = na.pass)
# area.code A B C D
#1 0 NA 0.00 NA NA
#2 1 0.0 6.30 15.6 0.0
#3 2 0.0 19.50 17.0 3.1
#4 3 0.0 14.52 15.4 22.7
# 4 0.0 0.25 6.5 6.7
#6 5 0.0 0.68 6.9 6.8
#7 6 0.0 0.95 6.7 0.0
#8 7 1.2 NA 6.8 0.0
When there are only NA elements and we use sum with na.rm = TRUE, it returns 0
sum(c(NA, NA), na.rm = TRUE)
#[1] 0
Another solution is to use dplyr:
test %>%
group_by(area.code) %>%
summarise_all(sum, na.rm = TRUE)
I am trying to get all points on a 2d plane in the range (0..10,0..10) with a step of 0.5. I would like two store these values in a dataframe like this:
x y
1 1 1.5
2 0 0.5
3 4 2.0
I am considering using a loop to start from 0.0 for the x column and fill the y column such that I get something like this:
x y
1 0 0
2 0 0.5
3 0 1
and so on upto 10. And increment it by 0.5 and do for 1 and so on. I would like to know a more efficient way of doing this in R?.
Is this what you want?
expand.grid(x=seq(0,10,by=0.5),y=seq(0,10,by=0.5))
x y
1 0.0 0.0
2 0.5 0.0
3 1.0 0.0
4 1.5 0.0
5 2.0 0.0
6 2.5 0.0
7 3.0 0.0
8 3.5 0.0
9 4.0 0.0
10 4.5 0.0
11 5.0 0.0
12 5.5 0.0
13 6.0 0.0
14 6.5 0.0
15 7.0 0.0
16 7.5 0.0
17 8.0 0.0
18 8.5 0.0
19 9.0 0.0
20 9.5 0.0
21 10.0 0.0
22 0.0 0.5
23 0.5 0.5
24 1.0 0.5
25 1.5 0.5
26 2.0 0.5
27 2.5 0.5
28 3.0 0.5
29 3.5 0.5
30 4.0 0.5
...
I'm trying to move the data in the data frame around. I want to move all the first values not equal to 0 to Height 1.
Example data looks like follow
Tree <- c(1:10)
height0 <- c(0,0,0,0,0,0,0,0,0,0)
height1 <- c(1.5,2.0,0.0,1.2,1.3,0.9,0.0,0.0,1.8,0.0)
height2 <- c(2.4,2.2,1.1,1.9,1.4,1.7,0.0,0.0,2.7,0.0)
height3 <- c(3.1,2.9,2.1,2.6,2.2,2.4,0.0,0.6,3.6,0.0)
height4 <- c(3.8,3.4,2.9,3.0,2.9,3.1,0.0,1.1,4.1,0.0)
height5 <- c(4.2,3.7,3.6,3.7,3.5,3.8,0.7,1.9,4.6,0.0)
height6 <- c(4.4,4.1,4.1,4.2,4.0,4.5,1.6,2.6,4.9,1.2)
height7 <- c(4.7,4.4,4.3,4.6,4.2,4.9,2.2,3.0,5.1,2.0)
df <- data.frame(Tree, height0, height1, height2, height3, height4, height5, height6, height7)
So the Data frame df looks like follow
df
Tree height0 height1 height2 height3 height4 height5 height6 height7
1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
3 3 0 0.0 1.1 2.1 2.9 3.6 4.1 4.3
4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
7 7 0 0.0 0.0 0.0 0.0 0.7 1.6 2.2
8 8 0 0.0 0.0 0.6 1.1 1.9 2.6 3.0
9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
10 10 0 0.0 0.0 0.0 0.0 0.0 1.2 2.0
I'm trying to move all the first height values to height 1, as not all the trees germinated at the same time and i only want to compare the growth speed and not get false results due to germination differences.
So what my data should like like afterwards is as follow
df
Tree height0 height1 height2 height3 height4 height5 height6 height7
1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
3 3 0 1.1 2.1 2.9 3.6 4.1 4.3
4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
7 7 0 0.7 1.6 2.2
8 8 0 0.6 1.1 1.9 2.6 3.0
9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
10 10 0 1.2 2.0
Is there any a way to do this?
I have over 3000 trees I measured for 40 times, and doing it manually is going to take to long
Thank you
One option would be to loop through the rows (apply with MARGIN = 1), extract the non-zero elements, pad the rest with NA using the length<-), transpose the output and assign it back.
df[-(1:2)] <- t(apply(df[-(1:2)], 1, function(x) `length<-`(x[x!=0], ncol(df)-2)))
df
# Tree height0 height1 height2 height3 height4 height5 height6 height7
#1 1 0 1.5 2.4 3.1 3.8 4.2 4.4 4.7
#2 2 0 2.0 2.2 2.9 3.4 3.7 4.1 4.4
#3 3 0 1.1 2.1 2.9 3.6 4.1 4.3 NA
#4 4 0 1.2 1.9 2.6 3.0 3.7 4.2 4.6
#5 5 0 1.3 1.4 2.2 2.9 3.5 4.0 4.2
#6 6 0 0.9 1.7 2.4 3.1 3.8 4.5 4.9
#7 7 0 0.7 1.6 2.2 NA NA NA NA
#8 8 0 0.6 1.1 1.9 2.6 3.0 NA NA
#9 9 0 1.8 2.7 3.6 4.1 4.6 4.9 5.1
#10 10 0 1.2 2.0 NA NA NA NA NA
I am using dtw to calculate distances between several series and getting strange results. Notice that in the sample data below the first 9 customers are identical sets (A==B==C, D==E==F, and G==H==I). The remaining rows are only for noise to allow me to make 8 clusters.
I expect that the first sets would be clustered with their identical partners. This happens when I calculate distance on the original data, but when I scale the data before distance/clustering I get different results.
The distances between identical rows in original data is 0.0 (as expected), but with scaled data the distances is not 0.0 (not even close). Any ideas why they are not the same?
library(TSdist)
library(dplyr)
library(tidyr)
mydata = as_data_frame(read.table(textConnection("
cust P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
1 A 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2 B 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
3 C 1.1 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
4 D 0.0 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0
5 E 0.0 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0
6 F 0.0 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0
7 G 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5
8 H 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5
9 I 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5
10 D2 1.0 2.0 1.0 0.0 1.0 2.0 1.0 0.0 1.0 2.0
11 E2 5.0 6.0 5.0 4.0 5.0 6.0 5.0 4.0 5.0 6.0
12 F2 9.0 10.0 9.0 8.0 9.0 10.0 9.0 8.0 9.0 10.0
13 G2 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.5 1.0
14 H2 5.5 5.0 4.5 4.0 4.5 5.0 5.5 6.0 5.5 5.0
15 I2 9.5 9.0 8.5 8.0 8.5 9.0 9.5 10.0 9.5 9.0
16 A3 1.0 1.0 0.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0
17 B3 5.0 5.0 5.0 5.0 5.0 3.0 8.0 5.0 5.0 5.0
18 C3 9.0 9.0 9.0 9.0 9.0 5.4 14.4 9.0 9.0 9.0
19 D3 0.0 1.0 2.0 1.0 0.0 1.0 1.0 2.0 0.0 1.0
20 E3 4.0 5.0 5.0 6.0 4.0 5.0 6.0 5.0 4.0 5.0
21 F3 8.0 9.0 10.0 9.0 9.0 9.0 9.0 9.0 8.0 9.0
22 G3 2.0 1.5 1.0 0.5 0.0 0.5 1.0 2.0 1.5 1.5
23 H3 6.0 5.5 5.0 4.5 4.0 5.0 4.5 5.5 6.0 5.5
24 I3 10.0 9.5 9.0 9.0 8.0 8.5 9.0 9.5 10.0 9.5
25 D4 0.0 3.0 6.0 3.0 0.0 3.0 6.0 3.0 0.0 5.0
26 E4 3.0 6.0 9.0 6.0 3.0 6.0 9.0 6.0 3.0 6.0
27 F4 4.0 6.0 10.0 7.0 5.0 6.0 11.0 8.0 5.0 7.0
28 D5 5.0 0.0 3.0 6.0 3.0 0.0 3.0 6.0 3.0 0.0
29 D6 9.0 6.0 3.0 6.0 9.0 6.0 3.0 6.0 9.0 6.0
30 D7 9.0 11.0 5.0 4.0 6.0 10.0 7.0 5.0 6.0 11.0
31 Dw 0.0 0.8 1.4 2.0 1.0 0.0 2.0 0.0 1.0 2.0
32 Ew 4.0 4.8 5.4 6.0 5.0 4.0 6.0 4.0 5.0 6.0
33 Fw 8.0 8.8 9.4 10.0 9.0 8.0 10.0 8.0 9.0 10.0
34 Gw 2.0 1.5 1.0 0.5 0.0 1.0 2.0 1.5 1.3 1.1
35 Hw 6.0 5.5 5.0 4.5 4.0 5.0 6.0 5.5 5.3 5.1
36 Iw 10.0 9.5 9.0 8.5 8.0 9.0 10.0 9.5 9.3 9.1"),
header = TRUE, stringsAsFactors = FALSE))
k=8
# create a scale version of mydata (raw data - mean) / std dev
mydata_long = mydata %>%
mutate (mean = apply(mydata[,2:ncol(mydata)],1,mean,na.rm = T)) %>%
mutate (sd = apply(mydata[,2:(ncol(mydata))],1,sd,na.rm = T))%>%
gather (period,value,-cust,-mean,-sd) %>%
mutate (sc = (value-mean)/sd)
mydata_sc = mydata_long[,-c(2,3,5)] %>%
spread(period,sc)
# dtw
dtw_dist = TSDatabaseDistances(mydata[2:ncol(mydata)], distance = "dtw",lag.max= 2) #distance
dtw_clus = hclust(dtw_dist, method="ward.D2") # Cluster
dtw_res = data.frame(cutree(dtw_clus, k)) # cut dendrogram into 9 clusters
# dtw (w scaled data)
dtw_sc_dist = TSDatabaseDistances(mydata_sc[2:ncol(mydata_sc)], distance = "dtw",lag.max= 2) #distance
dtw_sc_clus = hclust(dtw_sc_dist, method="ward.D2") # Cluster
dtw_sc_res = data.frame(cutree(dtw_sc_clus, k)) # cut dendrogram into 9 clusters
results = cbind (dtw_res,dtw_sc_res)
names(results) = c("dtw", "dtw_scaled")
print(results)
dtw dtw_scaled
1 1 1
2 1 2
3 1 1
4 1 2
5 1 1
6 1 2
7 1 3
8 1 4
9 1 3
10 1 3
11 2 3
12 3 4
13 1 5
14 2 6
15 3 3
16 1 4
17 2 3
18 4 3
19 1 6
20 2 3
21 3 4
22 1 3
23 2 3
24 3 6
25 5 7
26 6 8
27 7 7
28 5 7
29 6 7
30 8 8
31 1 7
32 2 7
33 3 7
34 1 8
35 2 7
36 3 7
A couple issues
You are scaling rowwise, not columnwise (take a look at the intermediate results of your dplyr chain -- do they make sense?)
The data manipulations you used to produce the scaled data changed the rows ordering of your data frame to alphabetical:
> mydata_sc %>% head
cust P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
(chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A 2.84604989 -0.31622777 -0.31622777 -0.31622777 -0.31622777 -0.3162278 -0.3162278 -0.31622777 -0.31622777 -0.31622777
2 A3 0.00000000 0.00000000 -2.12132034 2.12132034 0.00000000 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000
3 B 2.84604989 -0.31622777 -0.31622777 -0.31622777 -0.31622777 -0.3162278 -0.3162278 -0.31622777 -0.31622777 -0.31622777
vs.
> mydata %>% head
Source: local data frame [6 x 11]
cust P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
(chr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 A 1.1 1 1 1 1 1 1 1 1 1
2 B 1.1 1 1 1 1 1 1 1 1 1
(check the cust variable ordering!)
Here's my approach, and how I think you can avoid similar mistakes in the future:
scale with built-in scale function
mydata_sc <- mydata %>% select(-cust) %>% scale %>% as.data.frame %>% cbind(cust =mydata$cust,.) %>% as.tbl
assert that your scaled dataframe is equivalent to a scaled version of your original dataframe:
> (scale(mydata_sc %>% select(-cust)) - scale(mydata %>% select(-cust)))
%>% colSums %>% sum
[1] 0.000000000000005353357
Create one single function to perform your desired manipulations:
return_dtw <- function(df) {
res_2 = TSDatabaseDistances(df[2:ncol(df)],distance="dtw",lag.max=2) %>%
hclust(.,method="ward.D2")
return(data.frame(cutree(res_2,k)))
}
execute function:
> mydata %>% return_dtw %>% cbind(mydata_sc %>% return_dtw)
cutree.res_2..k. cutree.res_2..k.
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 2 2
12 3 3
13 1 1
14 2 2
15 3 3
16 1 1
17 2 2
18 4 3
19 1 1
20 2 2
21 3 3
22 1 1
23 2 2
24 3 3
25 5 4
26 6 5
27 7 5
28 5 6
29 6 7
30 8 8
31 1 1
32 2 2
33 3 3
34 1 1
35 2 2
36 3 3
Some of the later customers are not grouped similarly, but that's for another question!