When I try using as.POSIXlt or strptime I keep getting a single value of 'NA' as a result.
What I need to do is transform 3 and 4 digit numbers e.g. 2300 or 115 to 23:00 or 01:15 respectively, but I simply cannot get any code to work.
Basically, this data fame of outputs:
Time
1 2345
2 2300
3 2130
4 2400
5 115
6 2330
7 100
8 2300
9 1530
10 130
11 100
12 215
13 2245
14 145
15 2330
16 2400
17 2300
18 2230
19 2130
20 30
should look like this:
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
7 01:00
8 23:00
9 15:30
10 01:30
11 01:00
12 02:15
13 22:45
14 01:45
15 23:30
16 24:00
17 23:00
18 22:30
19 21:30
20 00:30
I think you can use the following solution. However this is actually producing a character vector:
gsub("(\\d{2})(\\d{2})", "\\1:\\2", sprintf("%04d", df$Time)) |>
as.data.frame() |>
setNames("Time") |>
head()
Time
1 23:45
2 23:00
3 21:30
4 24:00
5 01:15
6 23:30
I have a matrix of time series data as below and I want to create 104 replicates of it.
The sample of original data as below (the total data include 104 years of monthly streamflows)
Month Year streamflow1 streamflow2
1 1913 3632 703
2 1913 2274 407
3 1913 4047 566
4 1913 3226 538
5 1913 4027 911
6 1913 6772 1779
7 1913 5335 1401
8 1913 8138 1626
9 1913 9769 1993
10 1913 6243 1463
11 1913 11913 2694
12 1913 6024 1482
1 1914 3506 674
2 1914 2062 392
3 1914 2083 417
4 1914 1945 428
5 1914 3587 568
6 1914 4035 846
7 1914 7969 1620
8 1914 6218 1588
9 1914 3512 894
10 1914 2277 651
11 1914 1820 519
12 1914 2316 485
1 1915 1751 417
2 1915 1252 327
3 1915 1513 304
4 1915 1817 312
5 1915 4361 653
6 1915 6356 1282
7 1915 7726 1660
8 1915 8852 1586
9 1915 7314 1721
10 1915 8391 1783
11 1915 5968 1702
12 1915 4008 764
and so on
The first replicate is the same as the original data but for the second replicate, the streamflow is from the first month of the second year and the third replicate, streamflows is from the first month of the third year and so on. It recycles when reaches the end of dataset. The example of first, second, and third replicate are as follow:
month year replicate streamflow1 streamflow2
1 1913 1 3632 703
1 1913 2 3506 674
1 1913 3 1751 417
2 1913 1 2274 407
2 1913 2 2062 392
2 1913 3 1252 327
1 1914 1 3506 674
1 1914 2 1751 417
1 1914 3 3632 703
2 1914 1 2062 392
2 1914 2 1252 327
2 1914 3 2274 407
Note: replicate 3 of year 2 recycled and so on
Thanks
You could try something like this:
# Create function to reorder a given variable 'x' (analagous to streamflow), putting everything before zth observation at end
f = function(x, z) {
c(z:x[length(x)], 1:(z - 1))
}
# Create dataset for testing
X_orig = data.frame(rep(1:12, 104), rep(1:104, each = 12), 1:(104 * 12)) #pretend this is your data
colnames(X_orig) = c("Month", "Year", "Streamflow")
# Create your 104 replicates
L = list() # to store replicates
year_inds = which(1:nrow(X_orig) %% 12 == 1) #implement function 'f' every year i.e. on 1st, 13th, 24th.. obs
k = 1 # counter
for (i in 1:nrow(X_orig)) {
if (i %in% year_inds) {
X = X_orig
X$Replicate = k
if (i != 1) { #first replicate should be same as original data
X$Streamflow = fun(X$Streamflow, i)
}
L[[k]] = X; k = k +1
}
}
The post below answers the above question. There are two methods one with for-loop and there is a great suggestion to make it fast.
R how to make loop faster
This question already has answers here:
Reorder rows using custom order
(2 answers)
Closed 6 years ago.
I have done some manipulations as below to arrive at the following dataframe:
df
cluster.kmeans variable max mean median min sd
1 1 MonthlySMS 191 90.32258 71.0 8 56.83801
2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
3 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
4 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
5 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
6 2 MonthlySMS 155 53.18815 57.0 1 31.64533
7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
8 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
9 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
10 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
11 3 MonthlySMS 215 135.60202 137.0 49 34.09794
12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
13 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
14 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
15 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
16 4 MonthlySMS 136 49.93980 54.5 1 31.47778
17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
18 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157
19 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
20 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
I would like to order the variable column based on these strings:
top.vars_kmeans
[1] "ThreeMonthlySMS" "SixMonthlyData" "ThreeMonthlyData"
[4] "MonthlySMS" "SixMonthlySMS"
I could do it using sqldf as below:
library(sqldf)
a <- c(1,2,3,4,5)
a <- data.frame(top.vars_kmeans,a)
a <- sqldf('select a1.* ,b1.a from "MS.DATA.STATS.KMEANS" a1 inner join a b1
on a1.variable=b1."top.vars_kmeans"')
a <- sqldf('select * from a order by "cluster.kmeans",a')
a$a <- NULL
a
cluster.kmeans variable max mean median min sd
1 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
3 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
4 1 MonthlySMS 191 90.32258 71.0 8 56.83801
5 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
6 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
8 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
9 2 MonthlySMS 155 53.18815 57.0 1 31.64533
10 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
11 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
13 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
14 3 MonthlySMS 215 135.60202 137.0 49 34.09794
15 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
16 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
18 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
19 4 MonthlySMS 136 49.93980 54.5 1 31.47778
20 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157
I am just curious to know if this could be achieved using dplyr......my understanding of this wonderful package will get enhanced....
need help here!
We can use arrange with match
library(dplyr)
a %>%
arrange(cluster.kmeans, match(variable, top.vars_kmeans))
# cluster.kmeans variable max mean median min sd
#1 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
#2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
#3 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
#4 1 MonthlySMS 191 90.32258 71.0 8 56.83801
#5 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
#6 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
#7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
#8 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
#9 2 MonthlySMS 155 53.18815 57.0 1 31.64533
#10 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
#11 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
#12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
#13 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
#14 3 MonthlySMS 215 135.60202 137.0 49 34.09794
#15 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
#16 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
#17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
#18 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
#19 4 MonthlySMS 136 49.93980 54.5 1 31.47778
#20 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157
you can redefine a factor (or ordered factor) with the levels in desired order (e.g. as stored in top.vars_kmeans):
a$variable <- factor(a$variable, levels = top.vars_kmeans)
See also the help page online, or via ?factor.
If you desire to order the whole data.frame, go by the answer of akrun.
You can try group_by and slice:
df %>% group_by(cluster.kmeans) %>% slice(match(top.vars_kmeans, variable))
# cluster.kmeans variable max mean median min sd
# (int) (fctr) (int) (dbl) (dbl) (int) (dbl)
#1 1 ThreeMonthlySMS 199 88.35484 76.0 6 59.15491
#2 1 SixMonthlyData 1085 567.09677 573.0 109 275.46994
#3 1 ThreeMonthlyData 1038 563.03226 573.0 94 275.51340
#4 1 MonthlySMS 191 90.32258 71.0 8 56.83801
#5 1 SixMonthlySMS 208 94.38710 86.0 29 56.27828
#6 2 ThreeMonthlySMS 149 53.68641 50.5 3 31.40001
#7 2 SixMonthlyData 574 280.27352 280.5 -48 139.75252
#8 2 ThreeMonthlyData 548 280.89547 279.0 -11 137.54755
#9 2 MonthlySMS 155 53.18815 57.0 1 31.64533
#10 2 SixMonthlySMS 167 57.77526 47.0 1 33.49210
#11 3 ThreeMonthlySMS 240 146.00756 146.0 54 37.06427
#12 3 SixMonthlyData 1046 541.76322 557.0 2 258.90622
#13 3 ThreeMonthlyData 1064 541.50378 558.0 10 255.35560
#14 3 MonthlySMS 215 135.60202 137.0 49 34.09794
#15 3 SixMonthlySMS 314 152.40302 152.0 27 45.55642
#16 4 ThreeMonthlySMS 141 50.88796 46.0 1 31.07977
#17 4 SixMonthlyData 1091 788.09365 805.0 503 145.67031
#18 4 ThreeMonthlyData 1073 785.19398 799.5 500 142.90054
#19 4 MonthlySMS 136 49.93980 54.5 1 31.47778
#20 4 SixMonthlySMS 190 57.50167 46.0 1 33.66157
I have some problems in reading in date and time in a proper way, and I wonder why I get these problems. The problem is only on my windows installation of R. Running the exact same script on my UNIX installation works fine.
Basically, I want to read in a file with data and time as the second column, like this:
TrainData[[i]] = read.csv(TrainFiles[i],header=F, colClasses=c(NA,"POSIXct",rep(NA,8)))
colnames(TrainData[[i]])=c("comp","time","s1","s2","s3","s4","r1","r2","r3","r4")
However, only the dates are read, not the times, and my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 711 630 69 600 689 20 40 1
2 5 2009-08-18 725 460 101 705 689 20 40 1
3 6 2009-08-18 711 505 69 678 689 20 40 1
4 1 2009-08-18 705 630 69 600 689 20 40 1
5 2 2009-08-18 734 516 101 671 689 20 40 1
6 3 2009-08-18 743 637 69 595 689 20 40 1
7 4 2009-08-18 730 577 101 633 689 20 40 1
8 2 2009-08-18 721 511 101 674 689 20 40 1
9 3 2009-08-18 747 563 101 642 689 20 40 1
10 4 2009-08-18 716 572 101 636 689 20 40 1
Running the exact same cond on UNIX returned both time and dates.
When I read in another file in the same script, with dates and times in the two first columns, I get the correct format of the date/time:
TrainData[[i]]=read.csv(TrainFiles[i],header=F, colClasses=c("POSIXct","POSIXct",NA))
colnames(TrainData[[i]])=c("start","end","fault")
returns
start end fault
1 2010-10-24 04:25:53 2010-10-24 11:22:33 6
2 2010-10-30 12:57:16 2010-11-02 12:29:54 6
3 2010-11-05 10:40:17 2010-11-05 11:59:51 6
4 2010-11-05 17:07:37 2010-11-06 14:30:01 6
5 2010-11-06 23:59:59 2010-11-07 00:14:49 6
6 2010-11-06 23:59:59 2010-11-07 00:14:49 6
7 2010-11-06 23:59:59 2010-11-07 00:14:49 6
8 2010-11-06 23:59:59 2010-11-07 00:14:49 6
9 2010-11-06 23:59:59 2010-11-07 00:14:50 6
10 2010-11-06 23:59:47 2010-11-07 00:14:51 6
Actually, I found a solution that works, eventually, but I wonder why I get these problems.
It appears that my Sys.timezone is set to "Europe/Berlin". If I set this to NA, the times will be read in as well, i.e. using Sys.setenv(tz=NA). If I then run the same code, my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 18:12:00 711 630 69 600 689 20 40 1
2 5 2009-08-18 18:14:27 725 460 101 705 689 20 40 1
3 6 2009-08-18 18:14:31 711 505 69 678 689 20 40 1
4 1 2009-08-18 18:14:43 705 630 69 600 689 20 40 1
5 2 2009-08-18 18:14:47 734 516 101 671 689 20 40 1
6 3 2009-08-18 18:14:51 743 637 69 595 689 20 40 1
7 4 2009-08-18 18:15:00 730 577 101 633 689 20 40 1
8 2 2009-08-18 18:29:33 721 511 101 674 689 20 40 1
9 3 2009-08-18 18:29:37 747 563 101 642 689 20 40 1
10 4 2009-08-18 18:29:45 716 572 101 636 689 20 40 1
The other file still get times, but now consistently two hours different.
This is how the csv-files look like (basically, text separated by commas):
this is my file (basically text separated by commas):
1,2009-08-18 18:12:00,711,630,69,600,689,20,40,1
5,2009-08-18 8:14:27,725,460,101,705,689,20,40,1
6,2009-08-18 18:14:31,711,505,69,678,689,20,40,1
1,2009-08-18 18:14:43,705,630,69,600,689,20,40,1
2,2009-08-18 8:14:47,734,516,101,671,689,20,40,1
3,2009-08-18 18:14:51,743,637,69,595,689,20,40,1
4,2009-08-18 8:15:00,730,577,101,633,689,20,40,1
2,2009-08-18 8:29:33,721,511,101,674,689,20,40,1
3,2009-08-18 8:29:37,747,563,101,642,689,20,40,1
4,2009-08-18 8:29:45,716,572,101,636,689,20,40,1
Why am I having these problems with reading in the times? I would expect that it is not correct to use tz=NA, but this is the only way I found to work. Can anyone help me figure out why the times are ignored when tz = "Europe/Berlin"?
Is it generally adviced to put tz=NA when reading files like this? Even if this seems to work in reading in the times, the tz="NA" results in warning messages when I later want to work with the data:
Warning message:
In as.POSIXlt.POSIXct(x, tz) : unknown timezone 'NA'
Can anyone help me explain the differences I get?
My data is follow the sequence:
deptime .count
1 4.5 6285
2 14.5 5901
3 24.5 6002
4 34.5 5401
5 44.5 5080
6 54.5 4567
7 104.5 3162
8 114.5 2784
9 124.5 1950
10 134.5 1800
11 144.5 1630
12 154.5 1076
13 204.5 738
14 214.5 556
15 224.5 544
16 234.5 650
17 244.5 392
18 254.5 309
19 304.5 356
20 314.5 364
My ggplot code:
ggplot(pplot, aes(x=deptime, y=.count)) + geom_bar(stat="identity",fill='#FF9966',width = 5) + labs(x="time", y="count")
output figure
There are a gap between each 100. Does anyone know how to fix it?
Thank You