Related
I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
library(tidyverse)
library(lubridate)
sub %>%
left_join(stack, by="id") %>%
mutate(
release=ymd_hms(paste0(release, ":00")),
date=ymd_hms(date)
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)
This question already has answers here:
Reshape horizontal to to long format using pivot_longer
(3 answers)
Closed 2 years ago.
Thank you all for your answers, I thought I was smarter than I am and hoped I would've understood any of it. I think I messed up my visualisation of my data aswell. I have edited my post to better show my sample data. Sorry for the inconvenience, and I truly hope that someone can help me.
I have a question about reshaping my data. The data collected looks as such:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick. Am reletively new to Rstudio and Stackoverflow (if you couldn't tell that already).
Kind regards, and thank you in advance.
Here is a slightly different pivot_longer() version.
library(tidyr)
library(dplyr)
dw %>%
pivot_longer(cols = -PID, names_to =".value", names_pattern = "(.+)[0-9]")
# A tibble: 9 x 3
PID T measurement
<dbl> <dbl> <dbl>
1 1 1 100
2 1 4 200
3 1 7 50
4 2 2 150
5 2 5 300
6 2 8 60
7 3 3 120
8 3 6 210
9 3 9 70
The names_to = ".value" argument creates new columns from column names based on the names_pattern argument. The names_pattern argument takes a special regex input. In this case, here is the breakdown:
(.+) # match everything - anything noted like this becomes the ".values"
[0-9] # numeric characters - tells the pattern that the numbers
# at the end are excluded from ".values". If you have multiple digit
# numbers, use [0-9*]
In the last edit you asked for a solution that is easy to understand. A very simple approach would be to stack the measurement columns on top of each other and the Tdays columns on top of each other. Although specialty packages make things very concise and elegant, for simplicity we can solve this without additional packages. Standard R has a convenient function aptly named stack, which works like this:
> exp <- data.frame(value1 = 1:5, value2 = 6:10)
> stack(exp)
values ind
1 1 value1
2 2 value1
3 3 value1
4 4 value1
5 5 value1
6 6 value2
7 7 value2
8 8 value2
9 9 value2
10 10 value2
We can stack measurements and Tdays seperately and then combine them via cbind:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurement4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
cbind(stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
Which keeps measurements and Tdays neatly together but leaves us without pid which we can add using rep to replicate the original pid 4 times:
result <- cbind(pid = rep(data$pid, 4),
stack(data, c(measurement1, measurement2, measurement3, measurement4)),
stack(data, c(Tdays1, Tdays2, Tdays3, Tdays4)))
The head of which looks like
> head(result)
pid values ind values ind
1 1 1356 measurement1 1435 Tdays1
2 2 943 measurement1 1848 Tdays1
3 3 1590 measurement1 185 Tdays1
4 4 130 measurement1 72 Tdays1
5 4 140 measurement1 82 Tdays1
6 4 220 measurement1 126 Tdays1
As I said above, this is not the order you expected and you can try to sort this data.frame, if that is of any concern:
result <- result[order(result$pid), c(1, 4, 2)]
names(result) <- c("pid", "Time", "Value")
leading to the final result
> head(result)
pid Time Value
1 1 1435 1356
13 1 1405 1483
25 1 1374 1563
37 1 NA NA
2 2 1848 943
14 2 1818 1173
tidyverse solution
library(tidyverse)
dw %>%
pivot_longer(-PID) %>%
mutate(name = gsub('^([A-Za-z]+)(\\d+)$', '\\1_\\2', name )) %>%
separate(name, into = c('A', 'B'), sep = '_', convert = T) %>%
pivot_wider(names_from = A, values_from = value)
Gives the following output
# A tibble: 9 x 4
PID B T measurement
<int> <int> <int> <int>
1 1 1 1 100
2 1 2 4 200
3 1 3 7 50
4 2 1 2 150
5 2 2 5 300
6 2 3 8 60
7 3 1 3 120
8 3 2 6 210
9 3 3 9 70
Considering a dataframe, df like the following:
PID T1 measurement1 T2 measurement2 T3 measurement3
1 1 100 4 200 7 50
2 2 150 5 300 8 60
3 3 120 6 210 9 70
You can use this solution to get your required dataframe:
iters = seq(from = 4, to = length(colnames(df))-1, by = 2)
finalDf = df[, c(1,2,3)]
for(j in iters){
tobind = df[, c(1,j,j+1)]
finalDf = rbind(finalDf, tobind)
}
finalDf = finalDf[order(finalDf[,1]),]
print(finalDf)
The output of the print statement is this:
PID T1 measurement1
1 1 1 100
4 1 4 200
7 1 7 50
2 2 2 150
5 2 5 300
8 2 8 60
3 3 3 120
6 3 6 210
9 3 9 70
Maybe you can try reshape like below
reshape(
setNames(data, gsub("(\\d+)$", "\\.\\1", names(data))),
direction = "long",
varying = 2:ncol(data)
)
I have this data that I want to plot as a time series.
Date Units.Sold
1 Jan-16 588
2 Feb-16 448
3 Mar-16 490
4 Apr-16 512
5 May-16 528
6 Jun-16 432
7 Jul-16 470
8 Aug-16 446
9 Sep-16 465
10 Oct-16 388
11 Nov-16 429
12 Dec-16 414
However, when I use ts(datasetName), I get this:
Time Series:
Start = 1
End = 12
Frequency = 1
Date Units.Sold
1 5 588
2 4 448
3 8 490
4 1 512
5 9 528
6 7 432
7 6 470
8 2 446
9 12 465
10 11 388
11 10 429
12 3 414
As you can see, the dates are in the wrong order. I want January to correspond with 1, February with 2, and so on. Can anybody help?
You need to convert your column named 'Date' to a Date - class object first. You can use as.Date for that, but you'll need to add a year first.
your_year <- 2018
df$Date <- as.Date(paste0(df$Date, '-', your_year), format = '%b-%d-%Y')
I have a huge data frame with details of each employee's working hours per day. For e.g:
STAFF_ID DATE MONTH HOURS_WORKED
345 4-May-15 May 5
678 4-May-15 May 2
965 4-May-15 May 4
248 4-May-15 May 6
345 5-May-15 May 7
678 6-May-15 May 3
678 7-May-15 May 3
965 8-May-15 May 5
345 7-Jun-15 June 1
678 8-Jun-15 June 2
965 8-Jun-15 June 4
248 8-Jun-15 June 6
345 8-Jun-15 June 3
678 9-Jun-15 June 2
678 10-Jun-15 June 3
965 11-Jun-15 June 4
965 12-Jun-15 June 3
What I want to find out is if any employee works more than 7 hours per month, and if there is,:
When is the latest day which the employee works on which results in the total hour to be greater than 7, and
for that latest day, how much did it exceed the maximum by?
Expected results:
STAFF_ID DATE MONTH HOURS_WORKED LATEST_DATE HOURS_EXCEED
345 4-May-15 May 5 5-May-15 5
678 4-May-15 May 2 7-May-15 1
965 4-May-15 May 4 8-May-15 2
248 4-May-15 May 6 NA NA
345 5-May-15 May 7 5-May-15 5
678 6-May-15 May 3 7-May-15 1
678 7-May-15 May 3 7-May-15 1
965 8-May-15 May 5 8-May-15 2
345 7-Jun-15 June 1 NA NA
678 8-Jun-15 June 2 NA NA
965 8-Jun-15 June 4 11-Jun-15 1
248 8-Jun-15 June 6 NA NA
345 8-Jun-15 June 3 NA NA
678 9-Jun-15 June 2 NA NA
678 10-Jun-15 June 3 NA NA
965 11-Jun-15 June 4 11-Jun-15 1
965 12-Jun-15 June 3 11-Jun-15 1
I have also asked the same question, but in that question, I asked for Excel solutions. However, as mentioned, the data file is really huge, hence, I would prefer if I could solve this using R.
Thank you!
Updated
Using data.table with custom function, assuming DATE is of class date.
library(data.table)
# Calculate cumulative sum of hours worked per month per group
setDT(df)[,total_hours := cumsum(HOURS_WORKED),by = c("STAFF_ID", "MONTH")]
# Define custom function which selects first match that is total_hours > 7
over.seven <- function(x,z) {
y <- x[(z>7)][1]
return(y)
}
# Add desired columns
df[,`:=`(LATEST_DATE = over.seven(DATE,total_hours),
HOURS_EXCEED = over.seven(total_hours - 7,total_hours)),
by = c("STAFF_ID", "MONTH")]
> df
# STAFF_ID DATE MONTH HOURS_WORKED total_hours LATEST_DATE HOURS_EXCEED
# 1: 345 2015-05-04 May 5 5 2015-05-05 5
# 2: 678 2015-05-04 May 2 2 2015-05-07 1
# 3: 965 2015-05-04 May 4 4 2015-05-08 2
# 4: 248 2015-05-04 May 6 6 <NA> NA
# 5: 345 2015-05-05 May 7 12 2015-05-05 5
# 6: 678 2015-05-06 May 3 5 2015-05-07 1
# 7: 678 2015-05-07 May 3 8 2015-05-07 1
# 8: 965 2015-05-08 May 5 9 2015-05-08 2
# 9: 345 2015-06-07 June 1 1 <NA> NA
#10: 678 2015-06-08 June 2 2 <NA> NA
#11: 965 2015-06-08 June 4 4 2015-06-11 1
#12: 248 2015-06-08 June 6 6 <NA> NA
#13: 345 2015-06-08 June 3 4 <NA> NA
#14: 678 2015-06-09 June 2 4 <NA> NA
#15: 678 2015-06-10 June 3 7 <NA> NA
#16: 965 2015-06-11 June 4 8 2015-06-11 1
#17: 965 2015-06-12 June 3 11 2015-06-11 1
I have some problems in reading in date and time in a proper way, and I wonder why I get these problems. The problem is only on my windows installation of R. Running the exact same script on my UNIX installation works fine.
Basically, I want to read in a file with data and time as the second column, like this:
TrainData[[i]] = read.csv(TrainFiles[i],header=F, colClasses=c(NA,"POSIXct",rep(NA,8)))
colnames(TrainData[[i]])=c("comp","time","s1","s2","s3","s4","r1","r2","r3","r4")
However, only the dates are read, not the times, and my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 711 630 69 600 689 20 40 1
2 5 2009-08-18 725 460 101 705 689 20 40 1
3 6 2009-08-18 711 505 69 678 689 20 40 1
4 1 2009-08-18 705 630 69 600 689 20 40 1
5 2 2009-08-18 734 516 101 671 689 20 40 1
6 3 2009-08-18 743 637 69 595 689 20 40 1
7 4 2009-08-18 730 577 101 633 689 20 40 1
8 2 2009-08-18 721 511 101 674 689 20 40 1
9 3 2009-08-18 747 563 101 642 689 20 40 1
10 4 2009-08-18 716 572 101 636 689 20 40 1
Running the exact same cond on UNIX returned both time and dates.
When I read in another file in the same script, with dates and times in the two first columns, I get the correct format of the date/time:
TrainData[[i]]=read.csv(TrainFiles[i],header=F, colClasses=c("POSIXct","POSIXct",NA))
colnames(TrainData[[i]])=c("start","end","fault")
returns
start end fault
1 2010-10-24 04:25:53 2010-10-24 11:22:33 6
2 2010-10-30 12:57:16 2010-11-02 12:29:54 6
3 2010-11-05 10:40:17 2010-11-05 11:59:51 6
4 2010-11-05 17:07:37 2010-11-06 14:30:01 6
5 2010-11-06 23:59:59 2010-11-07 00:14:49 6
6 2010-11-06 23:59:59 2010-11-07 00:14:49 6
7 2010-11-06 23:59:59 2010-11-07 00:14:49 6
8 2010-11-06 23:59:59 2010-11-07 00:14:49 6
9 2010-11-06 23:59:59 2010-11-07 00:14:50 6
10 2010-11-06 23:59:47 2010-11-07 00:14:51 6
Actually, I found a solution that works, eventually, but I wonder why I get these problems.
It appears that my Sys.timezone is set to "Europe/Berlin". If I set this to NA, the times will be read in as well, i.e. using Sys.setenv(tz=NA). If I then run the same code, my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 18:12:00 711 630 69 600 689 20 40 1
2 5 2009-08-18 18:14:27 725 460 101 705 689 20 40 1
3 6 2009-08-18 18:14:31 711 505 69 678 689 20 40 1
4 1 2009-08-18 18:14:43 705 630 69 600 689 20 40 1
5 2 2009-08-18 18:14:47 734 516 101 671 689 20 40 1
6 3 2009-08-18 18:14:51 743 637 69 595 689 20 40 1
7 4 2009-08-18 18:15:00 730 577 101 633 689 20 40 1
8 2 2009-08-18 18:29:33 721 511 101 674 689 20 40 1
9 3 2009-08-18 18:29:37 747 563 101 642 689 20 40 1
10 4 2009-08-18 18:29:45 716 572 101 636 689 20 40 1
The other file still get times, but now consistently two hours different.
This is how the csv-files look like (basically, text separated by commas):
this is my file (basically text separated by commas):
1,2009-08-18 18:12:00,711,630,69,600,689,20,40,1
5,2009-08-18 8:14:27,725,460,101,705,689,20,40,1
6,2009-08-18 18:14:31,711,505,69,678,689,20,40,1
1,2009-08-18 18:14:43,705,630,69,600,689,20,40,1
2,2009-08-18 8:14:47,734,516,101,671,689,20,40,1
3,2009-08-18 18:14:51,743,637,69,595,689,20,40,1
4,2009-08-18 8:15:00,730,577,101,633,689,20,40,1
2,2009-08-18 8:29:33,721,511,101,674,689,20,40,1
3,2009-08-18 8:29:37,747,563,101,642,689,20,40,1
4,2009-08-18 8:29:45,716,572,101,636,689,20,40,1
Why am I having these problems with reading in the times? I would expect that it is not correct to use tz=NA, but this is the only way I found to work. Can anyone help me figure out why the times are ignored when tz = "Europe/Berlin"?
Is it generally adviced to put tz=NA when reading files like this? Even if this seems to work in reading in the times, the tz="NA" results in warning messages when I later want to work with the data:
Warning message:
In as.POSIXlt.POSIXct(x, tz) : unknown timezone 'NA'
Can anyone help me explain the differences I get?