I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
library(tidyverse)
library(lubridate)
sub %>%
left_join(stack, by="id") %>%
mutate(
release=ymd_hms(paste0(release, ":00")),
date=ymd_hms(date)
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)
I am sure this is a super easy answer but I am struggling with how to add a column with two different variables to my dataframe. Currently, this is what it looks like
vcv.index model.index par.index grid index estimate se lcl ucl fixed
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157
5 10 10 20 A 20 0.7575811 0.05033490 0.6461758 0.8424612
6 21 21 61 B 61 0.8713467 0.07638687 0.6404598 0.9626184
7 22 22 62 B 62 0.6074379 0.06881230 0.4677827 0.7314827
8 23 23 63 B 63 0.6041054 0.06107520 0.4805279 0.7156792
9 24 24 64 B 64 0.5806565 0.06927308 0.4422237 0.7074601
10 25 25 65 B 65 0.7370944 0.05892108 0.6070620 0.8357394
11 41 41 121 C 121 0.8048479 0.09684385 0.5519097 0.9324759
12 42 42 122 C 122 0.5259547 0.07165218 0.3871380 0.6608721
13 43 43 123 C 123 0.5427100 0.07127273 0.4033255 0.6757137
14 44 44 124 C 124 0.5168820 0.06156392 0.3975561 0.6343132
15 45 45 125 C 125 0.6550049 0.07378403 0.5002851 0.7826343
16 196 196 586 A 586 0.8536314 0.08709394 0.5979992 0.9580976
17 197 197 587 A 587 0.5672194 0.07079508 0.4268452 0.6975725
18 198 198 588 A 588 0.5675415 0.06380445 0.4408540 0.6859714
19 199 199 589 A 589 0.5666874 0.06499899 0.4377071 0.6872233
20 200 200 590 A 590 0.7058542 0.05985868 0.5769484 0.8085177
21 211 211 631 B 631 0.8360614 0.09413427 0.5703031 0.9514472
22 212 212 632 B 632 0.5432872 0.07906200 0.3891364 0.6895701
23 213 213 633 B 633 0.5400994 0.06497607 0.4129055 0.6622759
24 214 214 634 B 634 0.5161692 0.06292706 0.3943257 0.6361202
25 215 215 635 B 635 0.6821667 0.07280044 0.5263841 0.8056298
26 226 226 676 C 676 0.7621875 0.10484478 0.5077465 0.9087471
27 227 227 677 C 677 0.4607440 0.07326970 0.3240229 0.6036386
28 228 228 678 C 678 0.4775168 0.08336433 0.3219349 0.6375872
29 229 229 679 C 679 0.4517655 0.06393339 0.3319262 0.5774725
30 230 230 680 C 680 0.5944330 0.07210672 0.4491995 0.7248303
then I am adding a column with periods 1-5 repeated until reaches the end
with this code
SurJagPred$estimates %<>% mutate(Primary = rep(1:5, 6))
and I also need to add sex( F, M) as well. the numbers 1-15 are female and the 16-30 are male. So overall it should look like this.
> vcv.index model.index par.index grid index estimate se lcl ucl fixed Primary Sex
F
1 6 6 16 A 16 0.8856724 0.07033280 0.6650468 0.9679751 1 F
2 7 7 17 A 17 0.6298118 0.06925471 0.4873052 0.7528014 2 F
3 8 8 18 A 18 0.6299359 0.06658557 0.4930263 0.7487169 3 F
4 9 9 19 A 19 0.6297988 0.05511771 0.5169948 0.7300157 4 F
We can use rep with each on a vector of values to replicate each element of the vector to that many times
SurJagPred$estimates %<>%
mutate(Sex = rep(c("F", "M"), each = 15))
I have an excel file with the first column looking like this:
Age
17–20
17–20
17–20
17–20
21–24
21–24
21–24
21–24
25–29
25–29
25–29
25–29
30–34
30–34
30–34
30–34
35–39
35–39
35–39
35–39
40–49
40–49
40–49
40–49
50–59
50–59
50–59
50–59
60+
60+
60+
60+
I would like to read this into R without changing each individual observation in excel. I used the following code:
df39 <- read.csv("AutoCollision.csv",header=TRUE,sep=",",
colClasses=c("character","character","numeric","numeric"))
However, this makes the data set look like this:
Age Vehicle_Use Severity Claim_Count
1 17\xd020 Pleasure 250.48 21
2 17\xd020 DriveShort 274.78 40
3 17\xd020 DriveLong 244.52 23
4 17\xd020 Business 797.80 5
5 21\xd024 Pleasure 213.71 63
6 21\xd024 DriveShort 298.60 171
7 21\xd024 DriveLong 298.13 92
8 21\xd024 Business 362.23 44
9 25\xd029 Pleasure 250.57 140
10 25\xd029 DriveShort 248.56 343
11 25\xd029 DriveLong 297.90 318
12 25\xd029 Business 342.31 129
13 30\xd034 Pleasure 229.09 123
14 30\xd034 DriveShort 228.48 448
15 30\xd034 DriveLong 293.87 361
16 30\xd034 Business 367.46 169
17 35\xd039 Pleasure 153.62 151
18 35\xd039 DriveShort 201.67 479
19 35\xd039 DriveLong 238.21 381
20 35\xd039 Business 256.21 166
21 40\xd049 Pleasure 208.59 245
22 40\xd049 DriveShort 202.80 970
23 40\xd049 DriveLong 236.06 719
24 40\xd049 Business 352.49 304
25 50\xd059 Pleasure 207.57 266
26 50\xd059 DriveShort 202.67 859
27 50\xd059 DriveLong 253.63 504
28 50\xd059 Business 340.56 162
29 60+ Pleasure 192.00 260
30 60+ DriveShort 196.33 578
31 60+ DriveLong 259.79 312
32 60+ Business 342.58 96
Why did it change the minus signs to "\xd0" and how could I go about fixing this? Thanks in advance!
I have some problems in reading in date and time in a proper way, and I wonder why I get these problems. The problem is only on my windows installation of R. Running the exact same script on my UNIX installation works fine.
Basically, I want to read in a file with data and time as the second column, like this:
TrainData[[i]] = read.csv(TrainFiles[i],header=F, colClasses=c(NA,"POSIXct",rep(NA,8)))
colnames(TrainData[[i]])=c("comp","time","s1","s2","s3","s4","r1","r2","r3","r4")
However, only the dates are read, not the times, and my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 711 630 69 600 689 20 40 1
2 5 2009-08-18 725 460 101 705 689 20 40 1
3 6 2009-08-18 711 505 69 678 689 20 40 1
4 1 2009-08-18 705 630 69 600 689 20 40 1
5 2 2009-08-18 734 516 101 671 689 20 40 1
6 3 2009-08-18 743 637 69 595 689 20 40 1
7 4 2009-08-18 730 577 101 633 689 20 40 1
8 2 2009-08-18 721 511 101 674 689 20 40 1
9 3 2009-08-18 747 563 101 642 689 20 40 1
10 4 2009-08-18 716 572 101 636 689 20 40 1
Running the exact same cond on UNIX returned both time and dates.
When I read in another file in the same script, with dates and times in the two first columns, I get the correct format of the date/time:
TrainData[[i]]=read.csv(TrainFiles[i],header=F, colClasses=c("POSIXct","POSIXct",NA))
colnames(TrainData[[i]])=c("start","end","fault")
returns
start end fault
1 2010-10-24 04:25:53 2010-10-24 11:22:33 6
2 2010-10-30 12:57:16 2010-11-02 12:29:54 6
3 2010-11-05 10:40:17 2010-11-05 11:59:51 6
4 2010-11-05 17:07:37 2010-11-06 14:30:01 6
5 2010-11-06 23:59:59 2010-11-07 00:14:49 6
6 2010-11-06 23:59:59 2010-11-07 00:14:49 6
7 2010-11-06 23:59:59 2010-11-07 00:14:49 6
8 2010-11-06 23:59:59 2010-11-07 00:14:49 6
9 2010-11-06 23:59:59 2010-11-07 00:14:50 6
10 2010-11-06 23:59:47 2010-11-07 00:14:51 6
Actually, I found a solution that works, eventually, but I wonder why I get these problems.
It appears that my Sys.timezone is set to "Europe/Berlin". If I set this to NA, the times will be read in as well, i.e. using Sys.setenv(tz=NA). If I then run the same code, my data looks like this:
comp time s1 s2 s3 s4 r1 r2 r3 r4
1 1 2009-08-18 18:12:00 711 630 69 600 689 20 40 1
2 5 2009-08-18 18:14:27 725 460 101 705 689 20 40 1
3 6 2009-08-18 18:14:31 711 505 69 678 689 20 40 1
4 1 2009-08-18 18:14:43 705 630 69 600 689 20 40 1
5 2 2009-08-18 18:14:47 734 516 101 671 689 20 40 1
6 3 2009-08-18 18:14:51 743 637 69 595 689 20 40 1
7 4 2009-08-18 18:15:00 730 577 101 633 689 20 40 1
8 2 2009-08-18 18:29:33 721 511 101 674 689 20 40 1
9 3 2009-08-18 18:29:37 747 563 101 642 689 20 40 1
10 4 2009-08-18 18:29:45 716 572 101 636 689 20 40 1
The other file still get times, but now consistently two hours different.
This is how the csv-files look like (basically, text separated by commas):
this is my file (basically text separated by commas):
1,2009-08-18 18:12:00,711,630,69,600,689,20,40,1
5,2009-08-18 8:14:27,725,460,101,705,689,20,40,1
6,2009-08-18 18:14:31,711,505,69,678,689,20,40,1
1,2009-08-18 18:14:43,705,630,69,600,689,20,40,1
2,2009-08-18 8:14:47,734,516,101,671,689,20,40,1
3,2009-08-18 18:14:51,743,637,69,595,689,20,40,1
4,2009-08-18 8:15:00,730,577,101,633,689,20,40,1
2,2009-08-18 8:29:33,721,511,101,674,689,20,40,1
3,2009-08-18 8:29:37,747,563,101,642,689,20,40,1
4,2009-08-18 8:29:45,716,572,101,636,689,20,40,1
Why am I having these problems with reading in the times? I would expect that it is not correct to use tz=NA, but this is the only way I found to work. Can anyone help me figure out why the times are ignored when tz = "Europe/Berlin"?
Is it generally adviced to put tz=NA when reading files like this? Even if this seems to work in reading in the times, the tz="NA" results in warning messages when I later want to work with the data:
Warning message:
In as.POSIXlt.POSIXct(x, tz) : unknown timezone 'NA'
Can anyone help me explain the differences I get?
I am using the package trees found here, by #jbaums and explained in this post.
My data are the following:
the tree is composed by
the trunk
Trunk
[1] 13.60415
and the branches
Tree
TreeBranchLength TreeBranchID
1 10.004269 1
2 7.994269 2
3 9.028834 11
4 10.817401 12
5 8.551311 111
6 10.599798 112
7 11.073243 121
8 13.367392 122
9 9.625431 1111
10 10.793569 1112
11 9.896499 11121
12 8.687741 11122
13 7.791180 1211
14 12.506105 1212
15 6.768478 1221
16 10.441796 1222
17 10.751892 1121
18 9.458651 1122
19 10.768509 11221
20 10.150673 11222
21 12.377448 111211
22 12.235136 111212
23 9.074079 11211
24 9.996334 11212
25 9.807019 112221
26 10.895809 112222
27 6.741274 1122211
28 15.841272 1122212
29 5.753920 11222111
30 8.846389 11222112
31 11.925961 112111
32 9.780776 112112
33 8.207965 12221
34 10.079375 12222
the 50 squirrel populations -
Populations
PopulationPositionOnBranch PopulationBranchID ID
1 10.6321655 112111 1
2 1.0644897 1 2
3 3.9315473 1 3
4 1.0310244 0 4
5 9.1768846 0 5
6 13.4267181 0 6
7 7.9461528 0 7
8 6.0533401 121 8
9 2.1227425 121 9
10 1.8256787 121 10
11 4.7332588 11222112 11
12 4.4837432 11222112 12
13 4.6200834 11222112 13
14 2.5622276 1221 14
15 1.2446683 1221 15
16 7.0674052 111 16
17 1.3854674 111 17
18 4.8735635 111 18
19 9.5007998 1222 19
20 6.6373468 1222 20
21 12.6757728 122 21
22 4.2685465 122 22
23 3.9806540 2 23
24 3.1025403 2 24
25 3.9119065 11122 25
26 1.5527653 11122 26
27 1.6687957 11122 27
28 8.0697456 1122 28
29 6.7871391 1122 29
30 9.8050713 111212 30
31 8.5226920 111212 31
32 3.6113379 111212 32
33 7.3184965 111211 33
34 8.6142984 111211 34
35 1.3550870 1211 35
36 8.3650639 12 36
37 4.6411446 112112 37
38 3.2985541 112112 38
39 12.2344148 1212 39
40 9.0290776 1212 40
41 1.3900249 1121 41
42 0.9261425 1122212 42
43 15.2522199 1122212 43
44 4.0253771 12222 44
45 8.7507678 11222 45
46 4.6289841 1122211 46
47 9.1799522 112 47
48 5.1293838 12221 48
49 1.1543080 12221 49
50 10.1014837 112222 50
the code to produce the plot
g <- germinate(list(trunk.height=Trunk,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30))
xy <- squirrels(g, Populations$PopulationBranchID, pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)
, which produces
As you can see on the plot bellow population 43 (blue arrow) is out of the tree.. It seems that the length of the branches on the plot do not correspond to the data. For example the branch (left green arrow) on which are populations 38 and 37 is longer than the one where population 43 is (right green arrow), that is not the case in the data. What am I doing wrong? Have I understood correctly how to use trees?
On studying the germinate function it seems to me that the Tree values that you are passing to it needs to be sorted on TreeBranchId field in the ascending order.
The BranchID: 1122212 where you have placed 43 is not the actual 1122212 branch.
Due to the order in which you have fed the values in the Tree, the function is somehow messing the location of branch.
I was curious to see if I increase the length of Branch ID: 1122212, will it change the branch where 43 is placed, and guess what? it didn't. The branch which actually showed an increase in length was the branch where you have placed 37 and 38.
So this hint pointed out that something was wrong with germinate function. On further debugging I was able to make it work using the below code.
Tree<-read.csv("treeBranch.csv")
Tree<-Tree[order(Tree$TreeBranchID),]
g <- germinate(list(trunk.height=15,
branches=Tree$TreeBranchID,
lengths=Tree$TreeBranchLength),
left='1', right='2', angle=30)
xy <- squirrels(g, Populations$PopulationBranchID,pos=Populations$PopulationPositionOnBranch,
left='1', right='2', pch=21, bg='white', cex=3, lwd=2)
text(xy$x, xy$y, labels=seq_len(nrow(xy)), font=1)