Using ggplot2, connect x- and y-coordinates by a third variable - r

I would like to plot latitude vs longitude and connect the points via date and time, which I have stored in an object of class POSIXlt. I have many, many GPS points, but here is a small set of them that I would like to plot using ggplot2.
My data are like so:
Description lat lon
6/16/2012 17:22 12.117017 -89.69692
6/17/2012 9:15 12.1178 -89.69675
6/17/2012 9:33 12.117783 -89.69673
6/17/2012 10:19 12.11785 -89.69665
6/17/2012 10:45 12.11775 -89.69677
6/17/2012 11:22 12.1178 -89.69673
6/17/2012 11:39 12.117817 -89.69662
6/17/2012 11:59 12.117717 -89.69677
6/17/2012 12:10 12.117717 -89.69655
6/16/2012 16:38 12.11795 -89.6965
6/16/2012 18:29 12.1178 -89.69688
6/16/2012 17:11 12.117417 -89.69703
6/16/2012 17:36 12.116967 -89.69668
6/16/2012 17:50 12.117217 -89.69695
6/16/2012 18:02 12.117583 -89.69715
6/16/2012 18:15 12.11785 -89.69665
6/16/2012 18:27 12.117683 -89.69632
I have a map that I am plotting these points onto.
I can plot the points just fine
plot1 <- map + geom_point(data=dat, aes(x = lon, y = lat))
map is an object I made with ggmap, but it's not that important to include here.
The following code produces a line connecting points as lon increases
plot1+geom_line(data=dat, aes(x=lon,y=lat,colour="red"))
I can't figure out how to connect the points by the vector POSIXlt object Description
I know that in this small example I could easily reorder the points using something like dat2 <- dat[with(dat, order(Description)), ], and remake plot1 using dat2 and make the desired plot using the following code:
plot1+geom_path(data=dat2, aes(x = lond, y = latd, colour="red"))
But for my much larger (hundreds of thousands of observations) dataset, this doesn't make sense as a solution without a bit more work to properly id each observation, which I will certainly end up doing anyway as part of additional data exploration.
Is there an argument I haven't discovered in geom_line for telling R how to connect the points?
I am admittedly still a novice at using ggplot2, and so, I apologize if I have missed something very simple. I have been working on a lot of other code and learning, or at least using, several other packages, to work with this GPS data other spatial data available. It's all a bit overwhelming... So many ideas, so little know-how! The larger point of this is to visualize (and eventually analyze) movement patterns and use of space by my study organisms, but for now, it would be great to visualize the data in a variety of ways to really get familiar with it.
If you have any recommended packages for working with spatial data and GPS data, I'd love to hear about them, as well.

You need the rows ordered by the date/time object to use geom_path. Since I think this is the best way to display the data we should focus on finding an efficient way to sort a large dataset. Obviously it would be good to get an idea of the scale of dataset you are working with. Millions of rows? Billions perhaps?!
Fortunately the data.table package does this very well indeed. Here is an example on a 1 million row table, with an ID column X, which the table is originally sorted on, an unsorted time column of 1 second observations, and two random columns for x and y, which takes < 1s on my laptop t sort according to date/time:
set.seed(123)
require(data.table)
# Rows ordered on X, random order of unique date/time values of 1 second observations
df <- data.frame( ID = seq.int(1e6) , Desc = as.POSIXct(sample(1e6),origin=Sys.Date()) , x = runif(1e6) , y = runif(1e6) )
head(df)
# ID Desc x y
#1 1 2013-05-25 02:39:39 0.2363783 0.1387404
#2 2 2013-05-25 23:58:17 0.1192702 0.1284918
#3 3 2013-05-21 17:41:57 0.8599183 0.6301114
#4 4 2013-05-23 16:12:42 0.8089243 0.7919304
#5 5 2013-05-21 08:17:28 0.8197109 0.4568693
#6 6 2013-05-22 17:57:23 0.4611204 0.5358536
# Convert to data.table
DT <- data.table(df)
# Sort on 'Desc'
setkey(DT , Desc)
head(DT)
# ID Desc x y
#1: 544945 2013-05-18 01:00:01 0.7052422 0.52030877
#2: 886165 2013-05-18 01:00:02 0.2256636 0.04391553
#3: 893690 2013-05-18 01:00:03 0.1860687 0.30978506
#4: 932276 2013-05-18 01:00:04 0.6305562 0.65188810
#5: 407622 2013-05-18 01:00:05 0.5355992 0.98146120
#6: 138936 2013-05-18 01:00:06 0.5999025 0.81722902
# Make data.frame to from this to use with ggplot2 (not sure if you can't just use the data.table directly)
df2 <- DT
So in your case you can try something like:
datDT <- data.table(dat)
setkey(datDT , Description)
dat2 <- datDT

Related

calculating date specific correlation in r (leading to a potential time series)

I have a dataset that looks somewhat like this (the actual dataset is ~150000 lines with additional columns of fluff information such as company name, etc.):
Date return1 return2 rank
01/31/2008 0.05434 0.23413 3
01/31/2008 0.03423 0.43423 4
01/31/2008 0.65277 0.23423 1
01/31/2008 0.02342 0.47234 4
02/31/2008 0.01463 0.01231 4
02/31/2008 0.13456 0.52552 2
02/31/2008 0.34534 0.36663 1
02/31/2008 0.00324 0.56463 3
...
12/31/2015 0.21234 0.02333 2
12/31/2015 0.07245 0.87234 1
12/31/2015 0.47282 0.12998 1
12/31/2015 0.99022 0.03445 2
Basically I need to caculate the date-specific correlation between return1 and rank (so the corr. on 01/31/2008, 02/31/2008, and so on). I know I can split the data using the split() function but I am unsure as to how to get the date-specific correlation. The real data has about 260 entries per date and around 68 dates, so manually subsetting the original table and performing calculations is time consuming but more importantly more susceptible to error.
My ultimate goal is to create a time series of the correlations on different dates.
Thank you in advance!
I had this same problem earlier, except I wasn't calculating correlation. What I would do is
a %>% group_by(Date) %>% summarise(Correlation = cor(return1, rank))
And this will provide, for each date, a correlation value between return1 and rank. Don't forget that you can specify what kind of correlation you would like (e.g. Spearman).

Many dataframes, different row lengths, similiar columns and dataframe titles, how to bind?

This takes a bit to explain and the post itself may be a bit too long to be answered.
I have MANY data frames of individual chess players and their specific ratings at points in time.
Here is what my data looks like. Please forgive me for my poor formatting of separating the datasets. Carlsen and Nakamura are separate dataframes.
Player1
Nakamura, Hikaru Year
2364 2001-01-01
2430 2002-01-01
2520 2003-01-01
2571 2004-01-01
2613 2005-01-01
2644 2006-01-01
2651 2007-01-01
2670 2008-01-01
2699 2009-01-01
2708 2010-01-01
2751 2011-01-01
2759 2012-01-01
2769 2013-01-01
2789 2014-01-01
2776 2015-01-01
2787 2016-01-01
Player2
Carlsen, Magnus Year
2127 2002-01-01
2279 2003-01-01
2484 2004-01-01
2553 2005-01-01
2625 2006-01-01
2690 2007-01-01
2733 2008-01-01
2776 2009-01-01
2810 2010-01-01
2814 2011-01-01
2835 2012-01-01
2861 2013-01-01
2872 2014-01-01
2862 2015-01-01
2844 2016-01-01
You can download the two sets here:
Download Player2
Download Player1
Between the above code, and below, Ive deleted two columns and reassigned an observation as a column title.
Hikaru Nakamura/Magnus Carlsen's chess rating over time
Hikaru's data is assigned to a dataframe, Player1.
Magnus's data is assigned to a dataframe, Player2.
What I want to be able to do is get what you see below, a dataframe of them combined.
The code I used to produce this frame is
merged<- merge(Player1, Player2, by = c("Year"), all = TRUE)
Now, this is all fun and dandy for two data sets, but I am having very annoying difficulties to add more players to this combined data set.
For example, maybe I would like to add 5, 10, 15 more players to this set. Examples of these players would be Kramnik, Anand, Gelfand ( Examples of famous chess players). As you'd expect, for 5 players, the dataframe would have 6 columns, 10 would have 11, 15 would have 16, all ordered nicely by the Year variable.
Fortunately, the number of observations for each Player is less than 100 always. Also, each individual player is assigned his/her own dataset.
For example,
Nakamura is the Player1 dataframe
Carlsen is the Player2 dataframe
Kramnik is the Player3 dataframe
Anand is the Player4 dataframe
Gelfand is the Player5 dataframe
all of which I have created using a for loop assigning process using this code
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
I don't want to write out something like below:
merged<- merge(Player1, Player2,.....Player99 ,Player100, by = c("Year"), all = TRUE)
I want to able to merge all 5, 10, 15...i number of Player"i" objects that I created in the loop together by Year.
Also, once it leaves the loop initially, each dataset looks like this.
So what ends up happening is that I assign all of the data sets to a list by using the following snippet:
lst <- mget(ls(pattern='^Player\\d+'))
list2env(lapply(lst,`[`,-2), envir =.GlobalEnv)
lst <- mget(ls(pattern='^Player\\d+'))
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
names(lst[[i]]) [names(lst[[i]]) == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
}
This is what my list looks like.
Is there a way I write a table with YEAR as the way its merged by, so that it[cbinds, bind_cols, merges, etc] each of the Player"i" dataframes, which are necessarily not equal in length , in my lists are such a way that I get a combined/merged set like the one you saw below the merged(player1, player2) set?
Here is the diagram again, but it would have to be for many players, not just Carlsen and Nakmura.
Also, is there a way I can avoid using the list function, and just straight up do
names(Player"i") [names(Player"i") == 'Rating'] <- eval(unique(Timed_set_filtered$Name)[i])
which just renames the titles of all of the dataframes that start with "Player".
merge(player1, player2, player3,...., player99, player100, by = c("YEAR"), all = TRUE)
which would merge all of the "Player""i" datasets?
If anything is unclear, please mention it.
It was pretty funny that one line of code did the trick. After I assigned all of the Player1, Player 2....Player i into the list, I just joined all of the sets contained in the list by Year.
For loop that generates all of unique datasets.
for (i in 1:nrow(as.data.frame(unique(Timed_set_filtered$Name)))) {
assign(paste("Player",i,sep=""), subset(Timed_set_filtered, Name == unique(Timed_set_filtered$Name)[i]))
}
Puts them into a list
lst <- mget(ls(pattern='^Player\\d+'))
Merge, or join by common value
df <- join_all(lst, by = 'Year')
Unfortunately, unlike merge(datasets...., all= TRUE), it drops certain observations for an unknown reason, will have to see why this happens.

How to extract time form POSIXct and plot?

Here I have a data frame which looks like following way,with the first column "POSIXct" and second "latitude"
> head(b)
sample_time latitude
3813442 2015-05-21 19:02:41 39.92770
3813483 2015-05-21 19:03:16 39.92770
3813485 2015-05-21 19:14:30 39.92433
3813515 2015-05-21 19:14:59 39.92469
3813550 2015-05-21 19:15:30 39.92520
3813585 2015-05-21 19:16:00 39.92585
Now,I want to plot latitude vs sample_time, with x axis representing 24 hours timestamp within a single day and group latitude by different days.
Any help will be appreciated!Many thanks.
First, you need to define "day", as opposed to the full time. Then you need to figure out what you mean by "group" ... let's just say you want to aggregate and take the daily mean. Third, you need to make the plot.
b$day <- round.Date(b[,"sample_time"], units="days")
b_agg <- aggregate(list(sample_time=b[,"sample_time"]), by=list(day=b[,"day"]), FUN=mean)
plot(b_agg)
Edit:
Just an additional thought, if you didn't want to aggregate, you could skip the second step, and change the third to plot(b[,"day"], b[,"latitude"]. Alternatively, you may even want something like boxplot(latitude~day, data=b).

Using abline() when x-axis is date (ie, time-series data)

I want to add multiple vertical lines to a plot.
Normally you would specify abline(v=x-intercept) but my x-axis is in the form Jan-95 - Dec-09. How would I adapt the abline code to add a vertical line for example in Feb-95?
I have tried abline(v=as.Date("Jan-95")) and other variants of this piece of code.
Following this is it possible to add multiple vertical lines with one piece of code, for example Feb-95, Feb-97 and Jan-98?
An alternate solution could be to alter my plot, I have a column with month information and a column with the year information, how do I collaborate these to have a year month on the X-axis?
example[25:30,]
Year Month YRM TBC
25 1997 1 Jan-97 136
26 1997 2 Feb-97 157
27 1997 3 Mar-97 163
28 1997 4 Apr-97 152
29 1997 5 May-97 151
30 1997 6 Jun-97 170
The first note: your YRM column is probably a factor, not a datetime object, unless you converted it manually. I assume we do not want to do that and our plot is looking fine with YRM as a factor.
In that case
vline_month <- function(s) abline(v=which(s==levels(df$YRM)))
# keep original order of levels
df$YRM <- factor(df$YRM, levels=unique(df$YRM))
plot(df$YRM, df$TBC)
vline_month(c("Jan-97", "Apr-97"))
Disclaimer: this solution is a quick hack; it is neither universal nor scalable. For accurate representation of datetime objects and extensible tools for them, see packages zoo and xts.
I see two issues:
a) converting your data to a date/POSIX element, and
b) actually plotting vertical lines at specific rows.
For the first, create a proper date string then use strptime().
The second issue is resolved by converting the POSIX date to numeric using as.numeric().
# dates need Y-M-D
example$ymd <- paste(example$Year, '-', example$Month, '-01', sep='')
# convet to POSIX date
example$ymdPX <- strptime(example$ymd, format='%Y-%m-%d')
# may want to define tz otherwise system tz is used
# plot your data
plot(example$ymdPX, example$TBC, type='b')
# add vertical lines at first and last record
abline(v=as.numeric(example$ymdPX[1]), lwd=2, col='red')
abline(v=as.numeric(example$ymdPX[nrow(example)]), lwd=2, col='red')

Comparing multiple data frames

I need some help with data analysis.
I do have two datasets (before & after) and I want to see how big the difference is between them.
Before
11330 STAT1
2721 STAT2
52438 STAT3
6124 SUZY
After
17401 STAT1
3462 STAT2
0 STAT3
72 SUZY
Tried to group them with tapply(before$V1, before$V2, FUN=mean).
But as I am trying to plot it, on x axis am not getting the group name but number instead.
How can I plot such tapplied data (frequency on Y axis & group name on X axis)?
Also wanted to ask what is the proper command in R to compare such datasets as I am willing to find the difference between them?
Edited
dput(before$V1)
c(11330L, 2721L, 52438L, 6124L)
dput(before$V2)
structure(1:4, .Label = c("STAT1", "STAT2", "STAT3","SUZY"),class = "factor")
Here are a couple of ideas.
This is what I think your data look like?
before <- data.frame(val=c(11330,2721,52438,6124),
lab=c("STAT1","STAT2","STAT3","SUZY"))
after <- data.frame(val=c(17401,3462,0,72),
lab=c("STAT1","STAT2","STAT3","SUZY"))
Combine them into a single data frame with a period variable:
combined <- rbind(data.frame(before,period="before"),
data.frame(after,period="after"))
Reformat to a matrix and plot with (base R) dotchart:
library(reshape2)
m <- acast(combined,lab~period,value.var="val")
dotchart(m)
Plot with ggplot:
library(ggplot2)
qplot(lab,val,colour=period,data=combined)

Resources