Calculating distance between multiple points at the same time of the day - r

I have two dataframes, one with my boat GPS positions (5512 records) and one with fishing boats positions (35381 records). I want to calculate the distance between my boat and all other fishing boats that were present in the area at the same time of the day (to the minute).
I created a IDdatecode (yyyymmddhhmm) for all the positions, then I merged the two dataframes based on the same IDdatecode. I did this:
merged_table<- merge(myboat,fishboats,by="IDdatecode",all.y=TRUE)
To calculate the distance I used the formula:
merged_table$distance_between_vessels=distm(c("lon1","lat1"),c("lon2","lat2"),fun=distGeo)
where lon1, lat1 are my boat positions and lon2, lat2 are fishing boats.
But I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "distance_between_vessels", value = NA_real_) :
replacement has 1 row, data has 35652
In addition: Warning messages:
1: In .pointsToMatrix(x) : NAs introduced by coercion
2: In .pointsToMatrix(y) : NAs introduced by coercion
What I tried so far is:
use this other formula: merged_table$distance_between_vessels=distGeo(c("lon1","lat1"),c("lon2","lat2"))
put all the columns of lat and lon "as.numeric"
use only interval times where both my boat and fishing boats were present
ignore the warning and keep going
But I still get only a list of NAs.
I used the function "distGeo" in a much simplier dataset (only my boat position) where I calculated manually the distance between first and second point, then between second and third point, and so on. The function works perfectly as it gives me exactly the right distance between two points (I checked it on ArcGIS). This is what I did:
distGeo(mydata[1, ], mydata[2, ])
distGeo(mydata[2, ], mydata[3, ])
distGeo(mydata[3, ], mydata[4, ])
So, I want to calculate 'one-to-many' distances based on a unique time of the day, but I get the above error. Any ideas on why? Thanks :)
Here, my first 10 rows of the merged table:
structure(list(Record = 1:10, IDdatecode = structure(c(1L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("d201805081203",
"d201805081204", "d201805081205", "d201805081206", "d201805081207",
"d201805081208"), class = "factor"), lon1 = c(12.40203333, 12.4071,
12.41165, 12.41165, 12.41485, 12.41485, 12.41663333, 12.41663333,
12.41841667, 12.41841667), lat1 = c(45.1067, 45.10921667, 45.11218333,
45.11218333, 45.11303333, 45.11303333, 45.11313333, 45.11313333,
45.11348333, 45.11348333), boat1 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "RB", class = "factor"), lon2 = c(13.02718,
13.02585827, 13.02453654, 13.02173, 13.02321482, 13.02052301,
13.02189309, 13.01931602, 13.02057136, 13.01810904), lat2 = c(44.98946,
44.99031749, 44.99117498, 44.98792, 44.99203246, 44.98868065,
44.99288995, 44.98944129, 44.99374744, 44.99020194), boat2 = structure(c(1L,
1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("IMPERO II",
"MISTRAL"), class = "factor")), .Names = c("Record", "IDdatecode",
"lon1", "lat1", "boat1", "lon2", "lat2", "boat2"), row.names = c(NA,
-10L), class = "data.frame")

V2, Update (January 17, 2022)
Glad it works for you. If you are willing to avoid for-loops you could consider a dplyr approach. Have a look.
library(dplyr)
df <- silvia %>%
rowwise() %>%
mutate(distance = geosphere::distGeo(c(lon1, lat1), c(lon2, lat2)))
df
The base R **apply-family would be another option.
V1 (January 16, 2022)
Hopefully this approach does help you. Often it is recommended to not use for-loops. However, I used one, since they are easy to understand.
I made the following assumptions:
boat1 is your boat. lat1 and lon1 represent the position of boat1 for any IDdatecode;
as I did not fully understand what you mean with "based on a unique time of the day" I assumed looping over each row is sufficient;
the function distGeo() is from geosphere package.
# loading your dataframe as "silvia"
silvia <- structure(list(Record = 1:10, IDdatecode = structure(c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L),
.Label = c("d201805081203","d201805081204", "d201805081205", "d201805081206", "d201805081207", "d201805081208"),
class = "factor"), lon1 = c(12.40203333, 12.4071, 12.41165, 12.41165, 12.41485, 12.41485, 12.41663333,
12.41663333, 12.41841667, 12.41841667), lat1 = c(45.1067, 45.10921667, 45.11218333, 45.11218333, 45.11303333,
45.11303333, 45.11313333, 45.11313333, 45.11348333, 45.11348333), boat1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "RB", class = "factor"), lon2 = c(13.02718, 13.02585827, 13.02453654, 13.02173, 13.02321482,
13.02052301, 13.02189309, 13.01931602, 13.02057136, 13.01810904), lat2 = c(44.98946, 44.99031749, 44.99117498, 44.98792,
44.99203246, 44.98868065, 44.99288995, 44.98944129, 44.99374744, 44.99020194), boat2 = structure(c(1L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L), .Label = c("IMPERO II", "MISTRAL"), class = "factor")), .Names = c("Record", "IDdatecode",
"lon1", "lat1", "boat1", "lon2", "lat2", "boat2"), row.names = c(NA, -10L), class = "data.frame")
# for EACH ROW in "silvia" calculate the distance between c("lon1", "lat1") and c("lon2", "lat2")
for (i in 1:nrow(silvia)){
silvia$distance[i] <- geosphere::distGeo(c(silvia[i, "lon1"], silvia[i,"lat1"]),
c(silvia[i, "lon2"], silvia[i,"lat2"]))
}
# here you see the first 5 entrys of the df "silvia"
# the distances are calculated in metres
# the parameters a and f are set to WGS84 by default.
head(silvia, n=5)
#> Record IDdatecode lon1 lat1 boat1 lon2 lat2 boat2
#> 1 1 d201805081203 12.40203 45.10670 RB 13.02718 44.98946 IMPERO II
#> 2 2 d201805081204 12.40710 45.10922 RB 13.02586 44.99032 IMPERO II
#> 3 3 d201805081205 12.41165 45.11218 RB 13.02454 44.99117 IMPERO II
#> 4 4 d201805081205 12.41165 45.11218 RB 13.02173 44.98792 MISTRAL
#> 5 5 d201805081206 12.41485 45.11303 RB 13.02321 44.99203 IMPERO II
#> distance
#> 1 50943.77
#> 2 50503.93
#> 3 50118.46
#> 4 50005.52
#> 5 49774.51
Note. Created on 2022-01-16 by the reprex package (v2.0.1)

Related

How to filter row by lowest value in a column using a for loop in R

This is not elegant, but for each file I want to filter the row when dvdt first meets/exceeds 15. I first filtered for each file dvdt values >= 15. Then I tried to filter rows with the minimum time value in this new data frame. The problem is that min(time) returns the global minimum across all files, whereas I'd like to identify the lowest time value within each file. Any help would be appreciated!
df <- structure(list(file = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("19509002.abf", "19509007.abf"
), class = "factor"), time = c(4.800000191, 4.849999905, 4.900000095,
4.949999809, 5, 5.050000191, 5.099999905, 5.150000095, 5.199999809,
5.25, 5.300000191, 5.349999905, 5.400000095, 5.449999809, 5.5,
4.849999905, 4.900000095, 4.949999809, 5, 5.050000191, 5.099999905,
5.150000095, 5.199999809, 5.25, 5.300000191, 5.349999905, 5.400000095,
5.449999809), V = c(-34.8815918, -29.96826172, -23.65112305,
-16.44897461, -7.843017578, 3.234863281, 15.86914063, 27.6184082,
37.109375, 44.18945313, 49.37744141, 52.94799805, 55.41992188,
57.00683594, 57.80029297, -36.28540039, -31.92138672, -24.78027344,
-16.3269043, -6.683349609, 5.310058594, 18.89038086, 31.21948242,
40.67993164, 47.24121094, 51.63574219, 54.32128906, 55.9387207
), dvdt = c(47.6074219, 98.2666016, 126.342773, 144.042969, 172.119141,
221.557617, 252.685547, 234.985352, 189.819336, 141.601563, 103.759766,
71.4111328, 49.4384766, 31.7382813, 15.8691406, 27.4658203, 87.2802734,
142.822266, 169.067383, 192.871094, 239.868164, 271.606445, 246.582031,
189.208984, 131.225586, 87.890625, 53.7109375, 32.3486328)), row.names = c(NA,
28L), class = "data.frame")
vthresh <- data.frame()
for (i in unique(df$file)){
vthresh = rbind(vthresh, df %>% filter(file == i, time == min(time)))
}
# filter dvdt values >= 15
dfsub <- subset(df, dvdt >= 15)
# identify the lowest time value within each file
aggregate(dfsub$time, by = list(dfsub$file), min)
which gives the following output:
Group.1 x
1 19509002.abf 4.80
2 19509007.abf 4.85
I built on Marc's answer and the code below works!
df_sub <- subset(df, dvdt >= 15)
df_agg <- aggregate(df_sub$time, by = list(df_sub$file), min)
colnames(df_agg) <- c('file', 'time')
vthresh <- merge(df_sub, df_agg, by=c("file","time"))

Analyzing spatial data between two points in R using a very large data set

This is my first time writing code in R from scratch and I'm struggling with how to approach it. I'm looking at turtle nests and their proximity to light sources (i.e. houses, light poles, etc.) to determine how often a light source is within a given radius of a nest.
These are both very large data sets (hundreds of thousands of rows) so the code will likely need to run a loop for each nest position. GPS coordinates for both data sets are in decimal degrees.
The nest data is essentially latitude, longitude, date observed, and species (if known)
The light source data is latitude, longitude, type, and several other light-related parameters I'd like to keep in the data set.
Any suggestions on how to loop through the nest coordinates to determine light sources within radius, r, would be greatly appreciated! For each light source within r for a nest, I'd like for the end result to spit out the entire row of light source data (type, location, additional light-related parameters, etc.) if that is possible rather than just say how many values were T vs. F for being inside r. Thanks!
> Nest <- read.csv("Nest.csv", header=T)
> Lights <- read.csv("Lights.csv", header=T)
> #Nest
> dput(droplevels(Nest[1:10, ]))
structure(list(LAT = c(34.146535, 34.194585, 34.216854, 34.269901,
34.358718, 34.37268, 34.380848, 34.394183, 34.410384, 34.415077
), LONG = c(-77.839787, -77.804013, -77.787032, -77.742722, -77.63655,
-77.619872, -77.609373, -77.591654, -77.568456, -77.561256),
DATE = structure(c(2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 1L, 7L
), .Label = c("2016-05-19T03:12", "2016-05-21T07:23", "2016-05-23T08:14",
"2016-05-24T04:21", "2016-05-25T11:15", "2016-05-27T05:12",
"2016-05-27T09:45", "2016-05-28T09:42", "2016-05-28T10:18",
"2016-05-29T02:26"), class = "factor"), SPECIES = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Cc", class = "factor")), row.names = c(NA,
10L), class = "data.frame")
> #Lights
> dput(droplevels(Lights[1:10, ]))
structure(list(LAT = c(34.410925, 34.410803, 34.410686, 34.410476,
34.410361, 34.410237, 34.410151, 34.410016, 34.409821, 34.409671
), LONG = c(-77.568183, -77.568296, -77.568478, -77.568757, -77.568915,
-77.569135, -77.569355, -77.569527, -77.569707, -77.569905),
DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = "5/19/2016", class = "factor"), TYPE = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "R", class = "factor"),
WATTS = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
10L), class = "data.frame")
As you stated that your data sets were large, the proposed solution tries to avoid a full cartesian product between all Nest vs all Lamps.
For this, we use the non equi join possibilities of data.table which only allows simple operators like > or <.
This allows to make a first filter of the Lamps in a box around a Nest.
This box should be large enough to contain the circle of the max distance to Nest.
In a second step, we calculate the distance on the filtered data (much less calculation than a cartesian product of all data) :
library(data.table)
library(geosphere)
#To data.table
setDT(Nest)
setDT(Lights)
# Define a box around each nest
dlon<- 0.001
dlat <- 0.001
Nest[,c("LATNest","LONGNest","latmin","latmax","longmin","longmax"):=.(LAT,LONG,LAT-dlat, LAT+dlat,LONG-dlon,LONG+dlon)]
Nest[,c("LAT","LONG") :=.(NULL,NULL)]
# Search lights in box
LightNearNest <- Nest[Lights, .(LATNest,LONGNest, LATLight = LAT, LONGLight = LONG), on = .(latmin<LAT , latmax>LAT,longmin<LONG,longmax>LONG),nomatch=0,allow.cartesian=T]
# Calculate distance
LightNearNest[,dist:= geosphere::distHaversine(cbind(LONGNest,LATNest),cbind(LONGLight,LATNest))]
LightNearNest
LATNest LONGNest LATLight LONGLight dist
1: 34.41038 -77.56846 34.41092 -77.56818 25.072269
2: 34.41038 -77.56846 34.41080 -77.56830 14.694370
3: 34.41038 -77.56846 34.41069 -77.56848 2.020476
4: 34.41038 -77.56846 34.41048 -77.56876 27.643784
5: 34.41038 -77.56846 34.41036 -77.56892 42.154475
6: 34.41038 -77.56846 34.41024 -77.56914 62.359234
7: 34.41038 -77.56846 34.41015 -77.56936 82.563993

Extracting a Group from a Cluster

I am working on a Clusteranalyses with R.
I use the Allbus Dataset, from which I extracted 7 rows.
With the followig Code I made my Cluster
library("haven")
AllbusDatensatz <- read_sav("AllbusAntworten.sav")
CDU <- AllbusDatensatz$pa22
CSU <- AllbusDatensatz$pa23
SPD <- AllbusDatensatz$pa24
FDP <- AllbusDatensatz$pa25
Linke <- AllbusDatensatz$pa26
Gruenen <- AllbusDatensatz$pa27
AfD <- AllbusDatensatz$pa28
UmbenannterDatensatz <- cbind(CDU, CSU, SPD, FDP, Linke, Gruenen, AfD)
BereinigterDatensatz <- na.omit(UmbenannterDatensatz)
AllbusCentroid <- clara(BereinigterDatensatz,4,metric = "manhattan")
From these four CLusters I would now like to extract all the members of the first Cluster.
I then would like to compare this list with the initial Dataset AllbusDatensatz so I could get the avarage age of all members, which is also part of the Dataset.
If someone could help me with my problem I would be very thankful.
Due you've not given any data, here an example with some fake data:
library(cluster)
clarax <- clara(x,4,metric = "manhattan")
As written here, you've to fetch the vector of clusters:
clarax$clustering
And put it in your original data, then subset the desired rows:
# add vector of clustering
x$clust <- clarax$clustering
# subset the part of cluster == 1
x_cl1 <- x[x$clust == 1,]
Now you can calculate what you need.
mean(x_cl1[,1])
Note, the first cluster here is defined as the cluster marked with 1.
With data:
x <- structure(list(X1 = c(-4.58075844925284, 0.0652767299325834,
-3.77639403053622, 2.74184342257295, -1.55283663415684, -0.646564270393359,
1.98008127381616, 8.97937011921846, 17.0830608896667, -0.373624506395029,
2.60144234508749, 1.32892095552686, 1.54997041572331, -5.94773087812292,
8.30056236715301, 18.1001844129369, 24.1689939024213, 1.10899749796051,
1.53087100550846, -6.04743527148338), X2 = c(52.8099714292224,
38.3531449094573, 46.3760873669732, 51.7026666617339, 48.5273685430924,
55.6277967599455, 51.2257527215893, 45.8741668783965, 46.805479767603,
38.5446380799866, 33.9186743463602, 52.7066337605415, 55.2102957192513,
69.4652121754147, 59.5307056986744, 57.3795425387994, 54.9687789881024,
52.8506678644467, 50.5691711634846, 55.8544208074441), clust = c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 2L, 2L, 1L, 1L, 4L, 3L, 3L, 3L,
1L, 1L, 1L)), row.names = c(NA, -20L), class = "data.frame")

Conditional updating coordinate column in dataframe

I am attempting to populate two newly empty columns in a data frame with data from other columns in the same data frame in different ways depending on if they are populated.
I am trying to populate the values of HIGH_PRCN_LAT and HIGH_PRCN_LON (previously called F_Lat and F_Lon) which represent the final latitudes and londitudes for those rows this will be based off the values of the other columns in the table.
Case 1: Lat/Lon2 are populated (like in IDs 1 & 2), using the great
circle algorithm a midpoint between them should be calculated and
then placed into F_Lat & F_Lon.
Case 2: Lat/Lon2 are empty, then the values of Lat/Lon1 should be put
into F_Lat and F_Lon (like with IDs 3 & 4).
My code is as follows but doesn't work (see previous versions, removed in an edit).
The preperatory code I am using is as follows:
incidents <- structure(list(id = 1:9, StartDate = structure(c(1L, 3L, 2L,
2L, 2L, 3L, 1L, 3L, 1L), .Label = c("02/02/2000 00:34", "02/09/2000 22:13",
"20/01/2000 14:11"), class = "factor"), EndDate = structure(1:9, .Label = c("02/04/2006 20:46",
"02/04/2006 22:38", "02/04/2006 23:21", "02/04/2006 23:59", "03/04/2006 20:12",
"03/04/2006 23:56", "04/04/2006 00:31", "07/04/2006 06:19", "07/04/2006 07:45"
), class = "factor"), Yr.Period = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("2000 / 1", "2000 / 2", "2000 /3"
), class = "factor"), Description = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "ENGLISH TEXT", class = "factor"),
Location = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L
), .Label = c("Location 1", "Location 1 : Location 2"), class = "factor"),
Location.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Location 1", class = "factor"), Postcode.1 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Postcode 1", class = "factor"),
Location.2 = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
1L), .Label = c("", "Location 2"), class = "factor"), Postcode.2 = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("", "Postcode 2"
), class = "factor"), Section = structure(c(2L, 2L, 3L, 1L,
4L, 4L, 2L, 1L, 4L), .Label = c("East", "North", "South",
"West"), class = "factor"), Weather.Category = structure(c(1L,
2L, 4L, 2L, 2L, 2L, 4L, 1L, 3L), .Label = c("Animals", "Food",
"Humans", "Weather"), class = "factor"), Minutes = c(13L,
55L, 5L, 5L, 5L, 522L, 1L, 11L, 22L), Cost = c(150L, 150L,
150L, 20L, 23L, 32L, 21L, 11L, 23L), Location.1.Lat = c(53.0506727,
53.8721035, 51.0233529, 53.8721035, 53.6988355, 53.4768766,
52.6874562, 51.6638245, 51.4301359), Location.1.Lon = c(-2.9991256,
-2.4004125, -3.0988341, -2.4004125, -1.3031529, -2.2298073,
-1.8023421, -0.3964916, 0.0213837), Location.2.Lat = c(52.7116187,
53.746791, NA, 53.746791, 53.6787167, 53.4527824, 52.5264907,
NA, NA), Location.2.Lon = c(-2.7493169, -2.4777984, NA, -2.4777984,
-1.489026, -2.1247029, -1.4645023, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
#gpsColumns is used as the following line of code is used for several data frames.
gpsColumns <- c("HIGH_PRCN_LAT", "HIGH_PRCN_LON")
incidents [ , gpsColumns] <- NA
#create separate variable(?) containing a list of which rows are complete
ind <- complete.cases(incidents [,17])
#populate rows with a two Lat/Lons with great circle middle of both values
incidents [ind, c("HIGH_PRCN_LON_2","HIGH_PRCN_LAT_2")] <-
with(incidents [ind,,drop=FALSE],
do.call(rbind, geosphere::midPoint(cbind.data.frame(Location.1.Lon, Location.1.Lat), cbind.data.frame(Location.2.Lon, Location.2.Lat))))
#populate rows with one Lat/Lon with those values
incidents[!ind, c("HIGH_PRCN_LAT","HIGH_PRCN_LON")] <- incidents[!ind, c("Location.1.Lat","Location.1.Lon")]
I will use the geosphere::midPoint function based off a recommendation here: http://r.789695.n4.nabble.com/Midpoint-between-coordinates-td2299999.html.
Unfortunately, it doesn't appear that this way of populating the column will work when there are several cases.
The current error that is thrown is:
Error in `$<-.data.frame`(`*tmp*`, F_Lat, value = integer(0)) :
replacement has 0 rows, data has 178012
Edit: also posted to reddit: https://www.reddit.com/r/Rlanguage/comments/bdvavx/conditional_updating_column_in_dataframe/
Edit: Added clarity on the parts of the code I do not understand.
#replaces the F_Lat2/F_Lon2 columns in rows with a both sets of input coordinates
dataframe[ind, c("F_Lat2","F_Lon2")] <-
#I am unclear on what this means, specifically what the "with" function does and what "drop=FALSE" does and also why they were used in this case.
with(dataframe[ind,,drop=FALSE],
#I am unclear on what do.call and rbind are doing here, but the second half (geosphere onwards) is binding the Lats and Lons to make coordinates as inputs for the gcIntermediate function.
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
Though your code doesn't work as-written for me, and I cannot calculate the same precise values your expect, I suspect the error your seeing can be fixed with these steps. (Data is down at the bottom here.)
Pre-populate the empty columns.
Pre-calculate the complete.cases step, it'll save time.
Use cbind.data.frame for inside gcIntermediate.
I'm inferring from
gcIntermediate([dataframe...
^
this is an error in R
that you are binding those columns together, so I'll use cbind.data.frame. (Using cbind itself produced some ignorable warnings from geosphere, so you can use it instead and perhaps suppressWarnings, but that function is a little strong in that it'll mask other warnings as well.)
Also, since it appears you want one intermediate value for each pair of coordinates, I added the gcIntermediate(..., n=1) argument.
The use of do.call(rbind, ...) is because gcIntermediate returns a list, so we need to bring them together.
dataframe$F_Lon2 <- dataframe$F_Lat2 <- NA_real_
ind <- complete.cases(dataframe[,4])
dataframe[ind, c("F_Lat2","F_Lon2")] <-
with(dataframe[ind,,drop=FALSE],
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
dataframe[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
dataframe
# ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon F_Lat2 F_Lon2
# 1 1 19.05067 -3.999126 92.71332 -6.759169 55.88200 -5.379147 55.78466 -6.709509
# 2 2 58.87210 -1.400413 54.74679 -4.479840 56.80945 -2.940126 56.81230 -2.942029
# 3 3 33.02335 -5.098834 NA NA 33.02335 -5.098834 33.02335 -5.098834
# 4 4 54.87210 -4.400412 NA NA 54.87210 -4.400412 54.87210 -4.400412
Update, using your new incidents data and switching to geosphere::midPoint.
Try this:
incidents$F_Lon2 <- incidents$F_Lat2 <- NA_real_
ind <- complete.cases(incidents[,4])
incidents[ind, c("F_Lat2","F_Lon2")] <-
with(incidents[ind,,drop=FALSE],
geosphere::midPoint(cbind.data.frame(Location.1.Lat,Location.1.Lon),
cbind.data.frame(Location.2.Lat,Location.2.Lon)))
incidents[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
One (big) difference is that geosphere::gcIntermediate(..., n=1) returns a list of results, whereas geosphere::midPoint(...) (no n=) returns just a matrix, so no rbinding required.
Data:
dataframe <- read.table(header=T, stringsAsFactors=F, text="
ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon
1 19.0506727 -3.9991256 92.713318 -6.759169 55.88199535 -5.3791473
2 58.8721035 -1.4004125 54.746791 -4.47984 56.80944725 -2.94012625
3 33.0233529 -5.0988341 NA NA 33.0233529 -5.0988341
4 54.8721035 -4.4004125 NA NA 54.8721035 -4.4004125")

Converting this ugly for-loop to something more R-friendly

Been using SO as a resource constantly for my work. Thanks for holding together such a great community.
I'm trying to do something kinda complex, and the only way I can think to do it right now is with a pair of nested for-loops (I know that's frowned upon in R)... I have records of three million-odd course enrollments: student UserID's paired with CourseID's. In each row, there's a bunch of data including start/end dates and scores and so forth. What I need to do is, for each enrollment, calculate the average score for that user across the courses she's taken before the course in the enrollment.
The code I'm using for the for-loop follows:
data$Mean.Prior.Score <- 0
for (i in as.numeric(rownames(data)) {
sum <- 0
count <- 0
for (j in as.numeric(rownames(data[data$UserID == data$UserID[i],]))) {
if (data$Course.End.Date[j] < data$Course.Start.Date[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
if (count != 0)
data$Mean.Prior.Score[i] <- sum / count
}
I'm pretty sure this would work, but it runs incredibly slowly... my data frame has over three million rows, but after a good 10 minutes of chugging, the outer loop has only run through 850 of the records. That seems way slower than the time complexity would suggest, especially given that each user has only 5 or 6 courses to her name on average.
Oh, and I should mention that I converted the date strings with as.POSIXct() before running the loop, so the date comparison step shouldn't be too terribly slow...
There's got to be a better way to do this... any suggestions?
Edit: As per mnel's request... finally got dput to play nicely. Had to add control = NULL. Here 'tis:
structure(list(Username = structure(1:20, .Label = c("100225",
"100226", "100228", "1013170", "102876", "105796", "106753",
"106755", "108568", "109038", "110150", "110200", "110350", "111873",
"111935", "113579", "113670", "117562", "117869", "118329"), class = "factor"),
User.ID = c(2313737L, 2314278L, 2314920L, 9708829L, 2325896L,
2315617L, 2314644L, 2314977L, 2330148L, 2315081L, 2314145L,
2316213L, 2317734L, 2314363L, 2361187L, 2315374L, 2314250L,
2361507L, 2325592L, 2360182L), Course.ID = c(2106468L, 2106578L,
2106493L, 5426406L, 2115455L, 2107320L, 2110286L, 2110101L,
2118574L, 2106876L, 2110108L, 2110058L, 2109958L, 2108222L,
2127976L, 2106638L, 2107020L, 2127451L, 2117022L, 2126506L
), Course = structure(c(1L, 7L, 10L, 15L, 11L, 19L, 4L, 6L,
3L, 12L, 2L, 9L, 17L, 8L, 20L, 18L, 13L, 16L, 5L, 14L), .Label = c("ACCT212_A",
"BIOS200_N", "BIS220_T", "BUSN115_A", "BUSN115_T", "CARD205_A",
"CIS211_A", "CIS275_X", "CIS438_S", "ENGL112_A", "ENGL112_B",
"ENGL227_K", "GM400_A", "GM410_A", "HUMN232_M", "HUMN432_W",
"HUMN445_A", "MATH100_X", "MM575_A", "PSYC110_Y"), class = "factor"),
Course.Start.Date = structure(c(1098662400, 1098662400, 1098662400,
1309737600, 1099267200, 1098662400, 1099267200, 1099267200,
1098662400, 1098662400, 1099267200, 1099267200, 1099267200,
1098662400, 1104105600, 1098662400, 1098662400, 1104105600,
1098662400, 1104105600), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Term.ID = c(12056L, 12056L, 12056L, 66282L, 12057L, 12056L,
12057L, 12057L, 12056L, 12056L, 12057L, 12057L, 12057L, 12056L,
13469L, 12056L, 12056L, 13469L, 12056L, 13469L), Term.Name = structure(c(2L,
2L, 2L, 4L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L,
2L, 3L, 2L, 3L), .Label = c("Fall 2004", "Fall 2004 Session A",
"Fall 2004 Session B", "Summer Session A 2011"), class = "factor"),
Term.Start.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-10-21",
"2004-10-28", "2004-12-27", "2011-06-26"), class = "factor"),
Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125,
0, 0, 0, 0, 0), First.Course.Date = structure(c(1L, 1L, 1L,
4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L,
1L, 3L), .Label = c("2004-10-25", "2004-11-01", "2004-12-27",
"2011-07-04"), class = "factor"), First.Term.Date = structure(c(1L,
1L, 1L, 4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L,
1L, 3L, 1L, 3L), .Label = c("2004-10-21", "2004-10-28", "2004-12-27",
"2011-06-26"), class = "factor"), First.Timer = c(TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Course.Code = structure(c(1L,
6L, 9L, 13L, 9L, 17L, 4L, 5L, 3L, 10L, 2L, 8L, 15L, 7L, 18L,
16L, 11L, 14L, 4L, 12L), .Label = c("ACCT212", "BIOS200",
"BIS220", "BUSN115", "CARD205", "CIS211", "CIS275", "CIS438",
"ENGL112", "ENGL227", "GM400", "GM410", "HUMN232", "HUMN432",
"HUMN445", "MATH100", "MM575", "PSYC110"), class = "factor"),
Course.End.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-12-19",
"2005-02-27", "2005-03-26", "2011-08-28"), class = "factor")), .Names = c("Username",
"User.ID", "Course.ID", "Course", "Course.Start.Date", "Term.ID",
"Term.Name", "Term.Start.Date", "Score", "First.Course.Date",
"First.Term.Date", "First.Timer", "Course.Code", "Course.End.Date"
), row.names = c(NA, 20L), class = "data.frame")
I found that data.table worked well.
# Create some data.
library(data.table)
set.seed(1)
n=3e6
numCourses=5 # Average courses per student
data=data.table(UserID=as.character(round(runif(n,1,round(n/numCourses)))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
# test=function(CourseEndDate,Score,CourseStartDate) sapply(CourseStartDate, function(y) mean(Score[y>CourseEndDate]))
# I vastly reduced the number of comparisons with a better "test" function.
test2=function(CourseEndDate,Score,CourseStartDate) {
o.end = order(CourseEndDate)
run.avg = cumsum(Score[o.end])/seq_along(CourseEndDate)
idx=findInterval(CourseStartDate,CourseEndDate[o.end])
idx=ifelse(idx==0,NA,idx)
run.avg[idx]
}
system.time(data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1)
# For three million courses, at an average of 5 courses per student:
# user system elapsed
# 122.06 0.22 122.45
Running a test to see if it looks the same as your code:
set.seed(1)
n=1e2
data=data.table(UserID=as.character(round(runif(n,1,1000))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1
data["246"]
# UserID course Score CourseStartDate CourseEndDate MeanPriorScore
#1: 246 54 0.4531314 2000-08-09 2000-09-20 0.9437248
#2: 246 89 0.9437248 2000-02-19 2000-03-02 NA
# A comparison with your for loop (slightly modified)
data$MeanPriorScore.old<-NA # Set to NaN instead of zero for easy comparison.
# I think you forgot a bracket here. Also, There is no need to work with the rownames.
for (i in seq(nrow(data))) {
sum <- 0
count <- 0
# I reduced the complexity of figuring out the vector to loop through.
# It will result in the exact same thing if there are no rownames.
for (j in which(data$UserID == data$UserID[i])) {
if (data$CourseEndDate[j] <= data$CourseStartDate[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
# I had to add "[i]" here. I think that is what you meant.
if (count != 0) data$MeanPriorScore.old[i] <- sum / count
}
identical(data$MeanPriorScore,data$MeanPriorScore.old)
# [1] TRUE
This seems to be what you want
library(data.table)
# create a data.table object
DT <- data.table(data)
# key by userID
setkeyv(DT, 'userID')
# for each userID, where the Course.End.Date < Course.Start.Date
# return the mean score
# This is too simplistic
# DT[Course.End.Date < Course.Start.Date,
# list(Mean.Prior.Score = mean(Score)) ,
# by = list(userID)]
As per #jorans comment, this will be more complex than the code above.
This is only an outline of what I think a solution might entail. I'm going to use plyr just to illustrate the steps needed, for simplicity.
Let's just restrict ourselves to the case of one student. If we can calculate this for one student, extending it with some sort of split-apply will be trivial.
So let's suppose we have scores for a particular student, sorted by course end date:
d <- sample(seq(as.Date("2011-01-01"),as.Date("2011-01-31"),by = 1),100,replace = TRUE)
dat <- data.frame(date = sort(d),val = rnorm(100))
First, I think you'd need to summarise this by date and then calculate the cumulative running mean:
dat_sum <- ddply(dat,.(date),summarise,valsum = sum(val),n = length(val))
dat_sum$mn <- with(dat_sum,cumsum(valsum) / cumsum(n))
Finally, you'd merge these values back into the original data with the duplicate dates:
dat_merge <- merge(dat,dat_sum[,c("date","mn")])
I could probably write something that does this in data.table using an anonymous function to do all those steps, but I suspect others may be better able to do something that will be concise and fast. (In particular, I don't recommend actually tackling this with plyr, as I suspect it will still be extremely slow.)
I think something like this should work though it'd be helpful to have test data with multiple courses per user. Also might need +1 on the start dates in findInterval to make condition be End.Date < Start.Date instead of <=.
# in the test data, one is POSIXct and the other a factor
data$Course.Start.Date = as.Date(data$Course.Start.Date)
data$Course.End.Date = as.Date(data$Course.End.Date)
data = data[order(data$Course.End.Date), ]
data$Mean.Prior.Score = ave(seq_along(data$User.ID), data$User.ID, FUN=function(i)
c(NA, cumsum(data$Score[i]) / seq_along(i))[1L + findInterval(data$Course.Start.Date[i], data$Course.End.Date[i])])
With three million rows, maybe a database is helpful. Here an sqlite example which I believe creates something similar to your for loop:
# data.frame for testing
user <- sample.int(10000, 100)
course <- sample.int(10000, 100)
c_start <- sample(
seq(as.Date("2004-01-01"), by="3 months", length.ou=12),
100, replace=TRUE
)
c_end <- c_start + as.difftime(11, units="weeks")
c_idx <- sample.int(100, 1000, replace=TRUE)
enroll <- data.frame(
user=sample(user, 1000, replace=TRUE),
course=course[c_idx],
c_start=as.character(c_start[c_idx]),
c_end=as.character(c_end[c_idx]),
score=runif(1000),
stringsAsFactors=FALSE
)
#variant 1: for-loop
system.time({
enroll$avg.p.score <- NA
for (i in 1:nrow(enroll)) {
sum <- 0
count <- 0
for (j in which(enroll$user==enroll$user[[i]]))
if (enroll$c_end[[j]] < enroll$c_start[[i]]) {
sum <- sum + enroll$score[[j]]
count <- count + 1
}
if(count !=0) enroll$avg.p.score[[i]] <- sum / count
}
})
#variant 2: sqlite
system.time({
library(RSQLite)
con <- dbConnect("SQLite", ":memory:")
dbWriteTable(con, "enroll", enroll, overwrite=TRUE)
sql <- paste("Select e.user, e.course, Avg(p.score)",
"from enroll as e",
"cross join enroll as p",
"where e.user=p.user and p.c_end < e.c_start",
"group by e.user, e.course;")
res <- dbSendQuery(con, sql)
dat <- fetch(res, n=-1)
})
On my machine, sqlite is ten times faster. If that is not enough, it would be possible to index the database.
I can't really test this, as your data doesn't appear to satisfy the inequality in any combination, but I'd try something like this:
library(plyr)
res <- ddply(data, .(User.ID), function(d) {
with(subset(merge(d, d, by=NULL, suffixes=c(".i", ".j")),
Course.End.Date.j < Course.Start.Date.i),
c(Mean.Prior.Score = mean(Score.j)))
})
res$Mean.Prior.Score[is.nan(res$Mean.Prior.Score)] = 0
Here is how it works:
ddply: Group data by User.ID and execute function for each subset d of rows for one User.ID
merge: Create two copies of the data for one user, one with columns suffixed .i the other .j
subset: From this outer join, only select those matching the given inequality
mean: Compute the mean for the matched rows
c(…): Give a name to the resulting column
res: Will be a data.frame with columns User.ID and Mean.Prior.Score
is.nan: mean will return NaN for zero-length vectors, change these to zeros
I guess this might be reasonably fast if there are not too many rows for each User.ID. If this isn't fast enough, the data.tables mentioned in other answers might help.
Your code is a bit fuzzy on the desired output: you treat data$Mean.Prior.Score like a length-one variable, but assign to it in every iteration of the loop. I assume that this assignment is meant only for one row. Do you need means for every row of the data frame, or only one mean per user?

Resources