This website has helped me with so much over the years, but I can't seem to figure this part out. I am working on modeling terrorist attacks in Afghanistan and want to create a new variable to reflect the clustering of attacks. For each attack I want to calculate the number of attacks that fall into two range criteria, distance and time.
head(timedist_terr_talib, 15)
eventid lat lon event1 Cluster_Num
1 20110104 32.07333 64.83389 2011-01-04 NA
2 20110107 31.00806 66.39806 2011-01-07 NA
3 20110112 34.53306 69.16611 2011-01-12 NA
4 20110112 34.87417 71.15278 2011-01-12 NA
5 20110114 31.65003 65.65002 2011-01-14 1
6 20110115 33.42977 66.21314 2011-01-15 0
7 20110116 35.95000 68.70000 2011-01-16 0
8 20110119 32.68556 68.23778 2011-01-19 0
9 20110119 34.08056 68.51917 2011-01-19 1
10 20110123 34.89000 71.18000 2011-01-23
11 20110128 34.53306 69.16611 2011-01-28
12 20110129 31.61767 65.67594 2011-01-29
13 20110131 35.03924 69.00633 2011-01-31
14 20110201 31.61767 65.67594 2011-02-01
15 20110207 31.48623 64.32139 2011-02-07
I want to create a new column whose values are the number of attacks that happened within the last 14 days and 100 km of that attack.
event1 <- strptime(timedist_terr_talib$eventid,
format="%Y%m%d", tz="UTC")
I found code that makes a matrix with the distance between each point:
http://eurekastatistics.com/calculating-a-distance-matrix-for-geographic-points-using-r/
#find dist in meters / 1000 to get km
#dis_talib_mat<-round(GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
dis_talib_mat1 <- (GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
And I have a matrix that calculates the time distance between every pair:
timediff_talib1<-t(outer(timedist_terr_talib$event1,
timedist_terr_talib$event1, difftime))
timediff_talib1<-timediff_talib1/(60*60*24)
So example for attack 1:4 are NA because the data does not have a complete 14 days. When I look at attack 5, I look at attacks 1:4 because they happened with 14 days. The distance matrix shows that 1 of those attacks was within 100 km.
and manually count that there is 1 attack that is under 100 km away.
My current data set is 2813 attacks, so the running is slow, but if I could get the code for these 15 and apply it my set, I would be so happy!
Related
I want to iterate through a sequence of years and capture each output in one large dataframe.
The query only allows one year at a time of data to be requested so I thought I could run a loop like below and capture into an empty dataframe. This seems to work but I was wondering if there is a more concise way of achieving this.
In case anyone is interested.
API info and signup:https://aqs.epa.gov/aqsweb/documents/data_api.html#bdate
library("jsonlite")
library(lubridate)
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#email.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- my_content_from_json$Data
dat = rbind(dat,df)
}
A slightly more efficient solution may be obtained by using rbind() only once, rather than iteratively in the loop. We can do this with a combination of Base R and lapply(). The key change in order to make the code work was converting the list output from the fromJSON() function into a data frame, which did not work correctly in the code posted with the original question.
# set private key
Sys.setenv(AQS_KEY = "yourKeyGoesHere")
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
library(RJSONIO)
library(tidyr)
years <- as.character(2011:2019)
system.time(dfList <- lapply(years,function(year){
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#gmail.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
}))
system.time(combinedData <- do.call(rbind,dfList))
The code to extract years 2011 - 2019 from the EPA database runs in about 46.8 seconds of user time, including the initial extracts, unnesting of each resulting data structure, and the one time combination of data frames at the end.
user system elapsed
46.670 0.756 71.432
> system.time(combinedData <- data.frame(do.call(rbind,dfList)))
user system elapsed
0.096 0.027 0.123
The large difference between user time and elapsed time is likely due to wait times to receive data from the API.
A key feature of this solution is the technique used to convert the list of lists into data frame rows, which is accomplished as follows (h/t Alex Brown's answer for Convert a List to a Data Frame, as well as the unnesting of the resulting data structure with tidyr::unnest(). We also had to set the uncertainty column to blank, because unnest() fails with the NULL values extracted from the EPA API.
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
Output from the combined data frame looks like this.
> head(combinedData)
state_code county_code site_number parameter_code poc latitude longitude datum
1 48 141 0055 44201 1 31.74677 -106.4028 WGS84
2 48 141 0055 44201 1 31.74677 -106.4028 WGS84
3 48 141 0055 44201 1 31.74677 -106.4028 WGS84
4 48 141 0055 44201 1 31.74677 -106.4028 WGS84
5 48 141 0055 44201 1 31.74677 -106.4028 WGS84
6 48 141 0055 44201 1 31.74677 -106.4028 WGS84
parameter date_local time_local date_gmt time_gmt sample_measurement
1 Ozone 2011-12-31 23:00 2012-01-01 06:00 0.023
2 Ozone 2011-12-31 22:00 2012-01-01 05:00 NA
3 Ozone 2011-12-31 21:00 2012-01-01 04:00 NA
4 Ozone 2011-12-31 20:00 2012-01-01 03:00 0.018
5 Ozone 2011-12-31 19:00 2012-01-01 02:00 0.006
6 Ozone 2011-12-31 18:00 2012-01-01 01:00 0.002
units_of_measure units_of_measure_code sample_duration sample_duration_code
1 Parts per million 007 1 HOUR 1
2 Parts per million 007 1 HOUR 1
3 Parts per million 007 1 HOUR 1
4 Parts per million 007 1 HOUR 1
5 Parts per million 007 1 HOUR 1
6 Parts per million 007 1 HOUR 1
sample_frequency detection_limit uncertainty
1 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
2 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
3 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
4 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
5 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
6 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
qualifier method_type method
1 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
2 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
3 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
4 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
5 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
6 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
method_code state county date_of_last_change cbsa_code
1 087 Texas El Paso 2012-01-23 21340
2 087 Texas El Paso 2012-01-23 21340
3 087 Texas El Paso 2012-01-23 21340
4 087 Texas El Paso 2012-01-23 21340
5 087 Texas El Paso 2012-01-23 21340
6 087 Texas El Paso 2012-01-23 21340
The original code, updated to result in a data frame without nested lists, runs in about 43.6 seconds, about 3 seconds faster than the lapply() version which is a bit surprising.
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
system.time(for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#gmail.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
dataList <- my_content_from_json$Data
df <- data.frame(t(sapply(dataList,c)))[!(colnames(df) == "uncertainty")]
unnestedDf <- tidyr::unnest(df,cols = colnames(df))
dat <- rbind(dat,unnestedDf)
})
...and the runtime stats, which show the same pattern of elapsed time relative to user time:
user system elapsed
43.586 0.686 66.604
Data looks like this:
ID Lat Long Time
1 3 3 00:01
1 3 4 00:02
1 4 4 00:03
2 4 3 00:01
2 4 4 00:02
2 4 5 00:03
3 5 2 00:01
3 5 3 00:02
3 5 4 00:03
4 9 9 00:01
4 9 8 00:02
4 8 8 00:03
5 7 8 00:01
5 8 8 00:02
5 8 9 00:03
I want to measure how far the IDs are away from each other within a given radius at each given time interval. I am doing this on 1057 ID's across 16213 time intervals so efficiency is important.
It is important to measure distance between points within a radius because if the points are too far away I don't care. I am trying to measure distances between points who are relatively close. For example I don't care how far away ID 1 is from ID 5 but I care about how far ID 4 is from ID 5.
I am using R and the sp package.
For what I can see, there will be repeated values many times. Therefore, I would suggest to calculate the distance for each pair of coordinates only once (even if repeated many times in the df) as a starting point. Than you can filter the data and merge the tables. (I would add it as a comment, but I don't have enought reputation to do so yet).
The first lines would be:
#Creating a DF with no repeated coordinates
df2 <- df %>% group_by(Lat,Long) %>% summarise()
# Calculating Distances
Dist <- distm(cbind(df2$Long,df2$Lat))
I have longitudinal, geocoded address data and the length of time at each geocode. I then have a series of variables (I'm just calling them x here) that give characteristics of each geoid location. Below here is just two cases but I have thousands.
id<-c(1,1,1,7,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/1/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
x<-c(.5,.7,.7,.3,.4,.6)
dat<-data.frame(id,geoid,x,start,end)
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
dat
id geoid x start end
1 53 0.5 2004-01-01 2004-10-30
1 45 0.7 2004-10-31 2004-12-31
1 45 0.7 2005-01-01 2007-12-31
7 16 0.3 2005-01-01 2007-05-31
7 18 0.4 2007-06-01 2007-08-01
7 42 0.6 2007-08-02 2007-12-31
I need to end up with a single value for each year (2004, 2005, 2006, 2007) and for each case (1, 7) that is weighted by the length of time at each address. So case 1 moves from geoid 53 to 45 in 2004 and case 7 moves from geoid 16 to 18 to 42 in 2007. So I calculate the percent of the year at each geoid (and eventually I will multiply that by x and take the mean for each year to get a weighted average). Cases staying put for a whole year will get a weight of 1.
#calculate the percentage of year at each address for id 1
(as.Date("10/31/2004",format='%m/%d/%Y')-as.Date("1/1/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.8323066
(as.Date("12/31/2004",format='%m/%d/%Y')-as.Date("10/31/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.1670089
#calculate the percentage of year at each address for id 7
(as.Date("05/31/2007",format='%m/%d/%Y')-as.Date("1/1/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4106776
(as.Date("07/01/2007",format='%m/%d/%Y')-as.Date("06/01/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.08213552
(as.Date("12/31/2007",format='%m/%d/%Y')-as.Date("07/02/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4982888
I can do this by brute force by looking at each year individually, calculating the percent of the year spent at that address. Then I would multiply each weight by the x values and take the mean for that year - that will not be reasonably possible to do with thousands of cases. Any ideas of how to address this more efficiently would be much appreciated. Seems like it might be doable with dplyr slice but I'm stalled out at the moment. The key is separating out each year.
As eipi10 mentioned, some of your data spans more than a year. It also looks inconsistent with the data you used in your time difference calculations, which are all within the same year.
Assuming that your start and end dates would actually be in the same year, you can do something like the following:
foo <- dat %>%
mutate(start_year=year(dat$start),
end_year=year(dat$end),
same_year=(start_year==end_year),
year_frac=as.numeric(dat$end - dat$start)/365.25,
wtd_x = year_frac * x)
This gives you:
id geoid x start end start_year end_year same_year year_frac wtd_x
1 1 53 0.5 2004-01-01 2004-10-31 2004 2004 TRUE 0.83230664 0.41615332
2 1 45 0.7 2004-10-31 2004-12-31 2004 2004 TRUE 0.16700890 0.11690623
3 1 45 0.7 2005-01-01 2007-12-31 2005 2007 FALSE 2.99520876 2.09664613
4 7 16 0.3 2007-01-01 2007-05-31 2007 2007 TRUE 0.41067762 0.12320329
5 7 18 0.4 2007-06-01 2007-07-01 2007 2007 TRUE 0.08213552 0.03285421
6 7 42 0.6 2007-07-02 2007-12-31 2007 2007 TRUE 0.49828884 0.29897331
You can then group and summarise the data using:
bar <- foo %>%
group_by(start_year, id) %>%
summarise(sum(wtd_x))
to give you the answer:
start_year id sum(wtd_x)
(dbl) (dbl) (dfft)
1 2004 1 0.5330595 days
2 2005 1 2.0966461 days
3 2007 7 0.4550308 days
Hopefully this will get you started. I wasn't sure how you wanted to deal with cases where the period from start to end spans more than one year or crosses calendar years.
library(dplyr)
dat %>%
mutate(fractionOfYear = as.numeric(end - start)/365.25)
id geoid x start end fractionOfYear
1 1 53 0.5 2004-01-01 2004-10-30 0.82956879
2 1 45 0.7 2004-10-31 2004-12-31 0.16700890
3 1 45 0.7 2005-01-01 2007-12-31 2.99520876
4 7 16 0.3 2005-01-01 2007-05-31 2.40930869
5 7 18 0.4 2007-06-01 2007-07-01 0.08213552
6 7 42 0.6 2007-07-02 2007-12-31 0.49828884
I was able to find some local help that led us to a simple function. We're still stuck on how to use apply with dates but this overall handles it.
#made up sample address data
id<-c(1,1,1,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/31/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
dat <- data.frame(id,geoid,start,end)
#format addresses
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
#function to create proportion of time at each address
prop_time <- function(drange, year){
start <- drange[[1]]; end <- drange[[2]]
#start year and end year
syear <- as.numeric(format(start,'%Y'))
eyear <- as.numeric(format(end,'%Y'))
#select only those dates that are within the same year
if(syear<=year & year<=eyear){
byear <- as.Date(paste("1/1", sep="/", year), format='%m/%d/%Y')
eyear <- as.Date(paste("12/31", sep="/", year), format='%m/%d/%Y')
astart <- max(byear, start)
aend <- min(eyear, end)
prop <- as.numeric((aend - astart))/as.numeric((eyear - byear))
} else prop <- 0 #if no proportion within same year calculated then gets 0
prop
}
#a second function to apply prop_time to multiple cases
prop_apply <- function(dat_times, year){
out <- NULL
for(i in 1:dim(dat_times)[1]){
out <- rbind(out,prop_time(dat_times[i,], year))
}
out
}
#create new data frame to populate years
dat <- data.frame(dat, y2004=0, y2005=0, y2006=0, y2007=0)
dat_times <- dat[,c("start", "end")]
#run prop_apply in a loop across cases and selected years
for(j in 2004:2007){
newdate <- paste("y", j, sep="")
dat[,newdate] <- prop_apply(dat_times, j)
}
I have a train set here and I need you to help me with something.
This is the df.
Jobs Agency Location Date RXH HS TMM Payed
14 Netapp Gitex F1 Events House DWTC 2015-10-19 100 8.0 800 TRUE
5 RWC Heineken Lightblue EGC 2015-10-09 90 4.0 360 FALSE
45 Rugby 7s CEO Seven Stadium 2015-12-04 100 10.0 1000 FALSE
29 Playstation Lightblue Mirdiff CC 2015-11-11 90 7.0 630 FALSE
24 RWC Heineken Lightblue EGC 2015-10-31 90 4.5 405 FALSE
33 Playstation Lightblue Mirdiff CC 2015-11-15 90 10.0 900 FALSE
46 Rugby 7s CEO Seven Stadium 2015-12-05 100 10.0 1000 FALSE
44 Rugby 7s CEO Seven Stadium 2015-12-03 100 10.0 1000 FALSE
I want to know for example that the total of rows is 10, and I worked for " CEO" agency 3 times, I want CEO Agency to have the 30% value for that month, if that makes sense?
I want to know depending on the number of observations how much in % i ve worked for them.
Thats just a Demo DF to see what im talking about.
Thanks
If I understand correctly, you want to summarize by Agency and by month. Here's how to do it with dplyr:
library(dplyr)
table1 %>%
mutate(Month=format(Date,"%m-%Y")) %>%
group_by(Month,Agency)%>%
summarise(Total=n())%>%
mutate(Pct=round(Total/sum(Total)*100))
Source: local data frame [4 x 4]
Groups: Month [3]
Month Agency Total Pct
(chr) (chr) (int) (dbl)
1 10-2015 Events House 1 33
2 10-2015 Lightblue 2 67
3 11-2015 Lightblue 2 100
4 12-2015 CEO 3 100
This is just a simple approach, and I suspect you might be looking for more. However, here's some code that would give you the answer to your sample question:
length(df$Agency[df$Agency == "CEO"]) / length(df$Agency)
The first length() function calculates how many cells in df$Agency are marked "CEO," then the second one calculates the total number of cells in that column. Dividing one by the other will give you the answer.
This will get more complicated if you want to automatically do this for each of the agencies in the column, but there are the basics.
I have a data set with multiple sites that were each sampled over multiple years. As part of this I have climate data that were sampled throughout each year as well as calculated means for several variables (mean annual temp, mean annual precipitation, mean annual snow depth, etc). Here is what the data frame actually looks like:
site date year temp precip mean.ann.temp mn.ann.precip
a 5/1/10 2010 15 0 6 .03
a 6/2/10 2010 18 1 6 .03
a 7/3/10 2010 22 0 6 .03
b 5/2/10 2010 16 2 7 .04
b 6/3/10 2010 17 3 7 .04
b 7/4/10 2010 20 0 7 .04
c 5/3/10 2010 14 0 5 .06
c 6/4/10 2010 13 0 5 .06
c 7/8/10 2010 25 0 5 .06
d 5/5/10 2010 16 15 10 .2
d 6/6/10 2010 22 0 10 .2
d 7/7/10 2010 24 0 10 .2
...
It then goes on the same way for multiple years.
How can I extract the mean.ann.temp and mn.ann.precip for each site and year? I've tried doubling up tapply() with no success and using double for loops, but I can't seem to figure it out. Can someone help me? Or do I have to do it the long and tedious way of just subsetting everything out?
Thanks,
Paul
Subset the columns and wrap it in a unique.
unique(d[,c("site","year","mean.ann.temp","mn.ann.precip")])
A similar way if the last two columns are different, and you want the first row:
d[!duplicated(d[,c("site","year")]),]
To compute summaries using plyr
require(plyr)
ddply(yourDF, .(site,year), summarize,
meanTemp=mean(mean.ann.temp),
meanPrec=mean(mn.ann.precip)
)