Is this the most concise way to iterate and capture output from API in R? - r

I want to iterate through a sequence of years and capture each output in one large dataframe.
The query only allows one year at a time of data to be requested so I thought I could run a loop like below and capture into an empty dataframe. This seems to work but I was wondering if there is a more concise way of achieving this.
In case anyone is interested.
API info and signup:https://aqs.epa.gov/aqsweb/documents/data_api.html#bdate
library("jsonlite")
library(lubridate)
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#email.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- my_content_from_json$Data
dat = rbind(dat,df)
}

A slightly more efficient solution may be obtained by using rbind() only once, rather than iteratively in the loop. We can do this with a combination of Base R and lapply(). The key change in order to make the code work was converting the list output from the fromJSON() function into a data frame, which did not work correctly in the code posted with the original question.
# set private key
Sys.setenv(AQS_KEY = "yourKeyGoesHere")
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
library(RJSONIO)
library(tidyr)
years <- as.character(2011:2019)
system.time(dfList <- lapply(years,function(year){
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#gmail.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
}))
system.time(combinedData <- do.call(rbind,dfList))
The code to extract years 2011 - 2019 from the EPA database runs in about 46.8 seconds of user time, including the initial extracts, unnesting of each resulting data structure, and the one time combination of data frames at the end.
user system elapsed
46.670 0.756 71.432
> system.time(combinedData <- data.frame(do.call(rbind,dfList)))
user system elapsed
0.096 0.027 0.123
The large difference between user time and elapsed time is likely due to wait times to receive data from the API.
A key feature of this solution is the technique used to convert the list of lists into data frame rows, which is accomplished as follows (h/t Alex Brown's answer for Convert a List to a Data Frame, as well as the unnesting of the resulting data structure with tidyr::unnest(). We also had to set the uncertainty column to blank, because unnest() fails with the NULL values extracted from the EPA API.
df <- data.frame(t(sapply(my_content_from_json$Data,c)))
df$uncertainty <- " "
tidyr::unnest(df,cols = colnames(df)) # unnest & return to parent
Output from the combined data frame looks like this.
> head(combinedData)
state_code county_code site_number parameter_code poc latitude longitude datum
1 48 141 0055 44201 1 31.74677 -106.4028 WGS84
2 48 141 0055 44201 1 31.74677 -106.4028 WGS84
3 48 141 0055 44201 1 31.74677 -106.4028 WGS84
4 48 141 0055 44201 1 31.74677 -106.4028 WGS84
5 48 141 0055 44201 1 31.74677 -106.4028 WGS84
6 48 141 0055 44201 1 31.74677 -106.4028 WGS84
parameter date_local time_local date_gmt time_gmt sample_measurement
1 Ozone 2011-12-31 23:00 2012-01-01 06:00 0.023
2 Ozone 2011-12-31 22:00 2012-01-01 05:00 NA
3 Ozone 2011-12-31 21:00 2012-01-01 04:00 NA
4 Ozone 2011-12-31 20:00 2012-01-01 03:00 0.018
5 Ozone 2011-12-31 19:00 2012-01-01 02:00 0.006
6 Ozone 2011-12-31 18:00 2012-01-01 01:00 0.002
units_of_measure units_of_measure_code sample_duration sample_duration_code
1 Parts per million 007 1 HOUR 1
2 Parts per million 007 1 HOUR 1
3 Parts per million 007 1 HOUR 1
4 Parts per million 007 1 HOUR 1
5 Parts per million 007 1 HOUR 1
6 Parts per million 007 1 HOUR 1
sample_frequency detection_limit uncertainty
1 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
2 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
3 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
4 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
5 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
6 DAILY: 24 - 1 HR SAMPLES -PAMS 0.005
qualifier method_type method
1 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
2 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
3 BF - Precision/Zero/Span. FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
4 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
5 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
6 <NA> FEM INSTRUMENTAL - ULTRA VIOLET ABSORPTION
method_code state county date_of_last_change cbsa_code
1 087 Texas El Paso 2012-01-23 21340
2 087 Texas El Paso 2012-01-23 21340
3 087 Texas El Paso 2012-01-23 21340
4 087 Texas El Paso 2012-01-23 21340
5 087 Texas El Paso 2012-01-23 21340
6 087 Texas El Paso 2012-01-23 21340
The original code, updated to result in a data frame without nested lists, runs in about 43.6 seconds, about 3 seconds faster than the lapply() version which is a bit surprising.
base_url_site <- "https://aqs.epa.gov/data/api/sampleData/bySite"
years <- as.character(2011:2019)
dat = {}
system.time(for (year in years) {
my_raw_result <- httr::GET(base_url_site,
query = list(email="example#gmail.com",key=Sys.getenv("AQS_KEY"),
param = "44201",
bdate=paste(year,"0101",sep = ""),
edate=paste(year,"1231",sep = ""),state="48",
county="141", site="0055"))
my_content <- httr::content(my_raw_result, as = 'text')
my_content_from_json <- fromJSON(my_content)
dataList <- my_content_from_json$Data
df <- data.frame(t(sapply(dataList,c)))[!(colnames(df) == "uncertainty")]
unnestedDf <- tidyr::unnest(df,cols = colnames(df))
dat <- rbind(dat,unnestedDf)
})
...and the runtime stats, which show the same pattern of elapsed time relative to user time:
user system elapsed
43.586 0.686 66.604

Related

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

Why does R throw an error on iterative calculation

I'm looking at covid-19 data to calculate estimates for the reproductive number R0.
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(TTR)
# Get COVID cases, available from:
url <- "https://static.usafacts.org/public/data/covid-19/covid_confirmed_usafacts.csv"
DoubleCOV <- read.csv(url, stringsAsFactors = FALSE)
names(DoubleCOV)[1] <- "countyFIPS"
DoubleCovid <- pivot_longer(DoubleCOV, cols=starts_with("X"),
values_to="cases",
names_to=c("X","date_infected"),
names_sep="X") %>%
mutate(infected = as.Date(date_infected, format="%m.%d.%y"),
countyFIPS = str_pad(as.character(countyFIPS), 5, pad="0"))
#data is by county, summarise for the state of interest
stateData <- DoubleCovid %>% filter(State == "AL") %>% filter(cases != 0) %>%
group_by(infected) %>% summarise(sum(cases)) %>%
mutate(DaysSince = infected - min(infected))
names(stateData)[2] <- "cumCases"
#3 day moving average to smooth a little
stateData <- stateData %>% mutate(MA = runMean(cumCases,3))
#calculate doubling rate (DR) and then R0 infectious period/doubling rate
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
CDplot <- stateData %>%
ggplot(mapping = aes(x = as.numeric(DaysSince), y = R0)) +
geom_line(color = "firebrick")
print(CDplot)
So in the above the state of interest is Alabama, hence filter(State == "AL") and this works.
But if I change the state to "NY" I get
Error in `$<-.data.frame`(`*tmp*`, "DR", value = c(NA, NA, NA, 0.733907206043719 :
replacement has 4 rows, data has 39
head(stateData) yields
infected cumCases DaysSince MA
<date> <int> <drtn> <dbl>
1 2020-03-02 1 0 days NA
2 2020-03-03 2 1 days NA
3 2020-03-04 11 2 days 4.67
4 2020-03-05 23 3 days 12
5 2020-03-06 25 4 days 19.7
6 2020-03-07 77 5 days 41.7
The moving average values in rows 3 and 4 (12 and 4.67) would yield a doubling rate of 0.734 which aligns with the value in the error message value = c(NA, NA, NA, 0.733907206043719 but why does it throw an error after that?
Bonus question: I know loops are frowned upon in R...is there a way to get the moving average and R0 calculation without one?
You have to initialise the new variables before you can access them using the j index. Due to recycling, Alabama, which has 28 rows (divisible by 4), does not return an error, only the warnings about uninitialised columns. New York, however, has 39 rows, which is not divisible by 4 so recycling fails and R returns an error. You shouldn't ignore warnings, sometimes you can, but it's not a good idea.
Try this to see what R (you) is trying to do:
stateData[4]
You should get all rows of the 4th column, not the 4th row.
Solution: initialise your DR and R0 columns first.
stateData$DR <- NA
stateData$R0 <- NA
for(j in 4:nrow(stateData)){
stateData$DR[j] <- log(2)/log(stateData$MA[j]/stateData$MA[j-1])
stateData$R0[j] <- 14/stateData$DR[j]
}
For the bonus question, you can use lag in the same mutate with MA:
stateData <- stateData %>% mutate(MA = runMean(cumCases,3),
DR = log(2)/log(MA/lag(MA)),
R0 = 14 / DR)
stateData
# A tibble: 28 x 6
infected cumCases DaysSince MA DR R0
<date> <int> <drtn> <dbl> <dbl> <dbl>
1 2020-03-13 5 0 days NA NA NA
2 2020-03-14 11 1 days NA NA NA
3 2020-03-15 22 2 days 12.7 NA NA
4 2020-03-16 29 3 days 20.7 1.42 9.89
5 2020-03-17 39 4 days 30 1.86 7.53
6 2020-03-18 51 5 days 39.7 2.48 5.64
7 2020-03-19 78 6 days 56 2.01 6.96
8 2020-03-20 106 7 days 78.3 2.07 6.78
9 2020-03-21 131 8 days 105 2.37 5.92
10 2020-03-22 167 9 days 135. 2.79 5.03
# ... with 18 more rows
I'm using Alabama's data.

Tabulate number of attacks within time and distance range

This website has helped me with so much over the years, but I can't seem to figure this part out. I am working on modeling terrorist attacks in Afghanistan and want to create a new variable to reflect the clustering of attacks. For each attack I want to calculate the number of attacks that fall into two range criteria, distance and time.
head(timedist_terr_talib, 15)
eventid lat lon event1 Cluster_Num
1 20110104 32.07333 64.83389 2011-01-04 NA
2 20110107 31.00806 66.39806 2011-01-07 NA
3 20110112 34.53306 69.16611 2011-01-12 NA
4 20110112 34.87417 71.15278 2011-01-12 NA
5 20110114 31.65003 65.65002 2011-01-14 1
6 20110115 33.42977 66.21314 2011-01-15 0
7 20110116 35.95000 68.70000 2011-01-16 0
8 20110119 32.68556 68.23778 2011-01-19 0
9 20110119 34.08056 68.51917 2011-01-19 1
10 20110123 34.89000 71.18000 2011-01-23
11 20110128 34.53306 69.16611 2011-01-28
12 20110129 31.61767 65.67594 2011-01-29
13 20110131 35.03924 69.00633 2011-01-31
14 20110201 31.61767 65.67594 2011-02-01
15 20110207 31.48623 64.32139 2011-02-07
I want to create a new column whose values are the number of attacks that happened within the last 14 days and 100 km of that attack.
event1 <- strptime(timedist_terr_talib$eventid,
format="%Y%m%d", tz="UTC")
I found code that makes a matrix with the distance between each point:
http://eurekastatistics.com/calculating-a-distance-matrix-for-geographic-points-using-r/
#find dist in meters / 1000 to get km
#dis_talib_mat<-round(GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
dis_talib_mat1 <- (GeoDistanceInMetresMatrix(timedist_terr_talib) / 1000)
And I have a matrix that calculates the time distance between every pair:
timediff_talib1<-t(outer(timedist_terr_talib$event1,
timedist_terr_talib$event1, difftime))
timediff_talib1<-timediff_talib1/(60*60*24)
So example for attack 1:4 are NA because the data does not have a complete 14 days. When I look at attack 5, I look at attacks 1:4 because they happened with 14 days. The distance matrix shows that 1 of those attacks was within 100 km.
and manually count that there is 1 attack that is under 100 km away.
My current data set is 2813 attacks, so the running is slow, but if I could get the code for these 15 and apply it my set, I would be so happy!

Manipulating Dates with dplyr

I have longitudinal, geocoded address data and the length of time at each geocode. I then have a series of variables (I'm just calling them x here) that give characteristics of each geoid location. Below here is just two cases but I have thousands.
id<-c(1,1,1,7,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/1/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
x<-c(.5,.7,.7,.3,.4,.6)
dat<-data.frame(id,geoid,x,start,end)
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
dat
id geoid x start end
1 53 0.5 2004-01-01 2004-10-30
1 45 0.7 2004-10-31 2004-12-31
1 45 0.7 2005-01-01 2007-12-31
7 16 0.3 2005-01-01 2007-05-31
7 18 0.4 2007-06-01 2007-08-01
7 42 0.6 2007-08-02 2007-12-31
I need to end up with a single value for each year (2004, 2005, 2006, 2007) and for each case (1, 7) that is weighted by the length of time at each address. So case 1 moves from geoid 53 to 45 in 2004 and case 7 moves from geoid 16 to 18 to 42 in 2007. So I calculate the percent of the year at each geoid (and eventually I will multiply that by x and take the mean for each year to get a weighted average). Cases staying put for a whole year will get a weight of 1.
#calculate the percentage of year at each address for id 1
(as.Date("10/31/2004",format='%m/%d/%Y')-as.Date("1/1/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.8323066
(as.Date("12/31/2004",format='%m/%d/%Y')-as.Date("10/31/2004",format='%m/%d/%Y'))/365.25
Time difference of 0.1670089
#calculate the percentage of year at each address for id 7
(as.Date("05/31/2007",format='%m/%d/%Y')-as.Date("1/1/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4106776
(as.Date("07/01/2007",format='%m/%d/%Y')-as.Date("06/01/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.08213552
(as.Date("12/31/2007",format='%m/%d/%Y')-as.Date("07/02/2007",format='%m/%d/%Y'))/365.25
Time difference of 0.4982888
I can do this by brute force by looking at each year individually, calculating the percent of the year spent at that address. Then I would multiply each weight by the x values and take the mean for that year - that will not be reasonably possible to do with thousands of cases. Any ideas of how to address this more efficiently would be much appreciated. Seems like it might be doable with dplyr slice but I'm stalled out at the moment. The key is separating out each year.
As eipi10 mentioned, some of your data spans more than a year. It also looks inconsistent with the data you used in your time difference calculations, which are all within the same year.
Assuming that your start and end dates would actually be in the same year, you can do something like the following:
foo <- dat %>%
mutate(start_year=year(dat$start),
end_year=year(dat$end),
same_year=(start_year==end_year),
year_frac=as.numeric(dat$end - dat$start)/365.25,
wtd_x = year_frac * x)
This gives you:
id geoid x start end start_year end_year same_year year_frac wtd_x
1 1 53 0.5 2004-01-01 2004-10-31 2004 2004 TRUE 0.83230664 0.41615332
2 1 45 0.7 2004-10-31 2004-12-31 2004 2004 TRUE 0.16700890 0.11690623
3 1 45 0.7 2005-01-01 2007-12-31 2005 2007 FALSE 2.99520876 2.09664613
4 7 16 0.3 2007-01-01 2007-05-31 2007 2007 TRUE 0.41067762 0.12320329
5 7 18 0.4 2007-06-01 2007-07-01 2007 2007 TRUE 0.08213552 0.03285421
6 7 42 0.6 2007-07-02 2007-12-31 2007 2007 TRUE 0.49828884 0.29897331
You can then group and summarise the data using:
bar <- foo %>%
group_by(start_year, id) %>%
summarise(sum(wtd_x))
to give you the answer:
start_year id sum(wtd_x)
(dbl) (dbl) (dfft)
1 2004 1 0.5330595 days
2 2005 1 2.0966461 days
3 2007 7 0.4550308 days
Hopefully this will get you started. I wasn't sure how you wanted to deal with cases where the period from start to end spans more than one year or crosses calendar years.
library(dplyr)
dat %>%
mutate(fractionOfYear = as.numeric(end - start)/365.25)
id geoid x start end fractionOfYear
1 1 53 0.5 2004-01-01 2004-10-30 0.82956879
2 1 45 0.7 2004-10-31 2004-12-31 0.16700890
3 1 45 0.7 2005-01-01 2007-12-31 2.99520876
4 7 16 0.3 2005-01-01 2007-05-31 2.40930869
5 7 18 0.4 2007-06-01 2007-07-01 0.08213552
6 7 42 0.6 2007-07-02 2007-12-31 0.49828884
I was able to find some local help that led us to a simple function. We're still stuck on how to use apply with dates but this overall handles it.
#made up sample address data
id<-c(1,1,1,7,7,7)
geoid<-c(53,45,45,16,18,42)
start<-c("1/31/2004","10/31/2004","1/1/2005","1/1/2005","6/1/2007","7/2/2007")
end<-c("10/30/2004","12/31/2004","12/31/2007","5/31/2007","7/1/2007","12/31/2007")
dat <- data.frame(id,geoid,start,end)
#format addresses
dat$start<-as.Date(dat$start,format='%m/%d/%Y')
dat$end<-as.Date(dat$end,format='%m/%d/%Y')
#function to create proportion of time at each address
prop_time <- function(drange, year){
start <- drange[[1]]; end <- drange[[2]]
#start year and end year
syear <- as.numeric(format(start,'%Y'))
eyear <- as.numeric(format(end,'%Y'))
#select only those dates that are within the same year
if(syear<=year & year<=eyear){
byear <- as.Date(paste("1/1", sep="/", year), format='%m/%d/%Y')
eyear <- as.Date(paste("12/31", sep="/", year), format='%m/%d/%Y')
astart <- max(byear, start)
aend <- min(eyear, end)
prop <- as.numeric((aend - astart))/as.numeric((eyear - byear))
} else prop <- 0 #if no proportion within same year calculated then gets 0
prop
}
#a second function to apply prop_time to multiple cases
prop_apply <- function(dat_times, year){
out <- NULL
for(i in 1:dim(dat_times)[1]){
out <- rbind(out,prop_time(dat_times[i,], year))
}
out
}
#create new data frame to populate years
dat <- data.frame(dat, y2004=0, y2005=0, y2006=0, y2007=0)
dat_times <- dat[,c("start", "end")]
#run prop_apply in a loop across cases and selected years
for(j in 2004:2007){
newdate <- paste("y", j, sep="")
dat[,newdate] <- prop_apply(dat_times, j)
}

Replacing values in one data frame with values in a second data frame conditional on a logic statement

I have two data frames: "unit_test" with unique descriptions of survey units (one row per survey unit) and "data_test" with field data (multiple rows per survey unit). If it is a ground survey (data_test$type='ground'), I want to replace data_test$easting with the value in unit_test$east for the corresponding code (unit_test$code must match data_test$code1). If it is an air survey (data_test$type=='air'), I want to keep the original values in data_test$easting.
# Create units table
code <- c('pondA','pondB','pondC','pondD','transect1','transect2','transect3','transect4')
east <- c(12345,23456,34567,45678,NA,NA,NA,NA)
north <- c(99876,98765,87654,76543,NA,NA,NA,NA)
unit_test <- data.frame(cbind(code,east,north))
unit_test
# Create data table
code1 <- c('pondA','pondA','transect1','pondB','pondB','transect2','pondC','transect3','pondD','transect4')
type <- c('ground','ground','air','ground','ground','air','ground','air','ground','air')
easting <- c(NA,NA,18264,NA,NA,46378,NA,86025,NA,46295)
northing <-c(NA,NA,96022,NA,NA,85766,NA,21233,NA,23090)
species <- c('NOPI','NOPI','SCAU','GWTE','GWTE','RUDU','NOPI','GADW','NOPI','MALL')
count <- c(10,23,50,1,2,43,12,3,7,9)
data_test <- data.frame(cbind(code1,type,easting,northing,species,count))
data_test
I have tried using the match function:
if(data_test$type=="ground") {
data_test$easting <- unit_test$east[match(data_test$code1, unit_test$code)]
}
However it replaces the easting values if data_test$type=='air' with NAs. Any help would be much appreciated.
I want my final output to look like this:
code1 type easting northing species count
1 pondA ground 12345 99876 NOPI 10
2 pondA ground 12345 99876 NOPI 23
3 transect1 air 18264 96022 SCAU 50
4 pondB ground 23456 98765 GWTE 1
5 pondB ground 23456 98765 GWTE 2
6 transect2 air 46378 85766 RUDU 43
7 pondC ground 34567 87654 NOPI 12
8 transect3 air 86025 21233 GADW 3
9 pondD ground 45678 76543 NOPI 7
10 transect4 air 46295 23090 MALL 9
I think data.table package is really useful for this task:
install.packages("data.table")
library(data.table)
unit_test = data.table(unit_test)
data_test = data.table(data_test)
Add a column to unit_test specifying it refers to "ground":
unit_test$type = "ground"
Set keys to table in order to cross reference
setkey(data_test, code1, type, species)
setkey(unit_test, code, type)
Every time you have "ground" for type in data_test, lookup appropriate data in unit_test and replace easting with east
data_test[unit_test, easting:= east]
data_test[unit_test,northing:= north]
Results:
> data_test
code1 type easting northing species count
1: pondA ground 12345 99876 NOPI 10
2: pondA ground 12345 99876 NOPI 23
3: pondB ground 23456 98765 GWTE 1
4: pondB ground 23456 98765 GWTE 2
5: pondC ground 34567 87654 NOPI 12
6: pondD ground 45678 76543 NOPI 7
7: transect1 air 18264 96022 SCAU 50
8: transect2 air 46378 85766 RUDU 43
9: transect3 air 86025 21233 GADW 3
10: transect4 air 46295 23090 MALL 9
Base R:
data_test[data_test$type == 'ground',c('easting','northing')] <- unit_test[match(data_test[data_test$type == 'ground','code1'],unit_test$code),c('east','north')]
Find the spots you want to fill, and make an index with match like you mentioned. This is after a change in your sample data. I used stringsAsFactors = F when creating both data frames so I didn't have to deal with factors.

Resources