Group data and assign group id based on time intervals in R - r

I am trying to figure out how to assign group id based on time intervals in R.
More context: I have merged GPS data (lat/lon data points, recorded in irregular intervals) with acceleration data (ACC "bursts" of 82 data points, recorded at the start of every minute - all 82 data points in one burst have the same timestamp).
As GPS points and ACC bursts were collected simultaneously, I now want to group GPS points with the associated ACC bursts: assign all GPS and ACC data that ocurr within the same minute, a unique group id.
EDIT: Here are some sample data. I want to group the GPS point in row 8 to the ACC data within the same minute (in this case above the GPS point).
structure(list(X.1 = 1:11, timestamp = c("2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:00Z", "2019-01-26T16:25:47Z", "2019-01-26T16:26:00Z", "2019-01-26T16:26:00Z", "2019-01-26T16:26:00Z"), sensor.type = c("acceleration", "acceleration", "acceleration", "acceleration", "acceleration", "acceleration", "acceleration", "gps", "acceleration", "acceleration", "acceleration"), location.long = c(NA, NA, NA, NA, NA, NA, NA, 44.4777343, NA, NA, NA), location.lat = c(NA, NA, NA, NA, NA, NA, NA, -12.2839707, NA, NA, NA), annotation = c("Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing", "Moving/Climbing"), X = c(2219L, 1694L, 1976L, 1744L, 2014L, 2202L, 2269L, NA, 1874L, 2024L, 1990L), Y = c(1416L, 1581L, 1524L, 1620L, 1409L, 1545L, 1771L, NA, 1687L, 1773L, 1813L), Z = c(2189L, 2209L, 2121L, 2278L, 2003L, 2034L, 2060L, NA, 2431L, 2504L, 2428L)), class = "data.frame", row.names = c(NA, -11L))
X.1 timestamp sensor.type location.long location.lat annotation X Y Z
1 1 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2219 1416 2189
2 2 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 1694 1581 2209
3 3 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 1976 1524 2121
4 4 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 1744 1620 2278
5 5 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2014 1409 2003
6 6 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2202 1545 2034
7 7 2019-01-26T16:25:00Z acceleration NA NA Moving/Climbing 2269 1771 2060
8 8 2019-01-26T16:25:47Z gps 44.47773 -12.28397 Moving/Climbing NA NA NA
9 9 2019-01-26T16:26:00Z acceleration NA NA Moving/Climbing 1874 1687 2431
10 10 2019-01-26T16:26:00Z acceleration NA NA Moving/Climbing 2024 1773 2504
11 11 2019-01-26T16:26:00Z acceleration NA NA Moving/Climbing 1990 1813 2428
Does that make sense? I know lubridate can summarize data based on time intervals but how do I add a new group id (variable) based on timestamps?

Here's a solution using dplyr and lubridate. We convert your timestamp column to a proper datetime class, add a new column rounding down to the nearest minute, and then create an ID based on the rounded timestamp:
library(dplyr)
library(lubridate)
dat %>%
mutate(
timestamp = ymd_hms(timestamp),
minute = floor_date(timestamp, unit = "minute"),
group_id = as.integer(factor(minute))
)
# X.1 timestamp sensor.type location.long location.lat annotation X Y Z
# 1 1 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2219 1416 2189
# 2 2 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 1694 1581 2209
# 3 3 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 1976 1524 2121
# 4 4 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 1744 1620 2278
# 5 5 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2014 1409 2003
# 6 6 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2202 1545 2034
# 7 7 2019-01-26 16:25:00 acceleration NA NA Moving/Climbing 2269 1771 2060
# 8 8 2019-01-26 16:25:47 gps 44.47773 -12.28397 Moving/Climbing NA NA NA
# 9 9 2019-01-26 16:26:00 acceleration NA NA Moving/Climbing 1874 1687 2431
# 10 10 2019-01-26 16:26:00 acceleration NA NA Moving/Climbing 2024 1773 2504
# 11 11 2019-01-26 16:26:00 acceleration NA NA Moving/Climbing 1990 1813 2428
# minute group_id
# 1 2019-01-26 16:25:00 1
# 2 2019-01-26 16:25:00 1
# 3 2019-01-26 16:25:00 1
# 4 2019-01-26 16:25:00 1
# 5 2019-01-26 16:25:00 1
# 6 2019-01-26 16:25:00 1
# 7 2019-01-26 16:25:00 1
# 8 2019-01-26 16:25:00 1
# 9 2019-01-26 16:26:00 2
# 10 2019-01-26 16:26:00 2
# 11 2019-01-26 16:26:00 2

Related

Unnest or move rows to columns?

This is just one of those things that I can't figure out how to word in order to search for a solution to my problem. I have some election data for Democratic and Republican candidates. The data is contained in 2 rows per county with one of those rows corresponding to one of the two candidates.
I need a data frame with one row per county and I need to create a new column out of the second row for each county. I've tried to un-nest the dataframe, but that doesn't work. I've seen something about using un-nest and mutate together, but I can't figure that out. Transposing the dataframe didn't help either. I've also tried to ungroup without success.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# Remove unnecessary columns
election <- within(election, rm('ElectionDate','OfficeCode.Text.','DistrictCode.Text.','StatusCode','CountyCode','OfficeDescription','PartyOrder','PartyName','CandidateID','CandidateFirstName','CandidateMiddleName','CandidateFormerName','WriteIn.W..Uncommitted.Z.','Recount...','Nominated.N..Elected.E.'))
# Remove offices other than POTUS
election <- election[-c(167:2186),]
# Keep only DEM and REP parties
election <- election %>%
filter(PartyDescription == "Democratic" |
PartyDescription == "Republican")
[
I'd like it to look like this:
dplyr
library(dplyr)
library(tidyr) # pivot_wider
election %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
slice(-(167:2186)) %>%
filter(PartyDescription %in% c("Democratic", "Republican")) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)
# # A tibble: 83 x 25
# CountyName Biden Trump Richer LaFave Cambensy Wagner Metsa Markkanen Lipton Strayhorn Carlone Frederick Bernstein Diggs Hubbard Meyers Mosallam Vassar `O'Keefe` Schuitmaker Dewaelsche Stancato Gates Land
# <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 ALCONA 2142 4848 NA NA NA NA NA NA 1812 1748 4186 4209 1818 1738 4332 4114 1696 1770 4273 4187 1682 1733 4163 4223
# 2 ALGER 2053 3014 NA NA 2321 2634 NA NA 1857 1773 2438 2470 1795 1767 2558 2414 1757 1769 2538 2444 1755 1757 2458 2481
# 3 ALLEGAN 24449 41392 NA NA NA NA NA NA 20831 19627 37681 38036 20043 19640 38805 37375 18820 19486 37877 39052 19081 19039 37322 38883
# 4 ALPENA 6000 10686 NA NA NA NA NA NA 5146 4882 8845 8995 5151 4873 9369 8744 4865 4935 9212 8948 4816 4923 9069 9154
# 5 ANTRIM 5960 9748 NA NA NA NA NA NA 5042 4798 8828 8886 4901 4797 9108 8737 4686 4810 9079 8867 4679 4781 8868 9080
# 6 ARENAC 2774 5928 NA NA NA NA NA NA 2374 2320 4626 4768 2396 2224 4833 4584 2215 2243 5025 4638 2185 2276 4713 4829
# 7 BARAGA 1478 2512 NA NA NA NA 1413 2517 1267 1212 2057 2078 1269 1233 2122 2003 1219 1243 2090 2056 1226 1228 2072 2074
# 8 BARRY 11797 23471 NA NA NA NA NA NA 9794 9280 20254 20570 9466 9215 20885 20265 9060 9324 21016 20901 8967 9121 20346 21064
# 9 BAY 26151 33125 NA NA NA NA NA NA 23209 22385 26021 26418 23497 22050 27283 25593 21757 22225 27422 25795 21808 21999 26167 26741
# 10 BENZIE 5480 6601 NA NA NA NA NA NA 4704 4482 5741 5822 4584 4479 6017 5681 4379 4449 5979 5756 4392 4353 5704 5870
# # ... with 73 more rows
#r2evans had the right idea, but slicing the data before filtering lost a lot of the voting data. I hadn't realized that before.
# Load Michigan 2020 by-county election data
# Data: https://mielections.us/election/results/DATA/2020GEN_MI_CENR_BY_COUNTY.xls
election <- read.csv("2020GEN_MI_CENR_BY_COUNTY.txt", sep = "\t", header = TRUE)
# That's an ugly dataset...let's make it better
election <- election[-c(1:5,7:9,11,13:15,17:19)]
election <- election %>%
filter(CandidateLastName %in% c("Biden", "Trump")) %>%
select(CountyName, PartyDescription, CandidateLastName, CandidateVotes) %>%
pivot_wider(CountyName, names_from = CandidateLastName, values_from = CandidateVotes)

How to filter timestamps of one data frame based on timestamps from another?

I am attempting to filter one dataframe 'Blond_GSE' e.g. (bird tracking data which contains lots of variables including a timestamp) by the timestamps from a separate dataframe 'Blond_Prey' (variables including a timestamp of when a bird bought food to a nest) .
I would like to filter, so I have a new data frame with all tracking data (Blond_GSE) 30 minutes prior to the timestamps from the 'Blond_Prey.
Here is a look at each separate data frame.
head(Blond_GSE)
tag_id sensor_type_id acceleration_raw_x acceleration_raw_y
1 977476871 653 30 -942
2 977476871 653 32 -949
3 977476871 653 34 -949
4 977476871 653 40 -944
5 977476871 653 36 -943
6 977476871 653 36 -944
acceleration_raw_z barometric_height battery_charge_percent
1 454 0 100
2 445 0 100
3 450 0 100
4 446 0 100
5 451 0 100
6 455 0 100
battery_charging_current external_temperature flt_switch gps_hdop
1 0 33 NA 0.9
2 0 33 NA 1.0
3 0 33 NA 1.0
4 0 34 NA 0.9
5 0 33 NA 1.0
6 0 33 NA 0.8
gps_maximum_signal_strength gps_satellite_count gps_time_to_fix
1 NA 7 21.46
2 NA 6 12.48
3 NA 7 14.48
4 NA 8 26.41
5 NA 7 7.95
6 NA 9 8.98
ground_speed gsm_mcc_mnc heading height_above_ellipsoid
1 0 NA 86 NA
2 0 NA 296 NA
3 0 NA 331 NA
4 0 NA 44 NA
5 0 NA 213 NA
6 0 NA 225 NA
height_above_msl import_marked_outlier light_level
1 152 false 0
2 152 false 0
3 152 false 0
4 152 false 0
5 152 false 0
6 152 false 0
location_error_numerical location_lat location_long
1 NA 51.86663 27.59045
2 NA 51.86654 27.59053
3 NA 51.86645 27.59056
4 NA 51.86644 27.59071
5 NA 51.86636 27.59047
6 NA 51.86646 27.59067
magnetic_field_raw_x magnetic_field_raw_y magnetic_field_raw_z
1 0.067 -0.354 -0.024
2 0.065 -0.360 -0.013
3 0.067 -0.352 -0.019
4 0.061 -0.360 -0.012
5 0.061 -0.356 -0.014
6 0.073 -0.350 -0.019
ornitela_transmission_protocol tag_voltage timestamp
1 GPRS 4155 2019-04-26 01:42:00
2 GPRS 4150 2019-04-26 01:46:51
3 GPRS 4150 2019-04-26 01:51:53
4 GPRS 4150 2019-04-26 01:57:05
5 GPRS 4147 2019-04-26 02:01:46
6 GPRS 4147 2019-04-26 02:06:47
transmission_timestamp update_ts
1 2019-10-07 09:46:52.104
2 2019-10-07 09:46:52.104
3 2019-10-07 09:46:52.104
4 2019-10-07 09:46:52.104
5 2019-10-07 09:46:52.104
6 2019-10-07 09:46:52.104
vertical_error_numerical visible deployment_id event_id
1 NA true 1003456347 12506913411
2 NA true 1003456347 12506913412
3 NA true 1003456347 12506913413
4 NA true 1003456347 12506913414
5 NA true 1003456347 12506913415
6 NA true 1003456347 12506913416
sensor_type tag_local_identifier location_long.1 location_lat.1
1 GPS 171035 27.59045 51.86663
2 GPS 171035 27.59053 51.86654
3 GPS 171035 27.59056 51.86645
4 GPS 171035 27.59071 51.86644
5 GPS 171035 27.59047 51.86636
6 GPS 171035 27.59067 51.86646
optional sensor timestamps trackId comments
1 TRUE GPS 2019-04-26 01:42:00 Blond NA
2 TRUE GPS 2019-04-26 01:46:51 Blond NA
3 TRUE GPS 2019-04-26 01:51:53 Blond NA
4 TRUE GPS 2019-04-26 01:57:05 Blond NA
5 TRUE GPS 2019-04-26 02:01:46 Blond NA
6 TRUE GPS 2019-04-26 02:06:47 Blond NA
death_comments earliest_date_born exact_date_of_birth
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
individual_id latest_date_born local_identifier nick_name ring_id
1 1003455374 NA Blond Blond
2 1003455374 NA Blond Blond
3 1003455374 NA Blond Blond
4 1003455374 NA Blond Blond
5 1003455374 NA Blond Blond
6 1003455374 NA Blond Blond
sex taxon_canonical_name timestamp_start
1 Aquila clanga 2018-08-31 00:01:23.000
2 Aquila clanga 2018-08-31 00:01:23.000
3 Aquila clanga 2018-08-31 00:01:23.000
4 Aquila clanga 2018-08-31 00:01:23.000
5 Aquila clanga 2018-08-31 00:01:23.000
6 Aquila clanga 2018-08-31 00:01:23.000
timestamp_end number_of_events number_of_deployments
1 2020-07-16 09:54:12.000 85156 1
2 2020-07-16 09:54:12.000 85156 1
3 2020-07-16 09:54:12.000 85156 1
4 2020-07-16 09:54:12.000 85156 1
5 2020-07-16 09:54:12.000 85156 1
6 2020-07-16 09:54:12.000 85156 1
sensor_type_ids taxon_detail
1 GPS Clanga clanga
2 GPS Clanga clanga
3 GPS Clanga clanga
4 GPS Clanga clanga
5 GPS Clanga clanga
6 GPS Clanga clanga
head(Blond_prey)
Location ID Species Habitat Year Date Activity Gender
1 ?????? Blond BP Fen Mire 2019 2019-04-25 Arrival M
2 ?????? Blond BP Fen Mire 2019 2019-04-27 Arrival M
3 ?????? Blond BP Fen Mire 2019 2019-04-27 Arrival M
4 ?????? Blond BP Fen Mire 2019 2019-05-03 Arrival M
5 ?????? Blond BP Fen Mire 2019 2019-05-12 Arrival M
6 ?????? Blond BP Fen Mire 2019 2019-05-13 Arrival M
Activity_1 Category Prey
1 Prey Delivery ? medium-sized bird or large vole
2 Prey Delivery ? Something Small
3 Prey Delivery Crane-like Spotted Crake
4 Prey Delivery Geese Large Duck
5 Prey Delivery ? medium-sized bird or large vole
6 Prey Delivery Snake Grass Snake
Class Age Condition Weight..g. Notes
1 ? <NA> <NA> 100 Imperfectly Seen
2 ? <NA> <NA> NA <NA>
3 Aves ad <NA> NA <NA>
4 Aves ad duck spine with head NA <NA>
5 ? <NA> <NA> 100 Imperfectly Seen
6 Reptilia <NA> <NA> NA <NA>
New_Time
1 2019-04-25 17:03:00 UTC
2 2019-04-27 04:39:00 UTC
3 2019-04-27 07:33:00 UTC
4 2019-05-03 07:26:00 UTC
5 2019-05-12 06:40:00 UTC
6 2019-05-13 13:19:00 UTC
The columns with the timestamps are called "timestamp" in Blond_GSE and "New_Time in Blond_Prey.
Here are a look at the two timestamps.
head(Blond_GSE$timestamp)
[1] "2019-04-26 01:42:00 UTC" "2019-04-26 01:46:51 UTC"
[3] "2019-04-26 01:51:53 UTC" "2019-04-26 01:57:05 UTC"
[5] "2019-04-26 02:01:46 UTC" "2019-04-26 02:06:47 UTC"
head(Blond_prey$New_Time)
[1] "2019-04-25 17:03:00 UTC" "2019-04-27 04:39:00 UTC"
[3] "2019-04-27 07:33:00 UTC" "2019-05-03 07:26:00 UTC"
[5] "2019-05-12 06:40:00 UTC" "2019-05-13 13:19:00 UTC"
I would like to filter the Blond_GSE data by the timestamp of Blond_prey, so I get all data 30 mins prior to the Blond_Prey timestamps.
Is this possible?
I have tried the code.
Blond.GSE <- Blond_GSE %>% filter_time(timestamp => Blond_prey$New_Time <=(Blond_prey&New_Time - 30))
However that returns an error message:
Error: unexpected '>' in "Blond.GSE <- Blond_GSE %>% filter_time(timestamp =>"
Please can someone help?

Calculating rates when data is in long form

A sample of my data is available here.
I am trying to calculate the growth rate (change in weight (wt) over time) for each squirrel.
When I have my data in wide format:
squirrel fieldBirthDate date1 date2 date3 date4 date5 date6 age1 age2 age3 age4 age5 age6 wt1 wt2 wt3 wt4 wt5 wt6 litterid
22922 2017-05-13 2017-05-14 2017-06-07 NA NA NA NA 1 25 NA NA NA NA 12 52.9 NA NA NA NA 7684
22976 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 3 25 NA NA NA NA 15.5 50.9 NA NA NA NA 7692
22926 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 0 25 NA NA NA NA 10.1 48 NA NA NA NA 7719
I am able to calculate growth rate with the following code:
library(dplyr)
#growth rate between weight 1 and weight 3, divided by age when weight 3 is recorded
growth <- growth %>%
mutate (g.rate=((wt3-wt1)/age3))
#growth rate between weight 1 and weight 2, divided by age when weight 2 is recorded
merge.growth <- merge.growth %>%
mutate (g.rate=((wt2-wt1)/age2))
However, when the data is in long format (a format needed for the analysis I am running afterwards):
squirrel litterid date age wt
22922 7684 2017-05-13 0 NA
22922 7684 2017-05-14 1 12
22922 7684 2017-06-07 25 52.9
22976 7692 2017-05-13 1 NA
22976 7692 2017-05-16 3 15.5
22976 7692 2017-06-07 25 50.9
22926 7719 2017-05-14 0 10.1
22926 7719 2017-06-08 25 48
I cannot use the mutate function I used above. I am hoping to create a new column that includes growth rate as follows:
squirrel litterid date age wt g.rate
22922 7684 2017-05-13 0 NA NA
22922 7684 2017-05-14 1 12 NA
22922 7684 2017-06-07 25 52.9 1.704
22976 7692 2017-05-13 1 NA NA
22976 7692 2017-05-16 3 15.5 NA
22976 7692 2017-06-07 25 50.9 1.609
22926 7719 2017-05-14 0 10.1 NA
22926 7719 2017-06-08 25 48 1.516
22758 7736 2017-05-03 0 8.8 NA
22758 7736 2017-05-28 25 43 1.368
22758 7736 2017-07-05 63 126 1.860
22758 7736 2017-07-23 81 161 1.879
22758 7736 2017-07-26 84 171 1.930
I have been calculating the growth rates (growth between each wt and the first time it was weighed) in excel, however I would like to do the calculations in R instead since I have a large number of squirrels to work with. I suspect if else loops might be the way to go here, but I am not well versed in that sort of coding. Any suggestions or ideas are welcome!
You can use group_by to calculate this for each squirrel:
group_by(df, squirrel) %>%
mutate(g.rate = (wt - nth(wt, which.min(is.na(wt)))) /
(age - nth(age, which.min(is.na(wt)))))
That leaves NaNs where the age term is zero, but you can change those to NAs if you want with df$g.rate[is.nan(df$g.rate)] <- NA.
alternative using data.table and its function "shift" that takes the previous row
library(data.table)
df= data.table(df)
df[,"growth":=(wt-shift(wt,1))/age,by=.(squirrel)]

use replace_na conditionally

I want to conditionally replace missing revenue up to 16th July 2017 with zero using tidyverse.
My Data
library(tidyverse)
library(lubridate)
df<- tribble(
~Date, ~Revenue,
"2017-07-01", 500,
"2017-07-02", 501,
"2017-07-03", 502,
"2017-07-04", 503,
"2017-07-05", 504,
"2017-07-06", 505,
"2017-07-07", 506,
"2017-07-08", 507,
"2017-07-09", 508,
"2017-07-10", 509,
"2017-07-11", 510,
"2017-07-12", NA,
"2017-07-13", NA,
"2017-07-14", NA,
"2017-07-15", NA,
"2017-07-16", NA,
"2017-07-17", NA,
"2017-07-18", NA,
"2017-07-19", NA,
"2017-07-20", NA
)
df$Date <- ymd(df$Date)
Date up to which I want to conditionally replace NAs
max.date <- ymd("2017-07-16")
Output I desire
# A tibble: 20 × 2
Date Revenue
<chr> <dbl>
1 2017-07-01 500
2 2017-07-02 501
3 2017-07-03 502
4 2017-07-04 503
5 2017-07-05 504
6 2017-07-06 505
7 2017-07-07 506
8 2017-07-08 507
9 2017-07-09 508
10 2017-07-10 509
11 2017-07-11 510
12 2017-07-12 0
13 2017-07-13 0
14 2017-07-14 0
15 2017-07-15 0
16 2017-07-16 0
17 2017-07-17 NA
18 2017-07-18 NA
19 2017-07-19 NA
20 2017-07-20 NA
The only way I could work this out was to split the df into several parts, update for NAs and then rbind the whole lot.
Could someone please help me do this efficiently using tidyverse.
We can mutate the 'Revenue' column to replace the NA with 0 using a logical condition that checks whether the element is NA and the 'Date' is less than or equal to 'max.date'
df %>%
mutate(Revenue = replace(Revenue, is.na(Revenue) & Date <= max.date, 0))
# A tibble: 20 x 2
# Date Revenue
# <date> <dbl>
# 1 2017-07-01 500
# 2 2017-07-02 501
# 3 2017-07-03 502
# 4 2017-07-04 503
# 5 2017-07-05 504
# 6 2017-07-06 505
# 7 2017-07-07 506
# 8 2017-07-08 507
# 9 2017-07-09 508
#10 2017-07-10 509
#11 2017-07-11 510
#12 2017-07-12 0
#13 2017-07-13 0
#14 2017-07-14 0
#15 2017-07-15 0
#16 2017-07-16 0
#17 2017-07-17 NA
#18 2017-07-18 NA
#19 2017-07-19 NA
#20 2017-07-20 NA
It can be achieved with data.table by specifying the logical condition in 'i and assigning (:=) the 'Revenue' to 0
library(data.table)
setDT(df)[is.na(Revenue) & Date <= max.date, Revenue := 0]
Or with base R
df$Revenue[is.na(df$Revenue) & df$Date <= max.date] <- 0

Conditional replacement of a data.frame column with matrix values. VLOOKUP in R

I have a very simple doubt in R but still I cannot find the solution in previous answers for what I need, or I missed it. I want a sort of vlookup (like Excel) formula but only for specific rows in a dataframe. Let’s say I have a data frame like the following:
id obs year a1 a2 b1 b2 c
604 43 2003 NA NA NA NA NA
605 43 2004 NA NA NA NA NA
606 43 2005 9000 6421 1748365 0.1616 36872152
769 55 2003 NA NA NA NA NA
770 55 2004 NA NA NA NA NA
771 55 2005 2500 12449 NA NA 125992307
844 61 2003 1800 11633 157977428 0.0089 69901689
845 61 2004 2200 14841 228966763 0.0012 86853166
846 61 2005 2500 15559 345889717 0.0081 103029905
2209 178 2003 NA NA NA NA NA
2210 178 2004 200 45093 NA NA 11668685
2211 178 2005 250 47202 610500 0.1605 12813908
Then, I apply a formula to all the complete cases in the data so, for this particular example, I will get a matrix with 5 lines of results (and 2 results per observation) that I am showing down here:
id x y
606 8000 30
844 1700 90
845 8000 61
846 400 82
2211 600 30
So now, what I basically want is, only for rows in year 2005 in the dataframe, check where there is a matching (by id) in the matrix and modify a specific column in the dataframe (that I created before as “value”) with its corresponding result in the “y” column of the matrix. Consider here some points: (a) for the non complete cases it should offer NA, (b) I only want year 2005 to be modified; other years will be modified later with other follow up formulas that will offer a different matrix result. Given this, to my knowledge, functions like merge, match, cbind or plyr ones, will affect the whole column and I am not looking for that. Other options like %in% or %l% didn’t work neither, or I am using them mistakenly. This is what I tried so far with no success:
df$value [c(df$year==2005)] <- matrix[,3[matrix[,1]==df$id]]
df$value [c(df$year==2005)] <- matrix[,3][matrix[,1]==df$id]
Maybe a loop can be the solution but I am still learning how to build them and was unfruitful too.
Here the result that I would expect, for better understanding.
id obs year a1 a2 b1 b2 c value
604 43 2003 NA NA NA NA NA NA
605 43 2004 NA NA NA NA NA NA
606 43 2005 9000 6421 1748365 0.1616 36872152 30
769 55 2003 NA NA NA NA NA NA
770 55 2004 NA NA NA NA NA NA
771 55 2005 2500 12449 NA NA 125992307 NA
844 61 2003 1800 11633 157977428 0.0089 69901689 NA
845 61 2004 2200 14841 228966763 0.0012 86853166 NA
846 61 2005 2500 15559 345889717 0.0081 103029905 82
2209 178 2003 NA NA NA NA NA NA
2210 178 2004 200 45093 NA NA 11668685 NA
2211 178 2005 250 47202 610500 0.1605 12813908 30
Thanks a lot for any hint and keep on doing the great job. I was checking this web for about a year already and it helped me a lot!!!
Using akrun's data, you could, also, use:
ifelse(df1$year == 2005 & rowSums(sapply(df1[-(1:3)], is.na)) == 0,
m1[match(df1$id, m1[, "id"]), "y"],
NA)
#[1] NA NA 30 NA NA NA NA NA 82 NA NA 30
i.e. if the year is 2005 and there is no NAin the row, take the respective "y" from the matrix else NA.
You could try: df1 is the data.frame and m1 matrix
indx <- which(df1$year==2005)
Update
I guess I missed one of the conditions i.e. complete.cases (though in the example dataset, it didn't change the results). The new indx should be
indx <- which(df1$year==2005 & !rowSums(is.na(df1[-(1:3)]))) #inspired from #alexis_laz answer
df1$value <- NA
df1$value[indx[df1$id[indx] %in% m1[,"id"] ]] <- m1[, "y"][m1[,"id"] %in% df1$id[indx]]
df1
# id obs year a1 a2 b1 b2 c value
#1 604 43 2003 NA NA NA NA NA NA
#2 605 43 2004 NA NA NA NA NA NA
#3 606 43 2005 9000 6421 1748365 0.1616 36872152 30
#4 769 55 2003 NA NA NA NA NA NA
#5 770 55 2004 NA NA NA NA NA NA
#6 771 55 2005 2500 12449 NA NA 125992307 NA
#7 844 61 2003 1800 11633 157977428 0.0089 69901689 NA
#8 845 61 2004 2200 14841 228966763 0.0012 86853166 NA
#9 846 61 2005 2500 15559 345889717 0.0081 103029905 82
#10 2209 178 2003 NA NA NA NA NA NA
#11 2210 178 2004 200 45093 NA NA 11668685 NA
#12 2211 178 2005 250 47202 610500 0.1605 12813908 30
data
df1 <- structure(list(id = c(604L, 605L, 606L, 769L, 770L, 771L, 844L,
845L, 846L, 2209L, 2210L, 2211L), obs = c(43L, 43L, 43L, 55L,
55L, 55L, 61L, 61L, 61L, 178L, 178L, 178L), year = c(2003L, 2004L,
2005L, 2003L, 2004L, 2005L, 2003L, 2004L, 2005L, 2003L, 2004L,
2005L), a1 = c(NA, NA, 9000L, NA, NA, 2500L, 1800L, 2200L, 2500L,
NA, 200L, 250L), a2 = c(NA, NA, 6421L, NA, NA, 12449L, 11633L,
14841L, 15559L, NA, 45093L, 47202L), b1 = c(NA, NA, 1748365L,
NA, NA, NA, 157977428L, 228966763L, 345889717L, NA, NA, 610500L
), b2 = c(NA, NA, 0.1616, NA, NA, NA, 0.0089, 0.0012, 0.0081,
NA, NA, 0.1605), c = c(NA, NA, 36872152L, NA, NA, 125992307L,
69901689L, 86853166L, 103029905L, NA, 11668685L, 12813908L)), .Names = c("id",
"obs", "year", "a1", "a2", "b1", "b2", "c"), class = "data.frame", row.names = c(NA,
-12L))
m1 <- structure(c(606L, 844L, 845L, 846L, 2211L, 8000L, 1700L, 8000L,
400L, 600L, 30L, 90L, 61L, 82L, 30L), .Dim = c(5L, 3L), .Dimnames = list(
NULL, c("id", "x", "y")))
If I was in your shoes, I probably will write a for loop and a function to loop through every record since it seems like they have several difference logic going on based on the condition.
Here is my understanding of your 'specification':
work on only on the rows which obeys certain criteria (year equals 2005 in this case) instead of affecting the whole column.
Here is some code, it is a bit long but I don't know if the idea of breaking the dataframe into two parts and then put them back together using melt/cast will be helpful:
mytext1 <- "id obs year a1 a2 b1 b2 c
604 43 2003 NA NA NA NA NA
605 43 2004 NA NA NA NA NA
606 43 2005 9000 6421 1748365 0.1616 36872152
769 55 2003 NA NA NA NA NA
770 55 2004 NA NA NA NA NA
771 55 2005 2500 12449 NA NA 125992307
844 61 2003 1800 11633 157977428 0.0089 69901689
845 61 2004 2200 14841 228966763 0.0012 86853166
846 61 2005 2500 15559 345889717 0.0081 103029905
2209 178 2003 NA NA NA NA NA
2210 178 2004 200 45093 NA NA 11668685
2211 178 2005 250 47202 610500 0.1605 12813908"
mytext2 <- "id x y
606 8000 30
844 1700 90
845 8000 61
846 400 82
2211 600 30"
data.1 <- read.table(text=mytext1, header=TRUE)
data.2 <- read.table(text=mytext2, header=TRUE)
require(plyr)
require(reshape2)
a <- merge(x=subset(data.1, year==2005), y=data.2, by="id")
b <- subset(data.1, year!=2005)
a.new <- melt(a, id.vars=c('id'))
b.new <- melt(b, id.vars=c('id'))
result.new <- rbind(a.new, b.new)
result <- dcast(result.new, id ~ variable)
Now you have the result likes this:
> result
id obs year a1 a2 b1 b2 c x y
1 604 43 2003 NA NA NA NA NA NA NA
2 605 43 2004 NA NA NA NA NA NA NA
3 606 43 2005 9000 6421 1748365 0.1616 36872152 8000 30
4 769 55 2003 NA NA NA NA NA NA NA
5 770 55 2004 NA NA NA NA NA NA NA
6 844 61 2003 1800 11633 157977428 0.0089 69901689 NA NA
7 845 61 2004 2200 14841 228966763 0.0012 86853166 NA NA
8 846 61 2005 2500 15559 345889717 0.0081 103029905 400 82
9 2209 178 2003 NA NA NA NA NA NA NA
10 2210 178 2004 200 45093 NA NA 11668685 NA NA
11 2211 178 2005 250 47202 610500 0.1605 12813908 600 30
You still need to change the name either in the end or before putting them back together.. :)

Resources