Conditional Mutating using specific identifiers in a data frame - r

I have a dataset that employs a similar approach to this one.
ID <- c(4,5,6,7,3,8,9)
quantity <- c(20,300, 350, 21, 44, 20, 230)
measurementvalue <- c("tin", "kg","kg","tin","tin","tin","kg")
kgs <- c(21,12, 30,23,33,11,24)
DF <- data.frame(ID, quantity, measurementvalue)
My standard way of deriving a new column totalkgs using conditional mutating is the code below.
DF <- DF %>%
mutate(totalkgs =
ifelse(quantity == "tin", quantity * 5,
ifelse(quantity =="kg", quantity*1, quantity)))
However, the dataset had erroneous entries in the column quantity so I'd like to perform a division on those specific identifiers. All the final values of both the multiplication and division should be store in the column totalkgs. How do I go about this?
Let's assume the id's with the error data are 3,5,9,7. and I'd like to divide the values found in the column quantity with 10.

You could use case_when:
ID <- c(4,5,6,7,3,8,9)
quantity <- c(20,300, 350, 21, 44, 20, 230)
measurementvalue <- c("tin", "kg","kg","tin","tin","tin","kg")
kgs <- c(21,12, 30,23,33,11,24)
DF <- data.frame(ID, quantity, measurementvalue)
library(dplyr)
DF %>%
mutate(quantity2 = ifelse(ID %in% c(3,5,9,7), quantity/10, quantity)) %>%
mutate(totalkgs = case_when(measurementvalue == "tin" ~ quantity2 * 5,
measurementvalue == "kg" ~ quantity2 * 1,
TRUE ~ quantity2)) %>%
select(-quantity2) #if you want
#> ID quantity measurementvalue totalkgs
#> 1 4 20 tin 100.0
#> 2 5 300 kg 30.0
#> 3 6 350 kg 350.0
#> 4 7 21 tin 10.5
#> 5 3 44 tin 22.0
#> 6 8 20 tin 100.0
#> 7 9 230 kg 23.0
Created on 2022-07-05 by the reprex package (v2.0.1)

Related

R: Filtering rows based on a group criterion

I have a data frame with over 100,000 rows and with about 40 columns. The schools column has about 100 distinct schools. I have data from 1980 to 2023.
I want to keep all data from schools that have at least 10 rows for each of the years 2018 through 2022. Schools that do not meet that criterion should have all rows deleted.
In my minimal example, Schools, I have three schools.
Computing a table makes it apparent that only Washington should be retained. Adams only has 5 rows for 2018 and Jefferson has 0 for 2018.
Schools2 is what the result should look like.
How do I use the table computation or a dplyr computation to perform the filter?
Schools =
data.frame(school = c(rep('Washington', 60),
rep('Adams',70),
rep('Jefferson', 100)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5),
rep(2017, 25), rep(2018, 5), rep(2019:2022, each = 10),
rep(2019:2023, each = 20)),
stuff = rnorm(230)
)
Schools2 =
data.frame(school = c(rep('Washington', 60)),
year = c(rep(2016, 5), rep(2018:2022, each = 10), rep(2023, 5)),
stuff = rnorm(60)
)
table(Schools$school, Schools$year)
Schools |> group_by(school, year) |> summarize(counts = n())
Keep only the year from 2018 to 2022 in the data with filter, then add a frequency count column by school, year, and filter only those 'school', having all count greater than or equal to 10 and if all the year from the range are present
library(dplyr)# version >= 1.1.0
Schools %>%
filter(all(table(year[year %in% 2018:2022]) >= 10) &
all(2018:2022 %in% year), .by = c("school")) %>%
as_tibble()
-output
# A tibble: 60 × 3
school year stuff
<chr> <dbl> <dbl>
1 Washington 2016 0.680
2 Washington 2016 -1.14
3 Washington 2016 0.0420
4 Washington 2016 -0.603
5 Washington 2016 2.05
6 Washington 2018 -0.810
7 Washington 2018 0.692
8 Washington 2018 -0.502
9 Washington 2018 0.464
10 Washington 2018 0.397
# … with 50 more rows
Or using count
library(magrittr)
Schools %>%
filter(tibble(year) %>%
filter(year %in% 2018:2022) %>%
count(year) %>%
pull(n) %>%
is_weakly_greater_than(10) %>%
all, all(2018:2022 %in% year) , .by = "school")
As it turns out, a friend just helped me come up with a base R solution.
# form 2-way table, school against year
sdTable = table(Schools$school, Schools$year)
# say want years 2018-2022 having lots of rows in school data
sdTable = sdTable[,3:7]
# which have >= 10 rows in all years 2018-2022
allGtEq = function(oneRow) all(oneRow >= 10)
whichToKeep = which(apply(sdTable,1,allGtEq))
# now whichToKeep is row numbers from the table; get the school names
whichToKeep = names(whichToKeep)
# back to school data
whichOrigRowsToKeep = which(Schools$school %in% whichToKeep)
newHousing = Schools[whichOrigRowsToKeep,]
newHousing

Calculating distance between all locations to first location, by group

I have GPS locations from several seabird tracks, each starting from colony x. Therefore the individual tracks all have similar first locations. For each track, I would like to calculate the beeline distance between each GPS location and either (a) a specified location that represents the location of colony x, or (b) the first GPS point of a given track which represents the location of colony x. For (b), I would look to use the first location of each new track ID (track_id).
I have looked for appropriate functions in geosphere, sp, raster, adehabitatLT, move, ... and just cannot seem to find what I am looking for.
I can calculate the distance between successive GPS points, but that is not what I need.
package(dplyr)
df %>%
group_by(ID) %>%
mutate(lat_prev = lag(Lat,1), lon_prev = lag(Lon,1) ) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(lon_prev, lat_prev), ncol = 2), # or use distHaversine
matrix(c(Lon, Lat), ncol = 2)))
#example data:
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
Grateful for any pointers.
You were pretty close. The key is that you want to calcualte the distance from the first observation in each track. Therefore you need to first adorn with the order in each track (easy to do with dplyr::row_number()). Then for the distance calculation, make the reference observation always the first by subsetting with order == 1.
library(tidyverse)
library(geosphere)
df <- data.frame(Lon = c(-96.8, -96.60861, -96.86875, -96.14351, -92.82518, -90.86053, -90.14208, -84.64081, -83.7, -82, -80, -88.52732, -94.46049,-94.30, -88.60, -80.50, -81.70, -83.90, -84.60, -90.10, -90.80, -92.70, -96.10, -96.55, -96.50, -96.00),
Lat = c(25.38657, 25.90644, 26.57339, 27.63348, 29.03572, 28.16380, 28.21235, 26.71302, 25.12554, 24.50031, 24.89052, 30.16034, 29.34550, 29.34550, 30.16034, 24.89052, 24.50031, 25.12554, 26.71302, 28.21235, 28.16380, 29.03572, 27.63348, 26.57339, 25.80000, 25.30000),
ID = c(rep("ID1", 13), rep("ID2", 13)))
df %>%
group_by(ID) %>%
mutate(order = row_number()) %>%
mutate(dist = distVincentyEllipsoid(matrix(c(Lon[order == 1], Lat[order == 1]), ncol = 2),
matrix(c(Lon, Lat), ncol = 2)))
#> # A tibble: 26 x 5
#> # Groups: ID [2]
#> Lon Lat ID order dist
#> <dbl> <dbl> <chr> <int> <dbl>
#> 1 -96.8 25.4 ID1 1 0
#> 2 -96.6 25.9 ID1 2 60714.
#> 3 -96.9 26.6 ID1 3 131665.
#> 4 -96.1 27.6 ID1 4 257404.
#> 5 -92.8 29.0 ID1 5 564320.
#> 6 -90.9 28.2 ID1 6 665898.
#> 7 -90.1 28.2 ID1 7 732131.
#> 8 -84.6 26.7 ID1 8 1225193.
#> 9 -83.7 25.1 ID1 9 1319482.
#> 10 -82 24.5 ID1 10 1497199.
#> # ... with 16 more rows
Created on 2022-01-09 by the reprex package (v2.0.1)
This also seems to work (sent to me by a friend) - very similar to Dan's suggestion above, but slightly different
library(geosphere)
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Dist_to_col = distHaversine(c(Lon[1], Lat[1]),cbind(Lon,Lat)))

How to group a data frame by variables that are approximately the same?

I have the following data frame with data for each county:
A tibble: 6 x 4
# Groups: countyfips, day_after_reopening, deciles_income [6]
countyfips day_after_reopening deciles_income winner2016
<int> <drtn> <int> <chr>
1 1001 -109 days 8 Donald Trump
2 1001 -102 days 8 Donald Trump
3 1001 -95 days 8 Donald Trump
4 1001 -88 days 8 Donald Trump
5 1001 -81 days 8 Donald Trump
6 1001 -74 days 8 Donald Trump
And I would like to group it by the day_after_reopening column. However, the problem is that for each county the day_after_reopening number differs a little, as the observations are taken at the same time for each county, but the counties each opened on a different day of the week (e.g.out of the two counties I would like to have in the same group, one might have -109, the other -108).
How would you group these two counties with very similar numeric values together? Thank you.
You can create artificial groups based on some pre defined difference between numbers.
I created one example below:
require(dplyr)
# Difference max that you want
difference_max <- 2
# Create dummy data frame
day_after_reopening <- c(108, 109, 107, 50, 51, 68, 69, 67, 108, 109, 55, 56, 57, 100, 101, 101, 100,56)
df <- data.frame(day_after_reopening = day_after_reopening, index = seq(1:length(day_after_reopening)))
# Order the interesting column
df <- df[order(df$day_after_reopening),]
df$test <- c(diff(df$day_after_reopening, lag = 1), 0)
# Create the breaks where the difference value is greater than a selected value
breaks <- df[df$test > difference_max,]
breaks$test <- "here"
df <- rbind(breaks, df)
df <- df[order(df$day_after_reopening, df$test),]
# Create the split points and grouping
df <- df %>%
mutate(split_point = test < "here",
breaks = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
filter(split_point) %>%
group_by(breaks) %>%
summarise(day_after_reopening_mean = mean(day_after_reopening))
> df
# A tibble: 5 x 2
breaks day_after_reopening_mean
<int> <dbl>
1 1 50.5
2 3 56
3 5 68
4 7 100.
5 9 108.
Ok, then sounds like you'll first want to get the max number of days so you know how far to go out, and code for a new dataframe could be something like below (I've never used cut() so wouldn't know how to do it a bit more automatically):
df2 <- df1 %>%
mutate(day_after_grp =
case_when(day_after_reopening >=0 & day_after_reopening <=6 ~ "0-6",
day_after_reopening >=7 & day_after_reopening <=13 ~ "7-13",
day_after_reopening >=14 & day_after_reopening <=20 ~ "14-20",
etc to max. You'd then have a new variable "day_after_grp" to use for groupings.
Again, there may be a more programmatic way to do it w/ less copy/paste.

How to change all values within group following lag(value) > 1

Someone named Tarqon on Reddits /r/Rlanguage solved the problem. 1 + cumsum(days_between >= 45 instead of the if_else.
group_by(DMHID) %>%
arrange(DMHID, DateOfService) %>%
mutate(days_between = as.numeric(DateOfService - lag(DateOfService, default = DateOfService[1]))) %>%
mutate(eoc_45dco = 1 + cumsum(days_between >= 45)) %>%
mutate(id_eoc = as.integer(paste0(DMHID, eoc_45dco))) %>%
ORIGINAL QUESTION
So I am trying to split cases based on the amount of days (> 45) between one visit and the next. It works for the individual instance when there is more than 45 days between one visit and the next, but I need each visit after that to be part of the second group. For example, Participant 1234 has 362 visits, but between visit 105 (2016-12-26) and 106 (2017-02-23) there was a 59 day gap so i want all cases after that to be labeled 2. Rather All cases leading up to and including 105 are 12341 and after that 12342, so I can group by this variable for later analyses. Problem is that I can only seem to get the 106th visit to be labeled 12342 and everything before and after are 12341. I created a stripped down dataset and script that does reproduce the problem.
https://www.dropbox.com/s/k6gvo8igvbhpgti/reprex.zip?dl=0
EDIT: I just thought of another way to say it. I basically need to figure out how to group/subset data for each person, with the dividing line being the first time there is a gap of 45 days or more. I might be going down the wrong road with my current implementation, so if you can suggest alternative ways to split the data the way I want let me know. In the example I only have one persons visits, so the full dataset has a few thousand people with similar issues.
barometer <- df_pdencs_orig %>%
select(-EncID, -SiteName, -EOCKey, -ProgramLevel, -ProgramLevelCode, -ProcedureDesc, -MedicationValue, -CheckDate, -PdAmount, -PayerType) %>%
mutate_at(vars(contains("Date")), funs(ymd)) %>%
filter(DMHID %in% valid_diag$DMHID & DateOfService >= ymd(open_date)) %>%
group_by(DMHID) %>%
arrange(DMHID, DateOfService) %>%
mutate(days_between = DateOfService - lag(DateOfService, n = 1, default = DateOfService[1])) %>%
mutate(eoc_45dco = 1) %>%
mutate(eoc_45dco = if_else(days_between >= 45, lag(eoc_45dco) + 1, eoc_45dco)) %>%
mutate(eoc_45dco2 = if_else(lag(eoc_45dco) > 1, eoc_45dco + 1, eoc_45dco)) %>%
mutate(id_eoc = as.integer(paste0(DMHID, eoc_45dco))) %>%
...
The reprex below works just fine so I don't think that helps.
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
df <- data.frame(
date = sample(seq(as.Date('2016/06/01'), as.Date('2017/03/01'), by="day"), 11),
days = as.difftime(c(40:50), units = "days")
)
df %>%
mutate(id = 1234) %>%
arrange(days) %>%
mutate(Z = 1) %>%
mutate(Z = if_else(days >= 45, lag(Z) + 1, Z)) %>%
mutate(id_eoc = as.integer(paste0(id, Z)))
#> date days id Z id_eoc
#> 1 2016-06-30 40 days 1234 1 12341
#> 2 2016-11-25 41 days 1234 1 12341
#> 3 2016-09-09 42 days 1234 1 12341
#> 4 2017-01-16 43 days 1234 1 12341
#> 5 2016-08-16 44 days 1234 1 12341
#> 6 2016-09-23 45 days 1234 2 12342
#> 7 2016-09-05 46 days 1234 2 12342
#> 8 2016-08-29 47 days 1234 2 12342
#> 9 2016-07-08 48 days 1234 2 12342
#> 10 2017-01-11 49 days 1234 2 12342
#> 11 2017-02-22 50 days 1234 2 12342
Created on 2018-04-17 by the reprex package (v0.2.0).
As such I think the issue is with the dates maybe since subtracting dates gives a time variable and not an integer.

Days since a variable changed dplyr

Does anyone know of a dplyr method for calculating the number of days since a variable changed (by groups)? For example, consider the number of days since a particular store last changed its price.
library(dplyr)
df <- data.frame(store = c(34, 34, 34, 34, 34, 28, 28, 28, 81, 81),
date = c(20111231, 20111224, 20111217, 20111210, 20111203,
20111224, 20111217, 20111203, 20111231, 20111224),
price = c(3.45, 3.45, 3.45, 3.36, 3.45, 3.17, 3.25, 3.15,
3.49, 3.17))
df <- df %>% mutate(date = as.Date(as.character(date), format = "%Y%m%d")) %>%
arrange(store, desc(date)) %>% group_by(store) %>%
mutate(pchange = price - lead(price))
df$days.since.change <- c(7, 14, 0, 21, 14, 7, 7, 0, 7, 0)
I'm trying to use dplyr to generate a variable called days.since.change. For example, store 34 charged $3.45 on 2012-12-31, a price which had been in effect for 21 days (since it charged $3.36 on 2012-12-10). The variable appears manually above. The challenge is that a store might change its price back to an earlier price level, which invalidates some grouping strategies.
One option is to calculate the number of days between each price listing for each store and then adding a second grouping variable to group together consecutive dates during which the price didn't change. Then just take the cumulative sum over the days that passed.
I did this with the dataset sorted by date in ascending order with lag instead of lead to avoid using arrange twice but of course you could change this around. I also left the group variable in the dataset, which you likely won't want and could remove by ungrouping and then using select.
df %>% mutate(date = as.Date(as.character(date), format = "%Y%m%d")) %>%
arrange(store, date) %>%
group_by(store) %>%
mutate(pchange = price - lag(price), dchange = as.numeric(date - lag(date))) %>%
group_by(store, group = cumsum(c(1, diff(price) != 0))) %>%
mutate(dchange = cumsum(dchange))
Source: local data frame [10 x 6]
Groups: store, group
store date price pchange dchange group
1 28 2011-12-03 3.15 NA NA 1
2 28 2011-12-17 3.25 0.10 14 2
3 28 2011-12-24 3.17 -0.08 7 3
4 34 2011-12-03 3.45 NA NA 1
5 34 2011-12-10 3.36 -0.09 7 2
6 34 2011-12-17 3.45 0.09 7 3
7 34 2011-12-24 3.45 0.00 14 3
8 34 2011-12-31 3.45 0.00 21 3
9 81 2011-12-24 3.17 NA NA 1
10 81 2011-12-31 3.49 0.32 7 2

Resources