Looping with between() - r

I sort through photos marking the starting and ending images of photo groups containing animals of interest. The finished product look something like whats included below. After sorting, I'd normally use the starting and ending photos as markers to move photos of interest from each subfolder into a main folder for later processing.
Primary.Folder | Sub.folder |Start.Image.. |End.Image..
RPU_03262019_05092019 | 100EK113 | 2019-03-26-11-23-46 | 2019-03-26-11-32-02
RPU_03262019_05092019 | 100EK113 | 2019-03-27-08-35-00 | 2019-03-27-08-35-00
RPU_03262019_05092019 | 101EK113 | 2019-03-31-00-29-58 | 2019-03-31-00-59-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-01-44-58 | 2019-03-31-01-59-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-03-14-58 | 2019-03-31-03-44-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-04-34-58 | 2019-03-31-04-39-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-05-04-58 | 2019-03-31-05-14-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-05-44-58 | 2019-03-31-05-44-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-19-30-58 | 2019-03-31-19-40-58
By having a list of the total images I'm hoping to loop my way through each row and build a new list of photos of just animal subjects that I can file.copy into another folder. I'm hoping between can help with this.
So far I've removed the .JPG from every file in the total photo list to match whats in the sorted csv, separated Start.Image.. column to t1 and End.Image.. column to t2, and tested a for loop to see if they line up.
fn <- photolist %>% str_replace_all('\\.JPG', '')
t1 <- csvfilled[,4]
t2 <- csvfilled[,5]
#test
for (i in t1) for (j in t2) {
print(paste(i,j,sep=","))
}
# using between() function
for (i in t1) {
for (j in t2){
finalsortedlist<- (fn[between(fn,i, j)])
}
}
The test results show i and j are running at the same time. It appears i waits for j to loop through before it continues at which j loops again.
"2019-05-09-09-24-24, 2019-05-08-18-35-24"
"2019-05-09-09-24-24, 2019-05-08-19-05-24"
"2019-05-09-09-24-24, 2019-05-08-19-50-24"
"2019-05-09-09-24-24, 2019-05-09-00-09-24"
"2019-05-09-09-24-24, 2019-05-09-09-59-24"
"2019-05-09-09-24-24, 2019-05-09-10-49-24"
Is there a way to run them in sequence like below?
"2019-03-26-11-23-46, 2019-03-26-11-32-02"
"2019-03-27-08-35-00, 2019-03-27-08-35-00"
"2019-03-31-00-29-58, 2019-03-31-00-59-58"
"2019-03-31-01-44-58, 2019-03-31-01-59-58"
"2019-03-31-03-14-58, 2019-03-31-03-44-58"
"2019-03-31-04-34-58, 2019-03-31-04-39-58"
"2019-03-31-05-04-58, 2019-03-31-05-14-58"
"2019-03-31-05-44-58, 2019-03-31-05-44-58"
"2019-03-31-19-30-58, 2019-03-31-19-40-58"
I basically want:
"1,1"
"2,2"
"3,3"
"4,4"
instead of
"1,1"
"1,2"
"1,3"
"1,4"
"2,1"
"2,2"
"2,3"
"2,4"

This would be fairly simple using 'map2' from the purrr library.
finalsortedlist <- map2(t1, t2, ~fn[between(fn, .x, .y)])
Essentially, map2 will take the nth item from both t1 and t2, and pass them as .x and .y respectively to your function. The result will be a list containing the results for every iteration.

Related

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

Besides indexing, how to speed up this query on 100m rows in PostgreSQL?

Background
First, let me know if this is more appropriate for the DBA StackExchange. Happy to move it there.
I've got a dataset, db1_dummy with ~100 million rows worth of car and motorcycle insurance claims that I'm prepping for statistical analysis. It's in PostgreSQL v13, which I have running on a local 64bit Windows machine and accessing through DataGrip. db1_dummy has ~15 variables, but for this question only 3 are relevant. Here's a toy version of the dataset:
+-------------------+------------+--+
|member_composite_id|service_date|id|
+-------------------+------------+--+
|eof81j4 |2010-01-12 |1 |
|eof81j4 |2010-06-03 |2 |
|eof81j4 |2011-01-06 |3 |
|eof81j4 |2011-05-21 |4 |
|j42roit |2015-11-29 |5 |
|j42roit |2015-11-29 |6 |
|j42roit |2015-11-29 |7 |
|p8ur0fq |2014-01-13 |8 |
|p8ur0fq |2014-01-13 |9 |
|p8ur0fq |2016-04-04 |10|
|vplhbun |2019-08-15 |11|
|vplhbun |2019-08-15 |12|
|vplhbun |2019-08-15 |13|
|akj3vie |2009-03-31 |14|
+-------------------+------------+--+
id is unique (a primary key), and as you can see member_composite_id identifies policyholders and can have multiple entries (an insurance policyholder can have multiple claims). service_date is just the date a policyholder's vehicle was serviced for an insurance claim.
I need to get the data into a certain format in order to run my analyses, all of which are regression-based implementations of survival analysis in R (Cox proportional hazards models with shared frailty, if anyone's interested). Three main things need to happen:
service_date needs to be converted into an integer counted up from 2009-01-01 -- days since January 1st, 2009, in other words. service_date needs to be renamed service_date_2.
A new column, service_date_1, needs to be created, and it needs to contain one of two things for each row: the cell should be 0 if that row is the first for that member_composite_id, or, if it isn't the first, it should contain the value of service_date_2 for that member_composite_id's previous row.
Since the interval (the difference) between service_date_1 and service_date_2 cannot equal zero, a small amount (0.1) should be subtracted from service_date_1 in such cases.
That may sound confusing, so let me just show you. Here's what I need the dataset to look like:
+--+-------------------+--------------+--------------+
|id|member_composite_id|service_date_1|service_date_2|
+--+-------------------+--------------+--------------+
|1 |eof81j4 |0 |376 |
|2 |eof81j4 |376 |518 |
|3 |eof81j4 |518 |735 |
|4 |eof81j4 |735 |870 |
|5 |j42roit |0 |2523 |
|6 |j42roit |2522.9 |2523 |
|7 |j42roit |2522.9 |2523 |
|8 |p8ur0fq |0 |1838 |
|9 |p8ur0fq |1837.9 |1838 |
|10|p8ur0fq |1838 |2650 |
|11|vplhbun |0 |3878 |
|12|vplhbun |3877.9 |3878 |
|13|vplhbun |3877.9 |3878 |
|14|akj3vie |0 |89 |
+--+-------------------+--------------+--------------+
The good news: I have a query that can do this -- indeed, this query spat out the output above. Here's the query:
CREATE TABLE db1_dummy_2 AS
SELECT
d1.id
, d1.member_composite_id
,
CASE
WHEN (COALESCE(MAX(d2.service_date)::TEXT,'') = '') THEN 0
WHEN (MAX(d2.service_date) - '2009-01-01'::DATE = d1.service_date - '2009-01-01'::DATE) THEN d1.service_date - '2009-01-01'::DATE - 0.1
ELSE MAX(d2.service_date) - '2009-01-01'::DATE
END service_date_1
, d1.service_date - '2009-01-01'::DATE service_date_2
FROM db1_dummy d1
LEFT JOIN db1_dummy d2
ON d2.member_composite_id = d1.member_composite_id
AND d2.service_date <= d1.service_date
AND d2.id < d1.id
GROUP BY
d1.id
, d1.member_composite_id
, d1.service_date
ORDER BY
d1.id;
The Problem
The bad news is that while this query runs very speedily on the dummy dataset I've given you all here, it takes interminably long on the "real" dataset of ~100 million rows. I've waited as much as 9.5 hours for this thing to finish working, but have had zero luck.
My question is mainly: is there a faster way to do what I'm asking Postgres to do?
What I've tried
I'm not database genius by any means, so the best I've come up with here is to index the variables being used in the query:
create index index_member_comp_id on db1_dummy(member_composite_id)
And so on like that for id, too. But it doesn't seem to make a dent, time-wise. I'm not sure how to benchmark code in Postgres, but it's a bit of a moot point if I can't get the query to run after 10 hours. I've also thought of trimming some variables in the dataset (ones I won't need for analysis), but that only gets me down from ~15 columns to ~11.
I had outside help with the query above, but they're unsure (for now) about how to approach this issue, too. So I decided to see if the boffins on SO have any ideas. Thanks in advance for your kind help.
EDIT
Per Laurenz's request, here's the output for EXPLAIN on the version of the query I've given you here:
+-------------------------------------------------------------------------------------+
|QUERY PLAN |
+-------------------------------------------------------------------------------------+
|GroupAggregate (cost=2.98..3.72 rows=14 width=76) |
| Group Key: d1.id |
| -> Sort (cost=2.98..3.02 rows=14 width=44) |
| Sort Key: d1.id |
| -> Hash Left Join (cost=1.32..2.72 rows=14 width=44) |
| Hash Cond: (d1.member_composite_id = d2.member_composite_id) |
| Join Filter: ((d2.service_date <= d1.service_date) AND (d2.id < d1.id))|
| -> Seq Scan on db1_dummy d1 (cost=0.00..1.14 rows=14 width=40) |
| -> Hash (cost=1.14..1.14 rows=14 width=40) |
| -> Seq Scan on db1_dummy d2 (cost=0.00..1.14 rows=14 width=40) |
+-------------------------------------------------------------------------------------+
Your query is a real server killer(*). Use the window function lag().
select
id,
member_composite_id,
case service_date_1
when service_date_2 then service_date_1- .1
else service_date_1
end as service_date_1,
service_date_2
from (
select
id,
member_composite_id,
lag(service_date, 1, '2009-01-01') over w - '2009-01-01' as service_date_1,
service_date - '2009-01-01' as service_date_2
from db1_dummy
window w as (partition by member_composite_id order by id)
) main_query
order by id
Create the index before running the query
create index on db1_dummy(member_composite_id, id)
Read in the docs:
3.5. Window Functions
9.22. Window Functions
4.2.8. Window Function Calls
(*) The query produces several additional records for each member_composite_id. In the worst case, this is half the Cartesian product. So before the server can group and calculate aggregates, it has to create some several hundred million rows. My laptop couldn't stand it, the server run out of memory on a table with a million rows. Self-joins always are suspicious, especially on large tables.

Appending text to an existing file from inside a parallel loop

I'm currently processing some data in parallel as part of a larger loop that essentially looks something like this:
for (i in files) {
doFunction
foreach (j = 1:100) %dopar% {
parallelRegFun}
}
The doFunction extracts average data for each year, while the parallelRegFun regresses the data with maximally overlapping windows (not always 1:100, sometimes it can be 1:1000+, which is why I'm doing it in parallel).
Part of the parallelRegFun involves writing data to CSV
write_csv(parallelResults,
path = "./outputFile.csv",
append = TRUE, col_names = FALSE)
The issues is that quite often when writing to the output file, the data is appended to an existing row, or a blank row is written. For example the output might look like this:
+-----+-------+------+---+-------+------+
| Uid | X | Y | | | |
+-----+-------+------+---+-------+------+
| 1 | 0.79 | 2.37 | | | |
+-----+-------+------+---+-------+------+
| 2 | -1.88 | 3.53 | 3 | -0.54 | 3.32 |
+-----+-------+------+---+-------+------+
| | | | | | |
+-----+-------+------+---+-------+------+
| 5 | -0.18 | 1.45 | | | |
+-----+-------+------+---+-------+------+
This requires extensive clean-up afterwards, but when some of the output files are 100+MB it's a lot of data to have to inspect manually and clean. It also appears that if a blank row is written, the output for that row is completely missing from the output - i.e. it's not in the data that gets appended to an existing row.
Is there anyway to get the doParallel workers to check if a file is being accessed and if it is, to wait until it's not before appending the output?
I thought something like Sys.sleep() before the write_csv command would work,as it would force each worker to wait a different amount of time before writing, but this doesn't appear to work in the testing I've done.

How are complex conditions represented in decision table

I am trying to model a decision table template.
Why I understand for simple rules like
(x>10 and y<10) print "red" can be represented in a decision table with one row using two columns for conditions and one column for action.
+-----+-----+-------------+
| X | Y | Action |
+-----+-----+-------------+
| >10 | <10 | Print "red" |
+-----+-----+-------------+
How are conditions like
((x>10 and y<10) or x>1) or z<5 and y>5 print "red" represented in decision tables.
I assume the above big condition is represented in many rows where the combination of different mini conditions is true. with the same action part repeated. Is there any method to reduce conditions like this to decision tables?
However In that case The action is fired multiple rows. Where as we have only one action. Is there any column for grouping?
One approach is to give actions numbers, and reference them from decision tables. If an action has been fired during an evaluation run, subsequent firings are ignored.
Here is an example:
+-----+-----+-----+--------+
| X | Y | Z | Action |
+-----+-----+-----+--------+
| >10 | >10 | - | 1 |
+-----+-----+-----+--------+
| >10 | <10 | - | 2 |
+-----+-----+-----+--------+
| >50 | - | - | 2 |
+-----+-----+-----+--------+
| - | - | >5 | 2 |
+-----+-----+-----+--------+
Action number corresponds to an action from this table:
+-----+--------------+
| # | Action |
+-----+--------------+
| 1 | Print "red" |
+-----+--------------+
| 2 | Print "blue" |
+-----+--------------+
If action #2 is fired because x>10 AND y<10, it wouldn't fire again even if x>50 or z>5.

Last matching date in spreadsheet function

I have a spreadsheet where dates are being recorded in regards to individuals, with additional data, as such:
Tom | xyz | 5/2/2012
Dick | foo | 5/2/2012
Tom | bar | 6/1/2012
On another sheet there is a line in which I want to be able to put in the name, such as Tom, and retrieve on the following cell through a formula the data for the LAST (most recent by date) entry in the first sheet. So the first sheet is a log, and the second sheet displays the most recent one. In the following example, the first cell is entered and the remaining are formulas displaying data from the first sheet:
Tom | bar | 6/1/2012
and so on, showing the latest dated entry in the log.
I'm stumped, any ideas?
If you only need to do a single lookup, you can do that by adding two new columns in your log sheet:
Sheet1
| A | B | C | D | E | F
1 | Tom | xyz | 6/2/2012 | | * | *
2 | Dick | foo | 5/2/2012 | | * | *
3 | Tom | bar | 6/1/2012 | | * | *
Sheet2
| A | B | C
1 | Tom | =Sheet1.E1 | =Sheet1.F1
*(E1) = =IF(AND($A1=Sheet2.$A$1;E2=0);B1;E2)
(i.e. paste the formula above in E1, then copy/paste it in the other cells with *)
Explanation: if A is not what you're looking for, go for the next; if it is, but there is a non-empty next, go for the next; otherwise, get it. This way you're selecting the last one corresponding to your search. I'm assuming you want the last entry, not "the one with the most recent date", since that's what you asked in your example. If I interpreted your question wrong, please update it and I can try to provide a better answer.
Update: If the log dates can be out of order, here's how you get the last entry:
*(F1) = =IF(AND($A1=Sheet2.$A$1;C1>=F2);C1;F2)
*(E1) = =IF(C1=F1;B1;E2)
Here I just replaced the test F2=0 (select next if non-empty) for C1>=F2 (select next if more recent) and, for the other column, select next if the first test also did so.
Disclaimer: I'm very inexperienced with spreadsheets, the solution above is ugly but gets the job done. For instance, if you wanted a 2nd row in Sheet2 to do another lookup, you'd need to add two more columns to Sheet1, etc.

Resources