Appending text to an existing file from inside a parallel loop - r

I'm currently processing some data in parallel as part of a larger loop that essentially looks something like this:
for (i in files) {
doFunction
foreach (j = 1:100) %dopar% {
parallelRegFun}
}
The doFunction extracts average data for each year, while the parallelRegFun regresses the data with maximally overlapping windows (not always 1:100, sometimes it can be 1:1000+, which is why I'm doing it in parallel).
Part of the parallelRegFun involves writing data to CSV
write_csv(parallelResults,
path = "./outputFile.csv",
append = TRUE, col_names = FALSE)
The issues is that quite often when writing to the output file, the data is appended to an existing row, or a blank row is written. For example the output might look like this:
+-----+-------+------+---+-------+------+
| Uid | X | Y | | | |
+-----+-------+------+---+-------+------+
| 1 | 0.79 | 2.37 | | | |
+-----+-------+------+---+-------+------+
| 2 | -1.88 | 3.53 | 3 | -0.54 | 3.32 |
+-----+-------+------+---+-------+------+
| | | | | | |
+-----+-------+------+---+-------+------+
| 5 | -0.18 | 1.45 | | | |
+-----+-------+------+---+-------+------+
This requires extensive clean-up afterwards, but when some of the output files are 100+MB it's a lot of data to have to inspect manually and clean. It also appears that if a blank row is written, the output for that row is completely missing from the output - i.e. it's not in the data that gets appended to an existing row.
Is there anyway to get the doParallel workers to check if a file is being accessed and if it is, to wait until it's not before appending the output?
I thought something like Sys.sleep() before the write_csv command would work,as it would force each worker to wait a different amount of time before writing, but this doesn't appear to work in the testing I've done.

Related

How to match two columns in one dataframe using values in another dataframe in R

I have two dataframes. One is a set of ≈4000 entries that looks similar to this:
| grade_col1 | grade_col2 |
| --- | --- |
| A-| A-|
| B | 86|
| C+| C+|
| B-| D |
| A | A |
| C-| 72|
| F | 96|
| B+| B+|
| B | B |
| A-| A-|
The other is a set of ≈700 entries that look similar to this:
| grade | scale |
| --- | --- |
| A+|100|
| A+| 99|
| A+| 98|
| A+| 97|
| A | 96|
| A | 95|
| A | 94|
| A | 93|
| A-| 92|
| A-| 91|
| A-| 90|
| B+| 89|
| B+| 88|
...and so on.
What I'm trying to do is create a new column that shows whether grade_col2 matches grade_col1 with a binary, 0-1 output (0 = no match, 1 = match). Most of grade_col2 is shown by letter grade. But every once in awhile an entry in grade_col2 was accidentally entered as a numeric grade instead. I want this match column to give me a "1" even when grade_col2 is a numeric grade instead of a letter grade. In other words, if grade_col1 is B and grade_col2 is 86, I want this to still be read as a match. Only when grade_col1 is F and grade_col2 is 96 would this not be a match (similar to when grade_col1 is B- and grade_col2 is D = not a match).
The second data frame gives me the information I need to translate between one and the other (entries between 97-100 are A+, between 93-96 are A, and so on). I just don't know how to run a script that uses this information to find matches through all ≈4000 entries. Theoretically, I could do this manually, but the real dataset is so lengthy that this isn't realistic.
I had been thinking of using nested if_else statements with dplyr. But once I got past the first "if" statement, I got stuck. I'd appreciate any help with this people can offer.
You can do this using a join.
Let your first dataframe be grades_df and your second dataframe be lookup_df, then you want something like the following:
output = grades_df %>%
# join on look up, keeping everything grades table
left_join(lookup_df, by = c(grade_col2 = "scale")) %>%
# combine grade_col2 from grades_df and grade from lookup_df
mutate(grade_col2b = ifelse(is.na(grade), grade_col2, grade)) %>%
# indicator column
mutate(indicator = ifelse(grade_col1 == grade_col2b, 1, 0))

Cumulative count of occurrences per value in array in Kusto

I'm looking to get the count of query param usage from the query string from page views stored in app insights using KQL. My query currently looks like:
pageViews
| project parsed=parseurl(url)
| project keys=bag_keys(parsed["Query Parameters"])
and the results look like
with each row looking like
I'm looking to get the count of each value in the list when it is contained in the url in order to anwser the question "How many times does page appear in the querystring". So the results might look like:
Page | From | ...
1000 | 67 | ...
Thanks in advance
you could try something along the following lines:
datatable(url:string)
[
"https://a.b.c/d?p1=hello&p2=world",
"https://a.b.c/d?p2=world&p3=foo&p4=bar"
]
| project parsed = parseurl(url)
| project keys = bag_keys(parsed["Query Parameters"])
| mv-expand key = ['keys'] to typeof(string)
| summarize count() by key
which returns:
| key | count_ |
|-----|--------|
| p1 | 1 |
| p2 | 2 |
| p3 | 1 |
| p4 | 1 |

Looping with between()

I sort through photos marking the starting and ending images of photo groups containing animals of interest. The finished product look something like whats included below. After sorting, I'd normally use the starting and ending photos as markers to move photos of interest from each subfolder into a main folder for later processing.
Primary.Folder | Sub.folder |Start.Image.. |End.Image..
RPU_03262019_05092019 | 100EK113 | 2019-03-26-11-23-46 | 2019-03-26-11-32-02
RPU_03262019_05092019 | 100EK113 | 2019-03-27-08-35-00 | 2019-03-27-08-35-00
RPU_03262019_05092019 | 101EK113 | 2019-03-31-00-29-58 | 2019-03-31-00-59-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-01-44-58 | 2019-03-31-01-59-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-03-14-58 | 2019-03-31-03-44-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-04-34-58 | 2019-03-31-04-39-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-05-04-58 | 2019-03-31-05-14-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-05-44-58 | 2019-03-31-05-44-58
RPU_03262019_05092019 | 101EK113 | 2019-03-31-19-30-58 | 2019-03-31-19-40-58
By having a list of the total images I'm hoping to loop my way through each row and build a new list of photos of just animal subjects that I can file.copy into another folder. I'm hoping between can help with this.
So far I've removed the .JPG from every file in the total photo list to match whats in the sorted csv, separated Start.Image.. column to t1 and End.Image.. column to t2, and tested a for loop to see if they line up.
fn <- photolist %>% str_replace_all('\\.JPG', '')
t1 <- csvfilled[,4]
t2 <- csvfilled[,5]
#test
for (i in t1) for (j in t2) {
print(paste(i,j,sep=","))
}
# using between() function
for (i in t1) {
for (j in t2){
finalsortedlist<- (fn[between(fn,i, j)])
}
}
The test results show i and j are running at the same time. It appears i waits for j to loop through before it continues at which j loops again.
"2019-05-09-09-24-24, 2019-05-08-18-35-24"
"2019-05-09-09-24-24, 2019-05-08-19-05-24"
"2019-05-09-09-24-24, 2019-05-08-19-50-24"
"2019-05-09-09-24-24, 2019-05-09-00-09-24"
"2019-05-09-09-24-24, 2019-05-09-09-59-24"
"2019-05-09-09-24-24, 2019-05-09-10-49-24"
Is there a way to run them in sequence like below?
"2019-03-26-11-23-46, 2019-03-26-11-32-02"
"2019-03-27-08-35-00, 2019-03-27-08-35-00"
"2019-03-31-00-29-58, 2019-03-31-00-59-58"
"2019-03-31-01-44-58, 2019-03-31-01-59-58"
"2019-03-31-03-14-58, 2019-03-31-03-44-58"
"2019-03-31-04-34-58, 2019-03-31-04-39-58"
"2019-03-31-05-04-58, 2019-03-31-05-14-58"
"2019-03-31-05-44-58, 2019-03-31-05-44-58"
"2019-03-31-19-30-58, 2019-03-31-19-40-58"
I basically want:
"1,1"
"2,2"
"3,3"
"4,4"
instead of
"1,1"
"1,2"
"1,3"
"1,4"
"2,1"
"2,2"
"2,3"
"2,4"
This would be fairly simple using 'map2' from the purrr library.
finalsortedlist <- map2(t1, t2, ~fn[between(fn, .x, .y)])
Essentially, map2 will take the nth item from both t1 and t2, and pass them as .x and .y respectively to your function. The result will be a list containing the results for every iteration.

How to separate out letters in a sentence using R

I have a character vector that is a string of letters and punctuation. I want to create a data frame where each column is made up of a letter/character from this string.
e.g.
Character string = I WENT TO THE FAIR
Dataframe = | I | | W | E | N | T | | T | O | | T | H | E | | F | A | I | R |
I thought I could do this using a loop with substr, but I can't work out how to get R to write into separate columns, rather than just writing over the previous letter. I'm new to writing loops etc so struggling a bit to get my head around the way in which to compose what I need.
Thanks for any help and advice that you can offer.
Best wishes,
Natalie
This should get that result
string <- "I WENT TO THE FAIR"
df <- as.data.frame(t(as.data.frame(strsplit(string,""))), row.names = "1")

Last matching date in spreadsheet function

I have a spreadsheet where dates are being recorded in regards to individuals, with additional data, as such:
Tom | xyz | 5/2/2012
Dick | foo | 5/2/2012
Tom | bar | 6/1/2012
On another sheet there is a line in which I want to be able to put in the name, such as Tom, and retrieve on the following cell through a formula the data for the LAST (most recent by date) entry in the first sheet. So the first sheet is a log, and the second sheet displays the most recent one. In the following example, the first cell is entered and the remaining are formulas displaying data from the first sheet:
Tom | bar | 6/1/2012
and so on, showing the latest dated entry in the log.
I'm stumped, any ideas?
If you only need to do a single lookup, you can do that by adding two new columns in your log sheet:
Sheet1
| A | B | C | D | E | F
1 | Tom | xyz | 6/2/2012 | | * | *
2 | Dick | foo | 5/2/2012 | | * | *
3 | Tom | bar | 6/1/2012 | | * | *
Sheet2
| A | B | C
1 | Tom | =Sheet1.E1 | =Sheet1.F1
*(E1) = =IF(AND($A1=Sheet2.$A$1;E2=0);B1;E2)
(i.e. paste the formula above in E1, then copy/paste it in the other cells with *)
Explanation: if A is not what you're looking for, go for the next; if it is, but there is a non-empty next, go for the next; otherwise, get it. This way you're selecting the last one corresponding to your search. I'm assuming you want the last entry, not "the one with the most recent date", since that's what you asked in your example. If I interpreted your question wrong, please update it and I can try to provide a better answer.
Update: If the log dates can be out of order, here's how you get the last entry:
*(F1) = =IF(AND($A1=Sheet2.$A$1;C1>=F2);C1;F2)
*(E1) = =IF(C1=F1;B1;E2)
Here I just replaced the test F2=0 (select next if non-empty) for C1>=F2 (select next if more recent) and, for the other column, select next if the first test also did so.
Disclaimer: I'm very inexperienced with spreadsheets, the solution above is ugly but gets the job done. For instance, if you wanted a 2nd row in Sheet2 to do another lookup, you'd need to add two more columns to Sheet1, etc.

Resources