R, use newer data to update list - r

This question is very similar to this question that I created previously which has an answer, however I've come to realize the problem I'm trying to solve has evolved and I figured I should start fresh.
I have two data frames like so:
df1<-structure(list(protocol_no = c("study1", "study2", "study3",
"study4", "study5", "study6", "study7"), status = c("New", "Open",
"Closed", "New", "PI signoff", "Closed", "Open")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
df2<-structure(list(record_id = c(11, 12, 13, 14, 15, 16), protocol_no = c("study1",
"study2", "study3", "study4", "study5", "study6"), status = c("New",
"Closed", "Closed", "New", "PI signoff", "Closed"), form_1_complete = c(0,
0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
They pretty much reference the same data, but df1 will always be newer and have more rows, whereas df2 is older and has more columns. Also, they will have 20,000+ rows in real life.
I need to update df2 with the new information from df1, this might mean new rows that will be need to be numbered (the record_id column), and it might mean updating the "status" column if its changed.
For instance in this example, the row for study7 is new and needs to be added and given record_id = 17 (because 16 is where that list left off). Additionally the status of study2 changed from Closed to Open (its 'open' in df1) so that needs to be changed.
Things that wouldn't work:
In the previous solution it used binding rows and distinct, but in this scenario since study2 has changed and needs to be updated, that would bind two copies of study2 and have trouble distinguishing which to get rid of.
Output I'm looking for:
A dataframe with all 4 columns, with record_id for everything, one row per protocol ('protocol_no'), and any status's that have changed updated to reflect df1. Like so:

Here, a join would be enough
library(data.table)
setDT(df2)[as.data.table(df1), status := i.status, on = .(protocol_no)]
Or use rows_upsert and use the same code in the other post
library(dplyr)
library(tidyr)
rows_upsert(df2, df1) %>%
fill(record_id) %>%
mutate(record_id = record_id + (rowid(record_id) - 1))
-output
record_id protocol_no status form_1_complete
1 11 study1 New 0
2 12 study2 Open 0
3 13 study3 Closed 0
4 14 study4 New 0
5 15 study5 PI signoff 0
6 16 study6 Closed 0
7 17 study7 Open NA

Related

Check to see if value from one column is present in two other columns in one dataframe R

I'd like to figure out a way to compare columns in the SAME data frame, but in such a way that I create a new column called STATUS for the output. I have 3 columns 1)SNPs, 2)gained, and 3)lost. I want to know if the data in each cell in column 1 is present in either column 2 or 3. If the data from column 1 is present in column 2 then I would want the output to say GAINED, and if its present in column 3 then the output would be LOST. If it's present in either then the output will be NEUTRAL.
Here is what I would like:
SNPs GAINED LOST STATUS
1_752566 1_949654 6_30022061 NEUTRAL
1_776546 1_1045331 6_30314321 NEUTRAL
1_832918 1_832918 13_95612033 GAINED
1_914852 1_1247494 1_914852 LOST
I've tried this:
data_frame$status <- data.frame(lapply(data_frame[1], `%in%`, data_frame[2:3]))
but it produces 2 columns that all say NEUTRAL. I believe it's reading per row to see if it matches, but my data isn't organized in that manner such that it will find every match per row. Instead I'd like to search the entire column and have R find the matches in each cell instead of searching per row.
You don't need lapply or anything fancy like that.
data_frame$STATUS = with(data_frame,
ifelse(SNPs %in% GAINED, "GAINED",
ifelse(SNPs %in% LOST, "LOST", "NEUTRAL")
)
)
Note that the way this is written the GAINED condition is checked first so if it is present in both GAINED and LOST the result will be "GAINED".
Using a nested ifelse should work, and be fairly understandable if indented properly:
tbl$status <- ifelse(tbl$SNPs %in% tbl$GAINED, "GAINED",
ifelse(tbl$SNPs %in% tbl$LOST, "LOST", "NEUTRAL") )
> tbl
SNPs GAINED LOST STATUS status
1 1_752566 1_949654 6_30022061 NEUTRAL NEUTRAL
2 1_776546 1_1045331 6_30314321 NEUTRAL NEUTRAL
3 1_832918 1_832918 13_95612033 GAINED GAINED
4 1_914852 1_1247494 1_914852 LOST LOST
A Tidyverse approach with case_when
library(tidyverse)
df <-
structure(
list(
SNPs = c("1_752566", "1_776546", "1_832918", "1_914852"),
GAINED = c("1_949654", "1_1045331", "1_832918", "1_1247494"),
LOST = c("6_30022061", "6_30314321", "13_95612033", "1_914852")
),
row.names = c(NA,-4L),
spec = structure(list(
cols = list(
SNPs = structure(list(), class = c("collector_character",
"collector")),
GAINED = structure(list(), class = c("collector_character",
"collector")),
LOST = structure(list(), class = c("collector_character",
"collector"))
),
default = structure(list(), class = c("collector_guess",
"collector")),
delim = ","
), class = "col_spec"),
class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
)
df %>%
mutate(STATUS = case_when(
SNPs %in% GAINED ~ 'GAINED',
SNPs %in% LOST ~ 'LOST',
TRUE ~ 'NEUTRAL'
))
#> # A tibble: 4 × 4
#> SNPs GAINED LOST STATUS
#> <chr> <chr> <chr> <chr>
#> 1 1_752566 1_949654 6_30022061 NEUTRAL
#> 2 1_776546 1_1045331 6_30314321 NEUTRAL
#> 3 1_832918 1_832918 13_95612033 GAINED
#> 4 1_914852 1_1247494 1_914852 LOST
Created on 2022-12-01 with reprex v2.0.2

R, Look up values in other datatable to fill in values

The background
Question edited heavily for clarity
I have data like this:
df<-structure(list(fname = c("Linda", "Bob"), employee_number = c("00000123456",
"654321"), Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0,
0), CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1), ResearchNurse = c(0,
0)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
In a previous question I asked on here, I mentioned that I needed to pivot this data from wide to long in order to export it elsewhere. Answers worked great!
Problem is, I discovered that some of the people in my dataset didn't fill out their surveys correctly and have all zero's in certain problematic columns. I.e. when they get pivoted and filtered to "1" values, they get dropped.
Luckily (depending on how you think about it) I can fix their mistakes. If they left those columns blank, I can populate what they should have based on their other columns. I.e. what they filled out under "CRA","Regulatory", "Finance" or "ResearchNurse" will determine whether they get 1's or 0's in "Calendar","Protocol" or "Subject"
To figure out what goes in those columns, we created this matrix of job responsibilities:
jobs<-structure(list(`Roles (existing)` = c("Calendar Build", "Protocol Management",
"Subject Management"), `CRA/ Manager/ Senior` = c(1, 1, 0), Regulatory = c(0,
1, 1), Finance = c(0, 0, 0), `Research Nurse` = c(1, 0, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
So if you're following so far, no matter what "Bob" put in his columns for "Calendar", "Protocol" or "subject" (he currently has zeros), it will be overwritten based on what he put in other columns. So if Bob put a "1" in his 'Regulatory' column, based on that matrix I screenshotted, he should get a 1 in both the protocol and subject columns.
The specific question
So how do I tell R, "look at bob's "CRA,Regulatory, Finance, and researchNurse" columns, and then crossreference the "jobs" dataframe, and overwrite his "calendar, protocol, and subjects" columns?
My expected output in this particular case would be:
One last little detail: I could see instances where (depending on the order), numbers would overwrite each other. I.e. if Bob should get a 1 in protocol because he's got a 1 in regulatory... but he's got a 1 in finance which would mean he should get a 0 in protocol.....
When in doubt, if a column is overwritten with a 1, it should never be turned back into a zero. I hope that makes sense.
I'd suggest converting your logic to ifelse statement(s):
df$Calendar <- ifelse(df$CRA == 1 | df$ResearchNurse == 1, 1, df$Calendar)
df$Protocol <- ifelse(df$CRA == 1 | df$Regulatory == 1, 1, df$Protocol)
df$Subject <- ifelse(df$Regulatory == 1 | df$ResearchNurse == 1, 1, df$Subject)
df
#> fname employee_number Calendar Protocol Subject CRA Regulatory Finance
#> 1 Linda 00000123456 0 1 1 0 1 0
#> 2 Bob 654321 0 1 1 0 1 1
#> ResearchNurse
#> 1 0
#> 2 0
data:
df <- structure(list(
fname = c("Linda", "Bob"),
employee_number = c("00000123456", "654321"),
Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0, 0),
CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1),
ResearchNurse = c(0, 0)), row.names = c(NA, -2L), class = c("data.frame"))
Created on 2022-03-28 by the reprex package (v2.0.1)
Both tables need a common look up value.
So for example in your df table there is a employee_number column. Do you have the same field in the jobs table? If so this is easy to do with left_join() and then a case_when()
You will need simplify your current jobs table to have some summary value of the logic you put in your post eg(if Bob has a 1 in regulatory then he should get a 1 in protocol and subject columns). This can be done with some table manipulation functions. I can't tell you exactly which ones because I don't fully understand the logic.
Assuming that is clear to you and you know how to summarize that jobs table (and you have the unique employee_number) for each row then the below should work.
left_join(x=df,y=jobs,by="employee_number") %>%
muate(new_col1=case_when(logic_1 ~ value1,
logic_2 ~ value2,
logic_3 ~ value3,
TRUE ~ default_value))
You can repeat the newcol logic for additional columns as required.
library(tidyverse)
First, by pivoting both df and jobs, the task should become much easier
(df_long <- df %>%
pivot_longer(
cols = -c(fname, employee_number), names_to = "term"
) %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 3 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Bob 654321 Regulatory
#> 3 Bob 654321 Finance
Now, if I understand your question correctly, Bob should have added “Protocol”
and “Subject”in his survey because he works in “Finance”. Luckily, we can add
that information for him automatically. We pivot jobs and clean up the
names/terms to match those in df. This can be done like this:
(jobs_long <- jobs %>%
rename(
CRA = `CRA/ Manager/ Senior`, ResearchNurse = `Research Nurse`
) %>%
mutate(
roles = `Roles (existing)` %>% str_extract("^\\w+"),
.keep = "unused"
) %>%
pivot_longer(-roles, names_to = "term") %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 6 x 2
#> roles term
#> <chr> <chr>
#> 1 Calendar CRA
#> 2 Calendar ResearchNurse
#> 3 Protocol CRA
#> 4 Protocol Regulatory
#> 5 Subject Regulatory
#> 6 Subject ResearchNurse
Once in this shape, we can join the two tables, do some tidying, and then we
end up with the correct information. We could continue from here and wrangle
the data back into the wide shape, but it’s probably more useful like this
so that’s where I would stop.
df_long %>%
left_join(jobs_long, by = c("term" = "term")) %>%
pivot_longer(cols = c(term, roles), values_drop_na = TRUE) %>%
distinct(fname, employee_number, term = value)
#> # A tibble: 7 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Linda 00000123456 Protocol
#> 3 Linda 00000123456 Subject
#> 4 Bob 654321 Regulatory
#> 5 Bob 654321 Protocol
#> 6 Bob 654321 Subject
#> 7 Bob 654321 Finance
Created on 2022-03-31 by the reprex package (v1.0.0)

ggplot2 plotting alternate rows from 2 columns

I have a dataframe with 3 columns "ID", "on.tank", "on.mains". the data looks like:
ID: 1,2,3,4,5,6,7,8,9,10 ( sequentially) on.tank: 25,0,10,0,43,0,5 on.mains: 0,12,0,11,0,2,0
so columns 2 and 3 alternate between zero and a value, where when one is a zero the other has a value alternately.
I want to create one column that interleaves each value alternately and a second column which will be a factor on.main, on.tank, on.main, etc alternating as it represents days on tank, then days on mains, then days on tank, etc, etc alternately.
I tried using melt but it doesn't give me alternating it stacks the data so I get on.tank, on.tank, on.tank etc for 2000 rows and then on.mains, on.mains etc
> dput(head(data))
structure(list(ID = 1:6, on.tank = c(0, 56, 0, 1, 0, 97), on.main = c(-1,
0, -9, 0, -18, 0)), .Names = c("ID", "on.tank", "on.main"), row.names = c(NA,
6L), class = "data.frame")
Here's your data:
df <- data.frame(ID=1:7,
on.tank=c( 25,0,10,0,43,0,5),
on.mains=c(0,12,0,11,0,2,0))
Using base R:
df$On.which <- ifelse(df$on.tank > df$on.mains, "on.tank", "on.mains")
This will work unless any of your values are negative. If you have negative values use:
df$On.which <- ifelse(df$on.mains==0, "on.tank", "on.mains")
Does this do what you need? If you remove the quotes you can also use this method to get the values of the columns merged into 1.

Using values from two data files for one formula for calculation in R

I'm a trying-to-be R user. I never learned to code properly and have been just doing it by finding stuff online.
I encountered a problem that I would need some of you experts' help.
I have two data files.
Particulate matter (PM) concentrations (~20000 observations)
Coefficient combinations to use with the particulate matter concentrations to calculate final concentrations.
For example..
Data set 1.
ID PM
1 5
2 10
... ...
1500 25
Data set 2.
alpha beta
5 6
1 2
... ...
I ultimately have to use all the coefficient combinations (alpha and beta) for each of the IDs from data set 1. For example, if I have 10 observations in data set 1, and 10 coefficient combinations in data set 2, my output table should have 100 different output values (10*10=100).
for (i in cmaq$FID) {
mean=cmaq$PM*IER$alpha*IER$beta
}
I used the above code to do what I'm trying to do, but it only gave me 10 output values rather than 100. I think using the split function first, and somehow use that with the second dataset would work, but have not figured out how...
It may be a very very simple problem, but after spending hours to figure it out, I thought it may be a better strategy to get some help from R experts.
Thank you in advance!!!
You can do:
df1 = data.frame(
ID = c(1, 2, 1500),
PM = c(5, 10, 25)
)
df2 = data.frame(
alpha = c(5, 6),
beta = c(1, 2)
)
library(tidyverse)
library(dplyr)
df1 %>%
group_by(ID) %>%
do(data.frame(result = .$PM * df2$alpha * df2$beta,
alpha = df2$alpha,
beta = df2$beta))
Look for the term 'cross join' or 'cartesian join' (eg, How to do cross join in R?).
If that doesn't address the issue, please see https://stackoverflow.com/help/mcve. I think there is a mistake inside the loop. beta is free-floating, and not connected to the IER data.frame
We can do this with outer
data.frame(ID = rep(df1$ID, each = nrow(df2)), alpha = df2$alpha,
beta = df2$beta, result = c(t(outer(df1$PM, df2$alpha*df2$beta))))
# ID alpha beta result
#1 1 5 1 25
#2 1 6 2 60
#3 2 5 1 50
#4 2 6 2 120
#5 1500 5 1 125
#6 1500 6 2 300
data
df1 <- structure(list(ID = c(1, 2, 1500), PM = c(5, 10, 25)), .Names = c("ID",
"PM"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(alpha = c(5, 6), beta = c(1, 2)), .Names = c("alpha",
"beta"), row.names = c(NA, -2L), class = "data.frame")

R - Check if element from vector exist in data.frame, and if not, add dummy values

Having a vector of campaigns:
campaignsTypes <- c("Social Media","Distribution","Nurture","Newsletter","Push")
and a data.frame with information about them:
out <- structure(list(Type = c("Distribution", "Newsletter", "Nurture",
"Social Media"), Pageviews = c(42, 880, 17, 84)), .Names = c("Type",
"Pageviews"), row.names = c(NA, -4L), class = "data.frame")
I want to check if all elements from vector campaignsTypes are included in the data.frame out, and if not, create a new row with dummy values for this missing campaign. So far, I can check if a campaigngType is not present. However, I'm having problems into assigning the not existing element from vector as value for the first column of a manually inserted new row:
> ifelse(campaignsTypes %in% out$Type == FALSE,rbind(out, c(????,0)),"")
How to put the value of the missing campaign here?----------⤴
You can create a new data frame with the missing rows, and then stack the
two data frames.
rbind(out, data.frame(Type=setdiff(campaignsTypes, out$Type),
Pageviews=0L))
Result:
Type Pageviews
1 Distribution 42
2 Newsletter 880
3 Nurture 17
4 Social Media 84
5 Push 0
One way to do it,
output <- rbind(out, campaignsTypes[sapply(campaignsTypes, function(i) !(i %in% out$Type))])
output$Pageviews[output$Pageviews == output$Type] <- 0
output
# Type Pageviews
#1 Distribution 42
#2 Newsletter 880
#3 Nurture 17
#4 Social Media 84
#5 Push 0

Resources