R, Look up values in other datatable to fill in values - r

The background
Question edited heavily for clarity
I have data like this:
df<-structure(list(fname = c("Linda", "Bob"), employee_number = c("00000123456",
"654321"), Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0,
0), CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1), ResearchNurse = c(0,
0)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
In a previous question I asked on here, I mentioned that I needed to pivot this data from wide to long in order to export it elsewhere. Answers worked great!
Problem is, I discovered that some of the people in my dataset didn't fill out their surveys correctly and have all zero's in certain problematic columns. I.e. when they get pivoted and filtered to "1" values, they get dropped.
Luckily (depending on how you think about it) I can fix their mistakes. If they left those columns blank, I can populate what they should have based on their other columns. I.e. what they filled out under "CRA","Regulatory", "Finance" or "ResearchNurse" will determine whether they get 1's or 0's in "Calendar","Protocol" or "Subject"
To figure out what goes in those columns, we created this matrix of job responsibilities:
jobs<-structure(list(`Roles (existing)` = c("Calendar Build", "Protocol Management",
"Subject Management"), `CRA/ Manager/ Senior` = c(1, 1, 0), Regulatory = c(0,
1, 1), Finance = c(0, 0, 0), `Research Nurse` = c(1, 0, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
So if you're following so far, no matter what "Bob" put in his columns for "Calendar", "Protocol" or "subject" (he currently has zeros), it will be overwritten based on what he put in other columns. So if Bob put a "1" in his 'Regulatory' column, based on that matrix I screenshotted, he should get a 1 in both the protocol and subject columns.
The specific question
So how do I tell R, "look at bob's "CRA,Regulatory, Finance, and researchNurse" columns, and then crossreference the "jobs" dataframe, and overwrite his "calendar, protocol, and subjects" columns?
My expected output in this particular case would be:
One last little detail: I could see instances where (depending on the order), numbers would overwrite each other. I.e. if Bob should get a 1 in protocol because he's got a 1 in regulatory... but he's got a 1 in finance which would mean he should get a 0 in protocol.....
When in doubt, if a column is overwritten with a 1, it should never be turned back into a zero. I hope that makes sense.

I'd suggest converting your logic to ifelse statement(s):
df$Calendar <- ifelse(df$CRA == 1 | df$ResearchNurse == 1, 1, df$Calendar)
df$Protocol <- ifelse(df$CRA == 1 | df$Regulatory == 1, 1, df$Protocol)
df$Subject <- ifelse(df$Regulatory == 1 | df$ResearchNurse == 1, 1, df$Subject)
df
#> fname employee_number Calendar Protocol Subject CRA Regulatory Finance
#> 1 Linda 00000123456 0 1 1 0 1 0
#> 2 Bob 654321 0 1 1 0 1 1
#> ResearchNurse
#> 1 0
#> 2 0
data:
df <- structure(list(
fname = c("Linda", "Bob"),
employee_number = c("00000123456", "654321"),
Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0, 0),
CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1),
ResearchNurse = c(0, 0)), row.names = c(NA, -2L), class = c("data.frame"))
Created on 2022-03-28 by the reprex package (v2.0.1)

Both tables need a common look up value.
So for example in your df table there is a employee_number column. Do you have the same field in the jobs table? If so this is easy to do with left_join() and then a case_when()
You will need simplify your current jobs table to have some summary value of the logic you put in your post eg(if Bob has a 1 in regulatory then he should get a 1 in protocol and subject columns). This can be done with some table manipulation functions. I can't tell you exactly which ones because I don't fully understand the logic.
Assuming that is clear to you and you know how to summarize that jobs table (and you have the unique employee_number) for each row then the below should work.
left_join(x=df,y=jobs,by="employee_number") %>%
muate(new_col1=case_when(logic_1 ~ value1,
logic_2 ~ value2,
logic_3 ~ value3,
TRUE ~ default_value))
You can repeat the newcol logic for additional columns as required.

library(tidyverse)
First, by pivoting both df and jobs, the task should become much easier
(df_long <- df %>%
pivot_longer(
cols = -c(fname, employee_number), names_to = "term"
) %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 3 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Bob 654321 Regulatory
#> 3 Bob 654321 Finance
Now, if I understand your question correctly, Bob should have added “Protocol”
and “Subject”in his survey because he works in “Finance”. Luckily, we can add
that information for him automatically. We pivot jobs and clean up the
names/terms to match those in df. This can be done like this:
(jobs_long <- jobs %>%
rename(
CRA = `CRA/ Manager/ Senior`, ResearchNurse = `Research Nurse`
) %>%
mutate(
roles = `Roles (existing)` %>% str_extract("^\\w+"),
.keep = "unused"
) %>%
pivot_longer(-roles, names_to = "term") %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 6 x 2
#> roles term
#> <chr> <chr>
#> 1 Calendar CRA
#> 2 Calendar ResearchNurse
#> 3 Protocol CRA
#> 4 Protocol Regulatory
#> 5 Subject Regulatory
#> 6 Subject ResearchNurse
Once in this shape, we can join the two tables, do some tidying, and then we
end up with the correct information. We could continue from here and wrangle
the data back into the wide shape, but it’s probably more useful like this
so that’s where I would stop.
df_long %>%
left_join(jobs_long, by = c("term" = "term")) %>%
pivot_longer(cols = c(term, roles), values_drop_na = TRUE) %>%
distinct(fname, employee_number, term = value)
#> # A tibble: 7 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Linda 00000123456 Protocol
#> 3 Linda 00000123456 Subject
#> 4 Bob 654321 Regulatory
#> 5 Bob 654321 Protocol
#> 6 Bob 654321 Subject
#> 7 Bob 654321 Finance
Created on 2022-03-31 by the reprex package (v1.0.0)

Related

How to pivot_wider the n unique values of variable A grouped_by variable B?

I am trying to pivot_wider() the column X of a data frame containing various persons names. Within group_by() another variable Y of the df there are always 2 of these names. I would like R to take the 2 unique X names values within each unique identifier of Y and put them in 2 new columns ex_X_Name_1 and ex_X_Name_2.
My data frame is looking like this:
df <- data.frame(Student = rep(c(17383, 16487, 17646, 2648, 3785), each = 2),
Referee = c("Paul Severe", "Cathy Nice", "Jean Exigeant", "Hilda Ehrlich", "John Rates",
"Eva Luates", "Fred Notebien", "Aldous Grading", "Hans Streng", "Anna Filaktic"),
Rating = format(round(x = sqrt(sample(15:95, 10, replace = TRUE)), digits = 3), nsmall = 3)
)
df
I would like to make the transformation of the Referee column to 2 new columns Referee_1 and Referee_2 with the 2 unique Referees assigned to each student and end with this result:
even_row_df <- as.logical(seq_len(length(df$Referee)) %% 2)
df_wanted <- data_frame(
Student = unique(df$Student),
Referee_1 = df$Referee[even_row_df],
Rating_Ref_1 = df$Rating[even_row_df],
Referee_2 = df$Referee[!even_row_df],
Rating_Ref_2 = df$Rating[!even_row_df]
)
df_wanted
I guess I could achieve this with by subsetting unique rows of student/referee combinations and make joins , but is there a way to handle this in one call to pivot_wider?
You should create a row id per group first:
library(dplyr)
library(tidyr)
df %>%
group_by(Student) %>%
mutate(row_n = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = "row_n", values_from = c("Referee", "Rating"))
# A tibble: 5 × 5
Student Referee_1 Referee_2 Rating_1 Rating_2
<dbl> <chr> <chr> <chr> <chr>
1 17383 Paul Severe Cathy Nice 9.165 7.810
2 16487 Jean Exigeant Hilda Ehrlich 5.196 6.557
3 17646 John Rates Eva Luates 7.211 5.568
4 2648 Fred Notebien Aldous Grading 4.000 8.124
5 3785 Hans Streng Anna Filaktic 7.937 6.325
using data.table
library(data.table)
setDT(df)
merge(df[, .SD[1], Student], df[, .SD[2], Student], by = "Student", suffixes = c("_1", "_2"))
# Student Referee_1 Rating_1 Referee_2 Rating_2
# 1: 2648 Fred Notebien 6.708 Aldous Grading 9.747
# 2: 3785 Hans Streng 6.245 Anna Filaktic 8.775
# 3: 16487 Jean Exigeant 7.681 Hilda Ehrlich 4.359
# 4: 17383 Paul Severe 4.583 Cathy Nice 7.616
# 5: 17646 John Rates 6.708 Eva Luates 8.246

Remove rows with two conditions in R

I have this following dataset:
df <- structure(list(Data = structure(c(1623888000, 1629158400, 1629158400
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Client = c("Client1",
"Client1", "Client1"), Fund = c("Fund1", "Fund1", "Fund2"), Nature = c("Application",
"Rescue", "Application"), Quantity = c(433.059697, 0, 171.546757
), Value = c(69800, -70305.67, 24875), `NAV Yesterday` = c(162.40991399996,
162.40991399996, 145.044589000056), `NAV in Application Date` = c(161.178702344125,
162.346370458944, 145.004198476337), `Var NAV` = c(0.00763879866215962,
0.00039140721678275, 0.000278547270652531), `Var * Value` = c(533.188146618741,
-27.5181466187465, 6.92886335748171), FinalValue = c(70333.1881466187,
-70333.1881466187, 24881.9288633575), `Rentability WRONG` = c(0.0210345899274819,
0.0210345899274819, 0.0210345899274819)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
What I need to do is:
If quantity = 0, then remove all rows with the same Fund name as that one, but remove only the rows that have Date < or = Date of the Quantity = 0 Fund
What I did here is:
I grouped the data by Fund
Arranged each group by Data
Created a column zero_point that assigns 1 to the row where Quantity == 0 and NA otherwise
Filled the fields in zero_point that come before the actual "zero point" with the same value.
filtered those rows out.
output <- df %>%
group_by(Fund) %>%
arrange(Data) %>%
mutate(zero_point = case_when(Quantity == 0 ~ 1)) %>%
fill(zero_point, .direction = "up") %>%
filter(is.na(zero_point))
(On the condition that there is only one instance where Quantity is 0 per Fund group)
You can try -
library(dplyr)
df %>%
filter({
#Row index where Quantity = 0
inds = which(Quantity == 0)
#Drop rows where Data value is less than Data value at Quantity = 0
#and Fund is same as present at Quantity = 0.
!(Data <= Data[inds] & Fund %in% Fund[inds])
})
Here's a thought:
df %>%
group_by(Fund) %>%
filter(!any(Quantity == 0) | Data <= Data[which.min(Quantity)])
# # A tibble: 3 x 12
# # Groups: Fund [2]
# Data Client Fund Nature Quantity Value `NAV Yesterday` `NAV in Applica~ `Var NAV` `Var * Value` FinalValue `Rentability WR~
# <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2021-06-17 00:00:00 Clien~ Fund1 Appli~ 433. 69800 162. 161. 0.00764 533. 70333. 0.0210
# 2 2021-08-17 00:00:00 Clien~ Fund1 Rescue 0 -70306. 162. 162. 0.000391 -27.5 -70333. 0.0210
# 3 2021-08-17 00:00:00 Clien~ Fund2 Appli~ 172. 24875 145. 145. 0.000279 6.93 24882. 0.0210
I'm assuming you meant "Data <= Data of the Quantity = 0 Fund", therefore using Data instead of Date (not found) or NAV in Application Date.
This filters nothing in this sample data, I'm hoping the logic is correct.
Testing for equality with floating-point (numeric) can be problematic at times (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). If you have some small near-zero numbers, then this will silently produce counter-intuitive results without warning or error. You might be more defensive to use something like:
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) | Data <= Data[which.min(Quantity)])
or even
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) |
row_number() == which.min(Quantity) |
Data < Data[which.min(Quantity)])
While the latter is a bit paranoid (and double-calculates which.min(.), it should not succumb to problems with equality tests.
The only time this will fail is if all(is.na(Quantity)); that is, which.min(c(NA,NA)) returns integer(0) which will cause an error in dplyr::filter. One might choose to add safeguard with something like filter(any(!is.na(Quantity)) & (...)).

How to use R for handling (pivoting?) social survey raw data?

We frequently ask scale questions in our social surveys; respondents provides their agreement with our statement (strongly agree, agree, neither nor, disagree, strongly disagree). The survey result usually comes in an aggregated format, i.e for each question(variable), the answers are provided in a single column, where 5=strongly agree, 1=strongly disagree etc.
Now we came across a new survey tool where answers were partitions into several columns for one question. For example Q1_1 column = Strongly agree for Q1, Q1_5 column = Strongly disagree. So for each question we received 5 columns of answers, if respondent answered Strongly Agree, Q1_1 related row is marked as 1, where Q1_2 - Q1_5 related row for that respondent are marked as 0.
Please can anyone kindly share a solution to 'aggregated' the answers from the new survey tool, so instead of having 5 columns for each question, we would have one column per question, with value 1-5.
I'm new to R, I thought R would handle this instead of having to manually change in Excel.
Try this approach reshaping and next time follow the advice from #r2evans as we have to type data. Here the code:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(Respondent=paste0('Respondent',1:10),
Q6_1=c(1,0,1,1,1,1,0,0,0,1),
Q6_2=c(0,1,0,0,0,0,1,1,0,1),
Q6_3=rep(0,10),
Q6_4=c(rep(0,8),1,0),stringsAsFactors = F
)
#Code
new <- df %>% pivot_longer(-Respondent) %>%
separate(name,c('variable','answer'),sep='_') %>%
filter(value==1) %>%
select(-value) %>%
filter(!duplicated(Respondent)) %>%
pivot_wider(names_from = variable,values_from=answer)
Output:
# A tibble: 10 x 2
Respondent Q6
<chr> <chr>
1 Respondent1 1
2 Respondent2 2
3 Respondent3 1
4 Respondent4 1
5 Respondent5 1
6 Respondent6 1
7 Respondent7 2
8 Respondent8 2
9 Respondent9 4
10 Respondent10 1
I only curious why your data in case of member 10 have two values of 1. Maybe a typo or is that possible?
We can use data.table methods
library(data.table)
dcast(unique(melt(setDT(df), id.var = 'Respondent')[,
c('variable', 'answer') := tstrsplit(variable, '_',
type.convert = TRUE)][value == 1], by = "Respondent"),
Respondent ~ variable, value.var = 'answer')
-output
# Respondent Q6
# 1: Respondent1 1
# 2: Respondent10 1
# 3: Respondent2 2
# 4: Respondent3 1
# 5: Respondent4 1
# 6: Respondent5 1
# 7: Respondent6 1
# 8: Respondent7 2
# 9: Respondent8 2
#10: Respondent9 4
data
df <- structure(list(Respondent = c("Respondent1", "Respondent2", "Respondent3",
"Respondent4", "Respondent5", "Respondent6", "Respondent7", "Respondent8",
"Respondent9", "Respondent10"), Q6_1 = c(1, 0, 1, 1, 1, 1, 0,
0, 0, 1), Q6_2 = c(0, 1, 0, 0, 0, 0, 1, 1, 0, 1), Q6_3 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Q6_4 = c(0, 0, 0, 0, 0, 0, 0, 0,
1, 0)), class = "data.frame", row.names = c(NA, -10L))

Is there a function within dcast that allows me to include additional conditions? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 2 years ago.
I'm trying to create a wide format dataset that would include only some of the long format data. This is data from learners going through an online learning module in which they sometimes get "stuck" in a screen, therefore have multiple attempts recorded for that screen.
lesson_long <- data.frame (id = c(4256279, 4256279, 4256279, 4256279, 4256279, 4256279, 4256308, 4256308, 4256308, 4256308),
screen = c("survey1", "survey1", "survey1", "survey1", "survey2", "survey2", "survey1", "survey1", "survey2", "survey2"),
question_attempt = c(1, 1, 2, 2, 1, 1, 1, 1, 1, 1),
variable = c("age", "country", "age", "country", "education", "course", "age", "country", "education", "course"),
response = c(0, 5, 20, 5, 3, 2, 18, 5, 4, 1 ))
.
id screen question_attempt variable response
4256279 survey1 1 age 0
4256279 survey1 1 country 5
4256279 survey1 2 age 20
4256279 survey1 2 country 5
4256279 survey2 1 education 3
4256279 survey2 1 course 2
4256308 survey1 1 age 18
4256308 survey1 1 country 5
4256308 survey2 1 education 4
4256308 survey2 1 course 1
For my analyses I need to include only their response in their last attempt in each screen (or response on their max question_attempt - sometimes they have up to 8 or 9 attempts in each screen). All previous attempts will be dismissed and I don't need to have the screen name in the final dataset. The final wide format would look like this:
id age country education course
4256279 20 5 3 2
4256308 18 5 4 1
I've been trying to do this with just dcast (unsuccessfully):
lesson_wide <- dcast(lesson_long, `id` ~ variable, value.var = "response", fun.aggregate = max("question_attempt"), fill=0)
The fun.aggregate is obviously not working as I made it up... But is there a solution for this? Or perhaps I need an additional step to select the data before using dcast? But how would do this if that's the solution?
Curious to see your answers. Thanks in advance!
You can order the data by id, screen and question_attempt and select the last value of each question_attempt.
library(data.table)
setDT(lesson_long)
dcast(lesson_long[order(id, screen, question_attempt)],
id~variable, value.var = 'response', fun.aggregate = last, fill = NA)
# id age country course education
#1: 4256279 20 5 2 3
#2: 4256308 18 5 1 4
Similarly, using dplyr and tidyr :
library(dplyr)
lesson_long %>%
arrange(id, screen, question_attempt) %>%
tidyr::pivot_wider(names_from = variable, values_from = response,
id_cols = id, values_fn = last)

apply rename_if predicate to column names

I am working with a set of excel spreadsheets which has column names which are dates.
After reading in the data with readxl::read_xlsx(), these column names become excel index dates (i.e. an integer representing days elapsed from 1899-12-30)
Is it possible to used dplyr::rename_if() or similar to rename all column names that are currently integers? I have written a function rename_func that I would like to apply to all such columns.
df %>% rename_if(is.numeric, rename_func) is not suitable as is.numeric is applied to the data in the column not the column name itself. I have also tried:
is.name.numeric <- function(x) is.numeric(names(x))
df %>% rename_if(is.name.numeric, rename_func)
which does not work and does not change any names (i.e. is.name.numeric returns FALSE for all cols)
edit: here is a dummy version of my data
df_badnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), `38718` = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
`38749` = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
`38777` = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
`38808` = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
and I would like:
df_goodnames <- structure(list(Level = c(1, 2, 3, 3, 3), Title = c("AUSTRALIAN TOTAL",
"MANAGERS", "Chief Executives, Managing Directors & Legislators",
"Farmers and Farm Managers", "Hospitality, Retail and Service Managers"
), Jan2006 = c(213777.89, 20997.52, 501.81, 121.26, 4402.7),
Feb2006 = c(216274.12, 21316.05, 498.1, 119.3, 4468.67),
Mar2006 = c(218563.95, 21671.84, 494.08, 118.03, 4541.02),
Apr2006 = c(220065.05, 22011.76, 488.56, 116.24, 4609.28)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
I understand that it is best practice to create a date column and change the shape of this df, but I need to join a few spreadsheets first and having integer column names causes a lot of problems. I currently have a work around but the crux of my question (apply a rename_if predicate to a name, rather than a column) is still interesting.
Although, the names look numeric but they are not
class(names(df_badnames))
#[1] "character"
so they would not be caught by is.numeric or similar other functions.
One way to do this is find out which names can be coerced to numeric and then convert them into the date format of our choice
cols <- as.numeric(names(df_badnames))
names(df_badnames)[!is.na(cols)] <- format(as.Date(cols[!is.na(cols)],
origin = "1899-12-30"), "%b%Y")
df_badnames
# Level Title Jan2006 Feb2006 Mar2006 Apr2006
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 AUSTRALIAN TOTAL 213778. 216274. 218564. 220065.
#2 2 MANAGERS 20998. 21316. 21672. 22012.
#3 3 Chief Executives, Managing Directors & Legisla… 502. 498. 494. 489.
#4 3 Farmers and Farm Managers 121. 119. 118. 116.
#5 3 Hospitality, Retail and Service Managers 4403. 4469. 4541. 4609.

Resources