We frequently ask scale questions in our social surveys; respondents provides their agreement with our statement (strongly agree, agree, neither nor, disagree, strongly disagree). The survey result usually comes in an aggregated format, i.e for each question(variable), the answers are provided in a single column, where 5=strongly agree, 1=strongly disagree etc.
Now we came across a new survey tool where answers were partitions into several columns for one question. For example Q1_1 column = Strongly agree for Q1, Q1_5 column = Strongly disagree. So for each question we received 5 columns of answers, if respondent answered Strongly Agree, Q1_1 related row is marked as 1, where Q1_2 - Q1_5 related row for that respondent are marked as 0.
Please can anyone kindly share a solution to 'aggregated' the answers from the new survey tool, so instead of having 5 columns for each question, we would have one column per question, with value 1-5.
I'm new to R, I thought R would handle this instead of having to manually change in Excel.
Try this approach reshaping and next time follow the advice from #r2evans as we have to type data. Here the code:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(Respondent=paste0('Respondent',1:10),
Q6_1=c(1,0,1,1,1,1,0,0,0,1),
Q6_2=c(0,1,0,0,0,0,1,1,0,1),
Q6_3=rep(0,10),
Q6_4=c(rep(0,8),1,0),stringsAsFactors = F
)
#Code
new <- df %>% pivot_longer(-Respondent) %>%
separate(name,c('variable','answer'),sep='_') %>%
filter(value==1) %>%
select(-value) %>%
filter(!duplicated(Respondent)) %>%
pivot_wider(names_from = variable,values_from=answer)
Output:
# A tibble: 10 x 2
Respondent Q6
<chr> <chr>
1 Respondent1 1
2 Respondent2 2
3 Respondent3 1
4 Respondent4 1
5 Respondent5 1
6 Respondent6 1
7 Respondent7 2
8 Respondent8 2
9 Respondent9 4
10 Respondent10 1
I only curious why your data in case of member 10 have two values of 1. Maybe a typo or is that possible?
We can use data.table methods
library(data.table)
dcast(unique(melt(setDT(df), id.var = 'Respondent')[,
c('variable', 'answer') := tstrsplit(variable, '_',
type.convert = TRUE)][value == 1], by = "Respondent"),
Respondent ~ variable, value.var = 'answer')
-output
# Respondent Q6
# 1: Respondent1 1
# 2: Respondent10 1
# 3: Respondent2 2
# 4: Respondent3 1
# 5: Respondent4 1
# 6: Respondent5 1
# 7: Respondent6 1
# 8: Respondent7 2
# 9: Respondent8 2
#10: Respondent9 4
data
df <- structure(list(Respondent = c("Respondent1", "Respondent2", "Respondent3",
"Respondent4", "Respondent5", "Respondent6", "Respondent7", "Respondent8",
"Respondent9", "Respondent10"), Q6_1 = c(1, 0, 1, 1, 1, 1, 0,
0, 0, 1), Q6_2 = c(0, 1, 0, 0, 0, 0, 1, 1, 0, 1), Q6_3 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0), Q6_4 = c(0, 0, 0, 0, 0, 0, 0, 0,
1, 0)), class = "data.frame", row.names = c(NA, -10L))
Related
I am currently working on microdata, using a survey called SHARE. I want to use a variable for education but the way it was coded makes it kind of hard.
In the survey, households are asked what degree they have. There is one column for each degree and it takes the value 0 or 1 if the interviewed has the degree or not. The issue is that I have two countries with different degrees, but they are using the same column, so I have to go to the user manual to find for each country to which degree corresponds each 0 or 1. I was able to do so and then translate it to an international way of measuring education.
My idea was to sum each column and then having only one column for each household. However, I wasn't able to proceed because some people have many degrees. I would like to get the highest degree of each household. I would like to have your help on this issue.
Here are tables of what I have and what I would like:
Let imagine in Germany the first diplome is equivalent to the first diplome in international standards, the second and thee third in Germany are the same as the second diplom in international standards and the last diplom in Germany is the same as the third internationally. And in France we have first = first int., second = second int., third = third int. and no fourth diplom. Then I have a the table:
country= c( "Germany", "Germany", "Germany", "France" , "France", "France")
degree_one= c( 1, 1, 1, 1 , 1, 1)
degree_two = c( 0, 1, 0, 1 , 1, 0)
degree_three= c( 1, 0, 1, 1 , 1, 0)
degree_four = c( 1, 0, 0, NA ,NA, NA)
f = data.frame(country,degree_one,degree_two,degree_three,degree_four)
Then I can translate and try to creat my variable degree by summing everything:
f$degree_one = ifelse(f$country == "Germany" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "Germany" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "Germany" & f$degree_three == 1,2,f$degree_three)
f$degree_four = ifelse(f$country == "Germany" & f$degree_four == 1,3,f$degree_four)
f$degree_one = ifelse(f$country == "France" & f$degree_one == 1,1,f$degree_one)
f$degree_two = ifelse(f$country == "France" & f$degree_two == 1,2,f$degree_two)
f$degree_three = ifelse(f$country == "France" & f$degree_three == 1,3,f$degree_three)
f$degree_four = ifelse(f$country == "France" & f$degree_four == "NA",0,f$degree_four)
f = replace(f, is.na(f), 0)
f2 = f %>% mutate(degree = degree_one + degree_two + degree_three + degree_four )
Unfortunately, it does not work and what I would like should look like this:
degree = c(3,2,2,3,3,1)
f3 = data.frame(f,degree)
I tried to do smth with a while loop but it did not work, as anyone any idea how I can solve my problem? I tried to make it as clear as possible, I hope you will understand and that someone as an idea on how to fix this.
Thanks :)
Here is an approach using data.table
library(data.table)
##
# create degree map by country
#
degreeMap <- data.table(country=c('France', 'Germany'))
degreeMap <- degreeMap[, .(degree=paste('degree', c('one', 'two', 'three', 'four'), sep='_')), by=.(country)]
degreeMap[country=='France', intlDegree:=c(1,2,3,NA)]
degreeMap[country=='Germany', intlDegree:=c(1,2,2,3)]
##
# process your data
#
setDT(f)
f[, indx:=1:.N] # need an index column to recover original order
f[, HH:=1:.N, by=.(country)] # need a HH column to distinguish different HH w/in country
maxDegree <- melt(f, id=c('country', 'HH', 'indx'), variable.name='degree', value.name = 'flag')
maxDegree <- maxDegree[flag > 0] # remove rows with flag=0 or NA
setorder(maxDegree, HH, degree)
maxDegree <- maxDegree[, .SD[.N], keyby=.(country, HH)]
maxDegree[degreeMap, intlDegree:=i.intlDegree, on=.(country, degree)]
setorder(maxDegree, indx)
maxDegree
## country HH indx degree flag intlDegree
## 1: Germany 1 1 degree_four 1 3
## 2: Germany 2 2 degree_two 1 2
## 3: Germany 3 3 degree_three 1 2
## 4: France 1 4 degree_three 1 3
## 5: France 2 5 degree_three 1 3
## 6: France 3 6 degree_one 1 1
So this converts your f to a data.table and adds an index column and a HH column to distinguish between HH within a country.
We then convert to long format using melt(...). In long format the four degree_ columns are reduced to two columns: a flag column indicating whether or not the degree applies, and a degree column indicating which degree.
Then we remove all rows with 0 or NA flags, and then extract the last remaining row (highest degree) for each country and HH.
Finally, we join to degreeMap to get the equivalent intlDegree.
Change NAs to 0 and then sum degree columns:
f <- f %>%
mutate(
degree_one = ifelse(is.na(degree_one), 0, degree_one),
degree_two = ifelse(is.na(degree_two), 0, degree_two),
degree_three = ifelse(is.na(degree_three), 0, degree_three),
degree_four = ifelse(is.na(degree_four), 0, degree_four),
degree_sum = degree_one + degree_two + degree_three + degree_four
)
Or, if you want to get fancy with the dplyr
f <- f %>%
mutate(across(contains("degree"), \(x) {ifelse(is.na(x), 0, x)})) %>%
mutate(degree_sum = select(., contains("degree")) %>% rowSums())
The background
Question edited heavily for clarity
I have data like this:
df<-structure(list(fname = c("Linda", "Bob"), employee_number = c("00000123456",
"654321"), Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0,
0), CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1), ResearchNurse = c(0,
0)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
In a previous question I asked on here, I mentioned that I needed to pivot this data from wide to long in order to export it elsewhere. Answers worked great!
Problem is, I discovered that some of the people in my dataset didn't fill out their surveys correctly and have all zero's in certain problematic columns. I.e. when they get pivoted and filtered to "1" values, they get dropped.
Luckily (depending on how you think about it) I can fix their mistakes. If they left those columns blank, I can populate what they should have based on their other columns. I.e. what they filled out under "CRA","Regulatory", "Finance" or "ResearchNurse" will determine whether they get 1's or 0's in "Calendar","Protocol" or "Subject"
To figure out what goes in those columns, we created this matrix of job responsibilities:
jobs<-structure(list(`Roles (existing)` = c("Calendar Build", "Protocol Management",
"Subject Management"), `CRA/ Manager/ Senior` = c(1, 1, 0), Regulatory = c(0,
1, 1), Finance = c(0, 0, 0), `Research Nurse` = c(1, 0, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
So if you're following so far, no matter what "Bob" put in his columns for "Calendar", "Protocol" or "subject" (he currently has zeros), it will be overwritten based on what he put in other columns. So if Bob put a "1" in his 'Regulatory' column, based on that matrix I screenshotted, he should get a 1 in both the protocol and subject columns.
The specific question
So how do I tell R, "look at bob's "CRA,Regulatory, Finance, and researchNurse" columns, and then crossreference the "jobs" dataframe, and overwrite his "calendar, protocol, and subjects" columns?
My expected output in this particular case would be:
One last little detail: I could see instances where (depending on the order), numbers would overwrite each other. I.e. if Bob should get a 1 in protocol because he's got a 1 in regulatory... but he's got a 1 in finance which would mean he should get a 0 in protocol.....
When in doubt, if a column is overwritten with a 1, it should never be turned back into a zero. I hope that makes sense.
I'd suggest converting your logic to ifelse statement(s):
df$Calendar <- ifelse(df$CRA == 1 | df$ResearchNurse == 1, 1, df$Calendar)
df$Protocol <- ifelse(df$CRA == 1 | df$Regulatory == 1, 1, df$Protocol)
df$Subject <- ifelse(df$Regulatory == 1 | df$ResearchNurse == 1, 1, df$Subject)
df
#> fname employee_number Calendar Protocol Subject CRA Regulatory Finance
#> 1 Linda 00000123456 0 1 1 0 1 0
#> 2 Bob 654321 0 1 1 0 1 1
#> ResearchNurse
#> 1 0
#> 2 0
data:
df <- structure(list(
fname = c("Linda", "Bob"),
employee_number = c("00000123456", "654321"),
Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0, 0),
CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1),
ResearchNurse = c(0, 0)), row.names = c(NA, -2L), class = c("data.frame"))
Created on 2022-03-28 by the reprex package (v2.0.1)
Both tables need a common look up value.
So for example in your df table there is a employee_number column. Do you have the same field in the jobs table? If so this is easy to do with left_join() and then a case_when()
You will need simplify your current jobs table to have some summary value of the logic you put in your post eg(if Bob has a 1 in regulatory then he should get a 1 in protocol and subject columns). This can be done with some table manipulation functions. I can't tell you exactly which ones because I don't fully understand the logic.
Assuming that is clear to you and you know how to summarize that jobs table (and you have the unique employee_number) for each row then the below should work.
left_join(x=df,y=jobs,by="employee_number") %>%
muate(new_col1=case_when(logic_1 ~ value1,
logic_2 ~ value2,
logic_3 ~ value3,
TRUE ~ default_value))
You can repeat the newcol logic for additional columns as required.
library(tidyverse)
First, by pivoting both df and jobs, the task should become much easier
(df_long <- df %>%
pivot_longer(
cols = -c(fname, employee_number), names_to = "term"
) %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 3 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Bob 654321 Regulatory
#> 3 Bob 654321 Finance
Now, if I understand your question correctly, Bob should have added “Protocol”
and “Subject”in his survey because he works in “Finance”. Luckily, we can add
that information for him automatically. We pivot jobs and clean up the
names/terms to match those in df. This can be done like this:
(jobs_long <- jobs %>%
rename(
CRA = `CRA/ Manager/ Senior`, ResearchNurse = `Research Nurse`
) %>%
mutate(
roles = `Roles (existing)` %>% str_extract("^\\w+"),
.keep = "unused"
) %>%
pivot_longer(-roles, names_to = "term") %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 6 x 2
#> roles term
#> <chr> <chr>
#> 1 Calendar CRA
#> 2 Calendar ResearchNurse
#> 3 Protocol CRA
#> 4 Protocol Regulatory
#> 5 Subject Regulatory
#> 6 Subject ResearchNurse
Once in this shape, we can join the two tables, do some tidying, and then we
end up with the correct information. We could continue from here and wrangle
the data back into the wide shape, but it’s probably more useful like this
so that’s where I would stop.
df_long %>%
left_join(jobs_long, by = c("term" = "term")) %>%
pivot_longer(cols = c(term, roles), values_drop_na = TRUE) %>%
distinct(fname, employee_number, term = value)
#> # A tibble: 7 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Linda 00000123456 Protocol
#> 3 Linda 00000123456 Subject
#> 4 Bob 654321 Regulatory
#> 5 Bob 654321 Protocol
#> 6 Bob 654321 Subject
#> 7 Bob 654321 Finance
Created on 2022-03-31 by the reprex package (v1.0.0)
In a related question I had some good help to generate possible combinations of a set or variables.
Assume the output of that process is
combo_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
combo_id = c("combo1", "combo2", "combo3"),
selection_1 = c("Alice", "Alice", "Bob"),
selection_2 = c("Bob", "Cat", "Cat")
),
name = "combo_table")
This is a tbl reference to a spark data frame object with two columns, each representing a selection of 2 values from a list of 3 (Alice, Bob, Cat), that could be imagined as 3 members of a household.
Now there is also a spark data frame with a binary encoding indicating a 1 if the member of the house was in the house, and 0 where they were not.
obs_tbl <- sdf_copy_to(sc = sc,
x = data.frame(
obs_day = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"),
Alice = c(1, 1, 0, 1, 0, 1, 0),
Bob = c(1, 1, 1, 0, 0, 0, 0),
Cat = c(0, 1, 1, 1, 1, 0, 0)
),
name = "obs_table")
I can relatively simply check if a specific pair were present in the house with this code:
obs_tbl %>%
group_by(Alice, Bob) %>%
summarise(n())
However, there are 2 flaws with this approach.
Each pair is being put in manually, when every combination I need to check is already in combo_tbl.
The output automatically outputs the intersection of every combination. i.e. I get the count of values where both Alice and Bob == 1, but also where Alice ==1 and Bob == 0, Alice == 0 and Bob ==1, etc.
The ideal end result would be an output like so:
Alice | Bob | 2
Alice | Cat | 2
Bob | Cat | 2
i.e. The count of co-habitation days per pair.
A perfect solution would allow simple modification to change the number of selection within the combination to increase. i.e. each combo_id may have 3 or greater selections, from a larger list than the one given.
So, is it possible on sparklyr to pass a vector of pairs that are iterated through?
How do I only check for where both of my selections are present? Instead of a vectorised group_by should I use a vectorised filter?
I've read about quosures and standard evaluation in the tidyverse. Is that the solution to this if running locally? And if so is this supported by spark?
For reference, I have a relatively similar solution using data.table that can be run on a single-machine, non-spark context. Some pseudo code:
combo_dt[, obs_dt[get(tolower(selection_1)) == "1" &
get(tolower(selection_2)) == "1"
, .N], by = combo_id]
This nested process effectively splits each combination into it's own sub-table: by = combo_id, and then for that sub-table filters where selection_1 and selection_2 are 1, and then applies .N to count the rows in that sub-table, and then aggregates the output.
This question already has answers here:
Stratified random sampling from data frame
(6 answers)
Closed 6 years ago.
I have a dataframe called test.data where I have a column called Ethnicity. There are three groups of ethnicities (more in actual data), Adygei, Balochi and Biaka_pygmies. I want to subset this data frame to include only two samples (rows) randomly from each ethnic group and get the result. How can I do this in R?
test.data <- structure(list(Sample = c("1793102418_A", "1793102460_A", "1793102500_A",
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A",
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A",
"1775705355_A"), Ethnicity = c("Adygei", "Adygei", "Adygei",
"Adygei", "Balochi", "Balochi", "Balochi", "Balochi", "Balochi",
"Biaka_Pygmies", "Biaka_Pygmies", "Biaka_Pygmies"), Height = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("Sample", "Ethnicity",
"Height"), row.names = c("1793102418_A", "1793102460_A", "1793102500_A",
"1793102576_A", "1749751113_A", "1749751187_A", "1749751189_A",
"1749751285_A", "1749751356_A", "1749751195_A", "1749751218_A",
"1775705355_A"), class = "data.frame")
result
Sample Ethnicity Height
1793102418_A 1793102418_A Adygei 0
1793102460_A 1793102460_A Adygei 0
1749751189_A 1749751189_A Balochi 0
1749751285_A 1749751285_A Balochi 0
1749751195_A 1749751195_A Biaka_Pygmies 0
1775705355_A 1775705355_A Biaka_Pygmies 0
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(test.data)), grouped by 'Ethnicity', we sample the sequence of rows and subset the rows based on that.
setDT(test.data)[, .SD[sample(1:.N,2)], Ethnicity]
Or using tapply from base R
test.data[ with(test.data, unlist(tapply(seq_len(nrow(test.data)),
Ethnicity, FUN = sample, 2))), ]
I am new to R. I have daily data and want to separate months with mean less than 1 from rest of data. Do something on daily data (with mean greater than 1). The important thing is not to touch daily values with monthly mean less than 1.
I have used aggregate(file,as.yearmon,mean) to get monthly mean but failing to grasp on how to use it to filter specific month's daily values from analysis. Any suggestion to start would be highly appreciative.
I have reproduced data using a small subset of it and dput:
structure(list(V1 = c(0, 0, 0, 0.43, 0.24, 0, 1.06, 0, 0, 0, 1.57, 1.26, 1.34, 0, 0, 0, 2.09, 0, 0, 0.24)), .Names = "V1", row.names = c(NA, 20L), class = "data.frame")
A snippet of code I am using:
library(zoo)
file <- read.table("text.txt")
x_daily <- zooreg(file, start=as.Date("2000-01-01"))
x1_daily <- x_daily[]
con_daily <- subset(x1_daily, aggregate(x1_daily,as.yearmon,mean) > 1 )
Let's create some sample data:
feb2012 <- data.frame(year=2012, month=2, day=1:28, data=rnorm(28))
feb2013 <- data.frame(year=2013, month=2, day=1:28, data=rnorm(28) + 10)
jul2012 <- data.frame(year=2012, month=7, day=1:31, data=rnorm(31) + 10)
jul2013 <- data.frame(year=2013, month=7, day=1:31, data=rnorm(31) + 10)
d <- rbind(feb2012, feb2013, jul2012, jul2013)
You can get an aggregate of the data column by month like this:
> a <- aggregate(d$data, list(year=d$year, month=d$month), mean)
> a
year month x
1 2012 2 0.09704817
2 2013 2 9.93354271
3 2012 7 10.19073868
4 2013 7 9.78324133
Perhaps not the best way, but an easy way to filter the d data frame by the mean of the corresponding year and month is to work with a temporary data frame that merges d and a, like this:
work <- merge(d, a)
subset(work, x > 1)
I hope this will help you get started!