Left join with multiple conditions in R - r

I'm trying to replace ids for their respective values. The problem is that each id has a different value according to the previous column type, like this:
>df
type id
1 q1 1
2 q1 2
3 q2 1
4 q2 3
5 q3 1
6 q3 2
Here's the type ids with its value:
>q1
id value
1 1 yes
2 2 no
>q2
id value
1 1 one hour
2 2 two hours
3 3 more than two hours
>q3
id value
1 1 blue
2 2 yellow
I've tried something like this:
df <- left_join(subset(df, type %in% c("q1"), q1, by = "id"))
But it removes the other values.
I' like to know how to do a one liner solution (or kind of) because there are more than 20 vectors with types description.
Any ideias on how to do it?
This is the df i'm expecting:
>df
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow

You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:
value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)
Then you just merge on both type and id:
df <- left_join(df, value_lookup, by = c('type', 'id'))
Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.

tempList = split(df, df$type)
do.call(rbind,
lapply(names(tempList), function(nm)
merge(tempList[[nm]], get(nm))))
# id type value
#1 1 q1 yes
#2 2 q1 no
#3 1 q2 one hour
#4 3 q2 more than two hours
#5 1 q3 blue
#6 2 q3 yellow

Get the values of 'q\d+' data.frame object identifiers in a list, bind them together into a single data.frame with bind_rows while creating the 'type' column as the identifier name and right_join with the dataset object 'df'
library(tidyverse)
mget(paste0("q", 1:3)) %>%
bind_rows(.id = 'type') %>%
right_join(df)
# type id value
#1 q1 1 yes
#2 q1 2 no
#3 q2 1 one hour
#4 q2 3 more than two hours
#5 q3 1 blue
#6 q3 2 yellow

You can do it by a series of left joins:
df1 = left_join(df, q1, by='id') %>% filter(type=="q1")
> df1
type id value
1 q1 1 yes
2 q1 2 no
df2 = left_join(df, q2, by='id') %>% filter(type=="q2")
> df2
type id value
1 q2 1 one hour
2 q2 3 more than two hours
df3 = left_join(df, q3, by='id') %>% filter(type=="q3")
> df3
type id value
1 q3 1 blue
2 q3 2 yellow
> rbind(df1,df2,df3)
type id value
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow
One liner would be:
rbind(left_join(df, q1, by='id') %>% filter(type=="q1"),
left_join(df, q2, by='id') %>% filter(type=="q2"),
left_join(df, q3, by='id') %>% filter(type=="q3"))
If you have more vectors then probably you should loop through the names of vector types and execute left_join and bind_rows one by one as:
vecQs = c(paste("q", seq(1,3,1),sep="")) #Types of variables q1, q2 ...
result = tibble()
#Execute left_join for the types and store it in result.
for(i in vecQs) {
result = bind_rows(result, left_join(df,eval(as.symbol(i)) , by='id') %>% filter(type==!!i))
}
This will give:
> result
# A tibble: 6 x 3
type id value
<chr> <int> <chr>
1 q1 1 yes
2 q1 2 no
3 q2 1 one hour
4 q2 3 more than two hours
5 q3 1 blue
6 q3 2 yellow

Related

How to aggregate in R where the values become the column names [duplicate]

This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 1 year ago.
I'm fairly new to R and using the dplyr package currently. I have a dataframe that looks something like this simplified table:
year
category
2009
A
2009
B
2009
B
2010
A
2010
B
2011
A
2011
C
2011
C
I want to count for each year hence I used:
df %>% count(year, category)
and got
year
category
count
2009
A
1
2009
B
2
2010
A
1
2010
B
1
2011
A
1
2011
C
2
However I would like to use the year as column names, to get the following:
2009
2010
2011
A
1
1
1
B
2
1
0
C
0
0
2
What is an easy way to get this? I would like to get this in absolute numbers, and if possible as a normalized table (percentages of the total of each year).
I hope you guys can help me out!
df %>% count(year, category) %>%
pivot_wider(
category,
names_from = year,
names_prefix = "year_",
values_from = n,
values_fill = 0
)
# A tibble: 3 x 4
category year_2009 year_2010 year_2011
<chr> <int> <int> <int>
1 A 1 1 1
2 B 2 1 0
3 C 0 0 2
Using reshape:
df2 = df %>% count(year, category)
df2 = reshape(df2, idvar='category', timevar='year', direction='wide')
rownames(df2) = df2$category
df2[is.na(df2)] = 0
df2 = df2[,c(2:4)]

How to drop NA's out of the summarise(count = n()) function in R?

I have a dataset containing 4 organisation units (org_unit) with different number of participants and 2 Questions (Q1,Q2) on a 2-degree scale (1:2). I want to know how many people per unit answered the respective question with [1] and divide them by the total number of participants / unit.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(-9,-9,-9,-9,-9,-9,-9,-9,-9,-9)
The problem is, my Q2 only consists of [-9] which stands for non-response. I therefore assigned NA to [-9].
DF <- data.frame(Org_unit, Q1, Q2)
DF[DF == -9] <- NA
DF
Org_unit Q1 Q2
1 1 1 NA
2 1 2 NA
3 1 1 NA
4 1 2 NA
5 2 1 NA
6 2 2 NA
7 2 1 NA
8 3 2 NA
9 3 1 NA
10 4 2 NA
Next I calculated the proportion of people who answered Q1 with [1], which works fine.
prop_q1 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q1 == 1))
prop_q1
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 0.5
2 2 3 0.667
3 3 2 0.5
4 4 1 0
when i run the same code for Q2 however, I get the same amount of members per unit (count = c(1,2,3,4), although nobody answered the question and I don't want them to be registered as participants, since they technically didn't participate in the study.
prop_q2 <- DF %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
prop_q2
# A tibble: 4 x 3
Org_unit count prop
<dbl> <int> <dbl>
1 1 4 NA
2 2 3 NA
3 3 2 NA
4 4 1 NA
Is there a way to calculate the right amount of members per unit when facing NA's? [-9]
Thanks!
Would
prop_q2 <- DF %>%
filter(!is.na(Q2)) %>%
group_by(Org_unit) %>%
summarise(count = n(),
prop = mean(Q2 == 1))
do the job?
Given that you want to do this across multiple columns, I think that using across() within the dplyr verbs will be better for you. I explain the solution below.
Org_unit <- c(1,1,1,1,2,2,2,3,3,4)
Q1 <- c(1,2,1,2,1,2,1,2,1,2)
Q2 <- c(1,-9,-9,-9,-9,-9,-9,-9,-9,-9) #Note one response
df <- tibble(Org_unit, Q1, Q2)
df %>%
mutate(across(starts_with("Q"), ~na_if(., -9))) %>%
group_by(Org_unit) %>%
summarize(across(starts_with("Q"),
list(
N = ~sum(!is.na(.)),
prop = ~sum(. == 1, na.rm = TRUE)/sum(!is.na(.)))
))
# A tibble: 4 x 5
Org_unit Q1_N Q1_prop Q2_N Q2_prop
* <dbl> <int> <dbl> <int> <dbl>
1 1 4 0.5 1 1
2 2 3 0.667 0 NaN
3 3 2 0.5 0 NaN
4 4 1 0 0 NaN
First, we take the data frame (which I created as a tibble) and substitute NA for all values that equal -9 for all columns that start with a capital "Q". This converts all question columns to have NAs in place of -9s.
Second, we group by the organizational unit and then summarize using two functions. The first sums all values where the response to the question is not NA. The string _N will be appended to columns with these values. The second calculates the proportion and will have _prop appended to the values.

Count the number of columns in a row with a specific value

I have a data set with numerical responses to several questions. I would like to know the number of times a person answers a question with a value of 1,2...
Here is an example of the data:
df=data.frame("Person"=c("person a", "person b"),
"Q1"=c(2,2),"Q2"=c(1,2),"Q3"=c(1,1))
Which looks like this:
Person Q1 Q2 Q3
person a 2 1 1
person b 2 2 1
I would like this and would prefer to use dplyr:
Person Q1 Q2 Q3 Total.1 Total.2
person a 2 1 1 2 1
person b 2 2 1 1 2
The base R approach suggested by #dww is quite simple and straight forward. However, if you prefer dplyr approach we can use rowwise and do to calculate occurrence of 1 and 2 respectively.
library(dplyr)
df %>%
rowwise() %>%
do( (.) %>% as.data.frame %>%
mutate(Total.1 = sum(.==1),
Total.2 = sum(.==2)))
# Person Q1 Q2 Q3 Total.1 Total.2
# <fct> <dbl> <dbl> <dbl> <int> <int>
#1 person a 2 1 1 2 1
#2 person b 2 2 1 1 2
A base R approach using apply
df[c("Total.1", "Total.2")] <- t(apply(df, 1, function(x) c(sum(x==1), sum(x==2))))
df
# Person Q1 Q2 Q3 Total.1 Total.2
#1 person a 2 1 1 2 1
#2 person b 2 2 1 1 2
No need for dplyr. In base R it is quite simple
df = cbind(df, Total.1 = rowSums(df[,-1]==1), Total.2 = rowSums(df[,-1]==2))
Here is one option with tidyverse
library(tidyverse)
df %>%
mutate(Total = pmap(.[-1], ~
c(...) %>%
paste0("Total.", .) %>%
table %>%
as.list %>%
as_tibble )) %>%
# unnest
# Person Q1 Q2 Q3 Total.1 Total.2
#1 person a 2 1 1 2 1
#2 person b 2 2 1 1 2
Or another way is
df %>%
mutate(Total = pmap(.[-1], ~
c(...) %>%
table %>%
toString)) %>%
separate(Total, into = c("Total.1", "Total.2"))
# Person Q1 Q2 Q3 Total.1 Total.2
#1 person a 2 1 1 2 1
#2 person b 2 2 1 1 2

identifying location of NA values in a data frame by ID (not row number) and column name

I have a survey where some questions were not answered by some participants. Here is a simplified version of my data
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA),
Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
df
I would like to see which ID numbers did not answer which questions. The following code is very close to the output I want but identifies the subject by row number - I would like the subject identified by ID number
table(data.frame(which(is.na(df), arr.ind=TRUE)))
right now the output shows that rows 1,3,5 did not answer at least one question and it identifies the column with the missing value. I would like it show me the same thing but with ID numbers 12,14,16. It would be a bonus if you could have the column names (eg Q1,Q2,Q3) in the output as well instead of column number.
We can get the column names which are NA row-wise using apply and make it into a comma separated string and attach it to a new dataframe along with it's ID.
new_df <- data.frame(ID =df$ID, ques = apply(df, 1, function(x)
paste0(names(which(is.na(x))), collapse = ",")))
new_df
# ID ques
#1 12 Q3
#2 13
#3 14 Q2
#4 15
#5 16 Q1,Q2
Similar equivalent would be
new_df <- data.frame(ID = df$ID, ques = apply(is.na(df), 1, function(x)
paste0(names(which(x)), collapse = ",")))
In base R:
res <- df[!complete.cases(df),]
res[-1] <- as.numeric(is.na(res[-1]))
res
# ID Q1 Q2 Q3
# 12 12 0 0 1
# 14 14 0 1 0
# 16 16 1 1 0
If you wish to avoid apply type operations and continue from which(..., T), you can do something like the following:
tmp <- data.frame(which(is.na(df[, 2:4]), T))
# change to character
tmp[, 2] <- paste0('Q', tmp[, 2])
# gather column numbers together for each row number
tmp_split <- split(tmp[, 2], tmp[, 1])
# preallocate new column in df
df$missing <- vector('list', 5)
df$missing[as.numeric(names(tmp_split))] <- tmp_split
This produces
> df
ID Q1 Q2 Q3 missing
1 12 a a <NA> Q3
2 13 b a a NULL
3 14 a <NA> a Q2
4 15 a b a NULL
5 16 <NA> <NA> b Q1, Q2
You can convert data in long format using tidyr::gather. Filter for Answer not available. Finally, you can summarise your data using toString as:
library(tidyverse)
df %>% gather(Question, Ans, -ID) %>%
filter(is.na(Ans)) %>%
group_by(ID) %>%
summarise(NotAnswered = toString(Question))
# # A tibble: 3 x 2
# ID NotAnswered
# <int> <chr>
# 1 12 Q3
# 2 14 Q2
# 3 16 Q1, Q2
If, OP wants to include all IDs in result then, solution can be as:
df %>% gather(Question, Ans, -ID) %>%
group_by(ID) %>%
summarise(NoAnswered = toString(Question[is.na(Ans)])) %>%
as.data.frame()
# ID NoAnswered
# 1 12 Q3
# 2 13
# 3 14 Q2
# 4 15
# 5 16 Q1, Q2
How's this with tidyverse:
data:
library(tidyverse)
df <- data.frame(ID = c(12:16), Q1 = c("a","b","a","a",NA), Q2 = c("a","a",NA,"b",NA), Q3 = c(NA,"a","a","a","b"))
code:
x <- df %>% filter(is.na(Q1) | is.na(Q2) | is.na(Q3)) # filter out NAs
y <- cbind(x %>% select(ID),
x %>% select(Q1, Q2, Q3) %>% sapply(., function(x) ifelse(is.na(x), 1, 0))
) # in 1/0 format
output:
x:
ID Q1 Q2 Q3
1 12 a a <NA>
2 14 a <NA> a
3 16 <NA> <NA> b
y:
ID Q1 Q2 Q3
1 12 0 0 1
2 14 0 1 0
3 16 1 1 0
My attempt is no better than any already offered, but it's a fun problem, so here's mine. Because why not?:
library( magrittr )
df$ques <- df %>%
is.na() %>%
apply( 1, function(x) {
x %>%
which() %>%
names() %>%
paste0( collapse = "," )
} )
df
# ID Q1 Q2 Q3 ques
# 1 12 a a <NA> Q3
# 2 13 b a a
# 3 14 a <NA> a Q2
# 4 15 a b a
# 5 16 <NA> <NA> b Q1,Q2
Most of the answer comes from your question:
df[which(is.na(df), arr.ind=TRUE)[,1],]
# ID Q1 Q2 Q3
# 5 16 <NA> <NA> b
# 3 14 a <NA> a
# 5.1 16 <NA> <NA> b
# 1 12 a a <NA>

R table from rows

I have some data frames which hold the results of a survey. The first frame lists the question ids (q_id) for each question in the survey:
q_id
1 q1
2 q2
3 q3
The second data frame holds responses (res) for each subject (s_id) for every question that subject responded to. A subject can skip questions:
s_id q_id res
1 1 q1 a
2 2 q1 b
3 1 q2 b
What I want to generate is a table which shows the responses to each question, where the columns are the question ids and each row represents a subject. In the above examples, the table would look like this:
q1 q2 q3
1 a b NA
2 b NA NA
What is the best way to generate such a table?
Assuming that your question data.frame is DQ and your answers DT
You need to make sure that your q_id column in your answers has all the levels available
DT$q_id <- factor(as.character(DT$q_id), levels = levels(DQ$q_id))
then you can use reshape2 and dcast with drop = FALSE to cast as you wish
library(reshape2)
dcast(DT, s_id~q_id, value.var = 'res', drop = FALSE)
s_id q1 q2 q3
1 1 a b <NA>
2 2 b <NA> <NA>
> dat <- read.table(text=" s_id q_id res
+ 1 1 q1 a
+ 2 2 q1 b
+ 3 1 q2 b", header =TRUE, stringsAsFactors=FALSE)
# Create a dummy entry for each question:
> dat<- rbind(dat, data.frame(s_id=1,q_id=qdat$q_id, res= NA))
> dat
s_id q_id res
1 1 q1 a
2 2 q1 b
3 1 q2 b
4 1 q1 <NA>
5 1 q2 <NA>
6 1 q3 <NA>
> reshape(dat, timevar="q_id", idvar="s_id", direction ="wide")
s_id res.q1 res.q2 res.q3
1 1 a b <NA>
2 2 b <NA> <NA>

Resources