I have a messy dataset (from CATI survey). I am a struggling to prepare and tidy it because of interviewee /partner/child files, deal with doublet (pair of similar questions) in each column
For example a chunk of data for gender is look like this (1 = male , 2 = female)
# A tibble: 7 x 7
Household_size q_1 q_2 q_3 q_4 q_5 q_6
<int> <int> <int> <int> <int> <int> <int>
1 3 1 2 1 NA NA NA
2 2 2 1 NA NA NA NA
3 5 1 2 1 1 2 NA
4 3 2 2 1 NA NA NA
5 6 2 1 1 1 1 1
6 5 1 2 1 2 2 NA
7 3 1 2 2 NA NA NA
Metadata says :
q_1 is interviewee gender
q_2 is interviewee - partner gender (if there is any)
q_3:q_6 interviewee - kid gender (if there is any)
The data has the same format for education, occupation etc (pair of identical questions for interviewee /partner/kid).
How can I tidy up this dataset to be able to easily calculate statistical summary or visualization. I would like to have something like this(total number of male and female in the survey regardless of age):
Male 15
Female 12
The table function in base R might be what you are looking for, it gives you a versatile option which counts all the levels:
table(unlist(df1[,c(2:7)]))
Alter this to make the dataframe name (df1) and column numbers c(2,7) suit your needs.
This replicates your example too:
df1 <- data.frame("v" = LETTERS[1:7], "q1" = c(1,2,1,2,2,1,1), "q2" = c(2,1,2,2,1,2,2), "q3" = c(1,NA,1,1,1,1,2), "q4" = c(NA, NA,1,NA,1,2,NA), "q5" = c(NA, NA,2,NA,1,2,NA), "q6" = c(NA, NA,NA,NA,1,NA,NA))
> table(unlist(df1[,c(2:7)]))
1 2
15 12
Some more examples:
df1 <- data.frame("v" = LETTERS[1:5], "q1" = c(1,2,6,1,1), "q2" = c("k","k","f","h","p"), "q3" = c(1,2,NA,1,NA))
> df1
v q1 q2 q3
1 A 1 k 1
2 B 2 k 2
3 C 6 f NA
4 D 1 h 1
5 E 1 p NA
table(unlist(df1[,c(2,4)]))
table(unlist(df1[,3]))
> table(unlist(df1[,c(2,4)]))
1 2 6
5 2 1
> table(unlist(df1[,3]))
f h k p
1 1 2 1
It's straightforward if you put the data into a long format, filter out the NAs, make gender into a factor, and tally up the counts. I'm using fct_recode from forcats (ships with tidyverse), but you can also change the labels of factor levels in base R.
library(tidyverse)
df %>%
gather(key = person, value = gender, -Household_size) %>%
filter(!is.na(gender)) %>%
mutate(gender_fct = as.factor(gender) %>% forcats::fct_recode("Male" = "1", "Female" = "2")) %>%
count(gender_fct)
#> # A tibble: 2 x 2
#> gender_fct n
#> <fct> <int>
#> 1 Male 15
#> 2 Female 12
Created on 2018-05-05 by the reprex package (v0.2.0).
Related
I'm trying to use the following function to iterate through a dataframe and return the counts from each row:
library(dplyr)
library(tidyr)
row_freq <- function(df_input,row_input){
print(df_input)
vec <- unlist(df_input %>%
select(-1) %>%
slice(row_input), use.names = FALSE)
r <- data.frame(table(vec)) %>%
pivot_wider(values_from = Freq, names_from = vec)
return(r)
}
This works fine if I use a single row from the dataframe:
sample_df <- data.frame(id = c(1,2,3,4,5), obs1 = c("A","A","B","B","B"),
obs2 = c("B","B","C","D","D"), obs3 = c("A","B","A","D","A"))
row_freq(sample_df, 1)
id obs1 obs2 obs3
1 1 A B A
2 2 A B B
3 3 B C A
4 4 B D D
5 5 B D A
# A tibble: 1 × 2
A B
<int> <int>
1 2 1
But when iterating over rows using purrr::map_dfr, it seems to reduce df_input to only the id column instead of using the entire dataframe as the argument, which I found quite strange:
purrr::map_dfr(sample_df, row_freq, 1:5)
[1] 1 2 3 4 5
Error in UseMethod("select") :
no applicable method for 'select' applied to an object of class "c('double', 'numeric')"
I'm looking for help with regards to 1) why this is happening, 2) how to fix it, and 3) any alternative approaches or functions that may already perform what I'm trying to do in a more efficient manner.
Specify the order of the arguments correctly if we are not passing with named arguments
purrr::map_dfr(1:5, ~ row_freq(sample_df, .x))
-output
# A tibble: 5 × 4
A B C D
<int> <int> <int> <int>
1 2 1 NA NA
2 1 2 NA NA
3 1 1 1 NA
4 NA 1 NA 2
5 1 1 NA 1
Or use a named argument
purrr::map_dfr(df_input = sample_df, .f = row_freq, .x = 1:5)
-output
# A tibble: 5 × 4
A B C D
<int> <int> <int> <int>
1 2 1 NA NA
2 1 2 NA NA
3 1 1 1 NA
4 NA 1 NA 2
5 1 1 NA 1
The reason is that map first argument is .x
map(.x, .f, ...)
and if we are providing the 'sample_df' as the argument, it takes the .x as sample_df and loops over the columns of the data (as data.frame/tibble/data.table - unit is column as these are list with additional attributes)
R - Count unique/distinct values in two columns together
Hi everyone. I have a panel of electoral behaviour but I am having problems to compute a new variable that would capture unique values (parties) of my two columns Party and Party2013 per group. The column Party2013 measures the vote in election 2013 and Party measures voters intentions after 2013. Everytime I try n_distinct or length I get the count of unique values in both columns separately but not as a sum.
ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
Based on the example above I normally get the count of 3 instead of desired 2.
I´ve tried following commands but got only the number of separate unique values:
data %>% group_by(ID) %>% distinct(Party, Party2013, .keep_all = TRUE) %> dplyr::summarise(Party_Party2013 = n())
or
ddply(data, .(ID), mutate, count = length(unique(Party, Party2013)))
The expected outcome would as follows:
ID Wave Party Party2013 Count
1 1 A A 2
1 2 A NA 2
1 3 B NA 2
1 4 B NA 2
2 1 A C 3
2 2 B NA 3
2 3 B NA 3
2 4 B NA 3
I would very much appreciate any advice on how to count the overall number of unique parties across the two columns per group and not the number of distinct values per each one. Thanks.
You can subset the data from cur_data() and unlist the data to get a vector. Use n_distinct to count number of unique values.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Count = n_distinct(unlist(select(cur_data(),
Party, Party2013)), na.rm = TRUE)) %>%
ungroup
# ID Wave Party Party2013 Count
# <int> <int> <chr> <chr> <int>
#1 1 1 A A 2
#2 1 2 A NA 2
#3 1 3 B NA 2
#4 1 4 B NA 2
#5 2 1 A C 3
#6 2 2 B NA 3
#7 2 3 B NA 3
#8 2 4 B NA 3
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Wave = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L), Party = c("A", "A", "B", "B", "A",
"B", "B", "B"), Party2013 = c("A", NA, NA, NA, "C", NA, NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
In situations like this I always like to simplify the problem and change the data into the long format since it is easier to solve problems like this if all of your values are in one column. With pivot_longer() you can also use the argument values_drop_na = TRUE to drop NAs which were counted in your example:
library(tidyr)
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>% pivot_longer(cols = starts_with("Party"), values_drop_na = TRUE) %>% group_by(ID) %>%
summarise(Count = n_distinct(value)) %>% merge(data, .)
#> ID Wave Party Party2013 Count
#> 1 1 1 A A 2
#> 2 1 2 A <NA> 2
#> 3 1 3 B <NA> 2
#> 4 1 4 B <NA> 2
#> 5 2 1 A C 3
#> 6 2 2 B <NA> 3
#> 7 2 3 B <NA> 3
#> 8 2 4 B <NA> 3
Created on 2021-08-30 by the reprex package (v2.0.1)
You can also and this way:
library(dplyr)
data <- read.table(text =
"ID Wave Party Party2013
1 1 A A
1 2 A NA
1 3 B NA
1 4 B NA
2 1 A C
2 2 B NA
2 3 B NA
2 4 B NA", header = TRUE)
data %>%
group_by(ID) %>%
mutate(Count = paste(Party, Party2013) %>%
unique %>% length() %>%
rep(length(Party)))
output
# A tibble: 8 x 5
# Groups: ID [2]
ID Wave Party Party2013 Count
<int> <int> <chr> <chr> <int>
1 1 1 A A 3
2 1 2 A NA 3
3 1 3 B NA 3
4 1 4 B NA 3
5 2 1 A C 2
6 2 2 B NA 2
7 2 3 B NA 2
8 2 4 B NA 2
I have the following data
df <- tibble(Type=c(1,2,2,1,1,2),ID=c(6,4,3,2,1,5))
Type ID
1 6
2 4
2 3
1 2
1 1
2 5
For each of the type 2 rows, I want to find the IDs of the type 1 rows just below and above them. For the above dataset, the output will be:
Type ID IDabove IDbelow
1 6 NA NA
2 4 6 2
2 3 6 2
1 2 NA NA
1 1 NA NA
2 5 1 NA
Naively, I can write a for loop to achieve this, but that would be too time consuming for the dataset I am dealing with.
One approach using dplyr lead,lag to get next and previous value respectively and data.table's rleid to create groups of consecutive Type values.
library(dplyr)
library(data.table)
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = rleid(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
# Type ID IDabove IDbelow
# <dbl> <dbl> <dbl> <dbl>
#1 1 6 NA NA
#2 2 4 6 2
#3 2 3 6 2
#4 1 2 NA NA
#5 1 1 NA NA
#6 2 5 1 NA
A dplyr only solution:
You could create your own rleid function then apply the logic provided by Ronak(Many thanks. Upvoted).
library(dplyr)
my_func <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
# this part is the same as provided by Ronak.
df %>%
mutate(IDabove = ifelse(Type == 2, lag(ID), NA),
IDbelow = ifelse(Type == 2, lead(ID), NA),
grp = my_func(Type)) %>%
group_by(grp) %>%
mutate(IDabove = first(IDabove),
IDbelow = last(IDbelow)) %>%
ungroup() %>%
select(-grp)
Output:
Type ID IDabove IDbelow
<dbl> <dbl> <dbl> <dbl>
1 1 6 NA NA
2 2 4 6 2
3 2 3 6 2
4 1 2 NA NA
5 1 1 NA NA
6 2 5 1 NA
Given the following data:
test = data.frame(x = c(NA,1,1,2,3,4),
y = c(NA,1,2,3,4,4))
I want to perform some calculations and store these as new columns. The calculations, however, might result in a variable amount of columns. E.g. suppose I want store for each row the column index of the column(s) that contain the minimum per row. E.g. in row 1, both columns contain the minimum, hence I need to create two columns.
Using the tidyverse approach, I know I can use the set_names argument when passing my function as a list. But this doesn't work when I don't know the number of columns my calculation will create. See also here: https://community.rstudio.com/t/how-to-handle-lack-of-names-with-unnest-wider/40496
My approach for the calculations:
library(tidyverse)
test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist))) %>%
ungroup() %>%
unnest_wider(code)
which automatically names the unnested columns with "...1" and "...2":
# A tibble: 6 x 5
x y dist ...1 ...2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
But that's not what I want. I also tried to use the named_repair argument within the unnest_wider, i.e. unnest_wider(code, names_repair = ~paste0("code", .x)) but this renames all columns.
Any ideas (preferably in the tidyverse approach)? Expected outcome:
# A tibble: 6 x 5
x y dist code_1 code_2
<dbl> <dbl> <dbl> <int> <int>
1 NA NA NA NA NA
2 1 1 1 1 2
3 1 2 1 1 NA
4 2 3 2 1 NA
5 3 4 3 1 NA
6 4 4 4 1 2
EDITED to add an example where one row contains only missings.
Edit 2: this is my current solution. But it is really ugly and requires to stop half way through. Problem here is that the rename_with function doesn't recognize the on-the-fly generated "length_code" column when I put everything into one pipe.
test2 <- test %>%
rowwise() %>%
mutate(dist = min(c_across(everything())),
code = list(which(c_across(cols = c(everything(), -dist)) == dist)),
length_code = length(code)) %>%
ungroup() %>%
unnest_wider(code) %>%
test3 <- test2 %>%
rename_with(.cols = starts_with("..."), .fn = ~paste0("code_", 1:max(test2$length_code)))
which gives:
# A tibble: 6 x 6
x y dist code_1 code_2 length_code
<dbl> <dbl> <dbl> <int> <int> <int>
1 NA NA NA NA NA 0
2 1 1 1 1 2 2
3 1 2 1 1 NA 1
4 2 3 2 1 NA 1
5 3 4 3 1 NA 1
6 4 4 4 1 2 2
I have a tibble containing time series of various blood parameters like CRP over the course of several days. The tibble is tidy, with each time series in one column, as well as a column for the day of measurement. The tibble contains another column with a day of infection. I want to replace each blood parameter with NA if the Day variable is greater-equal than the InfectionDay. Since I have a lot of variables, I'd like to have a function which accepts the column name dynamically and creates a new column name by appending "_censored" to the old one. I've tried the following:
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !!colname, NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
Running this, I expected
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 3
2 2 3 2 2
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
but I get
# A tibble: 5 x 4
Day InfectionDay CRP CRP_censored
<int> <dbl> <dbl> <chr>
1 1 3 3 CRP
2 2 3 2 CRP
3 3 3 5 NA
4 4 3 4 NA
5 5 3 1 NA
You can add sym() to the column name in mutate to convert to symbol before evaluating
censor.infection <- function(df, colname){
newcolname <- paste0(colname, "_censored")
return(df %>% mutate(!!newcolname := ifelse( Day < InfectionDay, !! sym(colname), NA)))
}
data = tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1))
data = censor.infection(data, "CRP")
We can select columns on which we want to apply the function (cols) and use mutate_at which will also automatically rename the columns. Added an extra column in the data to show renaming.
library(dplyr)
cols <- c("CRP", "CRP1")
data %>%
mutate_at(cols, list(censored = ~replace(., Day >= InfectionDay, NA)))
# A tibble: 5 x 6
# Day InfectionDay CRP CRP1 CRP_censored CRP1_censored
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 3 3 3 3 3
#2 2 3 2 2 2 2
#3 3 3 5 5 NA NA
#4 4 3 4 4 NA NA
#5 5 3 1 1 NA NA
data
data <- tibble(Day=1:5, InfectionDay=3, CRP=c(3,2,5,4,1), CRP1 = c(3,2,5,4,1))