I want to do a fairly common analysis of survey questions in R, but am stuck in the middle.
Imagine a survey where you are asked to answer which brands do you associate with certain features (e.g. "brands" could be PlayStation, XBox..., and features could be "speed", "graphics"... where each brand can be checked on several features aka mulit-select). E.g. sth. like this here: https://www.harvestyourdata.com/fileadmin/images/question-type-screenshots/Grid-multi-select.jpg
You often refer to these questions as multi-select grid or matrix questions.
Anyway, from a data perspective, this kind of data is usually stored in wide format where each row*column combination is one variable, which is 0/1 coded (0 if the survey participant doesn't check the box, 1 otherwise).
Assuming we have 5 brands and 10 items, we would have 50 variables in total, ideally following a nice, structured naming scheme, e.g. item1_column1, item2_column1, item3_column1, [...], item1_column2 and so on.
Now, I want to analyze (frequency table) all of these variables in one iteration. I've already found the cross.multi.table function in the questionr package. However, it only allows to analyze all items based on on single factor. What I need instead is to allow for several columns at the same time.
Any ideas? MIght be I'm missing a function from another package or this can easily be done with tidyverse or even with the cross.multi.table function?
Using this data as test input:
dat = data.frame(item1_column1 = c(0,1,1,1),
item2_column1 = c(1,1,1,0),
item3_column1 = c(0,0,1,1),
item1_column2 = c(1,1,1,0),
item2_column2 = c(0,1,1,1),
item3_column2 = c(1,0,1,1),
item1_column3 = c(0,1,1,0),
item2_column3 = c(1,1,1,1),
item3_column3 = c(0,0,1,0))
I'd expect this output:
column1 column2 column3
item1 3 3 2
item2 3 3 4
item3 2 3 1
or ideally as proportions/percentages:
column1 column2 column3
item1 75% 75% 50%
item2 75% 75% 100%
item3 50% 75% 25%
One way could be to get data into long format using gather, separate columns based on _, group_by item and column and calculate the ratio of value column and spread the data to wide format.
library(dplyr)
library(tidyr)
dat %>%
gather(key, value) %>%
separate(key, into = c("item", "column"), sep = "_") %>%
group_by(item, column) %>%
summarise(prop = mean(value) * 100) %>%
spread(column, prop)
# item column1 column2 column3
# <chr> <dbl> <dbl> <dbl>
#1 item1 75 75 50
#2 item2 75 75 100
#3 item3 50 75 25
A bit shorter (Thanks to #M-M)
dat %>%
summarise_all(~mean(.) * 100) %>%
gather(key, value) %>%
separate(key, into = c("item", "column"), sep = "_") %>%
spread(column, value)
What I do here, by using data.table package, is summarizing each column, converting data to long format, breaking a column to two (item and column), and finally converting to wide format. Look below;
library(data.table)
dcast(setDT(melt(setDT(dat)[,100*colMeans(.SD),]),keep.rownames = T)[,
c("item", "column") := tstrsplit(rn, "_", fixed=TRUE)],
item ~ column, value.var = "value")
#> item column1 column2 column3
#> 1: item1 75 75 50
#> 2: item2 75 75 100
#> 3: item3 50 75 25
We can do this in base R, by creating a two column data.frame with the column names replicated, cbind with the unlisted values, and use xtabs to get the sum while pivoting to 'wide' format
out <- xtabs(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)))
out
# V2
#V1 column1 column2 column3
# item1 3 3 2
# item2 3 3 4
# item3 2 3 1
Or as #GKi mentioned (a compact version would be) to split the column names by _, create a data.frame with that along with colSums (or colMeans - for percentage) and use xtabs for pivoting
xtabs(n ~ ., data.frame(do.call("rbind",
strsplit(colnames(dat), "_")), n=colSums(dat)))
Or to get the percentage
xtabs(val ~ ., aggregate(val ~ ., cbind(read.table(text = names(dat)[col(dat)],
sep="_", header = FALSE), val = unlist(dat, use.names = FALSE)), mean)) * 100
# V2
#V1 column1 column2 column3
# item1 75 75 50
# item2 75 75 100
# item3 50 75 25
Or inspired from #GKi, using enframe
library(dplyr)
library(tidyr)
library(tibble)
enframe(colSums(dat)) %>%
separate(name, into = c('name1', 'name2')) %>%
spread(name2, value)
# A tibble: 3 x 4
# name1 column1 column2 column3
# <chr> <dbl> <dbl> <dbl>
#1 item1 3 3 2
#2 item2 3 3 4
#3 item3 2 3 1
To get the percentage, just change the first line of code to
enframe(100 *colMeans(dat))
Related
I was working on something I thought would be simple, but maybe today my brain isn't working. My data is like this:
tibble(metric = c('income', 'income_upp', 'income_low', 'n_house', 'n_house_upp', 'n_house_low'),
value = c(120, 140, 100, 10, 8, 12))
metric value
income 120
income_low 100
income_upp 140
n 10
n_low 8
n_upp 12
And I want to pivot_wider so it looks like this:
metric value value_low value_upp
income 120 100 140
n 10 8 12
I'm having trouble separating metrics, because pivot_wider as is, brings a dataframe that's too wide:
df %>% pivot_wider(names_from = 'metric', values_from = value)
How can I achieve this or should I pivot longer after the pivot wider?
Thanks!
I think if you convert metric into a column with "value", "value_upp" and "value_low" values, you can pivot_wider:
df %>%
mutate(param = case_when(str_detect(metric, "upp") ~ "value_upp",
str_detect(metric, "low") ~ "value_low",
TRUE ~ "value"),
metric = str_remove(metric, "_low|_upp")) %>%
pivot_wider(names_from = param, values_from = value)
I like to use separate() when I have text in a column like this. This function allows you to separate a column into multiple columns if there is a separator in the function.
In particular in this example we would want to use the arguments sep="_" and into = c("metric", "state") to convert into columns with those names.
Then mutate() and pivot_wider() can be used as you had previously specified.
library(tidyverse)
df <- tribble(~metric, ~value,
"income", 120,
"income_low", 100,
"income_upp", 140,
"n", 10,
"n_low", 8,
"n_upp", 12)
df |>
separate(metric, sep = "_", into = c("metric", "state")) |>
mutate(state = ifelse(is.na(state), "value", state)) |>
pivot_wider(id_cols = metric, names_from = state, values_from = value, names_sep = "_")
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 4].
#> # A tibble: 2 × 4
#> metric value low upp
#> <chr> <dbl> <dbl> <dbl>
#> 1 income 120 100 140
#> 2 n 10 8 12
Created on 2022-12-21 with reprex v2.0.2
Note you can use the argument names_glue or names_prefix in pivot_wider() to add the "value" as a prefix to the column names.
a data.table approach (if you can live wit the trailing underacore achter value_
library(data.table)
setDT(df)
# create some new columns based on metric
df[, c("first", "second") := tstrsplit(metric, "_")]
# metric value first second
# 1: income 120 income <NA>
# 2: income_low 100 income low
# 3: income_upp 140 income upp
# 4: n 10 n <NA>
# 5: n_low 8 n low
# 6: n_upp 12 n upp
# replace NA with ""
df[is.na(df)] <- ""
# now cast to wide, createing colnames on the fly
dcast(df, first ~ paste0("value_", second), value.var = "value")
# first value_ value_low value_upp
# 1: income 120 100 140
# 2: n 10 8 12
I have the dataframe below and I want to create a new dataframe based on this with 3 columns. The first will be named "Field" and will contain the column names of this dataframe (col1,col2,col3). The second will be named "Absolute" and will contain the absolute number of missing values of this column and the third will be named Percentage and will contain the percentage of the missing values of this column. The number of columns and rows in my real dataframe is bigger.
col1<-c("as","df",NA)
col2<-c("ds",NA,NA)
col3<-c(NA,NA,NA)
df<-data.frame(col1,col2,col3)
Try,
data.frame(field = names(df),
Absolute = colSums(is.na(df)),
Percentage = 100 * (colSums(is.na(df)) / nrow(df)),
row.names = seq(nrow(df)))
field Absolute Percentage
1 col1 1 33.33333
2 col2 2 66.66667
3 col3 3 100.00000
Tidyverse approach (though I prefer #Sotos’ solution):
library(dplyr)
library(tidyr)
df %>%
summarize(across(
everything(),
list(
Absolute = ~ sum(is.na(.x)),
Percentage = ~ mean(is.na(.x)) * 100
)
)) %>%
pivot_longer(
everything(),
names_to = c("Field", ".value"),
names_sep = "_"
)
# A tibble: 3 × 3
Field Absolute Percentage
<chr> <int> <dbl>
1 col1 1 33.3
2 col2 2 66.7
3 col3 3 100
I have a dataframe that I want to gather so that it is in tall format, and then mutate on another column with values based on membership of a string from another column in a list of lists. For example, I have the following data frame and list of lists:
dummy_data <- data.frame("id" = 1:20,"test1_10" = sample(1:100, 20),"test2_11" = sample(1:100, 20),
"test3_12" = sample(1:100, 20),"check1_20" = sample(1:100, 20),
"check2_21" = sample(1:100, 20),"sound1_30" = sample(1:100, 20),
"sound2_31" = sample(1:100, 20),"sound3_32" = sample(1:100, 20))
dummylist <- list(c('test1_','test2_','test3_'),c('check1_','check2_'),c('sound1_','sound2_','sound3_'))
names(dummylist) <- c('shipments','arrivals','departures')
And then I gather the data frame like so:
dummy_data <- dummy_data %>%
gather("part", "number", 2:ncol(.))
What I want to do is add a column that has the name of the list found in dummylist where the string before the underscore in the part column is a member. And I can do that like this:
dummydata <- dummydata %>%
mutate(Group = case_when(
str_extract(part,'.*_') %in% dummylist[[1]] ~ names(dummylist[1]),
str_extract(part,'.*_') %in% dummylist[[2]] ~ names(dummylist[2]),
str_extract(part,'.*_') %in% dummylist[[3]] ~ names(dummylist[3])
))
However, this requires a separate str_extract line for each list/group within the dummylist. And my real data has way more than 3 lists/groups. So I'm wondering if there is a more efficient way to do this mutate step to get the names of the lists in?
Any help is much appreciated, thanks!
It may be easier with a regex_left_join after converting the 'dummylist' to a two column dataset
library(fuzzyjoin)
library(dplyr)
library(tidyr)
library(tibble)
dummy_data %>%
# // reshape to long format - pivot_longer instead of gather
pivot_longer(cols = -id, names_to = 'part', values_to = 'number') %>%
# // join with the tibble/data.frame converted dummylist
regex_left_join(dummylist %>%
enframe(name = 'Group', value = 'part') %>%
unnest(part)) %>%
rename(part = part.x) %>%
select(-part.y)
-output
# A tibble: 160 × 4
id part number Group
<int> <chr> <int> <chr>
1 1 test1_10 72 shipments
2 1 test2_11 62 shipments
3 1 test3_12 17 shipments
4 1 check1_20 89 arrivals
5 1 check2_21 54 arrivals
6 1 sound1_30 39 departures
7 1 sound2_31 94 departures
8 1 sound3_32 95 departures
9 2 test1_10 77 shipments
10 2 test2_11 4 shipments
# … with 150 more rows
If you prepare your lookup table beforehand, you don't need any extra libraries, but dplyr and tidyr:
lookup <- sapply(
names(dummylist),
\(nm) { setNames(rep(nm, length(dummylist[[nm]])), dummylist[[nm]]) }
) |>
setNames(nm = NULL) |>
unlist()
lookup
# test1_ test2_ test3_ check1_ check2_ sound1_ sound2_ sound3_
# "shipments" "shipments" "shipments" "arrivals" "arrivals" "departures" "departures" "departures"
Now you just gsubing on the fly, and translating your parts, within usual mutate() verb:
dummy_data |>
pivot_longer(-id, names_to = 'part', values_to = 'number') |>
mutate(group = lookup[gsub('^(\\w+_).*$', '\\1', part)])
# # A tibble: 160 × 4
# id part number group
# <int> <chr> <int> <chr>
# 1 1 test1_10 91 shipments
# 2 1 test2_11 74 shipments
# 3 1 test3_12 46 shipments
# 4 1 check1_20 62 arrivals
# 5 1 check2_21 7 arrivals
# 6 1 sound1_30 35 departures
# 7 1 sound2_31 23 departures
# 8 1 sound3_32 84 departures
# 9 2 test1_10 59 shipments
# 10 2 test2_11 73 shipments
# # … with 150 more rows
I have a large data for which I'm attempting to remove repeated row entries based on several columns. The column headings and sample entries are
count freq, cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart
5036 0.0599 TGCAGTGCTAGAG CSARDPDR TRBV20-1 TRBD1 TRBJ1-5 15 17 43 21
There are several thousand rows, and for two rows to match all the values except for "count" and "freq" must be the same. I want to remove the repeated entries, but before that, I need to change the "count" value of the one repeated row with the sum of the individual repeated row "count" to reflect the true abundance. Then, I need to recalculate the frequency of the new "count" based on the sum of all the counts of the entire table.
For some reason, the script is not changing anything, and I know for a fact that the table has repeated entries.
Here's my script.
library(dplyr)
# Input sample replicate table.
dta <- read.table("/data/Sample/ci1371.txt", header=TRUE, sep="\t")
# combine rows with identical data. Recalculation of frequency values.
dta %>% mutate(total = sum(count)) %>%
group_by(cdr3nt, cdr3aa, v, d, j, VEnd, DStart, DEnd, JStart) %>%
summarize(count_new = sum(count), freq = count_new/mean(total))
dta_clean <- dta
Any help is greatly appreciated. Here's a screenshot of how the datatable looks like.
Preliminary step: transform in data.table and store column names that are not count and freq
library(data.table)
setDT(df)
cols <- colnames(df)[3:ncol(df)]
(in your example, count and freq are in the first two positions)
To recompute count and freq:
df_agg <- df[, .(count = sum(count)), by = cols]
df_agg[, 'freq' := count/sum(count)]
If you want to keep unique values by all columns except count and freq
df_unique <- unique(df, by = cols)
Sample data, where grp1 and grp2 are intended to be all of your grouping variables.
set.seed(42)
dat <- data.frame(
grp1 = sample(1:2, size=20, replace=TRUE),
grp2 = sample(3:4, size=20, replace=TRUE),
count = sample(100, size=20, replace=TRUE),
freq = runif(20)
)
head(dat)
# grp1 grp2 count freq
# 1 2 4 38 0.6756073
# 2 2 3 44 0.9828172
# 3 1 4 4 0.7595443
# 4 2 4 98 0.5664884
# 5 2 3 44 0.8496897
# 6 2 4 96 0.1894739
Code:
library(dplyr)
dat %>%
group_by(grp1, grp2) %>%
summarize(count = sum(count)) %>%
ungroup() %>%
mutate(freq = count / sum(count))
# # A tibble: 4 x 4
# grp1 grp2 count freq
# <int> <int> <int> <dbl>
# 1 1 3 22 0.0206
# 2 1 4 208 0.195
# 3 2 3 383 0.358
# 4 2 4 456 0.427
In my data, I have some lines that represent results from a repeated test. Only certain values are captured in the repeat. What I'd like to do is to create a new row with the repeat values but pulling from the initial test if the repeat values are NA or blank.
E.g. for,
Patient ID Initial/Repeat Value Value 2 Accept/Reject
A1 Initial 95 NA Reject
A1 Repeat NA 80 Accept
A2 Initial 80 70 Accept
I'd like to tranform into:
Patient ID Initial/Repeat Value Value 2 Accept/Reject
A1 Repeat 95 80 Accept
A2 Initial 80 70 Accept
Thank you.
Try this:
require(zoo)
require(dplyr)
df %>%
group_by(Patient_ID) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE))) %>%
filter(row_number()==n())
Output:
# A tibble: 2 x 5
# Groups: Patient_ID [2]
Patient_ID Initial_Repeat Value Value2 Accept_Reject
<chr> <chr> <int> <int> <chr>
1 A1 Repeat 95 80 Accept
2 A2 Initial 80 70 Accept
Is it always a series of NA's with a single valid value? If yes, you could take the mean of the rows, throwing away any NA's. I do this using dplyr's grouping and summarising functionality:
# Sample data:
df = read.table(text="PatientID Initial_Repeat Value Value2 Accept_Reject
A1 Initial 95 NA Reject
A1 Repeat NA 80 Accept
A2 Initial 80 70 Accept", header = TRUE)
# My solution uses the dplyr package:
library(dplyr)
answer = df %>%
group_by(PatientID) %>%
summarise(Value = mean(Value, na.rm = TRUE), Value2 = mean(Value2, na.rm = TRUE))
answer:
# A tibble: 2 x 3
PatientID Value Value2
<fctr> <dbl> <dbl>
1 A1 95 80
2 A2 80 70
Without extra libraries:
df1 <- with(df, data.frame(PatientID=tapply(PatientID, PatientID,
function(x) x[length(x)])))
df1$Inital_Repeat <- with(df, tapply(Initial_Repeat, PatientID,
function(x) levels(Initial_Repeat)[x[length(x)]]))
for (v in c('Value', 'Value2'))
df1[[v]] <- tapply(df[[v]], df$PatientID, function(x) x[!is.na(x)][1])
df1$Accept_Reject <- with(df, tapply(Accept_Reject, PatientID,
function(x) levels(Accept_Reject)[x[length(x)]]))
Output:
PatientID Inital_Repeat Value Value2 Accept_Reject
A1 1 Repeat 95 80 Accept
A2 2 Initial 80 70 Accept
Note that Inital_Repeat and Accept_Reject are factors.
EDIT: PatientID is also a factor, which is why we have 1 and 2 for PatientID. To have "A1" and "A2", change x[length(x)] on line 2 to levels(x)[x[length(x)]]. Also, levels(Initial_Repeat) on line 4 can be replaced with levels(x), so can levels(Accept_Reject) on line 8.
I have also found tools within the tidyverse also accomplish the job. It's a little slower than zoo but offers better readability and requires fewer packages to be loaded.
library(tidyverse)
df <- df %>%
group_by(Patient_ID) %>%
fill(names(df), .direction = "down") %>%
filter(row_number() == n())