How can I cross tabulate multiple select and single select questions in R [duplicate] - r

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
Problem description
I've run a survey with a multiple select question, where the output is separated by commas in one column, and a grouping question (e.g. sex). Now I want to cross tabulate those 2 variables.
Sample data
My data comprises of 2 columns:
A multiple select question, which the survey software outputs as one column with commas separating the selection
A grouping variable, in this case male or female
dat <- data.frame(Multiple = c("A,B,C","B","A,C"), Sex = c("M","F","F"))
Desired output
I want to cross tabulate the multiple select options (without commas) with sex:
Multiple Sex Count
A M 1
B M 1
C M 1
A F 1
B F 1
C F 1
Attempted solution
This is a partial solution where I count the elements in the multiple select question only. My problem is that I don't know how to include the grouping variable sex into this function because I am using a regular expression to count the elements in the comma separated vector:
MSCount <- function(X){
# Function to count values in a comma separated vector
Answers <- sort(
unique(
unlist(
strsplit(
as.character(X), ",")))) # Find the possible options from the data alone, e.g. "A", "B" etc.
Answers <- Answers[-which(Answers == "")] # Drop blank answers
CountAnswers <- numeric(0) # Initialise the count as an empty numeric list
for(i in 1:length(Answers)){
CountAnswers[i] <- sum(grepl(Answers[i],X))
} # Loop round and count the rows with a match for the answer text
SummaryAnswers <- data.frame(Answers,CountAnswers,PropAnswers = 100*CountAnswers/length(X[!is.na(X)]))
return(SummaryAnswers)
}

We can use separate_rows
library(tidyverse)
separate_rows(dat, Multiple) %>%
mutate(Count = 1) %>%
arrange(Sex, Multiple) %>%
select(Multiple, Sex, Count)

Related

Convert each list element of different length into a column in R [duplicate]

This question already has answers here:
How to cbind or rbind different lengths vectors without repeating the elements of the shorter vectors?
(6 answers)
How to convert a list consisting of vector of different lengths to a usable data frame in R?
(6 answers)
Closed last month.
I was wondering if there is an efficient way to convert each element (each with different length) in my List to a column in a data frame to achieve my Desired_output?
I tried the following without success:
dplyr::bind_cols(List)
Note: This data is toy, a functional answer is appreciated.
List <- list(`1000`=letters[1:2], `2000`=letters[1:3], `3000`=letters[1:4])
Desired_output <-
data.frame(`1000`= c(letters[1:2],"",""),
`2000`= c(letters[1:3],""),
`3000`= letters[1:4])
You could try this.
library(purrr)
l <- list(`1000`=letters[1:2], `2000`=letters[1:3], `3000`=letters[1:4])
# get max lengths across all list items
max_len = max(lengths(l))
# using purrr::modify add as many empty characters needed until each
# item has the same number of rows as the item with the most
l = modify(l, function(f) {
f = c(f, rep("", max_len-length(f)))
})
as.data.frame(l)
X1000 X2000 X3000
1 a a a
2 b b b
3 c c
4 d

How to create a variable using another variable as an Index? [duplicate]

This question already has answers here:
Using row-wise column indices in a vector to extract values from data frame [duplicate]
(2 answers)
Closed 3 years ago.
I'm looking to create a new variable, d, which grabs the value from either an or b based off of the variable C.
dat = data.frame(a=1:10,b=11:20,c=rep(1:2,5))
The result would be:
d = c(1,12,3,14,... etc)
We can use a row/column indexing where the row index is the sequence of rows and column index the 'c' column, cbind them and extract the elements from the dataset based on this
dat$d <- dat[1:2][cbind(seq_len(nrow(dat)), dat$c)]
dat$d
#[1] 1 12 3 14 5 16 7 18 9 20
NOTE: This should also work when there are multiple column values to extract.
You can do
dat$d <- ifelse(dat$c==1,dat$a,dat$b)
A dplyr variant
dat %>%
mutate(d = case_when(c==1 ~ a,
TRUE ~ b))

How to subset the first column (rownames) in R [duplicate]

This question already has answers here:
What is about the first column in R's dataset mtcars?
(4 answers)
Closed 3 years ago.
I have xy data for gene expression in multiple samples. I wish to subset the first column so I can order the genes alphabetically and perform some other filtering.
> setwd("C:/Users/Will/Desktop/BIOL3063/R code assignment");
> df = read.csv('R-assignments-dataset.csv', stringsAsFactors = FALSE);
Here is a simplified example of the dataset I'm working with, it has 270 columns (tissue samples) and 7065 rows (gene names).
The first column is a list of gene names (A2M, AAAS, AACS etc.) and each column is a different tissue sample, thus showing the gene expression in each tissue sample.
The question being asked is "Sort the gene names alpahabetically (A-Z) and print out the first 20 gene names"
My thought process would be to subset the first column (gene names) and then perform order() to sort alphabetically, after which I can use head() to print the first 20.
However when I try
> genes <- df[1]
It simply subsets the first column that has data in it (TCGA-A6-2672_TissueA) rather than the one to its left.
Also
> genes <- df[,df$col1];
> genes;
data frame with 0 columns and 7065 rows
> order(genes);
integer(0)
Appears to create a list of gene names in R studio's viewer but I cannot perform any manipulation on it.
I am unable to correctly locate the first column in the data.frame, since it does not have a column header, and I also have the same problem when doing the same thing with row 1 (sample names) as well.
I'm a complete novice at R and this is part of an assignment I'm working on, it seems I'm missing something fundamental but I can not figure out what.
Cheers guys
Please include a sample of your text file as text instead of an image.
I have created a dataset similar to yours:
X Y
1 a b
2 c d
3 d g
Note that your tissue columns have a header but your gene names do not. Therefore these will be interpreted as rownames, see ?read.table:
If row.names is not specified and the header line has one less entry
than the number of columns, the first column is taken to be the row
names.
Reading it in R:
df <- read.table(text = ' X Y
1 a b
2 c d
3 d g')
So your gene names are not at df[1] but instead in rownames(df), so to get these genes <- rownames(df) or to add these to the existing df you can use df$gene <- rownames(df)
There are numerous ways to convert your row names to a column see for example this question.
If you are asking what I think you are asking, you just need to subset inside the as.data.frame function, which will auto-generate a "header", as you call it. It will be called V1, the first variable of your new data frame.
genes <- as.data.frame(df[,1])
genes$V1
1 A
2 C
3 A
4 B
5 C
6 D
7 A
8 B
As per the comment below, the issue could be avoided if you remove the comma from your subsetting syntax. When you select columns from a data.frame, you only need to index the column, not the rows.
genes <- df[1]

How to count the number of occurence of First Charcter of each string of a column in R [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data set which has a single column containing multiple names.
For eg
Alex
Brad
Chrisitne
Alexa
Brandone
And almost 100 records like this. I want to display record as
A 2
B 2
C 1
Which means i need to show this frequency from higher to lower and if there is a tie breaker , the the values should be shown in Alphabetical Order .
I have been trying to solve this but i am not able to.
Any pointer on these ?
df <- data.frame(name = c("Alex", "Brad", "Brad"))
first_characters <- substr(df$name, 1, 1)
result <- sort(table(first_characters), decreasing = TRUE)
# from wide to long
data.frame(result)

subset based on frequency level [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I want to generate a df that selects rows associated with an "ID" that in turn is associated with a variable called cutoff. For this example, I set the cutoff to 9, meaning that I want to select rows in df1 whose ID value is associated with more than 9 rows. The last line of my code generates a df that I don't understand. The correct df would have 24 rows, all with either a 3 or a 4 in the ID column. Can someone explain what my last line of code is actually doing and suggest a different approach?
set.seed(123)
ID<-rep(c(1,2,3,4,5),times=c(5,7,9,11,13))
sub1<-rnorm(45)
sub2<-rnorm(45)
df1<-data.frame(ID,sub1,sub2)
IDfreq<-count(df1,"ID")
cutoff<-9
df2<-subset(df1,subset=(IDfreq$freq>cutoff))
df1[ df1$ID %in% names(table(df1$ID))[table(df1$ID) >9] , ]
This will test to see if the df1$ID value is in a category with more than 9 values. If it is, then the logical element for the returned vector will be TRUE and in turn that as the "i" argument will cause the [-function to return the entire row since the "j" item is empty.
See:
?`[`
?'%in%'
Using dplyr
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(n()>cutoff)
Maybe closer to what you had in mind is to create a vector of frequencies using ave:
subset(df1, ave(ID, ID, FUN = length) > cutoff)

Resources