Recognizing 'select all that apply' answers as being separate in R - r

I'm working with a dataframe, from a survey I took (n=108). One of the questions (columns) contain four possible answers- another, 10. Here is my issue: when plotting either of these columns, it plots each level. The four-answer column is considered a factor with 13 levels and the 10-answer, a factor with 22 levels. Each time someone chose more than one answer, it counts as a separate level (i.e. "A,B", "A,B,C", etc.) My question is, how do I go about representing how many respondents chose "A" or "B" or a combination of "A" and "B" but not necessarily/only "A,B", regardless of what other choices the made, if any.
I wish to plot these items correctly, as well as analyze the data by, say, how many Female respondents chose "A" versus how many Male respondents, and so on.
My issues:
1) plot(data$letter) plots 13 different bars whereas letter is a question with only 4 possible answers, select all that apply.
2) I can't show through analysis how many chose "A" if they also happened to choose another answer, because "A","C" isn't equivalent to "A".
Solutions I'm searching for:
1) When plot(data$letter), I want only four bars showing how many times each letter was chosen.
2) I need to work with all values of "A" in analysis, even if the respondent selected more than just "A"
Thank you!
Also, How to clean and re-code check-all-that-apply responses in R survey data? is a question I found before posting that explains it in totality, but the code is fairly advanced at my level of experience with R.

I can give you two ideas that might help for the two issues you mention. First, I create some sample data:
set.seed(175)
choices <- c("A", "B", "C", "A,B", "A,C", "B,C", "A,B,C")
data <- data.frame(respondent = 1:15,
letter = sample(choices, 15, replace = TRUE))
data
## respondent letter
## 1 1 A,C
## 2 2 B,C
## 3 3 A,B
## 4 4 C
## 5 5 B,C
## 6 6 A,B
## 7 7 B
## 8 8 A
## 9 9 B,C
## 10 10 C
## 11 11 A
## 12 12 B
## 13 13 C
## 14 14 A,C
## 15 15 A,B,C
For simplicity, I used only three levels.
1) The following function can be used to plot `data$letter) directly in the way you want:
plot_allapply <- function(choices) {
# convert to character
choices <- as.character(choices)
# split at comma and unlist
choices_split <- unlist(strsplit(choices, ","))
# convert back to factor and plot
plot(as.factor(choices_split))
}
plot_allapply(data$letter)
It works as follows: First, the data in letter needs to be converted from type factor to character. (I know that it is a factor, because otherwise you would not get a plot at all.) Then, each element of the character vector is split at the commas. (Run strsplit(as.character(data$letter), ",") to see how this works for your data and ?strsplit for more information.). Since this yields a list, it is converted to a character vector using unlist. The last line converts back to factor (which is needed in order for plot to create the right kind of plot) and plotted.
2) There are many ways how you could work with the data in data$letter. If you are interested to know, which respondents chose "B", you could do
grepl("B", data$letter)
This will return a logical vector that is TRUE whenever "B" is contained in a respondents answer. Thus, all of those will give TRUE: "B", "A,B", "A,B,C".
Maybe it helps to add this information to your data frame. This could be done as follows:
data <- transform(data, isA = grepl("A", letter),
isB = grepl("B", letter),
isC = grepl("C", letter))
data
## respondent letter isA isB isC
## 1 1 A,C TRUE FALSE TRUE
## 2 2 B,C FALSE TRUE TRUE
## 3 3 A,B TRUE TRUE FALSE
## 4 4 C FALSE FALSE TRUE
## 5 5 B,C FALSE TRUE TRUE
## 6 6 A,B TRUE TRUE FALSE
## 7 7 B FALSE TRUE FALSE
## 8 8 A TRUE FALSE FALSE
## 9 9 B,C FALSE TRUE TRUE
## 10 10 C FALSE FALSE TRUE
## 11 11 A TRUE FALSE FALSE
## 12 12 B FALSE TRUE FALSE
## 13 13 C FALSE FALSE TRUE
## 14 14 A,C TRUE FALSE TRUE
## 15 15 A,B,C TRUE TRUE TRUE

Related

How to use apply over a vector?

Suppose I have a data.frame like
a <- data.frame(col1=1:6,
col2=c('a','b',1,'c',2,3),
stringsAsFactors=F)
a
col1 col2
1 1 a
2 2 b
3 3 1
4 4 c
5 5 2
6 6 3
I want to have a vector saying which rows have col2 as a number. I'm trying something like
apply(a$col2,1,is.numeric)
or
apply(a$col2,FUN=is.numeric)
but it always says
Error in apply(a$col2, 1, is.numeric) :
dim(X) must have a positive length
If a$col2 (the X in apply) must be a matrix, then why does the help from the function say:
X: an array, including a matrix.
The help on arrays says:
An array in R can have one, two or more dimensions.
If an array can have only one dimension, then why can't a one-dimensional array be used in apply? What am I missing here?
(Beyond that, I still would like to know how to find the numeric rows in col2 without using a loop.)
First note that even the numbers in col2 are character since when combined with other elements which are character they get coerced to character.
str(a)
## 'data.frame': 6 obs. of 2 variables:
## $ col1: int 1 2 3 4 5 6
## $ col2: chr "a" "b" "1" "c" ...
1) grepl thus we should use character processing like this:
grepl("^\\d+$", a$col2)
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
grepl is alredy vectorized so we don't need an apply or related function to iterate over the elements of col2.
2) (s)apply These also work but seems unnecessarily involved given that grepl alone works:
sapply(a$col2, grepl, pattern = "^\\d+$")
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE
apply(array(a$col2), 1, grepl, pattern = "^\\d+$")
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
3) type.convert Another approach is to use type.convert which will convert to numeric if it can be represented as one. Then we can use is.numeric.
sapply(a$col2, function(x) is.numeric(type.convert(x)))
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE

Filter dataframe based on presence of sample in a seperate list

I am wanting to filter a dataframe with 1212 so it only contains that samples listed in a seperate list. The list has multiple values and I can't work out how to do this.
The df below is called RNASeq2
RNASeq2Norm_samples Substrng_RNASeq2Norm
1 TCGA-3C-AAAU-01A-11R-A41B-07 TCGA.3C.AAAU
2 TCGA-3C-AALI-01A-11R-A41B-07 TCGA.3C.AALI
3 TCGA-3C-AALJ-01A-31R-A41B-07 TCGA.3C.AALJ
4 TCGA-3C-AALK-01A-11R-A41B-07 TCGA.3C.AALK
5 TCGA-4H-AAAK-01A-12R-A41B-07 TCGA.4H.AAAK
6 TCGA-5L-AAT0-01A-12R-A41B-07 TCGA.5L.AAT0
7 TCGA-5L-AAT1-01A-12R-A41B-07 TCGA.5L.AAT1
8 TCGA-5T-A9QA-01A-11R-A41B-07 TCGA.5T.A9QA
.
.
.
1212
list = intersect_samples
intersect_samples: "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" ... 1097
I have tried this code but returns all the original 1212 samples:
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% intersect_samples,]
Yet if I try
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% "TCGA.3C.AAAU",]
it will return the correct row
str(RNASeq2)
'data.frame': 1212 obs. of 2 variables:
$ RNASeq2 : Factor w/ 1212 levels "TCGA-3C-AAAU-01A-11R-A41B-07",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Substrng_RNASeq2Norm: Factor w/ 1093 levels "TCGA.3C.AAAU",..: 1 2 3 4 5 6 7 8 9 10 ...
str(intersect_samples)
chr [1:1093] "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" "TCGA.4H.AAAK" ...
AFAIK R does not offer a convenience function to find a vector of search strings in a vector of strings using partial matching ("sub-strings").
%in is not the right function if you want to find a sub-string within a string since it compares only whole strings.
Instead use base R's grepl or the probably faster stri_detect_fixed function of the excellent stringi package.
Please note that I have abstracted the code and data (instead of using your code and data) for easier understanding.
library(stringi)
pattern = c("23", "45", "999")
data <- data.frame(row_num = 1:4,
string = c("123", "234", "345", "xyz"),
stringsAsFactors = FALSE)
# row_num string
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 xyz
string <- data$string # the column that contains the values to be filtered
# Iterate over each element in pattern and apply it to the string vector.
# Returns a logical vector of the same length as string (TRUE = found, FALSE = not found)
selected <- lapply(pattern, function(x) stri_detect_fixed(string, x))
# Or shorter:
# lapply(pattern, stri_detect_fixed, str = string)
selected # show the result (it is a list of logical vectors - one per search pattern element)
# [[1]]
# [1] TRUE TRUE FALSE FALSE
#
# [[2]]
# [1] FALSE FALSE TRUE FALSE
#
# [[3]]
# [1] FALSE FALSE FALSE FALSE
# "row-wise" reduce the logical vectors into one final vector using the logical "or" operator
# WARNING: Does not handle `NA`s correctly (one NA does makes any TRUE to NA)
selected.rows <- Reduce("|", selected)
# [1] TRUE TRUE TRUE FALSE
# To handle NAs correctly (if you have NAs) you can use this (slower) code:
selected.rows <- rowSums(as.data.frame(selected), na.rm = TRUE) > 0
# Use the logical vector as row selector (TRUE returns the row, FALSE ignores the row):
string[selected.rows]
# [1] 123 234 345

R seems duplicated() to select the wrong duplicates

I've noticed a couple of times now that when I'm using R to identify duplicates, sometimes it seems to identify the wrong cases.
Here's a data frame that has three columns, each which may be holding duplicate values. I want to isolate the cases that are duplicates of another case on all three variables.
set.seed(100)
test <- data.frame(id = sample(1:15, 20, replace = TRUE),
cat1 = sample(letters[1:2], 20, replace = TRUE),
cat2 = sample(letters[1:2], 20, replace = TRUE))
Which gives me:
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
4 1 b b
5 8 a b
6 8 a a
7 13 b b
8 6 b b
9 9 b a
10 3 a a
11 10 a a
12 14 b a
13 5 a a
14 6 b a
15 12 b b
16 11 b a
17 4 a a
18 6 b a
19 6 b b
20 11 a a
I've tried this a couple of ways, such as:
duplicated(test$id) & duplicated(test$cat1) & duplicated(test$cat2)
But this just results in the same as duplicated(test$id):
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[17] TRUE TRUE TRUE TRUE
So instead I tried duplicated(test$id, test$cat1, test$cat2), which produces different results:
[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
[17] FALSE TRUE FALSE FALSE
But is still incorrect - if I call these cases from the data frame we get:
> test[which(duplicated(test$id, test$cat1, test$cat2)),]
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
5 8 a b
8 6 b b
14 6 b a
16 11 b a
18 6 b a
As you can see these are not the rows we should be getting (were it doing what I'd have thought it would do), which should be (as far as I can see):
18 6 b a
19 6 b b
Does anyone know why it's coming up with these results, and where I'm going wrong using it? Is there a simple (ideally non-verbose) way of doing this?
We need to apply duplicated on a data.frame or matrix or vector
i1 <- duplicated(test[c('id', 'cat1')])
i2 <- duplicated(cbind(test$id, test$cat1))
identical(i1, i2)
#[1] TRUE
and not on more than one data.frame or matrix or vector
i3 <- duplicated(test$id, test$cat1)
identical(i1, i3)
#[1] FALSE
It is specified in the documents of ?duplicated
duplicated(x, incomparables = FALSE, ...)
where
x a vector or a data frame or an array or NULL.
and not 'x1', 'x2', etc..
As #Aaron mentioned in the comments, to subset the duplicates from the OP's data
test[duplicated(test),]
and if we wanted only the duplicates, then
test[duplicated(test)|duplicated(test, fromLast = TRUE),]
Taking duplicates of columns separately is not the same as taking duplicates of a data frame or matrix. This example makes it more clear:
df = data.frame(x = c(1,2,1),
y = c(1,3,3))
df$dupe = duplicated(df$x) & duplicated(df$y)
df$dupe2 = duplicated(df[,c("x","y")])
df
Using your method, duplicated says "When I hit the third row, x already had a 1 so it's duplicated. y already had a 3 so it's duplicated." This doesn't mean that it already saw a row where x = 1 and y = 3.

R - find all unique values among subsets of a data frame

I have a data frame with two columns. The first column defines subsets of the data. I want to find all values in the second column that only appear in one subset in the first column.
For example, from:
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
data_subsets data_values
A 1
A 2
A 3
A 4
A 5
B 2
B 3
B 4
B 6
B 7
I would want to extract the following data frame.
data_subsets data_values
A 1
A 5
B 6
B 7
I have been playing around with duplicated but I just can't seem to make it work. Any help is appreciated. There are a number of topics tackling similar problems, I hope I didn't overlook the answer in my searches!
EDIT
I modified the approach from #Matthew Lundberg of counting the number of elements and extracting from the data frame. For some reason his approach was not working with the data frame I had, so I came up with this, which is less elegant but gets the job done:
counts=rowSums(do.call("rbind",tapply(df$data_subsets,df$data_values,FUN=table)))
extract=names(counts)[counts==1]
df[match(extract,df$data_values),]
First, find the count of each element in df$data_values:
x <- sapply(df$data_values, function(x) sum(as.numeric(df$data_values == x)))
> x
[1] 1 2 2 2 1 2 2 2 1 1
Now extract the rows:
> df[x==1,]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7
Note that you missed "A 5" above. There is no "B 5".
You had the right idea with duplicated. The trick is to combine fromLast = TRUE and fromLast = FALSE options to get a full list of non-duplicated rows.
!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE)
[1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
Indexing your data.frame with this vector gives:
df[!duplicated(df$data_values,fromLast = FALSE)&!duplicated(df$data_values,fromLast = TRUE),]
data_subsets data_values
1 A 1
5 A 5
9 B 6
10 B 7
A variant of P Lapointe's answer would be
df[! df$data_values %in% df[duplicated( unique(df)$data_values ), ]$data_values,]
The unique() deals with the possibility (not in your test data) that some rows in the data may be identical and you want to keep them once if the same data_values does not appear for distinct data_sets (or distinct other columns).
You can use the 'dplyr' and 'explore' library to overcome this problem.
library(dplyr)
library(explore)
df=data.frame(
data_subsets=rep(LETTERS[1:2],each=5),
data_values=c(1,2,3,4,5,2,3,4,6,7))
df %>% describe(data_subsets)
######## output ########
#variable = data_subsets
#type = character
#na = 0 of 10 (0%)
#unique = 2
# A = 5 (50%)
# B = 5 (50%)

Reshape data frame to convert factors into columns in R

I have a data frame where one particular column has a set of specific values (let's say, 1, 2, ..., 23). What I would like to do is to convert from this layout to the one, where the frame would have extra 23 (in this case) columns, each one representing one of the factor values. The data in these columns would be booleans indicating whether a particular row had a given factor value... To show a specific example:
Source frame:
ID DATE SECTOR
123 2008-01-01 1
456 2008-01-01 3
789 2008-01-02 5
... <more records with SECTOR values from 1 to 5>
Desired format:
ID DATE SECTOR.1 SECTOR.2 SECTOR.3 SECTOR.4 SECTOR.5
123 2008-01-01 T F F F F
456 2008-01-01 F F T F F
789 2008-01-02 F F F F T
I have no problem doing it in a loop but I hoped there would be a better way. So far reshape() didn't yield the desired result. Help would be much appreciated.
I would try to bind another column called "value" and set value = TRUE.
df <- data.frame(cbind(1:10, 2:11, 1:3))
colnames(df) <- c("ID","DATE","SECTOR")
df <- data.frame(df, value=TRUE)
Then do a reshape:
reshape(df, idvar=c("ID","DATE"), timevar="SECTOR", direction="wide")
The problem with using the reshape function is that the default for missing values is NA (in which case you will have to iterate and replace them with FALSE).
Otherwise you can use cast out of the reshape package (see this question for an example), and set the default to FALSE.
df.wide <- cast(df, ID + DATE ~ SECTOR, fill=FALSE)
> df.wide
ID DATE 1 2 3
1 1 2 TRUE FALSE FALSE
2 2 3 FALSE TRUE FALSE
3 3 4 FALSE FALSE TRUE
4 4 5 TRUE FALSE FALSE
5 5 6 FALSE TRUE FALSE
6 6 7 FALSE FALSE TRUE
7 7 8 TRUE FALSE FALSE
8 8 9 FALSE TRUE FALSE
9 9 10 FALSE FALSE TRUE
10 10 11 TRUE FALSE FALSE
Here's another approach using xtabs which may or may not be faster (if someone would try and let me know):
df <- data.frame(cbind(1:12, 2:13, 1:3))
colnames(df) <- c("ID","DATE","SECTOR")
foo <- xtabs(~ paste(ID, DATE) + SECTOR, df)
cbind(t(matrix(as.numeric(unlist(strsplit(rownames(foo), " "))), nrow=2)), foo)

Resources