How to use apply over a vector? - r

Suppose I have a data.frame like
a <- data.frame(col1=1:6,
col2=c('a','b',1,'c',2,3),
stringsAsFactors=F)
a
col1 col2
1 1 a
2 2 b
3 3 1
4 4 c
5 5 2
6 6 3
I want to have a vector saying which rows have col2 as a number. I'm trying something like
apply(a$col2,1,is.numeric)
or
apply(a$col2,FUN=is.numeric)
but it always says
Error in apply(a$col2, 1, is.numeric) :
dim(X) must have a positive length
If a$col2 (the X in apply) must be a matrix, then why does the help from the function say:
X: an array, including a matrix.
The help on arrays says:
An array in R can have one, two or more dimensions.
If an array can have only one dimension, then why can't a one-dimensional array be used in apply? What am I missing here?
(Beyond that, I still would like to know how to find the numeric rows in col2 without using a loop.)

First note that even the numbers in col2 are character since when combined with other elements which are character they get coerced to character.
str(a)
## 'data.frame': 6 obs. of 2 variables:
## $ col1: int 1 2 3 4 5 6
## $ col2: chr "a" "b" "1" "c" ...
1) grepl thus we should use character processing like this:
grepl("^\\d+$", a$col2)
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
grepl is alredy vectorized so we don't need an apply or related function to iterate over the elements of col2.
2) (s)apply These also work but seems unnecessarily involved given that grepl alone works:
sapply(a$col2, grepl, pattern = "^\\d+$")
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE
apply(array(a$col2), 1, grepl, pattern = "^\\d+$")
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
3) type.convert Another approach is to use type.convert which will convert to numeric if it can be represented as one. Then we can use is.numeric.
sapply(a$col2, function(x) is.numeric(type.convert(x)))
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE

Related

How to match multiple columns without merge?

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.
Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6
With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

Vector Indexing using Logical vector

I am new to R. I have created an object a:
a <- c(2,4,6,8,10,12,14,16,18,20)
I have performed the following operation on the vector:
a[!c(10,0,8,6,0)]
and I get the output as 4 10 14 20
I do understand that !c(10,0,8,6,0) produces the output as FALSE TRUE FALSE FALSE TRUE
I don't understand how the final results comes out to be 4 10 14 20
Can someone help?
We obtain the results because the logical vector is recycled (as its length is only 5 compared to length(a) which is 10) to meet the end of the 'a' vector i..e
i1 <- rep(!c(10,0,8,6,0), length.out = length(a))
i1
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
If we use that vector
a[i1]
[1] 4 10 14 20
It is easier to understand if we just pass TRUE, then the TRUE is recycled to return all the elements or the reverse with FALSE
a[TRUE]
[1] 2 4 6 8 10 12 14 16 18 20
a[FALSE]
numeric(0)
The recycling is mentioned in the documentation of ?Extract
For [-indexing only: i, j, ... can be logical vectors, indicating elements/slices to select. Such vectors are recycled if necessary to match the corresponding extent. i, j, ... can also be negative integers, indicating elements/slices to leave out of the selection.
In most of the languages, 0 is considered as FALSE and other values as TRUE. So, when we negate the 0 (FALSE) is converted to TRUE and all others to FALSE

Filter dataframe based on presence of sample in a seperate list

I am wanting to filter a dataframe with 1212 so it only contains that samples listed in a seperate list. The list has multiple values and I can't work out how to do this.
The df below is called RNASeq2
RNASeq2Norm_samples Substrng_RNASeq2Norm
1 TCGA-3C-AAAU-01A-11R-A41B-07 TCGA.3C.AAAU
2 TCGA-3C-AALI-01A-11R-A41B-07 TCGA.3C.AALI
3 TCGA-3C-AALJ-01A-31R-A41B-07 TCGA.3C.AALJ
4 TCGA-3C-AALK-01A-11R-A41B-07 TCGA.3C.AALK
5 TCGA-4H-AAAK-01A-12R-A41B-07 TCGA.4H.AAAK
6 TCGA-5L-AAT0-01A-12R-A41B-07 TCGA.5L.AAT0
7 TCGA-5L-AAT1-01A-12R-A41B-07 TCGA.5L.AAT1
8 TCGA-5T-A9QA-01A-11R-A41B-07 TCGA.5T.A9QA
.
.
.
1212
list = intersect_samples
intersect_samples: "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" ... 1097
I have tried this code but returns all the original 1212 samples:
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% intersect_samples,]
Yet if I try
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% "TCGA.3C.AAAU",]
it will return the correct row
str(RNASeq2)
'data.frame': 1212 obs. of 2 variables:
$ RNASeq2 : Factor w/ 1212 levels "TCGA-3C-AAAU-01A-11R-A41B-07",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Substrng_RNASeq2Norm: Factor w/ 1093 levels "TCGA.3C.AAAU",..: 1 2 3 4 5 6 7 8 9 10 ...
str(intersect_samples)
chr [1:1093] "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" "TCGA.4H.AAAK" ...
AFAIK R does not offer a convenience function to find a vector of search strings in a vector of strings using partial matching ("sub-strings").
%in is not the right function if you want to find a sub-string within a string since it compares only whole strings.
Instead use base R's grepl or the probably faster stri_detect_fixed function of the excellent stringi package.
Please note that I have abstracted the code and data (instead of using your code and data) for easier understanding.
library(stringi)
pattern = c("23", "45", "999")
data <- data.frame(row_num = 1:4,
string = c("123", "234", "345", "xyz"),
stringsAsFactors = FALSE)
# row_num string
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 xyz
string <- data$string # the column that contains the values to be filtered
# Iterate over each element in pattern and apply it to the string vector.
# Returns a logical vector of the same length as string (TRUE = found, FALSE = not found)
selected <- lapply(pattern, function(x) stri_detect_fixed(string, x))
# Or shorter:
# lapply(pattern, stri_detect_fixed, str = string)
selected # show the result (it is a list of logical vectors - one per search pattern element)
# [[1]]
# [1] TRUE TRUE FALSE FALSE
#
# [[2]]
# [1] FALSE FALSE TRUE FALSE
#
# [[3]]
# [1] FALSE FALSE FALSE FALSE
# "row-wise" reduce the logical vectors into one final vector using the logical "or" operator
# WARNING: Does not handle `NA`s correctly (one NA does makes any TRUE to NA)
selected.rows <- Reduce("|", selected)
# [1] TRUE TRUE TRUE FALSE
# To handle NAs correctly (if you have NAs) you can use this (slower) code:
selected.rows <- rowSums(as.data.frame(selected), na.rm = TRUE) > 0
# Use the logical vector as row selector (TRUE returns the row, FALSE ignores the row):
string[selected.rows]
# [1] 123 234 345

Recognizing 'select all that apply' answers as being separate in R

I'm working with a dataframe, from a survey I took (n=108). One of the questions (columns) contain four possible answers- another, 10. Here is my issue: when plotting either of these columns, it plots each level. The four-answer column is considered a factor with 13 levels and the 10-answer, a factor with 22 levels. Each time someone chose more than one answer, it counts as a separate level (i.e. "A,B", "A,B,C", etc.) My question is, how do I go about representing how many respondents chose "A" or "B" or a combination of "A" and "B" but not necessarily/only "A,B", regardless of what other choices the made, if any.
I wish to plot these items correctly, as well as analyze the data by, say, how many Female respondents chose "A" versus how many Male respondents, and so on.
My issues:
1) plot(data$letter) plots 13 different bars whereas letter is a question with only 4 possible answers, select all that apply.
2) I can't show through analysis how many chose "A" if they also happened to choose another answer, because "A","C" isn't equivalent to "A".
Solutions I'm searching for:
1) When plot(data$letter), I want only four bars showing how many times each letter was chosen.
2) I need to work with all values of "A" in analysis, even if the respondent selected more than just "A"
Thank you!
Also, How to clean and re-code check-all-that-apply responses in R survey data? is a question I found before posting that explains it in totality, but the code is fairly advanced at my level of experience with R.
I can give you two ideas that might help for the two issues you mention. First, I create some sample data:
set.seed(175)
choices <- c("A", "B", "C", "A,B", "A,C", "B,C", "A,B,C")
data <- data.frame(respondent = 1:15,
letter = sample(choices, 15, replace = TRUE))
data
## respondent letter
## 1 1 A,C
## 2 2 B,C
## 3 3 A,B
## 4 4 C
## 5 5 B,C
## 6 6 A,B
## 7 7 B
## 8 8 A
## 9 9 B,C
## 10 10 C
## 11 11 A
## 12 12 B
## 13 13 C
## 14 14 A,C
## 15 15 A,B,C
For simplicity, I used only three levels.
1) The following function can be used to plot `data$letter) directly in the way you want:
plot_allapply <- function(choices) {
# convert to character
choices <- as.character(choices)
# split at comma and unlist
choices_split <- unlist(strsplit(choices, ","))
# convert back to factor and plot
plot(as.factor(choices_split))
}
plot_allapply(data$letter)
It works as follows: First, the data in letter needs to be converted from type factor to character. (I know that it is a factor, because otherwise you would not get a plot at all.) Then, each element of the character vector is split at the commas. (Run strsplit(as.character(data$letter), ",") to see how this works for your data and ?strsplit for more information.). Since this yields a list, it is converted to a character vector using unlist. The last line converts back to factor (which is needed in order for plot to create the right kind of plot) and plotted.
2) There are many ways how you could work with the data in data$letter. If you are interested to know, which respondents chose "B", you could do
grepl("B", data$letter)
This will return a logical vector that is TRUE whenever "B" is contained in a respondents answer. Thus, all of those will give TRUE: "B", "A,B", "A,B,C".
Maybe it helps to add this information to your data frame. This could be done as follows:
data <- transform(data, isA = grepl("A", letter),
isB = grepl("B", letter),
isC = grepl("C", letter))
data
## respondent letter isA isB isC
## 1 1 A,C TRUE FALSE TRUE
## 2 2 B,C FALSE TRUE TRUE
## 3 3 A,B TRUE TRUE FALSE
## 4 4 C FALSE FALSE TRUE
## 5 5 B,C FALSE TRUE TRUE
## 6 6 A,B TRUE TRUE FALSE
## 7 7 B FALSE TRUE FALSE
## 8 8 A TRUE FALSE FALSE
## 9 9 B,C FALSE TRUE TRUE
## 10 10 C FALSE FALSE TRUE
## 11 11 A TRUE FALSE FALSE
## 12 12 B FALSE TRUE FALSE
## 13 13 C FALSE FALSE TRUE
## 14 14 A,C TRUE FALSE TRUE
## 15 15 A,B,C TRUE TRUE TRUE

How do I use grep on a data frame?

I have the following data frame:
> my.data
A.Seats B.Seats
1 14,15 14,15,16
2 7 7,8
3 12,13 16,17
4 <NA> 10,11
I would like to check if the string within any row in column "A.Seats" is found within the same row of column "B.Seats". So the output would look something like this:
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
But I don't know how to create this table. As a start, I tried using grep:
grep(my.data$A.Seats,my.data$B.Seats)
But I receive the following output
[1] 1
Warning message:
In grep(my.data$A.Seats, my.data$B.Seats) :
argument 'pattern' has length > 1 and only the first element will be used
...and I can't get past this error. Any ideas as to how I can get the intended result?
Many Thanks
The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:
my.data <- data.frame(
A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
# A.Seats B.Seats
# 1 14,15 14,15,16
# 2 7 7,8
# 3 12,13 16,17
# 4 <NA> 10,11
# 5 14,19 14,15,16
library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1] TRUE TRUE FALSE NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1] TRUE TRUE FALSE NA TRUE
The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.
Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.
If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:
vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats) # pattern is fixed
# [1] 1 1 0 NA 0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15 7 12|13 <NA> 14|19
# 1 1 0 NA 1
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats)) # coerce to logical
# [1] TRUE TRUE FALSE NA FALSE
Because this calls grepl on each element in the vector, I don't think this will scale well though.
This is an approach to get what you need
> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
Here's an alternative using grep
>transform(my.data,
Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))

Resources