Create a true/false variable in R - r

I have one variable column that contains large string values which are multiple words. I want to create a True/False column which reports true if a certain value is detected within the column of interest.
I have tried a mutate function with an embedded str_detect.
Dataset <- Dataset %>%
mutate(new_column = str_detect('column.of.interest', "abcd"))
My expected output was for all rows in which my column of interest contained "abcd" would report as TRUE in my new column. However, every row reports as FALSE in my new column.

Base R version. First create a sample data set (questioner: you should have done this; answerers: you should always do this):
> Dataset = data.frame(ID=1:10, column.of.interest=c(NA,"This","abcd","Foo","the abcde",NA,"Me","my","mo","END"))
which looks like this:
> Dataset
ID column.of.interest
1 1 <NA>
2 2 This
3 3 abcd
4 4 Foo
5 5 the abcde
6 6 <NA>
7 7 Me
8 8 my
9 9 mo
10 10 END
Then do:
> Dataset$new_column <- grepl("abcd", Dataset$column.of.interest, ignore.case = T)
to get:
> Dataset
ID column.of.interest new_column
1 1 <NA> FALSE
2 2 This FALSE
3 3 abcd TRUE
4 4 Foo FALSE
5 5 the abcde TRUE
6 6 <NA> FALSE
7 7 Me FALSE
8 8 my FALSE
9 9 mo FALSE
10 10 END FALSE
You may or may not want ignore.case.

Here is one answer which from based on a dataset from ggplot2
library(ggplot2)
library(dplyr)
diamonds %>% mutate(newCol = str_detect(clarity, "1"))
Original bad version of answer (see comments for why the above is better)
diamonds %>% mutate(newCol = ifelse(str_detect(clarity, "1"), "TRUE", "FALSE"))

Related

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Combining two columns using shared values in first column

I am trying to adjust the formatting of a data set. My current set looks like this, in two columns. The first column is a "cluster" and the second column "name" contains values within each cluster:
Cluster Name
A 1
A 2
A 3
B 4
B 5
C 2
C 6
C 7
And I'd like a list that is, one column wherein all the values from column 2 are listed under the associated cluster from column 1 in a single column:
Cluster A
1
2
3
Cluster B
4
5
Cluster C
2
6
7
I've been trying in R and Excel with no luck for the last few hours. Any ideas?
Using a trick with tidyr::nest :
library(dplyr)
library(tidyr)
df %>% mutate(Cluster = paste0("Cluster_",Cluster)) %>% nest(Name) %>% t %>% unlist %>% as.data.frame
# .
# 1 Cluster_A
# 2 1
# 3 2
# 4 3
# 5 Cluster_B
# 6 4
# 7 5
# 8 Cluster_C
# 9 2
# 10 6
# 11 7

Finding the minimums of groups of observations in R

I'm relatively new to R and struggle with "vectorizing" all my code in R. Even though I appreciate that's the proper way to do it.
I need to set a value in a dataframe to be the minimum time for the IDs.
ID isTrue RealTime MinTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3
To be turned into:
ID isTrue RealTime MinTime
1 TRUE 16 10
1 FALSE 8
1 TRUE 10 10
2 TRUE 7 7
2 TRUE 30 7
3 FALSE 3
The following works perfectly. However, it takes 10 minutes to run which isn't ideal:
for (i in 1:nrow(df)){
if (df[i,'isTrue']) {
prevTime <- sqldf(paste('Select min(MinTime) from dfStageIV where ID =',df[i,'ID'],sep=" "))[1,1]
if (is.na(prevTime) | is.na(df[i,'MinTime']) | df[i,'MinTime'] < prevTime){
df[i,'MinTime']<-dfStageIV[i,'RealTime']
} else {
dfStageIV[i,'MinTime']<-prevTime
}
}
}
How should I do this properly? I take it using for or do loops are not the best way in R. I've been looking at the apply() and aggregate.data.frame() functions but can't make sense of how to do this. Can someone point me in the right direction? Much appreciated!!
Here is a two line base R solution using ave, pmax, and is.na.
# calculate minimum for each ID, excluding FALSE instances
df$MinTime <- ave(pmax(df$RealTime, (!df$isTrue) * max(df$RealTime)), df$ID, FUN=min)
# turn FALSE instances into NA
is.na(df$MinTime) <- (!df$isTrue)
which returns
df
ID isTrue RealTime MinTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA
In the first line, pmax is used to construct a vector of the observations if df$isTrue is TRUE or the maximum RealTime value in the data.frame. This new vector is used in the minimum calculation. The FALSE values are set to NA in the second line.
data
df <- read.table(header=T, text="ID isTrue RealTime
1 TRUE 16
1 FALSE 8
1 TRUE 10
2 TRUE 7
2 TRUE 30
3 FALSE 3")
It should be far faster with a dplyr chain. Here we group the data frame by both ID and group and get the minima at the group level. Then we can ungroup it again and simply remove the F minima.
library(dplyr)
df %>%
group_by(ID, isTrue) %>%
mutate(Min.all = min(RealTime)) %>%
ungroup() %>%
transmute(ID, isTrue, RealTime, MinTime = ifelse(isTrue == T, Min.all, ""))
Output:
# A tibble: 6 × 4
ID isTrue RealTime MinTime
<int> <lgl> <int> <chr>
1 1 TRUE 16 10
2 1 FALSE 8
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3
I'd really recommend you get familiar with dplyr if you're going to be doing lots of data frame manipulation.
Someone suggested using the ave() function and the following works and is fast although it returns a ton of warnings:
df$MinTime<-ave(df$RealTime,df$ID, df$isTrue, FUN = min)
df$MinTime<-ifelse(df$isTrue, df$MinTime,NA).
The code in the question could be simplified by doing it all in SQL or all in R (appropriately vectorized) rather than half and half. There are already some R solutions so here is an SQL solution that shows that the problem amounts to aggregating a custom self-join.
library(sqldf)
sqldf("select a.*, min(b.RealTime) minRealTime
from df a
left join df b on a.ID = b.ID and a.isTRUE and b.isTRUE
group by a.rowid")
giving:
ID isTrue RealTime minRealTime
1 1 TRUE 16 10
2 1 FALSE 8 NA
3 1 TRUE 10 10
4 2 TRUE 7 7
5 2 TRUE 30 7
6 3 FALSE 3 NA

Smartest way to check if an observation in data.frame(x) exists also in data.frame(y) and populate a new column according with the result

Having two dataframes:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA")
and
y <- data.frame(numbers=c('1','3','10'))
How can I check if the observations in y (1, 3 and 10) also exist in x and fill accordingly the column x["coincidence"] (for example with YES|NO, TRUE|FALSE...).
I would do the same in Excel with a formula combining IFERROR and VLOOKUP, but I don't know how to do the same with R.
Note:
I am open to change data.frames to tables or use libraries. The dataframe with the numbers to check (y) will never have more than 10-20 observations, while the other one (x) will never have more than 1K observations. Therefore, I could also iterate with an if, if it´s necessary
We can create the vector matching the desired output with a set difference search that outputs boolean TRUE and FALSE values where appropriate. The sign %in%, is a binary operator that compares the values on the left-hand side to the set of values on the right:
x$coincidence <- x$numbers %in% y$numbers
# numbers coincidence
# 1 1 TRUE
# 2 2 FALSE
# 3 3 TRUE
# 4 4 FALSE
# 5 5 FALSE
# 6 6 FALSE
# 7 7 FALSE
# 8 8 FALSE
# 9 9 FALSE
Do numbers have to be factors, as you've set them up? (They're not numbers, but character.) If not, it's easy:
x <- data.frame(numbers=c('1','2','3','4','5','6','7','8','9'), coincidence="NA", stringsAsFactors=FALSE)
y <- data.frame(numbers=c('1','3','10'), stringsAsFactors=FALSE)
x$coincidence[x$numbers %in% y$numbers] <- TRUE
> x
numbers coincidence
1 1 TRUE
2 2 NA
3 3 TRUE
4 4 NA
5 5 NA
6 6 NA
7 7 NA
8 8 NA
9 9 NA
If they need to be factors, then you'll need to either set common levels or use as.character().

Removing same values in columns in some rows of a file in R

I have a file like this.
1 3
1 2
1 10
1 5
**5 5**
6 7
8 9
4 6
1 2
**10 10**
......
The file contains thousands of rows. I wanted to know, how can I remove the rows which contains the same values in columns in R ( The row containing 5 5 and row containing 10 10 )? I know how to remove duplicate columns or duplicate rows, but how do I go about selectively removing them? Thanks. :)
I would do this with indexing, example with small data frame:
myDf <- data.frame(a=c(3,5,8,6,9,4,3), b=c(3,3,5,8,9,6,4))
myDf <- myDf[myDf$a != myDf$b,]
I would consider writing a helper function like this:
indicator <- function(indf) {
rowSums(vapply(indf, function(x) x == indf[, 1],
logical(nrow(indf)))) == ncol(indf)
}
Basically, the function compares each column in the data.frame with the first column of the data.frame, then, checks to see which rowSums are the same as the number of columns in the data.frame.
This basically creates a logical vector that can be used to subset your data.frame.
Example:
mydf <- data.frame(a=c(3,5,8,6,9,4,3),
b=c(3,3,5,8,9,6,4),
c=c(3,4,5,6,9,7,2))
indicator(mydf)
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE
mydf[!indicator(mydf), ]
# a b c
# 2 5 3 4
# 3 8 5 5
# 4 6 8 6
# 6 4 6 7
# 7 3 4 2

Resources