Check if comma delimited column contains a value - r

I have an R dataframe where one of the columns is a comma delimited string. I want to add a new column to the dataset to show whether the column contains a particular value
For example
> data <- data.frame(a = 1:5, b = c("123", "6475,320", "475", "905,1204,543", "567,475"))
> data
a b
1 1 123
2 2 6475,320
3 3 475
4 4 905,1204,543
5 5 567,475
I want to create a new column to indicate whether b contains 475, which would leave me with
a b has_475
1 1 123 FALSE
2 2 6475,320 FALSE
3 3 475 TRUE
4 4 905,1204,543 FALSE
5 5 567,475 TRUE

You can use boundaries '\b' to look for the number. This will ensure things like 1475 24756 are not matched
data$has_475 <- grepl('\\b475\\b', data$b)
data
a b has_475
1 1 123 FALSE
2 2 6475,320 FALSE
3 3 475 TRUE
4 4 905,1204,543 FALSE
5 5 567,475 TRUE
6 6 1475 FALSE

You can use this regular expression
data["has_475"] = grepl("(^|,)475(,|$)",data$b)
Output:
a b has_475
1 1 123 FALSE
2 2 6475,320 FALSE
3 3 475 TRUE
4 4 905,1204,543 FALSE
5 5 567,475 TRUE

Related

Roll condition ifelse in R data frame

I have a data frame with two columns in R and I want to create a third column that will roll by 2 in both columns and check if a condition is satisfied or not as described in the table below.
The condition is a rolling ifelse and goes like this :
IF -A1<B3<A1 TRUE ELSE FALSE
IF -A2<B4<A2 TRUE ELSE FALSE
IF -A3<B5<A3 TRUE ELSE FALSE
IF -A4<B6<A4 TRUE ELSE FALSE
A
B
CHECK
1
4
NA
2
5
NA
3
6
FALSE
4
1
TRUE
5
-4
FALSE
6
1
TRUE
How can I do it in R? Is there a base R's function or within the dplyr framework ?
Since R is vectorized, you can do that with one command, using for instance dplyr::lag:
library(dplyr)
df %>%
mutate(CHECK = -lag(A, n=2) < B & lag(A, n=2) > B)
A B CHECK
1 1 4 NA
2 2 5 NA
3 3 6 FALSE
4 4 1 TRUE
5 5 -4 FALSE
6 6 1 TRUE

Find all the duplicate records using duplicated() on R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 4 years ago.
I have one question in R.
I have the following example code for a question.
> exdata <- data.frame(a = rep(1:4, each = 3),
+ b = c(1, 1, 2, 4, 5, 3, 3, 2, 3, 9, 9, 9))
> exdata
a b
1 1 1
2 1 1
3 1 2
4 2 4
5 2 5
6 2 3
7 3 3
8 3 2
9 3 3
10 4 9
11 4 9
12 4 9
> exdata[duplicated(exdata), ]
a b
2 1 1
9 3 3
11 4 9
12 4 9
I tried to use the duplicated() function to find all the duplicate records in the exdata dataframe, but it only finds a part of the duplicated records, so it is difficult to confirm intuitively whether duplicates exist.
I'm looking for a solution that returns the following results
a b
1 1 1
2 1 1
7 3 3
9 3 3
10 4 9
11 4 9
12 4 9
Can use the duplicated() function to find the right solution?
Or is there a way to use another function?
I would appreciate your help.
duplicated returns a logical vector with the length equal to the length of its argument, corresponding to the second time a value exists. It has a method for data frames, duplicated.data.frame, that looks for duplicated rows (and so has a logical vector of length nrow(exdata). Your extraction using that as a logical vector is going to return exactly those rows that have occurred once before. It WON'T however, return the first occurence of those rows.
Look at the index vector your using:
duplicated(exdata)
# [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
But you can combine it with fromLast = TRUE to get all of the occurrences of these rows:
exdata[duplicated(exdata) | duplicated(exdata, fromLast = TRUE),]
# a b
# 1 1 1
# 2 1 1
# 7 3 3
# 9 3 3
# 10 4 9
# 11 4 9
# 12 4 9
look at the logical vector for duplicated(exdata, fromLast = TRUE) , and the combination with duplicated(exdata) to convince yourself:
duplicated(exdata, fromLast = TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
duplicated(exdata) | duplicated(exdata, fromLast = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE

If row meets criteria, then TRUE else FALSE in R

I have nested data that looks like this:
ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE
I'd like to create a column called counter in which for each unique ID the counter adds one to the next row until the Behavior = TRUE
I am expecting this result:
ID Date Behavior counter
1 1 FALSE 1
1 2 FALSE 2
1 3 TRUE 3
2 3 FALSE 1
2 5 FALSE 2
2 6 TRUE 3
2 7 FALSE
3 1 FALSE 1
3 2 TRUE 2
Ultimately, I would like to pull the minimum counter in which the observation occurs for each unique ID. However, I'm having trouble developing a solution for this current counter issue.
Any and all help is greatly appreciated!
I'd like to create a counter within each array of unique IDs and from there, ultimately pull the row level info - the question is how long on average does it take to reach a TRUE
I sense there might an XY problem going on here. You can answer your latter question directly, like so:
> library(plyr)
> mean(daply(d, .(ID), function(grp)min(which(grp$Behavior))))
[1] 2.666667
(where d is your data frame.)
Here's a dplyr solution that finds the row number for each TRUE in each ID:
library(dplyr)
newdf <- yourdataframe %>%
group_by(ID) %>%
summarise(
ftrue = which(Behavior))
do.call(rbind, by(df, list(df$ID), function(x) {n = nrow(x); data.frame(x, Counter = c(1:(m<-which(x$Behavior)), rep(NA, n-m)))}))
ID Date Behavior Counter
1.1 1 1 FALSE 1
1.2 1 2 FALSE 2
1.3 1 3 TRUE 3
2.4 2 3 FALSE 1
2.5 2 5 FALSE 2
2.6 2 6 TRUE 3
2.7 2 7 FALSE NA
3.8 3 1 FALSE 1
3.9 3 2 TRUE 2
df = read.table(text = "ID Date Behavior
1 1 FALSE
1 2 FALSE
1 3 TRUE
2 3 FALSE
2 5 FALSE
2 6 TRUE
2 7 FALSE
3 1 FALSE
3 2 TRUE", header = T)

R: Creating a column by adding values of two previous columns

I am working in R. I have typed in the command :
table(shoppingdata$Identifier, shoppingdata$Coupon)
I have the following data:
FALSE TRUE
197386 0 5
197388 0 2
197390 2 0
197392 0 3
197394 1 0
197397 0 1
197398 1 1
197400 0 4
197402 1 5
197406 0 5
First of all, I cannot name the vectors FALSE and TRUE by something else, e.g couponused.
Most importantly, I want to create a third column which is the sum of FALSE+TRUE( Coupon used+coupon not used= number of visits). The actual columns contain hundreds of entries.
The solution is not obvious at all.
You have stumbled into the abyss of R data types, through no fault of your own.
Assuming that shoppingdata is a data frame,
table(shoppingdata$Identifier, shoppingdata$Coupon)
creates an object of type "table". One would think that using, e.g.
as.data.frame(table(shoppingdata$Identifier, shoppingdata$Coupon))
would turn this into a data frame with the same format as in the printout, but, as the example below shows, it does not!
# example
data <- data.frame(ID=rep(1:5,each=10),coupon=(sample(c(T,F),50,replace=T)))
# creates "contingency table", not a data frame.
t <- table(data)
t
# coupon
# ID FALSE TRUE
# 1 5 5
# 2 3 7
# 3 4 6
# 4 6 4
# 5 3 7
as.data.frame(t) # not useful!!
# ID coupon Freq
# 1 1 FALSE 5
# 2 2 FALSE 3
# 3 3 FALSE 4
# 4 4 FALSE 6
# 5 5 FALSE 3
# 6 1 TRUE 5
# 7 2 TRUE 7
# 8 3 TRUE 6
# 9 4 TRUE 4
# 10 5 TRUE 7
# this works...
coupons <- data.frame(ID=rownames(t),not.used=t[,1],used=t[,2])
# add two columns to make a third
coupons$total <- coupons$used + coupons$not.used
# or, less typing
coupons$ total <- with(coupons,not.used+used)
FWIW, I think yours is a perfectly reasonable question. The reason more people don't use R is that it has an extremely steep learning curve, and the documentation is not very good. On the other hand, once you've climbed that learning curve, R is astonishingly powerful.

How can I find the first and last occurrences of an element in a data.frame?

I have searched exhaustively for a direct R translation for the FIRST. and LAST. pointers in SAS DATA steps but can't seem to find one. For those not familiar with SAS, FIRST. is a boolean that identifies the first appearance of a given element in a table and LAST. is a boolean that identifies the last appearance. For instance, consider the following sorted table:
V1 V2 V3
1 1 1
1 1 2
1 2 3
1 2 4
2 3 5
2 3 6
2 4 7
2 4 8
3 5 9
3 5 10
3 6 11
3 6 12
Because SAS DATA steps read tables line by line, I can use a statement like:
IF FIRST.V1 THEN DO ...
FIRST.V1 will return TRUE if and only if this is the first time the observation has been encountered in V1. In other words, it will return true for V1[1] (the first appearance of '1'), V1[5] (the first appearance of '2'), and V1[9] (the first appearance of '3'). The LAST. pointer functions in analogous fashion, but with the final appearance of that element.
Is there anything in R that emulates this?
You can do this with duplicated and rev (for LAST):
> v1=c(1,1,1,2,2,3,3,3,3,4,4,5)
> data.frame(v1,FIRST=!duplicated(v1),LAST=rev(!duplicated(rev(v1))))
v1 FIRST LAST
1 1 TRUE FALSE
2 1 FALSE FALSE
3 1 FALSE TRUE
4 2 TRUE FALSE
5 2 FALSE TRUE
6 3 TRUE FALSE
7 3 FALSE FALSE
8 3 FALSE FALSE
9 3 FALSE TRUE
10 4 TRUE FALSE
11 4 FALSE TRUE
12 5 TRUE TRUE

Resources