Count pattern with random characters - r

I would like to count multiple patterns, but including * (any character) in them.
Here is an example search for: Y*Y, YY* and X*X simultaneously
df <- data.frame(
V1 = c("A", "B", "C", "D"),
V2 = c("XXYYYYY", "XXYYXX" , "XYXXYX", "XYYXYX")
)
And here is my try:
library(stringr)
df$V3 <- str_count(df$V2, "Y+Y+")
df$V4 <- str_count(df$V2, "YY+")
df$V5 <- str_count(df$V2, "X+X+")
I am not sure how to specify a random character in a string and how to count two or more patterns at once.
Expected output:
V1 V2 V3 V4 V5
A XXYYYYY 1 1 1
B XXYYXX 1 1 2
C XYXXYX 2 0 3
D XYYXYX 2 1 3

Related

How to apply a custom function to every value in a dataset

I have a multipart function to convert characters in a specified column to numbers, as follows:
ccreate <- function(df, x){ revalue(df[[x]], c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )) }
I then use that function to create new columns in my original dataset of just the values using another function:
coladd <- function(df, x){ df[[paste(x, "_col", sep='' )]] <- ccreate(df,x) df }
Here is an example of the function:
col1 <- c("A", "B", "C", "D", "?")
col2 <- c("A", "A", "A", "D", "?")
col3 <- c("C", "B", "?", "A", "B")
test <- data.frame(col1, col2, col3)
test
coladd(test, "col1")
This works, but I have to feed each column name from my dataset into coladd() one at a time. Is there a way to apply the coladd() function to every column in a dataframe without having to type in each column name?
Thanks again, and sorry for any confusion, this is my first post here.
Using your functions, you can use Reduce.
ccreate <- function(df, x){ revalue(df[[x]], c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )) }
coladd <- function(df, x){ df[[paste(x, "_col", sep='' )]] <- ccreate(df,x); df }
Reduce(coladd, names(test), test)
# col1 col2 col3 col1_col col2_col col3_col
# 1 A A C 5 5 3
# 2 B A B 4 5 4
# 3 C A ? 3 5 1
# 4 D D A 2 2 5
# 5 ? ? B 1 1 4
Here is how I would do it, though not using your functions.
library(dplyr)
# this is a named vector to serve as your lookup
recode_val <- c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )
test %>%
mutate(across(everything(), list(col = ~ recode_val[.])))
# col1 col2 col3 col1_col col2_col col3_col
# 1 A A C 5 5 3
# 2 B A B 4 5 4
# 3 C A ? 3 5 1
# 4 D D A 2 2 5
# 5 ? ? B 1 1 4

Convert data frame into a matrix of "ranked lists" based on unique values in column

Let's say I have a data frame df that looks like this:
df = data.frame(c("A", "A", "B", "B", "C", "D", "D", "D", "E"),
c(0.1, 0.3, 0.1, 0.8, 0.4, 0.7, 0.5, 0.2, 0.1),
c("v1", "v2", "v1", "v3", "v4", "v2", "v3", "v4", "v2"))
colnames(df) = c("entry", "value", "point")
df = df[order(df$entry, -df$value),]
df
entry value point
2 A 0.3 v2
1 A 0.1 v1
4 B 0.8 v3
3 B 0.1 v1
5 C 0.4 v4
6 D 0.7 v2
7 D 0.5 v3
8 D 0.2 v4
9 E 0.1 v2
I would like to convert it eventually into a matrix of "ranked lists", that has as rows the unique values in the entry column and the number of columns should be equal to the maximum number of unique elements in the point column for a given entry. In this example it would be 3. Each row should be populated with the corresponding values from the point column, sorted descendingly based on the corresponding elements in value (e.g., row A should have v2 as value in the first column). In case an entry has less points than the number of columns in the matrix, the rest of the row should be filled with NAs.
So, the expected output should look something like this:
>df
1 2 3
A v2 v1 NA
B v3 v1 NA
C v4 NA NA
D v2 v3 v4
E v2 NA NA
So far I have tried to create some sort of contingency table using
with(df, table(df$point, df$entry))
but of course my actual data is in the order of millions of entries, and the above command raises to huge amounts of RAM even when subsetting to 100 entries with a couple hundreds of unique points. I have also tried
xtabs(~ entry + point, data=df)
with the same results on my real data. Next I have tried to split it into ordered lists using
df = split(df$point, df$entry)
which works fine and it is fast enough, buuuuut.. now I have problems converting it to the result matrix. Something along those lines probably
matrix(sapply(df, function(x) unlist(x)), nrow=length(df), ncol=max(sapply(df, length)))
or first initialize a matrix and do some rbind or something?
res = matrix(NA, nrow=length(df), ncol=max(sapply(df, length)))
rownames(res) = names(df)
....
Can you please assist?
With dplyr:
df %>%
group_by(entry) %>%
mutate(unq=rank(rev(value))) %>%
select(-value) %>%
tidyr::spread(unq,point)
# A tibble: 5 x 4
# Groups: entry [5]
entry `1` `2` `3`
<fct> <fct> <fct> <fct>
1 A v2 v1 NA
2 B v3 v1 NA
3 C v4 NA NA
4 D v2 v3 v4
5 E v2 NA NA
Consider using by to split by entry and build needed vectors. For same length rows in final matrix, add NA as needed where the below 3 can be changed to however many columns required.
vec_list <- by(df, df$entry, function(sub) {
vec <- as.character(sub[order(-sub$value),]$point)
c(vec, rep(NA, 3 - length(vec)))
})
final_matrix <- do.call(rbind, vec_list)
final_matrix
# [,1] [,2] [,3]
# A "v2" "v1" NA
# B "v3" "v1" NA
# C "v4" NA NA
# D "v2" "v3" "v4"
# E "v2" NA NA
Rextester Demo

R count and list unique rows for each column satisfying a condition

I have been going crazy with something basic...
I am trying to count and list in a comma separated column each unique ID coming up in a data frame, e.g.:
df<-data.frame(id = as.character(c("a", "a", "a", "b", "c", "d", "d", "e", "f")), x1=c(3,1,1,1,4,2,3,3,3),
x2=c(6,1,1,1,3,2,3,3,1),
x3=c(1,1,1,1,1,2,3,3,2))
> > df
id x1 x2 x3
1 a 3 6 1
2 a 1 1 1
3 a 1 1 1
4 b 1 1 1
5 c 4 3 1
6 d 1 2 2
7 d 3 3 3
8 e 1 3 3
9 f 3 1 2
I am trying to get a count of unique id that satisfy a condition, >1:
res = data.frame(x1_counts =5, x1_names="a,c,d,e,f", x2_counts = 4, x2_names="a,c,d,f", x3_counts = 3, x3_names="d,e,f")
> res
x1_counts x1_names x2_counts x2_names x3_counts x3_names
1 5 a,c,d,e,f 4 a,c,d,f 3 d,e,f
I have tried with data.table but it seems very convoluted, i.e.
DT = as.data.table(df)
res <- DT[, list(x1= length(unique(id[which(x1>1)])), x2= length(unique(id[which(x2>1)]))), by=id)
But I can't get it right, I am going not getting what I need to do with data.table since it is not really a grouping I am looking for. Can you direct me in the right path please? Thanks so much!
You can reshape your data to long format and then do the summary:
library(data.table)
(melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable])
# You can replace the list to toString if you want a string as name instead of list
# variable counts names
#1: x1 5 a,c,d,e,f
#2: x2 4 a,c,d,e
#3: x3 3 d,e,f
To get what you need, reshape it back to wide format:
dcast(1~variable,
data = (melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable]),
value.var = c('counts', 'names'))
# . counts_x1 counts_x2 counts_x3 names_x1 names_x2 names_x3
# 1: . 5 4 3 a,c,d,e,f a,c,d,e d,e,f

Remove rows whose values across columns contain more than 2 of 4 unique characters

Hopefully the wording of the title makes sense. I have a data frame that consists of values: "A", "B", "C", "D", "", "A/B". I want to identify which rows contain only 2 of "A", "B", "C", or "D". The frequency of each of these letters within the row does not matter. I just want to know if more than 2 of those 4 letters exists in the row.
Here is a sample data frame:
df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B")))
df.sample
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
2 A B C A B B B
3 A B D D B B B B B
4 A B A A B B B B B B
I want to apply a function to each row that determines how many of each of the 4 letters ("A","B","C",or "D") exist, not the frequency of each, but essentially just a 0 or 1 value for "A", "B", "C", and "D". If the sum of those 4 values is > 3, then I want to assign the index of that row to a new vector which will be used to remove those rows from the data frame.
myfun (x){
#which rows contain > 2 different letters of A, B, C, or D.
#The number of times each letter occurs in a given row does not matter.
#What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter.
out = which(something > 2)
}
row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters.
new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters.
In the df.sample above, rows 2 and 3 contain more than 2 of those 4 letters and thus should be indexed for removal. After running the df.sample through the function and removing rows in row.indexes, my new.df.sample data frame should look like this:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
4 A B A A B B B B B B
I have tried to think of this as a logical statement for each of the 4 letters which then assigns a 0 or 1 to each letter, sums them up, and then identifies which ones sum to > 2. For instance, I thought perhaps I could try 'grep()' and convert that to a logical for each letter, which was then converted to a 0 or 1 and summed. That seems too lengthy and didn't work with the way I tried it. Any ideas?
Here's a function for this task. The function returns a logical value. TRUE indicates rows with more than two different strings:
myfun <- function(x) {
sp <- unlist(strsplit(x, "/"))
length(unique(sp[sp %in% c("A", "B", "C", "D")])) > 2
}
row.indexes <- apply(df.sample, 1, myfun)
# [1] FALSE TRUE TRUE FALSE
new.df.sample <- df.sample[!row.indexes, ] # negate the index with '!'
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 A B A A/B B B B B B
# 4 A B A A B B B B B B

How to swap values between two columns

I have a data frame with three variables and 250K records. As an example consider
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
V1 V2 V3
1 a 2
2 a 3
4 b 1
and want to swap values between V1 and V3 based on the value of V2 as follows:
if V2 == 'b' then V1 <- V3 and V3 <- V1
resulting in
V1 V2 V3
1 a 2
2 a 3
1 b 4
I tried a do loop but it takes forever. If I use Perl, it takes seconds. I believe this task can be done efficiently in R as well. Any suggestions are appreciated.
Try this
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
df[df$V2 == "b", c("V1", "V3")] <- df[df$V2 == "b", c("V3", "V1")]
which yields:
> df
V1 V2 V3
1 1 a 2
2 2 a 3
3 1 b 4
You can use transform to do this.
df <- transform(df, V3 = ifelse(V2 == 'b', V1, V3), V1 = ifelse(V2 == 'b', V3, V1))
Editted I got tripped up with column names, sorry. This works.
If you don't mind the rows ending up in different orders, this is kind of a 'cute' way to do this:
dat <- read.table(textConnection("V1 V2 V3
1 a 2
2 a 3
4 b 1"),sep = "",header = TRUE)
tmp <- dat[dat$V2 == 'b',3:1]
colnames(tmp) <- colnames(dat)
rbind(dat[dat$V2 != 'b',],tmp)
Basically, that's just grabbing the rows where V2 == 'b', reverses the columns and slaps it back together with everything else. This can be extended if you have more columns that don't need switching; you'd just use an integer index with those values transposed, rather than just 3:1.

Resources