Count rows that do not contain a string - r

I have the following dataset
#df
Factors Transactions
a,c 1
b 0
c 0
d,a 0
a 1
a 0
b 1
I'd like to know how many times we did not have a factor and we had a transaction. So, my desirable output is as follows:
#desired output
Factors count
a 1
b 2
c 2
d 3
For instance, only one time we didn't have a and we had a transaction (i.e. only in the last row).
There are many ways to know how many times we had each factor and we had transactions. For instance I tried this one:
library(data.table)
setDT(df)[, .(Factors = unlist(strsplit(as.character(Factors), ","))),
by = Transactions][,.(Transactions = sum(Transactions > 0)), by = Factors]
But I wish to count how many times we didn't have a factor and we had transaction.
Thanks in advance.

You can calculate the opposite, i.e, how many times the factor has a transaction and then the difference between the total transactions and transactions for each individual factor would be what you are looking for:
library(data.table)
total <- sum(df$Transactions > 0)
(setDT(df)[, .(Factors = unlist(strsplit(as.character(Factors), ","))), Transactions]
[, total - sum(Transactions > 0), Factors])
# Factors V1
#1: a 1
#2: c 2
#3: b 2
#4: d 3

We can also do this with cSplit
library(splitstackshape)
cSplit(df, "Factors", ',', 'long')[, sum(df$Transactions) - sum(Transactions>0), Factors]
# Factors V1
#1: a 1
#2: c 2
#3: b 2
#4: d 3
Or with dplyr/tidyr
library(dplyr)
library(tidyr)
separate_rows(df, Factors) %>%
group_by(Factors) %>%
summarise(count = sum(df$Transactions) - sum(Transactions>0))

Related

Is there a way to count values by presence per rows in R?

I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d

Sub-setting rows/columns in a dataframe based on Similarity in R

I have a dataframe "df" as below:
V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a
Is there a way that I can automate the following 3 steps in R?
Step 1:
R identifies that in the 12 rows of the dataframe "df" & the pattern "a b" is repeating majority of the times.
Step2:
Based on the majority pattern in Step 1, R subsets a dataframe with only those rows which contain the majority pattern in Step 1.
Step3:
R outputs a new sub-setted dataframe of Step2.
Is there a package to do this or a function that I can build? Any guidance will be very valuable. Thank You
Do you need duplicates of the most common combination? If not, there's a really simple way to do it with data.table
library("data.table")
#Create sample data, set seed to have the same output
set.seed(1)
df <- data.table(V1 = sample(c("a", "b", "c"),10 , replace = T),
V2 = sample(c("a", "b", "c"),10 , replace = T),
V3 = sample(c("a", "b", "c"),10 , replace = T))
#Subset
cols <- names(df)
df[, .N, by = cols][order(-N)][1,]
Output(N is number of occurences):
V1 V2 V3 N
1: b c b 2
With your updated Q, you may try this with data.table:
library(data.table)
setDT(df)
cols <- c("V1", "V2")
df[, .N, by = cols][N == max(N)][, N := NULL][df, on = cols, nomatch = 0]
# V1 V2 id
#1: a b 3
#2: a b 5
#3: a b 6
#4: a b 7
#5: a b 9
#6: a b 10
#7: a b 11
Explanation
setDT(df) coerces the data.frame to data.table without copying.
The relevant columns are defined to save typing.
The number of occurences of each combination in the relevant columns are counted.
Only the combination with the highest number of occurences is kept. This completes Step 1 of the Q which asks to find the majority pattern.
Note that in case of a tie, i.e., two or more combinations have the same maximum number of occurences, all combinations are returned. The OP hasn't specified how he wants to handle this case.
The counts are removed as they are no longer needed.
Finally, the original df is joined so that all rows of df which match the majority pattern are selected. nomatch = 0 specifies an inner join. This completes Step 2 and perhaps also Step 3 (but this is unclear to me, see comment.
Note that the row numbers in the id column are kept in the result without any additional effort. This would be the case for any other additional column in dfas well.
Data
df <- fread("id V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a")
# EDIT: By request of the OP, the row number (id) should be kept
# df[, id := NULL]
What exactly is "similarity" in your query - just finding the most common row and taking only that? If yes, you just need to group by all variables and sort by occurence
If you're talking about similar, but not matching text in your columns you need to look into edit distances (this package is good for the task https://cran.r-project.org/web/packages/stringdist/stringdist.pdf)

Group a data.table using a column which is list

I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]

How do I subset a data table row in R to get the rows unique to itself

I know this may be a simple question but I cant seem to get it right.
I have two data tables data table old_dt and data table new_dt. Both data tables has two similar columns. My goal is to get the rows from new_dt that is not in old_dt.
Here is an example. Old_dt
v1 v2
1 a
2 b
3 c
4 d
Here is new_dt
v1 v2
3 c
4 d
5 e
What I want is to get just the 5 e row.
Using setdiff didn't work because my real data is more than 3 million rows. Using subset like this
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Only resulted in new_dt itself.
Using
sub.cti <- subset(new_dt, old_dt$v1 != new_dt$v1 & old_dt$v2!= new_dt$v2)
Reulted in nothing.
Using
sub.cti <- new_dt[,.(!old_dt$v1, !old_dt$v2)]
Reulted in multiple rows of FALSEs
Can somebody help me?
Thank you in advance
We can do a join (data from #giraffehere's post)
df2[!df1, on = "a"]
# a b
#1: 6 14
#2: 7 15
To get rows in 'df1' that are not in 'df2' based on the 'a' column
df1[!df2, on = "a"]
# a b
#1: 4 9
#2: 5 10
In the OP's example we need to join on both columns
new_dt[!old_dt, on = c("v1", "v2")]
# v1 v2
#1: 5 e
NOTE: Here I assumed the 'new_dt' and 'old_dt' as data.tables.
Of course, dplyr is a good package. For dealing with this problem, a shorter anti_join can be used
library(dplyr)
anti_join(new_dt, old_dt)
# v1 v2
# (int) (chr)
#1 5 e
or the setdiff from dplyr can work on data.frame, data.table, tbl_df etc.
setdiff(new_dt, old_dt)
# v1 v2
#1: 5 e
However, the question is tagged as data.table.
dplyr would help a lot when you deal with tabular data in R - Would recommend you learn more about dplyr here
library(dplyr)
library(magrittr) # this is just for shorter code with %<>%
# Create a sequence number that combine v1 & v2
Old_dt %<>%
mutate(sequence = paste0(v1,v2))
new_dt %<>%
mutate(sequence = paste0(v1,v2))
# Filter new_dt by sequence not existed in old_dt
result <- new_dt %>%
filter(!(sequence %in% Old_dt$sequence)) %>%
select(v1:v2)
v1 v2
5 e
EDIT: I noticed OP wanted both rows and not just one to match on. I'll keep the data initialization part of the solution here as it is referenced above by #akron. However, use the top solution #akrun posted. It is the more of the "data.table way".
df1 <- data.table(a = 1:5, b = 6:10)
df2 <- data.table(a = c(1, 2, 3, 6, 7), b = 11:15)
head(df1)
a b
1: 1 6
2: 2 7
3: 3 8
4: 4 9
5: 5 10
head(df2)
a b
1: 1 11
2: 2 12
3: 3 13
4: 6 14
5: 7 15
If column a has repeats, you could try this base R hack:
id.var1 <- paste(df1$a, df1$b,sep="_")
id.var2 <- paste(df2$a, df2$b,sep="_")
dfKeep <- df[!(id.var2 %in% id.var1),]

R counting occurrence of string

I have read data from multiple xlsx sheet with xlsx package of R. Currently my data frame is like below
firstcol SecondCol
A abcd
B bds
A <NA>
A asd
C <NA>
B adfdf
? <NA>
C adfd
From the above data I want to get the following output.
Firsrcol FirstcolCount SecondCol
A 3 times 2 # we'll not count NA's
B 2 times 2
C 2 times 1
other 1 times 0
Is there any direct method that can do this? It would be nice to have some suggestion about this.
A data.table approach:
#load library
require(data.table)
# convert data.frame to data.table
setDT(df)
# make a new data.table with two columns. First one has the counts by each level of firstcol. Second one has the count minus the number of NA cases:
df[, .(FirsrcolCount = .N,
secondCol = .N - sum(is.na(secondcol))),
by = firstcol]
Though not quite clear what exactly you mean. Something like this?
library(dplyr)
df %>% group_by(firstcol) %>% summarise(FirstcolCount = n(), SecondCol = n() - sum(SecondCol == "<NA>"))
Source: local data frame [4 x 3]
firstcol FirstcolCount SecondCol
1 ? 1 0
2 A 3 2
3 B 2 2
4 C 2 1

Resources