Finding unique tuples in R but ignoring order - r

Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.

You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2

This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))

Related

Is there any way to delete the rows of data which don't have all numeric values?

I have data that has two columns. Each column of data has numerical values in it but some of them don't have any numerical values. I want to remove the rows which don't have all values numerical. In reality, the data has 1000 rows but for simplification, I made the data file in smaller size here. Thanks!
a <- c(1, 2, 3, 4, "--")
b <- c("--", 2, 3, "--", 5)
data <- data.frame(a, b)
One base R option could be:
data[!is.na(Reduce(`+`, lapply(data, as.numeric))), ]
a b
2 2 2
3 3 3
And for importing the data, use stringsAsFactors = FALSE.
Or using sapply():
data[!is.na(rowSums(sapply(data, as.numeric))), ]
An easier option is to check for NA after converting to numeric with as.numeric. If the element is not numeric, it returns NA and that can be detected with is.na and use it in filter_all to remove the rows
library(dplyr)
data %>%
filter_all(all_vars(!is.na(as.numeric(.))))
# a b
#1 2 2
#2 3 3
If we don't like the warnings, an option is to detect the numbers only element with regex by checking one or more digits ([0-9.]+) including a dot from start (^) to end ($) of string with str_detect
library(stringr)
data %>%
filter_all(all_vars(str_detect(., "^[0-9.]+$")))
# a b
#1 2 2
#2 3 3
If we have only -- as non-numeric, it is easier to remove
data[!rowSums(data == "--"),]
# a b
#2 2 2
#3 3 3
data
data <- data.frame(a,b, stringsAsFactors = FALSE)

apply function by name of list

Imagine that I have a list
l <- list("a" = 1, "b" = 2)
and a data frame
id value
a 3
b 4
I want to match id with list names, and apply a function on that list with the value in data frame. For example, I want the sum of value in the data frame and corresponding value in the list, I get
id value
a 4
b 6
Anyone has a clue?
Edit:
A.
I just want to expand the question a little bit with. Now, I have more than one value in every elements of list.
l <- list("a" = c(1, 2), "b" =c(1, 2))
I still want the sum
id value
a 6
b 7
We can match the names of the list with id of dataframe, unlist the list accordingly and add it to value
df$value <- unlist(l[match(df$id, names(l))]) + df$value
df
# id value
#1 a 4
#2 b 6
EDIT
If we have multiple entries in list we need to sum every list after matching. We can do
df$value <- df$value + sapply(l[match(df$id, names(l))], sum)
df
# id value
#1 a 6
#2 b 7
You just need
df$value=df$value+unlist(l)[df$id]# vector have names can just order by names
df
id value
1 a 4
2 b 6
Try answer with Ronak
l <- list("b" = 2, "a" = 1)
unlist(l)[as.character(df$id)]# if you id in df is factor
a b
1 2
Update
df$value=df$value+unlist(lapply(l,sum))[df$id]

R regular expression for p#q#c#

What would the regular expression be to encompass variable names such as p3q10000c150 and p29q2990c98? I want to add all variables in the format of p-any number-q-any number-c-any number to a list in R.
Thanks!
I think you are looking for something like matches function in dplyr::select:
df = data.frame(1:10, 1:10, 1:10, 1:10)
names(df) = c("p3q10000c150", "V1", "p29q2990c98", "V2")
library(dplyr)
df %>%
select(matches("^p\\d+q\\d+c\\d+$"))
Result:
p3q10000c150 p29q2990c98
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
matches in select allows you to use regex to extract variables.
If your objective is to pull out the 3 numbers and put them in a 3 column data frame or matrix then any of these alternatives would do it.
The regular expression in #1 matches p and then one or more digits and then q and then one or more digits and then c and one or more digits. The parentheses form capture groups which are placed in the corresponding columns of the prototype data frame given as the third argument.
In #2 each non-digit ("\\D") is replaced with a space and then read.table reads in the data using the indicated column names.
In #3 we convert each element of the input to DCF format, namely c("\np: 3\nq: 10000\nc: 150", "\np: 29\nq: 2990\nc: 98") and then read it in using read.dcf and conver the columns to numeric. This creates a matrix whereas the prior two alternatives create data frames.
The second alternative seems simplest but the third one is more general in that it does not hard code the header names or the number of columns. (If we used col.names = strsplit(input, "\\d+")[[1]] in #2 then it would be similarly general.)
# 1
strcapture("p(\\d+)q(\\d+)c(\\d+)", input,
data.frame(p = character(), q = character(), c = character()))
# 2
read.table(text = gsub("\\D", " ", input), col.names = c("p", "q", "c"))
# 3
apply(read.dcf(textConnection(gsub("(\\D)", "\n\\1: ", input))), 2, as.numeric)
The first two above give this data.frame and the third one gives the corresponding numeric matrix.
p q c
1 3 10000 150
2 29 2990 98
Note: The input is assumed to be:
input <- c("p3q10000c150", "p29q2990c98")
Try:
x <- c("p3q10000c150", "p29q2990c98")
sapply(strsplit(x, "[pqc]"), function(i){
setNames(as.numeric(i[-1]), c("p", "q", "c"))
})
# [,1] [,2]
# p 3 29
# q 10000 2990
# c 150 98
I'll assume you have a data frame called df with variables names names(df). If you want to only retain the variables with the structure p<somenumbers>q<somenumbers>c<somenumbers> you could use the regex that Wiktor Stribiżew suggested in the comments like this:
valid_vars <- grepl("p\\d+q\\d+c\\d", names(df))
df2 <- df[, valid_vars]
grepl() will return a vector of TRUE and FALSE values, indicating which element in names(df) follows the structure you suggested. Afterwards you use the output of grepl() to subset your data frame.
For clarity, observe:
var_names_test <- c("p3q10000c150", "p29q2990c98", "var1")
grepl("p\\d+q\\d+c\\d", var_names_test)
# [1] TRUE TRUE FALSE

Removing rows in data.frame having columns subsumed in others

I am trying to achieve something similar to unique in a data.frame where column each element of a column in a row are vectors. What I want to do is if the elements of the vector in the column of that hat row a subset or equal to another remove the row with smaller number of elements. I can achieve this with a nested for loop but since data contains 400,000 rows the program is very inefficient.
Sample data
# Set the seed for reproducibility
set.seed(42)
# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20),
grps = sample(1:5, 20, replace = TRUE),
supergrp = sample(LETTERS[1:4], replace = TRUE))
# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)
# Arrange by number of items for each grp and supergroup
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]
Temp looks like
grps supergrp items
1 4 D a, c, d
2 3 D c, d
3 5 D a, d
4 1 A b
5 2 A b
6 3 A b
7 4 A b
8 5 A b
9 1 D d
10 2 D c
Now you can see that second combination of supergrp and items in second and third row is contained in first row. So, I want to delete the second and third rows from the result. Similarly, rows 5 to 8 are contained in row 4. Finally, rows 9 and 10 are contained in the first row, so I want to delete rows 9 and 10.
Hence, my result would look like:
grps supergrp items
1 4 D a, c, d
4 1 A b
My implementation is as follows::
# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]
# For all rows in the the original data
for(i in 1:nrow(temp))
{
# Index to check if all the items are found
indx <- TRUE
# Check if item in the original data appears in the new data
for(j in 1:nrow(newdf))
{
if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in%
c(newdf$supergrp[[j]], newdf$items[[j]]))){
# set indx to false if a row with same items and supergroup
# as the old data is found in the new data
indx <- FALSE
}
}
# If none of the rows in new data contain items and supergroup in old data append that
if(indx){
newdf <- rbind(newdf, temp[i, ])
}
}
I believe there is an efficient way to implement this in R; may be using the tidy framework and dplyr chains but I am missing the trick. Apologies for a longish question. Any input would be highly appreciated.
I would try to get the items out of a list column and store them in a longer dataframe. Here is my somewhat hacky solution:
library(stringr)
items <- temp$items %>%
map(~str_split(., ",")) %>%
map_df(~data.frame(.))
out <- bind_cols(temp[, c("grps", "supergrp")], items)
out %>%
gather(item_name, item, -grps, -supergrp) %>%
select(-item_name, -grps) %>%
unique() %>%
filter(!is.na(item))

How do you test if a pair of elements is in a data frame?

Let's say I have this data frame A :
A = data.frame(first=c("a", "b","c", "d"), second=c(1, 2, 3, 4))
first second
1 a 1
2 b 2
3 c 3
4 d 4
And I have this data frame B :
B = data.frame(first=c("x", "a", "c"), second=c(1, 4, 3))
first second
1 x 1
2 a 4
3 c 3
I want to count the number of times a pair of the data frame B (B$first, B$second) is in the data frame A. The counting part is not the problem, I just can't find the function to determine whether a pair is in a data frame.
The result would be that only c("c",3) is an element of A, so it should be 1. both "a" and 4 are in data frame A, but the couple c("a", 4) does not exist in data frame A, so I don't want to count this. I want the exact match.
I'm looking for a function like %in% that could work for pairs.
Thanks for your help
Maybe something like this
apply(B, 1, function(r, A){ sum(A$first==r[1] & A$second==r[2]) }, A)
Basically, what it does is the following: for every row of B it applies a function that inspects which elements of A are in accordance with row r from B (part A$first==r[1] & A$second==r[2]) and then sums obtained logicals to derive the number of rows in A that are in accordance with row r.
If you also want grouping it can easily be done with dplyr like this
cbind(B,tmp) %.% group_by(first,second) %.% summarise(n=max(tmp))
where tmp is a variable representing the result of the aforementioned apply
Here's an alternative: rbind your data.frames together and use duplicated.
AB <- do.call(rbind, mget(c("A", "B")))
AB$ind <- as.numeric(duplicated(AB))
AB[grep("^B", rownames(AB)), ]
# first second ind
# B.1 x 1 0
# B.2 a 4 0
# B.3 c 3 1
You can also probably try to use "digest" to generate a hash for each row, but I'm not sure how efficient this would be:
library(digest)
Reduce(function(x, y) y %in% x,
lapply(mget(c("A", "B")), function(x)
apply(x, 1, digest)))
# [1] FALSE FALSE TRUE
An alternative is to merge by row, e.g. mB<-apply(B,1,function(j) paste0(j[1],"_",j[2]) and similarly for A at which point you can loop mB[j]%in%mA[k]
Not that I would really recommend doing this :-)

Resources