I have a dataframe with 2 columns of factors variables like this:
V1 <- c("A","B","C","Y","D","E","F","U","G","H","I","J","R")
V2 <- c("Z","Y","W","B","V","U","T","E","S","R","Q","P","H")
df <- cbind(V1,V2)
df
V1 V2
[1,] "A" "Z"
[2,] "B" "Y"
[3,] "C" "W"
[4,] "Y" "B"
[5,] "D" "V"
[6,] "E" "U"
[7,] "F" "T"
[8,] "U" "E"
[9,] "G" "S"
[10,] "H" "R"
[11,] "I" "Q"
[12,] "J" "P"
[13,] "R" "H"
Now I woudl like to count, using a function, all the cases where the combination of V1 and V2 equals to combination V2 and V1 and return them, for example for df this count will be equal to 3, like this:
y <-combinations_inver(df[,1],df[,2])
y$Combinations
"B""Y"= "Y""B"
"E""U"= "U""E"
"H""R"= "R""H"
y$Count
[1] 3 #because there are three ocurrences (see $Combinations)
A simple way to do it would be:
forwards<-paste(V1,V2)
backwards<-paste(V2,V1)
The intersection of these two "sets" would be what you are looking for, but R gives both sets of matches, so you would need to divide the length by 2:
length(intersect(forwards, backwards))/2
We can use pmin and pmax to reorder the elements for each row, then use duplicated to find the index of duplicate elements, get the unique rows after subsetting and get the nrow
m1 <- cbind(pmin(df[,1], df[,2]), pmax(df[,1], df[,2]))
i1 <- duplicated(m1)|duplicated(m1, fromLast=TRUE)
nrow(unique(m1[i1,]))
#[1] 3
Related
With a simple vector like
x <- sample(letters[1:3], size=20, replace=T)
I would extract the most frequent letter with something like
y <- table(x)
print(names(y)[y==max(y)])
"b"
However, using the same technique over a multidimensional dataframe does not work:
set.seed(5)
x <- data.frame(c1=sample(letters[1:3], size=30, replace=T),
c2=sample(letters[4:5], size=30, replace=T),
c3=sample(letters[6:10], size=30, replace=T))
y <- table(x)
print(names(y)[y==max(y)])
NULL
How can I extract the levels of c1, c2, and c3 that have the highest value in the contingency table?
I know I could convert the table to a dataframe and find the row where the Freq column is highest, but given the number of dimensions & levels in my dataset, doing the conversion to a dataframe would not fit in my RAM memory.
Edit: So my expected output in the second case would be c, d, j, as in:
z <- data.frame(y)
z[z$Freq==max(z$Freq), 1:3]
c1 c2 c3
27 c d j
But note that I cannot use the data.frame call on my data due to RAM issues.
You can use which with arr.ind = TRUE:
mapply("[",
dimnames(y),
as.data.frame(which(y == max(y), arr.ind = TRUE)))
# c1 c2 c3
#"c" "d" "j"
mapply("[",
dimnames(y),
as.data.frame(which(y == min(y), arr.ind = TRUE)))
# c1 c2 c3
# [1,] "a" "d" "f"
# [2,] "b" "d" "g"
# [3,] "c" "d" "g"
# [4,] "b" "e" "g"
# [5,] "a" "d" "h"
# [6,] "b" "d" "h"
# [7,] "c" "d" "h"
# [8,] "c" "e" "h"
# [9,] "a" "e" "i"
#[10,] "b" "e" "i"
#[11,] "c" "e" "i"
Given a variable x that can take values A,B,C,D
And three columns for variable x:
df1<-
rbind(c("A","B","C"),c("A","D","C"),c("B","A","C"),c("A","C","B"), c("B","C","A"), c("D","A","B"), c("A","B","D"), c("A","D","C"), c("A",NA,NA),c("D","A",NA),c("A","D",NA))
How do I make column indicating the combination of in the three preceding column such that permutations (ABC, ACB, BAC) would be considered as the same combination of ABC, (AD, DA) would be considered as the same combination of AD?
Pasting the three columns with apply(df1,1,function(x) paste(x[!is.na(x)], collapse=", ")->df1$x4 and using df1%>%group(x4)%>%summarize(c=count(x4)) would count AD,DA as different instead of the same.
Edited title
My desired result would be to get
a<-cbind(c("ABC",4),c("ACD",2),c("ABD",2),c("A",1),c("AD",2))
Someone already solved my question. Thanks
You can apply function paste after sorting each row vector.
df1 <-
cbind(df1, apply(df1, 1, function(x) paste(sort(x), collapse = "")))
df1
# [,1] [,2] [,3] [,4]
# [1,] "A" "B" "C" "ABC"
# [2,] "A" "D" "C" "ACD"
# [3,] "B" "A" "C" "ABC"
# [4,] "A" "C" "B" "ABC"
# [5,] "B" "C" "A" "ABC"
# [6,] "D" "A" "B" "ABD"
# [7,] "A" "B" "D" "ABD"
# [8,] "A" "D" "C" "ACD"
# [9,] "A" NA NA "A"
#[10,] "D" "A" NA "AD"
#[11,] "A" "D" NA "AD"
You can now simply table the column, with no need for an external package to be loaded and more complex pipes.
table(df1[, 4])
#A ABC ABD ACD AD
#1 4 2 2 2
Given are two vectors, a and b
a = letters[1:6]
b = letters[7:11]
The goal is to sample a two column matrix using a and b. The first column should contain elements from a such that each element of a is repeated two times. The second column should contain elements from b such that each element of b is also repeated at least two times. One more condition is that the pairs have to be unique.
I have figured out how to sample the 12 pairs but have not figured out how I can ensure they will always be unique. For example, in the solution presented below, row 3 and row 11 are the same.
The desired output should have no duplicate rows.
set.seed(42)
m = cbind(sample(c(a, a)), sample(c(b, b, sample(b, 2, replace = TRUE))))
m
# [,1] [,2]
# [1,] "e" "g"
# [2,] "f" "k"
# [3,] "c" "k"
# [4,] "b" "h"
# [5,] "f" "j"
# [6,] "d" "i"
# [7,] "e" "h"
# [8,] "a" "g"
# [9,] "d" "h"
#[10,] "a" "i"
#[11,] "c" "k"
#[12,] "b" "j"
You can make it a function and throw replace in there, i.e.
f1 <- function(a, b){
m <- cbind(sample(c(a, a)), sample(c(b, b, sample(b, 2, replace = TRUE))))
m[,2] <-replace(m[,2], duplicated(m), sample(b[!b %in% m[duplicated(m),2]], 1))
return(m)
}
#which seems stable
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
sum(duplicated(f1(a, b)))
#[1] 0
Another way that doesn't require replacement
m = rbind(
c(1,1,0,0,0),
c(1,1,0,0,0),
c(0,0,1,1,0),
c(0,0,1,1,0),
c(0,0,0,0,1),
c(0,0,0,0,1)
)
# One "free" selection in each of the last two rows
m[5, sample(4,1)] = 1
m[6, sample(4,1)] = 1
# Scramble it while preserving row/column sums
m = m[sample(6), sample(5)]
> as.matrix(expand.grid(a=a,b=b))[as.logical(m),]
# a b
# [1,] "a" "g"
# [2,] "b" "g"
# [3,] "e" "g"
# [4,] "c" "h"
# [5,] "d" "h"
# [6,] "f" "h"
# [7,] "d" "i"
# [8,] "f" "i"
# [9,] "b" "j"
#[10,] "c" "j"
#[11,] "a" "k"
#[12,] "e" "k"
Definitely not elegant, but would work.
a = letters[1:6]
b = letters[7:11]
asamp <- sample(c(a,a))
finished <- F
while(!finished) {
bsamp <- sample(c(b, b, sample(b, 2, replace = TRUE)))
if(length(unique(paste(asamp,bsamp)))==12) finished <- T
}
cbind(asamp,bsamp)
This question already has answers here:
How to skip a paste() argument when its value is NA in R
(2 answers)
Closed 5 years ago.
I have the following matrix:
V1 V2 V3 V4 V4
[1,] "a" "j" "d" "e" NA
[2,] "a" "b" "d" "e" NA
[3,] "a" "j" "g" "f" NA
[4,] "a" "g" "f" NA NA
I want to get:
V1 V2
[1,] "ajde"
[2,] "abde"
[3,] "ajgf"
[4,] "agf"
I know how to reduce a matrix to one column by using matrix(do.call(paste0, as.data.frame(M))) and how to remove the NA by row using m[!is.na(m[i,])]. I just do not know how to but the two together as any time I try to use m[!is.na(m)] on the whole matrix, I end up with one large row
We can use use gsub to remove the NA
V1 <- gsub("NA+", "", do.call(paste0, as.data.frame(M)))
V1
#[1] "ajde" "abde" "ajgf" "agf"
matrix(V1, ncol=1)
Or we can use the traditional approach with apply
apply(M, 1, function(x) paste(x[!is.na(x)], collapse=""))
As a variant of this question
I have a vector with strings, each string has 2 to 4 characters.
Strng <- c("XDX", "GUV", "FQ", "ACUE", "HIT", "AYX", "NFD", "AHBW", "GKQ", "PYF")
I want to split it to data frame with 4 columns, where each column contains one of the characters or 0 (for the case where the length of the string is less tan 4). The zeros can be in front of - doesn't matter.
So (probably) after applying this:
ss<-strsplit(Strng,"")
z<-lapply(ss,as.character)
I would like to have a dataframe like this:
>df
"X" "D" "X" "0"
"G" "U" "V" "0"
"F" "Q" "0" "0"
"A" "C" "U" "E"
"H" "I" "T" "0"
"A" "Y" "X" "0"
"N" "F" "D" "0"
"A" "H" "B" "W"
"G" "K" "Q" "0"
"P" "Y" "F" "0"
Any ideas?
Thank you,
Kalin
Here's an alternative with "data.table":
library(data.table)
setDT(tstrsplit(Strng, "", fill = "0"))[]
# V1 V2 V3 V4
# 1: X D X 0
# 2: G U V 0
# 3: F Q 0 0
# 4: A C U E
# 5: H I T 0
# 6: A Y X 0
# 7: N F D 0
# 8: A H B W
# 9: G K Q 0
# 10: P Y F 0
You could also use cSplit from my "splitstackshape" package, but it fills with NA and uses a little bit strange syntax:
library(splitstackshape)
cSplit(data.table(Strng), "Strng", "", stripWhite = FALSE)
We can use stri_list2matrix from stringi after we split the "Strng" to a list.
library(stringi)
stri_list2matrix(strsplit(Strng, ''), fill=0, byrow=TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "X" "D" "X" "0"
# [2,] "G" "U" "V" "0"
# [3,] "F" "Q" "0" "0"
# [4,] "A" "C" "U" "E"
# [5,] "H" "I" "T" "0"
# [6,] "A" "Y" "X" "0"
# [7,] "N" "F" "D" "0"
# [8,] "A" "H" "B" "W"
# [9,] "G" "K" "Q" "0"
#[10,] "P" "Y" "F" "0"
Or a base R option would be (variant of the one described in the link)
read.fwf(file= textConnection(Strng),
widths = rep(1,max(nchar(Strng))))