This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 8 years ago.
I am trying to look at protein sequence homology using R, and I'd like to go through a data frame looking for identical pairs of Position and Letter. The data look similar to the frame below:
Letter <- c("A", "B", "C", "D", "D", "E", "G", "L")
Position <- c(1, 2, 3, 4, 4, 5, 6, 7)
data.set <- cbind(Position, Letter)
Which yields:
Position Letter
[1,] "1" "A"
[2,] "2" "B"
[3,] "3" "C"
[4,] "4" "D"
[5,] "4" "D"
[6,] "5" "E"
[7,] "6" "G"
[8,] "7" "L"
I'd like to loop through and find all identical observations (in this case, observations 4 and 5), but I'm having difficulty in discovering the best way to do it.
I'd like the resultant data frame to look like:
Position Letter
[1,] "4" "D"
[2,] "4" "D"
The ways I've tried to do this ended up yielding this code, but unfortunately it returns one value of TRUE because I realized that I am comparing two identical data frames:
> identical(data.set[1:nrow(data.set),1:2], data.set[1:nrow(data.set),1:2])
[1] TRUE
I'm not sure if looping through using the identical() function would be the best way? I'm sure there's a more elegant solution that I am missing.
Thanks for any help!
Try the unique function:
unique(data.set)
...
You can use duplicated using fromLast to go in two directions:
data.set[(duplicated(data.set)==T | duplicated(data.set, fromLast = TRUE) == T),]
# Position Letter
#[1,] "4" "D"
#[2,] "4" "D"
Related
looking to write an R script that will search a column for a specific value and begin sub setting rows until a specific text value is reached.
Example:
X1 X2
[1,] "a" "1"
[2,] "b" "2"
[3,] "c" "3"
[4,] "d" "4"
[5,] "e" "5"
[6,] "f" "6"
[7,] "c" "7"
[8,] "k" "8"
What I'd like to do is search through X1 until the letter 'c' is found, and begin to subset rows until another letter 'c' is found, at which point the subset procedure would stop. Using the above example, the result should be a vector containing c(3,4,5,6,7).
Assume there will be no more than 2 rows where X1 equals 'c'
Any help is greatly appreciated.
You can lookup where a value is with the function which, and use that as in index to get the values you are looking for. If you want everything from the first to the second "c", it would look like this:
indices <- which(df$X1=='c')
range <- indices[1]:indices[2]
df$X2[range]
Context: I have a list of sports teams called teamNames, and I would like to generate their match-ups for each week. I'm not sure if permutations are even the right approach, but I feel like they would be. What I would ideally like is to pass a vector of team names to a function, and then have it give me a matrix where each row has that vector of team names in a different order, such that if I go through them in pairs, I'll get a unique set of match-ups for each row.
For example if my input is teamNames <- c("a", "b", "c", "d"), I want the output to be a matrix that says:
a b c d
a c b d
a d c b
Edit: Further clarification: in this case, the matrix has given me three "weeks" of matchups. First week: "a" vs. "b" and "c" vs. "d"
Second week: "a" vs. "c" and "b" vs. "d"
Third week: "a" vs. "d" and "b" vs. "c"
The closest I've gotten from reading other questions is to use the permutations function in the gtools package as follows:
permutations(length(teamNames), 2, teamNames)
This generates all the possible match-ups, but what it doesn't do is to divide them into sets/weeks. combinations(length(teamNames), 4, teamNames only gives me one set of matchups.
If I understand correctly, if 2 teams are chosen from the 4 teams, the rest two have to be matched. Then it is selecting 2 out of 4. Permutation may not be applied as 'a vs b' == 'b vs a'. No extra package is necessary as the utils package has combn().
> combn(teamNames, 2)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
Above shows selecting 2 teams from 4 and there are some duplication - selecting a and b equals to selecting c and d. If one of those duplicating cases are cancelled out, it'd be alright to set up a schedule.
Update
# Buckminster - I keep updating the code. In this update, the rest two are updated although there are still duplication. Also, among 4, if 2 are determined, the rest two have to be able to be determined (it is a similar idea how to solve a system of equations in linear algebra). In other words, I'm not sure why -1 was given probably by you.
# Update
teamNames <- c("a", "b", "c", "d")
first <- combn(teamNames, 2, simplify = FALSE)
second <- lapply(first, function(x) teamNames[!teamNames %in% x])
bind <- rbind(do.call(cbind, first), do.call(cbind, second))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
[3,] "c" "b" "b" "a" "a" "a"
[4,] "d" "d" "c" "d" "c" "b"
Let me check if duplication can be removed easily.
I have a data frame in R. One column of this data frame takes values from 1 to 6. I have another data frame that maps this column. There is some way to substitute the values in this column without using loops (using some function)?
[,1] [,2]
[1,] "1" "A"
[2,] "2" "B"
[3,] "3" "C"
[4,] "4" "D"
[1] 4 2 2 3 4 1 4 4
Exists a function that return the vector below without using loops?
[1] D B B C D A D D
You can try
v1 <- c(4, 2, 2, 3, 4, 1, 4, 4)
LETTERS[v1]
#[1] "D" "B" "B" "C" "D" "A" "D" "D"
Suppose if your mapping dataset 2nd column is not just "LETTERS"
d1 <- data.frame(Col1=c(1,3,4,2), Col2=c('j1', 'c9', '19f', 'd18'),
stringsAsFactors=FALSE)
unname(setNames(d1[,2], d1[,1])[as.character(v1)])
#[1] "19f" "d18" "d18" "c9" "19f" "j1" "19f" "19f"
Or
d1$Col2[match(v1, d1$Col1)]
#[1] "19f" "d18" "d18" "c9" "19f" "j1" "19f" "19f"
To use this type of map/dictionary to create a new data frame column, you can also use the match method:
snape_gradebook_df<-data_frame(students=c("Harry", "Hermione", "Ron", "Neville", "Ginny", "Luna", "Draco", "Cho"),
numeric_grade=c(4,2,2,3,4,1,4,4))
grade_map<-data_frame(numbers=c(1,2,3,4), letters=c("A", "B", "C", "D"),
stringsAsFactors = FALSE)
snape_gradebook_df['letter_grade']<-grade_map$letters[match(old_list, grade_map$numbers)]
I'm not sure the advantages/disadvantages of setNames vs match - just thought I would share another possible solution.
I am trying to find a way to get a list in R of all the possible unique permutations of A,A,A,A,B,B,B,B,B.
Combinations was what was originally thought to be the method for obtaining a solution, hence the combinations answers.
I think this is what you're after. #bill was on the ball with the recommendation of combining unique and combn. We'll also use the apply family to generate ALL of the combinations. Since unique removes duplicate rows, we need to transpose the results from combn before uniqueing them. We then transpose them back before returning to the screen so that each column represents a unique answer.
#Daters
x <- c(rep("A", 4), rep("B",5))
#Generates a list with ALL of the combinations
zz <- sapply(seq_along(x), function(y) combn(x,y))
#Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z))))
Which returns:
[[1]]
[,1] [,2]
[1,] "A" "B"
[[2]]
[,1] [,2] [,3]
[1,] "A" "A" "B"
[2,] "A" "B" "B"
[[3]]
[,1] [,2] [,3] [,4]
[1,] "A" "A" "A" "B"
[2,] "A" "A" "B" "B"
[3,] "A" "B" "B" "B"
...
EDIT Since the question is about permuations and not combinations, the answer above is not that useful. This post outlines a function to generate the unique permutations given a set of parameters. I have no idea if it could be improved upon, but here's one approach using that function:
fn_perm_list <-
function (n, r, v = 1:n)
{
if (r == 1)
matrix(v, n, 1)
else if (n == 1)
matrix(v, 1, r)
else {
X <- NULL
for (i in 1:n) X <- rbind(X, cbind(v[i], fn_perm_list(n -
1, r - 1, v[-i])))
X
}
}
zz <- fn_perm_list(9, 9)
#Turn into character matrix. This currently does not generalize well, but gets the job done
zz <- ifelse(zz <= 4, "A", "B")
#Returns 126 rows as indicated in comments
unique(zz)
There's no need to generate permutations and then pick out the unique ones.
Here's a much simpler way (and much, much faster as well): To generate all permutations of 4 A's and 5 B's, we just need to enumerate all possible ways of placing 4 A's among 9 possible locations. This is simply a combinations problem. Here's how we can do this:
x <- rep('B',9) # vector of 9 B's
a_pos <- combn(9,4) # all possible ways to place 4 A's among 9 positions
perms <- apply(a_pos, 2, function(p) replace(x,p,'A')) # all desired permutations
Each column of the 9x126 matrix perms is a unique permutation 4 A's and 5 B's:
> dim(perms)
[1] 9 126
> perms[,1:4] ## look at first few columns
[,1] [,2] [,3] [,4]
[1,] "A" "A" "A" "A"
[2,] "A" "A" "A" "A"
[3,] "A" "A" "A" "A"
[4,] "A" "B" "B" "B"
[5,] "B" "A" "B" "B"
[6,] "B" "B" "A" "B"
[7,] "B" "B" "B" "A"
[8,] "B" "B" "B" "B"
[9,] "B" "B" "B" "B"
Assume a table as below:
X =
col1 col2 col3
row1 "A" "0" "1"
row2 "B" "2" "NA"
row3 "C" "1" "2"
I select combinations of two rows, using the code below:
pair <- apply(X, 2, combn, m=2)
This returns a matrix of the form:
pair =
[,1] [,2] [,3]
[1,] "A" "0" "1"
[2,] "B" "2" NA
[3,] "A" "0" "1"
[4,] "C" "1" "2"
[5,] "B" "2" NA
[6,] "C" "1" "2"
I wish to iterate over pair, taking two rows at a time, i.e. first isolate [1,] and [2,], then [3,] and [4,] and finaly, [5,] and [6,]. These rows will then be passed as arguments to regression models, i.e. lm(Y ~ row[i]*row[j]).
I am dealing with a large dataset. Can anybody advise how to iterate over a matrix two rows at a time, assign those rows to variables and pass as arguments to a function?
Thanks,
S ;-)
It is unnecessary to multiply the rows of your matrix like that, and if you have a large data set it is might get problematic. In stead just pick out the relevant rows for each instance. But it is convenient to create the selection beforehand, something like this perhaps:
xselect <- combn(1:nrow(X),2)
To illustrate with your data (assuming you only use columns 2 and 3):
X <- matrix(c("A", "B", "C", 0,2,1,1,NA,2),3,3)
Y <- rnorm(2, 4, 2)
for (i in 1:ncol(xselect))
{
x1 <- as.numeric(X[xselect[1,i], c(2,3)])
x2 <- as.numeric(X[xselect[2,i], c(2,3)])
print(lm(Y ~ x1 * x2))
}
I'm not sure exactly what you're trying to do with the linear models, but to iterate over X, a pair of rows at a time, make a factor for each pair, and then use by
fac <- as.factor(sort(rep(1:(nrow(X)/2), 2)))
by(X, fac, FUN)
where FUN is whatever function you want to apply over the pairs of rows in X.