I have created a matrix like this:
> head(matrix)
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
[1,] "0" "0" "1" "0" "1" "1" "0" "0" "0" "0" "NA"
[2,] "1" "0" "1" "0" "1" "1" "0" "0" "0" "0" "NA"
[3,] "0" "1" "1" "0" "1" "1" "0" "0" "0" "0" "NA"
[4,] "1" "1" "1" "0" "1" "1" "0" "0" "0" "0" "NA"
[5,] "0" "0" "2" "0" "1" "1" "0" "0" "0" "0" "NA"
[6,] "1" "0" "2" "0" "1" "1" "0" "0" "0" "0" "NA"
Now, I want to compare the matrix above with the following data frame:
> head(df)
cod Var11 Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var12
1 C000354 B 1 1 4 0 1 2 0 0 0 1 51520.72
2 C000404 A 1 0 1 0 4 4 0 0 1 1 21183.25
3 C000444 A 1 0 4 1 3 3 0 0 0 1 67504.74
4 C000480 A 1 1 2 0 2 3 0 0 1 1 26545.92
5 C000983 C 1 0 1 0 3 4 0 0 0 0 10379.37
6 C000985 C 1 0 3 1 3 4 0 0 0 0 18660.99
Matrix contains all possible combinations of the variables Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10, so basically when a row of df (only column from VAR1 to VAR10) match with a row of matrix and this row in df had a Var12>=90000, I would like it to be written "A" in corresponding column VAR11 of matrix.
I have tried with this:
for (i in 1 : nrow(matrix)) {
for (j in 1 : 10) {
ifelse(matrix[i,j]==df[,(j+2)]
&& df$Var12[] >= 90000,
matrix[i,"Var11"] <- "A",
matrix[i,"Var11"] <- "NA")
}
}
But this writes NA in all rows of matrix.
Does anyone know why this happen or how to solve it?
Thanks in advance.
I don't understand why you used 1:10 and j+2 in your loop.
#Some dummy data
col_to_match<-paste0("V",1:10)
set.seed(123)
mat <- cbind(matrix(sample(0:4, 100, replace=TRUE), ncol=10), "NA")
colnames(mat)<-c(col_to_match,"V11")
set.seed(123)
df<- data.frame("cod"=paste0("C",1:20), "V12"= runif(20,min=88000,max=95000))
set.seed(1)
df <- cbind(df, rbind(mat[3:10,col_to_match], matrix(sample(0:4, 120, replace=TRUE), ncol=10)) )
From the dummy data, we expect the rows of the matrix
c(3:10)[df[1:8,"V12"]>=90000] to match the dummy data. Those are rows 3 4 5 6 7 9 10.
Run the following to check for every row in matrix, find whether there are any matching rows in df, and whether the V12 value is greater than 90000.
for(i in 1:nrow(mat)){
hasMatch<-any(sapply(1:nrow(df), function(j) all( df[j,col_to_match] == mat[i, col_to_match] ) && df[j,"V12"]>=90000 ))
if(hasMatch) mat[i, "V11"]<-"A"
}
Output
> mat
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
[1,] "1" "4" "4" "4" "0" "0" "3" "3" "1" "0" "NA"
[2,] "3" "2" "3" "4" "2" "2" "0" "3" "3" "3" "NA"
[3,] "2" "3" "3" "3" "2" "3" "1" "3" "2" "1" "A"
[4,] "4" "2" "4" "3" "1" "0" "1" "0" "3" "3" "A"
[5,] "4" "0" "3" "0" "0" "2" "4" "2" "0" "1" "A"
[6,] "0" "4" "3" "2" "0" "1" "2" "1" "2" "0" "A"
[7,] "2" "1" "2" "3" "1" "0" "4" "1" "4" "3" "A"
[8,] "4" "0" "2" "1" "2" "3" "4" "3" "4" "0" "NA"
[9,] "2" "1" "1" "1" "1" "4" "3" "1" "4" "2" "A"
[10,] "2" "4" "0" "1" "4" "1" "2" "0" "0" "2" "A"
Related
I'm not sure why I'm having difficulty finding an answer to what should be a simple problem. I have the following dataframe:
> test
Month Equity - Rank FI - Rank Alts - Rank Reweight
[1,] "2006-01-31" "1" "3" "2" "No"
[2,] "2006-02-28" "1" "3" "2" "No"
[3,] "2006-03-31" "1" "3" "2" "No"
[4,] "2006-04-28" "1" "3" "2" "No"
[5,] "2006-05-31" "2" "3" "1" "No"
[6,] "2006-06-30" "1" "3" "2" "No"
[7,] "2006-07-31" "2" "3" "1" "No"
[8,] "2006-08-31" "1" "3" "2" "No"
All I want to do is search by "Month" and return the values for that row.
I've been trying the following but keep getting no result back:
test[test$Month == '2006-03-31',]
# A tibble: 1 x 5
Month `Equity - Rank` `FI - Rank` `Alts - Rank` Reweight
<dttm> <dbl> <dbl> <dbl> <chr>
Can anyone help me figure out what I'm doing wrong? Thanks.
Figured it out after more searching
> test[which(test2 == '2006-01-31'),]
Month Equity - Rank FI - Rank Alts - Rank Reweight
"2006-01-31" "1" "3" "2" "No"
Is it possible to generate 0/1 character matrices like those shown below right from bifurcating phylogenetic trees like those on the left. The 1 in the matrix indicates presence of a shared character that unites the clades.
This code generates nice random trees but I have no idea where to begin to turn the results into a character matrix.
library(ape) # Other package solutions are acceptable
forest <- rmtree(N = 2, n = 10, br = NULL)
plot(forest)
To be clear, I can use the following code to generate random matrices, and then plot the trees.
library(ape)
library(phangorn)
ntaxa <- 10
nchar <- ntaxa - 1
char_mat <- array(0, dim = c(ntaxa, ntaxa - 1))
for (i in 1:nchar) {
char_mat[,i] <- replace(char_mat[,i], seq(1, (ntaxa+1)-i), 1)
}
char_mat <- char_mat[sample.int(nrow(char_mat)), # Shuffle rows
sample.int(ncol(char_mat))] # and cols
# Ensure all branch lengths > 0
dist_mat <- dist.gene(char_mat) + 0.5
upgma_tree <- upgma(dist_mat)
plot.phylo(upgma_tree, "phylo")
What I want is to generate random trees, and then make the matrices from those trees. This solution does not make the right type of matrix.
Edit for clarity: I am generating binary character matrices that students can use to draw phylogenetic trees using simple parsimony. The 1 character represents homologies that unite taxa into clades. So, all rows must share one character (a 1 across all rows in one column) and some characters must be shared by only two taxa. (I'm discounting autapomorphies.)
Examples:
you can have a look at the rTraitDisc function in ape that is pretty straight forward:
library(ape)
## You'll need to simulate branch length!
forest <- rmtree(N = 2, n = 10)
## Generate on equal rate model character
(one_character <- rTraitDisc(forest[[1]], type = "ER", states = c(0,1)))
# t10 t7 t5 t9 t1 t4 t2 t8 t3 t6
# 0 0 0 1 0 0 0 0 0 0
# Levels: 0 1
## Generate a matrix of ten characters
(replicate(10, rTraitDisc(forest[[1]], type = "ER", states = c(0,1))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# t10 "0" "0" "0" "0" "1" "0" "0" "0" "0" "0"
# t7 "0" "0" "0" "0" "1" "0" "0" "0" "0" "0"
# t5 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t9 "0" "0" "1" "0" "0" "0" "0" "0" "0" "0"
# t1 "0" "0" "1" "0" "0" "0" "0" "0" "0" "0"
# t4 "0" "0" "1" "0" "0" "0" "0" "0" "0" "0"
# t2 "0" "0" "1" "0" "0" "0" "0" "0" "0" "0"
# t8 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t3 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t6 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
To apply it to multiple tree, the best would be to create a lapply function like so:
## Lapply wrapper function
generate.characters <- function(tree) {
return(replicate(10, rTraitDisc(tree, type = "ER", states = c(0,1))))
}
## Generate 10 character matrices for each tree
lapply(forest, generate.characters)
# [[1]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# t10 "0" "0" "0" "1" "0" "0" "0" "0" "0" "0"
# t7 "0" "0" "0" "1" "0" "0" "0" "0" "0" "0"
# t5 "0" "0" "0" "1" "0" "0" "0" "0" "0" "0"
# t9 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t1 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t4 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t2 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t8 "0" "0" "0" "1" "0" "1" "0" "0" "0" "1"
# t3 "0" "0" "0" "0" "0" "1" "0" "0" "0" "0"
# t6 "0" "0" "0" "0" "0" "1" "0" "0" "0" "0"
# [[2]]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# t7 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t9 "1" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t5 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t2 "0" "0" "0" "0" "0" "0" "0" "0" "0" "0"
# t4 "0" "1" "0" "0" "1" "0" "0" "0" "0" "0"
# t6 "0" "1" "0" "0" "1" "0" "0" "0" "0" "0"
# t10 "0" "1" "1" "0" "1" "1" "0" "0" "0" "1"
# t8 "0" "1" "1" "0" "1" "0" "0" "0" "0" "0"
# t3 "0" "1" "0" "0" "0" "0" "0" "0" "0" "0"
# t1 "0" "1" "0" "0" "0" "0" "0" "0" "0" "0"
Another option is to use the sim.morpho from the dispRity package. This function reuses the rTraitDisc function but has a bit more models implemented and alllows the rates to be provided as distributions from which to sample. It also allows characters to look a bit more "realistic" without to much invariant data and insuring that the generated character "looks" like a real morphological character (like with the right amount of homoplasy, etc...).
library(dispRity)
## You're first tree
tree <- forest[[1]]
## Setting up the parameters
my_rates = c(rgamma, rate = 10, shape = 5)
my_substitutions = c(runif, 2, 2)
## HKY binary (15*50)
matrixHKY <- sim.morpho(tree, characters = 50, model = "HKY",
rates = my_rates, substitution = my_substitutions)
## Mk matrix (15*50) (for Mkv models)
matrixMk <- sim.morpho(tree, characters = 50, model = "ER", rates = my_rates)
## Mk invariant matrix (15*50) (for Mk models)
matrixMk <- sim.morpho(tree, characters = 50, model = "ER", rates = my_rates,
invariant = FALSE)
## MIXED model invariant matrix (15*50)
matrixMixed <- sim.morpho(tree, characters = 50, model = "MIXED",
rates = my_rates, substitution = my_substitutions, invariant = FALSE,
verbose = TRUE)
I suggest you have a read at the sim.morpho function for the proper references on how the model work or at the relevant section in the dispRity package manual.
I figured out how to make the matrix using Descendants from the phangorn package. I still have to tweak it with suitable node labels to match the example matrix in the original question, but the framework is there.
library(ape)
library(phangorn)
ntaxa <- 8
nchar <- ntaxa - 1
tree <- rtree(ntaxa, br = NULL)
# Gets descendants, but removes the first ntaxa elements,
# which are the individual tips
desc <- phangorn::Descendants(tree)[-seq(1, ntaxa)]
char_mat <- array(0, dim = c(ntaxa, nchar))
for (i in 1:nchar) {
char_mat[,i] <- replace(char_mat[,i], y <- desc[[i]], 1)
}
rownames(char_mat) <- tree$tip.label
char_mat
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#> t6 1 1 0 0 0 0 0
#> t3 1 1 1 0 0 0 0
#> t7 1 1 1 1 0 0 0
#> t2 1 1 1 1 1 0 0
#> t5 1 1 1 1 1 0 0
#> t1 1 0 0 0 0 1 1
#> t8 1 0 0 0 0 1 1
#> t4 1 0 0 0 0 1 0
plot(tree)
Created on 2019-01-28 by the reprex package (v0.2.1)
This question already has answers here:
R counting strings variables in each row of a dataframe
(2 answers)
Closed 6 years ago.
I have a dataframe like below sample:
A B C
[1,] "A1" "B3" "C1"
[2,] "A2" "B1" "C2"
[3,] "A3" "B3" "C3"
[4,] "A1" "B2" "C3"
[5,] "A3" "B3" "C2"
[6,] "A1" "B1" "C1"
And I would like to reshape it like this, to expand every unique value of variables to a single variable, and mark 1/0 in the value field. Above data frame shall be reshaped to this:
A B1 B2 B3 C1 C2 C3
[1,] "A1" "0" "0" "1" "1" "0" "0"
[2,] "A2" "1" "0" "0" "0" "1" "0"
[3,] "A3" "0" "0" "1" "0" "0" "1"
[4,] "A1" "0" "1" "0" "0" "0" "1"
[5,] "A3" "0" "0" "1" "0" "1" "0"
[6,] "A1" "1" "0" "0" "1" "0" "0"
The real data is in huge amount (>100 thousand per day, and much more fields and unique values. So I need a high efficiency program instead of using for...
I believe you could help... I am a beginner, only know for... :(
You can try this too (with base R):
df <- cbind(as.character(df$A), model.matrix(~B+C+0,df,list(B=contrasts(df$B, contrasts=F),
C=contrasts(df$C, contrasts=F))))
dimnames(df) <- list(NULL, c('A', paste0('B',1:3), paste0('C',1:3)))
df
# A B1 B2 B3 C1 C2 C3
#[1,] "A1" "0" "0" "1" "1" "0" "0"
#[2,] "A2" "1" "0" "0" "0" "1" "0"
#[3,] "A3" "0" "0" "1" "0" "0" "1"
#[4,] "A1" "0" "1" "0" "0" "0" "1"
#[5,] "A3" "0" "0" "1" "0" "1" "0"
#[6,] "A1" "1" "0" "0" "1" "0" "0"
We can use
library(qdapTools)
cbind(df1[1], mtabulate(as.data.frame(t(df1[-1]))))
# A B3 C1 B1 C2 C3 B2
#V1 A1 1 1 0 0 0 0
#V2 A2 0 0 1 1 0 0
#V3 A3 1 0 0 0 1 0
#V4 A1 0 0 0 0 1 1
#V5 A3 1 0 0 1 0 0
#V6 A1 0 1 1 0 0 0
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 7 years ago.
Improve this question
I'm trying to apply a loop on a matrix (matrixExample, as described below) to retrieve Columns V1 to V6 on 3 rows each time.
matrixExample
ID CHR SEGNUM GENOSEG V1 V2 V3 V4 V5 V6
[1,] "CHAR8" "1" "1" "102505" "1" "0" "2" "5" "0" "5"
[2,] "LIMO9" "1" "1" "012505" "0" "1" "2" "5" "0" "5"
[3,] "SIM10" "1" "1" "122505" "1" "2" "2" "5" "0" "5"
[4,] "CHAR8" "1" "2" "111520" "1" "1" "1" "5" "2" "0"
[5,] "LIMO9" "1" "2" "221520" "2" "2" "1" "5" "2" "0"
[6,] "SIM10" "1" "2" "222520" "2" "2" "2" "5" "2" "0"
[7,] "CHAR8" "1" "3" "501111" "5" "0" "1" "1" "1" "1"
[8,] "LIMO9" "1" "3" "501100" "5" "0" "1" "1" "0" "0"
[9,] "SIM10" "1" "3" "502011" "5" "0" "2" "0" "1" "1"
[10,] "CHAR8" "2" "1" "102505" "1" "0" "2" "5" "0" "5"
[11,] "LIMO9" "2" "1" "012505" "0" "1" "2" "5" "0" "5"
[12,] "SIM10" "2" "1" "122505" "1" "2" "2" "5" "0" "5"
[13,] "CHAR8" "2" "2" "111520" "1" "1" "1" "5" "2" "0"
[14,] "LIMO9" "2" "2" "221520" "2" "2" "1" "5" "2" "0"
[15,] "SIM10" "2" "2" "222520" "2" "2" "2" "5" "2" "0"
[16,] "CHAR8" "2" "3" "501111" "5" "0" "1" "1" "1" "1"
[17,] "LIMO9" "2" "3" "501100" "5" "0" "1" "1" "0" "0"
[18,] "SIM10" "2" "3" "502011" "5" "0" "2" "0" "1" "1"
As example, from the first 3 rows, I would like to do some matrix calculations using the submatrix:
"1" "0" "2" "5" "0" "5"
"0" "1" "2" "5" "0" "5"
"1" "2" "2" "5" "0" "5"
After this calculation, I need to go to rows 4 to 6 ...
I tried this code:
for(i in seq(1, dim(exampleDoubleSort)[1], 3))
{
print(matrixExample[(i:i+2),c(4:10)]) # using print only as an example
# do some matrix calculations using the subset matrix
}
Its only printing the elements from one row and not for the combination of 3 rows.
If I try out of the loop, I can obtain the expected result.
print(matrixExample[(1:3),c(5:10)])
V1 V2 V3 V4 V5 V6
[1,] "1" "0" "2" "5" "0" "5"
[2,] "0" "1" "2" "5" "0" "5"
[3,] "1" "2" "2" "5" "0" "5"
Please, could you give me some idea on how to read 3 rows each time and retrieve a matrix subset for further calculations?
If I have 30 hows, I need to retried 10 matrix as subset and perform 10 calculations ... The calculation will be implemented using function as example.
Thanks in advance!
Cheers!
RV
I am attempting to format a column of data into many binary columns to eventually use for association rule mining. I have had some success using a for loop and a simple triplet matrix, but I am unsure how to aggregate by the levels in the first column thereafter--similar to a group by statement in SQL. I have provided an example below, albeit with a much smaller data set--if successful my actual data set will be 4,200 rows by 3,902 columns so any solution needs to be scaleable. Any suggestions or alternative approaches would be greatly appreciated!
> data <- data.frame(a=c('sally','george','andy','sue','sue','sally','george'), b=c('green','yellow','green','yellow','purple','brown','purple'))
> data
a b
1 sally green
2 george yellow
3 andy green
4 sue yellow
5 sue purple
6 sally brown
7 george purple
x <- data[,1]
for(i in as.numeric(2:ncol(data)))
x <- cbind(x, simple_triplet_matrix(i=1:nrow(data), j=as.numeric(data[,i]),
v = rep(1,nrow(data)), dimnames = list(NULL, levels(data[,i]))) )
##Looks like this:
> as.matrix(x)
name brown green purple yellow
[1,] "sally" "0" "1" "0" "0"
[2,] "george" "0" "0" "0" "1"
[3,] "andy" "0" "1" "0" "0"
[4,] "sue" "0" "0" "0" "1"
[5,] "sue" "0" "0" "1" "0"
[6,] "sally" "1" "0" "0" "0" ##Need to aggregate by Name
##Would like it to look like this:
name brown green purple yellow
[1,] "sally" "1" "1" "0" "0"
[2,] "george" "0" "0" "0" "1"
[3,] "andy" "0" "1" "0" "0"
[4,] "sue" "0" "0" "1" "1"
This should do the trick:
## Get a contingency table of counts
X <- with(data, table(a,b))
## Massage it into the format you're wanting
cbind(name = rownames(X), apply(X, 2, as.character))
# name brown green purple yellow
# [1,] "andy" "0" "1" "0" "0"
# [2,] "george" "0" "0" "1" "1"
# [3,] "sally" "1" "1" "0" "0"
# [4,] "sue" "0" "0" "1" "1"