In this answer: https://stackoverflow.com/a/11503439/651779 it is shown how to shuffle a dataframe row- and column wise. I am interested in shuffeling column wise. From the original dataframe
> df1
a b c
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
shuffling column wise
> df3=df1[,sample(ncol(df1))]
> df3
c a b
1 0 1 1
2 0 1 0
3 0 0 1
4 0 0 0
However, I would like to shuffle each row on its columns independent of the other rows, instead of shuffling the complete columns, so that you could get something like
>df4
c a b
1 0 1 1
2 1 0 0
3 1 0 0
4 0 0 0
Now I do that by looping over each row, shuffling the row, and putting it in a dataframe. Is there an easy way to do this?
something like:
t(apply(df1, 1, function(x) { sample(x, length(x)) } ))
This will give you the result in matrix form. If you have factors, a mix of numeric and characters etc, be aware that this will coerce everything to character.
Related
I've got a dataset with event data in the below format
> order_df
# A tibble: 10 x 4
H M B FB
<int> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
5 0 1 0 0
6 0 0 0 0
7 0 1 1 1
8 0 0 1 0
9 0 1 0 0
10 0 0 0 0
I'd like to show it as a matrix pairs, which I can achieve with the below code
> order_matrix = as.matrix(order_df)
> pair_matrix <- crossprod(order_matrix)
> pair_matrix
H M B FB
H 4 2 1 1
M 2 5 1 1
B 1 1 3 1
FB 1 1 1 2
However, the pair instances (i.e M:M) include all rows from the original dataframe that include that column, but I'd like that value to include only instance where the original dataframe row included ONLY a 1 in the column.
In the example above I'd like the H:H pair to be 0, as all instances with H included another column. the M:M pair would be 1, as only 1 instance included only M
I'm a little confused about the output here, since if the only rows that are counted are rows with a single 1 in them, then the resulting matrix will only have entries on the diagonal. In other words, it would be better to return a vector than the matrix.
You also say in your question that M should be 1, since it only appears on its own once. It actually appears on its own twice (row 5 and row 9).
You can get the result you need by removing all rows with a row sum of more than one then taking the column sums:
colSums(as.matrix(order_df[rowSums(order_df) == 1,]))
#> H M B FB
#> 0 2 1 0
and if you check carefully, this is correct.
If you really want the result in a matrix, just remove the rows with more than one value and take the cross product of that:
crossprod(as.matrix(order_df[rowSums(order_df) == 1,]))
#> H M B FB
#> H 0 0 0 0
#> M 0 2 0 0
#> B 0 0 1 0
#> FB 0 0 0 0
Given a vector:
x = c(1,0,1,0)
can be arranged in the following manner:
> m
row1: 1 1 0 0
row2: 1 0 1 0 # <- identical
row3: 0 1 1 0
row4: 1 0 0 1
row5: 0 1 0 1
row6: 0 0 1 1
I would like to calculate how many diviations, or changes are required to each vector in each row one would have to perform to end up with the original vector x
result <- function(m,x)
> result
Var1 Var2 Var3 Var4 changes_from_x
1 1 0 0 1
1 0 1 0 0
0 1 1 0 1
1 0 0 1 1
0 1 0 1 2
0 0 1 1 1
this is different from just comparing if the vector is the same Compare two vector in R
Check whether two vectors contain the same (unordered) elements in R
or to simply state that it's the wrong order or not Test Match and Order between two vectors in R as the method would compare how many diviations has occured between each vector in the matrix and the original vector.
This link answers a part of my question: How to randomize (or permute) a dataframe rowwise and columnwise?.
> df1
a b c
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
Column-wise shuffle gives me below output df3, which is reordering the columns
> df3 <- df1[,sample(ncol(df1))]
> df3
c a b
1 0 1 1
2 0 1 0
3 0 0 1
4 0 0 0
What I want is that the column names should change as well. Row-wise and column-wise total remains the same, just the column names get reassigned. Something like df4. How can I achieve this?
> df4
c a b
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
PS: How do I keep the df in its shape rows by column? when I post the question the formatting collapses?
You might want to just sample the column-names. Something like:
names(df) <- names(df)[sample(ncol(df))]
I am trying to make a contingency (frequency) table using table() in R for two integer variables, but the default option in table does not include all the values in the range for each. For example:
a=c(1,2,3,5)
b=c(1,1,2,3)
table(a,b)
returns:
1 2 3
1 1 0 0
2 1 0 0
3 0 1 0
5 0 0 1
I would like it to give:
1 2 3
1 1 0 0
2 1 0 0
3 0 1 0
4 0 0 0
5 0 0 1
This is a simple example where the value '4' isn't in one of the vectors. I know I could manipulate it into an array and add in a row of zero's, but I'm wondering if there's a simpler way to automatically do this for when the variables might span hundreds of (sparse) integer values.
A way to get this is
a=factor(c(1,2,3,5), levels=1:5)
b=factor(c(1,1,2,3), levels=1:5)
table(a,b)
I have a 39 column (with upward of 100000 rows) data frame whose last ten columns looks like that (The rest of the columns do not concern my question)
H3K27me3_gross_bin H3K4me3_gross_bin H3K4me1_gross_bin UtoP UtoM UPU UPP UPM UMU UMP UMM
cg00000029 3 3 6 1 1 0 0 0 0 0 0
cg00000321 6 1 5 1 0 0 1 0 0 0 0
cg00000363 6 1 1 1 0 1 0 0 0 0 0
cg00000622 1 2 1 0 0 0 0 0 0 0 0
cg00000714 2 5 6 1 0 0 0 0 0 0 0
cg00000734 2 6 2 0 0 0 0 0 0 0 0
I want to create a matrix that will:
a) count the number of rows in which the value columns UPU, UPP or UPM is 1 by each of the first three columns (H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin)
b) sum each row of the columns UPU, UPP, UPM by the first three columns
I came up with this incredibly cumbersome way of doing this:
UtoPFrac<-seq(6)
UtoPTotEvents<-seq(6)
for (j in 1:3){
y<-df[,28+j]
for (i in 1:3){
UtoPFrac<-cbind(UtoPFrac,tapply(df[which(is.na(y)==FALSE),33+i],y[which(is.na(y)==FALSE)], function(x) length(which(x==1))))
}
}
UtoPFrac<-UtoPFrac[,2:10]
UtoPEvents<-cbind(rowSums(UtoPFrac[,1:3]),rowSums(UtoPFrac[,4:6]),rowSums(UtoPFrac[,7:9]))
I am certian there is a more elegent way of doing this, probably by using aggregate() or ddply(), but was unable to get this working.
I will apprciate any help doing this more efficenly
Thanks in advance
Not tested:
library(plyr)
dpply(df,.(H3K27me3_gross_bin, H3K4me3_gross_bin, H3K4me1_gross_bin), summarize, UPUl=length(UPU[which(UPU==1)]),UPPl=length(UPP[which(UPP==1)]),UPMl=length(UPM[which(UPM==1)]), mysum=sum( UPU + UPP + UPM))
P.S. If you dput the data and provide the expected output, I will test the above code