R: Shuffle dataframe columnwise - r

This link answers a part of my question: How to randomize (or permute) a dataframe rowwise and columnwise?.
> df1
a b c
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
Column-wise shuffle gives me below output df3, which is reordering the columns
> df3 <- df1[,sample(ncol(df1))]
> df3
c a b
1 0 1 1
2 0 1 0
3 0 0 1
4 0 0 0
What I want is that the column names should change as well. Row-wise and column-wise total remains the same, just the column names get reassigned. Something like df4. How can I achieve this?
> df4
c a b
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
PS: How do I keep the df in its shape rows by column? when I post the question the formatting collapses?

You might want to just sample the column-names. Something like:
names(df) <- names(df)[sample(ncol(df))]

Related

Match combinations of row values between 2 different data frames

I have a data.frame with 16 different combinations of 4 different cell markers
combinations_df
FITC Cy3 TX_RED Cy5
a 0 0 0 0
b 1 0 0 0
c 0 1 0 0
d 1 1 0 0
e 0 0 1 0
f 1 0 1 0
g 0 1 1 0
h 1 1 1 0
i 0 0 0 1
j 1 0 0 1
k 0 1 0 1
l 1 1 0 1
m 0 0 1 1
n 1 0 1 1
o 0 1 1 1
p 1 1 1 1
I have my "main" data.frame with 10 columns and thousands of rows.
> main_df
a b FITC d Cy3 f TX_RED h Cy5 j
1 0 1 1 1 1 0 1 1 1 1
2 0 1 0 1 1 0 1 0 1 1
3 1 1 0 0 0 1 1 0 0 0
4 0 1 1 1 1 0 1 1 1 1
5 0 0 0 0 0 0 0 0 0 0
....
I want to use all the possible 16 combinations from combinations_df to compare with each row of main_df. Then I want to create a new vector to later cbind to main_df as column 11.
sample output
> phenotype
[1] "g" "i" "a" "p" "g"
I thought about doing a while loop within a for loop checking each combinations_df row through each main_df row.
Sounds like it could work, but I have close to 1 000 000 rows in main_df, so I wanted to see if anybody had a better idea.
EDIT: I forgot to mention that I want to compare combinations_df only to columns 3,5,7,9 from main_df. They have the same name, but it might not be that obvious.
EDIT: Changin the sample data output, since no "t" should be present
The dplyr solution is outrageously simple. First you need to put phenotype in combinations_df as an explicit variable like this:
# phenotype FITC Cy3 TX_RED Cy5
#1 a 0 0 0 0
#2 b 1 0 0 0
#3 c 0 1 0 0
#4 d 1 1 0 0
# etc
dplyr lets you join on multiple variables, so from here it's a one-liner to look up the phenotypes.
library(dplyr)
left_join(main_df, combinations_df, by=c("FITC", "Cy3", "TX_RED", "Cy5"))
# a b FITC d Cy3 f TX_RED h Cy5 j phenotype
#1 0 1 1 1 1 0 1 1 1 1 p
#2 0 1 0 1 1 0 1 0 1 1 o
#3 1 1 0 0 0 1 1 0 0 0 e
#4 0 1 1 1 1 0 1 1 1 1 p
#5 0 0 0 0 0 0 0 0 0 0 a
I originally thought you'd have to concatenate columns with tidyr::unite but this was not the case.
Its not very elegant but this method works just fine. There are no loops in loops here so it should run just fine. Might trying to match using the dataframe rows and do away with the loops all together but this was just the fastest way I could figure it out. You might look at packages plyr or data.table. Very powerful packages for this kind of thing.
main_text=NULL
for(i in 1:length(main_df[,1])){
main_text[i]<-paste(main_df[i,3],main_df[i,5],main_df[i,7],main_df[i,9],sep="")
}
comb_text=NULL
for(i in 1:length(combinations_df[,1])){
comb_text[i]<-paste(combinations_df[i,1],combinations_df[i,2],combinations_df[i,3],combinations_df[i,4],sep="")
}
rownames(combinations_df)[match(main_text,comb_text)]
How about something like this? My results are different than yours as there is no "t" in the combination_df. You could do it without assigning a new column to if you wanted. This is mainly for illustrative purposes.
combination_df <- read.table("Documents/comb.txt.txt", header=T)
main_df <- read.table("Documents/main.txt", header=T)
main_df
combination_df
main_df$key <- do.call(paste0, main_df[,c(3,5,7,9)])
combination_df$key <- do.call(paste0, combination_df)
rownames(combination_df)[match(main_df$key, combination_df$key)]

A Pivot table in r with binary output [duplicate]

This question already has answers here:
R: Convert delimited string into variables
(3 answers)
Closed 5 years ago.
I have the following dataset
#datset
id attributes value
1 a,b,c 1
2 c,d 0
3 b,e 1
I wish to make a pivot table out of them and assign binary values to the attribute (1 to the attributes if they exist otherwise assign 0 to them). My ideal output will be the following:
#output
id a b c d e Value
1 1 1 1 0 0 1
2 0 0 1 1 0 0
3 0 1 0 0 1 1
Any tip is really appreciated.
We split the 'attributes' column by ',', get the frequency with mtabulate from qdapTools and cbind with the first and third column.
library(qdapTools)
cbind(df1[1], mtabulate(strsplit(df1$attributes, ",")), df1[3])
# id a b c d e value
#1 1 1 1 1 0 0 1
#2 2 0 0 1 1 0 0
#3 3 0 1 0 0 1 1
With base R:
attributes <- sort(unique(unlist(strsplit(as.character(df$attributes), split=','))))
cols <- as.data.frame(matrix(rep(0, nrow(df)*length(attributes)), ncol=length(attributes)))
names(cols) <- attributes
df <- cbind.data.frame(df, cols)
df <- as.data.frame(t(apply(df, 1, function(x){attributes <- strsplit(x['attributes'], split=','); x[unlist(attributes)] <- 1;x})))[c('id', attributes, 'value')]
df
id a b c d e value
1 1 1 1 1 0 0 1
2 2 0 0 1 1 0 0
3 3 0 1 0 0 1 1

Selecting specific columns from dataset

I have a dataset which looks this this:
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
3 1 1 0 0 1 1 0
6 3 0 1 0 1 0 1
2 3 1 0 0 1 1 0
10 5 0 1 1 0 1 0
0 0 1 0 1 0 0 1
I want to have new data frame (df) which only contains columns which ends with 1.1, 2.1 i.e.
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
0 1 0
1 1 1
0 1 0
1 0 0
0 0 1
As here I only shows few columns but actually it contains more than 100 columns. Therefore, kindly provide the solution which can be applicable to as many columns dataset consists.
Thanks in advance.
I guess the pattern is, that the column ends on ".1" may you need to adapt it at that point.
My data I am using
original_data
A B X50_TT_1.0 X50_TT_1.1 X60_DD_2.0 X60_DD_2.1 X100_L2V_7.0 X100_L2V_7.1
1 3 1 1 0 0 1 1 0
Actually this is for everything ending with "1"
df <- original_data[which(grepl(".1$", names(original_data)))]
For ending with ".1" you have to use:
df <- original_data[which(grepl("\\.1$", names(original_data)))]
For original_data both gave me the same result:
df
X50_TT_1.1 X60_DD_2.1 X100_L2V_7.1
1 0 1 0

How to shuffle a dataframe column wise, but independent of rows?

In this answer: https://stackoverflow.com/a/11503439/651779 it is shown how to shuffle a dataframe row- and column wise. I am interested in shuffeling column wise. From the original dataframe
> df1
a b c
1 1 1 0
2 1 0 0
3 0 1 0
4 0 0 0
shuffling column wise
> df3=df1[,sample(ncol(df1))]
> df3
c a b
1 0 1 1
2 0 1 0
3 0 0 1
4 0 0 0
However, I would like to shuffle each row on its columns independent of the other rows, instead of shuffling the complete columns, so that you could get something like
>df4
c a b
1 0 1 1
2 1 0 0
3 1 0 0
4 0 0 0
Now I do that by looping over each row, shuffling the row, and putting it in a dataframe. Is there an easy way to do this?
something like:
t(apply(df1, 1, function(x) { sample(x, length(x)) } ))
This will give you the result in matrix form. If you have factors, a mix of numeric and characters etc, be aware that this will coerce everything to character.

Binning and Naming New Columns with Mean of Binned Columns

This probably has been asked already, but I could not find it. I have a data set, where column names are numbers, and row names are sample names (see below).
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0
I want to bin the columns, say every 4 columns, by summation, and then name the new columns with the mean of the binned columns. For the above table I would end up with:
"599.785" "599.816"
"A" 0 7
"B" 0 4
"C" 0 4
"D" 3 4
The new column names, 599.785 and 599.816, are average of the column names that were binned. I think something like cut would work for a vector of numbers, but I am not sure how to implement it for large data frames. Thanks for any help!
colnames <- c("599.773", "599.781", "599.789", "599.797",
"599.804", "599.812" ,"599.82" ,"599.828" )
mat <- matrix(scan(), nrow=4, byrow=TRUE)
0 0 0 0 0 2 1 4
0 0 0 0 0 1 0 3
0 0 0 0 2 1 0 1
3 0 0 0 3 1 0 0
colnames(mat)=colnames
rownames(mat) = LETTERS[1:4]
sRows <- function(mat, cols) rowSums(mat[, cols])
sapply(1:(dim(mat)[2]/4), function(base) sRows(mat, base:(base+4)) )
[,1] [,2]
A 0 2
B 0 1
C 2 3
D 6 4
accum <- sapply(1:(dim(mat)[2]/4), function(base)
sRows(mat, base:(base+4)) )
colnames(accum) <- sapply(1:(dim(mat)[2]/4),
function(base)
mean(as.numeric(colnames(mat)[ base:(base+4)] )) )
accum
#-------
599.7888 599.7966
A 0 2
B 0 1
C 2 3
D 6 4
First of all Using numeric values as columns names is not a good/standard habit.
Even I am here giving a solution as the desired OP.
## read data without checking names
dt <- read.table(text='
"599.773" "599.781" "599.789" "599.797" "599.804" "599.812" "599.82" "599.828"
"A" 0 0 0 0 0 2 1 4
"B" 0 0 0 0 0 1 0 3
"C" 0 0 0 0 2 1 0 1
"D" 3 0 0 0 3 1 0 0',header=TRUE, check.names =FALSE)
cols <- as.numeric(colnames(dt))
## create a factor to groups columns
ff <- rep(c(TRUE,FALSE),each=length(cols)/2)
## using tapply to group operations by ff
vals <- do.call(cbind,tapply(cols,ff,
function(x)
rowSums(dt[,paste0(x)])))
nn <- tapply(cols,ff,mean)
## names columns with means
colnames(vals) <- nn[colnames(vals)]
vals
599.816 599.785
A 7 0
B 4 0
C 4 0
D 4 3

Resources