R Group dataframe according to certain conditions and each group has the same number of each condition - r

My dataframe has 324 different images with unique imageID. And there are 3*3 =9 conditions, each image belonging to one of the conditions. For example, Image 1 belongs to 1A condition and Image 5 belongs to 2B condition. What I try to achieve is to group images into 6 blocks randomly but in each block, there is the same number of each condition. Then, when group the dataframe by blokNo, they will be presented in a random order. And I want to generate multiple orders of presentation from the same dataframe.
My data frame looks like this:
ImageID Catagory1 Category2 BlokNo
1 1 A
4 1 A
6 1 A
5 2 B
8 2 B
3 2 B
14 3 C
12 3 C
17 3 C
I would like my data to look like this:
ImageID Catagory1 Category2 BlokNo
1 1 A 2
4 1 A 1
6 1 A 3
5 2 B 3
8 2 B 2
3 2 B 1
14 3 C 1
12 3 C 3
17 3 C 2
Below is the code I tried. It actually can realize part of my requirement, but since I actually have 3*3=9 conditions in total, I am wondering if there are other quick ways to do it. Thank you in advance!
Cond1 <- df %>% filter (Category1 == 1 & Category2 == A) #filter out one condition
Cond1$BlokNo <- sample(rep(1:6, each = ceiling(36/6))[1:36]) #randomly assign a number from 1:6 to each image in certain condition

Instead of filtering by each unique combinations, do a group_by on those 'Category2' columns and get the sample of row_number()
library(dplyr)
df <- df %>%
group_by(Category1, Category2) %>%
mutate(BlockNo = sample(row_number())) %>%
ungroup

Related

How to rearrange columns of a data frame based on values in a row

This is an R programming question. I would like to rearrange the order of columns in a data frame based on the values in one of the rows. Here is an example data frame:
df <- data.frame(A=c(1,2,3,4),B=c(3,2,4,1),C=c(2,1,4,3),
D=c(4,2,3,1),E=c(4,3,2,1))
Suppose I want to rearrange the columns in df based on the values in row 4, ascending from 1 to 4, with ties having the same rank. So the desired data frame could be:
df <- data.frame(B=c(3,2,4,1),D=c(4,2,3,1),E=c(4,3,2,1),
C=c(2,1,4,3),A=c(1,2,3,4))
although I am indifferent about the order of first three columns, all of which have the value 1 in column 4.
I could do this with a for loop, but I am looking for a simpler approach. Thank you.
We can use select - subset the row (4), unlist, order the values and pass it on select
library(dplyr)
df %>%
select(order(unlist(.[4, ])))
-output
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
Or may use
df %>%
select({.} %>%
slice_tail(n = 1) %>%
flatten_dbl %>%
order)
B D E C A
1 3 4 4 2 1
2 2 2 3 1 2
3 4 3 2 4 3
4 1 1 1 3 4
or in base R
df[order(unlist(tail(df, 1))),]

count unique combinations of variable values in an R dataframe column [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Count number of rows within each group
(17 answers)
Closed 2 years ago.
I want to count the unique combinations of a variable that appear per group.
For example:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,4,4,4,5,6,6,7,7,7),
status = c("a","b","c","a","b","c","b","c","b","c","d","b","b","c","b","c", "d"))
> df
id status
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
7 3 b
8 3 c
9 4 b
10 4 c
11 4 d
12 5 b
13 6 b
14 6 c
15 7 b
16 7 c
17 7 d
So that, for example, I can tally how many times a given combination of "status" appears.
By hand, for example, I see that "a,b,c" appears twice total (id's 1 and 2).
These seem to be similar questions, but I couldn't work out how to do it and with clearer explanation in R:
Counting unique combinations
Count of unique combinations despite order
The result I think I am looking for would be something like:
abc 2
bc 3
b 1
...
An option with tidyverse where group by 'id', paste the 'status' and get the count
library(dplyr)
library(stringr)
df %>%
group_by(id) %>%
summarise(status = str_c(status, collapse="")) %>%
count(status)
# A tibble: 4 x 2
# status n
# <chr> <int>
#1 abc 2
#2 b 1
#3 bc 2
#4 bcd 2
Here is a base R option via aggregate
> aggregate(.~status,rev(aggregate(.~id,df,paste0,collapse = "")),length)
status id
1 abc 2
2 b 1
3 bc 2
4 bcd 2
You can use the apply family of functions too with tapply and lapply to get there with table.
tap <- tapply(df$status, df$id ,FUN= function(x) unique(x))
lap <- lapply(tap,FUN = function(x) paste0(x,collapse=""))
status <- unlist(lap)
df1 <- data.frame(table(status))
> df1
status Freq
1 abc 2
2 b 1
3 bc 2
4 bcd 2

R: Filtering by two columns using "is not equal" operator dplyr/subset

This questions must have been answered before but I cannot find it any where. I need to filter/subset a dataframe using values in two columns to remove them. In the examples I want to keep all the rows that are not equal (!=) to both replicate "1" and treatment "a". However, either subset and filter functions remove all replicate 1 and all treatment a. I could solve it by using which and then indexing, but it is not the best way for using pipe operator. do you know why filter/subset do not filter only when both conditions are true?
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
#filtering data
> filter(df, replicate!=1, treatment!="a")
replicate treatment
1 2 b
2 3 b
3 2 b
4 3 b
> subset(df, (replicate!=1 & treatment!="a"))
replicate treatment
8 2 b
9 3 b
11 2 b
12 3 b
#solution by which - indexing
index = which(df$replicate==1 & df$treatment=="a")
> df[-index,]
replicate treatment
2 2 a
3 3 a
5 2 a
6 3 a
7 1 b
8 2 b
9 3 b
10 1 b
11 2 b
12 3 b
I think you're looking to use an "or" condition here. How does this look:
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
df %>%
filter(replicate != 1 | treatment != "a")
replicate treatment
1 2 a
2 3 a
3 2 a
4 3 a
5 1 b
6 2 b
7 3 b
8 1 b
9 2 b
10 3 b

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Resources