Take the subsets of a data.frame with the same feature and select a single row from each subset

Take the subsets of a data.frame with the same feature and select a single row from each subset - r

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!

tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4

You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)

The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Related

R Group dataframe according to certain conditions and each group has the same number of each condition

My dataframe has 324 different images with unique imageID. And there are 3*3 =9 conditions, each image belonging to one of the conditions. For example, Image 1 belongs to 1A condition and Image 5 belongs to 2B condition. What I try to achieve is to group images into 6 blocks randomly but in each block, there is the same number of each condition. Then, when group the dataframe by blokNo, they will be presented in a random order. And I want to generate multiple orders of presentation from the same dataframe.
My data frame looks like this:
ImageID Catagory1 Category2 BlokNo
1 1 A
4 1 A
6 1 A
5 2 B
8 2 B
3 2 B
14 3 C
12 3 C
17 3 C
I would like my data to look like this:
ImageID Catagory1 Category2 BlokNo
1 1 A 2
4 1 A 1
6 1 A 3
5 2 B 3
8 2 B 2
3 2 B 1
14 3 C 1
12 3 C 3
17 3 C 2
Below is the code I tried. It actually can realize part of my requirement, but since I actually have 3*3=9 conditions in total, I am wondering if there are other quick ways to do it. Thank you in advance!
Cond1 <- df %>% filter (Category1 == 1 & Category2 == A) #filter out one condition
Cond1$BlokNo <- sample(rep(1:6, each = ceiling(36/6))[1:36]) #randomly assign a number from 1:6 to each image in certain condition

Instead of filtering by each unique combinations, do a group_by on those 'Category2' columns and get the sample of row_number()
library(dplyr)
df <- df %>%
group_by(Category1, Category2) %>%
mutate(BlockNo = sample(row_number())) %>%
ungroup

Compare lists in dataframes based on personal code, shorten one lists if longer

I have two separate dataframes each for one speaker of an interacting dyad. They have different amounts of talk-turns (rows) which is why I keep them in separate files for now.
In order to run my final analyses I need identical number of rows for each speaker.
So what I want to do is compare dyad_id 1 in both data frames and then shorten the longer list for one by deleting the last row for all columns.
I prepared a data frame to illustrate what I already have.
So far, I tried to split the data frame by the dyad_id in both data sets to now compare the splits one after another and delete the unnecessary rows. As I have various conversations, I need to automate this to go through all dyad_ids one after another.
I hope someone can help me, I am completely lost.
dyad_id_A <- c(1,1,1,2,2,2,2,3,3,3,3,3)
fw_quantiles_a <- c(4,3,1,2,3,2,4,1,4,5,6,7)
df_A<- data.frame(dyad_id_A,fw_quantiles_a)
dyad_id_B <- c(1,1,1,1,2,2,2,3,3,3,3)
fw_quantiles_b <- c(3,1,2,1,2,4,1,3,3,4,5)
df_B <- data.frame(dyad_id_B,fw_quantiles_b)
example for final dataset
dyad_id_AB <- c(1,1,1,2,2,2,3,3,3,3)
What I tried so far:
split_conv_A = split(df_A, list(df_A$dyad_id_A))
split_conv_B = split(df_B, list(df_B$dyad_id_B))

Add a time counter within each dyad_id_x group and then merge together:
df_A$time <- ave(df_A$dyad_id_A, df_A$dyad_id_A, FUN=seq_along)
df_B$time <- ave(df_B$dyad_id_B, df_B$dyad_id_B, FUN=seq_along)
merge(
df_A, df_B,
by.x=c("dyad_id_A","time"), by.y=c("dyad_id_B","time")
)
# dyad_id_A time fw_quantiles_a fw_quantiles_b
#1 1 1 4 3
#2 1 2 3 1
#3 1 3 1 2
#4 2 1 2 2
#5 2 2 3 4
#6 2 3 2 1
#7 3 1 1 3
#8 3 2 4 3
#9 3 3 5 4
#10 3 4 6 5

Maybe we can try using table to calculate frequncies of id's in both the dataframe assuming you have the same id's in both the dataframe. Calculate the minimum between them using pmin and repeat the names based on the frequency.
tab <- pmin(table(df_A$dyad_id_A), table(df_B$dyad_id_B))
as.integer(rep(names(tab), tab))
# [1] 1 1 1 2 2 2 3 3 3 3

sum up certain variables (columns) by variable names

i want to sum up certain variables (columns in a data frame).
I would like to select those variables by parts of their names.
The complex thing is that i have various conditions. So, using a single contains from dplyr does not work.
Here is an example:
ab_yy <- c(1:5)
bc_yy <- c(5:9)
cd_yy <- c(2:6)
de_xx <- c(3:7)
ab_yy bc_yy cd_yy de_xx
1 1 5 2 3
2 2 6 3 4
3 3 7 4 5
4 4 8 5 6
5 5 9 6 7
dat <- data.frame(ab_yy,bc_yy,cd_yy,de_xx)
#sum up all variables that contain yy and certain extra conditions
#may look something like this: rowSums(select(dat, contains(("yy&ab")|("yy&bc")) ) )
desired result:
6 8 10 12 14

EDIT: Fixed, sorry, low on caffeine
If you want to use dplyr, try using matches:
library(dplyr)
dat %>%
select(matches("*yy", )) %>%
select(matches("ab*|bc*")) %>%
rowSums(.)
[1] 6 8 10 12 14

I don't think that it's the best way but u can do it like that with a grepl:
rowSums(dat[,grepl(pattern = "ab.*yy|bc.*yy",colnames(dat))==T])

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.

Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3

You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

R: Transposing from long to wide and aggregating rows with matching ID

This is something I've been working around for a while just making separate data frames and doing full_join but I think there's an easier way.
Overall, I'm wanting to calculate the differences between an individual ID's value from time 1 to time 2 by type from a long form data frame. This is one of the ways I think I could do it but if other people have other techniques or ideas I'd like to hear them too.
However, I'd also like to know how to address this transposing issue anyway because I'm curious.
Here's my issue.
I have a data frame in long form with 5 different measures for two different time periods. I want to convert this data frame from long form into a wide form so that instead of having a DF look like this (note, not all types are included -- just did 2 for sake of length):
(example df1)
ID Time Value Type
1 1 7 Type1
1 2 8 Type1
2 1 9 Type1
2 2 10 Type1
1 1 13 Type2
1 2 15 Type2
2 1 17 Type2
2 2 19 Type2
I want it to look more like this:
(example df 2)
ID Type1.1 Type1.2 Type2.1 Type2.2
1 7 8 13 15
2 9 10 17 19
I use:
library(dplyr)
library(tidyr)
df.new <- df %>%
spread(Type, Measurement.Value)
and get this from example df 1 which is on the right track:
(example df 3)
ID Time Type1 Type2
1 1 7 13
1 2 8 15
2 1 9 17
2 2 10 19
But now I want to spread the time for each type. When I do something like this on example df3:
newer.df <- df.new %>%
spread(Time, Type1)
to make this:
ID Type1.1 Type1.2
1 7 NA
1 NA 8
2 9 NA
2 NA 10
So, it's producing an NA for each row -- is there a way I can collapse rows on to each other by ID? I think I'm missing something.
Remember, in my example code I'm only using 2 types but in reality I have 5 types -- just wanted to give simplified code.

We can use dcast() from reshape2 package.
library(reshape2)
dcast(df, ID ~ Type + Time, value.var = "Value")
# ID Type1_1 Type1_2 Type2_1 Type2_2
#1 1 7 8 13 15
#2 2 9 10 17 19

Or using the original tidyr package, we could do this:
library(tidyr)
df$Type <- paste(df$Type, df$Time, sep="_")
df$Time <- NULL
spread(df, key=Type, value=Value)
ID Type1_1 Type1_2 Type2_1 Type2_2
1 7 8 13 15
2 9 10 17 19
Nulling the time column did the trick for me. It seems that spread considers all columns not used otherwise as what dcast would call id.vars. There might be a more elegant solution using tidyr, though.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Take the subsets of a data.frame with the same feature and select a single row from each subset - r

You can do that with dplyr like so: library(dplyr) df %>% group_by(ID) %>% sample_n(1)

The idea is reorder the rows randomly and then remove duplicates in that order. df <- read.table(text="ID Value 1 10 2 5 2 8 3 15 4 7 4 9", header=TRUE) df2 <- df[sample(nrow(df)), ] df2[!duplicated(df2$ID), ]

Related

R Group dataframe according to certain conditions and each group has the same number of each condition

Compare lists in dataframes based on personal code, shorten one lists if longer

sum up certain variables (columns) by variable names

Assign ID across 2 columns of variable

R: Transposing from long to wide and aggregating rows with matching ID

Categories

Resources