How to randomly choose only one row in each group [duplicate] - r

This question already has answers here:
from data table, randomly select one row per group
(4 answers)
Closed 6 years ago.
Say I have a dataframe as follows:
df <- data.frame(Region = c("A","A","A","B","B","C","D","D","D","D"),
Combo = c(1,2,3,1,2,1,1,2,3,4))
> df
Region Combo
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 1
7 D 1
8 D 2
9 D 3
10 D 4
What I would like to do, is for each Region (A,B,C,D) randomly choose only one of the possible combos for that region.
If the chosen combination were indicated by a binary variable, it would look something potentially like this:
Region Combo RandomlyChosen
1 A 1 1
2 A 2 0
3 A 3 0
4 B 1 0
5 B 2 1
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
I'm aware of the sample function, but just don't know how to choose only one combo within each region.
I reglarly use data.table, so any solutions using that are welcome. Though solutions not using data.table are equally welcome.
Thanks!

In plain R you can use sample() within tapply():
df$Chosen <- 0
df[-tapply(-seq_along(df$Region),df$Region, sample, size=1),]$Chosen <- 1
df
Region Combo Chosen
1 A 1 0
2 A 2 1
3 A 3 0
4 B 1 1
5 B 2 0
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
Note the -(-selected_row_number) trick to avoid sampling from 1 to n when there is a single row number for one group

Related

How to repeat query on different parts of a dataset in R?

I want to repeat a particular query on a large dataset and I am sure the answer to my question is quite basic, but after reading various sources on 'for' loops, repeat and replicate functions for about 2 hours, I still can't find any examples which appear to do what I need to do.
The dataset contains survey data from particular sites which are split into plots and each plot contains multiple species entries so the data looks like this:
SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e
The actual data is over 6500 rows as there are hundreds of sites and each should contain 20 plots - the issue is some plots are missing from some sites, so what I need to do is establish how many plots are missing in total. I can use the following code to query how many unique plots are on each site so in the example below I query how many unique plots are in site number 7:
NROW(unique(df$PLOT[df$SITE=="7"]))
[20]
But I have hundreds of sites, so is there a function that will allow me to query each site automatically without manually changing the site number each time?
Here is a base R way with tapply.
x <- '
SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e'
df1 <- read.table(textConnection(x), header = TRUE)
num_plots <- with(df1, tapply(PLOT, SITE, \(x) length(unique(x))))
which(num_plots != max(num_plots))
#> 2
#> 2
Created on 2022-05-26 by the reprex package (v2.0.1)
Not quite sure what you're going for but does this help?
Using data.table:
df <- read.table(text='SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e', header=TRUE)
library(data.table)
setDT(df)[, .(plots=uniqueN(PLOT)), by=.(SITE)]
## SITE plots
## 1: 1 5
## 2: 2 4

Putting back a missing column from a data.frame into a list of dta.frames

My LIST of data.frames below is made from my data. However, this LIST is missing the scale column which is available in the original data.
I was wondering how to put back the missing scale column into LIST to achive my DESIRED_LIST?
Reproducible data and code are below.
m3="
scale study outcome time ES bar
2 1 1 0 1 8
2 1 2 0 2 7
1 2 1 0 3 6
1 2 1 1 4 5
2 3 1 0 5 4
2 3 1 1 6 3
1 4 1 0 7 2
1 4 2 0 8 1"
data <- read.table(text = m3, h=T)
LIST <- list(data.frame(study=c(3,3) ,outcome=c(1,1) ,time=0:1),
data.frame(study=c(1,1) ,outcome=c(1,2) ,time=c(0,0)),
data.frame(study=c(2,2,4,4),outcome=c(1,1,1,2),time=c(0,1,0,0)))
DESIRED_LIST <- list(data.frame(scale=c(2,2) ,study=c(3,3) ,outcome=c(1,1) ,time=0:1),
data.frame(scale=c(2,2) ,study=c(1,1) ,outcome=c(1,2) ,time=c(0,0)),
data.frame(scale=c(1,1,1,1),study=c(2,2,4,4),outcome=c(1,1,1,2),time=c(0,1,0,0)))
In base R, you could do:
lapply(LITS, \(x)merge(x, data)[names(data)])

Find minimal value for a multiple same keys in table [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
I have a table which contains multiple rows of the different data for a key of multiple columns.
Table looks like this:
A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2
I also discovered how to remove all of the duplicate elements using unique command for multiple colums, so the data duplication is not a problem.
I would like to know how to for every key(columns A and B in example) in the table to find only the minimum value in third column(C column in table)
At the end table should look like this
A B C
1 1 1 2
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
Thanks for any help. It is really appreciated
In any question, feel free to ask
con <- textConnection(" A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2")
df <- read.table(con, header = T)
df[with(df, order(A, B, C)), ]
df[!duplicated(df[1:2]),]
# A B C
# 1 1 1 2
# 3 2 1 4
# 4 1 2 4
# 5 2 2 3
# 6 2 3 1

How can I operate on elements of a data.frame in r, that creates a new column? [duplicate]

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 7 years ago.
Suppose I have a data.frame, df.
a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
I'd like to operate on it so that for all places where a and b are equal, I compute the mean of d.
I found that using aggregate can do this,
aggregate(d ~ a + b, df, mean)
This gives me something reasonable
a b d
1 2 5
2 1 3
2 3 6
But I would ideally like to keep my original d column, and add a new column m, so that I get the original data.frame with a new column "m" that contains the averages like,
a b d m
1 2 4 5
1 2 5 5
1 2 6 5
2 1 5 3
2 3 6 6
2 1 1 3
Any ideas on how to do this "properly" in R?
library(dplyr)
df <- read.table(text = "a b d
1 2 4
1 2 5
1 2 6
2 1 5
2 3 6
2 1 1
" , header = T)
df %>%
group_by(a , b) %>%
mutate(m = mean(d))

subtract first value from each subset of dataframe

I want to subtract the smallest value in each subset of a data frame from each value in that subset i.e.
A <- c(1,3,5,6,4,5,6,7,10)
B <- rep(1:4, length.out=length(A))
df <- data.frame(A, B)
df <- df[order(B),]
Subtracting would give me:
A B
1 0 1
2 3 1
3 9 1
4 0 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
I think the output you show is not correct. In any case, from what you explain, I think this is what you want. This uses ave base function:
within(df, { A <- ave(A, B, FUN=function(x) x-min(x))})
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4
Of course there are other alternatives such as plyr and data.table.
Echoing Arun's comment above, I think your expected output might be off. In any event, you should be able to use can use tapply to calculate subsets and then use match to line those subsets up with the original values:
subs <- tapply(df$A, df$B, min)
df$A <- df$A - subs[match(df$B, names(subs))]
df
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4

Resources