How to repeat query on different parts of a dataset in R? - r

I want to repeat a particular query on a large dataset and I am sure the answer to my question is quite basic, but after reading various sources on 'for' loops, repeat and replicate functions for about 2 hours, I still can't find any examples which appear to do what I need to do.
The dataset contains survey data from particular sites which are split into plots and each plot contains multiple species entries so the data looks like this:
SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e
The actual data is over 6500 rows as there are hundreds of sites and each should contain 20 plots - the issue is some plots are missing from some sites, so what I need to do is establish how many plots are missing in total. I can use the following code to query how many unique plots are on each site so in the example below I query how many unique plots are in site number 7:
NROW(unique(df$PLOT[df$SITE=="7"]))
[20]
But I have hundreds of sites, so is there a function that will allow me to query each site automatically without manually changing the site number each time?

Here is a base R way with tapply.
x <- '
SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e'
df1 <- read.table(textConnection(x), header = TRUE)
num_plots <- with(df1, tapply(PLOT, SITE, \(x) length(unique(x))))
which(num_plots != max(num_plots))
#> 2
#> 2
Created on 2022-05-26 by the reprex package (v2.0.1)

Not quite sure what you're going for but does this help?
Using data.table:
df <- read.table(text='SITE PLOT SPECIES
1 1 a
1 1 b
1 2 a
1 2 c
1 3 b
1 3 c
1 3 d
1 4 a
1 5 a
1 5 b
2 1 b
2 1 c
2 3 a
2 3 b
2 4 b
2 4 c
2 4 d
2 5 e', header=TRUE)
library(data.table)
setDT(df)[, .(plots=uniqueN(PLOT)), by=.(SITE)]
## SITE plots
## 1: 1 5
## 2: 2 4

Related

R - Count duplicates values for each row

I'm working on a data frame that requires to calculate Fleiss's Kappa for inter-rater agreements. I'm using the 'irr' package for that.
Besides that, I need to count, for each observation, how many of raters are in agreement.
My data looks like these:
a b c
1 1 1 1
2 1 2 2
3 2 3 2
4 3 3 1
5 4 2 1
I'm expecting something like this, , where count stands for number of raters on agreement
a b c count
1 1 1 1 3
2 1 2 2 2
3 2 3 2 2
4 3 3 1 2
5 4 2 1 0
Thanks a lot.
Alternative solution if your data is in a data frame called abc:
as.numeric(apply(abc,1,function(x) {
ux<-unique(x);
tab <- tabulate(match(x, ux));
mode <- ux[tab == max(tab)];
ifelse(length(mode)==1,length(which(x==mode)),NA_character_);
} ))
When you run it gives:
[1] 3 2 2 2 NA

Find minimal value for a multiple same keys in table [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
I have a table which contains multiple rows of the different data for a key of multiple columns.
Table looks like this:
A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2
I also discovered how to remove all of the duplicate elements using unique command for multiple colums, so the data duplication is not a problem.
I would like to know how to for every key(columns A and B in example) in the table to find only the minimum value in third column(C column in table)
At the end table should look like this
A B C
1 1 1 2
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
Thanks for any help. It is really appreciated
In any question, feel free to ask
con <- textConnection(" A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2")
df <- read.table(con, header = T)
df[with(df, order(A, B, C)), ]
df[!duplicated(df[1:2]),]
# A B C
# 1 1 1 2
# 3 2 1 4
# 4 1 2 4
# 5 2 2 3
# 6 2 3 1

How to randomly choose only one row in each group [duplicate]

This question already has answers here:
from data table, randomly select one row per group
(4 answers)
Closed 6 years ago.
Say I have a dataframe as follows:
df <- data.frame(Region = c("A","A","A","B","B","C","D","D","D","D"),
Combo = c(1,2,3,1,2,1,1,2,3,4))
> df
Region Combo
1 A 1
2 A 2
3 A 3
4 B 1
5 B 2
6 C 1
7 D 1
8 D 2
9 D 3
10 D 4
What I would like to do, is for each Region (A,B,C,D) randomly choose only one of the possible combos for that region.
If the chosen combination were indicated by a binary variable, it would look something potentially like this:
Region Combo RandomlyChosen
1 A 1 1
2 A 2 0
3 A 3 0
4 B 1 0
5 B 2 1
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
I'm aware of the sample function, but just don't know how to choose only one combo within each region.
I reglarly use data.table, so any solutions using that are welcome. Though solutions not using data.table are equally welcome.
Thanks!
In plain R you can use sample() within tapply():
df$Chosen <- 0
df[-tapply(-seq_along(df$Region),df$Region, sample, size=1),]$Chosen <- 1
df
Region Combo Chosen
1 A 1 0
2 A 2 1
3 A 3 0
4 B 1 1
5 B 2 0
6 C 1 1
7 D 1 0
8 D 2 0
9 D 3 1
10 D 4 0
Note the -(-selected_row_number) trick to avoid sampling from 1 to n when there is a single row number for one group

Creating new dataframe with missing value

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?
You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.
new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

Creating a Rolling Wall Count Variable in R

have a dataset with around 21k in observations, and a categorical variable for each observation with options A, B and C. I'm looking to create an experience variable for countries that have previously taken option C in prior observations (case t-1 to put it simpler). I've been told this is called a rolling wall count. I haven't been able to figure out how to go about this or what package is best to use. Any suggestions would be super helpful!
dispute=c("1","1","1","2","2","2","2","3","3","3")
partner=c("1","2","3","1","2","3","4","2","1","3")
position=c("A","C","C","B","C","A","C","B","C","C")
Currently my data looks something like this:
Dispute Partner Position
1 1 A
1 2 C
1 3 C
2 1 B
2 2 C
2 3 A
2 4 C
3 1 B
3 2 C
3 3 C
Ideally I create a variable that cumulatively counts when each unique observation takes on the value C (generating an "experience" count for each unique "partner"
Dispute Partner Position Experience
1 1 A NA
1 2 C 1
1 3 C 1
2 1 B NA
2 2 C 2
2 3 A NA
2 4 C 1
3 1 B NA
3 2 C 3
With data.table
library(data.table)
setDT(df)[, experience:=cumsum(position=="C")*(position=="C"), by=partner]
dispute partner position experience
1: 1 1 A 0
2: 1 2 C 1
3: 1 3 C 1
4: 2 1 B 0
5: 2 2 C 2
6: 2 3 A 0
7: 2 4 C 1
8: 3 2 B 0
9: 3 1 C 1
10: 3 3 C 2
With dplyr
library(dplyr)
df %>%
group_by(partner) %>%
mutate(experience=cumsum(position=="C")*(position=="C"))
dispute partner position experience
1 1 1 A 0
2 1 2 C 1
3 1 3 C 1
4 2 1 B 0
5 2 2 C 2
6 2 3 A 0
7 2 4 C 1
8 3 2 B 0
9 3 1 C 1
10 3 3 C 2
data
df <- data.frame(dispute=c("1","1","1","2","2","2","2","3","3","3"),
partner=c("1","2","3","1","2","3","4","2","1","3"),
position=c("A","C","C","B","C","A","C","B","C","C"))

Resources