have a dataset with around 21k in observations, and a categorical variable for each observation with options A, B and C. I'm looking to create an experience variable for countries that have previously taken option C in prior observations (case t-1 to put it simpler). I've been told this is called a rolling wall count. I haven't been able to figure out how to go about this or what package is best to use. Any suggestions would be super helpful!
dispute=c("1","1","1","2","2","2","2","3","3","3")
partner=c("1","2","3","1","2","3","4","2","1","3")
position=c("A","C","C","B","C","A","C","B","C","C")
Currently my data looks something like this:
Dispute Partner Position
1 1 A
1 2 C
1 3 C
2 1 B
2 2 C
2 3 A
2 4 C
3 1 B
3 2 C
3 3 C
Ideally I create a variable that cumulatively counts when each unique observation takes on the value C (generating an "experience" count for each unique "partner"
Dispute Partner Position Experience
1 1 A NA
1 2 C 1
1 3 C 1
2 1 B NA
2 2 C 2
2 3 A NA
2 4 C 1
3 1 B NA
3 2 C 3
With data.table
library(data.table)
setDT(df)[, experience:=cumsum(position=="C")*(position=="C"), by=partner]
dispute partner position experience
1: 1 1 A 0
2: 1 2 C 1
3: 1 3 C 1
4: 2 1 B 0
5: 2 2 C 2
6: 2 3 A 0
7: 2 4 C 1
8: 3 2 B 0
9: 3 1 C 1
10: 3 3 C 2
With dplyr
library(dplyr)
df %>%
group_by(partner) %>%
mutate(experience=cumsum(position=="C")*(position=="C"))
dispute partner position experience
1 1 1 A 0
2 1 2 C 1
3 1 3 C 1
4 2 1 B 0
5 2 2 C 2
6 2 3 A 0
7 2 4 C 1
8 3 2 B 0
9 3 1 C 1
10 3 3 C 2
data
df <- data.frame(dispute=c("1","1","1","2","2","2","2","3","3","3"),
partner=c("1","2","3","1","2","3","4","2","1","3"),
position=c("A","C","C","B","C","A","C","B","C","C"))
Related
In my toy data, for each unique study, the numeric variables (sample and group) must have an order starting from 1. But:
For example, in study 1, we see that there are two unique sample values (1 & 3), so 3 must be replaced with 2.
For example, in study 2, we see that there is one unique group value (2), so it must be replaced with 1.
In study 3, both sample and group seem ok meaning their unique values are 1 and 2 (no replacing needed).
For this toy data, my desired output is shown below. But I appreciate a functional solution that can automatically replace any number of numeric variables in a data.frame that have lost their order just like I showed in my toy data.
m="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 3 1 A
1 3 1 B
1 3 2 A
1 3 2 B
2 1 2 A
2 1 2 B
2 2 2 A
2 2 2 B
2 3 2 A
2 3 2 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
data <- read.table(text=m, h=T)
Desired_output="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 2 1 A
1 2 1 B
1 2 2 A
1 2 2 B
2 1 1 A
2 1 1 B
2 2 1 A
2 2 1 B
2 3 1 A
2 3 1 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
You can do:
library(dplyr)
data %>%
group_by(study) %>%
mutate(across(tidyselect::vars_select_helpers$where(is.numeric),
function(x) as.numeric(as.factor(x)))) %>%
as.data.frame()
The resultant data frame looks like this:
study sample group outcome
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B
Here is an alternative (not as elegant as #Allan Cameron +1 ) dplyr solution:
library(dplyr)
df %>%
group_by(study) %>%
mutate(x = n()/length(unique(sample)),
sample = rep(row_number(), each=x, length.out = n()),
y = length(unique(group)),
group = ifelse(y==1, 1, group)) %>%
select(-x, -y)
study sample group outcome
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B
I am trying two merge two columns in data table 'A' with another column in another data table 'B' which is the unique value of a column . I want to merge in such a way that for every unique combination of two variables in data table 'A' , we get all unique values of column in data table 'B' repeated.
I tried merge but it doesn't give me all the values.I also tried the automated recycling function in data.table but this also doesn't give me the result.
Input:
data.table A
X Y
1 1
1 2
1 3
2 1
3 1
4 4
4 5
5 6
data.table B
Z
1
2
Expected output
X Y Z
1 1 1
1 1 2
1 2 1
1 2 2
1 3 1
1 3 2
2 1 1
2 1 2
3 1 1
3 1 2
4 4 1
4 4 2
4 5 1
4 5 2
5 6 1
5 6 2
We can make use of crossing from tidyr
library(tidyr)
crossing(A, B)
# X Y Z
#1 1 1 1
#2 1 1 2
#3 1 2 1
#4 1 2 2
#5 1 3 1
#6 1 3 2
#7 2 1 1
#8 2 1 2
#9 3 1 1
#10 3 1 2
#11 4 4 1
#12 4 4 2
#13 4 5 1
#14 4 5 2
#15 5 6 1
#16 5 6 2
Or with merge from base R, but the order will be slightly different
merge(A, B)
To get the correct order, replace the arguments in reverse and then order the columns
merge(B, A)[c(names(A), names(B))]
I need to create multiple (several 1000) resampled datasets from a large database. I have three categorical variables. Site (S), Transect(T), Quadrat(Q). The response variable is Value (V), which is the result of the particular S, T, & Q combination. Quads along each transect at each site. I pasted an abbreviated dataset below.
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0
The idea would be that for a given site, the resampled dataset would contain ## of quads from transect 1 to n, where ## would be the number of quadrats(Q) per transect (T) per site (S). I am not trying to resample the dataset based on S, T, & Q. I would like to be able to resample a user-defined number of rows, based on the conditions I define. For example, if I chose to resample using based on 2 quadrats(Q) per transect (T) per site(S), I envision the resampled dataset looking like the below example.
S T Q V
A 1 1 8
A 1 3 0
A 2 1 0
A 2 2 15
A 3 2 25
A 3 3 0
B 1 2 1
B 1 3 0
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
C 1 1 0
C 1 3 0
C 2 1 45
C 2 3 0
C 3 2 1
C 3 3 0
Please let me know if that doesn't make sense and I'll revise until it does. Thanks for any assistance!
Consider by to slice dataframes by Site and Transect factors and then sample random rows:
set.seed(444)
quads <- 2
# BUILD LIST OF SUBSETTED RANDOM SAMPLED DATAFRAMES
df_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), quads),])
# STACK ALL DATAFRAMES INTO ONE FINAL DF
sample_df <- do.call(rbind, df_list)
# SORT DATAFRAME BY S AND T
sample_df <- with(sample_df, sample_df[order(S, T),])
# RESET ROW NAMES
row.names(sample_df) <- NULL
sample_df
# S T Q V
# 1 A 1 1 8
# 2 A 1 3 0
# 3 A 2 2 15
# 4 A 2 1 0
# 5 A 3 1 0
# 6 A 3 3 0
# 7 B 1 2 1
# 8 B 1 1 0
# 9 B 2 3 2
# 10 B 2 1 33
# 11 B 3 1 0
# 12 B 3 2 207
# 13 C 1 1 0
# 14 C 1 2 1
# 15 C 2 1 45
# 16 C 2 3 0
# 17 C 3 3 0
# 18 C 3 2 1
Data
txt = '
S T Q V
A 1 1 8
A 1 2 5
A 1 3 0
A 2 1 0
A 2 2 15
A 2 3 0
A 3 1 0
A 3 2 25
A 3 3 0
B 1 1 0
B 1 2 1
B 1 3 0
B 2 1 33
B 2 2 1
B 2 3 2
B 3 1 0
B 3 2 207
B 3 3 0
C 1 1 0
C 1 2 1
C 1 3 0
C 2 1 45
C 2 2 33
C 2 3 0
C 3 1 0
C 3 2 1
C 3 3 0'
df = read.table(text=txt, header=TRUE)
To build randomly generated dataframes, simply extend out quads and run it through lapply:
max_quads <- 3
quads <- replicate(1000, sample(1:max_quads, 1))
df_list <- lapply(quads, function(q) {
by_list <- by(df, df[c("S", "T")], FUN=function(df) df[sample(nrow(df), q),]))
sample_df <- do.call(rbind, by_list)
sample_df <- with(sample_df, sample_df[order(S, T),])
row.names(sample_df) <- NULL
return(sample_df)
})
Supposing i got a dataframe like this:
dfA<-data.frame(A=c(letters[1:3]),B=c(letters[4:6]),C=c(letters[7:9]))
>dfA
A B C
1 a d g
2 b e h
3 c f i
and another one like this:
dfB<-data.frame(replicate(12,sample(0:5,5,rep=T)))
colnames(dfB)<-sample(letters[1:9],12,rep=T)
> dfB
a a d d g e i c i a g h
1 0 3 3 2 2 1 2 4 1 2 4 0
2 2 2 3 0 0 0 4 4 1 5 2 1
3 4 5 0 3 2 4 3 5 1 4 2 3
4 0 1 0 4 4 3 2 2 1 2 3 1
5 4 0 2 1 2 4 0 5 5 0 5 1
How could I refer to all columns from dfB, which have names contained in column A of dfA?
I am quite new to R and searched this forum a lot, but couldn't get the exact answer.
I tried something like this: sub<-subset(dfB, !colnames(dfB) %in% dfA$A) with unsatisfying results so far.
The output I'd wanna get would be:
> sub
a a c a
1 0 3 4 2
2 2 2 4 5
3 4 5 5 4
4 0 1 2 2
5 4 0 5 0
Can anyone help?
as akrun pointed out in the comments
subset(dfB, select=colnames(dfB) %in% dfA$A)
works perfectly.
I have a number of trials where one variable increases to a max of interest then decreases back to a starting point. How would I go about just retaining the observations with the increasing values to max. Thanks.
For example
Trial A B C
1 2 4 1
1 4 3 2
1 3 7 3
1 3 3 2
1 4 1 1
2 4 1 1
2 6 2 2
2 3 1 3
2 1 1 2
2 7 3 1
...
So we would check max on C and retain as follows,
Trial A B C
1 2 4 1
1 4 3 2
1 3 7 3
2 4 1 1
2 6 2 2
2 3 1 3
...
Ultimately I'll have a low cut off value as well as varying perhaps what I mean by max but essentially the above is the aim.
Probably not the most efficient solution, but here is an attempt using data.table
library(data.table)
setDT(df)[, .SD[1:which.max(C)], by = Trial]
# Trial A B C
# 1: 1 2 4 1
# 2: 1 4 3 2
# 3: 1 3 7 3
# 4: 2 4 1 1
# 5: 2 6 2 2
# 6: 2 3 1 3
Or for some efficiency gain
indx <- setDT(df)[, .I[1:which.max(C)], by = Trial]
df[indx$V1]
library(dplyr)
df%>%group_by(Trial)%>%slice(1:max(C))