How to identify duplicate items within a subset of data - r

I am trying to identify which trials, within a long form dataset, are repeated but only within certain blocks per participant. My data is structured something like this:
sub block trial item
1 1 1 A
1 1 2 B
1 2 1 A
1 2 2 B
1 3 1 B
1 3 2 C
2 1 1 A
2 1 2 B
2 2 1 A
2 2 2 B
2 3 1 B
2 3 2 C
What I would like to create is a new column that indicates for each participant, which items are repeating and another new column with a new trial code, but only if the items are repeated in blocks 2 and 3. So it would look something like this:
sub block trial item dup newtrial
1 1 1 A FALSE 1
1 1 2 B FALSE 2
1 2 1 A FALSE 1
1 2 2 B FALSE 2
1 3 1 C FALSE 1
1 3 2 B TRUE 102
2 1 1 A FALSE 1
2 1 2 B FALSE 2
2 2 1 A FALSE 1
2 2 2 B FALSE 2
2 3 1 C FALSE 1
2 3 2 B TRUE 102
I have been able to identify duplicates across the whole dataset and add 100 to each trial number using the following code:
data$dup<-duplicated(data$item)
data$newtrial<-NA
data<-transform(data,
item=make.unique(as.character(item)),
newtrial=ifelse(duplicated(item),trial+100, trial))
What I have not been able to figure out is how to constrain the function to each individual subject and only certain blocks within each subject number.
Thanks!

another option using data.table:
library(data.table)
xt <- fread("sub block trial item
1 1 1 A
1 1 2 B
1 2 1 A
1 2 2 B
1 3 1 B
1 3 2 B
2 1 1 A
2 1 2 B
2 2 1 A
2 2 2 B
2 3 1 B
2 3 2 B")
xt[,
c("dup","ntrial") := {
dup <- duplicated(item)
tt <- ifelse(dup,trial+100L,trial)
list(dup,tt)
},"sub,block"]

You can do this using dplyr grouping the observations by sub and block:
library(dplyr)
res <- data %>% group_by(sub,block) %>%
mutate(dup=duplicated(item)) %>%
ungroup %>%
mutate(newtrial=ifelse(dup,trial+100,trial))
We use mutate to create new columns dup and newtrial.
Data: Modifying your data slightly to introduce duplicate item for sub=1, block=3 and sub=2, block=3:
data <- structure(list(sub = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), block = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L,
3L), trial = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L
), item = structure(c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("sub",
"block", "trial", "item"), class = "data.frame", row.names = c(NA,
-12L))
## sub block trial item
##1 1 1 1 A
##2 1 1 2 B
##3 1 2 1 A
##4 1 2 2 B
##5 1 3 1 B
##6 1 3 2 B
##7 2 1 1 A
##8 2 1 2 B
##9 2 2 1 A
##10 2 2 2 B
##11 2 3 1 B
##12 2 3 2 B
Using this data:
print(res)
### A tibble: 12 x 6
## sub block trial item dup newtrial
## <int> <int> <int> <fctr> <lgl> <dbl>
##1 1 1 1 A FALSE 1
##2 1 1 2 B FALSE 2
##3 1 2 1 A FALSE 1
##4 1 2 2 B FALSE 2
##5 1 3 1 B FALSE 1
##6 1 3 2 B TRUE 102
##7 2 1 1 A FALSE 1
##8 2 1 2 B FALSE 2
##9 2 2 1 A FALSE 1
##10 2 2 2 B FALSE 2
##11 2 3 1 B FALSE 1
##12 2 3 2 B TRUE 102

Related

Pasting values from a vector to a new column in a for loop with nested data

I have a dataframe that currently looks like this:
subjectID
Trial
1
3
1
3
1
3
1
4
1
4
1
5
1
5
1
5
2
1
2
1
2
3
2
3
2
3
2
5
2
5
2
6
3
1
Etc., where trial number is nested under subject ID. I need to make a new column in which column "NewTrial" is simply what order the trials now appear in. For example:
subjectID
Trial
NewTrial
1
3
1
1
3
1
1
3
1
1
4
2
1
4
2
1
5
3
1
5
3
1
5
3
2
1
1
2
1
1
2
3
2
2
3
2
2
3
2
2
5
3
2
5
3
2
6
4
3
1
1
So far, I have a for-loop written that looks like this:
for (myperson in unique(data$subjectID)){
#This line creates a vector of the number of unique trials per subject: for subject 1, c(1, 2, 3)
triallength=1:length(unique(data$Trial[data$subID==myperson]))
I'm having trouble now finding a way to paste the numbers from the created triallength vector as a column in the dataframe. Does anyone know of a way to accomplish this? I am lacking some experience with for-loops and hoping to gain more. If anyone has a tidyverse/dplyr solution, however, I am open to that as well as an alternative to a for-loop. Thanks in advance, and let me know if any clarification is needed!
Converting to factor with unique values as levels, then as.numeric in an ave should be nice.
transform(dat, NewTrial=ave(Trial, subjectID, FUN=\(x) as.numeric(factor(x, levels=unique(x)))))
# subjectID Trial NewTrial
# 1 1 3 1
# 2 1 3 1
# 3 1 3 1
# 4 1 4 2
# 5 1 4 2
# 6 1 5 3
# 7 1 5 3
# 8 1 5 3
# 9 2 1 1
# 10 2 1 1
# 11 2 3 2
# 12 2 3 2
# 13 2 3 2
# 14 2 5 3
# 15 2 5 3
# 16 2 6 4
# 17 3 1 1
Data:
dat <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), Trial = c(3L, 3L, 3L, 4L,
4L, 5L, 5L, 5L, 1L, 1L, 3L, 3L, 3L, 5L, 5L, 6L, 1L)), class = "data.frame", row.names = c(NA,
-17L))
We could use match on the unique values after grouping by 'subjectID'
library(dplyr)
df1 <- df1 %>%
group_by(subjectID) %>%
mutate(NewTrial = match(Trial, unique(Trial))) %>%
ungroup
We could use rleid:
library(dplyr)
library(data.table)
df %>%
group_by(subjectID) %>%
mutate(NewTrial = rleid(subjectID, Trial))
subjectID Trial NewTrial
<int> <int> <int>
1 1 3 1
2 1 3 1
3 1 3 1
4 1 4 2
5 1 4 2
6 1 5 3
7 1 5 3
8 1 5 3
9 2 1 1
10 2 1 1
11 2 3 2
12 2 3 2
13 2 3 2
14 2 5 3
15 2 5 3
16 2 6 4
17 3 1 1

how refill a column with the help of 2 other column?

I have a data based 3 groups : SAMPN,PERNO,loop
there are 2 columns, mode1 and mode2. and a column called int.
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2
SAMPN is family index, PERNO is index of persons in each family and loop is tour of each person. the last row of each loop for each person is 0 or 2 and and rest of loop is NA. in each family and for each person and each loop I want copy the column mode 1 in int if the last row of loop is 0 and copy mode2 if the last row of loo is 2.
output
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 1
1 1 1 2 1 2
1 1 1 3 2 3
1 2 1 3 2 2
1 2 1 1 1 1
2 2 1 3 2 3
2 2 1 1 3 1
2 2 1 3 1 3
2 2 2 1 2 2
2 2 2 3 1 1
the first 3 rows is loop of first person in the first family, I filled that loop by mode1 because the third row was 0. and so on
Here's a way using dplyr
df <- read.table(h=T,text="SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2")
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = if(last(int) == 0) mode1 else mode2) %>%
ungroup()
#> # A tibble: 10 x 6
#> SAMPN PERNO loop mode1 mode2 int
#> <int> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 2 1
#> 2 1 1 1 2 1 2
#> 3 1 1 1 3 2 3
#> 4 1 2 1 3 2 2
#> 5 1 2 1 1 1 1
#> 6 2 2 1 3 2 3
#> 7 2 2 1 1 3 1
#> 8 2 2 1 3 1 3
#> 9 2 2 2 1 2 2
#> 10 2 2 2 3 1 1
If you have more values than 0 or 2, switch could be a good alternative :
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = switch(
as.character(last(int)),
`0` = mode1,
`2` = mode2)) %>%
ungroup()
# same output!
We can also use case_when
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = case_when(rep(last(int) == 0, n()) ~ mode1, TRUE ~mode2))
# A tibble: 10 x 6
# Groups: loop, SAMPN, PERNO [4]
# SAMPN PERNO loop mode1 mode2 int
# <int> <int> <int> <int> <int> <int>
# 1 1 1 1 1 2 1
# 2 1 1 1 2 1 2
# 3 1 1 1 3 2 3
# 4 1 2 1 3 2 2
# 5 1 2 1 1 1 1
# 6 2 2 1 3 2 3
# 7 2 2 1 1 3 1
# 8 2 2 1 3 1 3
#9 2 2 2 1 2 2
#10 2 2 2 3 1 1
data
df <- structure(list(SAMPN = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), PERNO = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), loop = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), mode1 = c(1L, 2L, 3L, 3L,
1L, 3L, 1L, 3L, 1L, 3L), mode2 = c(2L, 1L, 2L, 2L, 1L, 2L, 3L,
1L, 2L, 1L), int = c(NA, NA, 0L, NA, 2L, NA, NA, 0L, NA, 2L)),
class = "data.frame", row.names = c(NA,
-10L))

Creating a new variable based on the orders of existing variables using R

Hoping to create the new variable X based on three existing variables: "SubID" "Day" and "Time". I used to have three sorting functions in excel to do this manually: first sort by the "SubID," and then sort by the "Day," and lastly sort by "Time." X should be from 1 to the largest number of rows for each SubID, based on the order of Day and Time.
SubID: assigned subject number
Day: each subject's day number (1,2,3...21)
Time: 1, 2, 3
X: the number of rows marked as the same SubID
SubID Day Time X
1 1 1 1
1 1 2 2
1 1 3 3
1 2 1 4
1 2 2 5
2 1 1 1
2 1 2 2
2 1 3 3
2 2 3 6
2 2 2 5
2 2 1 4
I have been doing this manually in excel and I am sure there must be a smarter way to do it in R, but I am new to R and don't know how. Thank you in advance!
May be with data.table package. You will have to install it in case you haven't already. I have commented the command.
# install.packages("data.table")
library(data.table)
we can generate your data in the following way.
df <- data.frame(SubId=sample(1:2,10,replace=TRUE),
Day=sample(1:2,10,replace=TRUE),
Time=sample(1:2,10,replace=TRUE))
Then convert the data.frame into data.table.
setDT(df)
##> df
## SubId Day Time
## 1: 1 2 1
## 2: 1 1 1
## 3: 1 1 2
## 4: 2 2 1
## 5: 2 1 1
## 6: 1 2 2
## 7: 1 2 1
## 8: 1 2 2
## 9: 2 1 1
## 10: 2 1 2
Finally we can order my SubId, Day ,Time. As the table is ordered as we wanted, we just have to number the rows from 1 to the number of observations in each SubId.
df[order(SubId,Day,Time),X:=1:.N,SubId]
##> df
## SubId Day Time X
## 1: 1 2 1 3
## 2: 1 1 1 1
## 3: 1 1 2 2
## 4: 2 2 1 4
## 5: 2 1 1 1
## 6: 1 2 2 5
## 7: 1 2 1 4
## 8: 1 2 2 6
## 9: 2 1 1 2
## 10: 2 1 2 3
May be this helps
library(dplyr)
df1 %>%
group_by(SubID) %>%
mutate(X1 = row_number(as.numeric(paste0(Day, Time))))
# A tibble: 11 x 5
# Groups: SubID [2]
# SubID Day Time X X1
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 1
# 2 1 1 2 2 2
# 3 1 1 3 3 3
# 4 1 2 1 4 4
# 5 1 2 2 5 5
# 6 2 1 1 1 1
# 7 2 1 2 2 2
# 8 2 1 3 3 3
# 9 2 2 3 6 6
#10 2 2 2 5 5
#11 2 2 1 4 4
Or using order
df1 %>%
group_by(SubID) %>%
mutate(X1 = order(Day, Time))
Or with data.table
library(data.table)
setDT(df1)[, X1 := order(Day, Time), by = SubID]
data
df1 <- structure(list(SubID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), Day = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L),
Time = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 3L, 2L, 1L), X = c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 6L, 5L, 4L)), class = "data.frame",
row.names = c(NA,
-11L))

Panel data sequence adding for a particular value

I am really new in r and stackoverflow. Apologies in advance for this novice question.
I have a panel data set like the following table.
ID Choice
1 1
1 1
1 2
1 5
1 1
2 1
2 1
2 5
2 1
2 1
3 3
3 1
3 1
3 2
3 4
I want to add another column like the following table when choice is 1. This is basically, sequencing the choice 1 within ID.
ID Choice BUS
1 1 0 (The first 1 will be considered as 0)
1 1 1
1 2 1
1 5 1
1 1 2
2 1 0
2 1 1
2 5 1
2 1 2
2 1 3
3 3 0
3 1 0
3 1 1
3 2 1
3 4 1
with(df, ave(Choice == 1, ID, FUN = cumsum))
Almost gives you what you want but as you want to consider first 1 as 0 it needs some modification.
df$BUS <- with(df, ave(Choice == 1, ID, FUN = function(x) {
inds = cumsum(x)
ifelse(inds > 0, inds - 1, inds)
}))
df
# ID Choice BUS
#1 1 1 0
#2 1 1 1
#3 1 2 1
#4 1 5 1
#5 1 1 2
#6 2 1 0
#7 2 1 1
#8 2 5 1
#9 2 1 2
#10 2 1 3
#11 3 3 0
#12 3 1 0
#13 3 1 1
#14 3 2 1
#15 3 4 1
Here we subtract 1 from cumulative sum from the first 1.
Using the same logic in dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(inds = cumsum(Choice == 1),
BUS = ifelse(inds > 0, inds - 1, inds)) %>%
select(-inds)
We can also use data.table
library(data.table)
setDT(df1)[, BUS := pmax(0, cumsum(Choice == 1)-1), ID]
df1
# ID Choice BUS
# 1: 1 1 0
# 2: 1 1 1
# 3: 1 2 1
# 4: 1 5 1
# 5: 1 1 2
# 6: 2 1 0
# 7: 2 1 1
# 8: 2 5 1
# 9: 2 1 2
#10: 2 1 3
#11: 3 3 0
#12: 3 1 0
#13: 3 1 1
#14: 3 2 1
#15: 3 4 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L), Choice = c(1L, 1L, 2L, 5L, 1L, 1L, 1L, 5L,
1L, 1L, 3L, 1L, 1L, 2L, 4L)), class = "data.frame", row.names = c(NA,
-15L))

Creating a new columns from a data.frame

I have a dataset which is in longformat in which Measurements (Time) are nested in Networkpartners (NP) which are nested in Persons (ID), here is an example of what it looks like (the real dataset has over thousands of rows):
ID NP Time Outcome
1 11 1 4
1 11 2 3
1 11 3 NA
1 12 1 2
1 12 2 3
1 12 3 3
2 21 1 2
2 21 2 NA
2 21 3 NA
2 22 1 4
2 22 2 4
2 22 3 4
Now I would like to create 3 new variables:
a) The Number of Networkpartners (who have no NA in the outcome at this measurement) a specific person (ID) has Time 1
b) Number of Networkpartners (who have no NA in the outcome at this measurement) a specific person (ID) at Time 2
c) Number of Networkpartners (who have no NA in the outcome at this measurement) a specific person (ID) at Time 3
So I would like to create a dataset like this:
ID NP Time Outcome NP.T1 NP.T2 NP.T3
1 11 1 4 2 2 1
1 11 2 3 2 2 1
1 11 3 NA 2 2 1
1 12 1 2 2 2 1
1 12 2 3 2 2 1
1 12 3 3 2 2 1
2 21 1 2 2 1 1
2 21 2 NA 2 1 1
2 21 3 NA 2 1 1
2 22 1 4 2 1 1
2 22 2 4 2 1 1
2 22 3 4 2 1 1
I would really appreciate your help.
You can just create one variable rather than three. I am using ddply from plyr package for
that.
mydata<-structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), NP = c(11L, 11L, 11L, 12L, 12L, 12L, 21L, 21L, 21L,
22L, 22L, 22L), Time = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L), Outcome = c(4L, 3L, NA, 2L, 3L, 3L, 2L, NA, NA,
4L, 4L, 4L)), .Names = c("ID", "NP", "Time", "Outcome"), class = "data.frame", row.names = c(NA,
-12L))
library(plyr)
mydata1<-ddply(mydata,.(ID,Time),transform, NP.T=length(Outcome[which(Outcome !="NA")]))
>mydata1
ID NP Time Outcome NP.T
1 1 11 1 4 2
2 1 12 1 2 2
3 1 11 2 3 2
4 1 12 2 3 2
5 1 11 3 NA 1
6 1 12 3 3 1
7 2 21 1 2 2
8 2 22 1 4 2
9 2 21 2 NA 1
10 2 22 2 4 1
11 2 21 3 NA 1
12 2 22 3 4 1
Updated: You can also use interaction to create the unique variable that combines ID and Time (comb)
mydata1<-ddply(mydata,.(ID,Time),transform, NP.T=length(Outcome[which(Outcome !="NA")]),comb=interaction(ID,Time))

Resources