Combining Using Two Columns - r

A very basic question about merging interchangable columns.
Say I have a table
subject stim1 stim2 Chosen Tchosen
<dbl> <int> <int> <int> <int>
1 1 1 2 1 4
2 1 1 2 2 15
3 1 1 3 1 2
4 1 1 3 3 13
5 1 2 1 1 2
6 1 2 1 2 13
7 1 2 3 2 3
where stim1 and stim 2 are interchangable (Stim1=1,Stim2=2 is equivalent to Stim1=2, Stim2=1)
What is the simplest way to merge the data so that Tchosen is added from the two equivalent columns (though Chosen and by subject should be maintained distinctly)
Desired output
subject stim1 stim2 Chosen Tchosen
<dbl> <int> <int> <int> <int>
1 1 1 2 1 6
2 1 1 2 2 28
3 1 1 3 1 4
4 1 1 3 3 28
5 1 2 3 2 3
6 1 2 3 3 12...
Thank you

Here is a data.table approach.. could not reproduce your desired output, since it contains values that are not in your sample data?
library(data.table)
DT <- fread(" subject stim1 stim2 Chosen Tchosen
1 1 2 1 4
1 1 2 2 15
1 1 3 1 2
1 1 3 3 13
1 2 1 1 2
1 2 1 2 13
1 2 3 2 3")
# Switch values of stim2 and stim1 if stim2 < stim1
DT[stim2 < stim1, `:=`(stim1 = stim2, stim2 = stim1)]
# Now summarise and sum
DT[, .(Tchosen = sum(Tchosen, na.rm = TRUE)), by = .(subject,stim1, stim2, Chosen)]
# subject stim1 stim2 Chosen Tchosen
# 1: 1 1 2 1 6
# 2: 1 1 2 2 28
# 3: 1 1 3 1 2
# 4: 1 1 3 3 13
# 5: 1 2 3 2 3

A base R option using merge + rowSums
transform(
merge(df,
df,
by.x = c("subject", "stim1", "stim2", "Chosen"),
by.y = c("subject", "stim2", "stim1", "Chosen"),
all.x = TRUE
),
Tchosen = rowSums(cbind(Tchosen.x, Tchosen.y), na.rm = TRUE)
)
which gives
subject stim1 stim2 Chosen Tchosen.x Tchosen.y Tchosen
1 1 1 2 1 4 2 6
2 1 1 2 2 15 13 28
3 1 1 3 1 2 NA 2
4 1 1 3 3 13 NA 13
5 1 2 1 1 2 4 6
6 1 2 1 2 13 15 28
7 1 2 3 2 3 NA 3
where NA exists probably due to the incomplete data in your post.
Data
> dput(df)
structure(list(subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), stim1 = c(1L,
1L, 1L, 1L, 2L, 2L, 2L), stim2 = c(2L, 2L, 3L, 3L, 1L, 1L, 3L
), Chosen = c(1L, 2L, 1L, 3L, 1L, 2L, 2L), Tchosen = c(4L, 15L,
2L, 13L, 2L, 13L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7"))

You can use pmin/pmax to sort stim1 and stim2 columns and calculate sum for each group.
aggregate(Tchosen~., transform(df, stim1 = pmin(stim1, stim2),
stim2 = pmax(stim1, stim2)), sum)
# subject stim1 stim2 Chosen Tchosen
#1 1 1 2 1 6
#2 1 1 3 1 2
#3 1 1 2 2 28
#4 1 2 3 2 3
#5 1 1 3 3 13

Related

Pasting values from a vector to a new column in a for loop with nested data

I have a dataframe that currently looks like this:
subjectID
Trial
1
3
1
3
1
3
1
4
1
4
1
5
1
5
1
5
2
1
2
1
2
3
2
3
2
3
2
5
2
5
2
6
3
1
Etc., where trial number is nested under subject ID. I need to make a new column in which column "NewTrial" is simply what order the trials now appear in. For example:
subjectID
Trial
NewTrial
1
3
1
1
3
1
1
3
1
1
4
2
1
4
2
1
5
3
1
5
3
1
5
3
2
1
1
2
1
1
2
3
2
2
3
2
2
3
2
2
5
3
2
5
3
2
6
4
3
1
1
So far, I have a for-loop written that looks like this:
for (myperson in unique(data$subjectID)){
#This line creates a vector of the number of unique trials per subject: for subject 1, c(1, 2, 3)
triallength=1:length(unique(data$Trial[data$subID==myperson]))
I'm having trouble now finding a way to paste the numbers from the created triallength vector as a column in the dataframe. Does anyone know of a way to accomplish this? I am lacking some experience with for-loops and hoping to gain more. If anyone has a tidyverse/dplyr solution, however, I am open to that as well as an alternative to a for-loop. Thanks in advance, and let me know if any clarification is needed!
Converting to factor with unique values as levels, then as.numeric in an ave should be nice.
transform(dat, NewTrial=ave(Trial, subjectID, FUN=\(x) as.numeric(factor(x, levels=unique(x)))))
# subjectID Trial NewTrial
# 1 1 3 1
# 2 1 3 1
# 3 1 3 1
# 4 1 4 2
# 5 1 4 2
# 6 1 5 3
# 7 1 5 3
# 8 1 5 3
# 9 2 1 1
# 10 2 1 1
# 11 2 3 2
# 12 2 3 2
# 13 2 3 2
# 14 2 5 3
# 15 2 5 3
# 16 2 6 4
# 17 3 1 1
Data:
dat <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), Trial = c(3L, 3L, 3L, 4L,
4L, 5L, 5L, 5L, 1L, 1L, 3L, 3L, 3L, 5L, 5L, 6L, 1L)), class = "data.frame", row.names = c(NA,
-17L))
We could use match on the unique values after grouping by 'subjectID'
library(dplyr)
df1 <- df1 %>%
group_by(subjectID) %>%
mutate(NewTrial = match(Trial, unique(Trial))) %>%
ungroup
We could use rleid:
library(dplyr)
library(data.table)
df %>%
group_by(subjectID) %>%
mutate(NewTrial = rleid(subjectID, Trial))
subjectID Trial NewTrial
<int> <int> <int>
1 1 3 1
2 1 3 1
3 1 3 1
4 1 4 2
5 1 4 2
6 1 5 3
7 1 5 3
8 1 5 3
9 2 1 1
10 2 1 1
11 2 3 2
12 2 3 2
13 2 3 2
14 2 5 3
15 2 5 3
16 2 6 4
17 3 1 1

How to add new rows conditionally on R

I have a df with
v1 t1 c1 o1
1 1 9 1
1 1 12 2
1 2 2 1
1 2 7 2
2 1 3 1
2 1 6 2
2 2 3 1
2 2 12 2
And I would like to add 2 rows each time that v1 changes it's value, in order to get this:
v1 t1 c1 o1
1 1 1 1
1 1 1 2
1 2 9 1
1 2 12 2
1 3 2 1
1 3 7 2
2 1 1 1
2 1 1 2
1 2 3 1
1 2 6 2
2 3 3 1
2 3 12 2
So what I'm doing is that every time v1 changes its value I'm adding 2 rows of ones and adding a 1 to the values of t1. This is kind of tricky. I've been able to do it in Excel but I would like to scale to big files in R.
We may do the expansion in group_modify
library(dplyr)
df1 %>%
group_by(v1) %>%
group_modify(~ .x %>%
slice_head(n = 2) %>%
mutate(across(-o1, ~ 1)) %>%
bind_rows(.x) %>%
mutate(t1 = as.integer(gl(n(), 2, n())))) %>%
ungroup
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
Or do a group by summarise
df1 %>%
group_by(v1) %>%
summarise(t1 = as.integer(gl(n() + 2, 2, n() + 2)),
c1 = c(1, 1, c1), o1 = rep(1:2, length.out = n() + 2),
.groups = 'drop')
-output
# A tibble: 12 × 4
v1 t1 c1 o1
<int> <int> <dbl> <int>
1 1 1 1 1
2 1 1 1 2
3 1 2 9 1
4 1 2 12 2
5 1 3 2 1
6 1 3 7 2
7 2 1 1 1
8 2 1 1 2
9 2 2 3 1
10 2 2 6 2
11 2 3 3 1
12 2 3 12 2
data
df1 <- structure(list(v1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), t1 = c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), c1 = c(9L, 12L, 2L, 7L, 3L, 6L,
3L, 12L), o1 = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))

define an indicator when number of duplicate rows -1 is equal one of the column

I have some duplicate rows whose are the same in some columns, I want to define indicator if the number of duplicate rows -1 are equal the number of one of the column .
example
SAMPN PERNO ARR_HR HHMEM
1 1 2 1
1 2 2 1
2 1 3 2
2 3 3 2
3 1 4 2
3 2 4 2
3 3 4 2
rows are duplicate if they are the same in first ,second and third columns. I want the indicator to be 1 if number of duplicate rows -1 is equal HHMEM .
for example 2 first rows are duplicate so 2-1=1=HHMEM so indicator is 1.
out put
SAMPN PERNO ARR_HR HHMEM indicator
1 1 2 1 1
1 2 2 1 1
2 1 3 2 0
2 3 3 2 0
3 1 4 2 1
3 2 4 2 1
3 3 4 2 1
After grouping by 'SAMPN' and other grouping variables (from OP's comments) create the 'indicator' by coercing the logical vector ((n()- 1) == HHMEM) into binary with as.integer
library(dplyr)
df1 %>%
group_by(SAMPN, ARR_HR, HHMEM) %>%
mutate(indicator = as.integer((n()-1) == HHMEM))
# A tibble: 7 x 5
# Groups: SAMPN [3]
# SAMPN PERNO ARR_HR HHMEM indicator
# <int> <int> <int> <int> <int>
#1 1 1 2 1 1
#2 1 2 2 1 1
#3 2 1 3 2 0
#4 2 3 3 2 0
#5 3 1 4 2 1
#6 3 2 4 2 1
#7 3 3 4 2 1
NOTE: We don't need to create any additional column and then remove it later
Or the same logic in base R with ave
df1$indicator <- +(with(df1, HHMEM == ave(HHMEM, HHMEM, SAMPN,
ARR_HR, FUN = length)-1))
Or using duplicated with table
i1 <- table(cumsum(!duplicated(df1[c(1, 3, 4)])))
as.integer(rep(i1, i1) - 1 == df1$HHMEM)
data
df1 <- structure(list(SAMPN = c(1L, 1L, 2L, 2L, 3L, 3L, 3L), PERNO = c(1L,
2L, 1L, 3L, 1L, 2L, 3L), ARR_HR = c(2L, 2L, 3L, 3L, 4L, 4L, 4L
), HHMEM = c(1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA,
-7L))
We can use add_count to get count and compare it with HHMEM.
library(dplyr)
df %>%
add_count(SAMPN, ARR_HR, HHMEM) %>%
mutate(indicator = as.integer(n - 1 == HHMEM)) %>%
select(-n)
# SAMPN PERNO ARR_HR HHMEM indicator
# <int> <int> <int> <int> <int>
#1 1 1 2 1 1
#2 1 2 2 1 1
#3 2 1 3 2 0
#4 2 3 3 2 0
#5 3 1 4 2 1
#6 3 2 4 2 1
#7 3 3 4 2 1

how refill a column with the help of 2 other column?

I have a data based 3 groups : SAMPN,PERNO,loop
there are 2 columns, mode1 and mode2. and a column called int.
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2
SAMPN is family index, PERNO is index of persons in each family and loop is tour of each person. the last row of each loop for each person is 0 or 2 and and rest of loop is NA. in each family and for each person and each loop I want copy the column mode 1 in int if the last row of loop is 0 and copy mode2 if the last row of loo is 2.
output
SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 1
1 1 1 2 1 2
1 1 1 3 2 3
1 2 1 3 2 2
1 2 1 1 1 1
2 2 1 3 2 3
2 2 1 1 3 1
2 2 1 3 1 3
2 2 2 1 2 2
2 2 2 3 1 1
the first 3 rows is loop of first person in the first family, I filled that loop by mode1 because the third row was 0. and so on
Here's a way using dplyr
df <- read.table(h=T,text="SAMPN PERNO loop mode1 mode2 int
1 1 1 1 2 NA
1 1 1 2 1 NA
1 1 1 3 2 0
1 2 1 3 2 NA
1 2 1 1 1 2
2 2 1 3 2 NA
2 2 1 1 3 NA
2 2 1 3 1 0
2 2 2 1 2 NA
2 2 2 3 1 2")
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = if(last(int) == 0) mode1 else mode2) %>%
ungroup()
#> # A tibble: 10 x 6
#> SAMPN PERNO loop mode1 mode2 int
#> <int> <int> <int> <int> <int> <int>
#> 1 1 1 1 1 2 1
#> 2 1 1 1 2 1 2
#> 3 1 1 1 3 2 3
#> 4 1 2 1 3 2 2
#> 5 1 2 1 1 1 1
#> 6 2 2 1 3 2 3
#> 7 2 2 1 1 3 1
#> 8 2 2 1 3 1 3
#> 9 2 2 2 1 2 2
#> 10 2 2 2 3 1 1
If you have more values than 0 or 2, switch could be a good alternative :
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = switch(
as.character(last(int)),
`0` = mode1,
`2` = mode2)) %>%
ungroup()
# same output!
We can also use case_when
library(dplyr)
df %>%
group_by(loop, SAMPN, PERNO) %>%
mutate(int = case_when(rep(last(int) == 0, n()) ~ mode1, TRUE ~mode2))
# A tibble: 10 x 6
# Groups: loop, SAMPN, PERNO [4]
# SAMPN PERNO loop mode1 mode2 int
# <int> <int> <int> <int> <int> <int>
# 1 1 1 1 1 2 1
# 2 1 1 1 2 1 2
# 3 1 1 1 3 2 3
# 4 1 2 1 3 2 2
# 5 1 2 1 1 1 1
# 6 2 2 1 3 2 3
# 7 2 2 1 1 3 1
# 8 2 2 1 3 1 3
#9 2 2 2 1 2 2
#10 2 2 2 3 1 1
data
df <- structure(list(SAMPN = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), PERNO = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), loop = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), mode1 = c(1L, 2L, 3L, 3L,
1L, 3L, 1L, 3L, 1L, 3L), mode2 = c(2L, 1L, 2L, 2L, 1L, 2L, 3L,
1L, 2L, 1L), int = c(NA, NA, 0L, NA, 2L, NA, NA, 0L, NA, 2L)),
class = "data.frame", row.names = c(NA,
-10L))

Creating a new variable based on the orders of existing variables using R

Hoping to create the new variable X based on three existing variables: "SubID" "Day" and "Time". I used to have three sorting functions in excel to do this manually: first sort by the "SubID," and then sort by the "Day," and lastly sort by "Time." X should be from 1 to the largest number of rows for each SubID, based on the order of Day and Time.
SubID: assigned subject number
Day: each subject's day number (1,2,3...21)
Time: 1, 2, 3
X: the number of rows marked as the same SubID
SubID Day Time X
1 1 1 1
1 1 2 2
1 1 3 3
1 2 1 4
1 2 2 5
2 1 1 1
2 1 2 2
2 1 3 3
2 2 3 6
2 2 2 5
2 2 1 4
I have been doing this manually in excel and I am sure there must be a smarter way to do it in R, but I am new to R and don't know how. Thank you in advance!
May be with data.table package. You will have to install it in case you haven't already. I have commented the command.
# install.packages("data.table")
library(data.table)
we can generate your data in the following way.
df <- data.frame(SubId=sample(1:2,10,replace=TRUE),
Day=sample(1:2,10,replace=TRUE),
Time=sample(1:2,10,replace=TRUE))
Then convert the data.frame into data.table.
setDT(df)
##> df
## SubId Day Time
## 1: 1 2 1
## 2: 1 1 1
## 3: 1 1 2
## 4: 2 2 1
## 5: 2 1 1
## 6: 1 2 2
## 7: 1 2 1
## 8: 1 2 2
## 9: 2 1 1
## 10: 2 1 2
Finally we can order my SubId, Day ,Time. As the table is ordered as we wanted, we just have to number the rows from 1 to the number of observations in each SubId.
df[order(SubId,Day,Time),X:=1:.N,SubId]
##> df
## SubId Day Time X
## 1: 1 2 1 3
## 2: 1 1 1 1
## 3: 1 1 2 2
## 4: 2 2 1 4
## 5: 2 1 1 1
## 6: 1 2 2 5
## 7: 1 2 1 4
## 8: 1 2 2 6
## 9: 2 1 1 2
## 10: 2 1 2 3
May be this helps
library(dplyr)
df1 %>%
group_by(SubID) %>%
mutate(X1 = row_number(as.numeric(paste0(Day, Time))))
# A tibble: 11 x 5
# Groups: SubID [2]
# SubID Day Time X X1
# <int> <int> <int> <int> <int>
# 1 1 1 1 1 1
# 2 1 1 2 2 2
# 3 1 1 3 3 3
# 4 1 2 1 4 4
# 5 1 2 2 5 5
# 6 2 1 1 1 1
# 7 2 1 2 2 2
# 8 2 1 3 3 3
# 9 2 2 3 6 6
#10 2 2 2 5 5
#11 2 2 1 4 4
Or using order
df1 %>%
group_by(SubID) %>%
mutate(X1 = order(Day, Time))
Or with data.table
library(data.table)
setDT(df1)[, X1 := order(Day, Time), by = SubID]
data
df1 <- structure(list(SubID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), Day = c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L),
Time = c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 3L, 2L, 1L), X = c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 6L, 5L, 4L)), class = "data.frame",
row.names = c(NA,
-11L))

Resources