Transform subject ID across groups that vary in size

Transform subject ID across groups that vary in size - r

A MWE is as follows:
I have 3 groups with 2, 4, and 3 subjects consecutively. So I have:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2)
df <- rbind(Group, Subject_ID)
Since the subjects in different groups are different subjects, so I want the subject ID be unique for each subject in the dataset. What I did was as follows:
Num_Subjects <- (length(unique(filter(df, Group == 1)$Subject)),
length(unique(filter(df, Group == 2)$Subject)),
length(unique(filter(df, Group == 3)$Subject)),
)
# Then I defined a summation function to calculate how many subjects there are in all previous groups.
sumfun <- function(x,start,end){
return(sum(x[start:end]))
}
# Then I defined another function that generates a new subject ID for each subject in each group.
SubjIDFn <- function(x, i) {
x %>% filter(Session == i) %>% mutate(
Sujbect = Subject + sumfun(Num_Subjects, 1, i-1)
)
}
# Then I loop this from group 2 to group 3,
for (i in 2:3) {
df.Corruption.WithoutS1 <- SubjIDFn(df.Corruption.WithoutS1, i)
}
Then the data set has zero observations. I don't know where it went wrong, and I don't know what is the smart solution to this problem. Thanks for your help!

I think you're a bit overshooting it... If Subject_ID is unique within groups, you may just go with:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2, 3)
df <- bind_cols(Group=Group, Subject_ID=Subject_ID)
df %>% mutate(unique_id = paste(Group, Subject_ID, sep="."))
# A tibble: 9 x 3
Group Subject_ID unique_id
<dbl> <dbl> <chr>
1 1 1 1.1
2 1 2 1.2
3 2 1 2.1
4 2 2 2.2
5 2 3 2.3
6 2 4 2.4
7 3 1 3.1
8 3 2 3.2
9 3 3 3.3
Note that I used bind_cols instead of rbind to have a dataframe instead of a matrix.

Related

In R, How to make a function that finds if there is a matching pairs?

I want to make a function that can detect if there is a matching pair of numbers. I want to simulate x and y many times to see the # of matches occurring using a function.
x<-sample(1:6,6)
y<-sample(1:6,6)
x;y
For example, I have x<- c(2, 5, 6, 4, 3, 1)and y<- c(2, 1, 6, 5, 4, 3). Numbers 2 and 6 matches in order. There are 2 pairs. If there is no match between x and y, it should be just 0. I can use sum(x==y) to find for one example of x and y.
How can I make a function that finds number of identical pairs for many x and y?

You can just use
f<-function(n,k) {
sapply(1:k, \(i) sum(sample(n) == sample(n)))
}
where k is the number of iterations and n is the range (in your case 6)
Example Usage:
f(n=6, k=100)

In base R the following function would do the trick. The length of vector is given by the size argument, and the number of trials is given by n
n_pairs <- function(size, n) {
colSums(replicate(n, sample(size)) == replicate(n, sample(size)))
}
So, for example we can see:
set.seed(1)
n_pairs(size = 6, n = 5)
#> [1] 2 0 1 1 1
hist(n_pairs(6, 100), breaks = 0:6)
mean(n_pairs(6, 1000))
#> [1] 1.013
Note though that R already has the function rbinom, which can achieve the same result with:
rbinom(n, size, 1/size)
Created on 2022-04-26 by the reprex package (v2.0.1)

Maybe this one (removed first answer):
x<- c(2, 5, 6, 4, 3, 1)
y<- c(2, 1, 6, 5, 4, 3)
lst = list(x,y)
pairs <- outer(lst,lst,Vectorize(function(x,y){x[x==y]}))
pairs[1,2]
[[1]]
[1] 2 6

A possible solution with dplyr package
require(tidyverse)
x <- c(2, 5, 6, 4, 3, 1)
y <- c(2, 1, 6, 5, 4, 3)
df <- tibble(x = x,
y = y) %>%
mutate(pair = case_when(x == y ~ "PAIR",
TRUE ~ "NOT"))
The dataset:
# A tibble: 6 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 5 1 NOT
3 6 6 PAIR
4 4 5 NOT
5 3 4 NOT
6 1 3 NOT
Filtering:
df %>%
filter(pair == "PAIR")
Output:
# A tibble: 2 x 3
x y pair
<dbl> <dbl> <chr>
1 2 2 PAIR
2 6 6 PAIR

Will this give you what you want? Make a table out of the values that are paired.
table(x[x==y])
x <- sample(1:6,1000, TRUE)
y <- sample(1:6,1000, TRUE)
table(x[x==y])
# 1 2 3 4 5 6
# 37 26 32 28 30 33

using mutate_at with the in operator %in%

I have a data frame with a few variables to reverse code. I have a separate vector that has all the variables to reverse code. I'd like to use mutate_at(), or some other tidy way, to reverse code them all in one line of code. Here's the dataset and the vector of items to reverse
library(tidyverse)
mock_data <- tibble(id = 1:5,
item_1 = c(1, 5, 3, 5, 5),
item_2 = c(4, 4, 4, 1, 1),
item_3 = c(5, 5, 5, 5, 1))
reverse <- c("item_2", "item_3")
Here's what I want it to look like with only items 2 and 3 reverse coded:
library(tidyverse)
solution <- tibble(id = 1:5,
item_1 = c(1, 5, 3, 5, 5),
item_2 = c(2, 2, 2, 5, 5),
item_3 = c(1, 1, 1, 1, 5))
I've tried this below code. I know that the recode is correct because I've used it for other datasets, but I know something is off with the %in% operator.
library(tidyverse)
mock_data %>%
mutate_at(vars(. %in% reverse), ~(recode(., "1=5; 2=4; 3=3; 4=2; 5=1")))
Error: `. %in% reverse` must evaluate to column positions or names, not a logical vector
Any help would be appreciated!

You can give reverse directly to mutate_at, no need for vars(. %in% reverse). And I would simplify the reversing as 6 minus the current value.
mock_data %>% mutate_at(reverse, ~6 - .)
# # A tibble: 5 x 4
# id item_1 item_2 item_3
# <int> <dbl> <dbl> <dbl>
# 1 1 1 2 1
# 2 2 5 2 1
# 3 3 3 2 1
# 4 4 5 5 1
# 5 5 5 5 5
If there's a possibility that reverse includes columns that are not in mock_data, and you want to skip those, use mutate_at(vars(one_of(reverse)), ...)

Per group, select first row and another which matches a condition

Let's say I have the following data.table:
x <- data.table(a = c(1, 3, 2, 2, 4, 3, 7, 10, 9, 8),
b = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3))
And, after grouping by b, I want to select rows which:
are the first row of the group
have the highest a in the group
If a single row satisfies both conditions, it should only be selected once (the group will only contain one row).
Each of these selections is trivial:
x[, .SD[1], by = b] # selects first row per group
# b a
# 1: 1 1
# 2: 2 2
# 3: 3 10
x[, .SD[which.max(a)], by = b] # selects row with the highest 'a' in the group
# b a
# 1: 1 3
# 2: 2 7
# 3: 3 10
But I can't figure out how to do both at once (obviously .SD[1 | which.max(a)] doesn't work). I could perform them separately and then rbindlist the final result, but I'd like to know if there's a simpler way.
For clarity, in the case above, the expected output would be (different order is also acceptable):
b a
1: 1 1
2: 1 3
3: 2 2
4: 2 7
5: 3 10

One option is to concatenate the index 1 (for the first row) along with which.max -returns a numeric index as well, then take the unique of that (in case the same value 1 is returned by which.max and use that to subset the data.table (.SD)
x[, .SD[unique(c(1, which.max(a)))], by = b]
# b a
#1: 1 1
#2: 1 3
#3: 2 2
#4: 2 7
#5: 3 10
Or use .I
x[x[, .I[unique(c(1, which.max(a)))], by = b]$V1]

Here is how I would do it in dplyr:
library(dplyr)
x <- data.frame(a = c(1, 3, 2, 2, 4, 3, 7, 10, 9, 8),
b = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3))
x %>% group_by(b) %>% filter(row_number() == 1 | a == max(a))
Output
# a b
#1: 1 1
#2: 3 1
#3: 2 2
#4: 7 2
#5: 10 3

If you only have those two columns, just take the union of the two tables:
funion(
x[, lapply(.SD, max), by=b],
x[, lapply(.SD, first), by=b]
)
I guess max is more efficient than your which.max, since it is optimized (see ?GForce).

Summarize different Columns with different Functions

I have the following Problem: In a data frame I have a lot of rows and columns with the first row being the date. For each date I have more than 1 observation and I want to summarize them.
My df looks like that (date replaced by ID for ease of use):
df:
ID Cash Price Weight ...
1 0.4 0 0
1 0.2 0 82 ...
1 0 1 0 ...
1 0 3.2 80 ...
2 0.3 1 70 ...
... ... ... ... ...
I want to group them by the first column and then summarize all rows BUT with different functions:
The function Cash and Price should be sum so I get the sum of Cash and Price for each ID. The function on Weight should be max so I only get the maximum weight for the ID.
Because I have so many columns I can not write a all functions by hand, but I have only 2 columns which should be summarized by max the rest should be summarized by sum.
So I am looking for a function to group by ID, summarize all with sum except 2 different columns which I need the max value.
I tried to use the dplyr package with:
df %>% group_by(ID = tolower(ID)) %>% summarise_each(funs(sum))
But I need the addition to not sum but max the 2 specified columns, any Ideas?
To be clear, the output of the example df should be:
ID Cash Price Weight
1 0.6 4.2 82
2 0.3 1 70

As of dplyr 1.0.0 you can use across():
tribble(
~ID, ~max1, ~max2, ~sum1, ~sum2, ~sum3,
1, 1, 1, 1, 2, 3,
1, 2, 3, 1, 2, 3,
2, 1, 1, 1, 2, 3,
2, 3, 4, 2, 3, 4,
3, 1, 1, 1, 2, 3,
3, 4, 5, 3, 4, 5,
3, NA, NA, NA, NA, NA
) %>%
group_by(ID) %>%
summarize(
across(matches("max1|max2"), max, na.rm = T),
across(!matches("max1|max2"), sum, na.rm = T)
)
# ID max1 max2 sum1 sum2 sum3
# 1 2 3 2 4 6
# 2 3 4 3 5 7
# 3 4 5 4 6 8

We can use
df %>%
group_by(ID) %>%
summarise(Cash = sum(Cash), Price = sum(Price), Weight = max(Weight))
If we have many columns, one way would be to do this separately and then join the output together.
df1 <- df %>%
group_by(ID) %>%
summarise_each(funs(sum), Cash:Price)
df2 <- df %>%
group_by(ID) %>%
summarise_each(funs(max), Weight)
inner_join(df1, df2, by = "ID")
# ID Cash Price Weight
# (int) (dbl) (dbl) (int)
#1 1 0.6 4.2 82
#2 2 0.3 1.0 70

Or do it w/o the double groups:
library(dplyr)
set.seed(1492)
df <- data.frame(id=rep(c(1,2), 3),
cash=rnorm(6, 0.5, 0.1),
price=rnorm(6, 0.5, 0.1)*6,
weight=sample(100, 6))
df
## id cash price weight
## 1 1 0.4410152 2.484082 10
## 2 2 0.4101343 3.032529 93
## 3 1 0.3375889 2.305076 58
## 4 2 0.6047922 3.248851 55
## 5 1 0.4721711 3.209930 34
## 6 2 0.5362493 2.331530 99
custom_summarise <- function(do_df) {
return(bind_cols(
summarise_each(select(do_df, -weight), funs(sum)),
summarise_each(select(do_df, weight), funs(max))
))
}
group_by(df, id) %>% do(custom_summarise(.))
## Source: local data frame [2 x 4]
## Groups: id [2]
##
## id cash price weight
## (dbl) (dbl) (dbl) (int)
## 1 3 1.250775 7.999089 58
## 2 6 1.551176 8.612910 99

library(data.table)
setDT(df)
df[,.(Cash = sum(Cash),Price = sum(Price),Weight = max(Weight)),by=ID]
One way of doing this for +90 columns can be:
max_col <- 'Weight'
sum_col <- setdiff(colnames(df),max_col)
query_1 <- paste0(sum_col,' = sum(',sum_col,')')
query_2 <- paste0(max_col,' = max(',max_col,')')
query_3 <- paste(query_1,collapse=',')
query_4 <- paste(query_2,collapse=',')
query_5 <- paste(query_3,query_4,sep=',')
final_query <- paste0('df[,.(',query_5,'),by = ID]')
eval(parse(text = final_query))

Here is a solution based on this comment on an issue on dplyr repo. I think it's very general to be applied to more complicated cases.
library(tidyverse)
df <- tribble(
~ID, ~Cash, ~Price, ~Weight,
#----------------------
'a', 4, 6, 8,
'a', 7, 3, 0,
'a', 7, 9, 0,
'b', 2, 8, 8,
'b', 5, 1, 8,
'b', 8, 0, 1,
'c', 2, 1, 1,
'c', 3, 8, 0,
'c', 1, 9, 1
)
out <- list(.vars=lst(vars(-Weight), vars(Weight)),
.funs=lst(sum, max))%>%
pmap(~df%>%group_by(ID)%>%summarise_at(.x, .y)) %>%
reduce(inner_join)
out
# A tibble: 3 x 4
# ID Cash Price Weight
# <chr> <dbl> <dbl> <dbl>
# 1 a 18 18 8
# 2 b 15 9 8
# 3 c 6 18 1
You should specify the vars in the first lst (e.g. vars(-Weight), vars(Weight)) and respective function to be applied in the lst (sum, max). The .x in the summarise_at argument refers to elements in the variable lst, and .y refers to the elements in the function lst.

Row-wise sum for columns with certain names

I have a sample data:
SampleID a b d f ca k l cb
1 0.1 2 1 2 7 1 4 3
2 0.2 3 2 3 4 2 5 5
3 0.5 4 3 6 1 3 9 2
I need to find row-wise sum of columns which have something common in names, e.g. row-wise sum(a, ca) or row-wise sum(b,cb). The problem is that i have large data.frame and ideally i would be able to write what is common in column header, so that code would pick only those columns to sum
Thank you beforehand for any assistance.

We can select the columns that have 'a' with grep, subset the columns and do rowSums and the same with 'b' columns.
rowSums(df1[grep('a', names(df1)[-1])+1])
rowSums(df1[grep('b', names(df1)[-1])+1])

If you want the output as a data frame, try using dplyr
# Recreating your sample data
df <- data.frame(SampleID = c(1, 2, 3),
a = c(0.1, 0.2, 0.5),
b = c(2, 3, 4),
d = c(1, 2, 3),
f = c(2, 3, 6),
ca = c(7, 4, 1),
k = c(1, 2, 3),
l = c(4, 5, 9),
cb = c(3, 5, 2))
Process the data
# load dplyr
library(dplyr)
# Sum across columns 'a' and 'ca' (sum(a, ca))
df2 <- df %>%
select(contains('a'), -SampleID) %>% # 'select' function to choose the columns you want
mutate(row_sum = rowSums(.)) # 'mutate' function to create a new column 'row_sum' with the sum of the selected columns. You can drop the selected columns by using 'transmute' instead.
df2 # have a look
a ca row_sum
1 0.1 7 7.1
2 0.2 4 4.2
3 0.5 1 1.5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Transform subject ID across groups that vary in size - r

Related

In R, How to make a function that finds if there is a matching pairs?

using mutate_at with the in operator %in%

Per group, select first row and another which matches a condition

Summarize different Columns with different Functions

Row-wise sum for columns with certain names

Categories

Resources