Organizing a data frame with multiple entries per sample

Organizing a data frame with multiple entries per sample - r

I have the following database with several entries per individual:
record_id<-c(21,21,21,15,15,15,2,2,2,2,3,3,3)
var<-c(0,0,0,1,0,0,1,1,0,0,1,1,0)
data<-data.frame(cbind(record_id,var))
I want to create a new data frame with just 1 row per record_id. But it has to fulfill that if the individual (record_id) has a data$var == 1. The outcome data frame must indicate 1.
So, the outcome would be like this:
record_id<-c(21,15,2,3)
var<-c(0,1,1,1)
data_sol<-data.frame(cbind(record_id,var))
I have tried this:
DF1 <- data %>%
group_by(record_id) %>%
mutate(class = ifelse(var==1,1,0)) %>%
ungroup
I know it's not the best way, I was planning to obtain afterwards the unique values... But it did not make the trick.

If your 'var' is all zeroes or ones, you can also use max():
data%>%group_by(record_id)%>%
summarise(new_var=max(var))
# A tibble: 4 x 2
record_id new_var
<dbl> <dbl>
1 2 1
2 3 1
3 15 1
4 21 0

You can use mean() with the mutate to detect if there exsist any non zero value inside a group like,
data %>%
group_by(record_id) %>%
mutate(var = ifelse(mean(var)!=0,1,0)) %>%
distinct(record_id,var)
gives,
# A tibble: 4 x 2
# Groups: record_id [4]
# record_id var
# <dbl> <dbl>
# 1 21 0
# 2 15 1
# 3 2 1
# 4 3 1

We can do
library(dplyr)
data %>%
group_by(record_id) %>%
summarise(var = +(mean(var) != 0))
Or using slice
data %>%
group_by(record_id) %>%
slice_max(n = 1, order_by = var)

Related

Creating counts of subset with dplyr

I'm trying to summarize a data set with not only total counts per group, but also counts of subsets. So starting with something like this:
df <- data.frame(
Group=c('A','A','B','B','B'),
Size=c('Large','Large','Large','Small','Small')
)
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n())
I can get a summary of the number of observations for each group:
> df_summary
# A tibble: 2 x 2
Size size_n
<chr> <int>
1 Large 3
2 Small 2
Is there anyway I can add some sort of subsetting information to n() to get, say, a count of how many observations per group were Large in this example? In other words, ending up with something like:
Group group_n Large_n
1 A 2 2
2 B 3 1
Thank you!

We could use count:
count(xyz) is the same as group_by(xyz) %>% summarise(xyz = n())
library(dplyr)
df %>%
count(Group, Size)
Group Size n
1 A Large 2
2 B Large 1
3 B Small 2
OR
library(dplyr)
library(tidyr)
df %>%
count(Group, Size) %>%
pivot_wider(names_from = Size, values_from = n)
Group Large Small
<chr> <int> <int>
1 A 2 NA
2 B 1 2

I approach this problem using an ifelse and a sum:
df_summary <- df %>%
group_by(Group) %>%
summarize(group_n=n(),
Large_n = sum(ifelse(Size == "Large", 1, 0)))
The last line turns Size into a binary indicator taking the value 1 if Size == "Large" and 0 otherwise. Summing this indicator is equivalent to counting the number of rows with "Large".

df_summary <- df %>%
group_by(Group) %>%
mutate(group_n=n())%>%
ungroup() %>%
group_by(Group,Size) %>%
mutate(Large_n=n()) %>%
ungroup() %>%
distinct(Group, .keep_all = T)
# A tibble: 2 x 4
Group Size group_n Large_n
<chr> <chr> <int> <int>
1 A Large 2 2
2 B Large 3 1

How to sum a set of columns grouped by one column

I have a dataframe like so
ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)
cluster <- c(1,2,3,1)
test <- data.frame(ID, Type1, Type2, cluster)
I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.
I achieved it through
test.sum <- test %>%
group_by(cluster)%>%
summarise(sum(Type1), sum(Type2))
However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?

This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:
test %>%
group_by(cluster) %>%
summarise(across(contains("Type"), sum))
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:
library(dplyr)
library(tidyr)
test %>%
pivot_longer(-c(ID, cluster)) %>%
group_by(cluster, name) %>%
summarise(sum_value = sum(value)) %>%
pivot_wider(names_from = "name", values_from = "sum_value")
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1

Base R
You can exploit split which is equivalent to group_by(). This should give you what you are looking for, regardless of how many Types you have.
my_split <- split(subset(test, select = grep('^Ty', names(test))), test[, -1]$cluster)
my_sums <- sapply(my_split, \(x) colSums(x))
my_sums <- data.frame( cluster = as.numeric(gsub("\\D", '', colnames(my_sums))),
t(my_sums) )
Output
> my_sums
cluster Type1 Type2
1 1 2 0
2 2 1 1
3 3 0 1
Note: use function(x) instead of \(x) if you use a version of R <4.1.0

is there an R code for the following data wrangling and transformation

I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)

This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA

using mutate with row and column indexing and group by

I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks

We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))

rollsumr with window-length>1: filling missing values

My data frame looks something like the first two columns of the following
I want to add a third column, equal to the sum of the ID-group's last three observations for VAL.
Using the following command, I managed to get the output below:
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3)) %>%
ungroup()
ID VAL SUM
1 2 NA
1 1 NA
1 3 6
1 4 8
...
I am now hoping to be able to fill the NAs that result for the group's cells in the first two rows.
ID VAL SUM
1 2 2
1 1 3
1 3 6
1 4 8
...
How do I do that?
I have tried doing the following
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=min(3, row_number())) %>%
ungroup()
and
df %>%
group_by(ID) %>%
mutate(SUM=rollsumr(VAL, k=3), fill = "extend") %>%
ungroup()
But both give me the same error, because I have groups of sizes <= 2.
Evaluation error: need at least two non-NA values to interpolate.
What do I do?

Alternatively, you can use rollapply() from the same package:
df %>%
group_by(ID) %>%
mutate(SUM = rollapply(VAL, width = 3, FUN = sum, partial = TRUE, align = "right"))
ID VAL SUM
<int> <int> <int>
1 1 2 2
2 1 1 3
3 1 3 6
4 1 4 8
Due to argument partial = TRUE, also the rows that are below the desired window of length three are summed.

Not a direct answer but one way would be to replace the values which are NAs with cumsum of VAL
library(dplyr)
library(zoo)
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(is.na(SUM), cumsum(VAL), SUM))
# ID VAL SUM
# <int> <int> <int>
#1 1 2 2
#2 1 1 3
#3 1 3 6
#4 1 4 8
Or since you know the window size before hand, you could check with row_number() as well
df %>%
group_by(ID) %>%
mutate(SUM = rollsumr(VAL, k=3, fill = NA),
SUM = ifelse(row_number() < 3, cumsum(VAL), SUM))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Organizing a data frame with multiple entries per sample - r

If your 'var' is all zeroes or ones, you can also use max(): data%>%group_by(record_id)%>% summarise(new_var=max(var)) # A tibble: 4 x 2 record_id new_var <dbl> <dbl> 1 2 1 2 3 1 3 15 1 4 21 0

We can do library(dplyr) data %>% group_by(record_id) %>% summarise(var = +(mean(var) != 0)) Or using slice data %>% group_by(record_id) %>% slice_max(n = 1, order_by = var)

Related

Creating counts of subset with dplyr

How to sum a set of columns grouped by one column

is there an R code for the following data wrangling and transformation

using mutate with row and column indexing and group by

rollsumr with window-length>1: filling missing values

Categories

Resources