calculate descriptives for a nested variable - r

I want to calculate the M, min, and max of a variable. Data were collected at different visits. My data look like this:
id visit V1
1 1 18
1 2 24
2 2 NA
2 3 5
2 4 6
I want it to look like this, where I have columns for the M, SD, min, and max for V1 for each participant.
id visit V1 M MIN MAX
1 1 18 21 18 24
2 2 3 4.67 3 6
In calculating the M, I want to take into account the # of visits (e.g., 18 + 24/2 visits). I tried this as a first step:
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1), na.rm = T)
When I try to handle the NAs by making sure they are not included, the na.rm = T results in a new column entitled "na.rm" with every value being true, which isn't what I want. Any thoughts on making this work?

The dplyr package makes this easy. You can group_by() a variable, and whatever you do after that only applies within the group. In dplyr notation, the %>% is a special operator that feeds the outcome of the function on the left into the first argument of the function on the right.
There are two ways to do it. The first way keeps all of the data, but your summary statistics are repeated in each row.
library(dplyr)
df %>%
group_by(id) %>%
mutate(M = mean(V1), MIN = min(V1), MAX = max(V1)
id visit V1 M MIN MAX
1 1 18 21 18 24
1 2 24 21 18 24
2 2 3 4.67 3 6
2 3 5 4.67 3 6
2 4 6 4.67 3 6
The second way provides only the summary statistics by the group.
library(dplyr)
df %>%
group_by(id) %>%
summarize(M = mean(V1), MIN = min(V1), MAX = max(V1)
id M MIN MAX
1 21 18 24
2 4.67 3 6

You can try this dplyr approach similar to #ThomasIsCoding that produces something similar to what you want:
library(dplyr)
#Data
df <- structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))
The code:
df %>% group_by(id) %>% mutate(M=mean(V1),Min=min(V1),Max=max(V1),SD=sd(V1))
Output:
# A tibble: 5 x 7
# Groups: id [2]
id visit V1 M Min Max SD
<int> <int> <int> <dbl> <int> <int> <dbl>
1 1 1 18 21 18 24 4.24
2 1 2 24 21 18 24 4.24
3 2 2 3 4.67 3 6 1.53
4 2 3 5 4.67 3 6 1.53
5 2 4 6 4.67 3 6 1.53

Maybe you want something like below
transform(df,
M = ave(V1, id, FUN = mean),
MIN = ave(V1, id, FUN = min),
MAX = ave(V1, id, FUN = max)
)
which gives
id visit V1 M MIN MAX
1 1 1 18 21.000000 18 24
2 1 2 24 21.000000 18 24
3 2 2 3 4.666667 3 6
4 2 3 5 4.666667 3 6
5 2 4 6 4.666667 3 6
Data
> dput(df)
structure(list(id = c(1L, 1L, 2L, 2L, 2L), visit = c(1L, 2L,
2L, 3L, 4L), V1 = c(18L, 24L, 3L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-5L))

Related

Using dplyr to group the new calculations into one data frame

I have the following table and I have to obtain a standard deviation of y for each unique value of x.
ID x y
1 1 4
2 2 3
3 3 7
4 1 2
5 2 6
6 3 8
For example, each unique value of x, I have y=4 and y=2, so the standard deviation will be:
x1 <- c(4,2)
sd(x1)
#output is 1.41
x2 <-c(3,6)
sd(x2)
#output is 2.21
x3 <-c(3,6)
sd(x3)
#output is 0.71
Instead of getting each output and put it in a data frame using the long way, is there a way to do it faster using dplyr and the pipe? I tried to use mutate and group_by, but it doesn't seem to work. I would like the result to look the following with count_y (# of y values to each unique x)
x count_y Std_Dev
1 2 1.41
2 2 2.21
3 2 0.71
We don't need mutate (mutate creates or transforms column). Here, the output needed is one row per group which can be done with summarise
library(dplyr)
df1 %>%
group_by(x) %>%
summarise(count_y = n(), Std_Dev = sd(y))
-output
# A tibble: 3 × 3
x count_y Std_Dev
<int> <int> <dbl>
1 1 2 1.41
2 2 2 2.12
3 3 2 0.707
data
df1 <- structure(list(ID = 1:6, x = c(1L, 2L, 3L, 1L, 2L, 3L), y = c(4L,
3L, 7L, 2L, 6L, 8L)), class = "data.frame", row.names = c(NA,
-6L))

R aggregate() function: Sum and show missing values = 0

I want to sum the "value" column by group1 and by group2.
group2 can range from 1 to 5.
If there is no entry for group2, the sum should be 0.
Data:
group1 group2 value
a 1 100
a 2 200
a 3 300
b 1 10
b 2 20
I am using
aggregate(data$value, by=(list(data$group1, data$group2)), FUN = sum)
which gives
group1 group2 value
a 1 100
a 2 200
a 3 300
b 1 10
b 2 20
However, the result should look like
group1 group2 value
a 1 100
a 2 200
a 3 300
a 4 0
a 5 0
b 1 10
b 2 20
b 3 0
b 4 0
b 5 0
How can i address this using the aggregate function in R?
Thank you!
We can use complete from tidyr to complete missing combinations.
library(dplyr)
library(tidyr)
df %>%
group_by(group1, group2) %>%
summarise(value = sum(value)) %>%
complete(group2 = 1:5, fill = list(value = 0))
# group1 group2 value
# <fct> <int> <dbl>
# 1 a 1 100
# 2 a 2 200
# 3 a 3 300
# 4 a 4 0
# 5 a 5 0
# 6 b 1 10
# 7 b 2 20
# 8 b 3 0
# 9 b 4 0
#10 b 5 0
data
df <- structure(list(group1 = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), group2 = c(1L, 2L, 3L, 1L, 2L), value = c(100L,
200L, 300L, 10L, 20L)), class = "data.frame", row.names = c(NA, -5L))
You need of course to tell R that "group 2 can range from 1 to 5". Best you merge it with an expand.grid accordingly and use with.
with(merge(expand.grid(group1=c("a", "b"), group2=1:5, value=0), data, all=TRUE),
aggregate(value, by=(list(group1, group2)), FUN=sum))
# Group.1 Group.2 x
# 1 a 1 100
# 2 b 1 10
# 3 a 2 200
# 4 b 2 20
# 5 a 3 300
# 6 b 3 0
# 7 a 4 0
# 8 b 4 0
# 9 a 5 0
# 10 b 5 0
Data:
data <- structure(list(group1 = c("a", "a", "a", "b", "b"), group2 = c(1L,
2L, 3L, 1L, 2L), value = c(100L, 200L, 300L, 10L, 20L)), row.names = c(NA,
-5L), class = "data.frame")

Getting rowSums for triplicate records and retaining only the one with highest value

I have a data frame with 163 observations and 65 columns with some animal data. The 163 observations are from 56 animals, and each was supposed to have triplicated records, but some information was lost so for the majority of animals, I have triplicates ("A", "B", "C") and for some I have only duplicates (which vary among "A" and "B", "A" and "C" and "B" and "C").
Columns 13:65 contain some information I would like to sum, and only retain the one triplicate with the higher rowSums value. So my data frame would be something like this:
ID Trip Acet Cell Fibe Mega Tera
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3
I am not sure if what I need is to write my own function, or a loop, or what the best alternative actually is - sorry I am still learning and unfortunately for me, I don't think like a programmer so that makes things even more challenging...
So what I want is to know to keep on rows 2 and 6 (which have the highest rowSums among triplicates per animal), but for the whole data frame. What I want as a result is
ID Trip Acet Cell Fibe Mega Tera
1 4 B 9 3 7 5 5
2 12 C 5 5 7 3 3
REALLY sorry if the question is poorly elaborated or if it doesn't make sense, this is my first time asking a question here and I have only recently started learning R.
We can create the row sums separately and use that to find the row with the maximum row sums by using ave. Then use the logical vector to subset the rows of dataset
nm1 <- startsWith(names(df1), "V")
OP updated the column names. In that case, either an index
nm1 <- 3:7
Or select the columns with setdiff
nm1 <- setdiff(names(df1), c("ID", "Trip"))
v1 <- rowSums(df1[nm1], na.rm = TRUE)
i1 <- with(df1, v1 == ave(v1, ID, FUN = max))
df1[i1,]
# ID Trip V1 V2 V3 V4 V5
#2 4 B 9 3 7 5 5
#6 12 C 5 5 7 3 3
data
df1 <- structure(list(ID = c(4L, 4L, 4L, 12L, 12L, 12L), Trip = structure(c(1L,
2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
V1 = c(2L, 9L, 1L, 4L, 6L, 5L), V2 = c(4L, 3L, 2L, 6L, 8L,
5L), V3 = c(9L, 7L, 4L, 7L, 1L, 7L), V4 = c(8L, 5L, 8L, 2L,
1L, 3L), V5 = c(3L, 5L, 6L, 3L, 2L, 3L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is one way.
library(tidyverse)
dat2 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
group_by(ID) %>%
filter(Sum == max(Sum)) %>%
select(-Sum) %>%
ungroup()
dat2
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
Here is another one. This method makes sure only one row is preserved even there are multiple rows with row sum equals to the maximum.
dat3 <- dat %>%
mutate(Sum = rowSums(select(dat, starts_with("V")))) %>%
arrange(ID, desc(Sum)) %>%
group_by(ID) %>%
slice(1) %>%
select(-Sum) %>%
ungroup()
dat3
# # A tibble: 2 x 7
# ID Trip V1 V2 V3 V4 V5
# <int> <fct> <int> <int> <int> <int> <int>
# 1 4 B 9 3 7 5 5
# 2 12 C 5 5 7 3 3
DATA
dat <- read.table(text = " ID Trip V1 V2 V3 V4 V5
1 4 A 2 4 9 8 3
2 4 B 9 3 7 5 5
3 4 C 1 2 4 8 6
4 12 A 4 6 7 2 3
5 12 B 6 8 1 1 2
6 12 C 5 5 7 3 3 ",
header = TRUE)

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

Reshape dataframe by ID [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I have a data set like
id age edu blood
1 30-39 Primary 5.5
1 20-29 Secondary 8.7
1 30-39 Primary 10
2 30-39 Primary 11
2 20-29 Secondary 10
2 20-29 Secondary 9
I want id wise output like this:
id age30_39count age20_29count edu_pri_count edu_sec_count blood_median
1 2 1 2 1 8.7
2 1 2 1 2 10
I have tried R code:
library(dplyr)
library(tidyr)
ddply(dat, "id", spread, age, age, edu, edu, blood, blood_median=median(blood))
But it not showing desired result. Could anybody do help?
You mean like this?
> library(dplyr)
> library(tidyr)
> group_by(df,id,age) %>% gather(variable,value,age,edu) %>%
unite(tag,variable,value) %>%
mutate(medblood=median(blood)) %>%
spread(tag,id) %>% select(-blood) %>%
select(-medblood,medblood)
# A tibble: 6 x 5
`age_20-29` `age_30-39` edu_Primary edu_Secondary medblood
<int> <int> <int> <int> <dbl>
1 NA 1 1 NA 8.70
2 1 NA NA 1 8.70
3 2 NA NA 2 10.0
4 NA 1 1 NA 8.70
5 2 NA NA 2 10.0
6 NA 2 2 NA 10.0
That last select(-medblood,medblood) moves the median blood column to the far right. You might possibly be wanting to do this though:
> group_by(df,id,age) %>% gather(variable,value,age,edu) %>%
unite(tag,variable,value) %>%
mutate(medblood=median(blood)) %>%
count(medblood,id,tag) %>% spread(tag,n)
# A tibble: 2 x 6
# Groups: id [2]
id medblood `age_20-29` `age_30-39` edu_Primary edu_Secondary
<int> <dbl> <int> <int> <int> <int>
1 1 8.70 1 2 2 1
2 2 10.0 2 1 1 2
Here is the dput of the data df used for this example:
> dput(df)
structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L), age = structure(c(2L,
1L, 2L, 2L, 1L, 1L), .Label = c("20-29", "30-39"), class = "factor"),
edu = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("Primary",
"Secondary"), class = "factor"), blood = c(5.5, 8.7, 10,
11, 10, 9)), .Names = c("id", "age", "edu", "blood"), class = "data.frame", row.names = c(NA,
-6L))

Resources