R: How to aggregate a dataframe using an index column? - r

I have a dataframe which looks as following:
head(test_df, n =15)
# print the first 15rows of the dataframe
value frequency index
1 -2.90267705917358 1 1
2 -2.90254878997803 1 1
3 -2.90252590179443 1 1
4 -2.90219354629517 1 1
5 -2.90201354026794 1 1
6 -2.9016375541687 1 1
7 -2.90107154846191 1 1
8 -2.90089440345764 1 1
9 -2.89996957778931 1 1
10 -2.89970088005066 1 1
11 -2.89928865432739 1 2
12 -2.89920520782471 1 2
13 -2.89907360076904 1 2
14 -2.89888191223145 1 2
15 -2.8988630771637 1 2
The dataframe has 3columns and 61819rows. To aggregate the dataframe, I want to get the mean value for the columns 'value' and 'frequency' for all rows with the same 'index'.
I already found some useful links, see:
https://www.r-bloggers.com/2018/07/how-to-aggregate-data-in-r/
Which is the simplest way to aggregate rows (sum) by columns values the following type of data frame on R?
However, I could not solve the problem yet.
test_df_ag <- stats::aggregate(test_df[1:2], by = test_df[3], FUN = 'mean')
# aggregate the dataframe based on the 'index' column (build the mean)
index value frequency
1 1 NA 1
2 2 NA 1
3 3 NA 1
4 4 NA 1
5 5 NA 1
6 6 NA 1
7 7 NA 1
8 8 NA 1
9 9 NA 1
10 10 NA 1
11 11 NA 1
12 12 NA 1
13 13 NA 1
14 14 NA 1
15 15 NA 1
Since I just get NA values for the column 'value', I wonder whether it might just be a data type issue?! However also when I tried to convert the data type I failed...
base::typeof(test_df$value)
# query the data type of the 'value' column
[1] "integer"

1. Here is a base R solution.
aggregate(cbind(value, frequency) ~ index, data = test_df, FUN = mean)
# index value frequency
#1 1 -2.901523 1
#2 2 -2.899062 1
2. And a simple dplyr solution.
library(dplyr)
test_df %>%
group_by(index) %>%
summarize(across(1:2, mean))
## A tibble: 2 x 3
# index value frequency
#* <int> <dbl> <dbl>
#1 1 -2.90 1
#2 2 -2.90 1
Data
test_df <-
structure(list(value = c(-2.90267705917358, -2.90254878997803,
-2.90252590179443, -2.90219354629517, -2.90201354026794, -2.9016375541687,
-2.90107154846191, -2.90089440345764, -2.89996957778931, -2.89970088005066,
-2.89928865432739, -2.89920520782471, -2.89907360076904, -2.89888191223145,
-2.8988630771637), frequency = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), index = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"))

Using data.table
library(data.table)
setDT(test_df)[, lapply(.SD, mean), by = index, .SDcols = 1:2]

Try tidyverse. test_summary <- test_df %>% group_by(index) %>% summarise(n=n(), mean_value=mean(value, na.rm=T),mean_frequency=mean(frequency, na.rm=T)).
Oh, and, of course, you should make sure you've checked the quality of your data and understand the ifs and whys of any NA's in your data set.

Related

How can I cut an amount of rows by group based in the number of one column?

I would like to cut rows from my data frame by groups (Column "Group") based on the number asigned in the column "Count".
Data looks like this
Group Count Result Result 2
<chr> <dbl> <dbl> <dbl>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
4 Ane 3 8 5
5 Ane 3 7 8
6 John 2 9 NA
7 John 2 2 NA
8 John 2 4 2
9 John 2 3 2
Expected results
Group Count Result Result 2
<chr> <dbl> <dbl> <dbl>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
6 John 2 9 NA
7 John 2 2 NA
Thanks!
We may use slice on the first value of 'Count' after grouping by 'Group'
library(dplyr)
df1 %>%
group_by(Group) %>%
slice(seq_len(first(Count))) %>%
ungroup
-output
# A tibble: 5 × 4
Group Count Result Result2
<chr> <int> <int> <int>
1 Ane 3 5 NA
2 Ane 3 6 5
3 Ane 3 4 5
4 John 2 9 NA
5 John 2 2 NA
Or use filter with row_number() to create a logical vector
df1 %>%
group_by(Group) %>%
filter(row_number() <= Count) %>%
ungroup
data
df1 <- structure(list(Group = c("Ane", "Ane", "Ane", "Ane", "Ane", "John",
"John", "John", "John"), Count = c(3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 2L), Result = c(5L, 6L, 4L, 8L, 7L, 9L, 2L, 4L, 3L), Result2 = c(NA,
5L, 5L, 5L, 8L, NA, NA, 2L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

How to use case_when correctly when recoding values in columns under a condition?

I have a data frame with 2 variables:
Correct FACE.RESP
1 1 1
2 2 1
3 1 2
4 2 2
5 2 2
6 2 1
I would like to recode/replace the values in the 'FACE.RESP' column under a condition that if the value in 'FACE.RESP' matches the value in 'Correct', the value in 'FACE.RESP' should be rewritten to 1. If the value in 'FACE.RESP' doesn't match the value in 'Correct', the value in 'FACE.RESP' should be recoded to 0.
I tried the following code using mutate and case_when:
mutate(FACE.RESP = case_when(FACE.RESP == Correct ~ '1', FACE.RESP <= Correct ~ '0', FACE.RESP >= Correct ~ '1'))
but the results is this:
Correct FACE.RESP
5 2 2
6 2 1
7 1 NA
8 2 NA
9 2 NA
10 1 NA
Could anyone suggest how to achieve the required outcome and explain what is wrong with the above line of code?
You need to check only for one condition that FACE.RESP and Correct values are the same and assign all other values to 0.
library(dplyr)
df %>% mutate(FACE.RESP = case_when(FACE.RESP == Correct ~ 1,TRUE ~ 0))
# Correct FACE.RESP
#1 1 1
#2 2 0
#3 1 0
#4 2 1
#5 2 1
#6 2 0
However, a simpler approach is to compare the two columns and convert the logical values to integer values using + at the beginning.
df$FACE.RESP <- +(df$Correct == df$FACE.RESP)
data
df <- structure(list(Correct = c(1L, 2L, 1L, 2L, 2L, 2L), FACE.RESP = c(1L,
1L, 2L, 2L, 2L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
We can use as.integer
df$FACE.RESP <- as.integer(df$Correct == df$FACE.RESP)
data
df <- structure(list(Correct = c(1L, 2L, 1L, 2L, 2L, 2L), FACE.RESP = c(1L,
1L, 2L, 2L, 2L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Here is another dplyr solution.
library(dplyr)
df %>% mutate(FACE.RESP = +(Correct == FACE.RESP))
# Correct FACE.RESP
#1 1 1
#2 2 0
#3 1 0
#4 2 1
#5 2 1
#6 2 0

Return corresponding variable for max value in grouped dataframe R [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 3 years ago.
I am looking to return the corresponding value for a max value for each group in a dataframe in R. Searching I can only find solutions for python and excel.
I seem to get the right answers but in a strange format:
Example:
set.seed(423)
df = data.frame(week = c(rep(1, 7), rep(2, 7), rep(3, 7)),
day = c(1:7, 1:7, 1:7),
value = runif(21))
df
week day value
1 1 1 0.89368600
2 1 2 0.63863225
3 1 3 0.19254541
4 1 4 0.57557113
5 1 5 0.78458928
6 1 6 0.55080956
7 1 7 0.59388856
8 2 1 0.02040073
9 2 2 0.17663162
10 2 3 0.33647923
11 2 4 0.53304330
12 2 5 0.22939499
13 2 6 0.43232959
14 2 7 0.71889969
15 3 1 0.97318020
16 3 2 0.20320008
17 3 3 0.58991593
18 3 4 0.88450876
19 3 5 0.61154896
20 3 6 0.68123761
21 3 7 0.48162899
library('dplyr')
group_by(df, week) %>%
summarize(max.day = .[which(value == max(value, na.rm = T)), 'day'])
week max.day$ NA NA
<dbl> <int> <int> <int>
1 1 1 7 1
2 2 NA NA NA
3 3 NA NA NA
The value for max.day (1, 7, 1) appear correct, as can be seen if you match the values from this code to the original df:
group_by(df, week) %>%
summarise(value = max(value))
week value
<dbl> <dbl>
1 1 0.894
2 2 0.719
3 3 0.973
But what I want (and what I expected from the code) is a table that looks as follows:
week max.day
1 1 1
2 2 7
3 3 1
What am I doing wrong here?
Also, will this code work if i have a large dataset in which the max value might repeat for certain groups. Essentially will my .[which(value == max(value, na.rm = T)), 'day'] be applied group-wise, or is this just looking at the entire vector?
We can use which.max. If there are ties for max 'value' i.e. more than one max value for each 'week', then which.max returns the index of the first max 'value', use that to subset the corresponding 'day'
library(dplyr)
df %>%
group_by(week) %>%
summarise(max.day = day[which.max(value)])
# A tibble: 3 x 2
# week max.day
# <int> <int>
#1 1 1
#2 2 7
#3 3 1
With ==, there is the possibility of matching multiple elements if there are ties and summarise can return only single row/group resulting in conflict of interest and ultimately error
Another option is to either filter or slice the rows if the intention is to return the row
df %>%
group_by(week) %>%
slice(which.max(value)) %>%
select(week, max.day = day)
data
df <- structure(list(week = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), day = c(1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L,
5L, 6L, 7L), value = c(0.893686, 0.63863225, 0.19254541, 0.57557113,
0.78458928, 0.55080956, 0.59388856, 0.02040073, 0.17663162, 0.33647923,
0.5330433, 0.22939499, 0.43232959, 0.71889969, 0.9731802, 0.20320008,
0.58991593, 0.88450876, 0.61154896, 0.68123761, 0.48162899)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21"))

R - sample and resample a person-period file

I am working with a gigantic person-period file and I thought that
a good way to deal with a large dataset is by using sampling and re-sampling technique.
My person-period file look like this
id code time
1 1 a 1
2 1 a 2
3 1 a 3
4 2 b 1
5 2 c 2
6 2 b 3
7 3 c 1
8 3 c 2
9 3 c 3
10 4 c 1
11 4 a 2
12 4 c 3
13 5 a 1
14 5 c 2
15 5 a 3
I have actually two distinct issues.
The first issue is that I am having trouble in simply sampling a person-period file.
For example, I would like to sample 2 id-sequences such as :
id code time
1 a 1
1 a 2
1 a 3
2 b 1
2 c 2
2 b 3
The following line of code is working for sampling a person-period file
dt[which(dt$id %in% sample(dt$id, 2)), ]
However, I would like to use a dplyr solution because I am interested in resampling and in particular I would like to use replicate.
I am interested in doing something like replicate(100, sample_n(dt, 2), simplify = FALSE)
I am struggling with the dplyr solution because I am not sure what should be the grouping variable.
library(dplyr)
dt %>% group_by(id) %>% sample_n(1)
gives me an incorrect result because it does not keep the full sequence of each id.
Any clue how I could both sample and re-sample person-period file ?
data
dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L,
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b",
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA,
-15L), class = "data.frame")
I think the idiomatic way would probably look like
set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)
id code time
1 2 b 1
2 2 c 2
3 2 b 3
4 5 a 1
5 5 c 2
6 5 a 3
This extends straightforwardly to more grouping variables and fancier sampling rules.
If you need to do this many times...
nrep = 100
ng = 2
samps = df %>% select(id) %>% distinct %>%
slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)
# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff
I imagine you are doing some simulations and may want to do the subsetting many times. You probably also want to try this data.table method and utilize the fast binary search feature on the key column:
library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)
#[[1]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 5 a 1
#5: 5 c 2
#6: 5 a 3
#[[2]]
# id code time
#1: 3 c 1
#2: 3 c 2
#3: 3 c 3
#4: 4 c 1
#5: 4 a 2
#6: 4 c 3
We can use filter with sample
dt %>%
filter(id %in% sample(unique(id),2, replace = FALSE))
NOTE: The OP specified using dplyr method and this solution does uses the dplyr.
If we need to do replicate one option would be using map from purrr
library(purrr)
dt %>%
distinct(id) %>%
replicate(2, .) %>%
map(~sample(., 2, replace=FALSE)) %>%
map(~filter(dt, id %in% .))
#$id
# id code time
#1 1 a 1
#2 1 a 2
#3 1 a 3
#4 4 c 1
#5 4 a 2
#6 4 c 3
#$id
# id code time
#1 4 c 1
#2 4 a 2
#3 4 c 3
#4 5 a 1
#5 5 c 2
#6 5 a 3

Deleting Rows per ID when value gets greater than... minus 2

I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))

Resources