Rounding error when grouping by multiple categories - r

Why are the values for SE_daily wrong? I expected it to round to the nearest integer (though I wanted a decimal), instead the decimal answer is completely wrong. What did I miss?
csv<-csv%>%group_by(id_num)%>%group_by(Month)%>%group_by(Day)%>%mutate(SE_daily=mean(SelfEsteem, na.rm=T))
head(csv[,c(1:5,28,181)])
> head(csv[,c(1:5,28,181)])
Source: local data frame [6 x 7]
Groups: Day [3]
X.1 X id_num Month Day SelfEsteem SE_daily
<int> <int> <int> <int> <int> <int> <dbl>
1 1 1 29 2 19 4 3.457944 #mean(4,4,3)= 4, expected answer= 3.66666666667
2 2 2 29 2 19 4 3.457944
3 3 3 29 2 19 3 3.457944
4 4 4 29 2 20 4 3.424242 #expected answer= 4
5 5 5 29 2 21 4 3.318182 #expected answer=4
6 6 6 29 2 21 4 3.318182
head of csv output:
structure(list(X.1 = 1:6, X = 1:6,
id_num = c(29L, 29L, 29L, 29L, 29L, 29L),
Month = c(2L, 2L, 2L, 2L, 2L, 2L),
Day = c(19L, 19L, 19L, 20L, 21L, 21L),
SelfEsteem = c(4L, 4L, 3L, 4L, 4L, 4L),
SE_daily = c(3.45794392523365, 3.45794392523365, 3.45794392523365, 3.42424242424242, 3.31818181818182, 3.31818181818182)),
.Names = c("X.1", "X", "id_num", "Month", "Day", "SelfEsteem", "SE_daily"),
row.names = c(NA, -6L),
class = "data.frame")

I got the expected output for SE_daily. It's possible that by piping the group_by commands instead of putting them in a single command you are looking at multiple id_num and Months that share a common Day (assuming that the provided data structure is only a subset of the entire data set)
library(dplyr)
csv %>%
group_by(id_num, Month, Day) %>%
mutate(SE_daily=mean(SelfEsteem, na.rm=TRUE))
output
Source: local data frame [6 x 7]
Groups: id_num, Month, Day [3]
X.1 X id_num Month Day SelfEsteem SE_daily
<int> <int> <int> <int> <int> <int> <dbl>
1 1 1 29 2 19 4 3.666667
2 2 2 29 2 19 4 3.666667
3 3 3 29 2 19 3 3.666667
4 4 4 29 2 20 4 4.000000
5 5 5 29 2 21 4 4.000000
6 6 6 29 2 21 4 4.000000

Related

Creating new columns in dataset as a lookup function in R?

so i lets say i have a datatable that consist of stock monthly returns:
Company
Year
return
next years return
1
1
5
1
2
6
1
3
2
1
4
4
For a large dataset, of multiple companies and years how can i get a new column that consist of next years returns, for example in first row there would be second years return of 6% etc etc? In excel i could simple use index match but no idea how its done in R. And the reason for not using excel is that it takes over 20 hours to compute all functions as index match is extremely slow. The code needs to do this for all companies so it has to find the correct company for correct year and then input it into new column.
You could group by the company and use lead() to get the next value:
library(dplyr)
df <- data.frame(
company = c(1L, 1L, 1L, 1L, 2L, 2L),
year = c(1L, 2L, 3L, 4L, 1L, 2L),
return_ = c(5L, 6L, 2L, 4L, 2L, 4L))
df
#> company year return_
#> 1 1 1 5
#> 2 1 2 6
#> 3 1 3 2
#> 4 1 4 4
#> 5 2 1 2
#> 6 2 2 4
df %>% group_by(company) %>%
mutate(next.years.return = lead(return_, order_by = year))
#> # A tibble: 6 × 4
#> # Groups: company [2]
#> company year return_ next.years.return
#> <int> <int> <int> <int>
#> 1 1 1 5 6
#> 2 1 2 6 2
#> 3 1 3 2 4
#> 4 1 4 4 NA
#> 5 2 1 2 4
#> 6 2 2 4 NA
Created on 2023-02-10 with reprex v2.0.2
Getting the next years return if its really the next year.
library(dplyr)
df %>%
group_by(Company) %>%
arrange(Company, Year) %>%
mutate("next years return" =
if_else(lead(Year) - Year == 1, lead(`return`), NA)) %>%
ungroup()
# A tibble: 8 × 4
Company Year return `next years return`
<dbl> <dbl> <int> <int>
1 1 1 5 NA
2 1 3 2 4
3 1 4 4 6
4 1 5 6 NA
5 2 1 5 6
6 2 2 6 2
7 2 3 2 4
8 2 4 4 NA
Data
df <- structure(list(Company = c(1, 1, 1, 1, 2, 2, 2, 2), Year = c(1,
5, 3, 4, 4, 3, 2, 1), return = c(5L, 6L, 2L, 4L, 4L, 2L, 6L,
5L)), row.names = c("1", "2", "3", "4", "41", "31", "21", "11"
), class = "data.frame")

Wide to long form in R

Have a dataset for determining interrater reliability. Trying to restructure my data from wide to long form. Here is my data.
Subject Rater Item_1 Item_2
AB 1 6 4
AB 2 5 5
CD 1 4 5
CD 2 6 5
EF 1 4 4
EF 2 7 5
I want to restructure it so that it looks like this:
Subject Item Rater_1 Rater_2
AB 1 6 5
AB 2 4 5
CD 1 4 6
CD 2 5 5
EF 1 4 7
EF 2 4 5
I've tried pivot_longer but am unable to separate "rater" into two columns. Any ideas?
Get the data in long format and use a different key to get it in wide format again.
library(dplyr)
library(tidyr)
#Thanks to #Dan Adams for the `NA` trick.
df %>%
pivot_longer(cols = starts_with('Item'),
names_to = c(NA, 'Item'),
names_sep = "_") %>%
pivot_wider(names_from = Rater, values_from = value, names_prefix = "Rater_")
# Subject Item Rater_1 Rater_2
# <chr> <chr> <int> <int>
#1 AB 1 6 5
#2 AB 2 4 5
#3 CD 1 4 6
#4 CD 2 5 5
#5 EF 1 4 7
#6 EF 2 4 5
data
df <- structure(list(Subject = c("AB", "AB", "CD", "CD", "EF", "EF"
), Rater = c(1L, 2L, 1L, 2L, 1L, 2L), Item_1 = c(6L, 5L, 4L,
6L, 4L, 7L), Item_2 = c(4L, 5L, 5L, 5L, 4L, 5L)),
class = "data.frame", row.names = c(NA, -6L))
Here is a base R solution. You are really just transposing the data by group in this particular case.
Map(\(s) {
x <- subset(df, df$Subject == s)
x[,c("Item_1", "Item_2")] <- t(x[,c("Item_1", "Item_2")])
colnames(x) <- c("Subject", "Item", "Rater_1", "Rater_2")
x
}, unique(df$Subject)) |>
do.call(what = rbind)
#> # A tibble: 6 x 4
#> Subject Item Rater_1 Rater_2
#> * <chr> <dbl> <dbl> <dbl>
#> 1 AB 1 6 5
#> 2 AB 2 4 5
#> 3 CD 1 4 6
#> 4 CD 2 5 5
#> 5 EF 1 4 7
#> 6 EF 2 4 5

How to find the average of several lines with the same id in a big R dataframe? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 2 years ago.
i have a big data frame (more than 100 000 entries) that look something like this :
ID Pre temp day
134 10 6 1
134 20 7 1
134 10 8 1
234 5 1 2
234 10 4 2
234 15 10 3
I want to reduce my data frame by finding the mean value of pre, temp and day for identical ID values.
At the end, my data frame would look something like this
ID Pre temp day
134 13.3 7 1
234 10 5 2.3
i'm not sure how to do it ?
Thank you in advance !
With the dplyr package you can group_by your ID value and then use summarise to take the mean
library(dplyr)
df %>%
group_by(ID) %>%
summarise(Pre= mean(Pre),
temp = mean(temp),
day = mean(day))
# A tibble: 2 x 4
ID Pre temp day
<dbl> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
With dplyr, a solution looks like this:
textFile <- "ID Pre temp day
134 10 6 1
134 20 7 1
134 10 8 1
234 5 1 2
234 10 4 2
234 15 10 3"
data <- read.table(text = textFile,header=TRUE)
library(dplyr)
data %>% group_by(ID) %>%
summarise(.,Pre = mean(Pre),temp = mean(temp),day=mean(day))
...and the output:
<int> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
>
You can try next:
library(dplyr)
#Data
df <- structure(list(ID = c(134L, 134L, 134L, 234L, 234L, 234L), Pre = c(10L,
20L, 10L, 5L, 10L, 15L), temp = c(6L, 7L, 8L, 1L, 4L, 10L), day = c(1L,
1L, 1L, 2L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-6L))
#Code
df %>% group_by(ID) %>% summarise_all(mean,na.rm=T)
# A tibble: 2 x 4
ID Pre temp day
<int> <dbl> <dbl> <dbl>
1 134 13.3 7 1
2 234 10 5 2.33
There is no need of setting each individual variable.

How to sum rows based on exact conditions on multiple columns and save edited rows in original dataset? [duplicate]

This question already has answers here:
Find nearest matches for each row and sum based on a condition
(4 answers)
Closed 3 years ago.
There are 3 parts to this problem:
1) I want to sum values in column b,c,d for any two adjacent rows which have the same values for columns(b,c,d)
2) I would like to keep values in other columns the same. (Some other column (eg. a) may contain character data.)
3) I would like to keep the changes by replacing the original value in columns b,c,d in the first row (of the 2 same rows) with the new values (the sums) and delete the second row(of the 2 same rows).
Time a b c d id
1 2014/10/11 A 40 20 10 1
2 2014/10/12 A 40 20 10 2
3 2014/10/13 B 9 10 9 3
4 2014/10/14 D 16 5 12 4
5 2014/10/15 D 1 6 5 5
6 2014/10/16 B 20 7 8 6
7 2014/10/17 B 20 7 8 7
8 2014/10/18 A 11 9 5 8
9 2014/10/19 C 31 20 23 9
Expected outcome:
Time a b c d id
1 2014/10/11 A 80 40 20 1 *
3 2014/10/13 B 9 10 9 3
4 2014/10/14 D 16 5 12 4
5 2014/10/15 D 1 6 5 5
6 2014/10/16 B 40 14 16 6 *
8 2014/10/18 A 11 9 5 8
9 2014/10/19 C 31 20 23 9
id 1 and 2 combined to become id 1; id 6 and 7 combined to become id 6.
Thank you. Any contribution is greatly appreciated.
Using dplyr functions along with data.table::rleid. To get same values for adjacent b, c and d columns we paste them and use rleid to create groups. For each group we sum the values at b, c and d columns and keep only the 1st row.
library(dplyr)
df %>%
mutate(temp_col = paste(b, c, d, sep = "-")) %>%
group_by(group = data.table::rleid(temp_col)) %>%
mutate_at(vars(b, c, d), sum) %>%
slice(1L) %>%
ungroup %>%
select(-temp_col, -group)
# Time a b c d id
# <fct> <fct> <int> <int> <int> <int>
#1 2014/10/11 A 80 40 20 1
#2 2014/10/13 B 9 10 9 3
#3 2014/10/14 D 16 5 12 4
#4 2014/10/15 D 1 6 5 5
#5 2014/10/16 B 40 14 16 6
#6 2014/10/18 A 11 9 5 8
#7 2014/10/19 C 31 20 23 9
data
df <- structure(list(Time = structure(1:9, .Label = c("2014/10/11",
"2014/10/12", "2014/10/13", "2014/10/14", "2014/10/15", "2014/10/16",
"2014/10/17", "2014/10/18", "2014/10/19"), class = "factor"),
a = structure(c(1L, 1L, 2L, 4L, 4L, 2L, 2L, 1L, 3L), .Label = c("A",
"B", "C", "D"), class = "factor"), b = c(40L, 40L, 9L, 16L,
1L, 20L, 20L, 11L, 31L), c = c(20L, 20L, 10L, 5L, 6L, 7L,
7L, 9L, 20L), d = c(10L, 10L, 9L, 12L, 5L, 8L, 8L, 5L, 23L
), id = 1:9), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9"))

new data set with combining some rows

This is not a question about long and wide shape:!!!!! don't make it duplicate plz
Suppose I have :
HouseholdID. PersonID. time. dur. age
1 1 3 4 19
1 2 3 4 29
1 3 5 5 30
1 1 5 5 18
2 1 21 30 18
2 2 21 30 30
In each household some people have the same time and dur. want to combine only rows whose have the same HouseholdID,time and dur
OUTPUT:
HouseholdID. PersonID. time. dur. age. HouseholdID. PersonID. time. dur. age
1 1 3 4 19 1 2 3 4 29
1 3 5 5 30 1 1 5 5 18
2 1 21 30 18 2 2 21 30 30
An option would be dcast from data.table which can take multiple value.var columns
library(data.table)
dcast(setDT(df1), HouseholdID. ~ rowid(HouseholdID.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 2 1 2 21 21 30 30 18 30
Or an option with pivot_wider from the devel version of tidyr
library(tidyr) # ‘0.8.3.9000’
library(dplyr)
df1 %>%
group_by(HouseholdID.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= HouseholdID., names_from = rn,
values_from = c(PersonID., time., dur., age), name_sep="")
# A tibble: 2 x 9
# HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int> <int> <int>
#1 1 1 2 3 3 4 4 19 29
#2 2 1 2 21 21 30 30 18 30
Update
With the new dataset, extend the id columns by including the 'time.' and 'dur.'
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "age"), sep="")
If we need duplicate columns for 'time.' and 'dur.' (not clear why it is needed though)
dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.),
value.var = c("PersonID.", "time.", "dur.", "age"), sep="")[,
c('time.', 'dur.') := NULL][]
# HouseholdID. PersonID.1 PersonID.2 time..11 time..12 dur..11 dur..12 age1 age2
#1: 1 1 2 3 3 4 4 19 29
#2: 1 3 1 5 5 5 5 30 18
#3: 2 1 2 21 21 30 30 18 30
Or with tidyverse
df2 %>%
group_by(HouseholdID., time., dur.) %>%
mutate(rn = row_number()) %>%
pivot_wider(id_cols= c(HouseholdID., time., dur.), names_from = rn,
values_from = c(PersonID., age), names_sep = "")
# A tibble: 3 x 7
# HouseholdID. time. dur. PersonID.1 PersonID.2 age1 age2
# <int> <int> <int> <int> <int> <int> <int>
#1 1 3 4 1 2 19 29
#2 1 5 5 3 1 30 18
#3 2 21 30 1 2 18 30
NOTE: duplicate column names are not recommended as it can lead to confusion in identification of columns.
data
df1 <- structure(list(HouseholdID. = c(1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 1L, 2L), time. = c(3L, 3L, 21L, 21L), dur. = c(4L, 4L, 30L,
30L), age = c(19L, 29L, 18L, 30L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(HouseholdID. = c(1L, 1L, 1L, 1L, 2L, 2L), PersonID. = c(1L,
2L, 3L, 1L, 1L, 2L), time. = c(3L, 3L, 5L, 5L, 21L, 21L), dur. = c(4L,
4L, 5L, 5L, 30L, 30L), age = c(19L, 29L, 30L, 18L, 18L, 30L)),
class = "data.frame", row.names = c(NA,
-6L))

Resources