I have a dataframe in long format which is organised in this way:
help<- read.table(text="
ID Sodium H
1 140 31.9
1 138 29.6
1 136 30.6
2 145 35.9
2 137 33.3
3 148 27.9
4 139 30.0
4 128 32.4
4 143 35.3
4 133 NA", header = TRUE)
I need the worst value in each subject (ID) for Sodium and H. The worst value for H is defined as either value furthest away from 41-49, while the worst value for sodium is defined as value furthest away from 134-154.
The end result should therefore become something like this:
help<- read.table(text="
ID Sodium H
1 136 29.6
2 137 33.3
3 148 27.9
4 128 30.0 ", header=TRUE)
What is the easiest way to do this? Using aggregate function or dplyr? Or something else? Thank you in advance!
Here's a tidy version:
library(dplyr)
help %>%
group_by(ID) %>%
slice(which.max(abs(H - 45))) %>%
ungroup()
# # A tibble: 4 x 4
# ID DateTime Sodium H
# <int> <chr> <int> <dbl>
# 1 1 2020-07-27T11:00 138 29.6
# 2 2 2020-07-25T10:00 137 33.3
# 3 3 2020-07-27T14:00 148 27.9
# 4 4 2020-07-26T10:00 139 30
If it's possible that an ID may not have something out of limits, then the "worst" might return something within limits. If this is not desired, you can always add a filter to prevent within-limits:
help %>%
group_by(ID) %>%
slice(which.max(abs(H - 45))) %>%
ungroup() %>%
filter(!between(H, 41, 49))
The premise for Sodium is the same, using abs and the difference between its value and the mean of the desired range:
help %>%
group_by(ID) %>%
slice(which.max(abs(Sodium - 144))) %>%
ungroup()
# # A tibble: 4 x 4
# ID DateTime Sodium H
# <int> <chr> <int> <dbl>
# 1 1 2020-07-27T18:00 136 30.6
# 2 2 2020-07-25T10:00 137 33.3
# 3 3 2020-07-27T14:00 148 27.9
# 4 4 2020-07-26T12:00 128 32.4
Related
I have a data frame with five columns:
year<- c(2000,2000,2000,2001,2001,2001,2002,2002,2002)
k<- c(12.5,11.5,10.5,-8.5,-9.5,-10.5,13.9,14.9,15.9)
pop<- c(143,147,154,445,429,430,178,181,211)
pop_obs<- c(150,150,150,440,440,440,185,185,185)
df<- data_frame(year,k,pop,pop_obs)
df<-
year k pop pop_obs
<dbl> <dbl> <dbl> <dbl>
1 2000 12.5 143 150
2 2000 11.5 147 150
3 2000 10.5 154 150
4 2001 -8.5 445 440
5 2001 -9.5 429 440
6 2001 -10.5 430 440
7 2002 13.9 178 185
8 2002 14.9 181 185
9 2002 15.9 211 185
what I want is, based on each year and each k which value of pop has minimum difference of pop_obs. finally, I want to keep result as a data frame based on each year and each k.
my expected output would be like this:
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2003 14.9
You could try with dplyr
df<- data.frame(year,k,pop,pop_obs)
library(dplyr)
df %>%
mutate(diff = abs(pop_obs - pop)) %>%
group_by(year) %>%
filter(diff == min(diff)) %>%
select(year, k)
#> # A tibble: 3 x 2
#> # Groups: year [3]
#> year k
#> <dbl> <dbl>
#> 1 2000 11.5
#> 2 2001 -8.5
#> 3 2002 14.9
Created on 2021-12-11 by the reprex package (v2.0.1)
Try tidyverse way
library(tidyverse)
data_you_want = df %>%
group_by(year, k)%>%
mutate(dif=pop-pop_obs)%>%
ungroup() %>%
arrange(desc(dif)) %>%
select(year, k)
Using base R
subset(df, as.logical(ave(abs(pop_obs - pop), year,
FUN = function(x) x == min(x))), select = c('year', 'k'))
# A tibble: 3 × 2
year k
<dbl> <dbl>
1 2000 11.5
2 2001 -8.5
3 2002 14.9
Let's say I have the following stored in p
los tti ID
1 1.002083333 23.516667 84
2 -0.007638889 2.633333 118
3 0.036805556 2.633333 118
4 0.134722222 2.716667 120
5 2.756250000 82.800000 132
6 1.066666667 17.933333 156
7 -2.496250000 12.830948 156
I want to filter out rows with negative values for p$los, but only if p$tti and p$ID are duplicated between the rows. E.g., row 2 and 3 are duplicated on both p$tti and p$ID, and therefore should row 2 be omitted due to negative value in p$los.
Row 6 and 7 are duplicated in regards to p$ID, but not p$tti, and should therefore stay.
I am looking for a solution in dplyr
p <- structure(list(los = c(1.00208333333333, -0.00763888888888889,
0.0368055555555556, 0.134722222222222, 2.75625, 1.06666666666667,
-0.00763888888888889, 4.84305555555556, 1.79375, 8.55694444444444
), tti = c(23.5166666666667, 2.63333333333333, 2.63333333333333,
2.71666666666667, 82.8, 17.9333333333333, 1.31666666666667, 69.2666666666667,
52.9833333333333, 36.0166666666667), ID = c(84L, 118L, 118L,
120L, 132L, 156L, 179L, 245L, 253L, 334L)), row.names = c(NA,
10L), class = "data.frame")
Depending on your measure, you may want to round your tti column (which is a numeric decimal) to some tolerance level (e.g., 3 decimal places) as part of data processing.
Using dplyr you could try something like:
p %>%
group_by(tti, ID) %>%
filter(n() == 1 | los >= 0)
This would filter/keep rows where there are no duplicates by tti and ID (n() == 1) for the group), and then if duplicates exist, keep those where los is positive or zero (not negative).
Output
los tti ID
<dbl> <dbl> <int>
1 1.00 23.5 84
2 0.0368 2.63 118
3 0.135 2.72 120
4 2.76 82.8 132
5 1.07 17.9 156
6 -0.00764 1.32 179
7 4.84 69.3 245
8 1.79 53.0 253
9 8.56 36.0 334
if I understood correctly
library(tidyverse)
df <- read.table(text = " los tti ID
1 1.002083333 23.516667 84
2 -0.007638889 2.633333 118
3 0.036805556 2.633333 118
4 0.134722222 2.716667 120
5 2.756250000 82.800000 132
6 1.066666667 17.933333 156
7 -2.496250000 12.830948 156", header = T)
df %>%
group_by(ID) %>%
filter((sd(tti, na.rm = T) + los) > 0 | is.na(sd(tti, na.rm = T))) %>%
ungroup()
#> # A tibble: 6 x 3
#> los tti ID
#> <dbl> <dbl> <int>
#> 1 1.00 23.5 84
#> 2 0.0368 2.63 118
#> 3 0.135 2.72 120
#> 4 2.76 82.8 132
#> 5 1.07 17.9 156
#> 6 -2.50 12.8 156
Created on 2021-03-15 by the reprex package (v1.0.0)
I am trying to group_by a variable and then do operations per row per group. I got lost when using ifelse vs case_when. There is something basic I am failing to understand between the usage of two. I was assuming both would give me same output but that is not the case here. Using ifelse didn't give the expected output but case_when did. And I am trying to understand why ifelse didn't give me the expected output.
Here is the example df
structure(list(Pos = c(73L, 146L, 146L, 150L, 150L, 151L, 151L,
152L, 182L, 182L), Percentage = c(81.2, 13.5, 86.4, 66.1, 33.9,
48.1, 51.9, 86.1, 48, 52)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame")) -> foo
I am grouping by Pos and I want to round Percentage if their sum is 100. The following is using ifelse:
library(tidyverse)
foo %>%
group_by(Pos) %>%
mutate(sumn = n()) %>%
mutate(Val = ifelse(sumn == 1,100,
ifelse(sum(Percentage) == 100, unlist(map(Percentage,round)), 0)
# case_when(sum(Percentage) == 100 ~ unlist(map(Percentage,round)),
# TRUE ~ 0
# )
))
the output is
# A tibble: 10 x 4
# Groups: Pos [6]
Pos Percentage sumn Val
<int> <dbl> <int> <dbl>
1 73 81.2 1 100
2 146 13.5 2 0
3 146 86.4 2 0
4 150 66.1 2 66
5 150 33.9 2 66
6 151 48.1 2 48
7 151 51.9 2 48
8 152 86.1 1 100
9 182 48 2 48
10 182 52 2 48
I don't want this, rather I want the following which I get using case_when
foo %>%
group_by(Pos) %>%
mutate(sumn = n()) %>%
mutate(Val = ifelse(sumn == 1,100,
#ifelse(sum(Percentage) == 100, unlist(map(Percentage,round)), 0)
case_when(sum(Percentage) == 100 ~ unlist(map(Percentage,round)),
TRUE ~ 0
)
))
# A tibble: 10 x 4
# Groups: Pos [6]
Pos Percentage sumn Val
<int> <dbl> <int> <dbl>
1 73 81.2 1 100
2 146 13.5 2 0
3 146 86.4 2 0
4 150 66.1 2 66
5 150 33.9 2 34
6 151 48.1 2 48
7 151 51.9 2 52
8 152 86.1 1 100
9 182 48 2 48
10 182 52 2 52
What is ifelse doing different?
According to ?ifelse
A vector of the same length and attributes (including dimensions and "class") as test and data values from the values of yes or no.
If we replicate to make the lengths same, then it should work
foo %>%
group_by(Pos) %>%
mutate(sumn = n()) %>%
mutate(Val = ifelse(sumn == 1,100,
ifelse(rep(sum(Percentage) == 100,
n()), unlist(map(Percentage,round)), 0)
))
# A tibble: 10 x 4
# Groups: Pos [6]
Pos Percentage sumn Val
<int> <dbl> <int> <dbl>
1 73 81.2 1 100
2 146 13.5 2 0
3 146 86.4 2 0
4 150 66.1 2 66
5 150 33.9 2 34
6 151 48.1 2 48
7 151 51.9 2 52
8 152 86.1 1 100
9 182 48 2 48
10 182 52 2 52
R is doing a very strange thing where it is not giving me an error message and instead just not computing what I've told it to compute. I'm attempting to find the standard error of a variable and the command is producing NAs instead and I cannot figure out why. Here's my code for getting the mean and standard error:
ReHo_mean_Esc_1 <- ReHo_Group_Esc_1 %>% group_by(Group) %>% summarise(Value=mean(Value), se=sd(Value)/sqrt(n()))
My variable of interest is called Value. Here's my dataframe:
ID Clu Group Value Esc Nal
422 1 LgA 3.26090 94 7.50
501 1 LgA 3.32376 139 15.25
503 1 LgA 2.76855 24 31.50
521 1 LgA 1.81475 -28 6.75
522 1 LgA 1.80966 58 13.00
523 1 LgA 3.97502 76 10.25
603 1 LgA 1.78573 76 18.00
604 1 LgA 3.70577 54 10.00
605 1 LgA 2.93304 51 18.00
613 1 LgA 3.68118 116 17.00
429 1 ShA 2.61634 -33 5.75
430 1 ShA 3.39848 13 12.75
431 1 ShA 3.40785 -33 9.75
432 1 ShA 4.38024 50 4.75
513 1 ShA 4.14605 8 10.50
514 1 ShA 3.86332 0 10.75
518 1 ShA 2.96312 0 13.00
519 1 ShA 2.82937 -33 7.50
610 1 ShA 5.07850 13 26.00
612 1 ShA 4.14895 56 4.00
614 1 ShA 3.83926 42 8.25
My summarize command has no issues producing the mean for each group but it gives me NAs for the standard error and I have no idea why. Any ideas?
Thanks!
Don't name your new variable Value. dplyr is different to base R in that it allows newly created variables to be immediately available within the same function.
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(mValue=mean(Value), se=sd(Value)/sqrt(n()))
# A tibble: 2 x 3
Group mValue se
<chr> <dbl> <dbl>
1 LgA 2.91 0.266
2 ShA 3.70 0.223
The issue is that by the time you calculate sd(Value), the Value column of length 21 has been converted into a column of length 1 (per group). Two clues:
sd of anything length 1 is NA;
Try to replace sd with length, and you'll see that it's getting just one value (errr, Value :-) (this is a play on #CalumYou's comment):
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(Value=mean(Value), se=length(Value))
# # A tibble: 2 x 3
# Group Value se
# <chr> <dbl> <int>
# 1 LgA 2.91 1
# 2 ShA 3.70 1
whereas if you swap the order of calculations, you'll see something different:
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(se=length(Value), Value=mean(Value))
# # A tibble: 2 x 3
# Group se Value
# <chr> <int> <dbl>
# 1 LgA 10 2.91
# 2 ShA 11 3.70
Try calculating sd first:
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(
se = sd(Value)/sqrt(n()),
Value = mean(Value)
)
# # A tibble: 2 x 3
# Group se Value
# <chr> <dbl> <dbl>
# 1 LgA 0.266 2.91
# 2 ShA 0.223 3.70
You can try:
ReHo_Group_Esc_1 %>% group_by(Group) %>%
summarise(Value=mean(Value,na.rm=T), se=sd(Value,na.rm=T)/sqrt(n()))
Question
I use time-series data regularly. Sometimes, I would like to transmute an entire data frame to obtain some data frame of growth rates, or shares, for example.
When using transmute this is relatively straight-forward. But when I have a lot of columns to transmute and I want to keep the date column, I'm not sure if that's possible.
Below, using the economics data set, is an example of what I mean.
Example
library(dplyr)
economics %>%
transmute(date,
pce * 10,
pop * 10,
psavert * 10)
# A tibble: 574 x 4
date `pce * 10` `pop * 10` `psavert * 10`
<date> <dbl> <dbl> <dbl>
1 1967-07-01 5067 1987120 126
2 1967-08-01 5098 1989110 126
3 1967-09-01 5156 1991130 119
4 1967-10-01 5122 1993110 129
5 1967-11-01 5174 1994980 128
6 1967-12-01 5251 1996570 118
7 1968-01-01 5309 1998080 117
8 1968-02-01 5336 1999200 123
9 1968-03-01 5443 2000560 117
10 1968-04-01 5440 2002080 123
# ... with 564 more rows
Now, using transmute_at. The below predictably removes date in the .vars argument, but I haven't found a way of removing date and reintroducing it in .funs such that the resulting data frame looks as it does above. Any ideas?
economics %>%
transmute_at(.vars = vars(-c(date, uempmed, unemploy)),
.funs = list("trans" = ~ . * 10))
# A tibble: 574 x 3
pce_trans pop_trans psavert_trans
<dbl> <dbl> <dbl>
1 5067 1987120 126
2 5098 1989110 126
3 5156 1991130 119
4 5122 1993110 129
5 5174 1994980 128
6 5251 1996570 118
7 5309 1998080 117
8 5336 1999200 123
9 5443 2000560 117
10 5440 2002080 123
# ... with 564 more rows
We can use if/else inside the function.
library(dplyr)
library(ggplot2)
data(economics)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .)
# A tibble: 574 x 4
# date pce pop psavert
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we need to change the column names selectively, can do this after the transmute_at
library(stringr)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .) %>%
rename_at(vars(-date), ~ str_c(., '_trans'))
# A tibble: 574 x 4
# date pce_trans pop_trans psavert_trans
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we are changing the column names in all the selected columns in transmute_at use list(trans =
economics %>%
transmute_at(vars(date:psavert), list(trans = ~if(is.numeric(.)) .* 10 else .))