R is doing a very strange thing where it is not giving me an error message and instead just not computing what I've told it to compute. I'm attempting to find the standard error of a variable and the command is producing NAs instead and I cannot figure out why. Here's my code for getting the mean and standard error:
ReHo_mean_Esc_1 <- ReHo_Group_Esc_1 %>% group_by(Group) %>% summarise(Value=mean(Value), se=sd(Value)/sqrt(n()))
My variable of interest is called Value. Here's my dataframe:
ID Clu Group Value Esc Nal
422 1 LgA 3.26090 94 7.50
501 1 LgA 3.32376 139 15.25
503 1 LgA 2.76855 24 31.50
521 1 LgA 1.81475 -28 6.75
522 1 LgA 1.80966 58 13.00
523 1 LgA 3.97502 76 10.25
603 1 LgA 1.78573 76 18.00
604 1 LgA 3.70577 54 10.00
605 1 LgA 2.93304 51 18.00
613 1 LgA 3.68118 116 17.00
429 1 ShA 2.61634 -33 5.75
430 1 ShA 3.39848 13 12.75
431 1 ShA 3.40785 -33 9.75
432 1 ShA 4.38024 50 4.75
513 1 ShA 4.14605 8 10.50
514 1 ShA 3.86332 0 10.75
518 1 ShA 2.96312 0 13.00
519 1 ShA 2.82937 -33 7.50
610 1 ShA 5.07850 13 26.00
612 1 ShA 4.14895 56 4.00
614 1 ShA 3.83926 42 8.25
My summarize command has no issues producing the mean for each group but it gives me NAs for the standard error and I have no idea why. Any ideas?
Thanks!
Don't name your new variable Value. dplyr is different to base R in that it allows newly created variables to be immediately available within the same function.
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(mValue=mean(Value), se=sd(Value)/sqrt(n()))
# A tibble: 2 x 3
Group mValue se
<chr> <dbl> <dbl>
1 LgA 2.91 0.266
2 ShA 3.70 0.223
The issue is that by the time you calculate sd(Value), the Value column of length 21 has been converted into a column of length 1 (per group). Two clues:
sd of anything length 1 is NA;
Try to replace sd with length, and you'll see that it's getting just one value (errr, Value :-) (this is a play on #CalumYou's comment):
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(Value=mean(Value), se=length(Value))
# # A tibble: 2 x 3
# Group Value se
# <chr> <dbl> <int>
# 1 LgA 2.91 1
# 2 ShA 3.70 1
whereas if you swap the order of calculations, you'll see something different:
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(se=length(Value), Value=mean(Value))
# # A tibble: 2 x 3
# Group se Value
# <chr> <int> <dbl>
# 1 LgA 10 2.91
# 2 ShA 11 3.70
Try calculating sd first:
ReHo_Group_Esc_1 %>%
group_by(Group) %>%
summarise(
se = sd(Value)/sqrt(n()),
Value = mean(Value)
)
# # A tibble: 2 x 3
# Group se Value
# <chr> <dbl> <dbl>
# 1 LgA 0.266 2.91
# 2 ShA 0.223 3.70
You can try:
ReHo_Group_Esc_1 %>% group_by(Group) %>%
summarise(Value=mean(Value,na.rm=T), se=sd(Value,na.rm=T)/sqrt(n()))
Related
I have a dataset called PimaDiabetes.
PimaDiabetes <- read.csv("PimaDiabetes.csv")
PimaDiabetes[2:8][PimaDiabetes[2:8]==0] <- NA
mean_1 = 40.5
mean_0 = 30.7
p.tib <- PimaDiabetes %>%
as_tibble()
Here is a snapshot of the data:
And the dataset can be pulled from here.
I'm trying to navigate the columns in such a way that I can group the dataset by Outcomes (so to select for Outcome 0 and 1), and impute a different value (the median of the respected groups) into columns depending on the outcomes.
So for instance, in the fifth column, Insulin, there are some NA values down the line where the Outcome is 1, and some where the Outcome is 0. I would like to place a value (40.5) into it when the value in a row is NA, and the Outcome is 1. Then I'd like to put the mean_2 into it when the value is NA, and the Outcome is 0.
I've gotten advice prior to this and tried:
p.tib %>%
mutate(
p.tib$Insulin = case_when((p.tib$Outcome == 0) & (is.na(p.tib$Insulin)) ~ IN_0,
(p.tib$Outcome == 1) & (is.na(p.tib$Insulin) ~ IN_1,
TRUE ~ p.tib$Insulin))
However it constantly yields the following error:
Error: unexpected '=' in "p.tib %>% mutate(p.tib$Insulin ="
Can I know where things are going wrong, please?
Setup
It appears this dataset is also in the pdp package in R, called pima. The only major difference between the R package data and yours is that the pima dataset's Outcome variable is simply called "diabetes" instead and is labeled "pos" and "neg" instead of 0/1. I have loaded that package and the tidyverse to help.
#### Load Libraries ####
library(pdp)
library(tidyverse)
First I transformed the data into a tibble so it was easier for me to read.
#### Reformat Data ####
p.tib <- pima %>%
as_tibble()
Printing p.tib, we can see that the insulin variable has a lot of NA values in the first rows, which will be quicker to visualize later than some of the other variables that have missing data. Therefore, I used that instead of glucose, but the idea is the same.
# A tibble: 768 × 9
pregnant glucose press…¹ triceps insulin mass pedig…² age diabe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 148 72 35 NA 33.6 0.627 50 pos
2 1 85 66 29 NA 26.6 0.351 31 neg
3 8 183 64 NA NA 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.29 33 pos
6 5 116 74 NA NA 25.6 0.201 30 neg
7 3 78 50 32 88 31 0.248 26 pos
8 10 115 NA NA NA 35.3 0.134 29 neg
9 2 197 70 45 543 30.5 0.158 53 pos
10 8 125 96 NA NA NA 0.232 54 pos
# … with 758 more rows, and abbreviated variable names ¹pressure,
# ²pedigree, ³diabetes
# ℹ Use `print(n = ...)` to see more rows
Finding the Mean
After glimpsing the data, I checked the mean for each group who did and didn't have diabetes by first grouping by diabetes with group_by, then collapsing the data frame into a summary of each group's mean, thus creating the mean_insulin variable (which you can see removes NA values to derive the mean):
#### Check Mean by Group ####
p.tib %>%
group_by(diabetes) %>%
summarise(mean_insulin = mean(insulin,
na.rm=T))
The values we should be imputing seem to be below. Here the groups are labeled as "neg" or 0 in your data, and "pos", or 1 in your data. You can convert these groups into those numbers if you want, but I left it as is so it was easier to read:
# A tibble: 2 × 2
diabetes mean_insulin
<fct> <dbl>
1 neg 130.
2 pos 207.
Mean Imputation
From there, we will use case_when as a vectorized ifelse statement. First, we use mutate to transform insulin. Then we use case_when by setting up three tests. First, if the group is negative and the value is NA, we turn it into the mean value of 130. If the group is positive for the same condition, we use 207. For all other values (the TRUE part), we just use the normal value of insulin. The & operator here just says "this transformation can only take place if both of these tests are true". What follows the ~ is the transformation to take place.
#### Impute Mean ####
p.tib %>%
mutate(
insulin = case_when(
(diabetes == "neg") & (is.na(insulin)) ~ 130,
(diabetes == "pos") & (is.na(insulin)) ~ 207,
TRUE ~ insulin
)
)
You will now notice that the first rows of insulin data are replaced with the mutation and the rest are left alone:
# A tibble: 768 × 9
pregnant glucose press…¹ triceps insulin mass pedig…² age diabe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 148 72 35 207 33.6 0.627 50 pos
2 1 85 66 29 130 26.6 0.351 31 neg
3 8 183 64 NA 207 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.29 33 pos
6 5 116 74 NA 130 25.6 0.201 30 neg
7 3 78 50 32 88 31 0.248 26 pos
8 10 115 NA NA 130 35.3 0.134 29 neg
9 2 197 70 45 543 30.5 0.158 53 pos
10 8 125 96 NA 207 NA 0.232 54 pos
# … with 758 more rows, and abbreviated variable names ¹pressure,
# ²pedigree, ³diabetes
# ℹ Use `print(n = ...)` to see more rows
I'd like to create a new velocity variable. In my data set:
library(dplyr)
library(tidyr)
day <- c(0,47,76,118,160,193,227,262,306,355,396,450)
AT <- c(0.14,0.48,0.83,0.83,0.94,0.94,0.94,0.94,0.94,11.93,12.81,29.36)
ClassType <- c("Class_0_1","Class_0_1","Class_0_1","Class_0_1","Class_0_1","Class_0_1",
"Class_0_1","Class_0_1","Class_0_1","Class_9_25","Class_9_25","Class_25_50")
ClassMax <-c(1,1,1,1,1,1,1,1,1,25,25,50)
my.ds <- data.frame(day,AT,ClassType,ClassMax)
my.ds
# day AT ClassType ClassMax
# 1 0 0.14 Class_0_1 1
# 2 47 0.48 Class_0_1 1
# 3 76 0.83 Class_0_1 1
# 4 118 0.83 Class_0_1 1
# 5 160 0.94 Class_0_1 1
# 6 193 0.94 Class_0_1 1
# 7 227 0.94 Class_0_1 1
# 8 262 0.94 Class_0_1 1
# 9 306 0.94 Class_0_1 1
# 10 355 11.93 Class_9_25 25
# 11 396 12.81 Class_9_25 25
# 12 450 29.36 Class_25_50 50
If ClassType changes, take the next AT value minus actual ClassType values and divide by the difference between the two correspondent dates. In my case:
(11.93 0.94) / (355-306)
#[1] 0.2242857
(12.81-11.93) / (396-355)
#[1] 0.02146341
(29.36-12.81) / (450-396)
#[1] 0.3064815
But if AT is in a new ClassType but do not change based in ClassMax then ignore it.
I have a min to max custom ordination complte.cases <- c("Class_0_1","Class_1_3","Class_3_9", "Class_9_25","Class_25_50","Class_50").
I'd like to repeat the last velocity value inside the intermediate absent ClassType.
I try to do without success:
my.ds$velocity <- c(0,diff(my.ds$AT))/c(0,diff(my.ds$day))
final.ds <- %>%
group_by(nest,ClassType)%>%
summarize(velocity=mean(velocity)) %>%
complete(ClassType, tidyr:fill = list(velocity = NA)) %>%
fill(velocity, .direction = "downup")
}
My desirable output must to be:
final.ds
# ClassType velocity
# Class_ 0_1 0.224285714
# Class_ 1_3 0.224285714
# Class_ 3_9 0.224285714
# Class_ 9_25 0.224285714
# Class_ 9_25 0.021463415
# Class_ 9_25 0.306481481
Please, any help with it?
How about this:
my.ds %>%
group_by(ClassType) %>%
mutate(velocity = c(NA, diff(AT) / diff(day))) %>%
ungroup()
# # A tibble: 12 x 5
# day AT ClassType ClassMax velocity
# <dbl> <dbl> <chr> <dbl> <dbl>
# 1 0 0.14 Class_0_1 1 NA
# 2 47 0.48 Class_0_1 1 0.00723
# 3 76 0.83 Class_0_1 1 0.0121
# 4 118 0.83 Class_0_1 1 0
# 5 160 0.94 Class_0_1 1 0.00262
# 6 193 0.94 Class_0_1 1 0
# 7 227 0.94 Class_0_1 1 0
# 8 262 0.94 Class_0_1 1 0
# 9 306 0.94 Class_0_1 1 0
# 10 355 11.9 Class_9_25 25 NA
# 11 396 12.8 Class_9_25 25 0.0215
# 12 450 29.4 Class_25_50 50 NA
complete.cases <- c("Class_0_1","Class_1_3","Class_3_9", "Class_9_25","Class_25_50")
my.ds %>% group_by(ClassType = factor(ClassType, levels = complete.cases), grp = lag(match(ClassType, unique(ClassType)), default = 1)) %>% slice_tail(n = 1) %>%
ungroup %>%summarise(ClassType, velocity = c(NA, diff(AT))/c(NA, diff(day))) %>%
complete(ClassType) %>%
fill(velocity, .direction = "updown")
# ClassType velocity
# <fct> <dbl>
# 1 Class_0_1 0.224
# 2 Class_1_3 0.224
# 3 Class_3_9 0.224
# 4 Class_9_25 0.224
# 5 Class_9_25 0.0215
# 6 Class_25_50 0.306
Below is the sample data and code. I have two issues. First, I need the indtotal column to be the sum by the twodigit code and have it stay constant as shown below. The reasons is so that I can do a simple calculation of one column divided by the other to arrive at the smbshare number. When I try the following,
second <- first %>%
group_by(twodigit,smb) %>%
summarize(indtotal = sum(employment))
it breaks it down by twodigit and smb.
Second issue is having it produce an 0 if the value does not exist. Best example is twodigit code of 51 and smb = 4. When there are not 4 distinct smb values for a given two digit, I am looking for it to produce a 0.
Note: smb is short for small business
naicstest <- c (512131,512141,521921,522654,512131,536978,541214,531214,621112,541213,551212,574121,569887,541211,523141,551122,512312,521114,522112)
employment <- c(11,130,315,17,190,21,22,231,15,121,19,21,350,110,515,165,12,110,111)
smb <- c(1,2,3,1,3,1,1,3,1,2,1,1,4,2,4,3,1,2,2)
first <- data.frame(naicstest,employment,smb)
first<-first %>% mutate(twodigit = substr(naicstest,1,2))
second <- first %>% group_by(twodigit) %>% summarize(indtotal = sum(employment))
Desired result is below
twodigit indtotal smb smbtotal smbshare
51 343 1 23 (11+12) 23/343
51 343 2 130 130/343
51 343 3 190 190/343
51 343 4 0 0/343
52 1068 1 17 23/1068
52 1068 2 221 (110+111) 221/1068
52 1068 3 315 315/1068
52 1068 4 515 515/1068
This gives you all the columns you need, but in a slightly different order. You could use select or relocate to get them in the order you want I suppose:
first %>%
group_by(twodigit, smb) %>%
summarize(smbtotal = sum(employment)) %>%
ungroup() %>%
complete(twodigit, smb, fill = list('smbtotal' = 0)) %>%
group_by(twodigit) %>%
mutate(
indtotal = sum(smbtotal),
smbshare = smbtotal / indtotal
)
`summarise()` has grouped output by 'twodigit'. You can override using the `.groups` argument.
# A tibble: 32 × 5
# Groups: twodigit [8]
twodigit smb smbtotal indtotal smbshare
<chr> <dbl> <dbl> <dbl> <dbl>
1 51 1 23 343 0.0671
2 51 2 130 343 0.379
3 51 3 190 343 0.554
4 51 4 0 343 0
5 52 1 17 1068 0.0159
6 52 2 221 1068 0.207
7 52 3 315 1068 0.295
8 52 4 515 1068 0.482
9 53 1 21 252 0.0833
10 53 2 0 252 0
# … with 22 more rows
I have a dataframe in long format which is organised in this way:
help<- read.table(text="
ID Sodium H
1 140 31.9
1 138 29.6
1 136 30.6
2 145 35.9
2 137 33.3
3 148 27.9
4 139 30.0
4 128 32.4
4 143 35.3
4 133 NA", header = TRUE)
I need the worst value in each subject (ID) for Sodium and H. The worst value for H is defined as either value furthest away from 41-49, while the worst value for sodium is defined as value furthest away from 134-154.
The end result should therefore become something like this:
help<- read.table(text="
ID Sodium H
1 136 29.6
2 137 33.3
3 148 27.9
4 128 30.0 ", header=TRUE)
What is the easiest way to do this? Using aggregate function or dplyr? Or something else? Thank you in advance!
Here's a tidy version:
library(dplyr)
help %>%
group_by(ID) %>%
slice(which.max(abs(H - 45))) %>%
ungroup()
# # A tibble: 4 x 4
# ID DateTime Sodium H
# <int> <chr> <int> <dbl>
# 1 1 2020-07-27T11:00 138 29.6
# 2 2 2020-07-25T10:00 137 33.3
# 3 3 2020-07-27T14:00 148 27.9
# 4 4 2020-07-26T10:00 139 30
If it's possible that an ID may not have something out of limits, then the "worst" might return something within limits. If this is not desired, you can always add a filter to prevent within-limits:
help %>%
group_by(ID) %>%
slice(which.max(abs(H - 45))) %>%
ungroup() %>%
filter(!between(H, 41, 49))
The premise for Sodium is the same, using abs and the difference between its value and the mean of the desired range:
help %>%
group_by(ID) %>%
slice(which.max(abs(Sodium - 144))) %>%
ungroup()
# # A tibble: 4 x 4
# ID DateTime Sodium H
# <int> <chr> <int> <dbl>
# 1 1 2020-07-27T18:00 136 30.6
# 2 2 2020-07-25T10:00 137 33.3
# 3 3 2020-07-27T14:00 148 27.9
# 4 4 2020-07-26T12:00 128 32.4
Question
I use time-series data regularly. Sometimes, I would like to transmute an entire data frame to obtain some data frame of growth rates, or shares, for example.
When using transmute this is relatively straight-forward. But when I have a lot of columns to transmute and I want to keep the date column, I'm not sure if that's possible.
Below, using the economics data set, is an example of what I mean.
Example
library(dplyr)
economics %>%
transmute(date,
pce * 10,
pop * 10,
psavert * 10)
# A tibble: 574 x 4
date `pce * 10` `pop * 10` `psavert * 10`
<date> <dbl> <dbl> <dbl>
1 1967-07-01 5067 1987120 126
2 1967-08-01 5098 1989110 126
3 1967-09-01 5156 1991130 119
4 1967-10-01 5122 1993110 129
5 1967-11-01 5174 1994980 128
6 1967-12-01 5251 1996570 118
7 1968-01-01 5309 1998080 117
8 1968-02-01 5336 1999200 123
9 1968-03-01 5443 2000560 117
10 1968-04-01 5440 2002080 123
# ... with 564 more rows
Now, using transmute_at. The below predictably removes date in the .vars argument, but I haven't found a way of removing date and reintroducing it in .funs such that the resulting data frame looks as it does above. Any ideas?
economics %>%
transmute_at(.vars = vars(-c(date, uempmed, unemploy)),
.funs = list("trans" = ~ . * 10))
# A tibble: 574 x 3
pce_trans pop_trans psavert_trans
<dbl> <dbl> <dbl>
1 5067 1987120 126
2 5098 1989110 126
3 5156 1991130 119
4 5122 1993110 129
5 5174 1994980 128
6 5251 1996570 118
7 5309 1998080 117
8 5336 1999200 123
9 5443 2000560 117
10 5440 2002080 123
# ... with 564 more rows
We can use if/else inside the function.
library(dplyr)
library(ggplot2)
data(economics)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .)
# A tibble: 574 x 4
# date pce pop psavert
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we need to change the column names selectively, can do this after the transmute_at
library(stringr)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .) %>%
rename_at(vars(-date), ~ str_c(., '_trans'))
# A tibble: 574 x 4
# date pce_trans pop_trans psavert_trans
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we are changing the column names in all the selected columns in transmute_at use list(trans =
economics %>%
transmute_at(vars(date:psavert), list(trans = ~if(is.numeric(.)) .* 10 else .))