dplyr summarize output - how to save it - r

I need to calculate summary statistics for observations of bird breeding activity for each of 150 species. The data frame has the species (scodef), the type of observation (codef)(e.g. nest building), and the ordinal date (days since 1 January, since the data were collected over multiple years). Using dplyr I get exactly the result I want.
library(dplyr)
library(tidyr)
phenology %>% group_by(sCodef, codef) %>%
summarize(N=n(), Min=min(jdate), Max=max(jdate), Median=median(jdate))
# A tibble: 552 x 6
# Groups: sCodef [?]
sCodef codef N Min Max Median
<fct> <fct> <int> <dbl> <dbl> <dbl>
1 ABDU AY 3 172 184 181
2 ABDU FL 12 135 225 188
3 ACFL AY 18 165 222 195
4 ACFL CN 4 142 156 152.
5 ACFL FL 10 166 197 192.
6 ACFL NB 6 139 184 150.
7 ACFL NY 6 166 207 182
8 AMCO FL 1 220 220 220
9 AMCR AY 53 89 198 161
10 AMCR FL 78 133 225 166.
# ... with 542 more rows
How do I get these summary statistics into some sort of data object so that I can export them to use ultimately in a Word document? I have tried this and gotten an error. All of the many explanations of summarize I have reviewed just show the summary data on screen. Thanks
out3 <- summarize(N=n(), Min=min(jdate), Max=max(jdate), median=median(jdate))
Error: This function should not be called directly

Assign this to a variable, then write to a csv like so:
summarydf <- phenology %>% group_by......(as above)
write.csv(summarydf, filename="yourfilenamehere.csv")

Related

Why is My Multiple Selection Not Working in R?

I have a dataset called PimaDiabetes.
PimaDiabetes <- read.csv("PimaDiabetes.csv")
PimaDiabetes[2:8][PimaDiabetes[2:8]==0] <- NA
mean_1 = 40.5
mean_0 = 30.7
p.tib <- PimaDiabetes %>%
as_tibble()
Here is a snapshot of the data:
And the dataset can be pulled from here.
I'm trying to navigate the columns in such a way that I can group the dataset by Outcomes (so to select for Outcome 0 and 1), and impute a different value (the median of the respected groups) into columns depending on the outcomes.
So for instance, in the fifth column, Insulin, there are some NA values down the line where the Outcome is 1, and some where the Outcome is 0. I would like to place a value (40.5) into it when the value in a row is NA, and the Outcome is 1. Then I'd like to put the mean_2 into it when the value is NA, and the Outcome is 0.
I've gotten advice prior to this and tried:
p.tib %>%
mutate(
p.tib$Insulin = case_when((p.tib$Outcome == 0) & (is.na(p.tib$Insulin)) ~ IN_0,
(p.tib$Outcome == 1) & (is.na(p.tib$Insulin) ~ IN_1,
TRUE ~ p.tib$Insulin))
However it constantly yields the following error:
Error: unexpected '=' in "p.tib %>% mutate(p.tib$Insulin ="
Can I know where things are going wrong, please?
Setup
It appears this dataset is also in the pdp package in R, called pima. The only major difference between the R package data and yours is that the pima dataset's Outcome variable is simply called "diabetes" instead and is labeled "pos" and "neg" instead of 0/1. I have loaded that package and the tidyverse to help.
#### Load Libraries ####
library(pdp)
library(tidyverse)
First I transformed the data into a tibble so it was easier for me to read.
#### Reformat Data ####
p.tib <- pima %>%
as_tibble()
Printing p.tib, we can see that the insulin variable has a lot of NA values in the first rows, which will be quicker to visualize later than some of the other variables that have missing data. Therefore, I used that instead of glucose, but the idea is the same.
# A tibble: 768 × 9
pregnant glucose press…¹ triceps insulin mass pedig…² age diabe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 148 72 35 NA 33.6 0.627 50 pos
2 1 85 66 29 NA 26.6 0.351 31 neg
3 8 183 64 NA NA 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.29 33 pos
6 5 116 74 NA NA 25.6 0.201 30 neg
7 3 78 50 32 88 31 0.248 26 pos
8 10 115 NA NA NA 35.3 0.134 29 neg
9 2 197 70 45 543 30.5 0.158 53 pos
10 8 125 96 NA NA NA 0.232 54 pos
# … with 758 more rows, and abbreviated variable names ¹​pressure,
# ²​pedigree, ³​diabetes
# ℹ Use `print(n = ...)` to see more rows
Finding the Mean
After glimpsing the data, I checked the mean for each group who did and didn't have diabetes by first grouping by diabetes with group_by, then collapsing the data frame into a summary of each group's mean, thus creating the mean_insulin variable (which you can see removes NA values to derive the mean):
#### Check Mean by Group ####
p.tib %>%
group_by(diabetes) %>%
summarise(mean_insulin = mean(insulin,
na.rm=T))
The values we should be imputing seem to be below. Here the groups are labeled as "neg" or 0 in your data, and "pos", or 1 in your data. You can convert these groups into those numbers if you want, but I left it as is so it was easier to read:
# A tibble: 2 × 2
diabetes mean_insulin
<fct> <dbl>
1 neg 130.
2 pos 207.
Mean Imputation
From there, we will use case_when as a vectorized ifelse statement. First, we use mutate to transform insulin. Then we use case_when by setting up three tests. First, if the group is negative and the value is NA, we turn it into the mean value of 130. If the group is positive for the same condition, we use 207. For all other values (the TRUE part), we just use the normal value of insulin. The & operator here just says "this transformation can only take place if both of these tests are true". What follows the ~ is the transformation to take place.
#### Impute Mean ####
p.tib %>%
mutate(
insulin = case_when(
(diabetes == "neg") & (is.na(insulin)) ~ 130,
(diabetes == "pos") & (is.na(insulin)) ~ 207,
TRUE ~ insulin
)
)
You will now notice that the first rows of insulin data are replaced with the mutation and the rest are left alone:
# A tibble: 768 × 9
pregnant glucose press…¹ triceps insulin mass pedig…² age diabe…³
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 148 72 35 207 33.6 0.627 50 pos
2 1 85 66 29 130 26.6 0.351 31 neg
3 8 183 64 NA 207 23.3 0.672 32 pos
4 1 89 66 23 94 28.1 0.167 21 neg
5 0 137 40 35 168 43.1 2.29 33 pos
6 5 116 74 NA 130 25.6 0.201 30 neg
7 3 78 50 32 88 31 0.248 26 pos
8 10 115 NA NA 130 35.3 0.134 29 neg
9 2 197 70 45 543 30.5 0.158 53 pos
10 8 125 96 NA 207 NA 0.232 54 pos
# … with 758 more rows, and abbreviated variable names ¹​pressure,
# ²​pedigree, ³​diabetes
# ℹ Use `print(n = ...)` to see more rows

Counting the number of storms and listing their names?

The question is: Count the number of storms per year since 1975 and write their names
What I have so far is the code below:
storms %>%
count(name, year >= 1975, sort = TRUE)
I got this output:
name `year >= 1975` n
<chr> <lgl> <int>
1 Emily TRUE 217
2 Bonnie TRUE 209
3 Alberto TRUE 184
4 Claudette TRUE 180
5 Felix TRUE 178
6 Danielle TRUE 165
7 Josephine TRUE 165
8 Edouard TRUE 159
9 Gordon TRUE 158
10 Gabrielle TRUE 157
I think this is the correct output, but I just want to make sure.
Technically, year >= 1975 part should be in filter and not in count. However, the data starts from 1975 so your output is correct as well. You get the same output even if you remove filter part from the below code.
library(dplyr)
storms %>% filter(year >= 1975) %>% count(name, sort = TRUE)
# A tibble: 214 × 2
# name n
# <chr> <int>
# 1 Emily 217
# 2 Bonnie 209
# 3 Alberto 184
# 4 Claudette 180
# 5 Felix 178
# 6 Danielle 165
# 7 Josephine 165
# 8 Edouard 159
# 9 Gordon 158
#10 Gabrielle 157
# … with 204 more rows

How best to calculate relative shares of different columns in R?

Below is the sample data and code. I have two issues. First, I need the indtotal column to be the sum by the twodigit code and have it stay constant as shown below. The reasons is so that I can do a simple calculation of one column divided by the other to arrive at the smbshare number. When I try the following,
second <- first %>%
group_by(twodigit,smb) %>%
summarize(indtotal = sum(employment))
it breaks it down by twodigit and smb.
Second issue is having it produce an 0 if the value does not exist. Best example is twodigit code of 51 and smb = 4. When there are not 4 distinct smb values for a given two digit, I am looking for it to produce a 0.
Note: smb is short for small business
naicstest <- c (512131,512141,521921,522654,512131,536978,541214,531214,621112,541213,551212,574121,569887,541211,523141,551122,512312,521114,522112)
employment <- c(11,130,315,17,190,21,22,231,15,121,19,21,350,110,515,165,12,110,111)
smb <- c(1,2,3,1,3,1,1,3,1,2,1,1,4,2,4,3,1,2,2)
first <- data.frame(naicstest,employment,smb)
first<-first %>% mutate(twodigit = substr(naicstest,1,2))
second <- first %>% group_by(twodigit) %>% summarize(indtotal = sum(employment))
Desired result is below
twodigit indtotal smb smbtotal smbshare
51 343 1 23 (11+12) 23/343
51 343 2 130 130/343
51 343 3 190 190/343
51 343 4 0 0/343
52 1068 1 17 23/1068
52 1068 2 221 (110+111) 221/1068
52 1068 3 315 315/1068
52 1068 4 515 515/1068
This gives you all the columns you need, but in a slightly different order. You could use select or relocate to get them in the order you want I suppose:
first %>%
group_by(twodigit, smb) %>%
summarize(smbtotal = sum(employment)) %>%
ungroup() %>%
complete(twodigit, smb, fill = list('smbtotal' = 0)) %>%
group_by(twodigit) %>%
mutate(
indtotal = sum(smbtotal),
smbshare = smbtotal / indtotal
)
`summarise()` has grouped output by 'twodigit'. You can override using the `.groups` argument.
# A tibble: 32 × 5
# Groups: twodigit [8]
twodigit smb smbtotal indtotal smbshare
<chr> <dbl> <dbl> <dbl> <dbl>
1 51 1 23 343 0.0671
2 51 2 130 343 0.379
3 51 3 190 343 0.554
4 51 4 0 343 0
5 52 1 17 1068 0.0159
6 52 2 221 1068 0.207
7 52 3 315 1068 0.295
8 52 4 515 1068 0.482
9 53 1 21 252 0.0833
10 53 2 0 252 0
# … with 22 more rows

Percentage change in values in r

Here is the df I am using:
Date Country City Specie count min max median variance
27 2020-03-25 IN Delhi pm25 797 6 192 92 12116.60
159 2020-03-25 IN Chennai pm25 96 27 89 57 1928.38
223 2020-03-25 IN Mumbai pm25 285 12 163 90 6275.41
412 2020-03-25 IN Bengaluru pm25 179 25 145 73 4890.82
419 2020-03-25 IN Kolkata pm25 260 6 168 129 10637.10
10 2020-04-10 IN Delhi pm25 835 2 393 137 24542.30
132 2020-04-10 IN Chennai pm25 87 5 642 53 87856.50
298 2020-04-10 IN Mumbai pm25 168 1 125 90 5025.35
358 2020-04-10 IN Bengaluru pm25 159 21 834 56 57091.10
444 2020-04-10 IN Kolkata pm25 219 4 109 64 2176.61
I want to calculate the percentage change between 'median' values of the data frame. For that I have used the following code:
pct_change_pm25 <- day %>%
arrange(City, .by_group = TRUE) %>%
mutate(pct_change = -diff(median) / median[-1] * 100)
But I am getting this error:
Error in arrange_impl(.data, dots) :
incorrect size (1) at position 2, expecting : 10
The number of rows that mutate is creating is 9 and is not matching with the number of rows in the df.
I have followed this post on stackoverflow:
Calculate Percentage Change in R using dplyr
But, unfortunately id didn't work for me.
Since diff returns vector of length 1 less than the original vector, append an NA at the start of the calculation. Also probably you want to do this for each City separately,hence grouping by city.
library(dplyr)
df %>%
arrange(City) %>%
group_by(City) %>%
mutate(pct_change = c(NA, -diff(median) / median[-1] * 100))
Another way to do the same calculation is using lag
df %>%
arrange(City) %>%
group_by(City) %>%
mutate(pct_change = (lag(median) - median)/median * 100)
# Date Country City Specie count min max median variance pct_change
# <fct> <fct> <fct> <fct> <int> <int> <int> <int> <dbl> <dbl>
# 1 2020-03-25 IN Bengaluru pm25 179 25 145 73 4891. NA
# 2 2020-04-10 IN Bengaluru pm25 159 21 834 56 57091. 30.4
# 3 2020-03-25 IN Chennai pm25 96 27 89 57 1928. NA
# 4 2020-04-10 IN Chennai pm25 87 5 642 53 87856. 7.55
# 5 2020-03-25 IN Delhi pm25 797 6 192 92 12117. NA
# 6 2020-04-10 IN Delhi pm25 835 2 393 137 24542. -32.8
# 7 2020-03-25 IN Kolkata pm25 260 6 168 129 10637. NA
# 8 2020-04-10 IN Kolkata pm25 219 4 109 64 2177. 102.
# 9 2020-03-25 IN Mumbai pm25 285 12 163 90 6275. NA
#10 2020-04-10 IN Mumbai pm25 168 1 125 90 5025. 0
With data.table, we can do
library(data.table)
setDT(df)[, pct_change := (shift(median) - median)/median * 100, City]

time series extraction by event onset

I'm looking for a code to extract a time interval (500ms) of a column (called time) for each trial onset, so that I can calculate a baseline of the first 500ms of each trial
actual time in ms between two consecutive rows of the column varies, because the dataset is downsampled and only changes are reported, so I cannot just count a certain number of rows to define the time interval.
I tried this:
baseline <- labchart %>%
dplyr::filter(time[1:(length(labchart$time)+500)]) %>%
dplyr::group_by(Participant, trialonset)
but only got error messages like:
Error: Argument 2 filter condition does not evaluate to a logical vector
And I am not sure, if (time[1:(length(labchart$Time)+500)]) would really give me the first 500ms of each trial?
It's difficult to know exactly what you're asking here. I think what you're asking is how to group observations into 500ms periods given only time intervals between observations.
Suppose the data looks like this:
``` r
labchart <- data.frame(time = sample(50:300, 20, TRUE), data = rnorm(20))
labchart
#> time data
#> 1 277 -1.33120732
#> 2 224 -0.85356280
#> 3 80 -0.32012499
#> 4 255 0.32433366
#> 5 227 -0.49600772
#> 6 248 2.23246918
#> 7 138 -1.40170795
#> 8 115 -0.76525043
#> 9 159 0.14239351
#> 10 207 -1.53064873
#> 11 139 -0.82303066
#> 12 185 1.12473125
#> 13 239 -0.22491238
#> 14 117 -0.55809297
#> 15 147 0.83225435
#> 16 200 0.75178516
#> 17 170 -0.78484405
#> 18 208 1.21000589
#> 19 196 -0.74576650
#> 20 184 0.02459359
Then we can create a column for total elapsed time and which 500ms period the observation belongs to like this:
library(dplyr)
labchart %>%
mutate(elapsed = lag(cumsum(time), 1, 0),
period = 500 * (elapsed %/% 500))
#> time data elapsed period
#> 1 277 -1.33120732 0 0
#> 2 224 -0.85356280 277 0
#> 3 80 -0.32012499 501 500
#> 4 255 0.32433366 581 500
#> 5 227 -0.49600772 836 500
#> 6 248 2.23246918 1063 1000
#> 7 138 -1.40170795 1311 1000
#> 8 115 -0.76525043 1449 1000
#> 9 159 0.14239351 1564 1500
#> 10 207 -1.53064873 1723 1500
#> 11 139 -0.82303066 1930 1500
#> 12 185 1.12473125 2069 2000
#> 13 239 -0.22491238 2254 2000
#> 14 117 -0.55809297 2493 2000
#> 15 147 0.83225435 2610 2500
#> 16 200 0.75178516 2757 2500
#> 17 170 -0.78484405 2957 2500
#> 18 208 1.21000589 3127 3000
#> 19 196 -0.74576650 3335 3000
#> 20 184 0.02459359 3531 3500

Resources