Having trouble summarising data in R studio - r

I have a data set of species and how many were observed, however when I try to take the average for each species, I get the mean for the individual observation, so nothing is changing.
Along the top row of my data are the species names, and going down the column underneath them is the count. I am trying to summarize this data so I may plot it, but rather than the mean of the entire column being taken, it just takes the mean of the individual observation.
grassData <- read.csv("Dykebooke.csv", header = TRUE, sep = ",")
View(grassData)
summary.grass <- grassData %>% group_by(Cordgrass) %>%
summarise(mean = mean(Cordgrass), variance = var(Cordgrass))
summary.lavender <- grassData %>% group_by(Lavender) %>%
summarise(mean(Lavender), var(Lavender))
summary.goldenrod <- grassData %>% group_by(Goldenrod) %>%
summarise(mean(Goldenrod), var(Goldenrod))
summary.crab <- grassData %>% group_by(Crab) %>%
summarise(mean(Crab), var(Crab))
summary.iva <- grassData %>% group_by(Iva) %>%
summarise(mean(Iva), var(Iva))
summary.grasshopper <- grassData %>% group_by(Grasshopper) %>%
summarise(mean(Grasshopper), var(Grasshopper))
This is what I have done so far, but this is what it provides.
I have not used R in a few years so I am very rusty, any help is appreciated.

Related

Getting rid of NA values in R when trying to aggregate columns

I'm trying to aggregate this df by the last value in each corresponding country observation. For some reason, the last value that is added to the tibble is not correct.
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred)
aggre_data
I believe it has something to do with all of the NA values throughout the df. However I did try:
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred(na.rm = TRUE)))
aggre_data
Update:
combined %>%
group_by(location) %>%
arrange(date, .by_group = TRUE) %>% # or whatever
summarise(Last_value_vacc = last(na.omit( people_vaccinated_per_hundred)))

Creating indices using means of variables in R

I am trying to create an index of a set of variables by taken the mean of the selected variables using the following code:
data <- data %>%
group_by(country) %>%
# Standardize each component/measure
mutate(
std_var1 = standardize(var1, Z),
std_var2 = standardize(var2, Z),
std_var3 = standardize(var3, Z),
std_var4 = standardize(var4, Z)
) %>%
ungroup() %>%
dplyr::select(std_var1,
std_var2,
std_var3,
std_var4) %>%
# Average all z scores for an individual
mutate(index = pmap_dbl(., ~ mean(c(...), na.rm = T))) %>%
cbind(data, .) %>% unnest() %>%
I also use the idx_mean package that takes the following syntax:
mutate(data, idx_var = idx_mean(std_var1, std_var2, std_var3, std_var4))
and obtain similar but not exactly the same index values (not just a matter of rounding).
Is there one approach that seems more accurate here?
The 4th and 5th columns display index values created by the idx function (4th column) and the other approach (5th column.)

maping over a list and taking the colmeans and rowmeans in r

I am trying to compute the column means and row means of some data I have.
Its similar to the following:
library(rsample)
library(tidyquant)
library(tidyverse)
library(tsibble)
aapl <- tq_get("AAPL", start_date = "2000-01-01")
aapl_monthly_nested <- aapl %>%
mutate(ym = yearmonth(date)) %>%
nest(-ym)
aapl_rolled <- aapl_monthly_nested %>%
rolling_origin(cumulative = FALSE)
map(aapl_rolled$splits, ~ analysis(.x)) %>%
head
I try using the summarise_all function once I have mapped over the data but I cannot seem to get the colMeans. I have replaced colMeans with mean without luck.
x <- map(aapl_rolled$splits, ~analysis(.x),
~map(data,
~summarise_all(.funs(colMeans))))
x[[1]]$data
I would like a single observation of the column means for each of the splits.
EDIT:
I think I got it. - I believe I forgot the unnest the data after nesting it previously.
x <- map(aapl_rolled$splits, ~ analysis(.x) %>%
unnest() %>%
as_tibble(.) %>%
select(-year_month) %>%
summarise_all(mean))
If you have a better solution please let me know.

How to assign mutate and distinct to another variable in R?

enter image description hereI have a huge data set which has data for every 30 seconds . First I get the mean to take hourly data , then sum it for daily data and again sum it for monthly data . I need to assign the mutate function to a new data set / variable called mE_131 . for plotting monthly value .I'm new to this Please Help!
library(dplyr)
library(ggplot2)
attach(data)
data%>% #filtering 131 and 132
select(time,Column3,m_Pm) %>%
filter(data,Column3=="131")
filter(data,Column3=="132")
data_131<-filter(data,Column3=="131")
data_132<-filter(data,Column3=="132")
data_131%>%
mutate(datehour= format(time,"%Y-%m-%d %H"), date1= format(time,"%Y-%m-%d"), month=format(time,"%Y-%m")) %>%
group_by(datehour) %>% mutate(hourlyP=mean(m_Pm)) %>% distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>% mutate(dailyP=sum(hourlyP)) %>% distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>% summarise(monthlyP=sum(dailyP))
If your goal is to compare monthly data between column3 == 131 and column3 == 132 then you don't necessarily need to create a separate dataset for each of them although I will show you how to do it in the end.
First, let's create the required summary for both 131 and 132 :
data <- data %>%
filter(column3 == "131" | column3 == "132") %>% # filtering the required data only
mutate(datehour= format(time,"%Y-%m-%d %H"), # calculate the required stats
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Note: I have written every part of code in separate line to enhance readability but it is basically the same as your code shown above.
Now, let's do the plotting:
data %>%
ggplot(aes(x=month, y=monthlyP, fill=column3)) +
geom_bar(position="dodge") # this will produce similar plot as in your example
If you insist on having a separate dataset for each value in column3 then you can simply use the assignment operator <- to create a new dataframe as follows
mE_131 <- data_131 %>%
mutate(datehour= format(time,"%Y-%m-%d %H"),
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Then do the same thing to create mE_132. However, I don't recommend this because it would be harder to plot them.

Using replace_na for multiple data subsets

I'm trying to replace the NAs in multiple column variables with randomly generated values from each student_id's subset row data:
data snapshot
so for student 3, systolic needs two NAs replaced. I used the min and max values for each variable within the student 3 subset to generate random values.
library(dplyr)
library(tidyr)
library(tibble)
library(tidyverse)
dplyr::filter(exercise, student_id == "3") %>% replace_na(list(systolic= round(sample(runif(1000, 125,130),2),0),
diastolic =round(sample(runif(1000, 85,85),3),0), heart_rate= round(sample(runif(1000, 79,86),2),0),
phys_score = round(sample(runif(1000, 8,9),2),0)
However it works only when one NA needs replacing: successfully replaced systolic NA values. When I try to replace more than one NAs, this error comes up.
Error: Replacement for `systolic` is length 2, not length 1
Is there a way to fix this? I tried converting the column variables to data frames instead of the vectors they are now, but it only returned the original data without any replacement changes.
Are there any simpler ways to this? Any suggestions/comments would be appreciated. Thanks.
A solution that makes things a little more automated but may be unnecessarily complex.
Generated some grouped missing data from the mtcars dataset
library(magrittr)
library(purrr)
library(dplyr)
library(stringr)
library(tidyr)
## Generate some missing data with a subset of car make
mtcars_miss <- mtcars %>%
as_tibble(rownames = "car") %>%
select(car) %>%
separate(car, c("make", "name"), " ") %>%
bind_cols(mtcars[, -1] %>%
map_df(~.[sample(c(TRUE, NA), prob = c(0.8, 0.2),
size = length(.), replace = TRUE)])) %>%
filter(make %in% c("Mazda", "Hornet", "Merc"))
Function to replace na values from a given variable by sampling within the min and max and depending on some group (here make).
replace_na_sample <- function(df_miss, var, group = "make") {
var <- enquo(var)
df_miss %>%
group_by(.dots = group) %>%
mutate(replace_var := round(runif(n(), min(!!var, na.rm = T),
max(!!var, na.rm = T)), 0)) %>%
rowwise %>%
mutate_at(.vars = vars(!!var),
.funs = funs(replace_na(., replace_var))) %>%
select(-replace_var) %>%
ungroup
}
Example replacing several missing values in multiple columns.
mtcars_replaced <- mtcars_miss %>%
replace_na_sample(cyl, group = "make") %>%
replace_na_sample(disp, group = "make") %>%
replace_na_sample(hp, group = "make")

Resources