Getting rid of NA values in R when trying to aggregate columns

Getting rid of NA values in R when trying to aggregate columns - r

I'm trying to aggregate this df by the last value in each corresponding country observation. For some reason, the last value that is added to the tibble is not correct.
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred)
aggre_data
I believe it has something to do with all of the NA values throughout the df. However I did try:
aggre_data <- combined %>%
group_by(location) %>%
summarise(Last_value_vacc = last(people_vaccinated_per_hundred(na.rm = TRUE)))
aggre_data

Update:
combined %>%
group_by(location) %>%
arrange(date, .by_group = TRUE) %>% # or whatever
summarise(Last_value_vacc = last(na.omit( people_vaccinated_per_hundred)))

Related

Manipulating data.frame while using cycles and storing values in a list

I have 2 codes that manipulate and filter (by date) my data.frame and that work perfectly. Now I want to run the code for not only one day, but for every day in vector:
seq(from=as.Date('2020-03-02'), to=Sys.Date(),by='days')` #.... 538 days
The code I want to run for all the days between 2020-03-02 and today is:
KOKOKO <- data.frame %>%
filter(DATE < '2020-03-02')%>%
summarize(DATE = '2020-03-02', CZK = sum(Objem.v.CZK,na.rm = T)
STAVPTF <- data.frame %>%
filter (DATE < '2020-03-02')%>%
group_by(CP) %>%
summarize(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), DATE = '2020-03-02') %>%
select(DATE,CP,mnozstvi) %>%
rbind(KOKOKO)%>%
drop_na() %>%
So instead of '2020-03-02' I want to fill in all days since '2020-03-02' one after another. And each of the KOKOKO and STAVPTF created for the unique day like this I want to save as a separate data.frame and all of them store in a list.

We could use map to loop over the sequence and apply the code
library(dplyr)
library(purrr)
out <- map(s1, ~ data.frame %>%
filter(DATE < .x)%>%
summarize(DATE = .x, CZK = sum(Objem.v.CZK,na.rm = TRUE))
As this is repeated cycle, a function would make it cleaner
f1 <- function(dat, date_col, group_col, Objem_col, aktualni_col, date_val) {
filtered <- dat %>%
filter({{date_col}} < date_val)
KOKOKO <- filtered %>%
summarize({{date_col}} := date_val,
CZK = sum({{Objem_col}}, na.rm = TRUE)
STAVPTF <- filtered %>%
group_by({{group_col}}) %>%
summarize(mnozstvi = last({{aktualni_col}}),
{{date_col}} := date_val) %>%
select({{date_col}}, {{group_col}}, mnozstvi) %>%
bind_rows(KOKOKO)%>%
drop_na()
return(STAVPTF)
}
and call as
map(s1, ~ f1(data.frame, DATE, CP, Objem.v.CZK, AKTUALNI_MNOZSTVI_AKCIE, !!.x))
where
s1 <- seq(from=as.Date('2020-03-02'), to=Sys.Date(), by='days')

It would be easier to answer your question, if you would provide a minimal reproducible example. It's easy done with tidyverses reprex packages
However, your KOKOKO code can be rewritten as simple cumulative sum:
KOKOKO =
data.frame %>%
arrange(DATE) %>% # if necessary
group_by(DATE) %>%
summarise(CZK = sum(Objem.v.CZK), .groups = 'drop') %>% # summarise per DATE (if necessary)
mutate(CZK = cumsum(CZK) - CZK) # cumulative sum excluding current row (current DATE)
Even STAVPTF code can probably be rewritten without iterations. First find the last value of AKTUALNI_MNOZSTVI_AKCIE per CP and DATE. Then this value is assigned to the next DATE:
STAVPTF <-
data.frame %>%
group_by(CP, DATE) %>%
summarise(mnozstvi = last(AKTUALNI_MNOZSTVI_AKCIE), .groups='drop_last') %>%
arrange(DATE) %>% # if necessary
mutate(DATE = lead(DATE))

r: combine filter with n_distinct in data frame

Simple question. Considering the data frame below, I want to count distinct IDs: one for all records and one after filtering on status. However, the %>% doesn't seem to work here. I just want to have a single value as ouput (so for total this should be 10, for closed it should be 5), not a dataframe . Both # lines don't work
dat <- data.frame (ID = as.factor(c(1:10)),
status = as.factor(rep(c("open","closed"))))
total <- n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(dat$ID)
#closed <- dat %>% filter(status == "closed") %>% n_distinct(ID)

n_distinct expects a vector as input, you are passing a dataframe. You can do :
library(dplyr)
dat %>%
filter(status == "closed") %>%
summarise(n = n_distinct(ID))
# n
#1 5
Or without using filter :
dat %>% summarise(n = n_distinct(ID[status == "closed"]))
You can add %>% pull(n) to above if you want a vector back and not a dataframe.

An option with data.table
library(data.table)
setDT(dat)[status == "closed"][, .(n = uniqueN(ID))]

How to calculate p.value of each column in a data frame with NA values using shapiro.test in r?

This is what I have tried so far. It works, but it only tells me the p.value of the data that has no NA's. Much of my data has NA values in a few places up to 1/3rd of the data.
normal <- apply(cor_phys, 2, function(x) shapiro.test(x)$p.value)
I want to try adding na.rm to the function, but it's not working. Help?

#calculate the correlations between all variables
corres <- cor_phys %>% #cor_phys is my data
as.matrix %>%
cor(use="complete.obs") %>% #complete.obs does not use NA
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1)
#removes duplicates correlations
corres <- corres %>%
mutate(var_order = paste(var1, var2) %>%
strsplit(split = ' ') %>%
map_chr( ~ sort(.x) %>%
paste(collapse = ' '))) %>%
mutate(cnt = 1) %>%
group_by(var_order) %>%
mutate(cumsum = cumsum(cnt)) %>%
filter(cumsum != 2) %>%
ungroup %>%
select(-var_order, -cnt, -cumsum) #removes unneeded columns
I did not write this myself, but it is the answer that I used and worked for my needs. The link to the page I used is: How to compute correlations between all columns in R and detect highly correlated variables

How to assign mutate and distinct to another variable in R?

enter image description hereI have a huge data set which has data for every 30 seconds . First I get the mean to take hourly data , then sum it for daily data and again sum it for monthly data . I need to assign the mutate function to a new data set / variable called mE_131 . for plotting monthly value .I'm new to this Please Help!
library(dplyr)
library(ggplot2)
attach(data)
data%>% #filtering 131 and 132
select(time,Column3,m_Pm) %>%
filter(data,Column3=="131")
filter(data,Column3=="132")
data_131<-filter(data,Column3=="131")
data_132<-filter(data,Column3=="132")
data_131%>%
mutate(datehour= format(time,"%Y-%m-%d %H"), date1= format(time,"%Y-%m-%d"), month=format(time,"%Y-%m")) %>%
group_by(datehour) %>% mutate(hourlyP=mean(m_Pm)) %>% distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>% mutate(dailyP=sum(hourlyP)) %>% distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>% summarise(monthlyP=sum(dailyP))

If your goal is to compare monthly data between column3 == 131 and column3 == 132 then you don't necessarily need to create a separate dataset for each of them although I will show you how to do it in the end.
First, let's create the required summary for both 131 and 132 :
data <- data %>%
filter(column3 == "131" | column3 == "132") %>% # filtering the required data only
mutate(datehour= format(time,"%Y-%m-%d %H"), # calculate the required stats
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Note: I have written every part of code in separate line to enhance readability but it is basically the same as your code shown above.
Now, let's do the plotting:
data %>%
ggplot(aes(x=month, y=monthlyP, fill=column3)) +
geom_bar(position="dodge") # this will produce similar plot as in your example
If you insist on having a separate dataset for each value in column3 then you can simply use the assignment operator <- to create a new dataframe as follows
mE_131 <- data_131 %>%
mutate(datehour= format(time,"%Y-%m-%d %H"),
date1= format(time,"%Y-%m-%d"),
month=format(time,"%Y-%m")) %>%
group_by(datehour) %>%
mutate(hourlyP=mean(m_Pm)) %>%
distinct(datehour, .keep_all = TRUE) %>%
group_by(date1) %>%
mutate(dailyP=sum(hourlyP)) %>%
distinct(date1, .keep_all = TRUE) %>%
group_by(month) %>%
summarise(monthlyP=sum(dailyP))
Then do the same thing to create mE_132. However, I don't recommend this because it would be harder to plot them.

transform() to add rows with dplyr()

I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))

We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)