I'm trying to aggregate the variable Schulbildung which are less then 12. And aggregate the value of n. I tried using the aggregate() function but it didn't work. Has somebody any idea?
Use mutate with an ifelse statement to recode every value that is smaller than 12.
Summarise then with dplyr.
df <- data.frame(
Education = c(18, 16, 15, 12, 10, 8),
entries = c(200, 100, 50, 50, 10 ,5)
)
You said Education is a grouping varibale, so this means this is not the original data.frame, right?
df %>%
ungroup() %>%
mutate(Education = ifelse(Education < 12, "others", Education)) %>%
group_by(Education) %>%
summarise(entries = sum(entries))
Related
I have a continuous variable in R. Entries 1-30 need to stay the same. NAs are coded as 99 and 0 was coded as 88 for some reason. I'm trying to figure out how to recode 99s to NA and 88s to 0, but keep any variables 1-30 as is.
I have tried a few things, but I'm pretty new to R and coding in general. None of my attempts have come even close, and most of the examples I'm coming across in my search are about categorial variables, recoding continuous as categorical, or binning. I want to recode as continuous, just changing 88s and 99s only.
I tried using mutate in a few different ways, but none worked. Most of the outcomes were and error or the new MH variable with nothing actually changed.
With dplyr, you can use
recode()
df %>%
mutate(y = recode(x, `88` = 0, `99` = NA_real_))
case_match()
df %>%
mutate(y = case_match(x, 88 ~ 0, 99 ~ NA, .default = x))
case_when()
df %>%
mutate(y = case_when(x == 88 ~ 0, x == 99 ~ NA, .default = x))
Using fcase
library(data.table)
setDT(df)[, y := fcase(!x %in% c(88, 99), x, x == 88, 0)]
You have a lot of options at your disposal with the tidyverse packages (e.g., dplyr, tidyr). One option is to use na_if to turn the 99s into NA and if_else to turn the 88s to 0.
I have created a fake dataset below, but if you have questions about your specific dataset, you should provide a reproducible example with your own data.
library(tidyverse)
a <- sample(x = c(1, 2, 3, 4, 99, 88), size = 30, replace = T)
b <- sample(x = c(1, 2, 3, 4, 99, 88), size = 30, replace = T)
c <- sample(x = c(1, 2, 3, 4, 99, 88), size = 30, replace = T)
df <- data.frame(a, b, c)
df
df %>%
mutate(across(everything(), ~na_if(., 99))) %>%
mutate(across(everything(), ~if_else(. == 88, 0, .)))
We can update matching values inplace with base R
df$y[df$y == 99] <- NA
df$y[df$y == 88] <- 0
I am currently using station data for my research in R, and I need to count the number of missing/null values for each month. The data is currently in daily measurements, and the monthly total of missing values would let me trim certain months out if they are not useful.
CUM00078310_df %>%
dplyr::mutate(
Month=month(Date),
Mis = rowSums(is.na(.[,grepl("C",colnames(CUM00078310_df))]))
) %>%
group_by(Month) %>%
summarize(Sum=sum(Mis), Percentage=mean(Mis))
Here is an example. Not sure if you want the data summarized or held within the dataframe. If not summarized, then omit final two lines of code. Add month grouping variable to group_by() with your data. Filter NA's only, if needed filter(is.na(x))
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount)) %>%
group_by(x, valueCount) %>%
summarise()
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
Unsummarized example
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount))
Good Evening,
I do have a problem, that I just cannot seem to get around.
Let's assume I'm working with a simplified dataset that looks like this
library(tidyverse)
data <- tribble(~town , ~patients_aged_17, ~patients_aged_18, ~patients_aged_19, "newyork", 2, 3, 1,"berlin", 1, 1, 4)
I would like to use the tidyverse summarise function to calculate the median age for each town.
data %>% group_by(town) %>% summarise(median_patient_age = median([problem]))
The median for newyork would be median(c(17, 17, 18, 18, 18, 19), so simply using the median function won't yield the desired results.
The question is, how can I get R to calculate the median the correct way ? I guess the answer is quite easy, however I just cant figure it out.
Ps. I can't do it by hand as in the example, as there is way to many groups and "age-variables".
Any hints ?
Best wishes, David.
I think this will give the desired result
library(tidyverse)
data <- tribble(~town , ~patients_aged_17, ~patients_aged_18, ~patients_aged_19, "newyork", 2, 3, 1,"berlin", 1, 1, 4)
data %>%
pivot_longer(cols=c(-town), names_to = "age_group", values_to = "count") %>%
mutate(
age = as.numeric(gsub("[^\\d]+", "", age_group, perl=TRUE)),
age_total = count*age
) %>%
group_by(town) %>%
summarise(
count_total = sum(count),
age_sum = sum(age_total)
) %>%
mutate(
median_age = age_sum/count_total
) %>%
select(town, median_age)
# A tibble: 2 x 2
town median_age
<chr> <dbl>
1 berlin 18.5
2 newyork 17.8
I have a dataframe that looks something like this:
df <- data.frame(
text = c(1:12),
person = c(c(rep("John", 6)), c(rep("Jane", 6))),
lemma = c("he", "he", "he", "his", "it", "she", "he",
"she", "she", "his", "it", "she"),
n = c(8, 8, 3, 7, 10, 4, 12, 9, 3, 4, 2, 8),
total_words = c(20, 49, 19, 39, 40, 30, 13, 30, 20, 34, 33, 15))
What I'm trying to do is to get summary statistics, so that I can tell the relative frequency of each pronoun with all the texts produced by John and Jane respectively. If all I wanted was the counts, it would be easy:
library("dplyr")
library("tidyr")
df %>%
group_by(person, lemma) %>%
summarise_each(funs(sum), n) %>%
spread(lemma, n)
However, as I said, I need relative frequency, so I need to divide the above results to the total number of words in all the texts produced by John and Jane respectively. Getting the percentages is also easy:
df %>%
group_by(lemma) %>%
summarise_each(funs(sum), n, total_words) %>%
mutate(percentage = n / total_words)
What I want to to is replace the total counts in the first example with the percentages in the second example, and that is where I am stuck.
I asked this question over at manipulaR google and Brandon Hurr gave me an answer that I tweaked to get into the final form I wanted. Here it is, in case anyone else finds they need to do something similar:
wordPerson <- df %>%
group_by(person) %>%
summarise(sumWords = sum(total_words))
df %>%
group_by(lemma, person) %>%
summarise_each(funs(sum), n, total_words) %>%
inner_join(., wordPerson, by = "person") %>%
mutate(percentage = n / sumWords) %>%
select(person, lemma, percentage) %>%
spread(lemma, percentage)
In short, you need to do this in two stages.
I was wondering if there is a way to compute the mean excluding outliers using the dplyr package in R? I was trying to do something like this, but did not work:
library(dplyr)
w = rep("months", 4)
value = c(1, 10, 12, 9)
df = data.frame(w, value)
output = df %>% group_by(w) %>% summarise(m = mean(value, na.rm = T, outlier = T))
So in above example, output should be 10.333 (mean of 10, 12, & 9) instead of 8 (mean of 1, 10, 12, 9)
Thanks!
One way would be something like this using the outlier package.
library(outliers) #containing function outlier
library(dplyr)
df %>%
group_by(w) %>%
filter(!value %in% c(outlier(value))) %>%
summarise(m = mean(value, na.rm = TRUE))
# w m
#1 months 10.33333