summarise_each for two variables - r

I have a dataframe that looks something like this:
df <- data.frame(
text = c(1:12),
person = c(c(rep("John", 6)), c(rep("Jane", 6))),
lemma = c("he", "he", "he", "his", "it", "she", "he",
"she", "she", "his", "it", "she"),
n = c(8, 8, 3, 7, 10, 4, 12, 9, 3, 4, 2, 8),
total_words = c(20, 49, 19, 39, 40, 30, 13, 30, 20, 34, 33, 15))
What I'm trying to do is to get summary statistics, so that I can tell the relative frequency of each pronoun with all the texts produced by John and Jane respectively. If all I wanted was the counts, it would be easy:
library("dplyr")
library("tidyr")
df %>%
group_by(person, lemma) %>%
summarise_each(funs(sum), n) %>%
spread(lemma, n)
However, as I said, I need relative frequency, so I need to divide the above results to the total number of words in all the texts produced by John and Jane respectively. Getting the percentages is also easy:
df %>%
group_by(lemma) %>%
summarise_each(funs(sum), n, total_words) %>%
mutate(percentage = n / total_words)
What I want to to is replace the total counts in the first example with the percentages in the second example, and that is where I am stuck.

I asked this question over at manipulaR google and Brandon Hurr gave me an answer that I tweaked to get into the final form I wanted. Here it is, in case anyone else finds they need to do something similar:
wordPerson <- df %>%
group_by(person) %>%
summarise(sumWords = sum(total_words))
df %>%
group_by(lemma, person) %>%
summarise_each(funs(sum), n, total_words) %>%
inner_join(., wordPerson, by = "person") %>%
mutate(percentage = n / sumWords) %>%
select(person, lemma, percentage) %>%
spread(lemma, percentage)
In short, you need to do this in two stages.

Related

How to obtain maximum counts by group

Using tidyverse, I would like to obtain the maximum count of events (e.g., dates) by group. Here is a minimum reproducible example:
Data frame:
df <- data.frame(id = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5),
event = c(12, 6, 1, 7, 13, 9, 4, 8, 2, 5, 11, 3, 10, 14))
The following code produces the desired output, but seems overly complicated:
df %>%
group_by(id) %>%
mutate(count = n()) %>%
ungroup() %>%
select(count) %>%
slice_max(count, n = 1, with_ties = FALSE)
Is there a simpler/better way? The following works, but top_n has been superseded by slice_max and it is recommended that the latter be used instead.
df %>%
count(id) %>%
distinct(n) %>% # to remove tied values
top_n(1)
Any suggestions?
If you want something with fewer steps, you could try base R table() to get the counts in a vector and then take the max(). By default it returns the max value only once even if it appears a few times in the vector.
max(table(df$id))
[1] 4
Or if you want it in tidyverse style
df$id %>%
table() %>%
max()
If you want the maximum number of events by group (where id is the grouping variable), then:
df %>%
group_by(id) %>%
summarise(max_n_events = max(event))
If instead you basically do not consider the specific values in the event column and only look at the id column, the solution proposed by #Josh above can also be written as follows:
df %>% group_by(id) %>% count() %>% ungroup() %>% summarise(max(n))

Grouped sampling without duplication

I'm struggeling to find a solution for the following problem. From a dataframe with 384 rows and 11 columns need to be drawn 24 samples ramdomly, each one containing 16 items.
Those 16 items also represent the total amount of combinations between factor levels which must be considered within each sample.
We have 4 grouping factors in the process:
Type, Valence, LT, Gender. All of them comprise 2 factor levels respectively. The dataframe looks essentially like this:
df2 <- data.frame(VNr=c(rep(1:8, 48)),
PId=c(rep(1:48, each = 8)),
Gender=rep(c("M", "F"), each=192),
Type=rep(c("E", "S"), each=4, times=48),
Valence=rep(c("P", "N"), each = 2, times=96),
LT=rep(c("L", "T"), each=1, times=192))
My former approach used dplyr to do the job:
N=24
df3 <- map_dfr(seq_len(N), ~df2 %>%
group_by(Type, Valence, LT, Gender) %>%
slice_sample(n = 1) %>%
mutate(sample_no = .x) %>%
ungroup() %>%
mutate(resample = duplicated(PId)) %>%
rowwise())
Regarding the grouping, this works flawlessly. However, it produces duplicates, meaning the same PId appearing more than once in single sample, which is not acceptable.
How can this be avoided?
LMc proposed a workaround here
Sampling by Group in R with no replacement but the final result cannot contain any repeats as well
Unfortunately, I could not get this to work yet.
Any help on this issue is very much appreciated!
Thanks in advance!
-Marshal
Does this work?
library(tidyverse)
df2 <- tibble(
VNr=c(rep(1:8, 48)),
PId=c(rep(1:48, each = 8)),
Gender=rep(c("M", "F"), each=192),
Type=rep(c("E", "S"), each=4, times=48),
Valence=rep(c("P", "N"), each = 2, times=96),
LT=rep(c("L", "T"), each=1, times=192)
)
df2
df2 %>%
group_by(Type, Valence, LT, Gender) %>%
mutate(n_rows_initial = n()) %>%
slice_sample(n = 16, replace = FALSE) %>%
mutate(n_rows_sampled = n()) %>%
ungroup()

Is there a way to count the number of null/missing values for each month in a dataframe?

I am currently using station data for my research in R, and I need to count the number of missing/null values for each month. The data is currently in daily measurements, and the monthly total of missing values would let me trim certain months out if they are not useful.
CUM00078310_df %>%
dplyr::mutate(
Month=month(Date),
Mis = rowSums(is.na(.[,grepl("C",colnames(CUM00078310_df))]))
) %>%
group_by(Month) %>%
summarize(Sum=sum(Mis), Percentage=mean(Mis))
Here is an example. Not sure if you want the data summarized or held within the dataframe. If not summarized, then omit final two lines of code. Add month grouping variable to group_by() with your data. Filter NA's only, if needed filter(is.na(x))
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount)) %>%
group_by(x, valueCount) %>%
summarise()
df<-data.frame(x = c(NA,2,5,10,15, NA, 3, 5, 10, 15, NA, 4, 10, NA, 6, 15))
Unsummarized example
df <- df %>%
group_by(x) %>%
mutate(valueCount = n()) %>%
arrange(desc(valueCount))

R: How can I aggregate rows under conditions?

I'm trying to aggregate the variable Schulbildung which are less then 12. And aggregate the value of n. I tried using the aggregate() function but it didn't work. Has somebody any idea?
Use mutate with an ifelse statement to recode every value that is smaller than 12.
Summarise then with dplyr.
df <- data.frame(
Education = c(18, 16, 15, 12, 10, 8),
entries = c(200, 100, 50, 50, 10 ,5)
)
You said Education is a grouping varibale, so this means this is not the original data.frame, right?
df %>%
ungroup() %>%
mutate(Education = ifelse(Education < 12, "others", Education)) %>%
group_by(Education) %>%
summarise(entries = sum(entries))

R Create Custom Function with Group by and Mutate

I have dataset and performing group_by and mutate functions.
But having errors doing this with custom function and defined column like Value_1 or Value_2.
Pls advise if I might be missing something in the custom function
Dataset:
library(dplyr)
df <- data.frame(
Date = c("2010-10-06", "2010-10-06", "2010-10-06", "2010-10
06", "2010-10-06", "2010-10-06", "2010-10-06", "2010-10-06"),
Region = c("Central", "Central", "Central", "Central", "North", "North",
"North", "North"),
Value_1 = c(10, 2, 4, 12, 4, 4, 2, 15),
Value_2 = c(120, 45, 20, 20, 60, 50, 75, 80),
stringsAsFactors = F)
Works Fine:
df %>%
group_by(Date, Region) %>%
mutate(Value_3 = sum(Value_1)) %>%
ungroup()
Error with Custom Function:
test_fn <- function(dataset, Col1) {
dataset <- dataset %>%
group_by(Date, Region) %>%
mutate(Value_3 = sum(Col1)) %>%
ungroup()
return(dataset)
}
df_3 <- test_fn(df, "Value_1")
test_fn <- function(dataset, Col1) {
Col1 = sym(Col1)
dataset <- dataset %>%
group_by(Date, Region) %>%
mutate(Value_3 = sum(!!Col1)) %>%
ungroup()
return(dataset)
}
If you change sym(Col1) to enquo(Col1) then you dont need to pass Col1 as a string, i.e test_fn(df, Value_1)
Have a look at this for your first half and I or someone will finish the second half of your solution. You need to learn about standard vs non-standard evaluation.
tfn <- function(data, col, groups) {
temp <- data %>%
## this gets you to group by the variables
## you need to group by in a standard evaluation way
group_by_(.dots = groups) %>%
## now do a mutate with the dynamic variable name
## mutate_(.dots and setName(value, var name)
temp
}
tfn(df, "Value_1", c("Date", "Region"))

Resources