Counting how often x occures per y and Visualize in R - r

I would like to count certain things in my dataset. I have panel data and ideally would like to count the number of activities per person.
people <- c(1,1,1,2,2,3,3,4,4,5,5)
activity <- c(1,1,1,2,2,3,4,5,5,6,6)
completion <- c(0,0,1,0,1,1,1,0,0,0,1)
So my output would tell me that person 4 has 2 tasks.
people 1
frequency activity 2
Would i need to group something? Ideally i would like to also visualize this as a histogram.
I have tried this:
> ##activity per person cllw %>%
> ## Group observations by people group_by(id_user) %>%
> ## count activities per person and i am not sure how to create frequencies at all

Like this?
library(dplyr)
df %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 5 x 2
people `frequency activity`
<dbl> <int>
1 1 3
2 2 2
3 3 2
4 4 2
5 5 2
Or like this if you only want "active" tasks:
df %>%
filter(completion != 1) %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 4 x 2
people `frequency activity`
<dbl> <int>
1 1 2
2 2 1
3 4 2
4 5 1
Edit for unique tasks per person:
df %>%
filter(completion != 1) %>%
distinct(people, activity) %>%
group_by(people) %>%
summarise("frequency activity" = n())
# A tibble: 4 x 2
people `frequency activity`
<dbl> <int>
1 1 1
2 2 1
3 4 1
4 5 1

Related

Is there a way to do a group by and do a full count as well as a count based on filter in same table?

I have a dataset that looks like this
ID|Filter|
1 Y
1 N
1 Y
1 Y
2 N
2 N
2 N
2 Y
2 Y
3 N
3 Y
3 Y
I would like the final result to look like this. A summary count of total count and also when filter is "Y"
ID|All Count|Filter Yes
1 4 3
2 5 2
3 3 2
If i do like this i only get the full count but I also want the folder as the next column
df<- df %>%
group_by(ID)%>%
summarise(`All Count`=n())
df %>%
group_by(ID) %>%
summarise(`All Count` = n(),
`Count Yes` = sum(Filter == "Y"))
# A tibble: 3 × 3
ID `All Count` `Count Yes`
<chr> <int> <int>
1 1 4 3
2 2 5 2
3 3 3 2
We can use
library(dplyr)
df %>%
group_by(ID)%>%
summarise(`All Count`=n(), `Filter Yes` = sum(Filter == 'Y', na.rm = TRUE))

Calculate ratio for subsets within subsets using dplyr

I have a set of data for many authors (AU), spanning multiple years (Year) and multiple topics (Topic). For each AU, Year, and Topic combination I want to calculate a ratio of the total FL by Topic / total FL for the year.
The data will look like this:
Data <- data.frame("AU" = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
"Year" = c(2010,2010,2010,2010,2010,2010,2011,2011,2011,2011,2010,2010,2010,2011,2011,2011,2011,2010,2011,2011),
"Topic" = c(1,1,1,2,2,2,1,1,2,2,2,2,2,1,1,1,1,1,1,1),
"FL" = c(1,0,1,1,1,0,0,0,1,1,1,1,1,1,1,0,0,1,1,1))
I've been playing around with dplyr trying to figure out how to do this. I can group_by easy enough but I'm not sure how to go about calculating the ratio using a "group" for numerator and a total across all groups for the denominator
Results <- Data %>%
group_by(Year, AU) %>%
summarise(ratio = ???) # Should be (Sum(FL) by Topic) / (Sum(FL) across all Topics)
If I understand correctly your desired output, you can calculate the total by Topic, Year, AU and total by Year, AU separately and join them together using left_join.
left_join(
Data %>%
group_by(AU, Year, Topic) %>%
summarise(FL_topic = sum(FL)) %>%
ungroup(),
Data %>%
group_by(AU, Year) %>%
summarise(FL_total = sum(FL)) %>%
ungroup(),
by = c("AU", "Year")
) %>%
mutate(ratio = FL_topic/FL_total)
# A tibble: 7 x 6
# AU Year Topic FL_topic FL_total ratio
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2010 1 2 4 0.5
# 2 1 2010 2 2 4 0.5
# 3 1 2011 1 0 2 0
# 4 1 2011 2 2 2 1
# 5 2 2010 1 1 4 0.25
# 6 2 2010 2 3 4 0.75
# 7 2 2011 1 4 4 1

R - how to sum each columns from df

I have this df
df <- read.table(text="
id month gas tickets
1 1 13 14
2 1 12 1
1 2 4 5
3 1 5 7
1 3 0 9
", header=TRUE)
What I like to do is calculate sum of gas, tickets (and another 50+ rows in my real df) for each month. Usually I would do something like
result <-
df %>%
group_by(month) %>%
summarise(
gas = sum(gas),
tickets = sum(tickets)
) %>%
ungroup()
But since I have really lot of columns in my dataframe, I don´t want to repeat myself with creating sum function for each column. I´m wondering if is possible to create some more elegant - function or something that will create sum of each column except id and month with grouped month column.
You can use summarise_at() to ignore id and sum the rest:
df %>%
group_by(month) %>%
summarise_at(vars(-id), list(sum = ~sum))
# A tibble: 3 x 3
month gas_sum tickets_sum
<int> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9
You can use aggregate as markus recommends in the comments. If you want to stick to the tidyverse you could try something like this:
df %>%
select(-id) %>%
group_by(month) %>%
summarise_if(is.numeric, sum)
#### OUTPUT ####
# A tibble: 3 x 3
month gas tickets
<fct> <int> <int>
1 1 30 22
2 2 4 5
3 3 0 9

Dynamically Normalize all rows with first element within a group

Suppose I have the following data frame:
year subject grade study_time
1 1 a 30 20
2 2 a 60 60
3 1 b 30 10
4 2 b 90 100
What I would like to do is be able to divide grade and study_time by their first record within each subject. I do the following:
df %>%
group_by(subject) %>%
mutate(RN = row_number()) %>%
mutate(study_time = study_time/study_time[RN ==1],
grade = grade/grade[RN==1]) %>%
select(-RN)
I would get the following output
year subject grade study_time
1 1 a 1 1
2 2 a 2 3
3 1 b 1 1
4 2 b 3 10
It's fairly easy to do when I know what the variable names are. However, I'm trying to write a generalize function that would be able to act on any data.frame/data.table/tibble where I may not know the name of the variables that I need to mutate, I'll only know the variables names not to mutate. I'm trying to get this done using tidyverse/data.table and I can't get anything to work.
Any help would be greatly appreciated.
We group by 'subject' and use mutate_at to change multiple columns by dividing the element by the first element
library(dplyr)
df %>%
group_by(subject) %>%
mutate_at(3:4, funs(./first(.)))
# A tibble: 4 x 4
# Groups: subject [2]
# year subject grade study_time
# <int> <chr> <dbl> <dbl>
#1 1 a 1 1
#2 2 a 2 3
#3 1 b 1 1
#4 2 b 3 10

recoding categorical with no mapping values

Got a data frame with a lot of variables (82), many of them are used for further calculations. So I've tried to convert to numerical but there's a huge work guessing distinct values for every variable and then assign numbers.
I wonder if there's a more automated way of doing it since I don't care which number is assigned to any value as it is not repeated.
My approach so far (for he sake of clarity, dummy data):
df <- data.frame(original.var1 = c("display","memory","software","display","disk","memory"),
original.var2 = c("skeptic","believer","believer","believer","skeptic","believer"),
original.var3 = c("round","square","triangle","cube","sphere","hexagon"),
original.var4 = c(10,20,30,40,50,60))
taking into account this worked fine
library(dplyr)
library(magrittr)
df$NEW1 <- as.numeric(interaction(df$original.var1, drop=TRUE))
I've tried to adapt to dplyr and pipes this way
df %<>% mutate(VAR1= as.numeric(interaction(original.var1, drop=TRUE))) %>%
mutate(VAR2= as.numeric(interaction(original.var2, drop=TRUE))) %>%
mutate(VAR3= as.numeric(interaction(original.var2, drop=TRUE)))
but results got wrong from third VAR ahead
df %>% dplyr::group_by(original.var1,VAR1) %>% tally()
# A tibble: 4 x 3
# Groups: original.var1 [?]
original.var1 VAR1 n
<fctr> <dbl> <int>
1 disk 1 1
2 display 2 2
3 memory 3 2
4 software 4 1
> df %>% dplyr::group_by(original.var2,VAR2) %>% tally()
# A tibble: 2 x 3
# Groups: original.var2 [?]
original.var2 VAR2 n
<fctr> <dbl> <int>
1 believer 1 4
2 skeptic 2 2
> df %>% dplyr::group_by(original.var3,VAR3) %>% tally()
# A tibble: 6 x 3
# Groups: original.var3 [?]
original.var3 VAR3 n
<fctr> <dbl> <int>
1 cube 1 1
2 hexagon 1 1
3 round 2 1
4 sphere 2 1
5 square 1 1
6 triangle 1 1
Any approach or package to recode not having the mapping declared previously?
You can use mutate_if,
library(dplyr)
mutate_if(df, is.factor, funs(as.numeric(interaction(., drop = TRUE))))
which gives,
original.var1 original.var2 original.var3 original.var4
1 2 2 3 10
2 3 1 5 20
3 4 1 6 30
4 2 1 1 40
5 1 2 4 50
6 3 1 2 60
Alternatively you can read your data frame with stringsAsFactors = FALSE and use is.character but it's the same thing
To address your comment, If you want to also keep your original columns, then,
mutate_if(df, is.factor, funs(new = as.numeric(interaction(., drop = TRUE))))
Using purrr Keep the factor columns only and operate on them. Merge with numerical at the end.
df %>% purrr::keep(is.factor) %>% mutate_all(funs(as.numeric(interaction(., drop = TRUE))))

Resources