I am new to R and very stuck on a problem which I've tried to solve in various ways.
I have data I want to plot to a graph that shows twitter engagements per day.
To do this, I need to merge all the 'created at' rows, so there is only one data per row, and each date has the 'total engagements' assigned to it.
This is the data:
So far, I've tried to do this, but can't seem to get the grouping to work.
I mutated the data to get a new 'total engage' column:
lgbthm_data_2 <- lgbthm_data %>%
mutate(
total_engage = favorite_count + retweet_count
) %>%
Then I've tried to merge the dates:
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise_all(na.omit)
But no idea!
Any help would be great
Thanks
You are looking for:
library(dplyr)
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise(n = n())
And there is even a shorthand for this in dplyr:
only_one_date <- lgbthm_data_2 %>%
count(created_at)
group_by + summarise can be used for many things that involve summarising all values in a group to one value, for example the mean, max and min of a column. Here I think you simply want to know how many rows each group has, i.e., how many tweets were created in one day. The special function n() tells you exactly that.
From experience with Twitter, I also know that the column created_at is usually a time, not a date format. In this case, it makes sense to use count(day = as.Date(created_at)) to convert it to a date first.
library(tidyverse)
data <- tribble(
~created_at, ~favorite_count, ~retweet_count,
"2022-02-01", 0, 2,
"2022-02-01", 1, 3,
"2022-02-02", 2, NA
)
summary_data <-
data %>%
type_convert() %>%
group_by(created_at) %>%
summarise(total_engage = sum(favorite_count, retweet_count, na.rm = TRUE))
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> created_at = col_date(format = "")
#> )
summary_data
#> # A tibble: 2 × 2
#> created_at total_engage
#> <date> <dbl>
#> 1 2022-02-01 6
#> 2 2022-02-02 2
qplot(created_at, total_engage, geom = "col", data = summary_data)
Created on 2022-04-04 by the reprex package (v2.0.0)
Related
data can be found at: https://www.kaggle.com/tovarischsukhov/southparklines
SP = read.csv("/Users/michael/Desktop/stat 479 proj data/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)
Clean.Boys = SP %>% select(Season, Episode, Character) %>%
arrange(Season, Episode, Character) %>%
filter(Character == "Kenny" | Character == "Cartman") %>%
group_by(Season, Episode)
count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)
Season Episode Cartman Kenny
<fct> <fct> <int> <int>
1 1 1 85 5
2 2 1 1 0
3 3 1 43 19
4 4 1 83 6
5 5 1 37 3
6 6 1 67 0
I am trying to use ggplot to make a single plot with 2 lines on it one for the Cartman variable and one for the Kenny variable. My two questions are
is my data formated correctly to make a plot with geom_line()? or would I have to Pivot it longer?
I want to plot the X-scale as a continuous variable, similar to date but instead, it is season and episode. For example the first plotting point would be Season 1 Episode 1 then Season 1 Episode 2 and so on. I am stuck on how I would be able to do that with season and Episode being in separate columns and even if I combined them I'm not sure what the proper format would be.
In this example I've used readr::read_csv to read the file and set the variable types in the call to save doing this in separate lines of code.
The frequency count can be done with dplyr::summarise, within the piped workflow.
I'm not sure what you really mean by wanting to keep the season and episode data as a continuous variable - you'd have to be more explicit about how you want this to look. The approach I've taken is to provide a means of showing season and episode using minimal text:
The order of season and episode are in numeric order by default, but when combined into a character they have to be coerced into numerical order by using factor. An alternative could be to facet by season.
ggplot likes to have data in long format, so there is no need to convert the data into wide format.
To keep the graph readable only the first 80 observations are shown.
library(readr)
library(dplyr)
library(ggplot2
SP <- read_csv("...your file path.../All-seasons.csv"col_types = "nncc")
Clean.Boys <-
SP %>%
select(-Line) %>%
arrange(Season, Episode, Character) %>%
filter(Character == "Kenny" | Character == "Cartman") %>%
group_by(Season, Episode, Character)%>%
summarise(count = n(), .groups = "keep") %>%
mutate(x_lab = factor(paste(Season, Episode, sep = "\n"))) %>%
head(n = 80)
ggplot(Clean.Boys)+
geom_line(aes(x_lab, count, group = Character, colour = Character))+
labs(x = "Season and episode")
Created on 2022-02-20 by the reprex package (v2.0.1)
The trick is to gather the columns you want to map as variables. As I don't know, how you want to plot your graph, means, about x-axis and y-axis, I made a pseudo plot. and for your continuous variable part, you can either convert your values to integer or numeric using as.integer() or as.numeric(), then you can use as continuous scale. You can check your variable structure by calling str(df), which will show you the class of your variable, if it is in factor or character, convert them to numbers.
#libraries
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.0.3
#your code
SP <- read.csv("C:/Users/saura/Desktop/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
#> Warning: NAs introduced by coercion
SP$Episode = as.numeric(SP$Episode)
#> Warning: NAs introduced by coercion
Clean.Boys = SP %>% select(Season, Episode, Character) %>%
arrange(Season, Episode, Character) %>%
filter(Character == "Kenny" | Character == "Cartman") %>%
group_by(Season, Episode)
count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)
#here is your code, but as I dont know, what you want on your axis
new_df <- Clean %>%
gather(-Season,-Episode, key = "Views", value = "numbers")
ggplot(data = new_df, aes(
as.numeric(Episode),
numbers,
color = Views,
group = Views
)) +
geom_path()
Created on 2022-02-19 by the reprex package (v2.0.1)
I'm trying to read an array out of a JSON structure with tidyjson as I'm trying to fasten up my code.
My input data is of the structure
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
I want my output to be a data frame where key1 is one column and key2 is the second column in which all elements of the array are pasted together and separated by ";".
I tried something like
result <- json %>% spread_values(a = jstring("key1"), b = paste0(jstring("key2"), collapse = ";"))
I really have no idea how to get the array out of the JSON in the spread_values function.
I got what I want with
key2 <- json %>% enter_object("key2")
attributes(key2)$JSON %>% unlist() %>% paste0(collapse = ";")
but as I don't have unique keys I can't join it to the rest of my data and I think there must be a better way.
I'm glad you got something working! In case anyone else happens upon this question, there are definitely many ways to accomplish this task!
One is to use tidyjson to gather the data into a tall structure, then summarize:
library(tidyjson)
library(dplyr)
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
myj <- tidyjson::as.tbl_json(json)
myj %>%
# make the data tall
spread_values(key1 = jstring(key1)) %>%
enter_object("key2") %>%
gather_array("idx") %>%
append_values_string("key2") %>%
# now summarize
group_by(key1) %>%
summarize(key2 = paste(key2, collapse = ";"))
#> # A tibble: 1 x 2
#> key1 key2
#> <chr> <chr>
#> 1 test abc;def
Created on 2021-10-29 by the reprex package (v0.3.0)
Another way is to grab the json data directly with json_get_column() and mutate that:
library(tidyjson)
library(dplyr)
json <- "{\"key1\":\"test\",\"key2\":[\"abc\",\"def\"]}"
myj <- tidyjson::as.tbl_json(json)
myj %>%
spread_values(key1 = jstring(key1)) %>%
enter_object("key2") %>%
json_get_column("array") %>%
mutate(key2 = purrr::map_chr(array, ~ paste(.x, collapse = ";"))) %>%
as_tibble() %>% # drop tbl_json structure
select(key1, key2)
#> # A tibble: 1 x 2
#> key1 key2
#> <chr> <chr>
#> 1 test abc;def
Created on 2021-10-29 by the reprex package (v0.3.0)
I am trying to create a R database including some numerical variable.
While doing this, I made a typing mistake whose result looks weird to me and I would like to understand why (for sure I am missing something, here).
I have tried to look around for possible explanation but haven' t found what I am looking for.
library("dplyr")
library("tidyr")
data <-
data.frame(FS = c(1), FS_name = c("Armenia"), Year = c(2015), class =
c("class190"), area_1000ha = c(66.447)) %>%
mutate(FS_name = as.character(FS_name)) %>%
mutate(Year = as.integer(Year)) %>%
mutate(class = as.character(class)) %>%
tbl_df()
data
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = TRUE)) %>%
ungroup()
As you can see, the mistake is
rm.na=
rather than
na.rm=
When I type correctly, I have the right result on area_1000ha variable (10.5).
If I don't - i.e. keeping rm.na= I get 11.5, instead (+1, in fact).
What am I missing?
I think rm.na=TRUE is added to the sum, and as TRUE is considered as 1, it sums your initial sum and 1.
If you change TRUE to 2 for example
x <- data %>%
group_by(FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = 2)) %>%
ungroup()
The result is
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 12.5
There is no function in R as rm.na hence R is considering it as a variable which has value TRUE i.e. 1.
Try keeping it na.rm = T and you will get the right result.
Even if you change the name of the variable
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, tester = TRUE)) %>%
ungroup()
I have replaced rm.na with tester variable.
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 11.5
I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).
I am struggling a little with dplyr because I want to do two things at one and wonder if it is possible.
I want to calculate the mean of values and at the same time the mean for the values which have a specific value in an other column.
library(dplyr)
set.seed(1234)
df <- data.frame(id=rep(1:10, each=14),
tp=letters[1:14],
value_type=sample(LETTERS[1:3], 140, replace=TRUE),
values=runif(140))
df %>%
group_by(id, tp) %>%
summarise(
all_mean=mean(values),
A_mean=mean(values), # Only the values with value_type A
value_count=sum(value_type == 'A')
)
So the A_mean column should calculate the mean of values where value_count == 'A'.
I would normally do two separate commands and merge the results later, but I guess there is a more handy way and I just don't get it.
Thanks in advance.
We can try
df %>%
group_by(id, tp) %>%
summarise(all_mean = mean(values),
A_mean = mean(values[value_type=="A"]),
value_count=sum(value_type == 'A'))
You can do this with two summary steps:
df %>%
group_by(id, tp, value_type) %>%
summarise(A_mean = mean(values)) %>%
summarise(all_mean = mean(A_mean),
A_mean = sum(A_mean * (value_type == "A")),
value_count = sum(value_type == "A"))
The first summary calculates the means per value_type and the second "sums" only the mean of value_type == "A"
You can also give the following function a try:
?summarise_if
(the function family is summarise_all)
Example
The dplyr documentation serves a quite good example of this, i think:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns. Here we apply mean() to the numeric columns:
starwars %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
#> # A tibble: 1 x 3
#> height mass birth_year
#> <dbl> <dbl> <dbl>
#> 1 174. 97.3 87.6
The interesting thing here is the predicate function. This represents the rule by which the columns, that will have to be summarized, are selected.