trying to make a ggplot with two lines - r

data can be found at: https://www.kaggle.com/tovarischsukhov/southparklines
SP = read.csv("/Users/michael/Desktop/stat 479 proj data/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)
Clean.Boys = SP %>% select(Season, Episode, Character) %>%
arrange(Season, Episode, Character) %>%
filter(Character == "Kenny" | Character == "Cartman") %>%
group_by(Season, Episode)
count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)
Season Episode Cartman Kenny
<fct> <fct> <int> <int>
1 1 1 85 5
2 2 1 1 0
3 3 1 43 19
4 4 1 83 6
5 5 1 37 3
6 6 1 67 0
I am trying to use ggplot to make a single plot with 2 lines on it one for the Cartman variable and one for the Kenny variable. My two questions are
is my data formated correctly to make a plot with geom_line()? or would I have to Pivot it longer?
I want to plot the X-scale as a continuous variable, similar to date but instead, it is season and episode. For example the first plotting point would be Season 1 Episode 1 then Season 1 Episode 2 and so on. I am stuck on how I would be able to do that with season and Episode being in separate columns and even if I combined them I'm not sure what the proper format would be.

In this example I've used readr::read_csv to read the file and set the variable types in the call to save doing this in separate lines of code.
The frequency count can be done with dplyr::summarise, within the piped workflow.
I'm not sure what you really mean by wanting to keep the season and episode data as a continuous variable - you'd have to be more explicit about how you want this to look. The approach I've taken is to provide a means of showing season and episode using minimal text:
The order of season and episode are in numeric order by default, but when combined into a character they have to be coerced into numerical order by using factor. An alternative could be to facet by season.
ggplot likes to have data in long format, so there is no need to convert the data into wide format.
To keep the graph readable only the first 80 observations are shown.
library(readr)
library(dplyr)
library(ggplot2
SP <- read_csv("...your file path.../All-seasons.csv"col_types = "nncc")
Clean.Boys <-
SP %>%
select(-Line) %>%
arrange(Season, Episode, Character) %>%
filter(Character == "Kenny" | Character == "Cartman") %>%
group_by(Season, Episode, Character)%>%
summarise(count = n(), .groups = "keep") %>%
mutate(x_lab = factor(paste(Season, Episode, sep = "\n"))) %>%
head(n = 80)
ggplot(Clean.Boys)+
geom_line(aes(x_lab, count, group = Character, colour = Character))+
labs(x = "Season and episode")
Created on 2022-02-20 by the reprex package (v2.0.1)

The trick is to gather the columns you want to map as variables. As I don't know, how you want to plot your graph, means, about x-axis and y-axis, I made a pseudo plot. and for your continuous variable part, you can either convert your values to integer or numeric using as.integer() or as.numeric(), then you can use as continuous scale. You can check your variable structure by calling str(df), which will show you the class of your variable, if it is in factor or character, convert them to numbers.
#libraries
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.0.3
#your code
SP <- read.csv("C:/Users/saura/Desktop/All-seasons.csv")
SP$Season = as.numeric(SP$Season)
#> Warning: NAs introduced by coercion
SP$Episode = as.numeric(SP$Episode)
#> Warning: NAs introduced by coercion
Clean.Boys = SP %>% select(Season, Episode, Character) %>%
arrange(Season, Episode, Character) %>%
filter(Character == "Kenny" | Character == "Cartman") %>%
group_by(Season, Episode)
count = table(Clean.Boys)
count = as.data.frame(count)
Clean = count %>% pivot_wider(names_from = Character, values_from = Freq) %>% group_by(Episode)
#here is your code, but as I dont know, what you want on your axis
new_df <- Clean %>%
gather(-Season,-Episode, key = "Views", value = "numbers")
ggplot(data = new_df, aes(
as.numeric(Episode),
numbers,
color = Views,
group = Views
)) +
geom_path()
Created on 2022-02-19 by the reprex package (v2.0.1)

Related

How to merge duplicate rows in R

I am new to R and very stuck on a problem which I've tried to solve in various ways.
I have data I want to plot to a graph that shows twitter engagements per day.
To do this, I need to merge all the 'created at' rows, so there is only one data per row, and each date has the 'total engagements' assigned to it.
This is the data:
So far, I've tried to do this, but can't seem to get the grouping to work.
I mutated the data to get a new 'total engage' column:
lgbthm_data_2 <- lgbthm_data %>%
mutate(
total_engage = favorite_count + retweet_count
) %>%
Then I've tried to merge the dates:
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise_all(na.omit)
But no idea!
Any help would be great
Thanks
You are looking for:
library(dplyr)
only_one_date <- lgbthm_data_2 %>%
group_by(created_at) %>%
summarise(n = n())
And there is even a shorthand for this in dplyr:
only_one_date <- lgbthm_data_2 %>%
count(created_at)
group_by + summarise can be used for many things that involve summarising all values in a group to one value, for example the mean, max and min of a column. Here I think you simply want to know how many rows each group has, i.e., how many tweets were created in one day. The special function n() tells you exactly that.
From experience with Twitter, I also know that the column created_at is usually a time, not a date format. In this case, it makes sense to use count(day = as.Date(created_at)) to convert it to a date first.
library(tidyverse)
data <- tribble(
~created_at, ~favorite_count, ~retweet_count,
"2022-02-01", 0, 2,
"2022-02-01", 1, 3,
"2022-02-02", 2, NA
)
summary_data <-
data %>%
type_convert() %>%
group_by(created_at) %>%
summarise(total_engage = sum(favorite_count, retweet_count, na.rm = TRUE))
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> created_at = col_date(format = "")
#> )
summary_data
#> # A tibble: 2 × 2
#> created_at total_engage
#> <date> <dbl>
#> 1 2022-02-01 6
#> 2 2022-02-02 2
qplot(created_at, total_engage, geom = "col", data = summary_data)
Created on 2022-04-04 by the reprex package (v2.0.0)

Extracting JSON data with asymetric content from a dataframe column in R

I loaded a table from a database which contains a column that has JSON data in each row.
The table looks something like the example below. (I was not able to replicate the data.frame I have, due to the format of the column data)
dataframe_example <- data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
JSON_col = c({"_inv": [10,20,30,40]}, "_person": ["_personid": "green"],
{"_inv": [15,22]}, "_person": ["_personid": "blue"],
{"_inv": []}, "_person": ["_personid": "red"]))
I have the following two issues:
Some of the items (e.g. "_inv") sometimes have the full 4 numeric entries, sometimes less, and sometimes nothing. Some of the other items (e.g. "_person") usually contain another header, but only one character data point.
My goal is to preserve the existing dataframes colums (such as id and name) and spread the data in the json column such that I have new columns containing each point of information. The target dataframe would look a little like this:
data.frame(id = c(1,2,3),
name = c("name1","name2","name3"),
`_inv_1` = c(10,15,NA),
`_inv_2` = c(20,22,NA),
`_inv_3` = c(30,NA,NA),
`_inv_4` = c(40,NA,NA),
`_person_id` = c("green","blue","red"))
Please bear in mind that I have very little experience handling JSON data and no experience dealing with uneven JSON data.
Using purrr I got:
frame <- purrr::map(dataframe_example$JSON_col, jsonlite::fromJSON)
This gave me a large list with n elements, where n is the length of the original dataframe. The "Name" item contains n lists [[1]], each one with its own type of object, ranging from double to data.frame. The double object contain four numeric observations, (such as _inv), some of the objects are lists themselves (such as _person), which further contains "_personid" and then a single entry. The dataframe contains the datetime stamps for each observation in the JSON data. (each _inv item has a timestamp)
Is there a way to obtain the solution above, either by extracting the data from my "frame" object, or an altogether different solution?
library(tidyverse)
library(jsonlite)
#>
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#>
#> flatten
dataframe_example <-
data.frame(
id = c(1, 2, 3),
name = c("name1", "name2", "name3"),
JSON_col = c(
"{\"_inv\": [10,20,30,40], \"_person\": {\"_personid\": \"green\"}}",
"{\"_inv\": [15,22], \"_person\": {\"_personid\": \"blue\"}}",
"{\"_inv\": [], \"_person\": {\"_personid\": \"red\"}}"
)
)
dataframe_example %>%
as_tibble() %>%
mutate(
JSON_col = JSON_col %>% map(parse_json)
) %>%
unnest_wider(JSON_col) %>%
unnest(`_inv`) %>%
unnest(`_inv`) %>%
unnest(`_person`) %>%
unnest(`_person`) %>%
group_by(id, name) %>%
mutate(inv_id = row_number()) %>%
pivot_wider(names_from = inv_id, values_from = `_inv`, names_prefix = "_inv_")
#> # A tibble: 2 x 7
#> # Groups: id, name [2]
#> id name `_person` `_inv_1` `_inv_2` `_inv_3` `_inv_4`
#> <dbl> <chr> <chr> <int> <int> <int> <int>
#> 1 1 name1 green 10 20 30 40
#> 2 2 name2 blue 15 22 NA NA
Created on 2021-11-25 by the reprex package (v2.0.1)

Is there an R function that can convert a existing metric into a new logical metric?

I have a dataset derived from Pokemon statistics containing a lot of the numerical and categorical data. My end goal is to create a model or recommendation system that a user can input a list of Pokemon and the model finds similar Pokemon they may like. Currently the dataset looks something like this:
ID Name Type1 Type2 HP
001 Bulba.. Grass Poison 45
ect...
I understand the type1/type2 metric might be problematic, Is there a function that would let me create a new create/modify new columns were if a Pokemon had a particular type it would add a logical value(0 for false, 1 for true) in that new column?
I apologize for a lack luster explanation but what I want is for my dataset to look like this:
ID Name Grass Poison Water HP
001 Bulba.. 1 1 0 45
ect...
tidyr is a package for data reshaping. Here, we'll use pivot_longer() to put it into a long format, where the type names (Type1, Type2) will reside in column "name", while the values (Grass, Poison, etc.) will reside in column "value". We
filter out rows with is.na(value) because that means the pokemon did not have a second type. We create an indicator variable -- this gets a 1. Each pokemon will then have indicator == 1 for the types it has. We drop the now extraneous "name" column, and use pivot_wider() to transform each unique value in value into its own column, which will receive indicator's value as the cell value for each row. Finally, we mutate on all numeric columns to replace missings with 0, since we know those pokemon aren't those types.
A better solution than mutate_if(is.numeric, ...) would be to compute the unique values of types and use mutate_at(vars(pokemon_types), .... This would not affect other numeric columns unintentionally.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
pokemon <- tibble(ID = c(1,2), Name = c("Bulbasaur", "Squirtle"),
Type1 = c("Grass", "Water"),
Type2 = c("Poison", NA),
HP = c(40, 50))
pokemon %>% pivot_longer(
starts_with("Type")
) %>%
filter(!is.na(value)) %>%
mutate(indicator = 1) %>%
select(-name) %>%
pivot_wider(names_from = value, values_from = indicator,
) %>%
mutate_if(is.numeric, .funs = function(x) if_else(is.na(x), 0, x))
#> # A tibble: 2 x 6
#> ID Name HP Grass Poison Water
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 Bulbasaur 40 1 1 0
#> 2 2 Squirtle 50 0 0 1

Collapse data frame, by group, using lists of variables for weighted average AND sum

I want to collapse the following data frame, using both summation and weighted averages, according to groups.
I have the following data frame
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse = data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
I want to collapse my data according to the groups identified by group_id. However, in my data, I have variables in absolute levels (var_1, var_2) and in percentage terms (var_percent_1, var_percent_2).
I create two lists for each type of variable (my real data is much bigger, making this necessary). I also have a weighting variable (weighting).
to_be_weighted =df_to_collapse[, 4:5]
to_be_summed = df_to_collapse[,2:3]
to_be_weighted_2=colnames(to_be_weighted)
to_be_summed_2=colnames(to_be_summed)
And my goal is to simultaneously collapse my data using eiter sum or weighted average, according to the type of variable (ie if its in percentage terms, I use weighted average).
Here is my best attempt:
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_summed_2,to_be_weighted_2), .funs=c(sum, mean))
But, as you can see, it is not a weighted average
I have tried many different ways of using the weighted.mean fucntion, but have had no luck. Here is an example of one such attempt;
df_to_collapse %>% group_by(group_id) %>% summarise_at(.vars = c(to_be_weighted_2,to_be_summed_2), .funs=c(weighted.mean(to_be_weighted_2, weighting), sum))
And the corresponding error:
Error in weighted.mean.default(to_be_weighted_2, weighting) :
'x' and 'w' must have the same length
Here's a way to do it by reshaping into long data, adding a dummy variable called type for whether it's a percentage (optional, but handy), applying a function in summarise based on whether it's a percentage, then spreading back to wide shape. If you can change column names, you could come up with a more elegant way of doing the type column, but that's really more for convenience.
The trick for me was the type[1] == "percent"; I had to use [1] because everything in each group has the same type, but otherwise == operates over every value in the vector and gives multiple logical values, when you really just need 1.
library(tidyverse)
set.seed(1234)
group_id = c(1,1,1,2,2,3,3,3,3,3)
var_1 = sample.int(20, 10)
var_2 = sample.int(20, 10)
var_percent_1 =rnorm(10,.5,.4)
var_percent_2 =rnorm(10,.5,.4)
weighting =sample.int(50, 10)
df_to_collapse <- data.frame(group_id,var_1,var_2,var_percent_1,var_percent_2,weighting)
df_to_collapse %>%
gather(key = var, value = value, -group_id, -weighting) %>%
mutate(type = ifelse(str_detect(var, "percent"), "percent", "int")) %>%
group_by(group_id, var) %>%
summarise(sum_or_avg = ifelse(type[1] == "percent", weighted.mean(value, weighting), sum(value))) %>%
ungroup() %>%
spread(key = var, value = sum_or_avg)
#> # A tibble: 3 x 5
#> group_id var_1 var_2 var_percent_1 var_percent_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 26 31 0.269 0.483
#> 2 2 32 21 0.854 0.261
#> 3 3 29 49 0.461 0.262
Created on 2018-05-04 by the reprex package (v0.2.0).

Can I combine pairwise_cor and pairwise_count to get the phi coefficient AND number of occurrences for each pair of words?

I'm new to R, and I'm using widyr to do text mining. I successfully used the methods found here to get a list of co-occurring words within each section of text and their phi coefficient.
Code as follows:
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
I understand that I can also generate a data frame with co-occurring words and the number of times they appear, using code like:
word_pairs <- review_words %>%
pairwise_count(word, title, sort = TRUE)
What I need is a table that has both the phi coefficient and the number of occurrences for each pair of words. I've been digging into pairwise_cor and pairwise_count but still can't figure out how to combine them. If I understand correctly, joins only take one column into account for matching, so I couldn't use a regular join reliably since there may be multiple pairs that have the same word in the item1 column.
Is this possible using widyr? If not, is there another package that will allow me to do this?
Here is the full code:
#Load packages
pacman::p_load(XML, dplyr, stringr, rvest, httr, xml2, tidytext, tidyverse, widyr)
#Load source material
prod_reviews_df <- read_csv("SOURCE SPREADSHEET.csv")
#Split into one word per row
review_words <- prod_reviews_df %>%
unnest_tokens(word, comments, token = "words", format = "text", drop = FALSE) %>%
anti_join(stop_words, by = c("word" = "word"))
#Find phi coefficient
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
#Write data to CSV
write.csv(word_cors, "WORD CORRELATIONS.csv")
I want to add in pairwise_count, but I need it alongside the phi coefficient.
Thank you!
If you are getting into using tidy data principles and tidyverse tools, I would suggest GOING ALL THE WAY :) and using dplyr to do the joins you are interested in. You can use left_join to connect the calculations from pairwise_cor() and pairwise_count(), and you can just pipe from one to the other, if you like.
library(dplyr)
library(tidytext)
library(janeaustenr)
library(widyr)
austen_section_words <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
austen_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE) %>%
left_join(austen_section_words %>%
pairwise_count(word, section, sort = TRUE),
by = c("item1", "item2"))
#> # A tibble: 154,842 x 4
#> item1 item2 correlation n
#> <chr> <chr> <dbl> <dbl>
#> 1 bourgh de 0.9508501 29
#> 2 de bourgh 0.9508501 29
#> 3 pounds thousand 0.7005808 17
#> 4 thousand pounds 0.7005808 17
#> 5 william sir 0.6644719 31
#> 6 sir william 0.6644719 31
#> 7 catherine lady 0.6633048 82
#> 8 lady catherine 0.6633048 82
#> 9 forster colonel 0.6220950 27
#> 10 colonel forster 0.6220950 27
#> # ... with 154,832 more rows
I discovered and used merge today, and it appears to have used both relevant columns to merge the data. I'm not sure how to check for accuracy, but I think it worked.

Resources