Joined columns not visible in final result using dplyr - r

I am a newbie in R, I have the following code for doing some aggregations on the movie lens dataset in R using dplyr
joined_data <- inner_join(ratings_data,movie_data,by="movie_id",copy=TRUE)
data <- joined_data %>% group_by(movie_id) %>% arrange(movie_id)
data1 <- data %>% select(movie_id,movie_title,rating) %>% summarize(count_ratings=n())
The data tbl has all the columns I want(movie_id,movie_title,rating,...) I'm trying to select only 3 columns and summarize them, but the data1 tbl does not have the movie_title which was from the second table(movie_data). Any reason why this is happening? How do I get the columns I want in data1?
names(data)
[1] "user_id" "movie_id" "rating" "timestamp" "movie_title"
[6] "release_date" "video_release.date" "IMDb_URL" "unknown" "Action"
[11] "Adventure" "Animation" "Childrens" "Comedy" "Crime"
[16] "Documentary" "Drama" "Fantasy" "Film_Noir" "Horror"
[21] "Musical" "Mystery" "Romance" "Sci_Fi" "Thriller"
[26] "War" "Western"
But when I do this :
data1 <- data %>% select(movie_id,movie_title,user_id,rating) %>% summarize(count_users=n(),count_ratings=n())
names(data1)
[1] "movie_id" "count_users" "count_ratings"

group_by(movie_id) in your second line is responsible for that. Can you use:
group_by(movie_id, movie_title)
and check again - this worked as suggested by #AntoniosK

Related

Number of columns not matching

I'm trying to use the rbind function to create some data for a matching process, but I'm getting this error:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
I've checked and adjusted the order so they definitely match, but still getting the error. Any idea why?
This is my code:
# Match on the data from the year before treatment, matching on counttotal and countbrown
matchData <-
rbind(treat_firms_1year_prior[, -c(
grep("year_int_tx", colnames(treat_firms_1year_prior)),
grep("matchingyear", colnames(treat_firms_1year_prior)),
grep("flag", colnames(treat_firms_1year_prior))
)],
control_firms_year_int_tx)
These are the column names:
> colnames(treat_firms_1year_prior)
[1] "investor" "dealyear" "totalUSD" "counttotal" "greenUSD" "countgreen"
[7] "brownUSD" "countbrown" "signatory" "treatment" "firsttreat" "matchingyear"
[13] "country" "region" "yearest" "strategy" "capsources" "historicfunds"
[19] "eligible_treat_firm" "year_int_tx" "flag"
> colnames(control_firms_year_int_tx)
[1] "investor" "dealyear" "totalUSD" "counttotal" "greenUSD" "countgreen"
[7] "brownUSD" "countbrown" "signatory" "treatment" "firsttreat" "matchingyear"
[13] "country" "region" "yearest" "strategy" "capsources" "historicfunds"
[19] "eligible_treat_firm" "year_int_tx" "flag"
rbind requires the 2 dataframes to have the same number of columns.
Yours do not have the same columns as treat_firms_1year_prior is dropping 3 columns ("year_int_tx", "matchingyear", flag") and control_firms_year_int_tx is not.
You'll need to also drop them in control_firms_year_int_tx, or keep them in treat_firms_1year_prior.
matchData <-
rbind(treat_firms_1year_prior[, -c(
grep("year_int_tx", colnames(treat_firms_1year_prior)),
grep("matchingyear", colnames(treat_firms_1year_prior)),
grep("flag", colnames(treat_firms_1year_prior))
)],
control_firms_year_int_tx[, -c(
grep("year_int_tx", colnames(treat_firms_1year_prior)),
grep("matchingyear", colnames(treat_firms_1year_prior)),
grep("flag", colnames(treat_firms_1year_prior))
)])
Or
excludeColumns <- c("year_int_tx", "matchingyear", "flag")
matchData <-
rbind(
treat_firms_1year_prior[ , !(names(treat_firms_1year_prior) %in% excludeColumns)],
control_firms_year_int_tx[ , !(names(control_firms_year_int_tx) %in% excludeColumns)]
)
Or
matchData <-
rbind(treat_firms_1year_prior, control_firms_year_int_tx)

Levels of a dataframe after filtering

i've been doing an assignment for a self study in R programming. I have a question about what happens with factors in a dataframe once you filter it. I have a dataframe that has the columns (movie)Studio and Genre.
For the assignment i need to filter it. I succeeded in this, but when i check the levels of the newly filtered columns all factors are still present, so not only the filtered ones.
Why is this? Am i doing something wrong?
StudioTarget <- c("Buena Vista Studios","Fox","Paramount Pictures","Sony","Universal","WB")
GenreTarget <- c("action","adventure","animation","comedy","drama")
dftest <- df[df$Studio %in% StudioTarget & df$Genre %in% GenreTarget,]
> levels(dftest$Studio)
[1] "Art House Studios" "Buena Vista Studios" "Colombia Pictures"
[4] "Dimension Films" "Disney" "DreamWorks"
[7] "Fox" "Fox Searchlight Pictures" "Gramercy Pictures"
[10] "IFC" "Lionsgate" "Lionsgate Films"
[13] "Lionsgate/Summit" "MGM" "MiraMax"
[16] "New Line Cinema" "New Market Films" "Orion"
[19] "Pacific Data/DreamWorks" "Paramount Pictures" "Path_ Distribution"
[22] "Relativity Media" "Revolution Studios" "Screen Gems"
[25] "Sony" "Sony Picture Classics" "StudioCanal"
[28] "Summit Entertainment" "TriStar" "UA Entertainment"
[31] "Universal" "USA" "Vestron Pictures"
[34] "WB" "WB/New Line" "Weinstein Company"
You can do droplevels(dftest$Studio) to remove unused levels
No, you're not doing anything wrong. A factor defines a fixed number of levels. These levels remain the same even if one or more of them are not present in the data. You've asked for the levels of your factor, not the values present after filtering.
Consider:
library(tidyverse)
mtcars %>%
mutate(cyl= as.factor(cyl)) %>%
filter(cyl == 4) %>%
distinct(cyl) %>%
pull(cyl)
[1] 4
Levels: 4 6 8
Welcome to SO. Next time, pleasetry to provide a minumum working example. This post will help you construct one.

How to create and save subset dataframes for sequence of year-month

I would like to filter from a dataframe observations for a given year-month and then save it as a separate dataframe and name it with the respective year-month.
I would be grateful if someone could suggest a more efficient code than the one below. Also, this code is not filtering correctely the observations.
data <- data.frame(year = c(rep(2012,12),rep(2013,12),rep(2014,12),rep(2015,12),rep(2016,12)),
month = rep(1:12,5),
info = seq(60)*100)
years <- 2012:2016
months <- 1:12
for(year in years){
for(month in months){
data_sel <- data %>%
filter(year==year & month==month)
if(month<10){
month_alt <- paste0("0",month) # months 1-9 should show up as 01-09
}
Newname <- paste0(year,month_alt,'_','data_sel')
assign(Newname, data_sel)
}
}
The output I am looking to get is below (separate objects containing data from a given year-month):
> ls()
[1] "201201_data_sel" "201202_data_sel" "201203_data_sel" "201204_data_sel"
[5] "201205_data_sel" "201206_data_sel" "201207_data_sel" "201208_data_sel"
[9] "201209_data_sel" "201301_data_sel" "201302_data_sel" "201303_data_sel"
[13] "201304_data_sel" "201305_data_sel" "201306_data_sel" "201307_data_sel"
[17] "201308_data_sel" "201309_data_sel" "201401_data_sel" "201402_data_sel"
[21] "201403_data_sel" "201404_data_sel" "201405_data_sel" "201406_data_sel"
[25] "201407_data_sel" "201408_data_sel" "201409_data_sel" "201501_data_sel"
[29] "201502_data_sel" "201503_data_sel" "201504_data_sel" "201505_data_sel"
[33] "201506_data_sel" "201507_data_sel" "201508_data_sel" "201509_data_sel"
[37] "201601_data_sel" "201602_data_sel" "201603_data_sel" "201604_data_sel"
[41] "201605_data_sel" "201606_data_sel" "201607_data_sel" "201608_data_sel"
[45] "201609_data_sel" "data" "data_sel" "month"
[49] "month_alt" "months" "Newname" "year"
[53] "years"
You could do:
library(dplyr)
g <- data %>%
mutate(month = sprintf("%02d", month)) %>%
group_by(year, month)
setNames(group_split(g), with(group_keys(g), paste0("data_sel_", year, month))) %>%
list2env(envir = .GlobalEnv)
Starting an object name with a digit is not allowed in R, so in paste0 "data_sel_" is first.
As mentioned in the comments it might be better to not pipe to list2env and store the output as a list with named elements.

Information lost by html_table

I'm looking to scrape the third table off of this website and store it as a data frame. Below is a reproducible example
The third table is the one with "Isiah YOUNG" in the first row, third column.
library(rvest)
library(dplyr)
target_url <-
"https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- target_url %>%
read_html(options = c("DTDLOAD")) %>%
html_nodes("[id^=splitevents]") # this is the correct node
So far so good. Printing table[[1]] shows the contents I want.
table[[1]]
{html_node}
<table id="splitevents" class="sortable" align="center">
[1] <tr>\n<th class="sorttable_nosort" width="20">Pl</th>\n<th class="sorttable_nosort" width="20">Ln</th>\n<th ...
[2] <td>1</td>\n
[3] <td>6</td>\n
[4] <td></td>\n
[5] <td>Isiah YOUNG</td>\n
[6] <td></td>\n
[7] <td>NIKE</td>\n
[8] <td>20.28 Q</td>\n
[9] <td><b><font color="grey">0.184</font></b></td>
[10] <td>2</td>\n
[11] <td>7</td>\n
[12] <td></td>\n
[13] <td>Elijah HALL-THOMPSON</td>\n
[14] <td></td>\n
[15] <td>Houston</td>\n
[16] <td>20.50 Q</td>\n
[17] <td><b><font color="grey">0.200</font></b></td>
[18] <td>3</td>\n
[19] <td>9</td>\n
[20] <td></td>\n
...
However, passing this to html_table results in an empty data frame.
table[[1]] %>%
html_table(fill = TRUE)
[1] Pl Ln Athlete Affiliation Time
<0 rows> (or 0-length row.names)
How can I get the contents of table[[1]] (which clearly do exist) as a data frame?
The html is full of errors and tripping up the parser and I haven't seen any easy way to fix these.
An alternative way, in this particular scenario, is to use the header count to determine the appropriate column count, then derive the row count by dividing the total td count by the number of columns; use these to convert into a matrix then dataframe.
library(rvest)
library(dplyr)
target_url <- "https://flashresults.com/2017_Meets/Outdoor/06-22_USATF/004-2-02.htm"
table <- read_html(target_url) %>%
html_node("#splitevents")
tds <- table %>% html_nodes('td') %>% html_text()
ths <- table %>% html_nodes("th") %>% html_text()
num_col <- length(ths)
num_row <- length(tds) / num_col
df <- tds %>%
matrix(nrow = num_row, ncol = num_col, byrow = TRUE) %>%
data.frame() %>%
setNames(ths)

Error with R dplyr left_join

So I've been trying to use left_join to get the columns of a new dataset onto my main dataset (called employee)
I've double checked the vector names and the cleaning that I've don't and nothing seems to work. Here is my code. Would appreciate any help.
job_codes <- read_csv("Quest_UMMS_JobCodes.csv")
job_codes <- job_codes %>%
clean_names() %>%
select(job_code, pos_desc = pos_des_desc)
job_codes$is_nurse <- str_detect(tolower(job_codes$pos_desc), "nurse")
employee <- employee %>%
left_join(job_codes, by = "job_code")
The error I keep getting:Error in eval(substitute(expr), envir, enclos) :
'job_code' column not found in rhs, cannot join
here are the results of
names(job_code)
> names(job_codes)
[1] "job_code" "pos_desc" "is_nurse"
names(employee)
> names(employee)
[1] "REC_NUM" "ZIP" "STATE"
[4] "SEX" "EEO_CLASS" "BIRTH_YEAR"
[7] "EMP_STATUS" "PROCESS_LEVEL" "DEPARTMENT"
[10] "JOB_CODE" "UNION_CODE" "SUPERVISOR"
[13] "DATE_HIRED" "R_SHIFT" "SALARY_CLASS"
[16] "EXEMPT_EMP" "PAY_RATE" "ADJ_HIRE_DATE"
[19] "ANNIVERS_DATE" "TERM_DATE" "NBR_FTE"
[22] "PENSION_PLAN" "PAY_GRADE" "SCHEDULE"
[25] "OT_PLAN_CODE" "DECEASED" "POSITION"
[28] "WORK_SCHED" "SUPERVISOR_IND" "FTE_TOTAL"
[31] "PRO_RATE_TOTAL" "PRO_RATE_A_SAL" "NEW_HIRE_DATE"
[34] "COUNTY" "FST_DAY_WORKED" "date_hired"
[37] "date_hired_adj" "term_date" "employment_duration"
[40] "current" "age" "emp_duration_years"
[43] "DESCRIPTION.x" "PAY_STATUS.x" "DESCRIPTION.y"
[46] "PAY_STATUS.y"
Now, after the OP has added the column names of both tables in the Q, it is evident that the columns to join on are written in different ways (upper vs lower case).
If the column names are different, help("left_join") suggests:
To join by different variables on x and y use a named vector. For example, by = c("a" = "b") will match x.a to y.b.
So, in this case it should read
employee <- employee %>% left_join(job_codes, by = c("JOB_CODE" = "job_code"))

Resources