Levels of a dataframe after filtering - r

i've been doing an assignment for a self study in R programming. I have a question about what happens with factors in a dataframe once you filter it. I have a dataframe that has the columns (movie)Studio and Genre.
For the assignment i need to filter it. I succeeded in this, but when i check the levels of the newly filtered columns all factors are still present, so not only the filtered ones.
Why is this? Am i doing something wrong?
StudioTarget <- c("Buena Vista Studios","Fox","Paramount Pictures","Sony","Universal","WB")
GenreTarget <- c("action","adventure","animation","comedy","drama")
dftest <- df[df$Studio %in% StudioTarget & df$Genre %in% GenreTarget,]
> levels(dftest$Studio)
[1] "Art House Studios" "Buena Vista Studios" "Colombia Pictures"
[4] "Dimension Films" "Disney" "DreamWorks"
[7] "Fox" "Fox Searchlight Pictures" "Gramercy Pictures"
[10] "IFC" "Lionsgate" "Lionsgate Films"
[13] "Lionsgate/Summit" "MGM" "MiraMax"
[16] "New Line Cinema" "New Market Films" "Orion"
[19] "Pacific Data/DreamWorks" "Paramount Pictures" "Path_ Distribution"
[22] "Relativity Media" "Revolution Studios" "Screen Gems"
[25] "Sony" "Sony Picture Classics" "StudioCanal"
[28] "Summit Entertainment" "TriStar" "UA Entertainment"
[31] "Universal" "USA" "Vestron Pictures"
[34] "WB" "WB/New Line" "Weinstein Company"

You can do droplevels(dftest$Studio) to remove unused levels

No, you're not doing anything wrong. A factor defines a fixed number of levels. These levels remain the same even if one or more of them are not present in the data. You've asked for the levels of your factor, not the values present after filtering.
Consider:
library(tidyverse)
mtcars %>%
mutate(cyl= as.factor(cyl)) %>%
filter(cyl == 4) %>%
distinct(cyl) %>%
pull(cyl)
[1] 4
Levels: 4 6 8
Welcome to SO. Next time, pleasetry to provide a minumum working example. This post will help you construct one.

Related

R: scrape nested html table with links (table within cell)

For university research I try to scrape an FDA table (robots.txt allows to scrape this content)
The table contains 19 rows and 2 columns:
https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181
The format I try to extract is:
col1 col2 url_of_col2
<chr> <chr> <chr>
1 Device Classificati~ distal transcutaneous electrical stimulator for treatm~ https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?s~
What I achieved:
I can easly extract the items of the first column:
#library
library(tidyverse)
library(xml2)
library(rvest)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
html %>%
html_nodes("table") -> tables
tables[[9]] -> table
# extract col 1 items
table %>%
html_nodes("th") %>%
html_text() %>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "Device Classification Name" "510(k) Number"
#> [3] "Device Name" "Applicant"
#> [5] "Applicant Contact" "Correspondent"
#> [7] "Correspondent Contact" "Regulation Number"
#> [9] "Classification Product Code" "Date Received"
#> [11] "Decision Date" "Decision"
#> [13] "Regulation Medical Specialty" "510k Review Panel"
#> [15] "summary" "Type"
#> [17] "Clinical Trials" "Reviewed by Third Party"
#> [19] "Combination Product"
Created on 2021-02-27 by the reprex package (v1.0.0)
Where I get stuck
Since some cells of column 2 contain a table, this approach does not give the same number of items:
# extract col 2 items
table %>%
html_nodes("td") %>%
html_text()%>%
gsub("\n|\t|\r","",.) %>%
trimws()
#> [1] "distal transcutaneous electrical stimulator for treatment of acute migraine"
#> [2] "K203181"
#> [3] "Nerivio, FGD000075-4.7"
#> [4] "Theranica Bioelectronics ltd4 Ha-Omanutst. Poleg Industrial Parknetanya, IL4250574"
#> [5] "Theranica Bioelectronics ltd"
#> [6] "4 Ha-Omanutst. Poleg Industrial Park"
#> [7] "netanya, IL4250574"
#> [8] "alon ironi"
#> [9] "Hogan Lovells US LLP1735 Market StreetSuite 2300philadelphia, PA 19103"
#> [10] "Hogan Lovells US LLP"
#> [11] "1735 Market Street"
#> [12] "Suite 2300"
#> [13] "philadelphia, PA 19103"
#> [14] "janice m. hogan"
#> [15] "882.5899"
#> [16] "QGT  "
#> [17] "QGT  "
#> [18] "10/26/2020"
#> [19] "01/22/2021"
#> [20] "substantially equivalent (SESE)"
#> [21] "Neurology"
#> [22] "Neurology"
#> [23] "summary"
#> [24] "Traditional"
#> [25] "NCT04089761"
#> [26] "No"
#> [27] "No"
Created on 2021-02-27 by the reprex package (v1.0.0)
Moreover, I could not find a way to extract the urls of col2
I found a good manual to read html tables with cells spanning on multiple rows. However, I think this approach does not work for nested dataframes.
There is similar question regarding a nested table without links (How to scrape older html with nested tables in R?) which has not been answered yet. A comment suggested this question, unfortunately I could not apply it to my html table.
There is the unpivotr package that aims to read nested html tables, however, I could not solve my problem with that package.
Yes the tables within the rows of the parent table does make it more difficult. The key for this one is to find the 27 rows of the table and then parse each row individually.
library(rvest)
library(stringr)
library(dplyr)
#load html
html <- xml2::read_html("https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpmn/pmn.cfm?ID=K203181")
# select table of interest
tables <- html %>% html_nodes("table")
table <- tables[[9]]
#find all of the table's rows
trows <- table %>% html_nodes("tr")
#find the left column
leftside <- trows %>% html_node("th") %>% html_text() %>% trimws()
#find the right column (remove white at the end and in the middle)
rightside <- trows %>% html_node("td") %>% html_text() %>% str_squish() %>% trimws()
#get links
links <-trows %>% html_node("td a") %>% html_attr("href")
answer <-data.frame(leftside, rightside, links)
One will will need to use paste("https://www.accessdata.fda.gov/", answer$links) on some of the links to obtain the full web address.
The final dataframe does have several cells containing "NA" these can be removed and the table can be cleaned up some more depending on the final requirements. See tidyr::fill() as a good starting point.
Update
To reduce the answer down to the desired 19 original rows:
library(tidyr)
#replace NA with blanks
answer$links <- replace_na(answer$links, "")
#fill in the blank is the first column to allow for grouping
answer <-fill(answer, leftside, .direction = "down")
#Create the final results
finalanswer <- answer %>% group_by(leftside) %>%
summarize(info=paste(rightside, collapse = " "), link=first(links))

for loop: Select users that have used specific words more than x times in R

I have a dataframe (df) with user_names and text for each user. I have another data_frame with important words. I want to create a for loop that iterates over each user and counts how often the important words appear in their text.
Data:
important_words = c("marcus", "yesterday", "democrat", "republican", "trump", "hillary")
df$user_names
[1] "marc12"
[2] "jon"
[3] "67han"
[4] "XXmark"
[5] "mark"
[6] "mark"
df$text
[1] "hi my name is marcus and i am a republican"
[2] "i support hillary"
[3] "go trump!"
[4] "tomorrow i will vote democrat"
[5] "i don't think so"
[6] "yesterday was ok"
We can extract all the important_words for each user_names and count number of unique important words each user has.
library(dplyr)
library(stringr)
df %>%
group_by(user_names) %>%
summarise(unique_imp_word = n_distinct(unlist(str_extract_all(tolower(text),
str_c('\\b', tolower(important_words), '\\b', collapse = "|")))))

Joined columns not visible in final result using dplyr

I am a newbie in R, I have the following code for doing some aggregations on the movie lens dataset in R using dplyr
joined_data <- inner_join(ratings_data,movie_data,by="movie_id",copy=TRUE)
data <- joined_data %>% group_by(movie_id) %>% arrange(movie_id)
data1 <- data %>% select(movie_id,movie_title,rating) %>% summarize(count_ratings=n())
The data tbl has all the columns I want(movie_id,movie_title,rating,...) I'm trying to select only 3 columns and summarize them, but the data1 tbl does not have the movie_title which was from the second table(movie_data). Any reason why this is happening? How do I get the columns I want in data1?
names(data)
[1] "user_id" "movie_id" "rating" "timestamp" "movie_title"
[6] "release_date" "video_release.date" "IMDb_URL" "unknown" "Action"
[11] "Adventure" "Animation" "Childrens" "Comedy" "Crime"
[16] "Documentary" "Drama" "Fantasy" "Film_Noir" "Horror"
[21] "Musical" "Mystery" "Romance" "Sci_Fi" "Thriller"
[26] "War" "Western"
But when I do this :
data1 <- data %>% select(movie_id,movie_title,user_id,rating) %>% summarize(count_users=n(),count_ratings=n())
names(data1)
[1] "movie_id" "count_users" "count_ratings"
group_by(movie_id) in your second line is responsible for that. Can you use:
group_by(movie_id, movie_title)
and check again - this worked as suggested by #AntoniosK

str_split on first and second occurence of delimter at different locations in character vector

I have a character list that has weather variables followed by "mean_#" where # is a number between 5 and 10. I want to subset the list to only have the weather variable names themselves. The mean weather variables look like this:
> mean_vars
[1] "dew_mean_10" "dew_mean_5" "dew_mean_6" "dew_mean_7"
[5] "dew_mean_8" "dew_mean_9" "humid_mean_10" "humid_mean_5"
[9] "humid_mean_6" "humid_mean_7" "humid_mean_8" "humid_mean_9"
[13] "rain_mean_10" "rain_mean_5" "rain_mean_6" "rain_mean_7"
[17] "rain_mean_8" "rain_mean_9" "soil_moist_mean_10" "soil_moist_mean_5"
[21] "soil_moist_mean_6" "soil_moist_mean_7" "soil_moist_mean_8" "soil_moist_mean_9"
[25] "soil_temp_mean_10" "soil_temp_mean_5" "soil_temp_mean_6" "soil_temp_mean_7"
[29] "soil_temp_mean_8" "soil_temp_mean_9" "solar_mean_10" "solar_mean_5"
[33] "solar_mean_6" "solar_mean_7" "solar_mean_8" "solar_mean_9"
[37] "temp_mean_10" "temp_mean_5" "temp_mean_6" "temp_mean_7"
[41] "temp_mean_8" "temp_mean_9" "wind_dir_mean_10" "wind_dir_mean_5"
[45] "wind_dir_mean_6" "wind_dir_mean_7" "wind_dir_mean_8" "wind_dir_mean_9"
[49] "wind_gust_mean_10" "wind_gust_mean_5" "wind_gust_mean_6" "wind_gust_mean_7"
[53] "wind_gust_mean_8" "wind_gust_mean_9" "wind_spd_mean_10" "wind_spd_mean_5"
[57] "wind_spd_mean_6" "wind_spd_mean_7" "wind_spd_mean_8" "wind_spd_mean_9"
And this is all I want at the end:
> var_names
"dew" "humid" "rain" "solar" "temp" "soil_moist" "soil_temp" "wind_dir" "wind_gust" "wind_spd"
Now I figured out how to do it but I fill my method is extraneous due to a lack of ability with regular expressions. I also will have to repeat my process 20 times substituting "mean" with other words.
var_names <- unique(str_split_fixed(mean_vars, "_", n = 3)[c(1:18,31:42),1])
var_names <- unlist(c(var_names, unique(unite(as_tibble(str_split_fixed(mean_vars, "_", n = 3)[c(19:30,43:60), 1:2])))))
I've been trying to stay within the realm of the tidyverse packages as much as possible so I was using stringr::str_split_fixed.
If you have a solution using this same function that would be ideal as I could continue the same programming style, but I'm open to all suggestions.
Thanks.
Use sub and unique. This is shorter and has no package dependencies (or use unique(str_replace(mean_vars, "_mean.*", "")) with stringr):
unique(sub("_mean.*", "", mean_vars))
giving:
[1] "dew" "humid" "rain" "soil_moist" "soil_temp"
[6] "solar" "temp" "wind_dir" "wind_gust" "wind_spd"
If for some reason you really want to use str_split then:
rmMean <- function(x) paste(head(x, -2), collapse = "_")
unique(sapply(str_split(mean_vars, "_"), rmMean))
Note
mean_vars <- c("dew_mean_10", "dew_mean_5", "dew_mean_6", "dew_mean_7", "dew_mean_8",
"dew_mean_9", "humid_mean_10", "humid_mean_5", "humid_mean_6",
"humid_mean_7", "humid_mean_8", "humid_mean_9", "rain_mean_10",
"rain_mean_5", "rain_mean_6", "rain_mean_7", "rain_mean_8", "rain_mean_9",
"soil_moist_mean_10", "soil_moist_mean_5", "soil_moist_mean_6",
"soil_moist_mean_7", "soil_moist_mean_8", "soil_moist_mean_9",
"soil_temp_mean_10", "soil_temp_mean_5", "soil_temp_mean_6",
"soil_temp_mean_7", "soil_temp_mean_8", "soil_temp_mean_9", "solar_mean_10",
"solar_mean_5", "solar_mean_6", "solar_mean_7", "solar_mean_8",
"solar_mean_9", "temp_mean_10", "temp_mean_5", "temp_mean_6",
"temp_mean_7", "temp_mean_8", "temp_mean_9", "wind_dir_mean_10",
"wind_dir_mean_5", "wind_dir_mean_6", "wind_dir_mean_7", "wind_dir_mean_8",
"wind_dir_mean_9", "wind_gust_mean_10", "wind_gust_mean_5", "wind_gust_mean_6",
"wind_gust_mean_7", "wind_gust_mean_8", "wind_gust_mean_9", "wind_spd_mean_10",
"wind_spd_mean_5", "wind_spd_mean_6", "wind_spd_mean_7", "wind_spd_mean_8",
"wind_spd_mean_9")

Error with R dplyr left_join

So I've been trying to use left_join to get the columns of a new dataset onto my main dataset (called employee)
I've double checked the vector names and the cleaning that I've don't and nothing seems to work. Here is my code. Would appreciate any help.
job_codes <- read_csv("Quest_UMMS_JobCodes.csv")
job_codes <- job_codes %>%
clean_names() %>%
select(job_code, pos_desc = pos_des_desc)
job_codes$is_nurse <- str_detect(tolower(job_codes$pos_desc), "nurse")
employee <- employee %>%
left_join(job_codes, by = "job_code")
The error I keep getting:Error in eval(substitute(expr), envir, enclos) :
'job_code' column not found in rhs, cannot join
here are the results of
names(job_code)
> names(job_codes)
[1] "job_code" "pos_desc" "is_nurse"
names(employee)
> names(employee)
[1] "REC_NUM" "ZIP" "STATE"
[4] "SEX" "EEO_CLASS" "BIRTH_YEAR"
[7] "EMP_STATUS" "PROCESS_LEVEL" "DEPARTMENT"
[10] "JOB_CODE" "UNION_CODE" "SUPERVISOR"
[13] "DATE_HIRED" "R_SHIFT" "SALARY_CLASS"
[16] "EXEMPT_EMP" "PAY_RATE" "ADJ_HIRE_DATE"
[19] "ANNIVERS_DATE" "TERM_DATE" "NBR_FTE"
[22] "PENSION_PLAN" "PAY_GRADE" "SCHEDULE"
[25] "OT_PLAN_CODE" "DECEASED" "POSITION"
[28] "WORK_SCHED" "SUPERVISOR_IND" "FTE_TOTAL"
[31] "PRO_RATE_TOTAL" "PRO_RATE_A_SAL" "NEW_HIRE_DATE"
[34] "COUNTY" "FST_DAY_WORKED" "date_hired"
[37] "date_hired_adj" "term_date" "employment_duration"
[40] "current" "age" "emp_duration_years"
[43] "DESCRIPTION.x" "PAY_STATUS.x" "DESCRIPTION.y"
[46] "PAY_STATUS.y"
Now, after the OP has added the column names of both tables in the Q, it is evident that the columns to join on are written in different ways (upper vs lower case).
If the column names are different, help("left_join") suggests:
To join by different variables on x and y use a named vector. For example, by = c("a" = "b") will match x.a to y.b.
So, in this case it should read
employee <- employee %>% left_join(job_codes, by = c("JOB_CODE" = "job_code"))

Resources