R - IMDb dataset files - how to merge lines per film - r

One of the files (title.principals) available on IMDb dataset files contains details about cast and crew.
I would like to extract Directors details and merge them into single line, as there can be several Directors per film.
Is it possible?
#title.principals file download
url <- "https://datasets.imdbws.com/title.principals.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)
#file load
title_principals <- readr::read_tsv(
file = gzfile(tmp),
col_names = TRUE,
quote = "",
na = "\\N",
progress = FALSE
)
#name.basics file download
url <- "https://datasets.imdbws.com/name.basics.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)
#file load
name_basics <- readr::read_tsv(
file = gzfile(tmp),
col_names = TRUE,
quote = "",
na = "\\N",
progress = FALSE
)
#extract directors data
df_directors <- title_principals %>%
filter(str_detect(category, "director")) %>%
select(tconst, ordering, nconst, category) %>%
group_by(tconst)
df_directors <- df_directors %>% left_join(name_basics)
head(df_directors, 20)
I'm joining it with name_basics file to have Director name.
Name basics contains Name, birth and death year, profession.
And after this step, I would like to merge all Directors per film into single cell split by comma for example.
Is it somehow possible?

Please see this guide for minimal reproducible example. Setting up a simplified example with fake data that highlights the exact problem will help other people help you faster.
As I understand it, you want to take a file that has multiple rows per value of ID_tconst with different values of Director_Name and collapse it to a file with one row per value of ID_tconst and a comma separated list of Director_Names.
Here is a simple mock data set and solution. Note the use of the collapse argument in paste instead of sep.
library(tidyverse)
example <- tribble(
~ID_tconst, ~Director_Name,
1, "Aaron",
2, "Bob",
2, "Cathy",
3, "Doug",
3, "Edna",
3, "Felicty"
)
collapsed <- example %>%
group_by(ID_tconst) %>%
summarize(directors = paste(Director_Name, collapse = ","))

Related

How do I create / name dataframes in a for loop in R?

So I'm currently trying to scrape precinct results by county from JSON files on Virginia's Secretary of State. I got code working that gets the data from a URL and creates a dataframe named after the county. To speed up the process, I tried to put the code inside a for loop that iterates through Virginia's counties (which I'm sourcing from a 2020 election by county CSV already on my computer that I constructed from this: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ), constructs the URL for the county JSON file (since the format's consistent), and saves it to a dataframe. My current code doesn't save the dataframes though, so only the last county remains.
This is the code:
library(dplyr)
library(tidyverse)
library(jsonlite)
va <- filter(biden_margin, biden_margin$state_po == "VA")
#i put this line here because the spreadsheet uses spaces to separate "X" and "city" but the URL uses an underline
va$county_name <- gsub(" ", "_", va$county_name)
#i put this line here because the URLs have "county" in the name, but the spreadsheet doesn't; however the spreadsheet does have "city" for the independent cities, like the URLs (and the independent cities are the observations with FIPS above 51199)
va$county_name <- if_else(va$county_fips > 51199, va$county_name, paste0(va$county_name, "_COUNTY"))
#i did this as a list but i realize this might be a bad idea
governor_data <- vector(mode = "list", length = nrow(va))
for (i in nrow(va)) {
precincts <- paste0("https://results.elections.virginia.gov/vaelections/2021%20November%20General/Json/Locality/", va$county_name[i], "/Governor.json")
name <- paste0(va$county_name[i], "_governor_2021")
java_source <- stream_in(file(precincts))
df <- as.data.frame(java_source$Precincts)
df$county <- java_source$Locality$LocalityName
df <- unnest(df, cols = c(Candidates))
df <- subset(df, select = -c(PoliticalParty, BallotOrder))
df <- pivot_wider(df, names_from = BallotName, values_from = c(Votes, Percentage))
#tried append before this, got the same result
governor_data[i] <- assign(name, df)
}
Any thoughts?

Reading googledrive contents from R

I'm aiming to get a list of all files in a Google Drive folder, as well at the associated metadata for those files. When I use drive_ls, it returns 3 columns {name, id, drive_resource}. drive_resource is a structured like this: list(kind = "drive#file", id = "abc",...). However, some of the list is not qualified by quotations, and commas are also occassionally used when not a separator.
Any ideas how I might properly unlist this? I can't find anywhere in the package that can handle this.
Using the package 'googledrive', I can get a list of all the files
a <- drive_ls(path = "abc", recursive = TRUE)
The below attempt gets close, but fails to get thee column names and also splits some values at the wrong place based on a comma being contained in the string.
a$drive_resource <- vapply(a$drive_resource, paste, collapse = ",", character(1L))
abcd <- a%>% separate(drive_resource, sep = ",", into = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30") )
You can try the following approach. It's an example with only four elements of the list (selected names are specified in the function). The function maps each list contained in each row to a tibble, so you can unnest it
require(googledrive)
require(dplyr)
f <- function(l){
l[c("version","webContentLink","viewedByMeTime","mimeType")] %>% as_tibble()
}
dr_content <- drive_ls(path = "<path>", recursive = TRUE)
dr_content <- dr_content %>% mutate(drive_resource = purrr::map(drive_resource, f))
dr_content <- dr_content %>% tidyr::unnest(drive_resource)

How to add column to multiple data frames based on information from another data frame

Apologies if this question is simple/been answered elsewhere - I have looked but as a newbie I can't seem to find what I need.
I have a data frame (Length) which contains a a unique value which I need to add to different files
View(Length)
File_name Transcript_length <d
1 sample15.fasta.out_alternative.out_contig.copynumber.csv 89229486
2 sample16.fasta.out_alternative.out_contig.copynumber.csv 70908644
3 sample2.fasta.out_alternative.out_contig.copynumber.csv 56017470
4 sample28.fasta.out_alternative.out_contig.copynumber.csv 94888762
5 sample30.fasta.out_alternative.out_contig.copynumber.csv 106260465
6 sample31.fasta.out_alternative.out_contig.copynumber.csv 91189772
I have then imported and began to manipulate these copy.number.csv files but need to add a new column which contains the value corresponding to the file name?
Attempt 1:
#import copynumber data
import2 <- list.files(pattern="*copynumber.csv", full.names = TRUE)
list2env(
lapply(setNames(import2, make.names(gsub("$", "", import))),
read.csv, sep = ""),
envir = .GlobalEnv)
CN_files <- lapply(import2, read.csv, sep = "")
names(CN_files) <- gsub("$", "", import2)
#then manipulate
for (f in 1:length(CN_files)) {
names(CN_files[[f]]) <- c("Family", "Element", "Length", "Fragments", "Copies", "Solo_LTR", "Total_Bp", "Cover")
how do I then add the transcript length values to a new column based on the specific copynumber.csv file provided by the earlier data frame?
Any help greatly appreciated, again, I am new to this, so feel free to give more general advice on how to word a R question etc
I have worked out how to do it outside of the loop as so:
CN_files[[1]] <- CN_files[[1]] %>% mutate(bp = Length$Transcript_length[1])
CN_files[[2]] <- (CN_files[[2]] %>% mutate(bp = Length$Transcript_length[2]))
CN_files[[3]] <- (CN_files[[3]] %>% mutate(bp = Length$Transcript_length[3]))
CN_files[[4]] <- (CN_files[[4]] %>% mutate(bp = Length$Transcript_length[4]))
CN_files[[5]] <- (CN_files[[5]] %>% mutate(bp = Length$Transcript_length[5]))
CN_files[[6]] <- (CN_files[[6]] %>% mutate(bp = Length$Transcript_length[6]))
CN_files[[7]] <- (CN_files[[7]] %>% mutate(bp = Length$Transcript_length[7]))
CN_files[[8]] <- (CN_files[[8]] %>% mutate(bp = Length$Transcript_length[8]))
CN_files[[9]] <- (CN_files[[9]] %>% mutate(bp = Length$Transcript_length[9]))
Nevertheless, this seems quite awkward and non-efficient so again if anyone has any tips on how to approach this better it will be greatly appreciated!
Note, it was known that the order of the files within the list were the same as the 'Length' data file-

Scraping from more than one aspx pages with R

I am a linguistics student doing experiments in R. I have been looking at other questions and got a lot of help, but I am stuck at the moment as I cannot implement example functions to my case and would love to have some help.
First, I would like to go through every semester from here: http://registration.boun.edu.tr/schedule.htm, and every department here: http://registration.boun.edu.tr/scripts/schdepsel.asp
It is actually a bit easy to generate the list of it as the final link is something like this: http://registration.boun.edu.tr/scripts/sch.asp?donem=2017/2018-3&kisaadi=ATA&bolum=ATATURK+INSTITUTE+FOR+MODERN+TURKISH+HISTORY
Secondly, I need to select the code, name, days and hours of the course and tag the semester, which I did. (probably, I did it extremely poorly, but I did it nevertheless, yay!)
library("rvest")
library("dplyr")
library("magrittr")
# define the html
reg <- read_html("http://registration.boun.edu.tr/scripts/sch.asp?donem=2017/2018-3&kisaadi=ATA&bolum=ATATURK+INSTITUTE+FOR+MODERN+TURKISH+HISTORY")
# make the html a list of tables
regtable <- reg %>% html_table(fill = TRUE)
# tag their year
regtable[[4]][ ,15] <- regtable[[1]][1,2]
regtable[[4]][1,15] <- "Semester"
# Change the Days and Hours to sth usable, but how and to what?
# parse the dates, T and Th problem?
# parse the hour 10th hour problem?
# get the necessary info
regtable <- regtable %>% .[4] %>% as.data.frame() %>% select( . , X1 , X3 , X8 , X9 , V15)
# correct the names
names(regtable) <- regtable[1,]
regtable <- regtable[-1,]
View(regtable)
But the problem is that I want to write a function where I can do this for more than 20 semester and more than 50 departments. Any help would be great! I am doing this so that I can work on optimization for class hours for my department.
I guess I can do this better with XML Package, but I could not understand how to use it.
Thanks for any help,
Utku
Here is an answer building upon what you have already done. There are likely more efficient solutions, but this should be a good start. You also don't state how you would like to store the data, so currently what I have made will assign each combination of semester and department to its own data frame, which creates a huge amount for the number of departments. It is not ideal but I don't know how you plan to use the data after collection.
library("rvest")
library("dplyr")
library("magrittr")
# Create a Department list
dep_list <- read_html("http://registration.boun.edu.tr/scripts/schdepsel.asp")
# Take the read html and identify all objects of class menu2 and extract the
# href which will give you the final part of the url
dep_list <- dep_list %>%
html_nodes(xpath = '//*[#class="menu2"]') %>%
xml_attr("href")
department_list <- gsub("/scripts/sch.asp?donem=", "", dep_list, fixed = TRUE)
# Create a list for all of the semesters
sem_list <- read_html("http://registration.boun.edu.tr/schedule.htm")
sem_list <- sem_list %>% html_table(fill = TRUE)
# Extract the table from the list needed
semester_df <- sem_list[[2]]
# The website uses a table for the dropdown but the values are all in the second cell
# of the second column as a string
semester_list <- semester_df$X2[2]
# Separate the string into a list at the space characters
semester_list <- unlist(strsplit(semester_list, "\\s+"))
# Loop through the list of departments and within each department loop through the
# list of semesters to get the data you want
for(dep in department_list){
for(sem in semester_list){
url <- paste("http://registration.boun.edu.tr/scripts/sch.asp?donem=", sem, dep, sep = "")
reg <- read_html(url)
# make the html a list of tables
regtable <- reg %>% html_table(fill = TRUE)
# The data we want is in the 4th portion of the created list so extract that
regtable <- regtable[[4]]
# Rename the column headers to the values in the first row and remove the
# first row
regtable <- setNames(regtable[-1, ], regtable[1, ])
# Create semester column and select the variables we want
regtable <- regtable %>%
mutate(Semester = sem) %>%
select(Code.Sec, Name, Days, Hours, Semester)
# Assign the created table to a dataframe
# Could also save the file here instead
assign(paste("table", sem, gsub(" ", "_", dep), sep = "_"), regtable)
}
}
Thanks to #Amanda I was able to achieve what I wanted to do. Only thing left is to scraping shortnames list, match them and do the whole thing, but I can do what I want with creating a list. Any further comments to do this more elegantly are appreciated!
library("rvest")
library("dplyr")
library("magrittr")
# Create a Department list
dep_list <- read_html("http://registration.boun.edu.tr/scripts/schdepsel.asp")
dep_list <- dep_list %>% html_table(fill = TRUE)
# Select the table from the html that contains the data we want
department_df <- dep_list[[2]]
# Rename the columns with the value of the first row and remove row
department_df <- setNames(department_df[-1, ], department_df[1, ])
# Combine the two columns into a list
department_list <- c(department_df[, 1], department_df[, 2])
# Edit the department list
# We can choose accordingly.
department_list <- department_list[c(7,8,16,20,26,33,36,37,38,39)]
# Create a list for all of the semesters
sem_list <- read_html("http://registration.boun.edu.tr/schedule.htm")
sem_list <- sem_list %>% html_table(fill = TRUE)
# Extract the table from the list needed
semester_df <- sem_list[[2]]
# The website uses a table for the dropdown but the values are all in the second cell
# of the second column as a string
semester_list <- semester_df$X2[2]
# Separate the string into a list at the space characters
semester_list <- unlist(strsplit(semester_list, "\\s+"))
# Shortnames string
# We can add whichever we want.
shortname_list <- c("FLED", "HIST" , "PSY", "LL" , "PA" , "PHIL" , "YADYOK" , "SOC" , "TR" , "TKL" )
# Length
L = length(department_list)
# the function to get the schedule for the selected departments
for( i in 1:L){
for(sem in semester_list){tryCatch({
dep <- department_list[i]
sn <- shortname_list[i]
url_second_part <- interaction("&kisaadi=" , sn, "&bolum=", gsub(" ", "+", (gsub("&" , "%26", dep))), sep = "", lex.order = TRUE)
url <- paste("http://registration.boun.edu.tr/scripts/sch.asp?donem=", sem, url_second_part, sep = "")
reg <- read_html(url)
# make the html a list of tables
regtable <- reg %>% html_table(fill = TRUE)
# The data we want is in the 4th portion of the created list so extract that
regtable <- regtable[[4]]
# Rename the column headers to the values in the first row and remove the
# first row
regtable <- setNames(regtable[-1, ], regtable[1, ])
# Create semester column and select the variables we want
regtable <- regtable %>%
mutate(Semester = sem) %>%
select(Code.Sec, Name, Days, Hours, Semester)
# Assign the created table to a dataframe
# Could also save the file here instead
assign(paste("table", sem, gsub(" ", "_", dep), sep = "_"), regtable)
}, error = function(e){cat("ERROR : No information on this" , url , "\n" )})
}
}
### Maybe make Errors another dataset or list too.

Iterating through values in R

I'm new-ish to R and am having some trouble iterating through values.
For context: I have data on 60 people over time, and each person has his/her own dataset in a folder (I received the data with id #s 00:59). For each person, there are 2 values I need - time of response and picture response given (a number 1 - 16). I need to convert this data from wide to long format for each person, and then eventually append all of the datasets together.
My problem is that I'm having trouble writing a loop that will do this for each person (i.e. each dataset). Here's the code I have so far:
pam[x] <- fromJSON(file = "PAM_u[x].json")
pam[x]df <- as.data.frame(pam[x])
#Creating long dataframe for times
pam[x]_long_times <- gather(
select(pam[x]df, starts_with("resp")),
key = "time",
value = "resp_times"
)
#Creating long dataframe for pic_nums (affect response)
pam[x]_long_pics <- gather(
select(pam[x]df, starts_with("pic")),
key = "picture",
value = "pic_num"
)
#Combining the two long dataframes so that I have one df per person
pam[x]_long_fin <- bind_cols(pam[x]_long_times, pam[x]_long_pics) %>%
select(resp_times, pic_num) %>%
add_column(id = [x], .before = 1)
If you replace [x] in the above code with a person's id# (e.g. 00), the code will run and will give me the dataframe I want for that person. Any advice on how to do this so I can get all 60 people done?
Thanks!
EDIT
So, using library(jsonlite) rather than library(rjson) set up the files in the format I needed without having to do all of the manipulation. Thanks all for the responses, but the solution was apparently much easier than I'd thought.
I don't know the structure of your json files. If you are not in the same folder, like the json files, try that:
library(jsonlite)
# setup - read files
json_folder <- "U:/test/" #adjust you folder here
files <- list.files(path = paste0(json_folder), pattern = "\\.json$")
# import data
pam <- NULL
pam_df <- NULL
for (i in seq_along(files)) {
pam[[i]] <- fromJSON(file = files[i])
pam_df[[i]] <- as.data.frame(pam[[i]])
}
Here you generally read all json files in the folder and build a vector of a length of 60.
Than you sequence along that vector and read all files.
I assume at the end you can do bind_rowsor add you code in the for loop. But remember to set the data frames to NULL before the loop starts, e.g. pam_long_pics <- NULL
Hope that helped? Let me know.
Something along these lines could work:
#library("tidyverse")
#library("jsonlite")
file_list <- list.files(pattern = "*.json", full.names = TRUE)
Data_raw <- tibble(File_name = file_list) %>%
mutate(File_contents = map(File_name, fromJSON)) %>% # This should result in a nested tibble
mutate(File_contents = map(File_contents, as_tibble))
Data_raw %>%
mutate(Long_times = map(File_contents, ~ gather(key = "time", value = "resp_times", starts_with("resp"))),
Long_pics = map(File_contents, ~ gather(key = "picture", value = "pic_num", starts_with("pic")))) %>%
unnest(Long_times, Long_pics) %>%
select(File_name, resp_times, pic_num)
EDIT: you may or may not need not to include as_tibble() after reading in the JSON files, depending on how your data looks like.

Resources