Assigning observation name to a value when retrieving a variable - r

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!

I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

Related

Name matching and correcting spelling error in r

I have a huge data table with millions of rows that consists of Merchandise code with its description. I want to assign a category to each group (based on the combination of code and description). The problem is that the description is spelled in different ways and I want to convert all the similar names into a single one. Here is an illustrative example:
ibrary(data.table)
dt <- data.table(code = c(rep(1,2),rep(2,2),rep(3,2)), name = c('McDonalds','Mc
Dnald','Macys','macy','Comcast','Com-cats'))
dt[,cat:='NA']
setkeyv(dt,c('code','name'))
dt[.(1,'McDonalds'),cat:='Restaurant']
dt[.(1,'Mc Dnald'),cat:='Restaurant']
dt[.(1,'Macys'),cat:='Department Store']
Of course in the real case, it is impossible to go through all the spelling that refer to the same word and fix them manually.
Is there a way to detect all the similar words and convert them to a single (correct) spelling?
Thanks in advance

Football Data in R - Using if statement to check for missing attribute within element?

Straight to the situation:
I am working with football tracking data, I have a list of events and within each element is recorded different situation of the game. I am interested in the passes that were made. I have for loop to iterate through the list and get me each pass for the specific team.
for (p in 1:length(uruguay.passes)){
uruguay.pass.temp <- all.events[[uruguay.passes[p]]]
possession <- uruguay.pass.temp$possession
passer <- uruguay.pass.temp$player$id
.
.
body.part <- uruguay.pass.temp$pass$body_part$name
Here comes the problem:
Sometimes data for passer or another attribute would be missing and it messes up the whole row.
Here is a screenshot.
Here comes the question:
How can I write an if statement that checks if that specific attribute i.e. Passer is present within the element and if is not present to populate the column with NA?
Thank you everyone!
After some tinkering I got there, not sure this is the most performance optimised version, but it gets the job done.
The code is:
body.part <- if("body_part" %in% names(uruguay.pass.temp$pass)){uruguay.pass.temp$pass$body_part$name}else{NA}

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

importxml could not fetch url after repeated attempts

I am trying to import the weather data for a number of dates, and one zip code, in Google Sheets. I am using importxml for this in the following base formula:
=importxml("https://www.almanac.com/weather/history/zipcode/89118/2020-01-21","//*")
When using this formula with certain zip codes and certain times, it returns the full text of the page which I then query for the mean temperature and mean dew point. However, with the above example and in many other cases, it returns "Could not fetch URL" and #N/A in the cells.
Thus, the issue is, it works a number of times, but by the fifth date or so, it throws the "Could not fetch URL" error. It also fails as I change zip codes. My only guess based on reading many threads is that because I'm requesting the URL so often from Sheets, it is eventually being blocked. Is there any other error anyone can see? I have to use the formula a few times to calculate relative humidity and other things, so I need it to work multiple times. Is it possible there would be a better way to get this working using a script? Or anything else that could cause this?
Here is the spreadsheet in question (just a work in progress, but the weather part is my issue): https://docs.google.com/spreadsheets/d/1WPyyMZjmMykQ5RH3FCRVqBHPSom9Vo0eaLlff-1z58w/edit?usp=sharing
The formulas that are throwing errors start at column N.
This Sheet contains many formulas using the above base formula, in case you want to see more examples of the problem.
Thanks!
After a great deal of trial and error, I found a solution to my own problem. I'm answering this in detail for anyone who needs to find weather info by zip code and date.
I switched to using importdata, transposed it to speed up the query, and used a helper cell to hold the result for each date. I then have the other formulas searching within the result in the helper cell, instead of calling import*** many times throughout. It is slow at times, but it works. This is the updated helper formula (where O3 contains the date in "YYYY-MM-DD" form, O5 contains the URL "https://www.almanac.com/weather/history/", and O4 contains the zip code:
=if(O3="",,query(transpose(IMPORTdata($O$5&$O$4&"/"&O3)),"select Col487 where Col487 contains 'Mean'"))
And then to get the temperature (where O3 contains the date and O8 contains the above formula):
=if(O3="",,iferror(text(mid(O$8,find("Mean Temperature",O$8)+53,4),"0.0° F"),"Loading..."))
And finally, to calculate the relative humidity:
=if(O3="",,iferror(if(now()=0,,exp(((17.625*243.04)*((mid(O$8,find("Mean Dew Point",O$8)+51,4)-32)/1.8-(mid(O$8,find("Mean Temperature",O$8)+53,4)-32)/1.8))/((243.04+(mid(O$8,find("Mean Temperature",O$8)+53,4)-32)/1.8)*(243.04+(mid(O$8,find("Mean Dew Point",O$8)+51,4)-32)/1.8)))),"Loading..."))
Most importantly, importdata has not once thrown the Could not fetch URL error, so it appears to be a better fetch method for this particular site.
Hopefully this can help others who need to pull in historical weather data :)

How do I reference previous/next rows when iterating through a data frame in R?

I have a dataset that looks like this (I'm simplifying slightly here):
Column 1 has a user id
Column 2 has a url title
Column 3 has an actual url
The data is already ordered by user and time. So its User 1 and all the URLs they visited in ascending order of time and then User 2 and the URLs they visited in ascending order of time etc etc
What I'm trying to do is loop through the dataset and look for "triplets" where the first rows url doesn't contain my keyword (something like google or facebook or nytimes or whatever), the second rows url does contain my keyword, and the third row doesn't contain my keyword. Basically checking to see which websites users visited before and after any specific website.
I've figured out I can look for the keyword using:
if(length(grep("facebook",url)) > 0)
But I haven't been able to figure out how to loop through the code and achieve what I'm trying to do.
If you could break your response into two parts, I would really appreciate it:
Part 1: Is there any way to loop through a dataframe and have access to all the columns? I was able to work on a single column with this code:
new_data <- data.frame (url)
for (url in data$url)
if(length(grep("keyword",url)) > 0) {
new_data <- rbind(new_data,data.frame(url = url))
}
This approach is limited though because I can only reference a single column in my dataframe. Whats the better solution here? I tried:
for (row in data) and then referencing columns by row[column_number] and row['column_name'] to no avail
I also tried for (i in 1:nrow(data)) and then referencing columns using data[i,column_number] and that didn't work either (That should have worked right?) I figured if this method worked I could use i-1 and i+1 to access other rows! I know this isn't the traditional way of doing things in R, but if you could still offer an explanation on how to do it this way I would really appreciate it.
Part 2: How do I accomplish my actual goal, as stated earlier? I'd like to learn to do it the "R way"; I imagine its going to involve plyr or lapply, but I haven't managed to figure out how to use those functions even after extensive reading, let alone use them and include references to previous/next rows.
Thanks in advance for your help, any guidance is appreciated!
Use [-1]:
last <- nrow(df)
penu <- nrow(df) - 1
df$ContainsKeyword <- FALSE
df$ContainsKeyword[grep("keyword", df$url)] <- TRUE
df$TripletFound <- NA
for (i in 2:penu){
df$TripletFound[i] <- {df$ContainsKeyword[i-1] & df$ContainsKeyword[i+1]} & {!df$ContainsKeyword[i]}
}

Resources