Given URLs of the format:
/dashboard/app/Minecraft/info
/dashboard/app/Minecraft/players
/dashboard/app/Diablo/players
When I create a content group with extraction RegEx:
/app/(.*?)/
I see would expect to see Minecraft & Diablo as extracted values - but no extracted apps names are being shown - I get 1 row, which is (not set)
Am I doing something obviously wrong?
Answering my own question:
The regex was correct, but I was under the impression it would apply to historical data, looking at data coming in, the apps are being collected correctly.
Related
It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks
I am trying to import the weather data for a number of dates, and one zip code, in Google Sheets. I am using importxml for this in the following base formula:
=importxml("https://www.almanac.com/weather/history/zipcode/89118/2020-01-21","//*")
When using this formula with certain zip codes and certain times, it returns the full text of the page which I then query for the mean temperature and mean dew point. However, with the above example and in many other cases, it returns "Could not fetch URL" and #N/A in the cells.
Thus, the issue is, it works a number of times, but by the fifth date or so, it throws the "Could not fetch URL" error. It also fails as I change zip codes. My only guess based on reading many threads is that because I'm requesting the URL so often from Sheets, it is eventually being blocked. Is there any other error anyone can see? I have to use the formula a few times to calculate relative humidity and other things, so I need it to work multiple times. Is it possible there would be a better way to get this working using a script? Or anything else that could cause this?
Here is the spreadsheet in question (just a work in progress, but the weather part is my issue): https://docs.google.com/spreadsheets/d/1WPyyMZjmMykQ5RH3FCRVqBHPSom9Vo0eaLlff-1z58w/edit?usp=sharing
The formulas that are throwing errors start at column N.
This Sheet contains many formulas using the above base formula, in case you want to see more examples of the problem.
Thanks!
After a great deal of trial and error, I found a solution to my own problem. I'm answering this in detail for anyone who needs to find weather info by zip code and date.
I switched to using importdata, transposed it to speed up the query, and used a helper cell to hold the result for each date. I then have the other formulas searching within the result in the helper cell, instead of calling import*** many times throughout. It is slow at times, but it works. This is the updated helper formula (where O3 contains the date in "YYYY-MM-DD" form, O5 contains the URL "https://www.almanac.com/weather/history/", and O4 contains the zip code:
=if(O3="",,query(transpose(IMPORTdata($O$5&$O$4&"/"&O3)),"select Col487 where Col487 contains 'Mean'"))
And then to get the temperature (where O3 contains the date and O8 contains the above formula):
=if(O3="",,iferror(text(mid(O$8,find("Mean Temperature",O$8)+53,4),"0.0° F"),"Loading..."))
And finally, to calculate the relative humidity:
=if(O3="",,iferror(if(now()=0,,exp(((17.625*243.04)*((mid(O$8,find("Mean Dew Point",O$8)+51,4)-32)/1.8-(mid(O$8,find("Mean Temperature",O$8)+53,4)-32)/1.8))/((243.04+(mid(O$8,find("Mean Temperature",O$8)+53,4)-32)/1.8)*(243.04+(mid(O$8,find("Mean Dew Point",O$8)+51,4)-32)/1.8)))),"Loading..."))
Most importantly, importdata has not once thrown the Could not fetch URL error, so it appears to be a better fetch method for this particular site.
Hopefully this can help others who need to pull in historical weather data :)
I've been using the REDCapR package to read in data from my survey form. I was reading in the data with no issue using redcap_read until I realized I needed to add a field restriction to one question on my survey. Initially it was a short answer field asking users how many of something they had, and people were doing expectedly annoying things like spelling out numbers or entering "a few" instead of a number. But all of that data read in fine. I changed the field to be a short answer field (same type as before) that requires the response to be an integer and now the data won't read into R using redcap_read.
When I run:
redcap_read(redcap_uri=uri, token=api_token)$data
I get the error message that:
Column [name of my column] can't be converted from numeric to character
I also noticed when I looked at the data that that it read in the 1st and 6th records of that column (both zeros) just fine (out of 800+ records), but everything else is NA. Is there an inherent problem with trying to read in data from a text field restricted to an integer or is there another way to do this?
Edit: it also reads the dates fine, which are text fields with a date field restriction. This seems to be very specific to reading in the validated numbers from the text field.
I also tried redcapAPI::exportRecords and it will continue to read in the rest of the dataset, but reads in NA for all values in the column with the test restriction.
Upgrade REDCapR to the version on GitHub, which stacks the batches on top of each other before determining the data type (see #257).
# install.packages("remotes") # Run this line if the 'remotes' package isn't installed already.
remotes::install_github(repo="OuhscBbmc/REDCapR")
In your case, I believe that the batches (of 200 records, by default) contain different different data types (character & numeric, according to the error message), which won't stack on top of each other silently.
The REDCapR::redcap_read() function should work then. (If not, please create a new issue).
Two alternatives are
calling redcap_read_oneshot with a large value of guess_max, or
calling redcap_read_oneshot with guess_type = TRUE.
I have a list of local election candidates and I would like to find out
(i) if these individuals have a twitter account
(ii) if so what are their screen names/ user names.
search_users seemed to be the best option but it does not do a good job. Here is an example:
y1 <- search_users(q="suleyman kilinc", n=5, parse=TRUE)
This gives me a list of 5 users and non of them is the one that I am looking for. This is often the case. But when I do the same search on Google with the key words "suleyman+kilinc+twitter", the first option that Google offers is what I exactly need. This is true for 95% of the random names that I manually searched. Is there a good way to automatize the name to user name search through R or a better option than search_users function.
Any help is appreciated.
It is a very interesting question. the q parameter accepts a string as indicated above. When you pass a word with space as a value of q then you are instructing the function to search for "suleyman" & "kilinc" hence "suleyman kilinc" is the same as "suleyman AND kilinc". The REST API for twitter in this case will return any user with both "suleyman" and "kilinc" irregardless of the order.
I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!
I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.