Problem with dplyr causing additional info to appear - r

Using dplyr in R (microsoft R Open 3.5.3 to be precise). I'm having a slight problem with dplyr whereby I'm sometimes seeing lots of additional information in the data frame I create. For example, for these lines of code:
claims_frame_2 <- left_join(claims_frame,
select(new_policy_frame, c(Lookup_Key_4, Exposure_Year, RowName)),
by = c("Accident_Year" = "Exposure_Year", "Lookup_Key_4" = "Lookup_Key_4")
)
claims_frame_3 <- claims_frame_2 %>% group_by(Claim.Number) %>% filter(RowName == max(RowName))
No problem with the left_join command, but when I do the second command (group by/filter), the data structure of the claims_frame_3 object is different to that of the claims_frame_2 object. Seems to suddenly have lots of attributes (something I know little about) attached to the RowName field. See the attached photo.
Does anyone know why this happens and how I can stop it?
I had hoped to put together a small chunk of reproducible code that demonstrated this happening, but so far I haven't been successful. I will continue. In the mean time, I'm hoping someone might see this code (from a real project) and immediately know why this is happening!
Grateful for any advice.
Thanks
Alan

Related

SpotifyR get_artist_audio_features() doesn't filter by market, is_uri() not found when editing

I'm trying to obtain audio features of an artist's tracks with spotifyr:
test <- get_artist_audio_features("california honeydrops", include_groups = c("album", "single", "appears_on", "compilation"), market = "US")
A quick check of the results reveals several repeats of the same track and album names with slightly different audio features, and unique(test$available_markets reveals that these duplications are because the function did not properly filter by market = "US". Replacing "US" with other country codes yields the same result. However, if include_groups is left as the default, which only returns tracks from albums, then the market filter does work as expected.
I thought I might make a quick fix by editing the source code for get_artist_audio_features() to force market = "US" in RStudio, but I get an error when copy-pasting and then trying to run the original function's code because R insists one of the functions used to make get_artist_audio_features(), spotify::is_uri(), is not part of the spotifyr package. However, it can be found in the package's help section, is part of the original function, and works fine when calling the original function.
Of course, I can filter these duplicates out after the fact, but for edification's sake, what gives? Can anyone provide a fix to the original function or explain why R can't find the is_uri() when I try to run a copy of the original function?

Issue Knitting Lab Report - Coding Error?

I have been conducting some exercises from OpenIntro statistics to start getting familiar with R and RStudio.
I have completed all the exercises, I run my code in R studio and I get all of the tables and graphs that I have generated without a problem inside RStudio.
However, when it is time to knit the data, I get an error (that I believe I should not be getting given that I was able to run my code in RStudio without any errors and my tables and graphs are generated accurately).
The knitting bugs at exercise 3 where I am told to generate a plot of the proportion of boys that were born over time. Here is a sample of my code (lines 53 to 58)
```{r plot-prop-boys-arbuthnot}
mutate (arbuthnot, boy_ratio = boys / total)
ggplot(data = arbuthnot, aes(x = year, y = boy_ratio)) +
geom_line()
```
However, then I get a big error message that I do not understand. It says that total was not found. I tried defining the total by inserting :
total <- boys + girls
or by inserting :
total <- arbuthnot$boys + arbuthnot$girls
It just does not seem to work no matter what I do. For instance, even if I successfully define the total, it will bug again and give me another error when I need to knit the lab report. Sometimes I switched the way I write the mutate code. For instance, I also used
arbuthnot <- arbuthnot %>%
mutate(boy_ratio = boys / total)
However, even when I use this code in combination with the solutions I tried for defining the total, it still does not work.
I am not sure what to do at this point because the graph is displayed in RStudio. The ratio is accurate, it also shows up in a table that I have generated.
The variable total is in that table. I tried re-starting and re-running all the chunks of code in R. All of my tables and graphs come out perfectly and then when I try to knit my lab report again it bugs at line 54.
I have been trying to solve this for 2 days now and I am not sure what I should do.
I hope the community here will be able to give me a couple of pointers on how to solve this problem :) ! If you need more information or a bit more code let me know :) !
Wishing everyone a wonderful day !
To help others help you, consider making a minimal working example (MWE), for example using the reprex package. Without more details, it is near impossible to know exactly what when wrong.
The error message states that there is no total in the environment and that arbuthnot does not contain a column total, so possibly the latter was created but not assigned. It may be that the variable is in your environment when you run the code interactively and created the column or the variable at some point (using the code you provided). However, note that the script compiles in a new environment from scratch when knitting the .Rmd file, in which case it cannot find the variable and aborts.
To debug your code, consider replacing the code chunk lines 53-58 by a print statement, like head(arbuthnot), to see what comes out in the output file and confirm that the tibble indeed contains total.
Alternatively, debug by running the code chunk by chunk until you get the error message in a new environment. In RStudio, try Ctrl + Shift + F10 (equivalent to Session > Restart R) to clear everything and start afresh.
The following code chunk should work
library(openintro)
library(tidyverse)
data(arbuthnot)
arbuthnot <- arbuthnot %>% # note assignment (write over database)
mutate(total = boys + girls, # define total first
boy_ratio = boys / total)
ggplot(data = arbuthnot,
mapping = aes(x = year, y = boy_ratio)) +
geom_line()
Thank you #lbelzile for the great tips.
In the future, I will use the minimal working example to better inform other contributors on stack overflow. I thought that the evidence I had provided was sufficient.
That being said, thank to the bits of code you sent me, I was able to solve the problem.
Following parts of your instructions, here is the code that worked :
head(arbuthnot)
library(tidyverse)
library(openintro)
data(arbuthnot)
arbuthnot <-arbuthnot %>%
mutate (total = boys + girls, boy_ratio = boys / total)
ggplot(data = arbuthnot, aes(x = year, y = boy_ratio)) +
geom_line()
After inserting this code, the file was able to get stitched and my lab report was generated.
I would like to thank you for taking the time to help me :) !
Have a great week.

Completing web forms and ingesting the responses with R?

So, here's the current situation:
I have 2000+ lines of R code that produces a couple dozen text files. This code runs in under 10 seconds.
I then manually paste each of these text files into a website, wait ~1 minute for the website's response (they're big text files), then manually copy and paste the response into Excel, and finally save them as text files again. This takes hours and is prone to user error.
Another ~600 lines of R code then combines these dozens of text files into a single analysis. This takes a couple of minutes.
I'd like to automate step 2--and I think I'm close, I just can't quite get it to work. Here's some sample code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults
The code runs and every time I've done it "balcoResults" comes back with "Status: 200". Success! EXCEPT the file size is 0...
I don't know where the problem is, but my best guess is that the text block isn't getting filled out before the form is submitted. If I go to the website (http://hess.ess.washington.edu/math/v3/v3_age_in.html) and manually submit an empty form, it produces a blank webpage: pure white, nothing on it.
The problem with this potential explanation (and me fixing the code) is that I don't know why the text block wouldn't be filled out. The results of set_values tells me that "text_block" has 120 characters in it. This is the correct length for textString. I don't know why these 120 characters wouldn't be pasted into the web form.
An alternative possibility is that R isn't waiting long enough to get a response from the website, but this seems less likely because a single sample (as here) runs quickly and the status code of the response is 200.
Yesterday I took the DataCamp course on "Working with Web Data in R." I've explored GET and POST from the httr package, but I don't know how to pick apart the GET response to modify the form and then have POST submit it. I've considered trying the package RSelenium, but according to what I've read, I'd have to download and install a "Selenium Server". This intimidates me, but I could probably do it -- if I was convinced that RSelenium would solve my problem. When I look on CRAN at the function names in the RSelenium package, it's not clear which ones would help me. Without firm knowledge for how RSelenium would solve my problem, or even if it would, this seems like a poor return on the time investment required. (But if you guys told me it was the way to go, and which functions to use, I'd be happy to do it.)
I've explored SO for fixes, but none of the posts that I've found have helped. I've looked here, here, and here, to list three.
Any suggestions?
After two days of thinking, I spotted the problem. I didn't assign the results of set_value function to a variable (if that's the right R terminology).
Here's the corrected code:
library(xml2)
library(rvest)
textString <- "C2-Boulder1 37.79927 -119.21545 3408.2 std 3.5 2.78 0.98934 0.0001 2012 ; C2-Boulder1 Be-10 quartz 581428 7934 07KNSTD ;"
url <- "http://hess.ess.washington.edu/math/v3/v3_age_in.html"
balcoForm <- html_form(read_html(url))[[1]]
balcoForm <- set_values(balcoForm, summary = "no", text_block = textString)
balcoResults <- submit_form(html_session(url), balcoForm, submit = "text_block")
balcoResults

Issues in R with saving split dataframes as new files

I've searched a bit for answered questions related to this, but I still keep running into issues.
I have a 1.4 million dataframe loaded into R, containing gps route data for ~56 vehicles. I used the split() function to parse my data into smaller chunks by bus name (Bus name example: '1367/E0007489'). I used the following line of code:
dfs <- split(sater001_paired, f=sater001_paired[, "vehicleName"])
Where sater001_pairedis my dataframe, and vehicleName is the variable I split with. The # of rows for each chunk is uneven, given that this data was captured real-time.
The problem I'm facing now is attempting to save each of these chunks into their own .csv files. I tried using lapply as such:
lapply(names(dfs), function(x){write.table(dfs[[x]], file = paste("bus", x, sep = ""))})
But R returns en error message "cannot open the connection". It's likely I'm missing something, as I'm very rusty on using the lapply function.
Any suggestions based off this?
MrFlick has helped me realize the issue I was having here.
So just to close this, the Vehicle Names column I had contained a forward slash halfway in each identification code. As Rstudio on windows does not take kindly to these characters, I did not realize this, as I have only recently switched over from primarily Mac OS use.
By using gsub in the following code:
sater001_paired$vehicleName <- gsub('/', '-', sater001_paired$vehicleName)
This issue has now resolved. Thanks again to MrFlick for the help.

Viewing more than 1000 rows in RStudio

In RStudio when you use the View() function, it only allows you to see up to 1000 rows. Is there any way to see more than that. I know it is possible to subset the viewing and see rows 1000-2000 for example, but I would want to be able to see 1-2000. The best I could find was a comment about a year ago saying that it wasn't possible at the time but they were planning on fixing this.
Here's an example (note: I'm guessing you will have to run this in RStudio).
rstudio <- (1:2000)
View(rstudio)
The View command is specifically for the little helper window. You can easily view the full value in the actual console window. If you want the same layout, use cbind.
cbind(rstudio)
which in fact will even give you the same nice row-numbering setup
And if that's too cumbersome
pview <- function(x, rows=100) {
if (length(x) > rows)
print(cbind(x))
else
print(cbind(head(x, rows/2)))
print(cbind(tail(x, rows/2)))
}
pview(rstudio, 1998)
you will need to clean that up to get the row names to lineup
You can change this setting, for instance:
options(max.print=5000)

Resources