R - Number Rounding error when copying to Row - r

Below is how I created an empty data frame I intend to populate 1 row at a time from a data source.
numTweets=31
finalDataFrame = as.data.frame( matrix(NA, numTweets-1, 23), stringsAsFactors=FALSE)
names(finalDataFrame) = c( "TweetID", "TweetTime", "Text", "Source",
"UserID", "Username", "Screenname", "FollowerCount", "FriendCount",
"Location", "Latitude", "Longitude", "ReplyTweetID", "ReplyUserID",
"ReplyScreenname", "RetweetID", "RetweetCreated", "RetweetUsername",
"RetweetScreename", "RetweetLocation", "RetweetFollowers", "RetweetFriends",
"RetweetSource" )
An example of a row I am inserting is below as well
print( thisRow, row.names=FALSE )
TweetID TweetTime Text
877010425019158529 Tue Jun 20 03:49:14 +0000 2017 #OmniDestiny I would recommend trying to find the facebook group for evergreen because i think their school facebook page got shut down.
Source UserID Username Screenname FollowerCount FriendCount Location Latitude Longitude ReplyTweetID ReplyUserID ReplyScreenname RetweetID
Twitter Web Client 843603187298779137 Albert HellhoyZ 4 72 Bellevue, WA 0 0 876742560328417281 4726147296 OmniDestiny NA
RetweetCreated RetweetUsername RetweetScreename RetweetLocation RetweetFollowers RetweetFriends RetweetSource
NA NA NA NA NA NA NA
So, this row looks perfectly fine, and the data frame I have created to store it in looks fine. However, when I try to copy it in...
## Z minus 1 since we started our loop at 2
finalDataFrame[z-1, ] = thisRow
Many values get weird. For example, thisRow displays the ReplyTweetID value (an int64 value) perfectly as 876742560328417281, but when I look at it in R in the finalDataFrame....
finalDataFrame[1, "ReplyTweetID" ]
[1] 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000046816189162956993
>
I'm not sure what can cause this drastic change. Any ideas?
EDIT: I'm pretty sure it has to be due to the value being an int64, and the Matrix not liking that. However, is there a way to prep the Matrix for that? I can alternatively do toString( IDVALUEHERE ) when I am making "thisRow" in the first place, but that seems like it shouldn't be necessary.

Related

How to efficiently send a dataframe with multiple rows via httr::PUT

Probably due to my limited knowledge of communicating with APIs, (Which I am trying to remedy :) ) I seem to be unable to execute a put request for more than 1 row of a dataframe at a time. for example, if df_final consists of 1 row, the following code works. If there are multiple rows, it fails and I get a 400 status.
reqBody <- list(provName = df_final$Provider,site = df_final$Site,
monthJuly = df_final$July, monthAugust = df_final$August,
monthSeptember = df_final$September, monthOctober =df_final$October,
monthNovember = df_final$November ,
monthDecember = df_final$December, monthJanuary = df_final$January, monthFebruary = df_final$February,
monthMarch = df_final$March, monthApril = df_final$April, monthMay = df_final$May,
monthJune = df_final$June,
assumptions = paste("Monthly Volume:", input$Average, "; Baseline Seasonality:", input$Year, "; Trend:", input$Year_slopes),
rationale = as.character(input$Comments), fiscalYear = FY_SET, updateDtm = Sys.time())
r <- PUT(fullURL, body = reqBody, encode = "json", content_type_json())
Using with_verbose() I am able to see that the json being sent is formatted differently for the 2 cases. I haven't found anything in the documentation ( https://cran.r-project.org/web/packages/httr/httr.pdf) that has been particularly helpful in overcoming this.
The format it appears to be sending out in the first instance (1 row in the data frame) Looks like this:
{"provName":"Name","site":"site","monthJuly":56,"monthAugust":71,"monthSeptember":65,"monthOctober":78,"monthNovember":75,"monthDecember":98,"monthJanuary":23,"monthFebruary":39,"monthMarch":38,"monthApril":42,"monthMay":57,"monthJune":54,"assumptions":"Monthly Volume: Last 3 Months of 2019 ; Baseline Seasonality: 2017 ; Trend: 2017","rationale":"","fiscalYear":2022,"updateDtm":"2023-02-03 15:19:40"}
and again, it works sans issues.
With 2 rows I get the following format:
{"provName":["Name","Name"],"site":["site","site"],"monthJuly":[56,56],"monthAugust": [71,71],"monthSeptember":[65,65],"monthOctober":[78,78],"monthNovember":[75,75],"monthDecember": [98,98],"monthJanuary":[23,23],"monthFebruary":[39,39],"monthMarch":[38,38],"monthApril": [42,42],"monthMay":[57,57],"monthJune":[54,54],"assumptions":["Monthly Volume: Last 3 Months of 2019 ; Baseline Seasonality: 2017 ; Trend: 2017","Monthly Volume: Last 3 Months of 2019 ; Baseline Seasonality: 2017 ; Trend: 2017"],"rationale":["",""],"17":2,"18":2}
And it fails with status 400.
I suppose I could use lapply and PUT for each row, however with thousands of rows in a dataframe, I think that would be less than ideal.
Anyone have any light to share on this?
Any help would be greatly appreciated!
PS: this didn't really answer my question
R httr put requets
and as I mentioned, Doing something like this is not ideal:
Convert each data frame row to httr body parameter list without enumeration
Looks like you are using a list as the request body. Use a data frame instead.
Lists and data frames get serialized to JSON differently:
jsonlite::toJSON(list(x = 1:2, y = 3:4))
#> {"x":[1,2],"y":[3,4]}
jsonlite::toJSON(data.frame(x = 1:2, y = 3:4))
#> [{"x":1,"y":3},{"x":2,"y":4}]

Write values from R to a PostgreSQL table based on Row IDs

I have a PostgreSQL table Scores on a local server that looks like this:
ID Score_X Score_Y
1 NA NA
2 NA NA
3 NA NA
4 NA NA
I do a series of calculations in R that produces a dataframe Calc_Scores that looks like this:
ID Score_X Score_Y
1 0.53 0.81
4 0.75 0.95
I would like to write the scores that correspond with each ID from R to the PostgreSQL table such that the final PostgreSQL table should look like this:
ID Score_X Score_Y
1 0.53 0.81
2 NA NA
3 NA NA
4 0.75 0.95
I have a connection to the PostgreSQL table called connection which I setup using the function dbConnect(). The actual tables are quite big. What line/code in R could I use to write these scores to the PostgreSQL table? I have been looking for a similar question but couldn't find anything. I have tried
dbWriteTable(connection, "Scores", value = Calc_Scores, overwrite=T, append = F, row.names = F)
However, the entire table gets overwritten. I want only the scores to be updated.
Thank you.
Creating a temporary table could be an option:
# Create temporary table
dbWriteTable(connection, "ScoresTmp", value = Calc_Scores, overwrite=T, append = F, row.names = F)
# Update main table
dbExecute(connection,"
UPDATE Scores
SET Score_X = ScoresTmp.Score_X,
Score_Y = ScoresTmp.Score_Y
FROM ScoresTmp
WHERE Scores.ID = ScoresTmp.ID
")
# Clean up
dbExecute(connection,"DROP TABLE ScoresTmp")
Note that you should be able to create a real temporary table using the temporary=TRUE option : according to #Sirius comment below, it should work on a PostGreSQL database.
For users of an SQLServer database, this option doesn't work, but they can use the # prefix to create a temporary table.
In the example above, this would be:
dbWriteTable(connection, "#ScoresTmp", value = Calc_Scores, overwrite=T, append = F, row.names = F)
One way of doing this relies on the SQL 'update' and in essence you do
- open a connection to your database
- loop over your changeset and for each row
- form the update statement, i.e. for example via
cmd <- paste('update table set x=', Score_x, ', y=',
Score_y, ' where id=', id)
- submit the cmd via eg `dbSendQuery`
- close the connection
There are examples in RPostgreSQL.

R: Replace all Values that are not equal to a set of values

All.
I've been trying to solve a problem on a large data set for some time and could use some of your wisdom.
I have a DF (1.3M obs) with a column called customer along with 30 other columns. Let's say it contains multiple instances of customers Customer1 thru Customer3000. I know that I have issues with 30 of those customers. I need to find all the customers that are NOT the customers I have issues and replace the value in the 'customer' column with the text 'Supported Customer'. That seems like it should be a simple thing...if it werent for the number of obs, I would have loaded it up in Excel, filtered all the bad customers out and copy/pasted the text 'Supported Customer' over what remained.
Ive tried replace and str_replace_all using grepl and paste/paste0 but to no avail. my current code looks like this:
#All the customers that have issues
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127",
"Customer128", ..... , "Customer140")
#Look for everything that is NOT in the list above and replace with "Enabled"
orderData$customer <- str_replace_all(orderData$customer, paste0("[^", paste(out, collapse =
"|"), "]"), "Enabled Customers")
That code gets me this error:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
I've tried the inverse of this approach and pulled a list of all obs that dont match the list of out customers. Something like this:
in <- orderData %>% filter(!customer %in% out) %>% select(customer) %>%
distinct(customer)
This gets me a much larger list of customers that ARE enabled (~3,100). Using the str_replace_all and paste approach seems to have issues though. At this large number of patterns, paste no longer collapses using the "|" operator. instead I get a string that looks like:
"c(\"Customer1\", \"Customer2345\", \"Customer54\", ......)
When passed into str_replace_all, this does not match any patterns.
Anyways, there's got to be an easier way to do this. Thanks for any/all help.
Here is a data.table approach.
First, some example data since you didn't provide any.
customer <- sample(paste0("Customer",1:300),5000,replace = TRUE)
orderData <- data.frame(customer = sample(paste0("Customer",1:300),5000,replace = TRUE),stringsAsFactors = FALSE)
orderData <- cbind(orderData,matrix(runif(0,100,n=5000*30),ncol=30))
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127", "Customer128","Customer140")
library(data.table)
setDT(orderData)
result <- orderData[!(customer %in% out),customer := gsub("Customer","Supported Customer ",customer)]
result
customer 1 2 3 4 5 6 7 8 9
1: Supported Customer 134 65.35091 8.57117 79.594166 84.88867 97.225276 84.563997 17.15166 41.87160 3.717705
2: Supported Customer 225 72.95757 32.80893 27.318046 72.97045 28.698518 60.709381 92.51114 79.90031 7.311200
3: Supported Customer 222 39.55269 89.51003 1.626846 80.66629 9.983814 87.122153 85.80335 91.36377 14.667535
4: Supported Customer 184 24.44624 20.64762 9.555844 74.39480 49.189537 73.126275 94.05833 36.34749 3.091072
5: Supported Customer 194 42.34858 16.08034 34.182737 75.81006 35.167769 23.780069 36.08756 26.46816 31.994756
---

Trouble merging two dataframes in R (VLOOKUP)

I need help merging two data frames with R. I'm a little desperate, since I have tried everthing I could. Any help would be appreciated.
The thing is that I'm doing some daily web scraping, and I need to compare today's results whith yesterday's results in order to to detect if there have been any changes.
I only have two variables (title of the page and url) in two dataframes (one for today and one for yesterday), and I want to merge them in one.
The possible changes are:
Changes in the name.
Changes in the url.
New programs (new name and new url).
Deleted programs.
I've tried with merge, cast & melt, ifelse, etc. etc. and I can't solve the problem. For example:
yesterday <- read.csv2("Yesterday.csv")
today <- read.csv2("Today.csv")
new <- merge(x = today, y = yesterday, all = TRUE, sort = TRUE)
But without the desired result. I'm attaching three files:
Today.csv, with the results of today scraping
Yesterdat.csv, with the results of yesterday scraping
Results.xlsx with the desired output. A VLOOKUP in Excel, highlighting the changes I want to detect (in this case name changes).
I would need a solution for the four changes options. The output could be different, I don't care about that, but I need the comparison to be correct Even if you found that this question is duplicated I would need the link to the other one, because I haven't been able to find it.
Thanks in advance.
Answer is updated in response to the comments bellow:
library(tidyverse)
bind_rows(
anti_join(today, yest) %>%
mutate(
label = ifelse(programa %in% yest$programa, 'changed', 'added')
),
anti_join(yest, select(today, programa)) %>% mutate(label = "deleted")
)
Which, while applying it to the whole data sets, returns following results:
# # A tibble: 6 x 3
# programa url label
# <chr> <chr> <chr>
# 1 Carrera de Derecho a distancia |~ https://universidadeuropea.es/onlin~ added
# 2 "Carrera de Criminolog\xeda a di~ https://universidadeuropea.es/onlin~ added
# 3 "Carrera Ingenier\xeda Inform\xe~ https://universidadeuropea.es/onlin~ added
# 4 Grado en Derecho a distancia | U~ https://universidadeuropea.es/onlin~ dele~
# 5 "Grado en Criminolog\xeda a dist~ https://universidadeuropea.es/onlin~ dele~
# 6 "Grado Ingenier\xeda Inform\xe1t~ https://universidadeuropea.es/onlin~ dele~
In order to check, if it is able to register changes in the programm, we can do following:
yest[22, 2] <- yest[23, 2]
Piping the changed data into the code above, returns table with additional record, labelled as changed:
# # A tibble: 7 x 3
# programa url label
# <chr> <chr> <chr>
# 1 "M\xe1ster en Direcci\xf3n Hotel~ https://universidadeuropea.es/onlin~ chan~
# 2 Carrera de Derecho a distancia |~ https://universidadeuropea.es/onlin~ added
# 3 "Carrera de Criminolog\xeda a di~ https://universidadeuropea.es/onlin~ added
# 4 "Carrera Ingenier\xeda Inform\xe~ https://universidadeuropea.es/onlin~ added
# 5 Grado en Derecho a distancia | U~ https://universidadeuropea.es/onlin~ dele~
# 6 "Grado en Criminolog\xeda a dist~ https://universidadeuropea.es/onlin~ dele~
# 7 "Grado Ingenier\xeda Inform\xe1t~ https://universidadeuropea.es/onlin~ dele~
Explanation:
Everything enclosed inside bind_rows() is combined into the single tibble. As far as we have two separate anti_join() statements here, and each of them returns it's own tibble, we have to rbind them into the one;
anti_join() is a set operation, which, giving two sets A and B, returns another set C which is subset of A but not subset of B. In other words, C is the difference between A and B.
When we call anti_join(today, yest) we obtain a subset of today with records either not present in yest at all, or those with program or url changed comparing to yest. We pipe those results into mutate() call, and assign the value changed to label, if the value of programa is the same as yesterday (programa %in% yest$programa), while url value was changed. If programa %in% yest$programa is FALSE, it means that program name wasn't present in yest so it is a new program, and we label it as added.
When we call anti_join() for a second time, we are looking for the difference between yest and today program names. In other words: 'Which programs present in yest are not present in today?' We achieve this by looking for subset of yest with program names which are not in program names of today (that's why you need to select(today, programa)). If any of such records where detected, they are labeled by deleted.
Sorry if this explanation is somewhat clumsy, but I hope it will help you to navigate the code.
Data:
tmp <- tempfile()
download.file(
"https://drive.google.com/uc?authuser=0&id=1scYdZrGYaSDr-TE8IZsy1tKSdLjMn7jt&export=download",
tmp
)
today <- read_delim(tmp, delim = ";")
download.file(
"https://drive.google.com/uc?authuser=0&id=1uJ-ThiKykTjoY1gc3jlBHoab8WAJD-wP&export=download",
tmp
)
yest <- read_delim(tmp, delim = ";")
file.remove(tmp)

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

Resources