I have created an r data frame that currently has 691221 rows of data and I want to continue to add to this without repeating or having to recreate this df every time. So, I just want to append the new data. The original data is in an sql database that I have to access and this is my first time ever using RODBC library.
#this was my initial query to get the first batch of data and create the 691000 df
locs <- sqlQuery(con, 'SELECT * FROM v_AllLocs', rows_at_time = 1)
now tomorrow for example, I want to only append the new data that comes in. Is there some command in the RODBC libaray that can recognize this from an R object and previous command lines? OR I have a date/time stamp as one of the columns and thought I could reference that somehow. I was thinking something like:
lastloc<-max(locs$acq_time_ak)
new<-sqlQuery(con, 'SELECT * FROM v_AllLocs where acq_time_ak'> lastloc , rows_at_time = 1)
locs<-rbind(locs, new)
However, I don't think sqlQuery can recognize the r object in its line? or the str of last loc is a POSIXct and maybe the sqlQuery database can't recognize this? It doesn't work regardless. Also, technically this is really simplistic because in reality, I have subsets of information within this where I have individual X with a time stamp that may have a different time stamp than individual Y. But at the moment maybe just to get started?... how can I get the latest data to add to the r object?
Or regardless of data within the SQL, can I ask for the latest data the SQL db has since XX date. So no matter any attribute within the database, I just know that as of November 16 2021 any new data coming in would be selected. Then subsequent queries id have to change the date or something?
Related
So I have a large number of CSV's that are dropped into a folder each week, and the file names and number of CSV's changes each week. The format is consistent, and the manipulation I need to do is consistent, but the inputs are dynamic. Each CSV is essentially a data table with a unique ID for a user, an email address, and a binary 1/0 for an action those users completed. Keep in mind, I'm a fairly novice R user, so mostly I've been poking around here for the answer
Essentially what I'm hoping to be able to do is upload all CSV's from this folder, and then dynamically reference the tables and perform a uniform set of actions on them each time. Whether there is 4 or 40 tables, I need to do the same action.
One of the actions needed, is to remove the "Id" column from each table. I use the email to join to a table later on, and the ID column is not useful, so it's easier to just drop the column. All these CSV's include this "Id" column, so essentially I just need to drop ID from all of the tables.
Here's a quick preview and the manual way I've been going about this.
#sample CSV with long file name '2018_October_10_regional_users_action_x'
ID email address action x
1 365367 joe.schmoe#email.com 1
2 953164 fake.guy#email.com 0
#sample CSV with long file name '2018_October_10_regional_users_action_z'
ID email addresss action z
1 798842 Jill.fake#email.com 0
2 100321 madeup.j#email.com 1
#code I've been using
setwd(choose.dir())
temp <- list.files(pattern = "\\.csv" # picking all the csv's from my folder
#I was inserting these all as separate objects, since that's the way I know
# how to do what I need, but super manual
#list2env(
lapply(setNames(temp, make.names(gsub("*.csv$", "", temp))),
read_csv), envir = .GlobalEnv)
# Manual way I was dropping my column
2018_October_10_regional_users_action_x <- select(2018_October_10_regional_users_action_x, -Id)
2018_October_10_regional_users_action_z <- select(2018_October_10_regional_users_action_z, -Id)
There are a few other things I'm doing, but their all fairly simple and similar in nature to this column drop. So if I can figure out how to this one, I can apply it through the rest of my code.
I tried using get and mget to put these into a list and then building a function or for loop, since manually grabbing the object names (Which are usually really long) and building these functions one at a time is not scalable.
Anyone know of a way I can 1) dynamically reference the table's I've input and 2) dynamically reference them in functions or formulas to accomplish some basic manipulation, similar to the column drop I reference above?
I have rather a large table in MS SQL Server (120 million rows) which I would like to query. I also have a dataframe in R that has unique ID's that I would like to use as part of my query criteria. I am familiar with the dplyr package but not sure if its possible to have the R query execute on the MS SQL server rather than bring all data onto my laptop memory (likely would crash my laptop).
of course, other option is to load the dataframe onto sql as a table which is currently what I am doing but I would prefer not to do this.
depending on what exactly you want to do, you may find value in the RODBCext package.
let's say you want to pull columns from an MS SQL table where IDs are in a vector that you have in R. you might try code like this:
library(RODBC)
library(RODBCext)
library(tidyverse)
dbconnect <- odbcDriverConnect('driver={SQL Server};
server=servername;database=dbname;trusted_connection=true')
v1 <- c(34,23,56,87,123,45)
qdf <- data_frame(idlist=v1)
sqlq <- "SELECT * FROM tablename WHERE idcol %in% ( ? )"
qr <- sqlExecute(dbconnect,sqlq,qdf,fetch=TRUE)
basically you want to put all the info you want to pass to the query into a dataframe. think of it like variables or parameters for your query; for each parameter you want a column in a dataframe. then you write the query as a character string and store it in a variable. you put it all together using the sqlExecute function.
I need to extract two columns with names Complaint Id and Date received from my mongodb in R. My csv only has approximately 70k records but when i insert the data and get the two columns above, i get > 400k records. Any idea why this might be happening ?
Using mongolite package, this is the code i used to extract the two columns
ids <- as.data.frame(mongoDf$find( fields = '{"Complaint ID":1, "Date received":1, "_id":0}'))
If your mongodb data contains arrays, it will load this data as lists. You will need to use the $unwind command inside your mongo query when you pass it through a $find or $aggregate query. You may also need to use "flatten" from the jsonlite package.
Is it possible to implement the following scenario in Power BI Desktop?
Load data from Excel file to several tables
Make calculation with R script from several data sources
Store results of calculation to new table in Power BI (.pbix)
The idea is to use Power BI Desktop for solving "transportation problem" with linear programming in R. Before solver will be running we need to make data transformations from several data sources. I'm new in Power BI. I see that it is possible to apply R scripts for loading and transformation of data, and visualizations. But I need the possibility of saving the results of calculation, for the subsequent visualization by the regular means of Power BI. Is it possible?
As I mentioned in my comment, this post would have solved most of your challenges. That approach replaces one of the tables with a new one after the R script, but you're specifically asking to produce a new table, presumably leaving the input tables untouched. I've recently written a post where you can do this using Python in the Power Query Editor. The only difference in your case would be the R script itself.
Here's how I would do it with an R script:
Data samples:
Table1
Date,Value1
2108-10-12,1
2108-10-13,2
2108-10-14,3
2108-10-15,4
2108-10-16,5
Table2
Date,Value2
2108-10-12,10
2108-10-13,11
2108-10-14,12
2108-10-15,13
2108-10-16,14
Power Query Editor:
With these tables loaded either from Excel or CSV files, you've got this setup in the Power Query Editor::
Now you can follow these steps to get a new table using an R script:
1. Change the data type of the Date Column to Text
2. Click Enter Data and click OK to get an empty table named Table3 by default.
3. Select the Transform tab and click Run R Script to open the Run R Script Edtor.
4. Leave it empty and click OK.
5. Remove = R.Execute("# 'dataset' holds the input data for this script",[dataset=#"Changed Type"]) from the Formula Bar and insert this: = R.Execute("# R Script:",[df1=Table1, df2=Table2]).
6. If you're promted to do so, click Edit Permission and Run.
7. Click the gear symbol next to Run R Scritp under APPLIED STEPS and insert the following snippet:
R script:
df3 <- merge(x = df1, y = df2, by = "Date", all.x = TRUE)
df3$Value3 <- df1$Value1 + df2$Value2
This snippet produces a new dataframe df3 by joining df1 and df2, and adds a new column Value3. This is a very simple setup but now you can do pretty much anything by just replacing the join and calculation methods:
8. Click Home > Close&Apply to get back to Power BI Desktop (Consider changing the data type of the Date column in Table3 from Text to Date before you do that, depending on how you'd like you tables, charts and slicers to behave.)
9. Insert a simple table to make sure everything went smoothly
I hope this was exactly what you were looking for. Let me know if not and I'll take another look at it.
I'm having an issue trying to format date in r... tried the following codes
rdate<-as.Date(dusted$time2,"%d/%m/%y") and also the recommendations on this stackoverflow question Changing date format in R but still couldn't get it to work.
geov<-dusted
geov$newdate <- strptime(as.character(geov$time2), "%d/%m/%Y")
all i'm getting is NA for the whole column for date. This are daily values, i would love if r can read them. Data available here https://www.dropbox.com/s/awstha04muoz66y/dusted.txt?dl=0
To convert to date, as long as you successfully imported the data already into a data frame such as dusted or geov, and have time2 holding dates as strings resembling 10-27-06, try:
geov$time2 = as.Date(geov$time2, "%m-%d-%y")
equal sign = used just to save on typing. It is equivalent to <-, so you can still use <- if you prefer
this stores the converted dates right back into geov$time2, overwriting it, instead of creating a new variable geov$newdate as you did in your original question. This is because a new variable is not required for conversion. But if for some reason you really need a new variable, feel free to use geov$newdate
similarly, you also didn't need to copy dusted to a new geov data frame just to convert. It does save time for testing purposes though, just in case the conversion doesn't work you can restart from copying dusted to geov instead of having to re-import data from a file into dusted
Additional resources
help(strptime) for looking up date code references such as %y. On Linux, man date can reveal the date codes