We are using fastreports.net with .net core and trying to loop the first dataset and compare each column value with the data in another dataset. I am not able to generate report in case of large data in second dataset. Say 1st dataset is having 10 records and second dataset is having 10000 records, for comparing we need to loop through 10000 times for 1 records. same we did in RDL and iwth lookupdata function we got the report generated successfully. Any option available here to achieve this functionality?
Any option available to do compare 2 datasets effectively?
Related
I have a main df of 250k observations to which I want to add a set of variables, which I had to compute in smaller dfs (5 different dfs of 50k observations each) due to the limitations in the left_join/merge-function's row size (of 2^31-1 observations).
I am now trying to use the left_join or merge-functions on the main df and the 5 smaller ones to add the columns for the new variables to the main df for 50k observations in each step.
mainFrame <- left_join(mainFrame, newVariablesFirstSubsample)
mainFrame <- left_join(mainFrame, newVariablesSecondSubsample)
mainFrame <- left_join(mainFrame, newVariablesThirdSubsample)
mainFrame <- left_join(mainFrame, newVariablesFourthSubsample)
mainFrame <- left_join(mainFrame, newVariablesFifthSubsample)
After the first left_join (which includes the new variables' values for the first 50k observations), R doesn't seem to include any values for the following groups of 50k observations, when I run the second to fifth left_joins. I derive this conclusion from building the summary stats for the respective columns after each left_join.
Any idea on what I do wrong or which other functions I might use?
Data tables allow you to create "keys" which are R's version of SQL's indexes. That will help you to expedite the search for the columns that R uses for their merging or left-joining.
If I were you, I would just export all of them to csv files and work them out from SQL or using SSIS service.
The problem I'm noting is that you are realizing the error from the summary statistics. Have you tried reversing the order in which you insert the tables. Or explicitly stating the names of the columns used in your left join?
Please let me us know the outcome.
I need to sort my data by date. Previously I had one dataset and used select and filter to create two separate datasets, one with data from June 30 or earlier and the other with data from July 1 or later. However, my problem is that I seem to have lost some rows during this process - I went from 1390 rows in my original dataset to 1335 rows between the two new datasets. I can't figure out what happened.
What I am trying to do now is use my original dataset, ethica_surveys and create a new column. I want to call this column pre_post. I know how to create a new column, but I want to filter the data into this column based on my date parameters. So the rows containing pre should be dated June 30 or earlier, and those containing post should be dated July 1 or later. I am filtering based on the variable response_time, but I am just unsure of how to code this in RStudio.
Thanks in advance for any help you can provide.
This seemed to work after a lot of trial and error.
ethica_surveys$pre_post <-
if_else(ethica_surveys$response_time > as.Date("2018-07-
01"),"post","pre")
I have in a Spark data frame with 10 million rows, where each row represents an alpha numeric string indicating id of a user, example:
602d38c9-7077-4ea1-bc8d-af5c965b4e85 my objective is to check if another id like aaad38c9-7087-4ef1-bc8d-af5c965b4e85 is present in the 10 million list.
I would want to do it efficiently and not search all 10 million records, every single time a search happens. Example can I sort my records alphabetically and ask SparkR to search only within records that begin with a instead of the universe to speed up search and make it computationally efficient?
Any solutions primarily using SparkR if not then any Spark solution would be helpful
You can use rlike which is for regex search within a dataframe column.
df.filter($"foo".rlike("regex"))
Or You can index spark dataframe into solr which will definitely search your string within few milliseconds.
https://github.com/lucidworks/spark-solr
I have a bunch of sales opportunities in various excel files- broken down by region, type, etc.- that are one column each and simply list the dollar amounts of each opportunity. In R I have run a simulation to determine the likelihood of each opportunity closing with a sale or not, and repeated the simulation 100,000 times. I know that I can't pass the full results table back to Tableau because it has 100,000 rows- one total for each simulation- and the data I'm pulling into Tableau would just have the $ value of each opportunity so would only have a length of the number of opportunities of that type.
What I have in R is basically this first block of code; repeated a number of times with varying inputs and changing probabilities; then ultimately combine the totals vectors to get a quarter total vector.
APN<-ncol(APACPipelineNew)
APNSales<-matrix(rbinom(APN, 1, 0.033), 100000, APN)
APNSales<-sweep(APNSales,2,APACPipelineNew,'*')
APNTotals<-rowSums(APNSales)
...
Q1APACN<-APNTotals+ABNTotals+AFNTotals
...
Q1Total<-Q1APACT+Q1EMEAT+Q1NAMT
What I'd like to do is set this up as a dashboard in Tableau so that it can automatically update each week, but I'm not sure how to pass the simulation back into Tableau given the difference in length of the data.
Some suggestions:
For R you can use a windows scheduler to run a job at any given interval (or use the package taskscheduleR).
After you save the R data you can manually update your dashboard if it is on a desktop version (I do not know if you can schedule an extract refresh with a desktop dashboard).
However, if your dashboard lives on a tableau server you can schedule an extract refresh every week. Obviously, I would schedule the r update before the tableau extract refresh.
If you only wanted the data to update if there was a differing number of rows from the previous weekly run you can build that logic into R. Although saving the r data and refreshing the extract with the same data and number of rows should not cause any problems.
I'm new with R, but doing my best..
I'm trying to create a histogram from data I got in a .csv file. Just imagine one column with 10.000 random numbers with a range from 1 to 5. I want to create a histogram that shows how many times 1 occurs, how many times 2 occurs, how many times 3 occurs, etc. (Up to 5).
Is this possible in any way? Or should I do this in Excel and then get the results from there into R to create the histogram? I don't seem to get any wiser from any of the video tutorials so far or any of the other questions asked on here..
Import data from csv into R first:
dat = read.csv("c:\\documents\\file.csv")
Assuming you have a column called "col" in your csv file that has your data, run this:
hist(dat$col)
If you need to know how many times each value occurs, a more precise way is to make a table:
table(dat$col)