I have a main df of 250k observations to which I want to add a set of variables, which I had to compute in smaller dfs (5 different dfs of 50k observations each) due to the limitations in the left_join/merge-function's row size (of 2^31-1 observations).
I am now trying to use the left_join or merge-functions on the main df and the 5 smaller ones to add the columns for the new variables to the main df for 50k observations in each step.
mainFrame <- left_join(mainFrame, newVariablesFirstSubsample)
mainFrame <- left_join(mainFrame, newVariablesSecondSubsample)
mainFrame <- left_join(mainFrame, newVariablesThirdSubsample)
mainFrame <- left_join(mainFrame, newVariablesFourthSubsample)
mainFrame <- left_join(mainFrame, newVariablesFifthSubsample)
After the first left_join (which includes the new variables' values for the first 50k observations), R doesn't seem to include any values for the following groups of 50k observations, when I run the second to fifth left_joins. I derive this conclusion from building the summary stats for the respective columns after each left_join.
Any idea on what I do wrong or which other functions I might use?
Data tables allow you to create "keys" which are R's version of SQL's indexes. That will help you to expedite the search for the columns that R uses for their merging or left-joining.
If I were you, I would just export all of them to csv files and work them out from SQL or using SSIS service.
The problem I'm noting is that you are realizing the error from the summary statistics. Have you tried reversing the order in which you insert the tables. Or explicitly stating the names of the columns used in your left join?
Please let me us know the outcome.
Related
I want to extract some statistical measurements from large Spark DataFrames (approx. 250K records, 250 columns) as e.g. max, mean, stddev, etc., yet I also want some non-standard measurements as e.g. skewness.
I am working on databricks using SparkR API. I know that there is the possibility of getting basic statistics via summary
df <- SparkR::asDataFrame(mtcars)
SparkR::summary(df, "min", "max", "50%", "mean") -> mySummary
head(mySummary)
However, this does not cover other measures as skewness and I am bassically extracting them using
SparkR::select(df, SparkR::skewness(df$mpg)) -> mpg_Skewness
head(mpg_Skewness)
This works, yet I am looking for a more performant way doing so. No matter whether I am looping over columns or using an apply function over the columns, Spark is always executing one job per column which are run subsequently.
I also tried to execute this for all columns as one job, but this is even slower.
Is it possible to force Spark to execute these kind of calculations separately for all columns but in parallel? What else could you propose to do such column operations as performant as possible? Any hints are highly appreciated!
I know that there are many threads called this but either the advice within hasn't worked or I haven't understood it.
I have read what was an SPSS file into R.
I cleaned some variables and added new ones.
By this point the file size is 1,000 MB.
I wanted to write it into a CSV to look at it more easily but it just stops responding - file too big I guess.
So instead I want to create a subset of only the variables I need. I tried a couple of things
(besb <- bes[, c(1, 7, 8)])
data1 <- bes[,1:8]
I also tried referring to variables by name:
nf <- c(bes$approveGov, bes$politmoney)
All these attempts return errors with number of dimensions.
Therefore could somebody please explain to me how to create a reduced subset of variables preferably using variable names?
An easy way to subset variables from a data.frame is with the dplyr package. You can select variables with their bare names. For example:
library(dplyr)
nf <- select(bes, approveGov, politmoney)
It's fast for large data frames too.
I currently do a lot of descriptive analysis in R. I always work with a data.table like df
net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)
library(data.table)
df <- data.table(net,gross,color,height)
In order to obtain results, I do apply a lot of filters.
Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:
df[color=="green" & height>175]
In my real data.table, I have 7 columns and all kind of filter-combinations.
Since I always address the same data.table, I'd like to find the most efficient way to filter the data.
So far, my files are organized like this (bottom-up):
execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a
new data.table and source the "source file" with this new filtered table.
I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.
How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?
I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.
I usually do it like this:
create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results
Automate the process as much as possible.
I would like to use the Cluster PAM algorithm in R to cluster a dataset of around 6000 rows.
I want the PAM algorithm to ignore a column called "ID" (Not use it in the clustering) but i do not want to delete that column. I want to use that column later on to combine my clustered data with the original dataset.
basically what i want is to add a cluster column to the original dataset.
I am want to use the PAM as a data compression/variables reduction method. I have 220 variables and i would like to cluster some of the variables and reduce the dimensionality of my dataset so i can apply a classification algorithm (Most likely a tree) to classify a problem that i am trying to solve.
If anyone knows a way around this or a better approach, please let me know.
Thank you
import data
data <- read.table(“sampleiris.txt”)
execution
result <- pam(data[2:4], 3, FALSE, “euclidean”)
Here subset [2:4] is done considering id is the first column.And the below code should fetch you the cluster values from PAM. you can the add this as a column to your Data
result$silinfo[[1]][1:nrow(pam.result$silinfo[[1]])]
Their is a small problem in the above code.
You should not use the silhouette information because it re-orders the rows as a preparation for the plot.
If you want to extract the cluster assignment while preserving the original dataset order and adding just a column of cluster assignment you should use $cluster. I tried it and it works like a charm.
This is the code:
data<- swiss[4:6]
result <- pam(data, 3)
summary (result)
export<-result$cluster
swiss[,"Clus"]<- export
View(export)
View(swiss)
Cheers
My R application reads input data from large txt files. it does not read the entire
file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about
gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read
35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max)
and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with
reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one
time processing) if that will accelerate reading operations in the future.
You can use the colClasses argument to read.table to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
This would be more efficient if you used a database interface. There are several available via the RODBC package, but a particularly well-integrated-with-R option would be the sqldf package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.