I have a following task. I have 50M of transaction rows. I couldnt export it to .txt file but i have connection with my Hive and i created table with transactions:
Transcation_id Item
1 A
1 B
1 C
2 A
2 A
I cannot use
order_trans <- read.transactions(
file = "(...)/trans2019.csv",
format = "single",
header=TRUE,
sep = ",",
cols=c("trans_id","item"),
rm.duplicates = T,
encoding = "UTF-16LE")
because it cuts transactions.
I would likt to do the same but in place of "File" i would like to put my data frame (trans_id,item) but it doesnt work.
I also tried:
trans = as(data.frame,"transactions")
but then apriori algorithm gives me wrong rules
APPLE--> transaction_ID
Can anyone help me with this?
Here are the solutions from the manual page (see '? transactions'):
## example 4: creating transactions from a data.frame with
## transaction IDs and items (by converting it into a list of transactions first)
a_df3 <- data.frame(
TID = c(1,1,2,2,2,3),
item=c("a","b","a","b","c", "b")
)
a_df3
trans4 <- as(split(a_df3[,"item"], a_df3[,"TID"]), "transactions")
trans4
inspect(trans4)
## Note: This is very slow for large datasets. It is much faster to
## read transactions using read.transactions() with format = "single".
## This can be done using an anonymous file.
write.table(a_df3, file = tmp <- file(), row.names = FALSE)
trans4 <- read.transactions(tmp, format = "single",
header = TRUE, cols = c("TID", "item"))
close(tmp)
inspect(trans4)
Related
I am having some difficulties on summarizing data from my database in R. I am looking to pull the data and have it summarized by Quarter.
Below is the code i am using to get a txt output but I am getting errors.
What do I need to do to manipulate the code to run this so that I can have the data be summarized by quarter?
library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
################
## PARAMETERS ##
################
# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/name/Documents/Raw Data/"
# List names of sub-folders (currently grouped by first two characters of
CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
"MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
"VA-WA", "WB-ZZ")
# Set location for output
out_directory <- "C:/Users/name/Documents/YTD Master/"
out_filename <- "NEW.csv"
# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2018-01-01"), as.Date("2018-05-31"))
# Enable or disable filtering of raw files to only grab items bought within
certain months to save space.
# If false, all files will be scanned for unique items, which will take
longer and be a larger file.
date_filter <- TRUE
##########
## CODE ##
##########
starttime <- Sys.time()
mastertable <- NULL
for (j in 1:length(in_subfolders)) {
subfolder <- in_subfolders[j]
sub_directory <- paste0(in_directory, subfolder, "/")
## IMPORT DATA
in_filenames <- dir(sub_directory, pattern =".txt")
for (i in 1:length(in_filenames)) {
# Default value provided for when fast filtering is disabled.
read_this_file <- TRUE
# To fast filter the data, we choose to include or exclude an entire file
based on the date of its first line.
# WARNING: This is only a valid method if filtering by entire months,
since that is the amount of data housed in each file.
if (date_filter) {
temptable <- fread(paste0(sub_directory, in_filenames[i]),
colClasses=c(CUSTOMER_TIER = "character"),
na.strings = "", nrows = 1)
temptable[, INVOICE_DT := as.Date(INVOICE_DT)]
# If date matches, set read flag to TRUE. If date does not match, set
read flag to FALSE.
read_this_file <- temptable[, INVOICE_DT] %within% date_range
}
if (read_this_file) {
print(Sys.time()-starttime)
print(paste0("Reading in ", in_filenames[i]))
temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),
na.strings = "")
temptable <- temptable[, lapply(.SD, sum), by = quarter(INVOICE_DT),
.SDcols = c("INV_ITEM_ID","Ext Sale", "Ext Total Cost", "CE100", "CE110","CE120","QTY_SOLD","PACKSLIP_WHSL")]
# Combine into full list
mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)
# Release unneeded memory
rm(temptable)
}
}
}
# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
rm(mastertable)
print(Sys.time()-starttime)
After running this scrip the below is the error message i receive.
Error in gsum(INV_ITEM_ID) :
Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)
Here is the general approach with some generic data.
library(tidyverse)
library(lubridate)
data.frame(date = seq(as.Date('2010-01-12'), as.Date('2018-02-03'), by = 100),
var = runif(30)) %>%
group_by(quarter(date, with_year = T)) %>%
summarize(average_var = mean(var))
you can leave out the "with_year = T" if you don't care about the differences between years.
How can I save 3 dataframes of different dimensions to one csv in order to load them afterwards in 3 different dataframes?
E.g
write.table(A, file = "di2.csv", row.names = FALSE, col.names = FALSE, sep=',')
write.table(B, file = "di2.csv", row.names = FALSE, col.names = FALSE, sep=',', append=TRUE)
write.table(C, file = "di2.csv", row.names = FALSE, col.names = FALSE, sep=',', append=TRUE)
or in a more elegant way
write.csv(rbind(A, B, C), "di2.csv")
How can I load this CSV to 3 dataframes, A,B and C?
This worked for me:
save(A_table, B_table, C_table, file="temp.Rdata")
load("temp.Rdata")
As mentioned in the comments if your purpose is just to read them back into R later then you could use save/load.
Another simple solution is dump/source:
A <- B <- C <- BOD # test input
dump(c("A", "B", "C"))
# read back
source("dumpdata.R")
Other multiple object formats that you could consider would be hdf and an SQLite database (which is a single file).
On the other hand if it is important that it be readable text, directly readable by Excel and at least somewhat similar to a csv file then write the data frames out one after another with a blank line after each. Then to read them back later, read the file and separate the input at the blank lines. st and en are the starting and ending line numbers of the chunks in Lines.
A <- B <- C <- BOD # test inputs
# write data frames to a single file
con <- file("out.csv", "w")
for(el in list(A, B, C)) {
write.csv(el, con, row.names = FALSE)
writeLines("", con)
}
close(con)
# read data.frames from file into a list L
Lines <- readLines("out.csv")
en <- which(Lines == "")
st <- c(1, head(en, -1))
L <- Map(function(st, en) read.csv(text = Lines[st:en]), st, en)
Note that there are some similarities between this question and Importing from CSV from a specified range of values
Hi somehow my loop is not working. It only takes the last variable. Here's the code:
library(readxl)
library(readr)
library(plyr)
library(dplyr)
path = "C:/Users/benja/OneDrive/Studium/Bachelorarbeit/Ressourcen/Conference Calls/"
Enterprise = "ABB Ltd"
#Import Dictionary
Dictionary <- read_excel("C:/Users/benja/OneDrive/Studium/Bachelorarbeit/Ressourcen/LoughranMcDonald_MasterDictionary_2014.xlsx",
sheet = "Tabelle1")
for (File in c("2016 Q1.xml","2016 Q2.xml","2016 Q3.xml","2016 Q4.txt"))
{
#Import Text
ABB_2016_Q4 <- read_delim(paste0(path,Enterprise,"/",File),
" ", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
#Umformatierung -> Zuerst Transp, Vektor, kleinbuchstaben, dataframe
ABB_2016_Q4 = data.frame(tolower(c(t(ABB_2016_Q4))))
colnames(ABB_2016_Q4) = "Word"
#Zusammenführung Text-Dictionary
Analyze_2016_Q4 = inner_join(Dictionary,ABB_2016_Q4)
#Analyse
Rating = sum(Analyze_2016_Q4$Rating)
}
If I try to test it with
print(File)
it has the appropriate list but the loop is not working anyways. And how can I save the results after each loop?
I want to have each Rating for the different quartals displayed.
It looks like you're loading one 'master' file, then loading lots of individual files and trying to join these to the master. If that's the case, I'd take a more functional approach rather than use a for() loop.
Some example data:
master <- data.frame(
key = letters,
stringsAsFactors = FALSE
)
a <- data.frame(
key = sample(letters, 13),
dat = sample(1:100, 13),
stringsAsFactors = FALSE
)
a$key
letters_reduced <- letters %in% a$key
letters_reduced <- letters[!letters_reduced]
b <- data.frame(
key = sample(letters_reduced, 13),
dat = sample(1:100, 13),
stringsAsFactors = FALSE
)
readr::write_csv(a, "~/StackOverflow/BenjaminBerger/a.csv")
readr::write_csv(b, "~/StackOverflow/BenjaminBerger/b.csv")
So we have the master object in memory. To load in multiple files in R, assuming they're in the same directory, I'd use list.files() then iterate over the files with lapply() and read_csv():
files <- list.files("StackOverflow/BenjaminBerger", pattern = "*.csv",
full.names = TRUE)
df <- lapply(files, readr::read_csv)
You now have a list of data frames. There are many ways you could join these to your master object, but perhaps the simplest is to 'collapse' the list of data frames into one data frame, and do one join with this. This is as easy as:
df <- dplyr::bind_rows(df)
master <- dplyr::inner_join(master, df, by = "key")
Which gets you:
head(master)
# key dat
# 1 a 38
# 2 b 52
# 3 c 59
# 4 d 77
# 5 e 34
# 6 f 93
your loop is probably working, but at the moment it's not returning anything : )
you can for instance write your result to a list:
#initiate result list
allResults <- list()
#populate your filelist; depending on your directory, you can also use list.files()
files <- c("2016 Q1.xml","2016 Q2.xml","2016 Q3.xml","2016 Q4.txt")
#iterate through your files
for (i in (1:length(files))
{ #Import Text
ABB_2016_Q4 <- read_delim(paste0(path,Enterprise,"/",files[i]),
" ", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
#Umformatierung -> Zuerst Transp, Vektor, kleinbuchstaben, dataframe
ABB_2016_Q4 = data.frame(tolower(c(t(ABB_2016_Q4))))
colnames(ABB_2016_Q4) = "Word"
#Zusammenführung Text-Dictionary
Analyze_2016_Q4 = inner_join(Dictionary,ABB_2016_Q4)
#Analyse & store results & add identifier:
allResults[[i]] = data.frame(ID = paste0("Q",i),
result =sum(Analyze_2016_Q4$Rating),
stringsAsFactors = FALSE)
}
# flatten resultlist to a dataframe:
allResultsDf <- do.call(rbind, allResults)
My goal is to create a table in the database and fill it with data afterwards. This is my code:
library(ROracle)
# ... "con" is the connection string, created in an earlier stage!
# 1 create example
testdata <- data.frame(A = c(1,2,3), B = c(4,5,6))
# 2 create-statement
createTable <- paste0("CREATE TABLE TestTable(", paste(paste(colnames(testdata), c("integer", "integer")), collapse = ","), ")")
# 3 send and execute query
dbGetQuery(con, createTable)
# 4 write example data
dbWriteTable(con, "TestTable", testdata, row.names = TRUE, append = TRUE)
I already suceeded a few times. The table was created and filled.
Now step 4 doesn't work anymore, R returns TRUE after execution of dbWriteTable though. But the table is still empty.
I know this is a vague question, but does anyone have an idea what could be wrong here?
I found the solution for my problem. After creating the table in step 3, you have to commit! After that, the data is written into the table.
library(ROracle)
# ... "con" is the connection string, created in an earlier stage!
# 1 create example
testdata <- data.frame(A = c(1,2,3), B = c(4,5,6))
# 2 create-statement
createTable <- paste0("CREATE TABLE TestTable(", paste(paste(colnames(testdata), c("integer", "integer")), collapse = ","), ")")
# 3 send and execute query
dbGetQuery(con, createTable)
# NEW LINE: COMMIT!
dbCommit(con)
# 4 write example data
dbWriteTable(con, "TestTable", testdata, row.names = TRUE, append = TRUE)
I'm working with 12 large data files, all of which hover between 3 and 5 GB, so I was turning to RSQLite for import and initial selection. Giving a reproducible example in this case is difficult, so if you can come up with anything, that would be great.
If I take a small set of the data, read it in, and write it to a table, I get exactly what I want:
con <- dbConnect("SQLite", dbname = "R2")
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow=100, header=TRUE)
dbWriteTable(con, name = "Chr1test", value = data)
> dbListFields(con, "Chr1test")
[1] "row_names" "CHR_A" "BP_A" "SNP_A" "CHR_B" "BP_B" "SNP_B" "R2"
> dbGetQuery(con, "SELECT * FROM Chr1test LIMIT 2")
row_names CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 1 1 1579 SNP-1.578. 1 2097 SNP-1.1096. 0.07223050
2 2 1 1579 SNP-1.578. 1 2553 SNP-1.1552. 0.00763724
If I read in all of my data directly to a table, though, my columns aren't separated correctly. I've tried both sep = " " and sep = "\t", but both give the same column separation
dbWriteTable(con, name = "Chr1", value ="chr1.ld", header = TRUE)
> dbListFields(con, "Chr1")
[1] "CHR_A_________BP_A______________SNP_A__CHR_B_________BP_B______________SNP_B___________R
I can tell that it's clearly some sort of delimination issue, but I've exhausted my ideas on how to fix it. Has anyone run into this before?
*Edit, update:
It seems as though this works:
n <- 1000000
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow = n, header = TRUE)
con_data <- dbConnect("SQLite", dbname = "R2")
while (nrow(data) == n){
dbWriteTable(con_data, data, name = "ch1", append = TRUE, header = TRUE)
data <- read.table(f, nrow = n, header = TRUE)
}
close(f)
if (nrow(data) != 0){
dbWriteTable(con_data, data, name = "ch1", append = TRUE)
}
Though I can't quite figure out why just writing the table through SQLite is a problem. Possibly a memory issue.
I am guessing that your big file is causing a free memory issue (see Memory Usage under docs for read.table). It would have been helpful to show us the first few lines of chr1.ld (on *nix systems you just say "head -n 5 chr1.ld" to get the first five lines).
If it is a memory issue, then you might try sipping the file as a work-around rather than gulping it whole.
Determine or estimate the number of lines in chr1.ld (on *nix systems, say "wc -l chr1.ld").
Let's say your file has 100,000 lines.
`sip.size = 100
for (i in seq(0,100000,sip.size)) {
data <- read.table(f, nrow=sip.size, skip=i, header=TRUE)
dbWriteTable(con, name = "SippyCup", value = data, append=TRUE)
}`
You'll probably see warnings at the end but the data should make it through. If you have character data that read.table is trying to factor, this kludge will be unsatisfactory unless there are only a few factors, all of which are guaranteed to occur in every chunk. You may need to tell read.table not to factor those columns or use some other method to look at all possible factors so you can list them for read.table. (On *nix, split out one column and pipe it to uniq.)