Converting Table Data to Transaction Data in r - r

For a current project I am trying to find a way to convert large amounts of table data (300,000+ obs. of 19 variables) into transaction data for arules. A large number of the variables are formatted logically.
I've tried the following from library(arules): newdata <- read.transactions("olddata.csv", format = "basket", rm.duplicates = FALSE, skip = 1)
However I get the following error:
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
I don't want to remove duplicates as I lose so much of my data because it removes every duplicate logical T/F after the first occurrence.
I figured I could try and accomplish my task using a for loop:
newdata <- ""
for (row in 1:nrow(olddata)) {
if (row !=1) {
newdata <- paste0(newdata, "\n")}
newdata <- paste0(newdata, row,",")
for (col in 2:ncol(olddata)) {
if (col !=2) {
newdata <- paste0(newdata, ",")}
newdata <- paste0(newdata, colnames(olddata),"=", olddata[row,col])}
}
write(newdata,"newdata.csv")`
My goal was to have the value of each variable for each observation look as follows: columnnameA=TRUE, columnnameB=FALSE, etc. This would eliminate "duplicates" for the read.transactions function and retain all of the data.
However my output starts looking like this:
[1] "1,Recipient=Thu Feb 04 21:52:00 UTC 2016,Recipient=TRUE,Recipient=TRUE,Recipient=FALSE,Recipient=TRUE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE\n2,Recipient=Thu Feb 04 21:52:00 UTC 2016,Recipient=TRUE,Recipient=TRUE,Recipient=FALSE,Recipient=TRUE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE,Recipient=FALSE\n3
Just a note that Recipient is my first variable name in my olddata object. After it does every observation as Recipient=X it changes to the next variable name and repeats. I end up with a file that has over 5 million observations...oops! This is my first real stab at nested for loops. Not sure if this is the best approach or if there is a better one.
Thanks in advance for any thoughts or insights you might have.

Related

How to convert EMG data of type 'list' to vector or data frame when unlist and as.data.frame don't work?

I have a really long list of EMG data that I need to convert to a vector or data frame before using the biosignal EMG package in R. It doesn't work with lists. The EMG data is .csv and is in the form shown in the picture.
I tried using the as.data.frame function, but it still gave me a list.
I also tried unlisting it, but it gave me an integer instead.
There are 2 columns and 647 rows.
I need to plot the data in the 2nd column and starting from row 8 till row 647.
How do I do this?
Below is the code I used:
library(biosignalEMG) # ReadCSV
MyEMGdata151517 <- read.csv(file="C:\\Users\\zyous\\OneDrive\\Desktop\\AsciiTraceDump_190211_151517.csv", header=TRUE, sep=",")
MyEMGdata151543<-read.csv(file="C:\\Users\\zyous\\OneDrive\\Desktop\\AsciiTraceDump_190211_151743.csv", header=TRUE, sep=",")
Rectified_EMG_151517<-rectification(MyEMGdata151517,rtype = "fullwave")
M<-as.data.frame.array(MyEMGdata151543) is.data.frame(M) #Rectifying 151517
Rectified_EMG_151517 <- rectification(MyEMGdata151517, rtype = "fullwave")
Rectified_plot_151517<-plot(MyEMGdata151517, main = "Rectified EMG")
When I try to rectify, I get this error: Error in rectification(MyEMGdata151517, rtype = "fullwave") : an object of class 'emg' is required.
And that error I think is because my file is not a vector. But how do i convert it when unlist wont work I wanna see peaks like the kind you would get in excel doing this.
It appears that you are not converting the variable created with read.csv() into an emg object. The documentation gives the following example:
x <- rnorm(10000, 0, 1)
emg1 <- emg(x, samplingrate=1000, units="mV", data.name="")
summary(emg1)
EMG Object
Total number of samples: 10000
Number of channels: 1
Duration (seconds): 10
Samplingrate (Hertz): 1000
Channel information:
Units: mV
plot(emg1, main="Simulated EMG")
Which yields:
Based on the attached image in your question I think you should do something similar with emg2 <- emg(MyEMGdata151517$X8, samplingrate=1000, units="mV", data.name="")

How to Make a New Column in a Data Set with Values Corresponding to a Separate Data Set

I have two different csv files, one is called CA_Storms and one is called CA_adj. CA_Storms has many start and end dates/times for storm events (in one column), and CA_adj has a DateTime column that includes many thousand dates/times. I want to see if any of the dates/times in CA_adj correspond with any of the storm events in CA_Storms. To do this, I am trying to make a new column in CA_adj titled Storm_ID that will identify which storm it corresponds with based on the storm start and end times/dates in CA_Storms.
This is the process I have currently undergone:
#Make a value to which the csv files are attached
CA_Storms <- read.csv(file = "CA_Storms.csv", header = TRUE, stringsAsFactors = FALSE)
CA_adj <- read.csv(file = "CA_adj.csv", header = TRUE, stringsAsFactors
#strptime function (do this for both data sets)
CA_adj$DateTime1 <- strptime(CA_adj$DateTime, format = "%m/%d/%Y %H:%M")
CA_Storms$Start.time1 <- strptime(CA_Storms$Start.time, format = "%m/%d/%Y %H:%M")
CA_Storms$End.time1 <- strptime(CA_Storms$End.time, format = "%m/%d/%Y %H:%M")
#Make a new column into CA_adj that says Storm ID. Have it by
#default hold NAs.
CA_adj$Storm_ID <- NA
#Write a which statement to see if it meets the conditions of greater than
#or equal to start time or less than or equal to end time. Put this through a
#for loop to apply it to every row within CA_adj$DateTime1
for (i in nrow(CA_adj$DateTime1))
{
CA_adj$DateTime1[which(CA_adj$DateTime1 >= CA_Storms$Start.time1 | CA_adj$DateTime1 <= CA_Storms$End.time1), "Storm_ID"]
}
This is not giving me any errors, but it's also not replacing any of the values in the Storm_ID column that I have made. In my Global Environment under "Values" it now just says: i is NULL(empty). I am pretty sure what's missing is an i within the for loop, but I do not know where to put it. I also think the other issue is that it doesn't know what value to replace the NA's in the Storm_ID column with. I would like it to replace the NA's with the correct Storm ID that corresponds with the Storm dates (in CA_Storms$Start.time1 and in CA_Storms$End.Time1). For Dates/Times within CA_adj that do not apply to a storm date, I'd just want it to continue to say NA.
Any guidance on how to do this would be greatly appreciated. I'm new to R, and I've been trying to teach it to myself, which can make figuring out how to do these things on my own a bit difficult.
Thanks so much!
Why not have a look at the lubridate package. It will let you create time/date intervals which can then be tested against a specific time/date by %within% . Your code should be simpler.
You do need to use the loop index and you also need to make an assignment to CA_adj$StormID. I'm not certain if you could also have multiple CA_adj entries in a CA_Storms interval.
# make a lubridate interval in CA_Storms
# make CA_DateTime a lubridate
# or stick with the longer code...
# loop through all CA_adj
for (i in nrow(CA_adj)) {
CA_adj$StormID[i] <- CA_Storms$StormID[CA_adj$DateTime %within% CA_Storms$interval]
}

How to pass an R function argument to subset a column

First I am new here, this is my first post so my apologies in advance if I am not doing everything correct. I did take the time to search around first but couldn't find what I am looking for.
Second, I am pretty sure I am breaking a rule in that this question is related to a 'coursera.org' R programming course I am taking (this was part of an assignment) but the due date has lapsed and I have failed for now, I will repeat the subject next month and try again but I am kind of now in damage control trying to find out what went wrong.
Basically below is my code:
What I am trying to do is read in data from a series of files. These files are four columns wide with the titles: Date, nitrate, sulfate and id and contain various rows of data.
The function I am trying to write should take the arguments of the directory of the files, the pollutant (so either nitrate or sulfate), and the set of numbered files, e.g. files 1 and 2, files 1 through to 4 etc. The return of the function should be the average value of the selected pollutant across the selected files.
I would call the function using a call like this
pollutantmean("datafolder", "nitrate", 1:3)
and the return should just be a number which is the average in this case of nitrate across data files 1 through to 3
OK, I hope I have provided enough information. Other stuff that may be useful is:
Operating system :Ubuntu
Language: R
Error message received:
Warning message:
In is.na(x) : is:na() applied to non(list or vector) of type 'NULL'
As I say, the data files are a series of files located in a folder and are four columns wide and vary as to the number of rows.
My function code is a follows:
pollutantmean <- function(directory, pollutant, id = 1:5) { #content of the function
#create a list of files, a vector I think
files_list <- dir(directory, full.names = TRUE)
# Now create an empty data frame
dat <- data.frame()
# Next step is to execute a loop to read all the selected data files into the dataframe
for (i in 1:5) {
dat <- rbind(dat, read.csv(files_list[i]))
}
#subsets the rows matching the selected monitor numbers
dat_subset <- dat[dat[, "ID"] == id, ]
#identify the median of the pollutant and ignore the NA values
median(dat_subset$pollutant, na.rm = TRUE)
ok, that is it, through trial and error I am pretty sure the final line of code, the "median(dat_subset$pollutant, na.rm = TRUE)" appears to be the problem. I pass an argument to the function of pollutant which should be either sulfate or nitrate but it seems the dat_subset$pollutant bit of code is what is not working. Somehow I am getting the passed pollutant argument to not come into the function body. the dat_subset$pollutant bit should ideally be equivalent to either dat_subset$nitrate or dat_subset$sulfate depending on the argument fed to the function.
You cannot subset with $ operator if you pass the column name in an object like in your example (where it is stored in pollutant). So try to subset using [], in your case that would be:
median(dat_subset[,pollutant], na.rm = TRUE)
or
median(dat_subset[[pollutant]], na.rm = TRUE)
Does that work?

Association analysis with duplicate transactions using arules package in R

I want to create a transaction object in basket format which I can call anytime for my analyses. The data contains comma separated items with 1001 transactions. The first 10 transactions look like this:
hering,corned_b,olives,ham,turkey,bourbon,ice_crea
baguette,soda,hering,cracker,heineken,olives,corned_b
avocado,cracker,artichok,heineken,ham,turkey,sardines
olives,bourbon,coke,turkey,ice_crea,ham,peppers
hering,corned_b,apples,olives,steak,avocado,turkey
sardines,heineken,chicken,coke,ice_crea,peppers,ham
olives,bourbon,coke,turkey,ice_crea,heineken,apples
corned_b,peppers,bourbon,cracker,chicken,ice_crea,baguette
soda,olives,bourbon,cracker,heineken,peppers,baguette
corned_b,peppers,bourbon,cracker,chicken,bordeaux,hering
...
I observed that there are duplicated transactions in the data and removed them but each time I tried to read the transactions, I get:
Error in asMethod(object) :
can not coerce list with transactions with duplicated items
Here is my code:
data <- read.csv("AssociationsItemList.txt",header=F)
data <- data[!duplicated(data),]
pop <- NULL
for(i in 1:length(data)){
pop <- paste(pop, data[i],sep="\n")
}
write(pop, file = "Trans", sep = ",")
transdata <- read.transactions("Trans", format = "basket", sep=",")
I'm sure there's something little yet important I've missed. Kindly offer your assistance.
The problem is not with duplicated transactions (the same row appearing twice)
but duplicated items (the same item appearing twice, in the same transaction --
e.g., "olives" on line 4).
read.transactions has an rm.duplicates argument to remove those duplicates.
read.transactions("Trans", format = "basket", sep=",", rm.duplicates=TRUE)
Vincent Zoonekynd is right, the problem is caused by duplicated items in a transaction. Here I can explain why arules require transactions without duplicated items.
The data of transactions is store internally as a ngCMatrix Object. Relevant source code:
setClass("itemMatrix",
representation(
data = "ngCMatrix",
...
setClass("transactions",
contains = "itemMatrix",
...
ngCMatrix is an sparse matrix defined at Matrix package. It's description from official document:
The nsparseMatrix class is a virtual class of sparse “pattern” matrices, i.e., binary matrices conceptually with TRUE/FALSE entries. Only the positions of the elements that are TRUE are stored
It seems ngCMatirx stored status of an element by an binary indicator. Which means the transactions object in arules can only store exist/not exist for a transaction object and can not record quantity. So...
I just used the 'unique' function to remove duplicates. My data was a little different since I had a dataframe (data was too large for a CSV) and I had 2 columns: product_id and transaction_id. I know it's not your specific question, but I had to do this to create the transaction dataset and apply association rules.
data # > 1 Million Transactions
data <- unique(data[ , 1:2 ] )
trans <- as(split(data[,"product_id"], data[,"trans_id"]),"transactions")
rules <- apriori(trans, parameter = list(supp = 0.001, conf = 0.2))

How to allocate/append a large column of Date objects to a data-frame

I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb.
(it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr
(where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) {
from <- floor(i*N/8)
to <- floor((i+1)*N/8) -1
if (i==7)
to <- N
print(c("Copying tr.sql$visit_date",from,to," ..."))
tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to])
gc()
}
rm(tr.sql)
memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column.
Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
and code:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
The second line that tries to declare data.frame(visit_date = Date(N)) fails, I don't know the correct qualified name with namespace for Date object (tried chron::Date , Dates::Date? don't work)
Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries.
Thanks in advance...
Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).
(If you find you are running out of memory try putting the database on disk by adding the dbname = tempfile() argument to the read.csv.sql call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)
# create test data file
Lines <-
"customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52"
cat(Lines, file = "trainingtest.csv")
# read it back
library(sqldf)
DF <- read.csv.sql("trainingtest.csv", method = c("integer", "Date2", "numeric"))
It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:
tr <- tr.sql
tr$visit_date <- as.Date(tr.sql$visit_date)
Or even more efficient:
tr <- data.frame(colOne = tr.sql[,1], visit_date = as.Date(tr.sql$visit_date), colThree = tr.sql[,3])
As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a cbind() operation.
Try this:
tr <- data.frame(visit_date= as.Date(tr.sql$visit_date, origin="1970-01-01") )
This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g:
vdfrm <- data.frame(a = as.Date(c(1470, 1475, 1480), origin="1970-01-01") )
vdfrm
a
1 1974-01-10
2 1974-01-15
3 1974-01-20

Resources