summary() of transactions is wrong for itemMatrix object - r

I am trying to do some market basket analysis using the arules package, but when I use the summary() function on an itemMatrix object to check which are the most frequent items, the numbers do not add up.
If I do:
library(arules)
x <- read.transactions("Supermarket2014-15.csv")
summary(x)
I get:
transactions as itemMatrix in sparse format with
5001 rows (elements/itemsets/transactions) and
997 columns (items) and a density of 0.003557162
most frequent items:
45 28 42 35 22 (Other)
503 462 444 440 413 15474
But if I check with a for loop, or even in Excel, the count for the product 45 is 513 and not 503. The same for 28, which should be 499, and so on.
The odd thing is if I sum up all the totals (15474+413+440+444+462+503) I get the correct number for the total of transacted products.
The data has several NA values and products are factors.
And here is the raw data (Day ranges from 1 to 28, Product ranges from 1 to 50):

If you look at the result of your str(x) call then you see under #iteminfo and $labels that some items have labels like "1;1", etc. This means that the items are not correctly separated after reading the file in. The default separator in read.transactions() is a white space, but you seem to have (some) semicolons there. Try sep=";" in read.transactions().

Related

Calculating row sums in data frame based on column names

I have a data frame with media spending for different media channels:
TV <- c(200,500,700,1000)
Display <- c(30,33,47,55)
Social <- c(20,21,22,23)
Facebook <- c(30,31,32,33)
Print <- c(50,51,52,53)
Newspaper <- c(60,61,62,63)
df_media <- data.frame(TV,Display,Social,Facebook, Print, Newspaper)
My goal is to calculate the row sums of specific columns based on their name.
For example: Per definition Facebook falls into the category of Social, so I want to add the Facebook column to the Social column and just have the Social column left. The same goes for Newspaper which should be added to Print and so on.
The challenge is that the names and the number of columns that belong to one category change from data set to data set, e.g. the next data set could contain Social, Facebook and Instagram which should be all summed up to Social.
There is a list of rules, which define which media types (column names) belong to each other, but I have to admit that I'm a bit clueless and can only think about a long set of if commands right now, but I hope there is a better solution.
I'm thinking about putting all the names that belong to each other in vectors and use them to find and summarize the relevant columns, but I have no idea, how to execute this.
Any help is appreciated.
You could something along those lines, which allows columns to not be part of every data set (with intersect and setdiff).
Define a set of rules, i.e. those columns that are going to be united/grouped together.
Create a vector d of the remaining columns
Compute the rowSums of every subset of the data set defined in the rules
append the remaining columns
cbind the columns of the list using do.call.
#Rules
rules = list(social = c("Social", "Facebook", "Instagram"),
printed = c("Print", "Newspaper"))
d <- setdiff(colnames(df_media), unlist(rules)) #columns that are not going to be united
#data frame
lapply(rules, function(x) rowSums(df_media[, intersect(colnames(df_media), x)])) |>
append(df_media[, d]) |>
do.call(cbind.data.frame, args = _)
social printed TV Display
1 50 110 200 30
2 52 112 500 33
3 54 114 700 47
4 56 116 1000 55

Replace id numbers in rows based on the match between two columns

I am dealing with a data on club membership where each row represents a club's membership in one of the 10 student clubs, and the length of non-empty column represents the membership "size" of that club. Each non-empty cell of the data frame is filled with a "random number" denoting a student's membership in a club (random numbers were used to suppress their identities).
By default, each club has at least one member but not all students are registered as club members (some have no involvement in any clubs). The data looks like this (the data displayed at below contains only part of the data):
club_id mem1 mem2 mem3 mem4 mem5 mem6 mem7
1 339 520 58
2 700
3 80 434
4 516 811 471
5 20
6 211 80 439 516 305
I want to replace those random numbers with student ids (without revealing their real names) based on the match between the random numbers assigned to them and their student ids; however, only some of the students ids are matched to the random numbers assigned to those students.
I compiled them into a dataframe of 2 columns, which is available here and looks like
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
head(match)
id rn
1 1 700
2 2 339
3 3 540
4 4 58
5 5 160
6 6 371
where column rm means random number.
So the tasks I am having trouble with are to
(1) match and replace the random numbers on the dataframe with their corresponding student ids
(2) set those unmatched random number as NA
It will be really appreciated if someone could enlighten me on this.
Not sure if I got the logic right. I replicated only a short version of your initial table and replaced the first number with 1000 (because that is a number that has no matching id).
club2 <- data.frame(club_id = 1:6, mem2 = c(1000, 700, 80, 516, 20, 211))
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
Then, for the column mem2, I check if it exists in match$rn. If that is not the case, an NA is inserted. If that is the case, however, it inserts match$id - the one at the position where match$rn is equal to the number in mem2.
club2$mem2 <- ifelse(club2$mem2 %in% match$rn == TRUE, match$id[match(club2$mem2, match$rn)], NA)

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

How does R know that I have no entries of a certain type

I have a table where one of the variables is country of registration.
table(df$reg_country)
returns:
AR BR ES FR IT
123 202 578 642 263
Now, if I subset the original table to exclude one of the countries
df_subset<-subset(df, reg_country!='AR')
table(df_subset$reg_country)
returns:
AR BR ES FR IT
0 202 578 642 263
This second result is very surprising to me, as R seems to somehow magically know that I have removed the the entries from AR.
Why does that happen?
Does it affect the size of the second data frame (df_subset)? If 'yes' - is there a more efficient way to to subset in order to minimize the size?
df$reg_country is a factor variable, which contains the information of all possible levels in the levels attribute. Check levels(df_subset$reg_country).
Factor levels only have a significant impact on data size if you have a huge number of them. I wouldn't expect that to be the case. However, you could use droplevels(df_subset$reg_country) to remove unused levels.

Looping within a loop in R

I'm trying to build quite a complex loop in R.
I have a set of data set as an object called p_int (p_int is peak intensity).
For this example the structure of p_int i.e. str(p_int) is:
num [1:1599]
The size of p_int can vary i.e. [1:688], [1:1200] etc.
What I'm trying to do with p_int is to construct a complex loop to extract the monoisotopic peaks, these are peaks with certain characteristics which will be extracted into a second object: mono_iso:
search for the first eight sets of data results in p_int. Of these eight, find the set of data with the greatest score (this score also needs to be above 50).
Once this result has been found, record it into mono_iso.
The loop will then fix on to this position of where this result is located within the large dataset. From this position it will then skip the next result along the dataset before doing the same for the next set of 8 results.
So something similar to this:
16 Results: 100 120 90 66 220 90 70 30 70 100 54 85 310 200 33 41
** So, to begin with, the loop would take the first 8 results:
100 120 90 66 220 90 70 30
**It would then decide which peak is the greatest:
220
**It would determine whether 220 was greater than 50
IF YES: It would record 220 into "mono_iso"
IF NO: It would move on to the next set of 8 results
**220 is greater than 50... so records into mono_iso
The loop would then place it's position at 220 it would then skip the "90" and begin the same thing again for the next set of 8 results beginning at the next data result in line: in this case at the 70:
70 30 70 100 54 85 310 200
It would then record the "310" value (highest value) and do the same thing again etc etc until the end of the set of data.
Hope this makes perfect sense. If anyone could possibly help me out into making such a loop work with R-script, I'd very much appreciate it.
Use this:
mono_iso <- aggregate(p_int, by=list(group=((seq_along(p_int)-1)%/%8)+1), function(x)ifelse(max(x)>50,max(x),NA))$x
This will put NA for groups such that max(...)<=50. If you want to filter those out, use this:
mono_iso <- mono_iso[!is.na(mono_iso)]

Resources