I have a matrix I have coerced from a realRatingMatrix in recommenderlab package in R. The data contains predictions of ratings between 0-1 for a number of products.
The matrix should contain customer numbers along the rows (row 2 down) so that column 1 header is row label, and product IDs along the columns in the first row from column 2 onwards. The problem I have is when I coerce to a matrix the data structure becomes messy:
EDIT: Link to Github repository www.github.com/APBuchanan/recommenderlab-model
str(wsratings)
num [1:43, 1:319] 0.192 0.44 0.262 0.161 0.239 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:319] "X011211" "X014227" "X014229" "X014235" ...
The first cell wsratings[1,1] should be labelled as "CustomerNumber" and the remainder of the columns in row 1 should contain the data that is currently held in the above $:chr, but should display as separate variables in the matrix.
From the code below you will see that I've been trying to go about this by inserting the data into two vectors, that I can then call in the dimnames function, but I'm getting something wrong:
setwd("location to pull in data")
#look at using XLConnect package to link straight to excel workbook
library(recommenderlab)
library(xlsx)
library(tidyr)
library(Matrix)
#library(stringer)
data=read.csv("WS1 & WS2 V3.csv",header=TRUE,row.names=1)
#remove rows where number of purchases is <10
df=data[rowSums(data[-1])>=10,]
df<-as.matrix(df)
data.matrix=as(df,"binaryRatingMatrix")
#image(data.matrix)
model=Recommender(data.matrix,method="UBCF")
predictions<-predict(model,data.matrix,n=5)
set.seed(100)
evaluation<-evaluationScheme(data.matrix,method="split",train=0.5,given=5)
Rec.ubcf <- Recommender(getData(evaluation, "train"), "UBCF")
predict.ubcf<-predict(Rec.ubcf,getData(evaluation,"known"),type="topNList")
pred.ubcfratings<-predict(Rec.ubcf,getData(evaluation,"known"),type="ratings")
error.ubcf<-calcPredictionAccuracy(predict.ubcf,getData(evaluation,"unknown"),given=5)
setwd("Location to output data from model")
wsratings<-as(pred.ubcfratings,"matrix")
ratingrows<-c(evaluation#runsTrain)
where I've called colnames2<-c(wsratings[1,2:ncol(wsratings)]) I am expecting the the data from column 2 to the last column, in row 1 to be read into the vector. But when I print the results, it includes rating information as well which is not what I'm after.
ratingrows<-c(evaluation#runsTrain) contains the customer numbers that I want to insert below the row label "CustomerNumber".
I'm guessing there's a way of sorting this out with tidyr package, but not so familiar with it. If anyone can provide some advice on how I can clean this all up, I'd be very grateful.
So with the data you gave, I whipped up a solution here.
You said "I need to extract the customer numbers from the test split of data and drop that into the first column of the matrix - that's my main issue". The way to extract that is either: colnames(wsratings) or dimnames(wsratings)[[2]].
Once you have this vector (length of 320), you want to "drop that to the first column". You're asking for a cbind(), but the length of the data you want to bind it contains 43 row. You can't bind them together because the length of the two elements are not the same or multiples of each other.
Assuming you have the full dataset and their length matches, then the code would be:
customerid <-c("CustomerName", evaluation#runsTrain[[1]])
wsratings <- cbind(customerid, wsratings)
This is what I gathered you want, and it yields me the following:
Related
I've been trying to solve the following problem which I am sure is an easy one (I am just not able to find a solution). I am using the package vegan and want to perform a cca that shows the actual row names as labels (instead of the default "sit1", "sit2", ...).
I created a dataframe (ls_Treat1) with cast(), showing plot treatments (AB, DB, DL etc.) as row names and species occurences. The dataframe looks as follows:
species 1
species 2
species 3
AB
0
3
1
DB
1
6
0
DL
3
4
2
I created the data frame with the following code to set the treatments (AB, DB, DL, ...) as row names:
ls_Treat1 <- cast(fungi_ls, Treatment ~ species)
row.names(ls_Treat1)<- ls_Treat1$Treatment
ls_Treat1 <- ls_Treat1[,-1]
When I perform a cca with the following code:
ca <- cca(ls_Treat1)
plot(ca,display="sites")
R puts the default labels "sit1", "sit2", ... into the plot, instead of the actual row names, even though I have performed it this way before and the plots normally showed the right labels. Does this have anything to do with my creating the data frame? I tried to change the treatments (characters) into numbers (integers or factors) but still, the plot won't be labelled with my row names.
Can anyone help me with this?
Thank you very very much!!
The problem is that reshape::cast() does not produce data.frame but something else. It claims to be a data.frame but it is not. We do matrix algebra in cca and therefore we cast input to a matrix which works for standard data.frame, but it does not work with the object you supplied as input. In particular, after you remove the first column in ls_Treat1 <- ls_Treat1[,-1], you also remove the attributes that allow preserving names – it would have worked without removing this column (if reshape package was still loaded). It seems that upgrading to reshape2 package and using reshape2::acast() can be a solution.
I have data frame "data". I searched for a pattern using grep function and i would like to put result back in data frame to match rows with others.
data$CleanDim<-data$RAW_MATERIAL_DIMENSION[grep("^BAC",data$RAW_MATERIAL_DIMENSION)]
I would like to paste the result into a new column data$CleanDim but i get the following errors.... can someone please help me?
Error in `$<-.data.frame`(`*tmp*`, CleanDim, value = c(1393L, 1405L, 734L, : replacement has 2035 rows, data has 1881
grep() returns a vector of indices of entries that match the given criteria.
The only way that your code could work here is if the number of rows of data equals some even multiple of the number of matches grep() finds.
Consider the following reproducible example:
data = data.frame(RAW_MATERIAL_DIMENSION = c("BAC","bBAC","aBAC","BACK","lbd"))
> data
RAW_MATERIAL_DIMENSION
1 BAC
2 bBAC
3 aBAC
4 BACK
5 lbd
> grep("^BAC",data$RAW_MATERIAL_DIMENSION)
[1] 1 4
data$CleanDim <- data$RAW_MATERIAL_DIMENSION[grep("^BAC",data$RAW_MATERIAL_DIMENSION)]
Error in `$<-.data.frame`(`*tmp*`, CleanDim, value = 1:2) :
replacement has 2 rows, data has 5
Note: this would work out ok (though it would be pretty weird) if the original data object just had its first four rows. In that case, you'd just get repeated values populated in your new column.
But, what you want to do here is to look at the results of grep("^BAC",data$RAW_MATERIAL_DIMENSION) and think about what is going to be sensible in your context. Your operation will only work if the length of this result equals that of your data object, or at least if your data object is a whole multiple of that length.
As below, dataframe factorizedss is the factorized version of a sourcedata dataframe ss.
ss <- data.frame(c('a','b','a'), c(1,2,1)); #There are string columns and number columns.
#So, I factorized them as below.
factorizedss <- data.frame(lapply(ss, as.factor)); #factorized version
indices <- data.frame(c(1,1,2,2), c(1,1,1,2)); #Now, given integer indices
With given indices, using factorizedss, is it possible to get corresponding element of the source dataframe as below? (The purpose is to access data frame element by integer number in factor level )
a 1
a 1
b 1
b 2
You can access the first column like this
factorizedss[indices[,1],][,1]
and the second in a similar way
factorizedss[indices[,2],][,2]
It gets more difficult when trying to combine them, you might have to convert them back to native types
t(rbind(as.character(factorizedss[indices[,1],][,1]),as.numeric(factorizedss[indices[,2],][,2])))
I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.
I am a relative newcomer to R. I have searched for the last two workdays trying to figure this out and failed. I have a list of factors generated by a function. I have 9 items in the list of different lengths.
>summary(list_dataframes)
Length Class Mode
[1,] 1757 factor numeric
[2,] 1776 factor numeric
[3,] 1737 factor numeric
[4,] 1766 factor numeric
[5,] 1783 factor numeric
[6,] 1751 factor numeric
[7,] 1744 factor numeric
[8,] 1749 factor numeric
[9,] 1757 factor numeric
Part of a sample of the data as it comes out:
list_dataframes
[[1]]
[1] 1776234_at 1779003_at 1776344_at 1777664_at 1772541_at 1774525_at
[[2]]
[1] 1771703_at 1776299_at 1772744_at 1780116_at 1775451_at 1778821_at
[7] 1774342_at
[[3]]
[1] 1780116_at 1776262_at 1775451_at 1780200_at 1775704_at
I am not sure why it says the Mode is "numeric". The individual entries are a mix of numbers and letter like "S35_at".
I would like to make this into a table of nine columns and 1783 rows without making duplicate values. (Hence I tried using do.call and it didn't work. I ended up with a mess full of duplicates.) The shorter ones can have NAs in the empty spaces or be blank.
I need to be able to eventually end up with something I can put into a spread sheet.
There has to be a way to do this. Thank you!
I guess I should add it initially was coming out as data frames when I had four columns of data but I only need one column of the data and when I subsetted the function that creates this list to create only the one column I actually needed it seems to no longer be a dataframe.
dput(head(list_dataframes))
list(structure(c(3605L, 5065L, 3663L, 4349L, 1655L, 2700L, 5692L, plus many more
.Label = c("1769308_at",
"1769311_at", "1769312_at", "1769313_at", "1769314_at", "1769317_at", plus many more
this pattern is repeated nine more times
What I am trying to do is produce a table that would look like this:
a= xyz,tuv,efg,hij,def
b= xyz,tuv,efg
c= tuv,efg,hij,def
What I want to make is a table that is
a b c
xyz xyz tuv
tuv tuv efg
efg efg hij
hij NA NA
NA NA NA
NA could be blank as well.
After much reading the manual section on lists I determined that I had generated a buried list of lists. It had nine items with the data I wanted buried two layers down i.e to see it I had to use [[1]]. Also because of something in R that results in a single column data frame becoming a factor instead of staying a data frame it was further complicated. To fix it (sort of) I added one step in my equation so that I changed that factor into a data frame.
After that, when I used lapply to generate my result, at least the factor issue was resolved. I could then use the following steps to pull the data frames out.
first <- list_dataframes[[1]]
second <- list_dataframes[[2]]
third <- list_dataframes[[3]]
fourth <- list_dataframes[[4]]
fifth <- list_dataframes[[5]]
sixth <- list_dataframes[[6]]
seventh <- list_dataframes[[7]]
eighth <- list_dataframes[[8]]
nineth <- list_dataframes[[9]]
all_results <- cbindX(first,second,third,fourth,fifth,sixth,seventh, eighth,nineth)
I could then write the csv file using write.csv and get the correct result I was after. SO I guess I have my answer. I mean it does work now.
However I still think I am missing something in making this work optimally even though it is now giving me the correct result I was after.
The factor class variables are vectors of integer mode with an attached attribute that is a character vector specifying the labels to be used in displaying the integer values. I would think the safest way to bind these together would be to convert the factor columns to character class and then to merge with all=TRUE. Why not post a simple example with three dataframes or factors... I cannot actually discern the structure for sure from summary-output ... of length 10, 9 and 8 that has whatever level of complexity is in your data?
If you want to make them all factors with a common set of levels, then use this:
shared_levels <- unique( c( unlist( lapply(list_dataframes) ) ) )
length(shared_levels)
new_list <- lapply(list_dataframes, factor, levels=shared_levels)
As stated in the comment, I still do not understand what sort of table you imagine being produced. Need a concrete example.