I am a relative newcomer to R. I have searched for the last two workdays trying to figure this out and failed. I have a list of factors generated by a function. I have 9 items in the list of different lengths.
>summary(list_dataframes)
Length Class Mode
[1,] 1757 factor numeric
[2,] 1776 factor numeric
[3,] 1737 factor numeric
[4,] 1766 factor numeric
[5,] 1783 factor numeric
[6,] 1751 factor numeric
[7,] 1744 factor numeric
[8,] 1749 factor numeric
[9,] 1757 factor numeric
Part of a sample of the data as it comes out:
list_dataframes
[[1]]
[1] 1776234_at 1779003_at 1776344_at 1777664_at 1772541_at 1774525_at
[[2]]
[1] 1771703_at 1776299_at 1772744_at 1780116_at 1775451_at 1778821_at
[7] 1774342_at
[[3]]
[1] 1780116_at 1776262_at 1775451_at 1780200_at 1775704_at
I am not sure why it says the Mode is "numeric". The individual entries are a mix of numbers and letter like "S35_at".
I would like to make this into a table of nine columns and 1783 rows without making duplicate values. (Hence I tried using do.call and it didn't work. I ended up with a mess full of duplicates.) The shorter ones can have NAs in the empty spaces or be blank.
I need to be able to eventually end up with something I can put into a spread sheet.
There has to be a way to do this. Thank you!
I guess I should add it initially was coming out as data frames when I had four columns of data but I only need one column of the data and when I subsetted the function that creates this list to create only the one column I actually needed it seems to no longer be a dataframe.
dput(head(list_dataframes))
list(structure(c(3605L, 5065L, 3663L, 4349L, 1655L, 2700L, 5692L, plus many more
.Label = c("1769308_at",
"1769311_at", "1769312_at", "1769313_at", "1769314_at", "1769317_at", plus many more
this pattern is repeated nine more times
What I am trying to do is produce a table that would look like this:
a= xyz,tuv,efg,hij,def
b= xyz,tuv,efg
c= tuv,efg,hij,def
What I want to make is a table that is
a b c
xyz xyz tuv
tuv tuv efg
efg efg hij
hij NA NA
NA NA NA
NA could be blank as well.
After much reading the manual section on lists I determined that I had generated a buried list of lists. It had nine items with the data I wanted buried two layers down i.e to see it I had to use [[1]]. Also because of something in R that results in a single column data frame becoming a factor instead of staying a data frame it was further complicated. To fix it (sort of) I added one step in my equation so that I changed that factor into a data frame.
After that, when I used lapply to generate my result, at least the factor issue was resolved. I could then use the following steps to pull the data frames out.
first <- list_dataframes[[1]]
second <- list_dataframes[[2]]
third <- list_dataframes[[3]]
fourth <- list_dataframes[[4]]
fifth <- list_dataframes[[5]]
sixth <- list_dataframes[[6]]
seventh <- list_dataframes[[7]]
eighth <- list_dataframes[[8]]
nineth <- list_dataframes[[9]]
all_results <- cbindX(first,second,third,fourth,fifth,sixth,seventh, eighth,nineth)
I could then write the csv file using write.csv and get the correct result I was after. SO I guess I have my answer. I mean it does work now.
However I still think I am missing something in making this work optimally even though it is now giving me the correct result I was after.
The factor class variables are vectors of integer mode with an attached attribute that is a character vector specifying the labels to be used in displaying the integer values. I would think the safest way to bind these together would be to convert the factor columns to character class and then to merge with all=TRUE. Why not post a simple example with three dataframes or factors... I cannot actually discern the structure for sure from summary-output ... of length 10, 9 and 8 that has whatever level of complexity is in your data?
If you want to make them all factors with a common set of levels, then use this:
shared_levels <- unique( c( unlist( lapply(list_dataframes) ) ) )
length(shared_levels)
new_list <- lapply(list_dataframes, factor, levels=shared_levels)
As stated in the comment, I still do not understand what sort of table you imagine being produced. Need a concrete example.
Related
I have a matrix I have coerced from a realRatingMatrix in recommenderlab package in R. The data contains predictions of ratings between 0-1 for a number of products.
The matrix should contain customer numbers along the rows (row 2 down) so that column 1 header is row label, and product IDs along the columns in the first row from column 2 onwards. The problem I have is when I coerce to a matrix the data structure becomes messy:
EDIT: Link to Github repository www.github.com/APBuchanan/recommenderlab-model
str(wsratings)
num [1:43, 1:319] 0.192 0.44 0.262 0.161 0.239 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:319] "X011211" "X014227" "X014229" "X014235" ...
The first cell wsratings[1,1] should be labelled as "CustomerNumber" and the remainder of the columns in row 1 should contain the data that is currently held in the above $:chr, but should display as separate variables in the matrix.
From the code below you will see that I've been trying to go about this by inserting the data into two vectors, that I can then call in the dimnames function, but I'm getting something wrong:
setwd("location to pull in data")
#look at using XLConnect package to link straight to excel workbook
library(recommenderlab)
library(xlsx)
library(tidyr)
library(Matrix)
#library(stringer)
data=read.csv("WS1 & WS2 V3.csv",header=TRUE,row.names=1)
#remove rows where number of purchases is <10
df=data[rowSums(data[-1])>=10,]
df<-as.matrix(df)
data.matrix=as(df,"binaryRatingMatrix")
#image(data.matrix)
model=Recommender(data.matrix,method="UBCF")
predictions<-predict(model,data.matrix,n=5)
set.seed(100)
evaluation<-evaluationScheme(data.matrix,method="split",train=0.5,given=5)
Rec.ubcf <- Recommender(getData(evaluation, "train"), "UBCF")
predict.ubcf<-predict(Rec.ubcf,getData(evaluation,"known"),type="topNList")
pred.ubcfratings<-predict(Rec.ubcf,getData(evaluation,"known"),type="ratings")
error.ubcf<-calcPredictionAccuracy(predict.ubcf,getData(evaluation,"unknown"),given=5)
setwd("Location to output data from model")
wsratings<-as(pred.ubcfratings,"matrix")
ratingrows<-c(evaluation#runsTrain)
where I've called colnames2<-c(wsratings[1,2:ncol(wsratings)]) I am expecting the the data from column 2 to the last column, in row 1 to be read into the vector. But when I print the results, it includes rating information as well which is not what I'm after.
ratingrows<-c(evaluation#runsTrain) contains the customer numbers that I want to insert below the row label "CustomerNumber".
I'm guessing there's a way of sorting this out with tidyr package, but not so familiar with it. If anyone can provide some advice on how I can clean this all up, I'd be very grateful.
So with the data you gave, I whipped up a solution here.
You said "I need to extract the customer numbers from the test split of data and drop that into the first column of the matrix - that's my main issue". The way to extract that is either: colnames(wsratings) or dimnames(wsratings)[[2]].
Once you have this vector (length of 320), you want to "drop that to the first column". You're asking for a cbind(), but the length of the data you want to bind it contains 43 row. You can't bind them together because the length of the two elements are not the same or multiples of each other.
Assuming you have the full dataset and their length matches, then the code would be:
customerid <-c("CustomerName", evaluation#runsTrain[[1]])
wsratings <- cbind(customerid, wsratings)
This is what I gathered you want, and it yields me the following:
I work with survey data, where missing values are the rule rather than the exception. My datasets always have lots of NAs, and for simple statistics I usually want to work with cases that are complete on the subset of variables required for that specific operation, and ignore the other cases.
Most of R's base functions return NA if there are any NAs in the input. Additionally, subsets using comparison operators will return a row of NAs for any row with an NA on one of the variables. I literally never want either of these behaviors.
I would like for R to default to excluding rows with NAs for the variables it's operating on, and returning results for the remaining rows (see example below).
Here are the workarounds I currently know about:
Specify na.rm=T: Not too bad, but not all functions support it.
Add !is.na() to all comparison operations: Works, but it's annoying and error-prone to do this by hand, especially when there are multiple variables involved.
Use complete.cases(): Not helpful because I don't want to exclude cases that are missing any variable, just the variables being used in the current operation.
Create a new data frame with the desired cases: Often each row is missing a few scattered variables. That means that every time I wanted to switch from working with one variable to another, I'd have to explicitly create a new subset.
Use imputation: Not always appropriate, especially when computing descriptives or just examining the data.
I know how to get the desired results for any given case, but dealing with NAs explicitly for every piece of code I write takes up a lot of time. Hopefully there's some simple solution that I'm missing. But complex or partial solutions would also be welcome.
Example:
> z<-data.frame(x=c(413,612,96,8,NA), y=c(314,69,400,NA,8888))
# current behavior:
> z[z$x < z$y ,]
x y
3 96 400
NA NA NA
NA.1 NA NA
# Desired behavior:
> z[z$x < z$y ,]
x y
3 96 400
# What I currently have to do in order to get the desired output:
> z[(z$x < z$y) & !is.na(z$x) & !is.na(z$y) ,]
x y
3 96 400
One trick for dealing with NAs in inequalities when subsetting is to do
z[which(z$x < z$y),]
# x y
# 3 96 400
The which() silently drops NA values.
Factors can help preventing some kinds of programming errors in R: You cannot perform equality check for factors that use different levels, and you are warned when performing greater/less than checks for unordered factors.
a <- factor(letters[1:3])
b <- factor(letters[1:3], levels=letters[4:1])
a == b
## Error in Ops.factor(a, b) : level sets of factors are different
a < a
## [1] NA NA NA
## Warning message:
## In Ops.factor(a, a) : < not meaningful for factors
However, contrary to my expectation, this check is not performed when merging data frames:
ad <- data.frame(x=a, a=as.numeric(a))
bd <- data.frame(x=b, b=as.numeric(b))
merge(ad, bd)
## x a b
## 1 a 1 4
## 2 b 2 3
## 3 c 3 2
Those factors simply seem to be coerced to characters.
Is a "safe merge" available somewhere that would do the check? Do you see specific reasons for not doing this check by default?
Example (real-life use case): Assume two spatial data sets with very similar but not identical subdivision in, say, communes. The data sets refer to slightly different points in time, and some of the communes have merged during that time span. Each data set has a "commune ID" column, perhaps even named identically. While the semantics of this column are very similar, I wouldn't want to (accidentally) merge the data sets over this commune ID column. Instead, I construct a matching table between "old" and "new" commune IDs. If the commune IDs are encoded as factors, a "safe merge" would give a correctness check for the merge operation at no extra (implementation) cost and very little computational cost.
The "safe guard" with merge is the by= parameter. You can set exactly which columns you think should match. If you match up two factor columns, R will use the the labels for those values to match them up. So "a" will match with "a" regardless of how the hidden inner working of factor have coded those values. That's what a user sees, so that's how it will be merged. It's just like with numeric values, you can choose to merge on columns that have complete different ranges (the first column has 1:10, the second has 100:1000). When the by value is set, R will do what it's asked. And if you don't explicitly set the by parameter, then R will find all shared column names in the two data.frames and use that.
And many times when merging, you don't always expect matches. Sometimes you're using all.x or all.y to specifically get unmatched records. In this case, depending on how the different data.frames were created, one may not know about the levels it doesn't have. So it's not at all unreasonable to to try to merge them.
So basically R is treating factors like characters during merging, be cause it assumes that you already know that two columns belong together.
Well, with much credit (and apologies to) MrFlick:
> attributes(ad$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
> attributes(ad$a)
NULL
> attributes(ad$b)
NULL
> adfoo<-merge(ad,bd)
> attributes(adfoo$x)
$levels
[1] "a" "b" "c"
$class
[1] "factor"
So in fact the merged column $x is a factor, although only levels common to both ad and bd are merged. The other columns were coerced via as.numeric long ago.
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.
The dataset named data has both categorical and continuous variables. I would like to the delete categorical variables.
I tried:
data.1 <- data[,colnames(data)[[3L]]!=0]
No error is printed, but categorical variables stay in data.1. Where are problems ?
The summary of "head(data)" is
id 1,2,3,4,...
age 45,32,54,23,...
status 0,1,0,0,...
...
(more variables like as I wrote above)
All variables are defined as "Factor".
What are you trying to do with that code? First of all, colnames(data) is not a list so using [[]] doesn't make sense. Second, The only thing you test is whether the third column name is not equal to zero. As a column name can never start with a number, that's pretty much always true. So your code translates to :
data1 <- data[,TRUE]
Not what you intend to do.
I suppose you know the meaning of binomial. One way of doing that is defining your own function is.binomial() like this :
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass'){
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
in case you want to take care of NA's. This you can then apply to your dataframe :
data.1 <- data[!sapply(data,is.binomial)]
This way you drop all binomial columns, i.e. columns with only two distinct values.
#Shimpei Morimoto,
I think you need a different approach.
Are the categorical variables defines in the dataframe as factors?
If so you can use:
data.1 <- data[,!apply(data,2,is.factor)]
The test you perform now is if the colname number 3L is not 0.
I think this is not the case.
Another approach is
data.1 <- data[,-3L]
works only if 3L is a number and the only column with categorical variables
I think you're getting there, with your last comment to #Mischa Vreeburg. It might make sense (as you suggest) to reformat your original data file, but you should also be able to solve the problem within R. I can't quite replicate the undefined columns error you got.
Construct some data that look as much like your data as possible:
X <- read.csv(textConnection(
"id,age,pre.treat,status
1,'27', 0,0
2,'35', 1,0
3,'22', 0,1
4,'24', 1,2
5,'55', 1,3
, ,yes(vs)no,"),
quote="\"'")
Take a look:
str(X)
'data.frame': 6 obs. of 4 variables:
$ id : int 1 2 3 4 5 NA
$ age : int 27 35 22 24 55 NA
$ pre.treat: Factor w/ 3 levels " 0"," 1","yes(vs)no": 1 2 1 2 2 3
$ status : int 0 0 1 2 3 NA
Define #Joris Mey's function:
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass')) {
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
Try it out: you'll see that it does not detect pre.treat as binomial, and keeps all the variables.
sapply(X,is.binomial)
X1 <- X[!sapply(X,is.binomial)]
names(X1)
## keeps everything
We can drop the last row and try again:
X2 <- X[-nrow(X),]
sapply(X2,is.binomial)
It is true in general that R does not expect "extraneous" information such as level IDs to be in the same column as the data themselves. On the one hand, you can do even better in the R world by simply leaving the data as their original, meaningful values ("no", "yes", or "healthy", "sick" rather than 0, 1); on the other hand the data take up slightly more space if stored as a text file, and, more important, it becomes harder to incorporate other meta-data such as units in the file along with the data ...