Transforming the latest MovieLens dataset (ratings) into a matrix including NAs - r

I'm trying to transform the latest MovieLens Dataset (small) into a matrix. Basically it's a 'list' with three columns: userID, movieID and ratings. I want the users to be the rows, the items to be the columns and the ratings should be the content of the matrix.
I already searched on stackoverflow before but the closest approach I found was this one: Transforming Dataset into value matrix
Actually, this approach did work really really well but if I use the sparseMatrix-function I got no NAs in the matrix. Of course the sparseMatrix is a good way to save storage capacity but I need the NAs in the matrix since I use them to compute the number of similar items between two users etc.
The dataset looks like this:
|userId|movieId|rating
|1 |1 |3.5
|1 |3 |2.5
|1 |5 |3.0
|1 |412 |2.5
|2 |13 |4.5
|3 |412 |5
and so on.
Now I want to transform this dataset into a matrix so that it looks like this:
1 | 2 | 3 | 4 | 5 | ...
1| 3,5 |NA |2,5| NA|3,0| ...
_____________________
2| NA |NA |NA | NA| NA| ...
_____________________
3| NA |NA |NA | NA|5,0| ...
______________________
I hope this visulization helped a little to understand my problem. I am so sorry if it does not look like a typical question on stackoverflow but I am very new here.
It would be awesome if one of you would have a solution for my problem! Many thanks in advance!
Kind regards

Related

Correlation between 3 independent samples

I performed an experiment where participants from 3 countries filles a questionaire. I wish to know if the result from 1 country can be predictive of the results from another country. I dont think that regression is a good choice here as each observation is from 1 country only so I cant fill an independent and dependent variables that will both have values in the data set.
I though maybe on correlation ? any other ideas?
the dataset looks like this:
| France |US |Israel|
| -0.3 |NA |NA |
|NA |-0.5|NA |
|NA |NA |0.7 |
I would appreciate any help to code this in r

R: Filtering rows by “contains” text within row names

I am working with the mtcars dataset in R. My assignment is to print the rows corresponding to Honda and Toyota cards using knitr::kable(). I was directed to this for help: https://r4ds.had.co.nz/r-markdown.html#table
However that section only talks about printing the first 5 rows of the dataset, not about filtering it at all. I clicked to read more about the function but it was all foreign to me.
The best I can tell is that the make and model of the cars are the row names. So I need to filter the results to print only rows whose names contain “Honda” or “Toyota” and I need to do it using knitr::kable().
I have tried creating subsets, but unsure how to do so using row names. Also wouldn’t know how to search whether the row names contain the text “Honda” or “Toyota”.
This is my first day working with R and my only coding experience before today was some C# two+ years ago. This is just very frustrating to me because I could do this in Excel in less than 30 seconds. But R is like a foreign language and I don’t feel like the section of my textbook I was referred to explained the problem - especially for a brand new coder (and this class isn’t supposed to require any experience) . Appreciative of any help I can get!
Load needed library
First of load the tidyverse package, because the following code uses some helpful functions from this package. Maybe packages like rmarkdown or knitr need to be loaded as well.
library(tidyverse)
1. Filter rows by rownames in the index column
mtcars is a variable of type data frame and one of the built-in data sets in R. The Motor Trend Car Road Testsdata set contains 11 aspects of cars collected by a magazine in 1971 (see table at the end of this post).
TL;DR summary
The rows of the data frame are filtered by using filter and grepl to find all matches in the named index column by comparing it to the regular expression Honda|Toyota.
filtered_cars <- mtcars %>%
filter(!grepl("Honda|Toyota", rownames(mtcars)))
In-Depth Explanation:
The car data is piped into the function dplyr::filter() to extract a subset of only the rows that fullfill all the given conditions. This condition is given with a data-mask expression as a function argument. For this expression we use grep and the regular expression Honda|Toyota. Since the names of the cars are not in a regular data column they cannot be accessed with something like mtcars$gear. Therefore rownames(mtcars) must be used get a vector of the names in the index column. The result of this piped expression is assigned to the filtered_cars variable.
Just by looking at the count of the resulting data frame it is noticeable that a few rows have been removed:
mtcars %>% count() # 32
filtered_cars %>% count() # 29
2. Print data frame as markdown formatted table
knitr::kable(
filtered_cars[1:5,],
caption = "A table in a markdown document (subset of mtcar data set)"
)
Output:
| | mpg| cyl| disp| hp| drat| wt| qsec| vs| am| gear| carb|
|:-----------------|----:|---:|----:|---:|----:|-----:|-----:|--:|--:|----:|----:|
|Mazda RX4 | 21.0| 6| 160| 110| 3.90| 2.620| 16.46| 0| 1| 4| 4|
|Mazda RX4 Wag | 21.0| 6| 160| 110| 3.90| 2.875| 17.02| 0| 1| 4| 4|
|Datsun 710 | 22.8| 4| 108| 93| 3.85| 2.320| 18.61| 1| 1| 4| 1|
|Hornet 4 Drive | 21.4| 6| 258| 110| 3.08| 3.215| 19.44| 1| 0| 3| 1|
|Hornet Sportabout | 18.7| 8| 360| 175| 3.15| 3.440| 17.02| 0| 0| 3| 2|
...

Using R to group values within different vectors so they can be plotted (ggplot2)

I have a question about how to group different vectors from a dataframe, in order to compare and analyse them. For example using ggplot2 to plot some graphs. To make it clearer I will provide the type of dataframe I am working with.
ID Date |X |Y |Z | BR
---------------------------------
6001-102| 2016-03| 1| 1| 1| 1.0
--------------------------------
6001-102| 2016-03| 1| 1| 1| 1.0
--------------------------------
6001-102| 2016-03| 1| 1| 1| 1.0
--------------------------------
6044-460| 2016-03| 2| 1| 4| 0.5
---------------------------------
The data columns I am focused on here are Date, Z and BR.
The dates are characters containing the month and years, for example 2016-03 and 2015-05, whilst Z is numeric and ranges from 1-8. I am finding this complicated for myself, because what I want R to do is to first group the results by the date (for example looking at only May 2015) and then get the average BR for each level of Z. Z represents different time groups, so if I was using ggplot I would see the average BR for each time group in May.
Can anyone show me a good example or maybe a previous question that is trying to accomplish the same as me? Hopefully with ggplot2? I haven't found one, but I am sorry if this is a duplicate question.
Thank you for your help!
Edit: Removed dput as question answered.
You can use mydf %>% group_by(Date_fill, y) %>% summarise(z = sum(z, na.rm=TRUE))

How to Merge two column of ASPxGridView in DevExpress?

I want to Merge and show the two column of the gridview as single column.
Example:
I have two different columns
--------------------
|Amount | Currency |
--------------------
| 1000 | INR |
--------------------
| 2000 | EUR |
--------------------
| 500 | USD |
--------------------
Result as one column
-----------
|Amount |
-----------
| 1000INR |
-----------
| 2000EUR |
-----------
| 500USD |
-----------
Two columns are seperate fields from databaes.i don't want to do it in procedure, need to be done in frontend because i want to put total for this amount column.
It sounds like you're looking for calculated columns.
Have a look at the following links:
devex unbound columns
devex calculated columns

using ggpairs with NA-continaing data

ggpairs in the GGally package seems pretty useful, but it appears to fail when there NA is present anywhere in the data set:
#require(GGally)
data(tips, package="reshape")
pm <- ggpairs(tips[,1:3]) #works just fine
#introduce NA
tips[1,1] <- NA
ggpairs(tips[,1:3])
> Error in if (lims[1] > lims[2]) { : missing value where TRUE/FALSE needed
I don't see any documentation for dealing with NA values, and solutions like ggpairs(tips[,1:3], na.rm=TRUE) (unsurprisingly) don't change the error message.
I have a data set in which perhaps 10% of values are NA, randomly scattered throughout the dataset. Therefore na.omit(myDataSet) will remove much of the data. Is there any way around this?
Some functions of GGally like ggparcoord() support handling NAs by missing=[exclude,mean,median,min10,random] parameter. However this is not the case for ggpairs() unfortunately.
What you can do is to replace NAs with a good estimation of your data you were expecting ggpair() will do automatically for you. There are good solutions like replacing them by row means, zeros, median or even closest point (Notice 4 hyperlinks on the words of the recent sentence!).
I see that this is an old post. Recently I encountered the same problem but still could not find a solution on the Internet. So I provide my workaround below FYI.
I think the aim is to use pair-wise complete observations for plotting (i.e. in a manner that is specific to each panel/facet of the ggpairs grid plot), instead of using complete observations across all variables. The former will keep "useable" observations to the maximal extent, w/o introducing "artificial" data by imputing missing values. Up to date it seems that ggpairs still does not support this. My workaround for this is to:
Encode NA with another value not present in the data, e.g. for numerical variables, I replaced NA's with -666 for my dataset. For each dataset you can always pick something that is out of the range of its data values. BTW it seems that Inf doesn't work;
Then retrieve the pair-wise complete cases with user-created plotting functions. For example, for scatter plots of continuous variables in the lower triangle, I do something like:
scat.my <- function(data, mapping, ...) {
x <- as.character(unclass(mapping$x))[2] # my way of parsing the x variable name from `mapping`; there may be a better way
y <- as.character(unclass(mapping$y))[2] # my way of parsing the y variable name from `mapping`; there may be a better way
dat <- data.table(x=data[[x]], y=data[[y]])[x!=-666 & y!=-666] # I use the `data.table` package; assuming NA values have been replaced with -666
ggplot(dat, aes(x=x, y=y)) +
geom_point()
}
ggpairs(my.data, lower=list(continuous=scat.my), ...)
This can be similarly done for the upper triangle and the diagonal. It is somewhat labor-intensive as all the plotting functions need to be re-done manually with customized modifications as above. But it did work.
I'll take a shot at it with my own horrible workaround, because I think this needs stimulation. I agree with OP that filling in data based on statistical assumptions or a chosen hack is a terrible idea for exploratory analysis, and I think it's guaranteed to fail as soon as you forget how it works (about five days for me) and need to adjust it for something else.
Disclaimer
This is a terrible way to do things, and I hate it. It's useful for when you have a systematic source of NAs coming from something like sparse sampling of a high-dimensional dataset, which maybe the OP has.
Example
Say you have a small subset of some vastly larger dataset, making some of your columns sparsely represented:
| Sample (0:350)| Channel(1:118)| Trial(1:10)| Voltage|Class (1:2)| Subject (1:3)|
|---------------:|---------------:|------------:|-----------:|:-----------|--------------:|
| 1| 1| 1| 0.17142245|1 | 1|
| 2| 2| 2| 0.27733185|2 | 2|
| 3| 1| 3| 0.33203066|1 | 3|
| 4| 2| 1| 0.09483775|2 | 1|
| 5| 1| 2| 0.79609409|1 | 2|
| 6| 2| 3| 0.85227987|2 | 3|
| 7| 1| 1| 0.52804960|1 | 1|
| 8| 2| 2| 0.50156096|2 | 2|
| 9| 1| 3| 0.30680522|1 | 3|
| 10| 2| 1| 0.11250801|2 | 1|
require(data.table) # needs the latest rForge version of data.table for dcast
sample.table <- data.table(Sample = seq_len(10), Channel = rep(1:2,length.out=10),
Trial = rep(1:3, length.out=10), Voltage = runif(10),
Class = as.factor(rep(1:2,length.out=10)),
Subject = rep(1:3, length.out=10))
The example is hokey but pretend the columns are uniformly sampled from their larger subsets.
Let's say you want to cast the data to wide format along all channels to plot with ggpairs. Now, a canonical dcast back to wide format will not work, with an id column or otherwise, because the column ranges are sparsely (and never completely) represented:
wide.table <- dcast.data.table(sample.table, Sample ~ Channel,
value.var="Voltage",
drop=TRUE)
> wide.table
Sample 1 2
1: 1 0.1714224 NA
2: 2 NA 0.27733185
3: 3 0.3320307 NA
4: 4 NA 0.09483775
5: 5 0.7960941 NA
6: 6 NA 0.85227987
7: 7 0.5280496 NA
8: 8 NA 0.50156096
9: 9 0.3068052 NA
10: 10 NA 0.11250801
It's obvious in this case what id column would work because it's a toy example (sample.table[,index:=seq_len(nrow(sample.table)/2)]), but it's basically impossible in the case of a tiny uniform sample of a huge data.table to find a sequence of id values that will thread through every hole in your data when applied to the formula argument. This kludge will work:
setkey(sample.table,Class)
We'll need this at the end to ensure the ordering is fixed.
chan.split <- split(sample.table,sample.table$Channel)
That gets you a list of data.frames for each unique Channel.
cut.fringes <- min(sapply(chan.split,function(x) nrow(x)))
chan.dt <- cbind(lapply(chan.split, function(x){
x[1:cut.fringes,]$Voltage}))
There has to be a better way to ensure each data.frame has an equal number of rows, but for my application, I can guarantee they're only a few rows different, so I just trim off the excess rows.
chan.dt <- as.data.table(matrix(unlist(chan.dt),
ncol = length(unique(sample.table$Channel)),
byrow=TRUE))
This will get you back to a big data.table, with Channels as columns.
chan.dt[,Class:=
as.factor(rep(0:1,each=sampling.factor/2*nrow(original.table)/ncol(chan.dt))[1:cut.fringes])]
Finally, I rebind my categorical variable back on. The tables should be sorted by category already so this will match. This assumes you have the original table with all the data; there are other ways to do it.
ggpairs(data=chan.dt,
columns=1:length(unique(sample.table$Channel)), colour="Class",axisLabels="show")
Now it's plottable with the above.
As far as I can tell, there is no way around this with ggpairs(). Also, you are absolutely correct to not fill in with 'fake' data. If it is appropriate to suggest here, I would recommend using a different plotting method. For example
cor.data<- cor(data,use="pairwise.complete.obs") #data correlations ignoring pair-wise NA's
chart.Correlation(cor.data) #library(PerformanceAnalytics)
or using code from here http://hlplab.wordpress.com/2012/03/20/correlation-plot-matrices-using-the-ellipse-library/

Resources