Create Multi-dimensional Data Mapping in R - r

If I want to represent a set of values in R that are keyed on 3 different dimensions, is there a simple/succinct way of generating this?
Say for example I have the following keys - each dimension must support having a different number of keys. In total the example below will reference 360 values (3*30*4):
rating <- c('AA','AAB','C')
timeInYears <- 1:30
monthsUntilStart <- c(1,3,6,12)
So I want to be able to access, for example, the value with a rating of AA, 7 years from now, starting in 12 month, using something like:
value <- data[rating=='AA',timeInYears==7,monthsUntilStart==12]
To start with I'd like to be able to provide sample generated values for every combination of keys.
In reality they will be read in from a database, but to get started it would be good to provide a dummy structure from a set of dummy values, that can simply be sequentially repeated over the structure.
So say we have
values <- c(2.30,2.32,1.98,2.18,2.29,2.22)
So each (x,y,z) key maps to one of these values.
Any hints or tips on how to best to approach this much appreciated!
Thanks!
Phil.

You can use an array in R for this task.
First, we will create a data frame that includes all the possibilities. As desired, we will assign values that are cycled to the length of observations:
rating <- c('AA','AAB','C')
timeInYears <- 1:30
monthsUntilStart <- c(1,3,6,12)
data <- expand.grid(rating=rating, timeInYears=timeInYears, monthsUntilStart=monthsUntilStart)
data$value <- c(2.30,2.32,1.98,2.18,2.29,2.22) # cycles through
Next, we convert to an array:
dataarray <- unclass(by(data[["value"]], data[c("rating", "timeInYears", "monthsUntilStart")], identity))
Note that integers will be converted to character strings.
> dimnames(dataarray)
$rating
[1] "AA" "AAB" "C"
$timeInYears
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
$monthsUntilStart
[1] "1" "3" "6" "12"
You can access your desired element by index (it will return the random value that was assigned for this example).
> dataarray["AA", "7", "12"]
[1] 2.3
Edit
You can also just use the data frame itself, if you wish.
> subset(data, rating=='AA' & timeInYears==7 & monthsUntilStart==12)
rating timeInYears monthsUntilStart value
289 AA 7 12 2.3
> subset(data, rating=='AA' & timeInYears==7 & monthsUntilStart==12, value)
value
289 2.3
> subset(data, rating=='AA' & timeInYears==7 & monthsUntilStart==12)$value
[1] 2.3

Related

filter() (dplyr) does not distinguish between character and number?

I am using the function filter()(in library dplyr) with this dataset. It contains a variable called "depth_m" which is numeric, I transformed it to a character class with sapply (see code below) and I didn't have problems.
Now the variable is a character however, when I filtered the dataset based on the "depth_m" variable either as =="20" (as a character) or == 20 (as a number) I obtain the same result So.. Shouldn't I get an error when filtering by number (== 20)?
Here is my code:
data <- read.table("env.txt", sep = "\t", header = TRUE)
class(data$depth_m)
Output:
[1] "integer"
# Variable transformation
data$depth_m <- sapply(data$depth_m, as.character)
class(data$depth_m)
Output:
[1] "character"
To check the data type:
class(data$depth_m)
Output:
[1] "1000" "500" "20" "1" "1000" "500" "20" "1" "1000" "320" "1" "20" "1"
[14] "20" "1" "120" "20" "20" "365" "20" "1" "375" "20" "1" "1000" "500"
[27] "20" "1" "200" "20" "1" "1000" "500" "25" "1" "1000" "500" "25" "1"
[40] "20" "300" "20" "1000" "20"
Here I'm filtering. In this code I expected to get some subdataset because the value "20" is a character and it is correct because it exists in the original dataset.
y <- filter(data, depth_m == "20") %>%
select(env_sample, depth_m)
head(y)
Output:
env_sample depth_m
1 Jan_B16_0020 20
2 Jan_B08_0020 20
3 Mar_M03_0020 20
4 Mar_M04_0020 20
5 Mar_M05_0020 20
6 Mar_M06_0020 20
Here I'm filtering again. In this code I didn't expect to get some subdataset because the value 20 is a number and it is'nt correct because it doesnt't exist in the original dataset.
y1 <- filter(data, depth_m == 20) %>%
select(env_sample, depth_m)
head(y1)
Output:
env_sample depth_m
1 Jan_B16_0020 20
2 Jan_B08_0020 20
3 Mar_M03_0020 20
4 Mar_M04_0020 20
5 Mar_M05_0020 20
6 Mar_M06_0020 20
Any comment will be helpful. Thank you.
In R, the expression 20 == "20" is valid, though some (from other programming languages) might consider that a little "sloppy". When that is evaluated, it up-classes the 20 to "20" for the comparison. This silent casting can be good (useful and flexible), but it can also cause unintended, undesired, and/or surprising results. (The fact that it's silent is what I dislike about it, but convenience is convenience.)
If you want to be perfectly clear about your comparison, you can test for class as well. In your example, you show 20 which is numeric and not technically integer (which would be 20L), but you can shape the precision of the conditional to your own tastes:
filter(data, is.numeric(depth_m) & depth_m == 20)
This will still up-class the 20 to "20", but because the first portion is.numeric(.) fails, the combination of the two will fail as well. Realize that the specificity of that test is absolute: if the column is indeed character, then you will always get zero rows, which may not be what you want. If instead you want to remove non-20 rows only if they are 20 and numeric, then perhaps
filter(data, !is.numeric(depth_m) | depth_m == 20)
This goes down the dizzying logic of "if it is not numeric, then it obviously cannot truly be 20, so keep it ... but if it is numeric, make sure it is definitely 20". Of course, we run into the premise here that there is no way that one portion of the column can be numeric while another cannot, so ... perhaps that's over-indulging the specificity of filtering.

Recommenderlab: Receiving Duplicate Predictions for Multiple Users

I am using Recommenderlab in R to build a recommendation system to provide craft-beer suggestions to new users.
However, upon running the model, I am receiving the same predictions per user for a majority of the training dataset, or receiving 'character(0)' as the output. How can I receive the predictions that are associated with each user and not duplicated?
The dataset I'm using can be found here: https://www.kaggle.com/rdoume/beerreviews/version/1
I have tried converting the data frame directly into a matrix, then into a realRatingMatrix.
In order to receive any recommendations, I need to use the 'dcast' function from the data.table library before converting the data frame into a matrix.
I have also tried removing the first column from the matrix to drop the user ids.
One thing to note is that when the data is sampled, there can be a few rows where the 'reviewer' is blank, but the rating and beer id is there.
library(dplyr)
library(tidyverse)
library(recommenderlab)
library(reshape2)
library(data.table)
beer <- read.csv('beer.csv', stringsAsFactors = FALSE)
#Take sample of data(1000)
beer_sample <- sample_n(beer, 1000)
#Select relevant columns & rename
beer_ratings <- select(beer_sample, reviewer = review_profilename, beerId = beer_beerid, rating = review_overall)
#Add unique id for reviewers
beer_ratings$userId <- group_indices_(beer_ratings, .dots = 'reviewer')
#Create ratings matrix
rating_matrix <- dcast(beer_ratings, userId ~ beerId, value.var = 'rating')
rating_matrix <- as.matrix(rating_matrix)
rating_matrix <- as(rating_matrix, 'realRatingMatrix')
#UBCF Model
recommender_model <- Recommender(rating_matrix, method = 'UBCF', param=list(method='Cosine',nn=10))
#Predict top 5 beers for first 10 users
recom <- predict(recommender_model, rating_matrix[1:10], n=5)
#Return top recommendations as a list
recom_list<- as(recom,'list')
recom_list
The above code will result in:
[[1]]
[1] "48542" "2042" "6" "10" "19"
[[2]]
[1] "10277" "2042" "6" "10" "19"
[[3]]
[1] "10277" "48542" "6" "10" "19"
[[4]]
[1] "10277" "48542" "2042" "6" "10"
[[5]]
[1] "10277" "48542" "2042" "6" "10"
[[6]]
[1] "10277" "48542" "2042" "6" "10"
Converting the data frame to a matrix then realRatingMatrix without casting first into a table results in the user's recommendation as:
`886093`
`character(0)`
Using the 'dcast' function first then converting the data frame into a matrix and removing the first column, then into a realRatingMatrix returns the same predictions for almost every user:
[[1]]
[1] "6" "7" "10" "12" "19"
[[2]]
[1] "6" "7" "10" "12" "19"
[[3]]
[1] "6" "7" "10" "12" "19"
Any help is greatly appreciated.

Converting factors with character names to numerics (after import from an .sav file)

So after I've imported a data.set via memsci (which worked very nicely btw! :)), I now have the problem that almost all of the data is converted to (non-ordered) factors, but the levels are not 1,2,3,4,5 (which is what it should be for calculations) but rather "fully agree" down to "don't agree at all".
This leads to the problem that I can't use as.numeric(levels(f))[f] to convert the factor into numerics.
To get import my data I used this:
data <- as.data.set(spss.system.file("data.sav"))
dat <- as.data.frame(data)
However: The informations seems to be there.
str(var1)
Factor w/ 5 levels "don't agree at all",..: NA 1 1 1 1 1 1 1 1 1 ...
labels(dat$var1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12"
[13] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24"
levels(dat$var1)
[1] "do not agree at all" ". ." ". . ."
[4] ". . . ." "fully agree"
Where are the values stored? I've tried labels(var1) and just var1, but I neither works. However: Using as.numeric(var1) gives me the information I need, BUT I don't think one should apply this as stated in the R help for factors. Also after using dat[,1:ncol(dat)] <- lapply(dat[,1:ncol(dat)], function(x) as.numeric(x))
the variable is still being considered a factor and behaves exactly the same as before.
Edit: Reproducible example thanks to #jakub
var1 <- factor(c(1,2,3,4,5,5,4,3,2,1),
levels = as.character(1:5),
labels = c("Fully agree", "....", "...", "..", "Do not agree at all"))
You say:
as.numeric(var1) gives me the information I need, BUT I don't think one should apply this as stated in the R help for factors
If you refer to:
In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion.
then you are most likely confusing two issues. You either want the labels, or you want the levels.
If you have numerical values that happen to be labels of a factor, then indeed you have to convert to numeric using as.numeric(levels(f))[f]. An example:
var1 <- factor(c(1,2,3,1),
labels = c("123", "5", "-11"),
levels = as.character(1:3))
levels(var1)
# [1] "123" "5" "-11"
as.numeric(var1)
# [1] 1 2 3 1 #this indeed does not make much sense - the values are lost!
as.numeric(levels(var1))[var1]
#[1] 123 5 -11 123
But in your case, this does not apply, because (if I understood correctly), you don't want the labels, but the underlying integers. For you, it makes sense that Fully agree means 1. In such case, as.numeric(var1) is fine.

adding tables to a list in R via loop

I am reading HTML tables, and can do that fine, but I am collecting tables from multiple years. Unfortunately, the columns and rows are different in each year, so I wanted to add them all recursively to a list, so I can later apply lapply and do some analysis.
I can download the table and manipulate it into a dataframe when I do it once, but then when I add it to a list, the list only accepts the first column.
library(XML)
#reg
r=readHTMLTable('http://www.nhl.com/stats/team?season=20132014&gameType=2&viewName=summary#',stringsAsFactors=FALSE)
r=as.data.frame(r[3])
for(i in 3:ncol(r)){
r[,i]=as.numeric(r[,i])
}
This gives me r as something I can manipulate. I want to add it to a list:
> l=as.list(NULL)
> l[1]=r
Warning message:
In l[1] = r :
number of items to replace is not a multiple of replacement length
> l
[[1]]
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
Does anyone know how I can add it into my list so I keep the dimensions
> dim(r)
[1] 30 25
The issue is, I have many other tables that I would like to add, and was able to add them, but each one that was added only included the first column/element.
Any ideas is greatly appreciated
Thanks!
A little more research, and I found an answer. I feel guilty about it, but here it is:
l[[1]]=r
adds the table r to the list, and can also be done recursively

How to edit "row.names" after split and cut2 in R?

I want to edit out some information from row.names that are created automatically once split and cut2 were used. See following code:
#Mock data
date_time <- as.factor(c('8/24/07 17:30','8/24/07 18:00','8/24/07 18:30',
'8/24/07 19:00','8/24/07 19:30','8/24/07 20:00',
'8/24/07 20:30','8/24/07 21:00','8/24/07 21:30',
'8/24/07 22:00','8/24/07 22:30','8/24/07 23:00',
'8/24/07 23:30','8/25/07 00:00','8/25/07 00:30'))
U. <- as.numeric(c('0.2355','0.2602','0.2039','0.2571','0.1419','0.0778','0.3557',
'0.3065','0.1559','0.0943','0.1519','0.1498','0.1574','0.1929'
,'0.1407'))
#Mock data frame
test_data <- data.frame(date_time,U.)
#To use cut2
library(Hmisc)
#Splitting the data into categories
sub_data <- split(test_data,cut2(test_data$U.,c(0,0.1,0.2)))
new_data <- do.call("rbind",sub_data)
test_data <- new_data
You will see that "test_data" would have an extra column "row.names" with values such as "[0.000,0.100).6", "[0.000,0.100).10", etc.
How do I remove "[0.000,0.100)" and keep the number after the "." such as 6 and 10 so that I can reference these rows by their original row number later?
Any other better method to do this?
You could also set the names of sub_data to NULL.
names(sub_data) <- NULL
test_data <- do.call('rbind', sub_data)
row.names(test_data)
#[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"
You could use a Regular Expression (Regex), as follows:
rownames(test_data) = gsub(".*[]\\)]\\.", "", rownames(test_data))
It's cryptic if you're not familiar with Regular Expressions, but it basically says match any sequence of characters (.*) that are followed by either a brace or parenthesis ([]\\)]) and then by a period (\\.) and remove all of it.
The double backslashes are "escapes" indicating that the character following the double-backslash should be interpreted literally, rather than in its special Regex meaning (e.g., . means "match any single character", but \\. means "this is really just a period").
Just for fun, you can also use regmatches
> Names <- rownames(test_data)
> ( rownames(test_data) <- regmatches(Names, regexpr("[0-9]+$", Names)) )
[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"

Resources