I'm dealing with a categorical variable retrieved from a database and am wanting to use factors to maintain the "fullness" of the data.
For instance, I have a table which stores colors and their associated numerical ID
ID | Color
------+-------
1 | Black
1805 | Red
3704 | White
So I'd like to use a factor to store this information in a data frame such as:
Car Model | Color
----------+-------
Civic | Black
Accord | White
Sentra | Red
where the color column is a factor and the underlying data stored, rather than being a string, is actually c(1, 3704, 1805) -- to IDs associated with each color.
So I can create a custom factor by modifying the levels attribute of an object of the factor class to achieve this effect.
Unfortunately, as you can see in the example, my IDs are not incremented. In my application, I have ~30 levels and the maximum ID for one level is ~9,000. Because the levels are stored in an array for a factor, that means I'm storing an integer vector of length 9,000 with only 30 elements in it.
Is there any way to use a hash or list to accomplish this effect more efficiently? i.e. if I were to use a hash in the levels attribute of a factor, I could store all 30 elements with whatever indices I please without having to create an array of size max(ID).
Thanks in advance!
Well, I'm pretty sure you can't change how factors work. A factor always has level ids that are integer numbers 1..n where n is the number of levels.
...but you can easily have a translation vector to get to your color ids:
# The translation vector...
colorIds <- c(Black=1,Red=1805,White=3704)
# Create a factor with the correct levels
# (but with level ids that are 1,2,3...)
f <- factor(c('Red','Black','Red','White'), levels=names(colorIds))
as.integer(f) # 2 1 2 3
# Translate level ids to your color ids
colorIds[f] # 1805 1 1805 3704
Technically, colorIds does not need to define the names of the colors, but it makes it easier to have in one place since the names are used when creating the levels for the factor. You want to specify the levels explicitly so that the numbering of them matches even if the levels are not in alphabetical order (as yours happen to be).
EDIT It is however possible to create a class deriving from factor that has the codes as an attribute. Lets call this new glorious class foo:
foo <- function(x = character(), levels, codes) {
f <- factor(x, levels)
attr(f, 'codes') <- codes
class(f) <- c('foo', class(f))
f
}
`[.foo` <- function(x, ...) {
y <- NextMethod('[')
attr(y, 'codes') <- attr(x, 'codes')
y
}
as.integer.foo <- function(x, ...) attr(x,'codes')[unclass(x)]
# Try it out
set.seed(42)
f <- foo(sample(LETTERS[1:5], 10, replace=TRUE), levels=LETTERS[1:5], codes=101:105)
d <- data.frame(i=11:15, f=f)
# Try subsetting it...
d2 <- d[2:5,]
# Gets the codes, not the level ids...
as.integer(d2$f) # 105 102 105 104
You could then also fix print.foo etc...
In thinking about it, the only feature that a "level" needs to implement in order to have a valid factor is the [ accessor. So any object implementing the [ accessor could be viewed as a vector from the standpoint of any interfacing function.
I looked into the hash class, but saw that it uses the normal R behavior (as is seen in lists) of returning a slice of the original hash when only using a single bracket (while extracting the actual value when using the double bracket). However, it I were to override this using setMethod(), I was actually able to get the desired behavior.
library(hash)
setMethod(
'[' ,
signature( x="hash", i="ANY", j="missing", drop = "missing") ,
function(
x,i,j, ... ,
drop
) {
if (class(i) == "factor"){
#presumably trying to lookup the values associated with the ordered keys in this hash
toReturn <- NULL
for (k in make.keys(as.integer(i))){
toReturn <- c(toReturn, get(k, envir=x#.xData))
}
return(toReturn)
}
#default, just make keys and get from the environment
toReturn <- NULL
for (k in make.keys(i)){
toReturn <- c(toReturn, get(k, envir=x#.xData))
}
return(toReturn)
}
)
as.character.hash <- function(h){
as.character(values(h))
}
print.hash <- function(h){
print(as.character(h))
}
h <- hash(1:26, letters)
df <- data.frame(ID=1:26, letter=26:1, stringsAsFactors=FALSE)
attributes(df$letter)$class <- "factor"
attributes(df$letter)$levels <- h
> df
ID letter
1 1 z
2 2 y
3 3 x
4 4 w
5 5 v
6 6 u
7 7 t
8 8 s
9 9 r
10 10 q
11 11 p
12 12 o
13 13 n
14 14 m
15 15 l
16 16 k
17 17 j
18 18 i
19 19 h
20 20 g
21 21 f
22 22 e
23 23 d
24 24 c
25 25 b
26 26 a
> attributes(df$letter)$levels
<hash> containing 26 key-value pair(s).
1 : a
10 : j
11 : k
12 : l
13 : m
14 : n
15 : o
16 : p
17 : q
18 : r
19 : s
2 : b
20 : t
21 : u
22 : v
23 : w
24 : x
25 : y
26 : z
3 : c
4 : d
5 : e
6 : f
7 : g
8 : h
9 : i
>
> df[1,2]
[1] z
Levels: a j k l m n o p q r s b t u v w x y z c d e f g h i
> as.integer(df$letter)
[1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
[26] 1
Any feedback on this? As best I can tell, everything's working. It looks like it works properly as far as printing, and the underlying data stored in the actual data.frame is untouched, so I don't feel like I'm jeopardizing anything there. I may even be able to get away with adding a new class into my package which just implements this accessor to avoid having to add a dependency on the hash class.
Any feedback or points on what I'm overlooking would be much appreciated.
Related
I wrote this lines of code below.
I want to get the most frequent value in matrix:
matrix7 <- matrix(sample(1:36, 100, replace = TRUE), nrow = 1)
t <- table(matrix7)
print(t)
a <- which.max(table(matrix7))
print(unlist(a))
it prints this:
> matrix7 <- matrix(sample(1:36, 100, replace = TRUE), nrow = 1)
> t <- table(matrix7)
> print(t)
matrix7
1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 29 30 31 32 34 35 36
4 5 1 5 2 5 1 3 1 4 2 2 2 5 5 1 3 7 2 3 2 3 2 1 4 4 2 2 2 5 2 5 3
> a <- which.max(table(matrix7))
> print(unlist(a))
19
18
>
What type is my t variable and a variable,
and how can I get the most frequent value from matrix?
To know the "type" of variable use:
class(t)
class(a)
But notice you are already setting your matrix7 as table here: t <- table(matrix7) while your variable a is an integer.
To get the most common element on your variable (t in your case):
sort(table(as.vector(t)))
In general, if you want to know the "type" (more properly called the class) of an object, use the function class:
> class(t)
[1] "table"
There are a few ways you can find the most frequent value. Given that you have already calculated the which.max, you can take the corresponding name of t:
> as.numeric(names(t)[a])
[1] 5 ## I have a different random number seed to you :)
Note that you can't just take t[a] since that might return an integer code (factors are integers underneath, and the integer might not be what you expect).
In your example, the object a is an integer vector of length one. The "data" is 18, and it has the "name" 19. Hence another and perhaps simpler way to get the most frequent value is to take names(a).
You can either use class() to get the the class attribute of an R object or typeof() to get the type or storage mode.
Class and type of a are 'integer', the class of t is 'table' and the type is 'integer'.
Note that a is a named integer, this is why 2 values are printed. If you use names(a) it will only return the value (as a character) of a.
If you use which.max(tabulate(matrix7)) it will return the value without the need to change it further.
which.max(tabulate(matrix7))
[1] 16
(Side node: since no seed is in your code the result differs, you can set it using set.seed(x) where x is an integer).
I have two vectors that I would like to reference in a for loop, but each is of different lengths.
n=1:50
m=letters[1:14]
I tried a single loop to read it
for (i in c(11:22,24,25)){
cat (paste(n[i],m[i],sep='\t'),sep='\n')
}
and ended up with:
11 k
12 l
13 m
14 n
15 NA
16 NA
17 NA
18 NA
19 NA
20 NA
21 NA
22 NA
24 NA
25 NA
but I would like to obtain:
11 a
12 b
13 c
...
25 n
is there a way to have a double variable declaration?
for (i in c(11:22,24,25) and j in 1:14){
cat (paste(n[i],m[j],sep='\t'),sep='\n')
}
or something similar to get the result I want?
No there isn't. But you can do this:
ind_j <- c(11:22,24,25)
ind_k <- 1:14
for (i in seq_along(ind_j)){
cat (paste(n[ind_j[i]],m[ind_k[i]],sep='\t'),sep='\n')
}
Of course, it's very probable that you shouldn't use a for loop for your actual problem.
If you want m to start over when it has reached the end, you can take advantage of recycling in R.
cat(paste(n, m, sep='\t', collapse='\n'), '\n')
When the end of m is reached, it will start over until all elements of n have been iterated over. If you need this in a loop, replace cat with a for loop.
your problem lies in assigning the values to i in for (i in c(11:22,24,25) - this assigns the values 11,12,13,14,15 .... to i.
then you want to get the values of m[i].
but remember: m[i] has only 1..14 items so for item 15 and above - you'll get NAs
maybe this is what you wanted - there are more robust answers here and #Roland's is far better but imho - this fixes your problem without changing your initial approach
for (i in c(1:12,14,15)){cat (paste(n[i],m[i],sep='\t'),sep='\n')}
if you just subtract 10 from your sequence - the indexing problem will go away and u'll get
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
12 l
14 n
15 o
I have 2 dataframes and i only want the the numbers which are in both frames. I use this function:
CH3[CH2[,2] %in% CH3[,2],]
The Problem is, the data frames have a different length and this operation CH2[,2] %in% CH3[,2] delivers 1400 true values. I have searched for a while now but could not find a solution. If i apply it on CH3 which is 1700 long i get all appending data too which are not marked as wrong. Is there a function parameter i can use or do i need a workaround?
Edit1:
CH3<-read.table("unkown1.txt")
CH2<-read.table("unkown2.txt")
Now i only want the elements which are in both tables. I used:
CH3[CH3[,2] %in% CH2[,2],]
Which only works if CH3 is larger
merge is an option, depending on exactly what you are trying to do:
CH2 <- data.frame(a=letters[1:20], b=1:20)
CH3 <- data.frame(a=letters[15:26], b=15:26)
merge(CH2, CH3, by=2)
produces:
b a.x a.y
1 15 o o
2 16 p p
3 17 q q
4 18 r r
5 19 s s
6 20 t t
Another alternative is with intersect:
x <- intersect(CH2[, 2], CH3[, 2])
Where x is:
[1] 15 16 17 18 19 20
You can then do either of the following:
CH2[CH2[, 2] %in% x, ]
CH3[CH3[, 2] %in% x, ]
To get:
a b
15 o 15
16 p 16
17 q 17
18 r 18
19 s 19
20 t 20
Im having some troubles using factors in functions, or just to make use of them in basic calculations. I have a data-frame something like this (but with as many as 6000 different factors).
df<- data.frame( p <- runif(20)*100,
q = sample(1:100,20, replace = T),
tt = c("e","e","f","f","f","i","h","e","i","i","f","f","j","j","h","h","h","e","j","i"),
ta = c("a","a","a","b","b","b","a","a","c","c","a","b","a","a","c","c","b","a","c","b"))
colnames(df)<-c("p","q","ta","tt")
Now price = p and quantity = q are my variables, and tt and ta are different factors.
Now, I would first like to find the average price per unit of q by each different factor in tt
(p*q ) / sum(q) by tt
This would in this case give me a list of 3 different sums, by a, b and c (I have 6000 different factors so I need to do it smart :) ).
I have tried using split to make lists, and in this case i can get each individual tt factor to contain the prices and another for the quantity, but I cant seem to get them to for example make an average. I've also tried to use tapply, but again I can't see how I can incorporate factors into this?
EDIT: I can see I need to clearify:
I need to find 3 sums, the average price pr. q given each factor, so in this simplified case it would be:
a: Sum of p*q for (Row (1,2,3, 7, 11, 13,14,18) / sum (q for row Row (1,2,3, 7, 11, 13,14,18)
So the result should be the average price for a, b and c, which is just 3 values.
I'd use plyr to do this:
library(plyr)
ddply(df, .(tt), mutate, new_col = (p*q) / sum(q))
p q ta tt new_col
1 73.92499 70 e a 11.29857879
2 58.49011 60 e a 7.66245932
3 17.23246 27 f a 1.01588711
4 64.74637 42 h a 5.93743967
5 55.89372 45 e a 5.49174103
6 25.87318 83 f a 4.68880732
7 12.35469 23 j a 0.62043207
8 1.19060 83 j a 0.21576367
9 84.18467 25 e a 4.59523322
10 73.59459 66 f b 10.07726727
11 26.12099 99 f b 5.36509998
12 25.63809 80 i b 4.25528535
13 54.74334 90 f b 10.22178577
14 69.45430 50 h b 7.20480246
15 52.71006 97 i b 10.60762667
16 17.78591 54 i c 5.16365066
17 0.15036 41 i c 0.03314388
18 85.57796 30 h c 13.80289670
19 54.38938 44 h c 12.86630433
20 44.50439 17 j c 4.06760541
plyr does have a reputation for being slow, data.table provides similar functionality, but much higher performance.
If I understood corectly you'r problem this should be the answer. Give it a try and responde, that I can adjust it if it's needed.
myRes <- function(tt) {
out <- NULL;
qsum <- sum(as.numeric(df[,"q"]))
psum <- sum(as.numeric(df[,"p"]))
for (var in tt) {
index <- which(df["tt"] == var)
out <- c(out, ((qsum *psum) / sum(df[index,"q"])))
}
return (out)
}
threeValue <- myRes(levels(df[, "tt"]));
I have a few questions/suggestions concerning data.table.
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15)
R) X[,list(sum(y)),by=list(x)]
x V1
1: q 6
2: w 9
3: e 6
I think it is too bad that one has to write
R) X[,list(y=sum(y)),by=list(x)]
x y
1: q 6
2: w 9
3: e 6
It should default to keeping the same column name (ie: y) where the function calls only one column, this would be a massive gain in most of the cases, typically in finance as we usually look as weighted sums or last time or...
=> Is there any variable I can set to default to this behaviour ?
When doing a selectI might want to do a calculus on few columns and apply another operation for all other columns.
I mean too bad that when I want this:
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15,t=20:25,u=30:35)
R) X
x y z t u
1: q 1 10 20 30
2: q 2 11 21 31
3: q 3 12 22 32
4: w 4 13 23 33
5: w 5 14 24 34
6: e 6 15 25 35
R) X[,list(y=sum(y),z=last(z),t=last(t),u=last(u)),by=list(x)] #LOOOOOOOOOOONGGGG
#EXPR
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
I cannot write it like...
R) X[,list(sum(y)),by=list(x),defaultFn=last] #defaultFn would be
applied to all remaniing columns
=> Can I do this somehow (may be setting an option)?
Thanks
On part 1, that's not a bad idea. We already do that for expressions in by, and something close is already on the list for j :
FR#2286 Inferred naming could apply to j=colname[...]
Find max per group and return another column
But if we did do that it would probably need to be turned on via an option, to maintain backwards compatibility. I've added a link in that FR back to this question.
On the 2nd part how about :
X[,c(y=sum(y),lapply(.SD,last)[-1]),by=x]
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
Please ask multiple questions separately, though. Each question on S.O. is supposed to be a single question.