R Using factor in a function - r

Im having some troubles using factors in functions, or just to make use of them in basic calculations. I have a data-frame something like this (but with as many as 6000 different factors).
df<- data.frame( p <- runif(20)*100,
q = sample(1:100,20, replace = T),
tt = c("e","e","f","f","f","i","h","e","i","i","f","f","j","j","h","h","h","e","j","i"),
ta = c("a","a","a","b","b","b","a","a","c","c","a","b","a","a","c","c","b","a","c","b"))
colnames(df)<-c("p","q","ta","tt")
Now price = p and quantity = q are my variables, and tt and ta are different factors.
Now, I would first like to find the average price per unit of q by each different factor in tt
(p*q ) / sum(q) by tt
This would in this case give me a list of 3 different sums, by a, b and c (I have 6000 different factors so I need to do it smart :) ).
I have tried using split to make lists, and in this case i can get each individual tt factor to contain the prices and another for the quantity, but I cant seem to get them to for example make an average. I've also tried to use tapply, but again I can't see how I can incorporate factors into this?
EDIT: I can see I need to clearify:
I need to find 3 sums, the average price pr. q given each factor, so in this simplified case it would be:
a: Sum of p*q for (Row (1,2,3, 7, 11, 13,14,18) / sum (q for row Row (1,2,3, 7, 11, 13,14,18)
So the result should be the average price for a, b and c, which is just 3 values.

I'd use plyr to do this:
library(plyr)
ddply(df, .(tt), mutate, new_col = (p*q) / sum(q))
p q ta tt new_col
1 73.92499 70 e a 11.29857879
2 58.49011 60 e a 7.66245932
3 17.23246 27 f a 1.01588711
4 64.74637 42 h a 5.93743967
5 55.89372 45 e a 5.49174103
6 25.87318 83 f a 4.68880732
7 12.35469 23 j a 0.62043207
8 1.19060 83 j a 0.21576367
9 84.18467 25 e a 4.59523322
10 73.59459 66 f b 10.07726727
11 26.12099 99 f b 5.36509998
12 25.63809 80 i b 4.25528535
13 54.74334 90 f b 10.22178577
14 69.45430 50 h b 7.20480246
15 52.71006 97 i b 10.60762667
16 17.78591 54 i c 5.16365066
17 0.15036 41 i c 0.03314388
18 85.57796 30 h c 13.80289670
19 54.38938 44 h c 12.86630433
20 44.50439 17 j c 4.06760541
plyr does have a reputation for being slow, data.table provides similar functionality, but much higher performance.

If I understood corectly you'r problem this should be the answer. Give it a try and responde, that I can adjust it if it's needed.
myRes <- function(tt) {
out <- NULL;
qsum <- sum(as.numeric(df[,"q"]))
psum <- sum(as.numeric(df[,"p"]))
for (var in tt) {
index <- which(df["tt"] == var)
out <- c(out, ((qsum *psum) / sum(df[index,"q"])))
}
return (out)
}
threeValue <- myRes(levels(df[, "tt"]));

Related

How to efficiently replace ranges with their median in a dataframe

Suppose I have a dataframe named score.master that looks like this:
school perc.prof num.tested
A 8 482
B 6-9 34
C 40-49 49
D GE50 81
E 80-89 26
Here, school A's percent proficient is 8%, and the number of students tested is 482. However, suppose that when num.tested falls below a certain number (in this case arbitrarily 100), data suppression is introduced. In most cases, ranges of perc.prof are given but in other cases a value such as "GE50" is given, indicating greater than or equal to 50.
My question is, in a much larger dataset, what is the best way to replace a range with its median? So for example I want the final dataset to look like this:
school perc.prof num.tested
A 8 482
B 8 34
C 44 49
D 75 81
E 85 26
I know this can be done manually like this:
score.master$perc.prof[score.master$perc.prof == "6-9"] <- round(median(6:9), 0)
But the actual dataset has many more range combinations. One way I thought of selecting the correct values is by length; all provided values are 1-2 characters long (no more than 99 percent proficient) whereas the range values are 3 or more characters long.
You can use stringr::str_split() to get the lower and upper bound, then calculate the median. The "GE50" and similar are not generalizable to this, and you could use ifelse() to handle special cases.
df <- data.frame(perc.prof = c('8', '6-9', '40-49', 'GE50', '80-89'))
df$lower.upper <- sapply(stringr::str_split(df$perc.prof, '-'), as.integer)
df$perc.prof.median <- sapply(df$lower.upper, median)
df$lower.upper <- NULL
> df
perc.prof perc.prof.median
1 8 8.0
2 6-9 7.5
3 40-49 44.5
4 GE50 NA
5 80-89 84.5
You could do the following to convert your ranges with the median. However, I did not handle the "GExx" or "LExx" situations since it's not well defined enough.
Note that you would need the stringr package for my solution.
score.master$perc.prof <- sapply(score.master$perc.prof, function(x){
sep <- stringr::str_locate(x, "-")[, 1]
if(is.na(sep)) {
x
} else {
as.character(round(median(as.integer(stringr::str_sub(x, c(1L, sep+1), c(sep-1, -1L))))))
}
})
Here's a tidyverse approach. First I replace "GE50" with it's expected output, then use tidyr::separate to split perc.prof where possible. Last step either uses the given perc.prof if large school, or uses the median for small schools.
library(tidyverse)
df %>%
mutate(perc.prof = if_else(perc.prof == "GE50", "75", perc.prof)) %>%
separate(perc.prof, c("low", "high"), remove = F, convert = T) %>%
mutate(perc.prof.adj = if_else(num.tested > 100,
as.numeric(perc.prof),
rowSums(select(., low, high), na.rm = T)/2)
)
school perc.prof low high num.tested perc.prof.adj
1 A 8 8 NA 482 8.0
2 B 6-9 6 9 34 7.5
3 C 40-49 40 49 49 44.5
4 D 75 75 NA 81 37.5
5 E 80-89 80 89 26 84.5

Arrange univariateTable output by values not by levels

I am trying to solve the following inconvenience when trying to export a table consisting of factor levels. Here is the code to generate the sample data, and a table from it.
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))
library(Publish)
univariateTable(~data)
The default output of the univariateTable is by levels (From A through D):
Variable Levels Value
1 data A 30 (7.5)
2 B 120 (30.0)
3 C 180 (45.0)
4 D 70 (17.5)
How can I change this so that the output is based on the value instead? I mean, the first row being the largest number (and percentage) and the last low being the lowest, like this:
Variable Levels Value
1 data C 180 (45.0)
2 B 120 (30.0)
3 D 70 (17.5)
4 A 30 (7.5)
Assuming that the "Publish" package is the one installed from github, we extract the numbers before the ( using sub, order it and use it to order the "xlevels" and "summary.totals".
#library(devtools)
#install_github("TagTeam/Publish")
library(Publish)
Out <- univariateTable(~data)
i1 <- order(as.numeric(sub('\\s+.*', '',
Out$summary.totals$data)), decreasing=TRUE)
Out$xlevels$data <- Out$xlevels$data[i1]
Out$summary.totals$data <- Out$summary.totals$data[i1]
Out
# Variable Level Total
#1 data C 180 (45.0)
#2 B 120 (30.0)
#3 D 70 (17.5)
#4 A 30 (7.5)
data
set.seed(24)
data <- c(sample('A',30,replace=TRUE), sample('B',120,replace=TRUE),
sample('C',180,replace=TRUE), sample('D',70,replace=TRUE))

R - Create a new variable where each observation depends on another table and other variables in the data frame

I have the two following tables:
df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))
eth ZIP1 ZIP2 A B C
A 1 1 56 49 69
B 1 2 98 10 2
B 2 3 43 69 59
A 3 4 4 30 8
C 5 5 90 10 17
6 19 4 84
7 59 95 39
I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:
eth ZIP1 Inc
A 1 56
B 1 49
B 2 10
A 3 43
C 5 17
A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..
Thanks !
Sure, it can be done in data.table:
library(data.table)
setDT(df)
df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"),
Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]
The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].
You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.
Alternately...
setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]
This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.
We can use row/column indexing
df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]
df
# eth ZIP1 Inc
#1 A 1 56
#2 B 1 49
#3 B 2 10
#4 A 3 43
#5 C 5 17
What about this?
library(reshape2)
merge(df, melt(Inc, id="ZIP2"), by.x = c("ZIP1", "eth"), by.y = c("ZIP2", "variable"))
ZIP1 eth value
1 1 A 56
2 1 B 49
3 2 B 10
4 3 A 43
5 5 C 17
Another option:
library(dplyr)
library(tidyr)
Inc %>%
gather(eth, value, -ZIP2) %>%
left_join(df, ., by = c("eth", "ZIP1" = "ZIP2"))
my solution(which maybe seems awkward)
for (i in 1:length(df$eth)) {
df$Inc[i] <- Inc[as.character(df$eth[i])][df$ZIP[i],]
}

getting from histogram counts to cdf

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

Hash or List-Backed Levels of a Factor

I'm dealing with a categorical variable retrieved from a database and am wanting to use factors to maintain the "fullness" of the data.
For instance, I have a table which stores colors and their associated numerical ID
ID | Color
------+-------
1 | Black
1805 | Red
3704 | White
So I'd like to use a factor to store this information in a data frame such as:
Car Model | Color
----------+-------
Civic | Black
Accord | White
Sentra | Red
where the color column is a factor and the underlying data stored, rather than being a string, is actually c(1, 3704, 1805) -- to IDs associated with each color.
So I can create a custom factor by modifying the levels attribute of an object of the factor class to achieve this effect.
Unfortunately, as you can see in the example, my IDs are not incremented. In my application, I have ~30 levels and the maximum ID for one level is ~9,000. Because the levels are stored in an array for a factor, that means I'm storing an integer vector of length 9,000 with only 30 elements in it.
Is there any way to use a hash or list to accomplish this effect more efficiently? i.e. if I were to use a hash in the levels attribute of a factor, I could store all 30 elements with whatever indices I please without having to create an array of size max(ID).
Thanks in advance!
Well, I'm pretty sure you can't change how factors work. A factor always has level ids that are integer numbers 1..n where n is the number of levels.
...but you can easily have a translation vector to get to your color ids:
# The translation vector...
colorIds <- c(Black=1,Red=1805,White=3704)
# Create a factor with the correct levels
# (but with level ids that are 1,2,3...)
f <- factor(c('Red','Black','Red','White'), levels=names(colorIds))
as.integer(f) # 2 1 2 3
# Translate level ids to your color ids
colorIds[f] # 1805 1 1805 3704
Technically, colorIds does not need to define the names of the colors, but it makes it easier to have in one place since the names are used when creating the levels for the factor. You want to specify the levels explicitly so that the numbering of them matches even if the levels are not in alphabetical order (as yours happen to be).
EDIT It is however possible to create a class deriving from factor that has the codes as an attribute. Lets call this new glorious class foo:
foo <- function(x = character(), levels, codes) {
f <- factor(x, levels)
attr(f, 'codes') <- codes
class(f) <- c('foo', class(f))
f
}
`[.foo` <- function(x, ...) {
y <- NextMethod('[')
attr(y, 'codes') <- attr(x, 'codes')
y
}
as.integer.foo <- function(x, ...) attr(x,'codes')[unclass(x)]
# Try it out
set.seed(42)
f <- foo(sample(LETTERS[1:5], 10, replace=TRUE), levels=LETTERS[1:5], codes=101:105)
d <- data.frame(i=11:15, f=f)
# Try subsetting it...
d2 <- d[2:5,]
# Gets the codes, not the level ids...
as.integer(d2$f) # 105 102 105 104
You could then also fix print.foo etc...
In thinking about it, the only feature that a "level" needs to implement in order to have a valid factor is the [ accessor. So any object implementing the [ accessor could be viewed as a vector from the standpoint of any interfacing function.
I looked into the hash class, but saw that it uses the normal R behavior (as is seen in lists) of returning a slice of the original hash when only using a single bracket (while extracting the actual value when using the double bracket). However, it I were to override this using setMethod(), I was actually able to get the desired behavior.
library(hash)
setMethod(
'[' ,
signature( x="hash", i="ANY", j="missing", drop = "missing") ,
function(
x,i,j, ... ,
drop
) {
if (class(i) == "factor"){
#presumably trying to lookup the values associated with the ordered keys in this hash
toReturn <- NULL
for (k in make.keys(as.integer(i))){
toReturn <- c(toReturn, get(k, envir=x#.xData))
}
return(toReturn)
}
#default, just make keys and get from the environment
toReturn <- NULL
for (k in make.keys(i)){
toReturn <- c(toReturn, get(k, envir=x#.xData))
}
return(toReturn)
}
)
as.character.hash <- function(h){
as.character(values(h))
}
print.hash <- function(h){
print(as.character(h))
}
h <- hash(1:26, letters)
df <- data.frame(ID=1:26, letter=26:1, stringsAsFactors=FALSE)
attributes(df$letter)$class <- "factor"
attributes(df$letter)$levels <- h
> df
ID letter
1 1 z
2 2 y
3 3 x
4 4 w
5 5 v
6 6 u
7 7 t
8 8 s
9 9 r
10 10 q
11 11 p
12 12 o
13 13 n
14 14 m
15 15 l
16 16 k
17 17 j
18 18 i
19 19 h
20 20 g
21 21 f
22 22 e
23 23 d
24 24 c
25 25 b
26 26 a
> attributes(df$letter)$levels
<hash> containing 26 key-value pair(s).
1 : a
10 : j
11 : k
12 : l
13 : m
14 : n
15 : o
16 : p
17 : q
18 : r
19 : s
2 : b
20 : t
21 : u
22 : v
23 : w
24 : x
25 : y
26 : z
3 : c
4 : d
5 : e
6 : f
7 : g
8 : h
9 : i
>
> df[1,2]
[1] z
Levels: a j k l m n o p q r s b t u v w x y z c d e f g h i
> as.integer(df$letter)
[1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
[26] 1
Any feedback on this? As best I can tell, everything's working. It looks like it works properly as far as printing, and the underlying data stored in the actual data.frame is untouched, so I don't feel like I'm jeopardizing anything there. I may even be able to get away with adding a new class into my package which just implements this accessor to avoid having to add a dependency on the hash class.
Any feedback or points on what I'm overlooking would be much appreciated.

Resources