Is there a general inverse of the table() function? - r

I am aware that a little programming allows converting fixed-dimension frequency tables, as returned e.g. by table(), back into observation data. So the aim is to convert a frequency table such as this one...
(flower.freqs <- with(iris,table(Petal=cut(Petal.Width,2),Species)))
Species
Petal setosa versicolor virginica
(0.0976,1.3] 50 28 0
(1.3,2.5] 0 22 50
...back into a data.frame() with a row number that corresponds to the sum of the numbers of the input matrix, while the cell values are obtained from input dimensions:
Petal Species
1 (0.0976,1.3] setosa
2 (0.0976,1.3] setosa
3 (0.0976,1.3] setosa
# ... (150 rows) ...
With some tinkering I build a rough prototype that should also digest higher-dimensional inputs:
tableinv <- untable <- function(x) {
stopifnot(is.table(x))
obs <- as.data.frame(x)[rep(1:prod(dim(x)),c(x)),-length(dim(x))-1]
rownames(obs) <- NULL; obs
}
> head(tableinv(flower.freqs)); dim(tableinv(flower.freqs))
Petal Species
1 (0.0976,1.3] setosa
2 (0.0976,1.3] setosa
3 (0.0976,1.3] setosa
4 (0.0976,1.3] setosa
5 (0.0976,1.3] setosa
6 (0.0976,1.3] setosa
[1] 150 2
> head(tableinv(Titanic)); nrow(tableinv(Titanic))==sum(Titanic)
Class Sex Age Survived
1 3rd Male Child No
2 3rd Male Child No
3 3rd Male Child No
4 3rd Male Child No
5 3rd Male Child No
6 3rd Male Child No
[1] TRUE
I am obviously proud that this bricolage reconstructs multi-attribute data.frame()s from higher-dimensional frequency tables such as Titanic - but is there an established (built-in, battle-tested) general inverse to table(), ideally one that does not depend on a specific library, that knows how to handle unlabeled dimensions, that is optimized so that it will not choke on bulky inputs, and that reasonably deals with table inputs that would correspond to factor as well as non-factor observation inputs?

I believe that your solution is pretty good. In any case, the way I would address this question is quite similar:
tableinv <- function(x){
y <- x[rep(rownames(x),x$Freq),1:(ncol(x)-1)]
rownames(y) <- c(1:nrow(y))
return(y)}
survivors <- as.data.frame(Titanic)
surv.invtab <- tableinv(survivors)
which yields
> head(surv.invtab)
Class Sex Age Survived
1 3rd Male Child No
2 3rd Male Child No
3 3rd Male Child No
4 3rd Male Child No
5 3rd Male Child No
6 3rd Male Child No
Concerning the example with the flowers, using the function tableinv() as defined above, it would first be necessary to convert the data into a data frame:
flower.freqs <- with(iris,table(Petal=cut(Petal.Width,2),Species))
flower.freqs <- as.data.frame(flower.freqs)
flower.invtab <- tableinv(flower.freqs)
The result in this case is
> head(flower.invtab)
Petal Species
1 (0.0976,1.3] setosa
2 (0.0976,1.3] setosa
3 (0.0976,1.3] setosa
4 (0.0976,1.3] setosa
5 (0.0976,1.3] setosa
6 (0.0976,1.3] setosa
Hope this helps.

In the specific case where we deal with one-dimension frequency data, there is an easy way. Let's take an example:
mytable = table(mtcars$cyl)
#### 4 6 8
#### 11 7 14
A simple function to retrieve expanded data:
InvTable = function(tb, random = TRUE){
output = rep(names(tb), tb)
if (random) { output <- base::sample(output, replace=FALSE) }
return(output)
}
InvTable(mytable, T)
#### [1] "4" "8" "8" "4" "4" "6" "6" ...
This is not exactly the need of the user, but I think it could be very helpful in many similar cases.
Just beware that the result is in character format, which is not always what we need (so add a as.numeric if needed).

Related

How do I delete all rows based on a loop in R

I am writing a for loop to delete rows in which all of the values between rows 5 and 8 is 'NA'. However, it only deletes SOME of the rows. When I do a while loop, it deletes all of the rows, but I have to manually end it (i.e. it is an infinite loop...I also have no idea why)
The for/if loop:
for(i in 1:nrow(df)){
if(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
while loop (but it is infinite):
for(i in 1:nrow(df)){
while(is.na(df[i,5]) && is.na(df[i,6]) &&
is.na(df[i,7]) && is.na(df[i,8])){
df<- df[-i,]
}
}
Can someone help? Thanks!
What's happening here is that when you remove a row in this way, all the rows below it "move up" to fill the space left behind. When there are repeated rows that should be deleted, the second one gets skipped over. Imagine this table:
1 keep
2 delete
3 delete
4 keep
Now, you loop through a sequence from 1 to 4 (the number of rows) deleting rows that say delete:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 3, the 3rd row says keep, so keep it ... The final table is:
1 keep
2 delete
3 keep
In your example with while, however, the deletion step keeps running on row 2 until that row doesn't meet the conditions instead of moving on to i = 3 right away. So the process goes:
i = 1, keep that row ...
i = 2, delete that row. Now, the data frame looks like this:
1 keep
2 delete
3 keep
i = 2 (again), delete that row (again). Now, the data frame looks like this:
1 keep
2 keep
i = 2 (again), this row says keep, so keep it and move on to i = 3
I'd be remiss to answer this question without mentioning that there are much better ways to do this in R such as square bracket notation (enter ?`[` in the R console), the filter function in the dplyr package, or the data.table package.
This question has many options: Filter data.frame rows by a logical condition
Store the row number in a vector and remove outside the loop.
test <- iris
test[1:5,2:4] <- NA
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 NA NA NA setosa
2 4.9 NA NA NA setosa
3 4.7 NA NA NA setosa
4 4.6 NA NA NA setosa
5 5.0 NA NA NA setosa
6 5.4 3.9 1.7 0.4 setosa
x <- 0
for(i in 1:nrow(test)){
if(is.na(test[i,2]) && is.na(test[i,3]) &&
is.na(test[i,4])){
x <- c(x,i)
}
}
x
test<- test[-x,]
head(test)
> head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa

Calculation via factor results in a by-list - how to circumvent?

I have a data.frame as following:
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 NA
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
My goal is simple but also a bit difficult. Definitely it is doable to solve it in several ways:
I want to apply a function "func" to each row according to a factor, e.g. the factor "Lot". This is done via
m_dist_lot<- by(data.frame, data.frame$Lot,func)
This actually works but the result is a by-list:
data.frame$Lot: 7
354 355 363 367 378 419 426 427 428 431 460 477 836
3.5231249 9.4229589 1.4996504 7.2984485 7.6883170 1.2354754 1.8547674 3.1129814 4.4303001 1.9634573 3.7281868 3.6182559 6.4718306
data.frame$Lot: 8
1 2 11 15 17 18 19 20 21 22 24 25
2.1415352 4.6459868 1.3485551 38.8218984 3.9988686 2.2473563 6.7186047 2.6433790 0.5869746 0.5832567 4.5321623 1.8567318
The first row seems to be the row of the initial data.frame where the data is taken from. The second row are the calculated values.
My problem now is: How can I store these values properly into the origin data.frame according to the correct rows?
For example in case of one certain calculation/row of the data frame:
m_dist_lot<- by(data.frame, data.frame$Lot,func)
results for the second row of the data.frame in
data.frame$Lot: 8
2
4.6459868
I want to store the value 4.6459868 in data.frame$m_dist_lot according to the correct row "2":
Lot Wafer Voltage Slope Voltage_irradiated Slope_irradiated m_dist_lot
1 8 810 356.119 6.08423 356.427 6.13945 NA
2 8 818 355.249 6.01046 354.124 6.20855 4.6459868
3 9 917 346.921 6.21474 346.847 6.33904 NA
4 (...)
120 9 914 353.335 6.15060 352.540 6.19277 NA
121 7 721 358.647 6.10592 357.797 6.17244 NA
122 (...)
but I don't know how. My best try actually is to use "unlist".
un<- unlist(m_dist_lot) results in
un[1]
6.354
3.523125
un[2]
6.355
9.422959
un[3]
(..)
But I still don't know how I can "separate" the information of "factor.row" and "calculcated" value in such a way that the information is stored correctly in the data frame.
At least when using un<- unlist(m_dist_lot, use.names = FALSE) the factors are not present:
un[1]
3.523125
un[2]
9.422959
un[3]
1.49965
(..)
But now I lack the information of how to assign these values properly into the data.frame.
Using un<- do.call(rbind, lapply(m_dist_lot, data.frame, stringsAsFactors=FALSE)) results in
(...)
7.922 0.94130936
7.976 4.89560441
8.1 2.14153516
8.2 4.64598677
8.11 1.34855514
(...)
Here I still lack a proper assignment of calculated values <> data.frame.
I'm sure there must be a doable way. Do you know a good method?
Without reproducible data or an example of what you want func to do, I am guessing a bit here. However, I think that dplyr is going to be the answer for you.
First, I am going to use the pipe (%>%) from dplyr (exported from magrittr) to pass the builtin iris data through a series of functions. If what you are trying to calculate requires the full data.frame (and not just a column or two), you could modify this approach to do what you want (just write your function to take a data.frame, add the column(s) of interest, then return the full data.frame).
Here, I first split the iris data by Species (this creates a list, with a separate data.frame for each species). Next, I use lapply to run the function head on each element of the list. This returns a list of data.frames that now each only have three rows. (You could replace head with your function of interest here, as long as it returns a full data.frame.) Finally, I stitch each element of the list back together with bind_rows.
topIris <-
iris %>%
split(.$Species) %>%
lapply(head, n = 3) %>%
bind_rows()
This returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 6.3 3.3 6.0 2.5 virginica
8 5.8 2.7 5.1 1.9 virginica
9 7.1 3.0 5.9 2.1 virginica
Which I am going to use to illustrate the approach that I think will actually address your underlying problem.
The group_by function from dplyr allows a similar approach, but without having to split the data.frame. When a data.frame is grouped, any functions applied to it are applied separately by group. Here is an example in action, which ranks the sepal lengths within each species. This is obviously not terribly useful directly, but you could write a custom function which took any number of columns as arguments (which are then passed in as vectors) and returned a vector of the same length (to create a new column or update an existing one). The select function at the end is only there to make it easier to see what I did
topIris %>%
group_by(Species) %>%
mutate(rank_Sepal_Length = rank(Sepal.Length)) %>%
select(Species, rank_Sepal_Length, Sepal.Length)
Returns:
Species rank_Sepal_Length Sepal.Length
<fctr> <dbl> <dbl>
1 setosa 3 5.1
2 setosa 2 4.9
3 setosa 1 4.7
4 versicolor 3 7.0
5 versicolor 1 6.4
6 versicolor 2 6.9
7 virginica 2 6.3
8 virginica 1 5.8
9 virginica 3 7.1
I got a workaround with the help of Force gsub to keep trailing zeros :
un<- do.call(rbind, lapply(list, data.frame, stringsAsFactors=FALSE))
un<- gsub(".*.","", un)
un<- regmatches(un, gregexpr("(?<=.).*", un, perl=TRUE))
rows<- data.frame(matrix(ncol = 1, nrow = lengths(un)))
colnames(rows)<- c("row_number")
rows["row_number"]<- sprintf("%s", rownames(un))
rows["row_number"]<- as.numeric(un[,1])
rows["row_number"]<- sub("^[^.]*[.]", "", format(rows[,1], width = max(nchar(rows[,1]))))

Is there a package that I can use in order to get rules for a target outcome in R

For example In this given data set I would like to get the best values of each variable that will yield a pre-set value of "percentage" : for example I need that the value of "percentage" will be >=0.7 so in this case the outcome should be something like:
birds >=5,1<wolfs<=3 , 2<=snakes <=4
Example data set:
dat <- read.table(text = "birds wolfs snakes percentage
3 8 7 0.50
1 2 3 0.33
5 1 1 0.66
6 3 2 0.80
5 2 4 0.74",header = TRUE
I can't use decision trees as I have a large data frame and I can't see all tree correctly. I tried the *arules* package as but it requires that all variables will be factors and I have mixed dataset of factor,logical and continuous variables and I would like to keep the variables and the Independent variable continues .Also I need "percentage" variable to be the only one that I would like to optimize.
The code that I wrote with *arules* package is this:
library(arules)
dat$birds<-as.factor(dat$birds)
dat$wolfs<-as.factor(dat$wolfs)
dat$snakes<-as.factor(dat$snakes)
dat$percentage<-as.factor(dat$percentage)
rules<-apriori(dat, parameter = list(minlen=2, supp=0.005, conf=0.8))
Thank you
I may have misunderstood the question but to get the maximum value of each variable with the restriction of percentage >= 0.7 you could do this:
lapply(dat[dat$percentage >= 0.7, 1:3], max)
$birds
[1] 6
$wolfs
[1] 3
$snakes
[1] 4
Edit after comment:
So perhaps this is more what you are looking for:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y))))
birds wolfs snakes
1 5 2 2
2 6 3 4
It will give the min and max values representing the ranges of variables if percentage >=0.7
If this is completely missing what you are trying to achieve, I may not be the right person to help you.
Edit #2:
> as.data.frame(lapply(dat[dat$percentage >= 0.7,1:3], function(y) c(min(y), max(y), length(y), length(y)/nrow(dat))))
birds wolfs snakes
1 5.0 2.0 2.0
2 6.0 3.0 4.0
3 2.0 2.0 2.0
4 0.4 0.4 0.4
Row 1: min
Row 2: max
Row 3: number of observations meeting the condition
Row 4: percentage of observations meeting the condition (relative to total observations)

Per-group operation on multiple columns on data.frame [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
General problem I have very often: I want to perform some operation on a data.frame, which for each factor level will produce one number, and for this it uses information from multiple columns. How to write that in R?
I considered these functions:
tapply - doesn't operate on multiple columns
aggregate - the function is given the columns separately
ave - the result has the same number of rows as input, not as the number of factors' levels
by - this was the hottest candidate, but I hate the format returned - the list. I want data.frame as result, I know I can convert it but it is ugly, I prefer another solution!
A base R solution is to use a combination of lapply and split:
> data.frame(lapply(split(iris[,1:4], iris[,5]), colMeans))
setosa versicolor virginica
Sepal.Length 5.006 5.936 6.588
Sepal.Width 3.428 2.770 2.974
Petal.Length 1.462 4.260 5.552
Petal.Width 0.246 1.326 2.026
...or you could wrap that in do.call(rbind, ...) to get the output in a slightly different form:
> data.frame(do.call(rbind,lapply(split(iris[,1:4], iris[,5]), colMeans)))
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
...or use sapply if your data can be stored in a matrix:
> sapply(split(iris[,1:4], iris[,5]), colMeans)
setosa versicolor virginica
Sepal.Length 5.006 5.936 6.588
Sepal.Width 3.428 2.770 2.974
Petal.Length 1.462 4.260 5.552
Petal.Width 0.246 1.326 2.026
The OP is asking for a general answer, so I think the 'plyr' package is the most appropriate. The 'plyr' package has limitations when approaching large data sets, but for everyday use (implied in the original post), the 'plyr' functions are wonderful assets for any R user.
Setup: Here is a quick data sample for us to work with.
data <- data.frame(id=1:50, group=sample(letters[1:3], 50, rep=TRUE), x_Value=sample(1:500, 50), y_Value=sample(2:5, 50, rep=TRUE)*100)
How to use plyr: I'm just going to address the basic uses here as an example to get things started. First, load up the package.
library(plyr)
Now, let's start calculating things. With the 'plyr' functions, you choose the first two letters of the function based on your input and output. In this example, I will be inputting a data frame (d) and outputting a data frame (d), so I will use the 'ddply" function.
The 'ddply' function uses this syntax:
ddply(
data_source,
.(grouping_variables),
function,
column_definitions)
First, let's quickly find out how many entries belong to groups a, b, and c:
ddply(
data,
.(group),
summarize,
N=length(id))
# group N
# 1 a 17
# 2 b 16
# 3 c 17
Here, we specified the data source first, and then specified that we wanted to group lines by the 'group' variable. We use the 'summarize' function to trash all of the columns except those in our grouping_variables and column_definitions. Using the 'length' function is basically just a count for this purpose.
Now, let's add a column to the data that shows the group means for the x and y values.
ddply(
data,
.(group),
mutate,
group_mean_x=mean(x_Value),
group_mean_y=mean(y_Value))
# id group x_Value y_Value group_mean_x group_mean_y
# 1 8 a 301 300 218.7059 394.1176
# 2 13 a 38 500 218.7059 394.1176
# 3 14 a 425 300 218.7059 394.1176
# .....................................................
# 17 47 a 191 300 218.7059 394.1176
# 18 5 b 411 500 235.1875 325.0000
# 19 6 b 121 400 235.1875 325.0000
# 20 11 b 151 200 235.1875 325.0000
# .....................................................
# 33 49 b 354 200 235.1875 325.0000
# 34 1 c 482 400 246.1765 400.0000
# 35 2 c 43 300 246.1765 400.0000
# .....................................................
# 50 50 c 248 500 246.1765 400.0000
I've truncated the results to make it shorter. Here, we used the same data source and grouping variable, but the 'mutate' function preserves all of the data in the data source while adding columns.
Now, let's do a two-step effort with the previous data. Let's show the means and the difference between the x and y mean values in a summary table.
ddply(
data,
.(group),
summarize,
group_mean_x=mean(x_Value),
group_mean_y=mean(y_Value),
difference=group_mean_x - group_mean_y)
# group group_mean_x group_mean_y difference
# 1 a 218.7059 394.1176 -175.4118
# 2 b 235.1875 325.0000 -89.8125
# 3 c 246.1765 400.0000 -153.8235
I show you this example, because there is something important going on... we're using columns that we just defined as part of a different column's definition. This is very, very useful when creating summary tables.
Finally, let's group by two factors: the group and the digit in the 10^2 place of the x value. Let's create a summary table that shows the mean x and y values for each group and 10^2 digit x value.
ddply(
data,
.(group, x_100=as.integer(x_Value/100)),
summarize,
mean_x=mean(x_Value),
mean_y=mean(y_Value))
# group x_100 mean_x mean_y
# 1 a 0 20.0000 425.0000
# 2 a 1 145.6667 333.3333
# 3 a 2 272.0000 400.0000
# 4 a 3 328.6667 433.3333
# 5 a 4 427.5000 350.0000
# 6 b 0 37.0000 200.0000
# 7 b 1 148.6667 383.3333
# 8 b 2 230.0000 325.0000
# 9 b 3 363.0000 200.0000
# 10 b 4 412.5000 400.0000
# 11 c 0 55.6000 360.0000
# 12 c 1 173.5000 350.0000
# 13 c 2 262.5000 450.0000
# 14 c 3 355.6667 400.0000
# 15 c 4 481.0000 433.3333
This example is important, because it shows us two things: we can create grouping columns using vectorized statements and we can group by more than one column by separating the list of columns with a comma.
This quick set of examples should be enough to get started using the 'plyr' packages. More details can be found in help(plyr).
ddply from the plyr package splits a data.frame by one or more factors, performs a function for each of the splits and returns a data.frame as a result. You might want to look there.
Searching on SO will produce many answers, here's a simple example.
library(data.table)
dt = data.table(a = c(1:6), b = c(1,1,1,2,2,2), c = c(1,2,1,2,1,2))
dt
# a b c
#1: 1 1 1
#2: 2 1 2
#3: 3 1 1
#4: 4 2 2
#5: 5 2 1
#6: 6 2 2
dt[, sum(a), by = list(b, c)]
# b c V1
#1: 1 1 4
#2: 1 2 2
#3: 2 2 10
#4: 2 1 5
Even in this simple example one can see the advantages over plyr's ddply - easier (more human and shorter) syntax, preservation of grouping order and of course faster speed. (for reference the plyr version would be ddply(dt, .(b, c), summarize, sum(a)))

drawing a stratified sample in R

Designing my stratified sample
library(survey)
design <- svydesign(id=~1,strata=~Category, data=billa, fpc=~fpc)
So far so good, but how can I draw now a sample in the same way I was able for simple sampling?
set.seed(67359)
samplerows <- sort(sample(x=1:N, size=n.pre$n))
If you have a stratified design, then I believe you can sample randomly within each stratum. Here is a short algorithm to do proportional sampling in each stratum, using ddply:
library(plyr)
set.seed(1)
dat <- data.frame(
id = 1:100,
Category = sample(LETTERS[1:3], 100, replace=TRUE, prob=c(0.2, 0.3, 0.5))
)
sampleOne <- function(id, fraction=0.1){
sort(sample(id, round(length(id)*fraction)))
}
ddply(dat, .(Category), summarize, sampleID=sampleOne(id, fraction=0.2))
Category sampleID
1 A 21
2 A 29
3 A 72
4 B 13
5 B 20
6 B 42
7 B 58
8 B 82
9 B 100
10 C 1
11 C 11
12 C 14
13 C 33
14 C 38
15 C 40
16 C 63
17 C 64
18 C 71
19 C 92
Take a look at the sampling package on CRAN (pdf here), and the strata function in particular.
This is a good package to know if you're doing surveys; there are several vignettes available from its page on CRAN.
The task view on "Official Statistics" includes several topics that are closely related to these issues of survey design and sampling - browsing through it and the packages recommended may also introduce other tools that you can use in your work.
You can draw a stratified sample using dplyr. First we group by the column or columns in which we are interested in. In our example, 3 records of each Species.
library(dplyr)
set.seed(1)
iris %>%
group_by (Species) %>%
sample_n(., 3)
Output:
Source: local data frame [9 x 5]
Groups: Species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.3 3.0 1.1 0.1 setosa
2 5.7 3.8 1.7 0.3 setosa
3 5.2 3.5 1.5 0.2 setosa
4 5.7 3.0 4.2 1.2 versicolor
5 5.2 2.7 3.9 1.4 versicolor
6 5.0 2.3 3.3 1.0 versicolor
7 6.5 3.0 5.2 2.0 virginica
8 6.4 2.8 5.6 2.2 virginica
9 7.4 2.8 6.1 1.9 virginica
here's a quick way to sample three records per distinct 'carb' value from the mtcars data frame without replacement
# choose how many records to sample per unique 'carb' value
records.per.carb.value <- 3
# draw the sample
your.sample <-
mtcars[
unlist(
tapply(
1:nrow( mtcars ) ,
mtcars$carb ,
sample ,
records.per.carb.value
)
) , ]
# print the results to the screen
your.sample
note that the survey package is mostly used for analyzing complex sample survey data, not creating it. #Iterator is right that you should check out the sampling package for more advanced ways to create complex sample survey data. :)

Resources