converting an ftable to a matrix - r

Take for example the following ftable
height <- c(rep('short', 7), rep('tall', 3))
girth <- c(rep('narrow', 4), rep('wide', 6))
measurement <- rnorm(10)
foo <- data.frame(height=height, girth=girth, measurement=measurement)
ftable.result <- ftable(foo$height, foo$girth)
I'd like to convert the above ftable.result into a matrix with row names and column names. Is there an efficient way of doing this? as.matrix() doesn't exactly work, since it won't attach the row names and column names for you.
You could do the following
ftable.matrix <- ftable.result
class(ftable.matrix) <- 'matrix'
rownames(ftable.matrix) <- unlist(attr(ftable.result, 'row.vars'))
colnames(ftable.matrix) <- unlist(attr(ftable.result, 'col.vars'))
However, it seems a bit heavy-handed. Is there a more efficient way of doing this?

It turns out that #Shane had originally posted (but quickly deleted) what is a correct answer with more recent versions of R.
Somewhere along the way, an as.matrix method was added for ftable (I haven't found it in the NEWS files I read through though.
The as.matrix method for ftable lets you deal fairly nicely with "nested" frequency tables (which is what ftable creates quite nicely). Consider the following:
temp <- read.ftable(textConnection("breathless yes no
coughed yes no
age
20-24 9 7 95 1841
25-29 23 9 108 1654
30-34 54 19 177 1863"))
class(temp)
# [1] "ftable"
The head(as.table(...), Inf) trick doesn't work with such ftables because as.table would convert the result into a multi-dimensional array.
head(as.table(temp), Inf)
# [1] 9 23 54 95 108 177 7 9 19 1841 1654 1863
For the same reason, the second suggestion also doesn't work:
t <- as.table(temp)
class(t) <- "matrix"
# Error in class(t) <- "matrix" :
# invalid to set the class to matrix unless the dimension attribute is of length 2 (was 3)
However, with more recent versions of R, simply using as.matrix would be fine:
as.matrix(temp)
# breathless_coughed
# age yes_yes yes_no no_yes no_no
# 20-24 9 7 95 1841
# 25-29 23 9 108 1654
# 30-34 54 19 177 1863
class(.Last.value)
# [1] "matrix"
If you prefer a data.frame to a matrix, check out table2df from my "mrdwabmisc" package on GitHub.

I found 2 solutions on R-Help:
head(as.table(ftable.result), Inf)
Or
t <- as.table(ftable.result)
class(t) <- "matrix"

Related

R: replace columns of matrix if vector fulfills certain condition [duplicate]

I have a dataframe as follows
Me You They Him She
1 4 6 3 233
82 0 2 4 122
98 2 5 2 99
I want to get a new dataframe which only contains those columns where the colMeans are >30 so the result should look like
Me She
1 233
82 122
98 99
I tried something like
dfNew<-subset(df,colMeans(df[, 1:ncol(df)]>30))
but got the error
Error in subset.data.frame(df[, 1:ncol(df)]> :
'subset' must be logical
Clearly don't know what Im doing.
try this:
df[,colMeans(df)>30]
I Think this is something that you are looking for
This step is just me creating your data.
Me <- c(1,82,98)
You <- c(4,0,2)
They <- c(6,5,5)
Him <- c(3,4,2)
She <- c(233,122,99)
df <- as.data.frame(cbind(Me, You, They, Him, She))
This is what you want.
df[, sapply(df, mean) > 60]

R lapply(): Change all columns within all data frames in a list to numeric, then convert all values to percentages

Question:
I am a little stumped as to how I can batch process as.numeric() (or any other function for that matter) for columns in a list of data frames.
I understand that I can view specific data frames or colunms within this list by using:
> my.list[[1]]
# or columns within this data frame using:
> my.list[[1]][1]
But my trouble comes when I try to apply this into an lapply() function to change all of the data from integer to numeric.
# Example of what I am trying to do
> my.list[[each data frame in list]][each column in data frame] <-
as.numberic(my.list[[each data frame in list]][each column in data frame])
If you can assist me in any way, or know of any resources that can help me out I would appreciate it.
Background:
My data frames are structured as the below example, where I have 5 habitat types and information on how much area an individual species home range extends to n :
# Example data
spp.1.data <- data.frame(Habitat.A = c(100,45,0,9,0), Habitat.B = c(0,0,203,45,89), Habitat.C = c(80,22,8,9,20), Habitat.D = c(8,59,77,83,69), Habitat.E = c(23,15,99,0,10))
I have multiple data frames with the above structure which I have assigned to a list object:
all.spp.data <- list(spp.1.data, spp.2.data, spp.1.data...n)
I am then trying to coerce all data frames to as.numeric() so I can create data frames of % habitat use i.e:
# data, which is now numeric as per Phil's code ;)
data.numeric <- lapply(data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
> head(data.numeric[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E
1 100 0 80 8 23
2 45 0 22 59 15
3 0 203 8 77 99
4 9 45 9 83 0
5 0 89 20 69 10
EDIT: I would like to sum every row, in all data frames
# Add row at the end of each data frame populated by rowSums()
f <- function(i){
data.numeric[[i]]$Sums <- rowSums(data.numeric[[i]])
data.numeric[[i]]
}
data.numeric.SUM <- lapply(seq_along(data.numeric), f)
head(data.numeric.SUM[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E Sums
1 100 0 80 8 23 211
2 45 0 22 59 15 141
3 0 203 8 77 99 387
4 9 45 9 83 0 146
5 0 89 20 69 10 188
EDIT: This is the code I used to convert values within the data frames to % habitat used
# Used Phil's logic to convert all numbers in percentages
data.numeric.SUM.perc <- lapply(data.numeric.SUM,
function(x) {
x[] <- (x[]/x[,6])*100
x
})
Perc.Habitat.A Perc.Habitat.B Perc.Habitat.C Perc.Habitat.D Perc.Habitat.E
1 47 32 0 6 0
2 0 0 52 31 47
3 38 16 2 6 11
4 4 42 20 57 37
5 11 11 26 0 5
6 100 100 100 100 100
This is still not the most condensed way to do this, but it did the trick for me.
Thank you, Phil, Val and Leo P, for helping with this problem.
I'd do this a bit more explicitly:
all.spp.data <- lapply(all.spp.data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
As a personal preference, this clearly conveys to me that I'm looping over each column in a data frame, and looping over each data frame in a list.
If you really want to do it all with lapply, here's a way to go:
lapply(all.spp.data,function(x) do.call(cbind,lapply(1:nrow(x),function(y) as.numeric(x[,y]))))
This uses a nested lapply call. The first one references the single data.frames to x. The second one references the column index for each x to y. So in the end I can reference each column by x[,y].
Since everything will be split up in single vectors, I'm calling do.call(cbind, ... ) to bring it back to a matrix. If you prefer you could add data.frame() around it to bring it back into the original type.

How to subset dataframe based on colMeans

I have a dataframe as follows
Me You They Him She
1 4 6 3 233
82 0 2 4 122
98 2 5 2 99
I want to get a new dataframe which only contains those columns where the colMeans are >30 so the result should look like
Me She
1 233
82 122
98 99
I tried something like
dfNew<-subset(df,colMeans(df[, 1:ncol(df)]>30))
but got the error
Error in subset.data.frame(df[, 1:ncol(df)]> :
'subset' must be logical
Clearly don't know what Im doing.
try this:
df[,colMeans(df)>30]
I Think this is something that you are looking for
This step is just me creating your data.
Me <- c(1,82,98)
You <- c(4,0,2)
They <- c(6,5,5)
Him <- c(3,4,2)
She <- c(233,122,99)
df <- as.data.frame(cbind(Me, You, They, Him, She))
This is what you want.
df[, sapply(df, mean) > 60]

getting from histogram counts to cdf

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

How to find the difference between 2 dataframes?

I have 2 dataframes which are "exactly" the same. The difference between them is that one has 676 observations (rows) and the second has 666 observations. I don't know which of those rows are missed in a second dataframe.
Would be the easiest to me if someone can show me the code how to make a third dataframe with those 10 rows which are missed.
The name of dataframes:
- dataset1 (676)
- dataset2 (666)
Thx.
dataset1[tail(!duplicated(rbind(dataset2, dataset1)), nrow(dataset1)), ]
Here's an approach:
library(qdap)
## generate random problem
prob <- sample(1:nrow(mtcars), 1)
## remove the random problem row
mtcars2 <- mtcars[-prob, ]
## Throw it into a list of 2 dataframes so they're easier to work with
dat <- list(mtcars, mtcars2)
## Use qdap's `paste2` function to paste all columns together
dat2 <- lapply(dat, paste2)
## Find the shorter data set
wmn <- which.min(sapply(dat2, length))
## Add additional element to shorter one
dat2[[wmn]] <- c(dat2[[wmn]], NA)
## check each element of the 2 pasted data sets for equality
out <- mapply(identical, dat2[[1]], dat2[[2]])
## Which row's the problem
which(!out)[1]
which(!out)[1] == prob
If which(!out)[1] equals NA problem is in the last row.
When you start seeing FALSE that's where the problem is located.
EDIT: removed the for loop
I would say try to use merge and then look for where the merge result has NA values.
Here's an example using dummy data:
set.seed(1)
df1 <- data.frame(x=rnorm(100),y=rnorm(100))
df2 <- df1[-sample(1:100,10),]
dim(df1)
# [1] 100 2
dim(df2)
# [1] 90 2
out <- merge(df1,df2,by='x',all.x=TRUE)
in1not2 <- which(is.na(out$y.y))
in1not2
# [1] 6 25 33 51 52 53 57 73 77 82
Then you can extract:
> df1[in1not2,]
x y
6 -0.8204684 1.76728727
25 0.6198257 -0.10019074
33 0.3876716 0.53149619
51 0.3981059 0.45018710
52 -0.6120264 -0.01855983
53 0.3411197 -0.31806837
57 -0.3672215 1.00002880
73 0.6107264 0.45699881
77 -0.4432919 0.78763961
82 -0.1351786 0.98389557

Resources