Obtaining index names from a by object by parsing its call - r

I'm trying to create an as.data.frame.by method which basically melts the N-dimensional by object for use with latex.table.by.
Melting it is simple enough, since a by object is just a matrix, but then the variable names returned are the most un-descriptive "X"'s imaginable.
dat <- transform( ChickWeight, Time=cut(Time,3), Chick=cut(as.numeric(Chick),3) )
my.by <- by( dat, with(dat,list(Time,Chick,Diet)), function(x) sum(x$weight) )
Looking through attributes(my.by) doesn't reveal anywhere the index variable names are stored except the call. I'd like to default to something reasonably descriptive for the table.
So that leaves parsing the call:
> attr(my.by,"call")
by.data.frame(data = dat, INDICES = with(dat, list(Time, Chick,
Diet)), FUN = function(x) sum(x$weight))
> str(attr(my.by,"call"))
language by.data.frame(data = dat, INDICES = with(dat, list(Time, Chick, Diet)), FUN = function(x) sum(x$weight))
I just want the index names used, but I have no idea how to go about parsing this monster. Ideas?

If you make the call with named arguments you get dimnames as you expect:
> my.by <- with(dat, by( weight, list(Time=Time,Chick=Chick,Diet=Diet), sum ))
> str(my.by)
by [1:3, 1:3, 1:4] 3475 5969 8002 640 1596 ...
- attr(*, "dimnames")=List of 3
..$ Time : chr [1:3] "(-0.021,6.99]" "(6.99,14]" "(14,21]"
..$ Chick: chr [1:3] "(0.951,17.3]" "(17.3,33.7]" "(33.7,50]"
..$ Diet : chr [1:4] "1" "2" "3" "4"
- attr(*, "call")= language by.default(data = weight, INDICES = list(Time = Time, Chick = Chick, Diet = Diet), FUN = sum)

This will work for the example given:
as.character(tail(as.list(attr(my.by, 'call')[['INDICES']]), 1) [[1]]) [-1]
tail(..., 1)[[1]] grabs the list(Time,Chick,Diet), and [-1] drops list.

Hm, the wild guess of attr(my.by,"call")[["INDICES"]] seems to produce a language object.
And coercing that to character works surprisingly well:
> as.character(attr(my.by,"call")[["INDICES"]])
[1] "with" "dat" "list(Time, Chick, Diet)"
So I could probably grab it from there, although it will remain highly dependent on how the user specifies it. Better parsing ideas would be most appreciated.

Related

Use purrr on a position of element in nested list?

Situation: I have a nested list in the image below. I want to use purrr to iterate over the second element of each nested list and apply a date conversion function.
Problem: I can write a for loop easily to iterate over it but I want to use this with purrr. My nested list attempts have not worked out. Normal list fine, nested by position, not fine.
Reproducible example code from Maurits Evers (Thank you!)
lst <- list(
list("one", "12345", "2019-01-01"),
list("two", "67890", "2019-01-02"))
Any assistance appreciated!
Please see the comment above to understand how to provide a reproducible example including sample data.
Since you don't provide sample data, let's create some minimal mock data similar to what is shown in your screenshot.
lst <- list(
list("one", "12345", "2019-01-01"),
list("two", "67890", "2019-01-02"))
To cast the third element of every list element as.Date we can then do
lst <- map(lst, ~{.x[[3]] <- as.Date(.x[[3]]); .x})
We can confirm that the third element of every list element is an object of type Date
str(lst)
#List of 2
# $ :List of 3
# ..$ : chr "one"
# ..$ : chr "12345"
# ..$ : Date[1:1], format: "2019-01-01"
# $ :List of 3
# ..$ : chr "two"
# ..$ : chr "67890"
# ..$ : Date[1:1], format: "2019-01-02"
Update
A more purrr/tidyverse-canonical approach would be to use modify_at (thanks #H1)
lst <- map(lst, ~modify_at(.x, 3, as.Date))
The result is the same as before.

Access first line only when output has two lines in R

I am using a package in R called linkcomm and here's the documentation for it https://cran.r-project.org/web/packages/linkcomm/linkcomm.pdf
This is what I run so far
library(linkcomm)
g <- read.table("sample.txt", header = FALSE)
lc <- getLinkCommunities(g)
mc=meta.communities(lc, hcmethod = "ward.D2", deepSplit = FALSE)
cc <- getCommunityCentrality(x, type = "commconn")
tmp = head(sort(cc, decreasing = TRUE))
print(tmp)
Output: 1e+14 5712365 12815415 511042 12815383 512594
3388.230 1493.165 1375.577 1350.684 1312.197 1302.445
Now the question is, how do I access the first row only in tmp, which is the actual nodes in the network data?
When I do tmp[1], it produces
1e+14
3388.23 where I only need 1e+14.
dput(a)
structure(c(3388.22995373249, 1493.16521374732, 1375.57742835837,
1350.68389440675, 1312.19704460178, 1302.44518389222), .Names = c("1e+14",
"5712365", "12815415", "511042", "12815383", "512594"))
You have a named numeric vector as you can see below when using str.
str(a)
Named num [1:6] 3388 1493 1376 1351 1312 ...
- attr(*, "names")= chr [1:6] "1e+14" "5712365" "12815415" "511042" ...
#To select the 1st element
a[1]
1e+14
3388.23
#To select the 1st element value without name
unname(a[1])
3388.23
#To select the 1st element name
names(a[1])
[1] "1e+14"
For all names/values in the vector, you can use names(a) / unname(a).

R package 'haven' read_spss: how to make it ignore value labels?

I have an SPSS file. I read it in using 'haven' package:
library(haven)
spss1 <- read_spss("SPSS_Example.sav")
I created a function that extracts the long labels (in SPSS - "Label"):
fix_labels <- function(x, TextIfMissing) {
val <- attr(x, "label")
if (is.null(val)) TextIfMissing else val
}
longlabels <- sapply(spss1, fix_labels, TextIfMissing = "NO LABLE IN SPSS")
Looks like a little bug in 'haven':
When I actually look at the attributes of one variable that has no
long label in SPSS but has Value Labels, I am getting:
attr(spss1$WAVE, "label")
NULL
But when I sapply my function longlabels to my data frame and ask it
to print the long labels for each column, for the same column "WAVE" I
am getting - instead of NULL:
NULL
VERY/SOMEWHAT FAMILIAR NOT AT ALL FAMILIAR
1 2
This is, of course, incorrect, because it grabs the next attribute
(which one?) and replaces NULL with it.
This function is supposed to create a vector of long labels and
usually it does, e.g.:
str(longlabels)
Named chr [1:64] "Serial number" ...
- attr(*, "names")= chr [1:64] "Respondent_Serial" "weight" "r7_1" "r7_2" ...
However, I just got an SPSS file with 92 columns and ran exactly the
same function on it. Now, I am getting not a vector, but a list
str(longlabels)
List of 92
$ VEHRATED : chr "VEHICLE RATED"
$ RESPID : chr "RESPONDENT ID"
$ RESPID8 : chr "8 DIGIT RESPONDENT NUMBER"
An observation about the structure of longlabels here: those columns
that do NOT have a long lable in SPSS but DO have Values (value
labels) - for them my function grabs their value labels, so that now
my long label is recorded as a numeric vector with names, e.g.:
$ AWARE2 : Named num [1:2] 1 2
..- attr(*, "names")= chr [1:2] "VERY/SOMEWHAT FAMILIAR" "NOT AT ALL FAMILIAR"
Question: How could I avoid the extraction of the Value Labels for the
columns that have no long labels?
Here is the solution. The problem was partial matching in attr():
fix_labels <- function(x, TextIfMissing) {
val <- attr(x, "label", exact = TRUE)
if (is.null(val)) TextIfMissing else val
}

Sorting after aggregating in R

I first used aggregate to get the mean of one column in a data frame, per another column:
meanDemVoteHouseState <- aggregate(congress$X2012.House.Dem.vote,
by = list(state = congress$state),
FUN = mean)
I then wanted to print this in order. First I looked at the new data frame
str(meanDemVoteHouseState)
and got
'data.frame': 50 obs. of 2 variables:
$ state: chr "AK" "AL" "AR" "AZ" ...
$ x : num 0.29 0.34 0.29 0.462 0.566 ...
apparently, the new variable is now called "x".
But when I tried to sort on that:
meanDemVoteHouseState[order(x),]
I got an error "object 'x' not found".
I tried a number of other things, but nothing worked.
What am I missing ?
You want
meanDemVoteHouseState[order(meanDemVoteHouseState[,"x"]),]
If you do it in two steps in becomes clearer
myind <- order(meanDemVoteHouseState[,"x"]) # need 'x' fully qualified
meanDemVoteHouseState[myind, ]
Or use things like with() ...
It would probably be easier to just do
meanDemVoteHouseState <- aggregate(X2012.House.Dem.vote ~ state,
data = congress, FUN = mean)
Which would maintain the variable name (such as it is). You'd still need to sort, say with
ord <- with(meanDemVoteHouseState, order(X2012.House.Dem.vote))
meanDemVoteHouseState <- meanDemVoteHouseState[ord, ]
And at this point you may want to choose some shorter names for variables and objects.

Writing a Simple Triplet Matrix to a File?

I am using the tm package to compute term-document-matrix for a dataset, I now have to write the term-document-matrix to a file but when I use the write functions in R I am getting a error.
Here is the code which I am using and the error I am getting:
data("crude")
tdm <- TermDocumentMatrix(crude, control = list(weighting = weightTfIdf, stopwords = TRUE))
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf, stopwords = TRUE))
and this is the error while I use the write.table command on this data:
Error in cat(list(...), file, sep, fill, labels, append) : argument 1 (type 'list') cannot be handled by 'cat'
I understand that tbm is a object of type Simple Triplet Matrix, but how can I write this to a simple text file.
I think I might be misunderstanding the question, but if all you want to do is export the term document matrix to a file, then how about this:
m <- inspect(tdm)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.table(DF)
Is that what you're after mate?
Hope that helps a little,
Tony Breyal
Should the file be "human-readable"? If not, use dump, dput, or save. If so, convert your list into a data.frame.
Edit: You can convert your list into a matrix if each list element is equal length by doing matrix(unlist(list.name), nrow=length(list.name[[1]])) or something like that (or with plyr).
Why aren't you doing your SVM analysis in R (e.g. with kernlab)?
Edit 2: Ok, I looked at your data, and it isn't easy to convert into a matrix because the list elements aren't equal length:
> is.list(tdm)
[1] TRUE
> str(tdm)
List of 7
$ i : int [1:1475] 15 29 151 152 173 205 215 216 227 228 ...
$ j : int [1:1475] 1 1 1 1 1 1 1 1 1 1 ...
$ v : Named num [1:1475] 3.32 4.32 2.32 2 2.32 ...
..- attr(*, "names")= chr [1:1475] "1.50" "16.00" "barrel," "barrel." ...
$ nrow : int 985
$ ncol : int 20
$ dimnames :List of 2
..$ Terms: chr [1:985] "(bpd)" "(bpd)." "(gcc)" "(it) appears to be nearing a crossroads with regard to\nderegulation, both as it pertains to investments and imports," ...
..$ Docs : chr [1:20] "127" "144" "191" "194" ...
$ Weighting: chr [1:2] "term frequency - inverse document frequency" "tf-idf"
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
In order to convert this to a matrix, you will need to either take elements of this list (e.g. i, j) or else do some other manipulation.
Edit 3: Just to conclude my commentary here: these objects are intended to be used with the inspect function (see the package vignette).
As discussed, in order to use a function like write.table, you will need to convert your list into a matrix, which requires some manipulation of that list such that you have several vectors of equal length. Looking at the structure of these tm objects: this will be very difficult to do, and I suggest you work with the helper functions that are included with that package.
dtmMatrix <- as.matrix(dtm)
write.csv(dtmMatrix, 'mydata.csv')
This certainly does the work. However, when I tried it on a very large DTM (25000 by 35000), it gave errors relating to lack of memory space.
I used the following method:
dtm <- DocumentTermMatrix(corpus)
dtm1 <- removeSparseTerms(dtm,0.998) ##max allowed sparsity 0.998
m <- inspect(dtm1)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.csv(DF,"mydata0.998sparse.csv")
Which reduced the size of the document term matrix to a great extent!
Here you can increase the max allowable sparsity (closer to 1) to include more terms in DF.

Resources