Value labels (levels) are lost when modifing a memisc:data.set in R - r

I use memisc:data.set because I import data from SPSS. I can get the value labels (in SPSS meaning) from a object when asking for levels(). I use that for the labels of the tick-marks in a plot.
When I modify the data.set (like in the exmpale below) levels() doesn't work anymore.
library('memisc')
# example dta
d <- data.set(a = sample(1:100))
d$a_strat <- cut(d$a, breaks=seq(1,100, by=10))
# "modify" the data.set
e <- d[,c('a_strat')]
# it is still a data.set but "a_strat" changed it's type
> class(e)
[1] "data.set"
attr(,"package")
[1] "memisc"
Now have a look at the different data types of a_strat in the two data.set.
> str(d$a_strat)
Factor w/ 9 levels "(1,11]","(11,21]",..: 4 9 3 1 NA 9 5 4 9 9 ...
> str(e$a_strat)
$ Nmnl. item w/ 9 labels for 1,2,3,... int 4 9 3 1 NA 9 5 4 9 9 ...
The practical issue is I can not do that on the second data.set.
> levels(e$a_strat)
NULL
But this works
> labels(e$a_strat)
Values and labels:
1 '(1,11]'
2 '(11,21]'
3 '(21,31]'
4 '(31,41]'
5 '(41,51]'
6 '(51,61]'
7 '(61,71]'
8 '(71,81]'
9 '(81,91]'
But when I use that for plotting in axis(..., labels=labels(e$_strat)) the value labels (e.g. (32,41]) doesn't appear. Instead of that the values (1, 2, ..., 9) appear on the tickmarks.
I am not sure how to solve that.

The little helper here is as.factor().
So it could look like this
axis(..., labels=labels(as.factor(e$_strat)))
But please don't rate that answer positive. ;) I still can't understand why the type of a_strat changes in my example.

Related

Sorting dataframe in R in reverse order - Column name as a variable [duplicate]

I've looked and looked and the answer either does not work for me, or it's far too complex and unnecessary.
I have data, it can be any data, here is an example
chickens <- read.table(textConnection("
feathers beaks
2 3
6 4
1 5
2 4
4 5
10 11
9 8
12 11
7 9
1 4
5 9
"), header = TRUE)
I need to, very simply, sort the data for the 1st column in descending order. It's pretty straightforward, but I have found two things below that both do not work and give me an error which says:
"Error in order(var) : Object 'var' not found.
They are:
chickens <- chickens[order(-feathers),]
and
chickens <- chickens[sort(-feathers),]
I'm not sure what I'm not doing, I can get it to work if I put the df name in front of the varname, but that won't work if I put an minus sign in front of the varname to imply descending sort.
I'd like to do this as simply as possible, i.e. no boolean logic variables, nothing like that. Something akin to SPSS's
SORT BY varname (D)
The answer is probably right in front of me, I apologize for the basic question.
Thank you!
You need to use dataframe name as prefix
chickens[order(chickens$feathers),]
To change the order, the function has decreasing argument
chickens[order(chickens$feathers, decreasing = TRUE),]
The syntax in base R, needs to use dataframe name as a prefix as #dmi3kno has shown. Or you can also use with to avoid using dataframe name and $ all the time as mentioned by #joran.
However, you can also do this with data.table :
library(data.table)
setDT(chickens)[order(-feathers)]
#Also
#setDT(chickens)[order(feathers, decreasing = TRUE)]
# feathers beaks
# 1: 12 11
# 2: 10 11
# 3: 9 8
# 4: 7 9
# 5: 6 4
# 6: 5 9
# 7: 4 5
# 8: 2 3
# 9: 2 4
#10: 1 5
#11: 1 4
and dplyr :
library(dplyr)
chickens %>% arrange(desc(feathers))

Strings in datatable (imported from database) get coerced into integers? [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 6 years ago.
I've imported a test file and tried to make a histogram
pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t")
hist <- as.numeric(pichman$WS)
However, I get different numbers from values in my dataset. Originally I thought that this because I had text, so I deleted the text:
table(pichman$WS)
ws <- pichman$WS[pichman$WS!="Down" & pichman$WS!="NoData"]
However, I am still getting very high numbers does anyone have an idea?
I suspect you are having a problem with factors. For example,
> x = factor(4:8)
> x
[1] 4 5 6 7 8
Levels: 4 5 6 7 8
> as.numeric(x)
[1] 1 2 3 4 5
> as.numeric(as.character(x))
[1] 4 5 6 7 8
Some comments:
You mention that your vector contains the characters "Down" and "NoData". What do expect/want as.numeric to do with these values?
In read.csv, try using the argument stringsAsFactors=FALSE
Are you sure it's sep="/t and not sep="\t"
Use the command head(pitchman) to check the first fews rows of your data
Also, it's very tricky to guess what your problem is when you don't provide data. A minimal working example is always preferable. For example, I can't run the command pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t") since I don't have access to the data set.
As csgillespie said. stringsAsFactors is default on TRUE, which converts any text to a factor. So even after deleting the text, you still have a factor in your dataframe.
Now regarding the conversion, there's a more optimal way to do so. So I put it here as a reference :
> x <- factor(sample(4:8,10,replace=T))
> x
[1] 6 4 8 6 7 6 8 5 8 4
Levels: 4 5 6 7 8
> as.numeric(levels(x))[x]
[1] 6 4 8 6 7 6 8 5 8 4
To show it works.
The timings :
> x <- factor(sample(4:8,500000,replace=T))
> system.time(as.numeric(as.character(x)))
user system elapsed
0.11 0.00 0.11
> system.time(as.numeric(levels(x))[x])
user system elapsed
0 0 0
It's a big improvement, but not always a bottleneck. It gets important however if you have a big dataframe and a lot of columns to convert.

Plot empty groups in boxplot

I want to plot a lot of boxplots in on particular style to compare them.
But when a group is empty the group "isn't plotted".
lets say I have a dataframe:
a b
1 1 5
2 1 4
3 1 6
4 1 4
5 2 9
6 2 8
7 2 9
8 3 NaN
9 3 NaN
10 3 NaN
11 4 2
12 4 8
and I use boxplot to plot it:
boxplot(b ~ a , df)
than I get the plot without group 3
(which I can't show because I did not have "10 reputation")
I found some solutions for removing empty groups via Google but my problem is the other way around.
And I found the solution via at=c(1,2,4) but as I generate an Rscript with python and different groups are empty I would prefer, that the groups aren't dropped at all.
Oh I don't think I have the time to grapple with additional packages.
Therefore I would be thankful for solutions without them.
You can get the group on the x-axis by
boxplot(b ~ a , df, na.action=na.pass)
Or
boxplot(b~factor(a), df)

R object of data.frame and data.table have same type?

I am still very new to R and recently came across something I am not sure what it means. data.frame and data.table have same type? Can an object have multiple types? After converting "cars" from data.frame to data.table, I obviously can't apply functions that apply to data.frames and not data.table, but class() shows the "cars" is still a data.frame. Anyone know why?
> class(cars)
[1] "data.frame"
> cars<-data.table(cars)
> class(cars)
[1] "data.table" "data.frame"
It is not clear what you mean by your line "I obviously can't apply functions that apply to data.frames and not data.table".
Many functions work as you would expect, whether applied to a data.frame or to a data.table. In particular, if you read the help page to ?data.table, you would find this specific line in the first paragraph of the description:
Since a data.table is a data.frame, it is compatible with R functions and packages that only accept data.frame.
You can test this out yourself:
library(data.table)
CARS <- data.table(cars)
The following should all give you the same results. They aren't the "data.table" way of doing things, but I've just popped off a few things off the top of my head to show you that many (most?) functions can be used with data.table the same way that you would use them with data.frame (but at that point, you miss out on all the great stuff that data.table has to offer).
with(cars, tapply(dist, speed, FUN = mean))
with(CARS, tapply(dist, speed, FUN = mean))
aggregate(dist ~ speed, cars, as.vector)
aggregate(dist ~ speed, CARS, as.vector)
colSums(cars)
colSums(CARS)
as.matrix(cars)
as.matrix(CARS)
t(cars)
t(CARS)
table(cut(cars$speed, breaks=3), cut(cars$dist, breaks=5))
table(cut(CARS$speed, breaks=3), cut(CARS$dist, breaks=5))
cars[cars$speed == 4, ]
CARS[CARS$speed == 4, ]
However, there are some cases in which this won't work. Compare:
cars[cars$speed == 4, 1]
CARS[CARS$speed == 4, 1]
For a better understanding of that, I recommend reading the FAQs. In particular, a couple of relevant points have been summarized at this question: what you can do with data.frame that you can't in data.table.
If your question is, more generally, "Can an object have more than one class?", then you've seen from your own exploration that, yes, it can. For more about that, you can read this page from Hadley's devtools wiki.
Classes also affect things like how objects are printed and how they interact with other functions.
Consider the rle function. If you look at the class, it returns "rle", and if you look at its structure, it shows that it is a list.
> x <- rev(rep(6:10, 1:5))
> y <- rle(x)
> x
[1] 10 10 10 10 10 9 9 9 9 8 8 8 7 7 6
> y
Run Length Encoding
lengths: int [1:5] 5 4 3 2 1
values : int [1:5] 10 9 8 7 6
> class(y)
[1] "rle"
> str(y)
List of 2
$ lengths: int [1:5] 5 4 3 2 1
$ values : int [1:5] 10 9 8 7 6
- attr(*, "class")= chr "rle"
As the length of each list item is the same, you might expect that you can conveniently use data.frame() to convert it to a data.frame. Let's try:
> data.frame(y)
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ""rle"" to a data.frame
> unclass(y)
$lengths
[1] 5 4 3 2 1
$values
[1] 10 9 8 7 6
> data.frame(unclass(y))
lengths values
1 5 10
2 4 9
3 3 8
4 2 7
5 1 6
Or, let's add another class to the object and try:
> class(y) <- c(class(y), "list")
> y ## Printing is not affected
Run Length Encoding
lengths: int [1:5] 5 4 3 2 1
values : int [1:5] 10 9 8 7 6
> data.frame(y) ## But interaction with other functions is
lengths values
1 5 10
2 4 9
3 3 8
4 2 7
5 1 6
Data.table and data.frame are different classes, but they are related through inheritance. Data.table inherits from data.frame, and basically expands its capabilities. You can also see that after converting cars to the data.table class:
R> typeof(cars)
[1] "list" # similar to dataframe
R> mode(cars)
[1] "list" # idem
More information here or just google for "inheritance".

R: Can't select a specific column in a data frame

I have a problem with a function to select a given column. I have a data frame called Volume from which I want to make a subset DateSearch:
DateSearch = subset(Volume,select=c("TRI",name))
For some reason it does not work. I have used browser(). I can select TRI or name but I can't select both (either with their name or indice). I have tried with and without "".
Does anyone know why is that?
Many thanks,
Vincent
I just did what (I think) you describe:
str(dfrm)
#'data.frame': 20 obs. of 8 variables:
# $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
# $ factor1: Factor w/ 4 levels "Not at all","To a small extent",..: 3 2 3 NA 3 NA 3 NA 4 1 ...
## <snip>
name = "factor1"
subset(dfrm, select=c("ID", name))
No error, .... results as expected.
Examine the spelling carefully. My guess is that you have a space at the beginning or end of the result of the as.character result. Perhaps even a non-printing character? You can use nchar(name) to check.

Resources