Strings in datatable (imported from database) get coerced into integers? [duplicate] - r

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 6 years ago.
I've imported a test file and tried to make a histogram
pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t")
hist <- as.numeric(pichman$WS)
However, I get different numbers from values in my dataset. Originally I thought that this because I had text, so I deleted the text:
table(pichman$WS)
ws <- pichman$WS[pichman$WS!="Down" & pichman$WS!="NoData"]
However, I am still getting very high numbers does anyone have an idea?

I suspect you are having a problem with factors. For example,
> x = factor(4:8)
> x
[1] 4 5 6 7 8
Levels: 4 5 6 7 8
> as.numeric(x)
[1] 1 2 3 4 5
> as.numeric(as.character(x))
[1] 4 5 6 7 8
Some comments:
You mention that your vector contains the characters "Down" and "NoData". What do expect/want as.numeric to do with these values?
In read.csv, try using the argument stringsAsFactors=FALSE
Are you sure it's sep="/t and not sep="\t"
Use the command head(pitchman) to check the first fews rows of your data
Also, it's very tricky to guess what your problem is when you don't provide data. A minimal working example is always preferable. For example, I can't run the command pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t") since I don't have access to the data set.

As csgillespie said. stringsAsFactors is default on TRUE, which converts any text to a factor. So even after deleting the text, you still have a factor in your dataframe.
Now regarding the conversion, there's a more optimal way to do so. So I put it here as a reference :
> x <- factor(sample(4:8,10,replace=T))
> x
[1] 6 4 8 6 7 6 8 5 8 4
Levels: 4 5 6 7 8
> as.numeric(levels(x))[x]
[1] 6 4 8 6 7 6 8 5 8 4
To show it works.
The timings :
> x <- factor(sample(4:8,500000,replace=T))
> system.time(as.numeric(as.character(x)))
user system elapsed
0.11 0.00 0.11
> system.time(as.numeric(levels(x))[x])
user system elapsed
0 0 0
It's a big improvement, but not always a bottleneck. It gets important however if you have a big dataframe and a lot of columns to convert.

Related

R - Print list in file and recover list

I have a list that looks like this:
> indices
$`48-168`
$`48-168`$`1`
[1] 1 2 3 4 5 6 7 8 9 10
$`60-180`
$`60-180`$`1`
[1] 1 2 3 4 5 6 7 8 9 10
$`180-300`
$`180-300`$`1`
[1] 1 2
$`180-300`$`4`
[1] 4 5 6 7 8 9 10
$`180-300`$`3`
[1] 3
I want to print it somehow in a file and then recover the same list later.
I though printing the object given by unlist(as.relistable(obj)) and use relist later but then I do not know how to recover the information from the file.
Given that your data is not particularly well structured, you might want to just use save() here, and save the original R list object:
save(indices, file="/path/to/your/file.txt")
When you want to load indices again, use the load() function:
load(file="/path/to/your/file.txt")

How does is.null work on list elements in R? [duplicate]

I found a very suprising and unpleasant feature of R - it completes list item names!!! See the following code:
a <- list(cov_spring = "spring")
a$cov <- c()
a$cov
# spring ## I expect it to be empty!!! I've set it empty!
a$co
# spring
a$c
I don't know what to do with that.... I need to be able to set $cov to NULL and have $cov_spring there at the same time!!! And use $cov separately!! This is annoying!
My question:
What is going on here? How is this possible, what is the logic behind?
Is there some easy fix, how to turn this completion off? I need to use list items cov_spring and cov independently as if they are normal variables. No damn completion please!!!
From help("$"):
'x$name' is equivalent to 'x[["name", exact = FALSE]]'
When you scroll back and read up on exact=:
exact: Controls possible partial matching of '[[' when extracting by
a character vector (for most objects, but see under
'Environments'). The default is no partial matching. Value
'NA' allows partial matching but issues a warning when it
occurs. Value 'FALSE' allows partial matching without any
warning.
So this provides you partial matching capability in both $ and [[ indexing:
mtcars$cy
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[["cy"]]
# NULL
mtcars[["cy", exact=FALSE]]
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
There is no way I can see of to disable the exact=FALSE default for $ (unless you want to mess with formals, which I do not recommend for the sake of reproducibility and consistent behavior).
Programmatic use of frames and lists (for defensive purposes) should prefer [[ over $ for precisely this reason. (It's rare, but I have been bitten by this permissive behavior.)
Edit:
For clarity on that last point:
mtcars$cyl becomes mtcars[["cyl"]]
mtcars$cyl[1:3] becomes mtcars[["cyl"]][1:3]
mtcars[,"cy"] is not a problem, nor is mtcars[1:3,"cy"]
You can use [ or [[ instead.
a["cov"] will return a list with a NULL element.
a[["cov"]] will return the NULL element directly.

Very confusing R feature - completion of list item names

I found a very suprising and unpleasant feature of R - it completes list item names!!! See the following code:
a <- list(cov_spring = "spring")
a$cov <- c()
a$cov
# spring ## I expect it to be empty!!! I've set it empty!
a$co
# spring
a$c
I don't know what to do with that.... I need to be able to set $cov to NULL and have $cov_spring there at the same time!!! And use $cov separately!! This is annoying!
My question:
What is going on here? How is this possible, what is the logic behind?
Is there some easy fix, how to turn this completion off? I need to use list items cov_spring and cov independently as if they are normal variables. No damn completion please!!!
From help("$"):
'x$name' is equivalent to 'x[["name", exact = FALSE]]'
When you scroll back and read up on exact=:
exact: Controls possible partial matching of '[[' when extracting by
a character vector (for most objects, but see under
'Environments'). The default is no partial matching. Value
'NA' allows partial matching but issues a warning when it
occurs. Value 'FALSE' allows partial matching without any
warning.
So this provides you partial matching capability in both $ and [[ indexing:
mtcars$cy
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
mtcars[["cy"]]
# NULL
mtcars[["cy", exact=FALSE]]
# [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
There is no way I can see of to disable the exact=FALSE default for $ (unless you want to mess with formals, which I do not recommend for the sake of reproducibility and consistent behavior).
Programmatic use of frames and lists (for defensive purposes) should prefer [[ over $ for precisely this reason. (It's rare, but I have been bitten by this permissive behavior.)
Edit:
For clarity on that last point:
mtcars$cyl becomes mtcars[["cyl"]]
mtcars$cyl[1:3] becomes mtcars[["cyl"]][1:3]
mtcars[,"cy"] is not a problem, nor is mtcars[1:3,"cy"]
You can use [ or [[ instead.
a["cov"] will return a list with a NULL element.
a[["cov"]] will return the NULL element directly.

Value labels (levels) are lost when modifing a memisc:data.set in R

I use memisc:data.set because I import data from SPSS. I can get the value labels (in SPSS meaning) from a object when asking for levels(). I use that for the labels of the tick-marks in a plot.
When I modify the data.set (like in the exmpale below) levels() doesn't work anymore.
library('memisc')
# example dta
d <- data.set(a = sample(1:100))
d$a_strat <- cut(d$a, breaks=seq(1,100, by=10))
# "modify" the data.set
e <- d[,c('a_strat')]
# it is still a data.set but "a_strat" changed it's type
> class(e)
[1] "data.set"
attr(,"package")
[1] "memisc"
Now have a look at the different data types of a_strat in the two data.set.
> str(d$a_strat)
Factor w/ 9 levels "(1,11]","(11,21]",..: 4 9 3 1 NA 9 5 4 9 9 ...
> str(e$a_strat)
$ Nmnl. item w/ 9 labels for 1,2,3,... int 4 9 3 1 NA 9 5 4 9 9 ...
The practical issue is I can not do that on the second data.set.
> levels(e$a_strat)
NULL
But this works
> labels(e$a_strat)
Values and labels:
1 '(1,11]'
2 '(11,21]'
3 '(21,31]'
4 '(31,41]'
5 '(41,51]'
6 '(51,61]'
7 '(61,71]'
8 '(71,81]'
9 '(81,91]'
But when I use that for plotting in axis(..., labels=labels(e$_strat)) the value labels (e.g. (32,41]) doesn't appear. Instead of that the values (1, 2, ..., 9) appear on the tickmarks.
I am not sure how to solve that.
The little helper here is as.factor().
So it could look like this
axis(..., labels=labels(as.factor(e$_strat)))
But please don't rate that answer positive. ;) I still can't understand why the type of a_strat changes in my example.

loop over columns with semi like columnnames

I have the following variable and dataframe
welltypes <- c("LC","HC")
qccast <- data.frame(
LC_mean=1:10,
HC_mean=10:1,
BC_mean=rep(0,10)
)
Now I only want to see the welltypes I selected(in this case LC and HC, but it could also be different ones.)
for(i in 1:length(welltypes)){
qccast$welltypes[i]_mean
}
This does not work, I know.
But how do i loop over those columns?
And it has to happen variable wise, because welltypes is of an unkown size.
The second argument to $ needs to be a column name of the first argument. I haven't run the code, but I would expect welltypes[i]_mean to be a syntax error. $ is similar to [[, so you can use paste to create the column name string and subset via [[.
For example:
qccast[[paste(welltypes[i],"_mean",sep="")]]
Depending on the rest of your code, you may be able to do something like this instead.
for(i in paste(welltypes,"_mean",sep="")){
qccast[[i]]
}
Here's another strategy:
qccast[ sapply(welltypes, grep, names(qccast)) ]
LC_mean HC_mean
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
Another easy way to access given welltypes
qccast[,paste(welltypes, '_mean', sep = "")]

Resources