How to convert factor to float without losing precision in R? - r

Here is what I have:
tmp[1,]
percentages percentages.1 percentages.2 percentages.3 percentages.4 percentages.5 percentages.6 percentages.7 percentages.8 percentages.9
0.0329489291598023 0.0391268533772652 0.0292421746293245 0.0354200988467875 0.0284184514003295 0.035831960461285 0.0308896210873147 0.0345963756177924 0.0366556836902801 0.0403624382207578
I try converting this to numeric, since the class is factor, but I get:
as.numeric(as.character(tmp[1,]))
[1] 35 36 35 36 31 32 31 34 36 34
Where did these integers come from?

Your problem is that indexing by rows of a data frame gives surprising results.
Reconstruct your object:
tmp <- read.csv(text=
"0.0329489291598023,0.0391268533772652,0.0292421746293245,0.0354200988467875,0.0284184514003295,0.035831960461285,0.0308896210873147,0.0345963756177924,0.0366556836902801,0.0403624382207578",
header=FALSE,colClasses=rep("factor",10))
Inspect:
str(tmp[1,])
## 'data.frame': 1 obs. of 10 variables:
## $ V1 : Factor w/ 1 level "0.0329489291598023": 1
## $ V2 : Factor w/ 1 level "0.0391268533772652": 1
## ... etc.
Converting via as.character() totally messes things up:
str(as.character(tmp[1,]))
## chr [1:10] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
On the other hand, this (converting to a matrix first) works fine:
as.numeric(as.matrix(tmp)[1,])
## [1] 0.03294893 0.03912685 0.02924217 0.03542010 0.02841845 0.03583196
## [7] 0.03088962 0.03459638 0.03665568 0.04036244
That said, I have to admit that I do not understand the particular magic that makes as.character() applied to a data frame drop the information about factor levels and convert everything first to the underlying numerical codes, and then to character -- I don't know where precisely you would go to read about this. (The bottom line is "don't extract rows of data frames if you can help it; convert them to matrices first if necessary.")

As an alternative to converting to matrix, you can just transpose the dataframe row to a column:
as.numeric(as.character(t(tmp[1,])))
## [1] 0.03294893 0.03912685 0.02924217 0.03542010 0.02841845 0.03583196
## [7] 0.03088962 0.03459638 0.03665568 0.04036244
I think the integers seen by the OP
[1] 35 36 35 36 31 32 31 34 36 34
are factor levels, his data frame had multiple rows - 36 or more - and these are the levels of the first row.
ETA I see that t() converts a data frame to a matrix, so my solution is the same as Ben's.
Perhaps the reason as.character() doesn't work with a dataframe row is that the levels of the different columns may differ, so there isn't a common set of levels(). In these circumstances as.matrix() will convert to character, so it solves the problem.

Related

Extract all values from a vector of named numerics with the same name in R

I'm trying to handle a vector of named numerics for the first time in R. The vector itself is named p.values. It consists of p-values which are named after their corresponding variabels. Through simulating I obtained a huge number of p-values that are always named like one of the five variables they correspond to. I'm interested in p-values of only one variable however and tried to extract them with p.values[["var_a"]] but that gives my only the p-value of var_a's last entry. p.values$var_a is invalid and as.numeric(p.values) or unname(p.values) gives my only all values without names obviously. Any idea how I can get R to give me the 1/5 of named numerics that are named var_a?
Short example:
p.values <- as.numeric(c(rep(1:5, each = 5)))
names(p.values) <- rep(letters[1:5], 5)
str(p.values)
Named num [1:25] 1 1 1 1 1 2 2 2 2 2 ...
- attr(*, "names")= chr [1:25] "a" "b" "c" "d" ...
I'd like to get R to show me all 5 numbers named "a".
Thanks for reading my first post here and I hope some more experienced R users know how to deal with named numerics and can help me with this issue.
You can subset p.values using [ with names(p.values) == "a" to show all values named a.
p.values[names(p.values) == "a"]
#a a a a a
#1 2 3 4 5

How to convert outcome of table function to a dataframe

df = data.frame(table(train$department , train$outcome))
Here department and outcome both are factors so it gives me a dataframe which looks like in the given image
is_outcome is binary and df looks like this
containing only 2 variables(fields) while I want this department column to be a part of dataframe i.e a dataframe of 3 variables
0 1
Analytics 4840 512
Finance 2330 206
HR 2282 136
Legal 986 53
Operations 10325 1023
Procurement 6450 688
R&D 930 69
Sales & Marketing 15627 1213
Technology 6370 768
One way I learnt was...
df = data.frame(table(train$department , train$is_outcome))
write.csv(df,"df.csv")
rm(df)
df = read.csv("df.csv")
colnames(df) = c("department", "outcome_0","outcome_1")
but I cannot save file in everytime in my program
is there any way to do it directly.
When you are trying to create tables from a matrix in R, you end up with trial.table. The object trial.table looks exactly the same as the matrix trial, but it really isn’t. The difference becomes clear when you transform these objects to a data frame. Take a look at the outcome of this code:
> trial.df <- as.data.frame(trial)
> str(trial.df)
‘data.frame’: 2 obs. of 2 variables:
$ sick : num 34 11
$ healthy: num 9 32
Here you get a data frame with two variables (sick and healthy) with each two observations. On the other hand, if you convert the table to a data frame, you get the following result:
> trial.table.df <- as.data.frame(trial.table)
> str(trial.table.df)
‘data.frame’: 4 obs. of 3 variables:
$ Var1: Factor w/ 2 levels “risk”,”no_risk”: 1 2 1 2
$ Var2: Factor w/ 2 levels “sick”,”healthy”: 1 1 2 2
$ Freq: num 34 11 9 32
The as.data.frame() function converts a table to a data frame in a format that you need for regression analysis on count data. If you need to summarize the counts first, you use table() to create the desired table.
Now you get a data frame with three variables. The first two — Var1 and Var2 — are factor variables for which the levels are the values of the rows and the columns of the table, respectively. The third variable — Freq — contains the frequencies for every combination of the levels in the first two variables.
In fact, you also can create tables in more than two dimensions by adding more variables as arguments, or by transforming a multidimensional array to a table using as.table(). You can access the numbers the same way you do for multidimensional arrays, and the as.data.frame() function creates as many factor variables as there are dimensions.

ls() the column of a data frame

I'm a C developper trying to learn R, and there's a few things I can't wrap my head around.
I've tried doing something as simple as listing the elements from an int list stored in a data frame.
For this exemple, I'm using the data mpg from package ggplot2.
data(mpg, package="ggplot2")
Doing ls() on the mpg data frame lists the elements stored in it.
> ls(mpg)
[1] "class" "cty" "cyl" "displ" "drv" "fl" "hwy" "manufacturer"
[9] "model" "trans" "year"
Accessing a column can be done by giving it's name as string to the data frame.
> mpg["hwy"]
# A tibble: 234 x 1
hwy
<int>
1 29
2 29
3 31
4 30
5 26
6 26
7 27
8 26
9 25
10 28
# ... with 224 more rows
But using ls() on the column doesn't return the list of int stored in it.
> ls(mpg["hwy"])
[1] "hwy"
I'm really hitting a wall there. I'm trying to understand why it doesn't work the way I'm expecting to, but I can't find any information. That probably means that what I thought I've understood about R is wrong.
Can anyone please give me any directions about that?
Best regards.

How to determine column to be Quantitative or Categorical data?

If I have a file with many column, the data are all numbers, how can I know whether a specific column is categorical or quantitative data?. Is there an area of study for this kind of problem? If not, what are some heuristics that can be used to determine?
Some heuristics that I can think of:
Likely to be categorical data
make a summary of the unique value, if it's < some_threshold, there is higher chance to be categorical data.
if the data is highly concentrate (low std.)
if the unique value are highly sequential, and starts from 1
if all the value in column has fixed length (may be ID/Date)
if it has a very small p-value at Benford's Law
if it has a very small p-value at the Chi-square test against the result column
Likely to be quantitative data
if the column has floating number
if the column has sparse value
if the column has negative value
Other
Maybe quantitative data are more likely to be near/next to quantitative data (vice-versa)
I am using R, but the question doesn't need to be R specific.
This assumes someone coded the data correctly.
Perhaps you are suggesting the data were not coded or labeled correctly, that it was all entered as numeric and some of it really is categorical. In that case, I do not know how one could tell with any certainty. Categorical data can have decimals places and can be negative.
The question I would ask myself in such a situation is what difference does it make how I treat the data?
If you are interested in the second scenario perhaps you should ask your question on Stack Exchange.
my.data <- read.table(text = '
aa bb cc dd
10 100 1000 1
20 200 2000 2
30 300 3000 3
40 400 4000 4
50 500 5000 5
60 600 6000 6
', header = TRUE, colClasses = c('numeric', 'character', 'numeric', 'character'))
my.data
# one way
str(my.data)
'data.frame': 6 obs. of 4 variables:
$ aa: num 10 20 30 40 50 60
$ bb: chr "100" "200" "300" "400" ...
$ cc: num 1000 2000 3000 4000 5000 6000
$ dd: chr "1" "2" "3" "4" ...
Here is a way to record the information:
my.class <- rep('empty', ncol(my.data))
for(i in 1:ncol(my.data)) {
my.class[i] <- class(my.data[,i])
}
> my.class
[1] "numeric" "character" "numeric" "character"
EDIT
Here is a way to record class for each column without using a for-loop:
my.class <- sapply(my.data, class)

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources