Organization of data with metadata - r

I have a dataframe that contains two columns X-data and Y-data.
This represents some experimental data.
Now I have a lot of additional information that I want to associate with this data, such as temperatures, flow rates and so on the sample was recorded at. I have this metadata in a second dataframe.
The data and metadata should always stay together, but I also want to be able to do calculations with the data
As I have many of those data-metadata pairs (>100), I was wondering what people think is an efficient way to organize the data?
For now, I have the two dataframes in a list, but I find accessing the individual values or data-columns tedious (= a lot of code and brackets to write).

You can use an attribute:
dfr <- data.frame(x=1:3,y=rnorm(3))
meta <- list(temp="30C",date=as.Date("2013-02-27"))
attr(dfr,"meta") <- meta
dfr
x y
1 1 -1.3580532
2 2 -0.9873850
3 3 0.3809447
attr(dfr,"meta")
$temp
[1] "30C"
$date
[1] "2013-02-27"
str(dfr)
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: num -1.358 -0.987 0.381
- attr(*, "meta")=List of 2
..$ temp: chr "30C"
..$ date: Date, format: "2013-02-27"

Related

collapse data frame with embedded matrices [duplicate]

This question already has answers here:
aggregate() puts multiple output columns in a matrix instead
(1 answer)
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 4 years ago.
Under certain conditions, R generates data frames that contain matrices as elements. This requires some determination to do by hand, but happens e.g. with the results of an aggregate() call where the aggregation function returns multiple values:
set.seed(101)
d0 <- data.frame(g=factor(rep(1:2,each=20)), x=rnorm(20))
d1 <- aggregate(x~g, data=d0, FUN=function(x) c(m=mean(x), s=sd(x)))
str(d1)
## 'data.frame': 2 obs. of 2 variables:
## $ g: Factor w/ 2 levels "1","2": 1 2
## $ x: num [1:2, 1:2] -0.0973 -0.0973 0.8668 0.8668
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "m" "s"
This makes a certain amount of sense, but can make trouble for downstream processing code (for example, ggplot2 doesn't like it). The printed representation can also be confusing if you don't know what you're looking at:
d1
## g x.m x.s
## 1 1 -0.09731741 0.86678436
## 2 2 -0.09731741 0.86678436
I'm looking for a relatively simple way to collapse this object to a regular three-column data frame (either with names g, m, s, or with names g, x.m, x.s ...).
I know this problem won't arise with tidyverse (group_by + summarise), but am looking for a base-R solution.

How to convert outcome of table function to a dataframe

df = data.frame(table(train$department , train$outcome))
Here department and outcome both are factors so it gives me a dataframe which looks like in the given image
is_outcome is binary and df looks like this
containing only 2 variables(fields) while I want this department column to be a part of dataframe i.e a dataframe of 3 variables
0 1
Analytics 4840 512
Finance 2330 206
HR 2282 136
Legal 986 53
Operations 10325 1023
Procurement 6450 688
R&D 930 69
Sales & Marketing 15627 1213
Technology 6370 768
One way I learnt was...
df = data.frame(table(train$department , train$is_outcome))
write.csv(df,"df.csv")
rm(df)
df = read.csv("df.csv")
colnames(df) = c("department", "outcome_0","outcome_1")
but I cannot save file in everytime in my program
is there any way to do it directly.
When you are trying to create tables from a matrix in R, you end up with trial.table. The object trial.table looks exactly the same as the matrix trial, but it really isn’t. The difference becomes clear when you transform these objects to a data frame. Take a look at the outcome of this code:
> trial.df <- as.data.frame(trial)
> str(trial.df)
‘data.frame’: 2 obs. of 2 variables:
$ sick : num 34 11
$ healthy: num 9 32
Here you get a data frame with two variables (sick and healthy) with each two observations. On the other hand, if you convert the table to a data frame, you get the following result:
> trial.table.df <- as.data.frame(trial.table)
> str(trial.table.df)
‘data.frame’: 4 obs. of 3 variables:
$ Var1: Factor w/ 2 levels “risk”,”no_risk”: 1 2 1 2
$ Var2: Factor w/ 2 levels “sick”,”healthy”: 1 1 2 2
$ Freq: num 34 11 9 32
The as.data.frame() function converts a table to a data frame in a format that you need for regression analysis on count data. If you need to summarize the counts first, you use table() to create the desired table.
Now you get a data frame with three variables. The first two — Var1 and Var2 — are factor variables for which the levels are the values of the rows and the columns of the table, respectively. The third variable — Freq — contains the frequencies for every combination of the levels in the first two variables.
In fact, you also can create tables in more than two dimensions by adding more variables as arguments, or by transforming a multidimensional array to a table using as.table(). You can access the numbers the same way you do for multidimensional arrays, and the as.data.frame() function creates as many factor variables as there are dimensions.

R: A column in a dataframe from numeric to factor with paste0 (and vise- versa)

Preface:
I have seen this post:How to convert a factor to an integer\numeric without a loss of information? , but it does not really apply to the issue I am having. It addresses the issue of converting a vector in the form of factor to a numeric, but the issue I am having is larger than that.
Problem:
I am trying to convert a column in a dataframe from a factor to a numeric, while representing the dataframe using paste0. Here is an example:
aa=1:10
bb=rnorm(10)
dd=data.frame(aa,bb)
get(paste0("d","d"))[,2]=as.factor(get(paste0("d","d"))[,2])
(The actual code I am using requires me to use the paste0 function)
I get the error: target of assignment expands to non-language object
I am not sure how to do this, I think what is messing it up is the paste0 function.
First, this is not really a natural way to think about things or to code things in R. It can be done, but if you rephrase your question to give the bigger picture, someone can probably provide more natural ways of doing this in R. (Like the named lists #joran mentioned in the comment.)
With that said, to do this in R, you need to split apart the three steps you're trying to do in one line: get the data frame with the specified variable, make the desired column a factor, and then assign back to the variable name. Here I've wrapped this in a function, so the assignment needs to be made in pos=1 instead of the default, which would name it only within the function.
tof <- function(dfname, colnum) {
d <- get(dfname)
d[, colnum] <- factor(d[, colnum])
assign(dfname, d, pos=1)
}
dd <- data.frame(aa=1:10, bb=rnorm(10))
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: num -1.4824 0.7904 0.0258 1.2075 0.2455 ...
tof("dd", 2)
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: Factor w/ 10 levels "-1.48237228248052",..: 1 8 4 9 5 10 2 7 3 6

How to determine column to be Quantitative or Categorical data?

If I have a file with many column, the data are all numbers, how can I know whether a specific column is categorical or quantitative data?. Is there an area of study for this kind of problem? If not, what are some heuristics that can be used to determine?
Some heuristics that I can think of:
Likely to be categorical data
make a summary of the unique value, if it's < some_threshold, there is higher chance to be categorical data.
if the data is highly concentrate (low std.)
if the unique value are highly sequential, and starts from 1
if all the value in column has fixed length (may be ID/Date)
if it has a very small p-value at Benford's Law
if it has a very small p-value at the Chi-square test against the result column
Likely to be quantitative data
if the column has floating number
if the column has sparse value
if the column has negative value
Other
Maybe quantitative data are more likely to be near/next to quantitative data (vice-versa)
I am using R, but the question doesn't need to be R specific.
This assumes someone coded the data correctly.
Perhaps you are suggesting the data were not coded or labeled correctly, that it was all entered as numeric and some of it really is categorical. In that case, I do not know how one could tell with any certainty. Categorical data can have decimals places and can be negative.
The question I would ask myself in such a situation is what difference does it make how I treat the data?
If you are interested in the second scenario perhaps you should ask your question on Stack Exchange.
my.data <- read.table(text = '
aa bb cc dd
10 100 1000 1
20 200 2000 2
30 300 3000 3
40 400 4000 4
50 500 5000 5
60 600 6000 6
', header = TRUE, colClasses = c('numeric', 'character', 'numeric', 'character'))
my.data
# one way
str(my.data)
'data.frame': 6 obs. of 4 variables:
$ aa: num 10 20 30 40 50 60
$ bb: chr "100" "200" "300" "400" ...
$ cc: num 1000 2000 3000 4000 5000 6000
$ dd: chr "1" "2" "3" "4" ...
Here is a way to record the information:
my.class <- rep('empty', ncol(my.data))
for(i in 1:ncol(my.data)) {
my.class[i] <- class(my.data[,i])
}
> my.class
[1] "numeric" "character" "numeric" "character"
EDIT
Here is a way to record class for each column without using a for-loop:
my.class <- sapply(my.data, class)

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources