accessing variables in data frame in R - r

I am try to open all the csv files in my working directory and read all the tables into a large list of data frame. I find a similar solution on stackoverflow and the solution works. The code is:
load_data <- function(path)
{
files <- dir(path, pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
do.call(rbind, tables)
}
pollutantmean <- load_data("specdata")
However, I am confused to some steps. If I delete or omit do.call(rbind,tables), I am not able to access the column variables by calling tables[index]$variable. It returns NULL in the console. Then I try to print an output by calling tables[index] and I do not see any column variables' name appearing the the first row in the table. Can someone explain to me what cause the column variables' name missing and return NULL value?

To see why you are getting NULL let's create a reproducible example:
df1 <- head(mtcars)
df2 <- head(iris)
my_list <- list(df1, df2)
Test the subsetting with one bracket and two:
my_list[2]$Species
NULL
my_list[[2]]$Species
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
Subsetting with two brackets produces the desired output.
Further Explanation
Why doesn't one bracket work?
> my_list[2]
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
> my_list[[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
If someone couldn't tell the difference between the two outputs I wouldn't blame them, they look alike. There's one small important difference between using one bracket and two. The first returns a list, the second returns a data frame. To check, notice the [[1]] in the first line of the output of my_list[2]. That indicates that the output is a list. As a list we cannot analyze it as we would a data frame. We must use the two brackets to get back a data frame.

Related

way to customize zScore function with r

I am new to R and have a question.
Create a function, zScore, that will take a vector of numbers (x) and converts them to a vector of z-scaled numbers (see code below). (Don't worry about NA's)
#This creates the z-scaled numbers for sepal lengths
(iris$Sepal.Length - mean(iris$Sepal.Length))/sd(iris$Sepal.Length)
#This creates the z-scaled numbers for sepal widths
(iris$Sepal.Width - mean(iris$Sepal.Width))/sd(iris$Sepal.Width)
write a zScore function that is flexible.
thank you for any help you provide
You can use the following code:
# Z-score function
zscore <- function(x) {
(x - mean(x))/sd(x)
}
library(tidyverse)
iris %>%
mutate(zscore_sepal.length = zscore(Sepal.Length)) %>%
mutate(zscore_sepal.width = zscore(Sepal.Width))
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species zscore_sepal.length zscore_sepal.width
1 5.1 3.5 1.4 0.2 setosa -1.95660229 -3.514384
2 4.9 3.0 1.4 0.2 setosa -2.15660229 -4.014384
3 4.7 3.2 1.3 0.2 setosa -2.35660229 -3.814384
4 4.6 3.1 1.5 0.2 setosa -2.45660229 -3.914384
5 5.0 3.6 1.4 0.2 setosa -2.05660229 -3.414384

Is it possible to combine parameters to a subset function that is generated programmatically in R?

Before my question, here is a little background.
I am creating a general purpose data shaping and charting library for plotting survey data of a particular format.
As part of my scripts, I am using the subset function on my data frame. The way I am working is that I have a parameter file where I can pass this subsetting criteria into my functions (so I don't need to directly edit my main library). The way I do this is as follows:
subset_criteria <- expression(variable1 != "" & variable2 == TRUE)
(where variable1 and variable2 are columns in my data frame, for example).
Then in my function, I call this as follows:
my.subset <- subset(my.data, eval(subset_criteria))
This part works exactly as I want it to work. But now I want to augment that subsetting criteria inside the function, based on some other calculations that can only be performed inside the function. So I am trying to find a way to combine together these subsetting expressions.
Imagine inside my function I create some new column in my data frame automatically, and then I want to add a condition to my subsetting that says that this additional column must be TRUE.
Essentially, I do the following:
my.data$newcolumn <- with(my.data, ifelse(...some condition..., TRUE, FALSE))
Then I want my subsetting to end up being:
my.subset <- subset(my.data, eval(subset_criteria & newcolumn == TRUE))
But it does not seem like simply doing what I list above is valid. I get the wrong solution. So I'm looking for a way of combining these expressions using expression and eval so that I essentially get the combination of all the conditions.
Thanks for any pointers. It would be great if I can do this without having to rewrite how I do all my expressions, but I understand that might be what is needed...
Bob
You should probably avoid two things: using subset in non-interactive setting (see warning in the help pages) and eval(parse()). Here we go.
You can change the expression into a string and append it whatever you want. The trick is to convert the string back to expression. This is where the aforementioned parse comes in.
sub1 <- expression(Species == "setosa")
subset(iris, eval(sub1))
sub2 <- paste(sub1, '&', 'Petal.Width > 0.2')
subset(iris, eval(parse(text = sub2))) # your case
> subset(iris, eval(parse(text = sub2)))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
22 5.1 3.7 1.5 0.4 setosa
24 5.1 3.3 1.7 0.5 setosa
27 5.0 3.4 1.6 0.4 setosa
32 5.4 3.4 1.5 0.4 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa

Unable to modify an object returned by an R function

I wrote a function in R to return the name of the first column of a data frame:
my_func <- function(table, firstColumnOnly = TRUE){
if(firstColumnOnly)
return(colnames(table)[1])
else
return(colnames(table))
}
If I call the function like this:
my_func(fertility)<-"foo"
I get the following error:
Error in my_func(fertility, FALSE)[1] <- "foo" :
could not find function "my_func<-"
Why am I getting this error? I can do this without an error:
colnames(fertility)[1]<-"Country"
It seems like you are expecting that this:
my_func(fertility)<-"foo"
will be understood by R as:
colnames(table)[1] <- "foo" # if firstColumnOnly
or
colnames(table) <- "foo" # if !firstColumnOnly
It will not. One reason for this is that colnames() and colnames()<- are two distinct functions. The first one returns the column names, the second one assigns new names. Your function can only return the names, not assign them.
One workaround would be to write your function using colnames()<-:
my_func <- function(table, rep, firstColumnOnly = TRUE){
if(firstColumnOnly) colnames(table)[1] <- rep
else colnames(table) <- rep
return(table)
}
Test
head(my_func(iris,"foo"))
foo Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

splitting a data.table, then modifying by reference

I have a use-case where I need to split a data.table, then apply different modify-by-reference operations to each partition. However, splitting forces copying of each table.
Here's a toy example on the iris dataset:
#split the data
DT <- data.table(iris)
out <- split(DT, DT$Species)
#assign partitions to global environment
NAMES <- as.character(unique(DT$Species))
lapply(seq_along(out), function(x) {
assign(NAMES[x], out[[x]], envir=.GlobalEnv)})
#modify by reference, same function applied to different columns for different partitions
#would do this programatically in real use case
virginica[ ,summ:=sum(Petal.Length)]
setosa[ ,summ:=sum(Petal.Width)]
#rbind all (again, programmatic)
do.call(rbind, list(virginica, setosa))
Then I get the following warning:
Warning message:
In `[.data.table`(out$virginica, , `:=`(cumPedal, cumsum(Petal.Width))) :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference.
I know this is related to putting data.tables in lists. Is there any workaround for this use case, or a way to avoid using split? Note that in the real case, I want to modify by reference programatically, so hardcoding a solution won't work.
Here's an example of using .EACHI to achieve what it sounds like you're trying to do:
## Create a data.table that indicates the pairs of keys to columns
New <- data.table(
Species = c("virginica", "setosa", "versicolor"),
FunCol = c("Petal.Length", "Petal.Width", "Sepal.Length"))
## Set the key of your original data.table
setkey(DT, Species)
## Now use .EACHI
DT[New, temp := cumsum(get(FunCol)), by = .EACHI][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species temp
# 1: 5.1 3.5 1.4 0.2 setosa 0.2
# 2: 4.9 3.0 1.4 0.2 setosa 0.4
# 3: 4.7 3.2 1.3 0.2 setosa 0.6
# 4: 4.6 3.1 1.5 0.2 setosa 0.8
# 5: 5.0 3.6 1.4 0.2 setosa 1.0
# ---
# 146: 6.7 3.0 5.2 2.3 virginica 256.9
# 147: 6.3 2.5 5.0 1.9 virginica 261.9
# 148: 6.5 3.0 5.2 2.0 virginica 267.1
# 149: 6.2 3.4 5.4 2.3 virginica 272.5
# 150: 5.9 3.0 5.1 1.8 virginica 277.6
## Basic verification
head(cumsum(DT["setosa", ]$Petal.Width), 5)
# [1] 0.2 0.4 0.6 0.8 1.0
tail(cumsum(DT["virginica", ]$Petal.Length), 5)

Merging files (and file names) in R

I'm trying to merge a directory full of comma delimited text files using R, while also incorporating the file name of each file as a new variable in the data set.
I've been using the following:
library(plyr)
file_list <- list.files()
dataset <- ldply(file_list, read.table, header=FALSE, sep=",")
Can anyone shed any light on how I'd add the file name for each file read as a new variable within dataset?
Many thanks,
-Jon
You can just make a wrapper around the read.table() function that adds in your filename variable. Something like this should work:
read.data <- function(file){
dat <- read.table(file,header=F,sep=",")
dat$fname <- file
return(dat)
}
Once there you just need to apply that function across your data files. Since you didn't post any example data I'm not sure what it actually looks like, but for now I'll assume it's clean as can be and that rbind() is sufficient to join them together, in which case this example should illustrate that function in action:
> data(iris)
> write.csv(iris,file="iris1.csv",row.names=F)
> write.csv(iris,file="iris2.csv",row.names=F)
> dataset <- do.call(rbind, lapply(list.files(pattern="csv$"),read.data))
> head(dataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species fname
1 5.1 3.5 1.4 0.2 setosa iris1.csv
2 4.9 3.0 1.4 0.2 setosa iris1.csv
3 4.7 3.2 1.3 0.2 setosa iris1.csv
4 4.6 3.1 1.5 0.2 setosa iris1.csv
5 5.0 3.6 1.4 0.2 setosa iris1.csv
6 5.4 3.9 1.7 0.4 setosa iris1.csv
> table(dataset$fname)
iris1.csv iris2.csv
150 150

Resources