Organizing data from physics experiments for ggplot2 - r

I am currently trying to use ggplot2 to visualize results from simple current-voltage experiments. I managed to achieve good results for one set of data of course.
However, I have a number of current-voltage datasets, which I input in R recursively to get the following organisation (see minimal code) :
data.frame(cbind(batch(string list), sample(string list), dataset(data.frame list)))
Edit : My data are stored in text files names batchname_samplenumber.txt, with voltage and current columns. The code I use to import them is :
require(plyr)
require(ggplot2)
#VARIABLES
regex <- "([[:alnum:]_]+).([[:alpha:]]+)"
regex2 <- "G5_([[:alnum:]]+)_([[:alnum:]]+).([[:alpha:]]+)"
#FUNCTIONS
getJ <- function(list, k) llply(list, function(i) llply(i, function(i, indix) getElement(i,indix), indix = k))
#FILES
files <- list.files("Data/",full.names= T)
#NAMES FOR FILES
paths <- llply(llply(files, basename),function(i) regmatches(i,regexec(regex,i)))
paths2 <- llply(llply(files, basename),function(i) regmatches(i,regexec(regex2,i)))
names <- llply(llply(getJ(paths, 2)),unlist)
batches <- llply(llply(getJ(paths2, 2)),unlist)
samples <- llply(llply(getJ(paths2, 3)),unlist)
#SETS OF DATA, NAMED
sets <- llply(files,function(i) read.table(i,skip = 0, header = F))
names(sets) <- names
for (i in as.list(names)) names(sets[[i]]) <- c("voltage","current")
df<-data.frame(cbind(batches,samples,sets))
And a minimal data can be generated via :
require(plyr)
batch <- list("A","A","B","B")
sample <- list(1,2,1,2)
set <- list(data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)),data.frame(voltage = runif(10), current = runif(10)))
df<-data.frame(cbind(batch,sample,set))
My question is : is it possible to use the data as is to plot using a code similar to the following (which does not work) ?
ggplot(data, aes(x = dataset$current, y = dataset$voltage, colour = sample)) + facet_wrap(~batch)
The more general version would be : is ggplot2 able of handeling raw physical data, as opposed to discrete statistical data (like diamonds, cars) ?

With the newly-defined problem (two-column files named "batchname_samplenumber.txt"), I would suggest the following strategy:
read_custom <- function(f, ...) {
d <- read.table(f, ...)
names(d) <- c("V", "I")
## extract sample and batch from the base filename
ids <- strsplit(gsub(".txt", "", f), "_")
d$batch <- ids[[1]][1]
d$sample <- ids[[1]][2]
d
}
## list files to read
files <- list.files(pattern=".txt")
## read them all in a single data.frame
m <- ldply(files, read_custom)

It's not clear how the sample names are defined with respect to the dataset. The general idea for ggplot2 is that you should group all your data in the form of a melted (long format) data.frame.
library(ggplot2)
library(plyr)
library(reshape2)
l1 <- list(batch="b1", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
l2 <- list(batch="b2", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
l3 <- list(batch="b3", sample=paste("s", 1:4, sep=""),
dataset=data.frame(current=rnorm(10*4), voltage=rnorm(10*4)))
list_to_df <- function(l, n=10){
m <- l[["dataset"]]
m$batch <- l[["batch"]]
m$sample <- rep(l[["sample"]], each=n)
m
}
## list_to_df(l1)
m <- ldply(list(l1, l2, l3), list_to_df)
ggplot(m) + facet_wrap(~batch)+
geom_path(aes(current, voltage, colour=sample))

Related

Threshold for using join over merge

My understanding regarding the difference between the merge() function (in base R) and the join() functions of plyr and dplyr are that join() is faster and more efficient when working with "large" data sets.
Is there some way to determine a threshold to regarding when to use join() over merge(), without using a heuristic approach?
I am sure you will be hard pressed to find a "hard and fast" rule around when to switch from one function to another. As others have mentioned, there are a set of tools in R to help you measure performance. object.size and system.time are two such function that look at memory usage and performance time, respectively. One general approach is to measure the two directly over an arbitrarily expanding data set. Below is one attempt at this. We will create a data frame with an 'id' column and a random set of numeric values, allowing the data frame to grow and measuring how it changes. I'll use inner_join here as you mentioned dplyr. We will measure time as "elapsed" time.
library(tidyverse)
setseed(424)
#number of rows in a cycle
growth <- c(100,1000,10000,100000,1000000,5000000)
#empty lists
n <- 1
l1 <- c()
l2 <- c()
#test for inner join in dplyr
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- inner_join(x,y, by = c('id' = 'id'))
l1[[n]] <- object.size(test)
print(system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3])
l2[[n]] <- system.time(test <- inner_join(x,y, by = c('id' = 'id')))[3]
n <- n+1
}
#empty lists
n <- 1
l3 <- c()
l4 <- c()
#test for merge
for(i in growth){
x <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
y <- data.frame("id" = 1:i, "value" = rnorm(i,0,1))
test <- merge(x,y, by = c('id'))
l3[[n]] <- object.size(test)
# print(object.size(test))
print(system.time(test <- merge(x,y, by = c('id')))[3])
l4[[n]] <- system.time(test <- merge(x,y, by = c('id')))[3]
n <- n+1
}
#ploting output (some coercing may happen, so be it)
plot <- bind_rows(data.frame("size_bytes" = l3, "time_sec" = l4, "id" = "merge"),
data.frame("size_bytes" = l1, "time_sec" = l2, "id" = "inner_join"))
plot$size_MB <- plot$size_bytes/1000000
ggplot(plot, aes(x = size_MB, y =time_sec, color = id)) + geom_line()
merge seems to perform worse out the gate, but really kicks off around ~20MB. Is this the final word on the matter? No. But such testing can give you a idea of how to choose a function.

How to export data frame with vector entries into CSV file?

I have a data frame with some entries as lists. This was an import from a JSON file, where an entry might have multiple tags. It imported JSON file using jsonlite package with flatten=TRUE. An example entry from my tags column is:
list(tag = c("ethicaltheory", "gametheory"), raw_tag = c("ethical heory", "Game Theory"))
I filtered the table down and want to export it to a csv. When I tried the write.csv command, I hit an error when it hit the first entry with list:
"unimplemented type 'list' in 'EncodeElement'"
The question is can I export this file as is, did I make a mistake in importing it?
I'd be fine with converting entries to strings or something, but I'm not sure how to do that for the entire table.
Interesting problem. I did not know that data.frame can store list. This algorithm is taking variable length list stored in a row of a data.frame. Then it finds maximum length, create matrix with proper dimensions then it saves the file.
x <- list(tag = letters[1:2], raw_tag = letters[1:3])
y <- list(tag = letters[1:2], raw_tag = letters[1:2])
z <- list(tag = letters[1:3], raw_tag = letters[1:4])
df <- data.frame(clmn = I(list(x, y, z)))
r <- apply(df, 1, unlist)
lm <- max(unlist(lapply(r, length)))
df <- data.frame(
matrix(
rep(0, (lm * nrow(df))),
ncol = lm
)
)
)
vals.v <- unlist(lapply(1:nrow(df), function(i) {
v <- unlist(r[i])
l <- length(v)
c(v, rep(0, lm - l))
}))
fin.res <- t(matrix(vals.v, ncol = nrow(df)))
# write.csv(fin.res, "res2.csv") # uncomment to save CSV file

Vectorization of a nested for-loop that inputs all paired combinations

I thought that the following problem must have been answered or a function must exist to do it, but I was unable to find an answer.
I have a nested loop that takes a row from one 3-col. data frame and copies it next to each of the other rows, to form a 6-col. data frame (with all possible combinations). This works fine, but with a medium sized data set (800 rows), the loops take forever to complete the task.
I will demonstrate on a sample data set:
Sdat <- data.frame(
x = c(10,20,30,40),
y = c(15,25,35,45),
ID =c(1,2,3,4)
)
compar <- data.frame(matrix(nrow=0, ncol=6)) # to contain all combinations
names(compar) <- c("x","y", "ID", "x","y", "ID")
N <- nrow(Sdat) # how many different points we have
for (i in 1:N)
{
for (j in 1:N)
{
Temp1 <- Sdat[i,] # data from 1st point
Temp2 <- Sdat[j,] # data from 2nd point
C <- cbind(Temp1, Temp2)
compar <- rbind(C,compar)
}
}
These loops provide exactly the output that I need for further analysis. Any suggestion for vectorizing this section?
You can do:
ind <- seq_len(nrow(Sdat))
grid <- expand.grid(ind, ind)
compar <- cbind(Sdat[grid[, 1], ], Sdat[grid[, 2], ])
A naive solution using rep (assuming you are happy with a data frame output):
compar <- data.frame(x = rep(Sdat$x, each = N),
y = rep(Sdat$y, each = N),
id = rep(1:n, each = N),
x1 = rep(Sdat$x, N),
y1 = rep(Sdat$y, N),
id_1 = rep(1:n, N))

Create longitudinal data from a list of igraph objects in R

I'm doing analysis on company networks in R and am trying to export my igraph results into a dataframe.
Here's a reproducible example:
library(igraph)
sample <- data.frame(ID = 1:8, org_ID = c(5,4,1,2,2,2,5,7), mon = c("199801", "199802","199802","199802","199904","199912","200001", "200012"))
create.graphs <- function(df){
g <- graph.data.frame(d = df, directed = TRUE)
g <- simplify(g, remove.multiple = FALSE, remove.loops = TRUE)
E(g)$weight <- count_multiple(g)
#calculate global values
g$centrality <- centralization.degree(g)
#calculate local values
g$indegree <- degree(g, mode = "in",
loops = FALSE, normalized = FALSE)
return(g)
}
df.list <- split(sample, sample$mon)
g <- lapply(df.list, create.graphs)
As you can see, I have graphs for multiple months. I want to export this to longitudinal data, where each row represents a month (per ID) and each column represents the corresponding network measures.
So far I've managed to create a data frame, but not how to run it through the list of graphs and put it into a fitting format. An additional problem could be that the graphs have different numbers of nodes (some have around 25, others more than 40), but that should theoretically just be recognised as missing by my regression model.
output <- data.frame(Centrality = g$`199801`$centrality,
Indegree = g$`199801`$indegree)
output
summary(output)
I tried writing a function similar to the one above for this, but unfortunately to no avail.
Thanks in advance for reading this, any help is greatly appreciated
I wanted to share how I solved it (thanks to Dave2e's suggestion).
Note that ci$monat defines my time periods in the original data, so one row for each point in time.
sumarTable <- data.frame(time = unique(ci$monat))
sumarTable$indegree <- lapply(g, function(x){x$indegree})
sumarTable$outdegree <- lapply(g, function(x){x$outdegree})
sumarTable$constraint <- lapply(g, function(x){x$constraint})
etc
edit:
in order to export these values, I had to "flatten" the lists:
sumarTable$indegree <- vapply(sumarTable$indegree, paste, collapse = ", ", character(1L))
sumarTable$outdegree <- vapply(sumarTable$outdegree, paste, collapse = ", ", character(1L))
sumarTable$constraint <- vapply(sumarTable$constraint, paste, collapse = ", ", character(1L))

R ggplot2 boxplot from 10 files

I have 4 files each called 0_X_cell.csv, 0_S_cell.csv and 15_X_cell.csv, 15_S_cell.csv of the format:
p U:0 U:1 U:2 Tracer Tracer_0 U_0:0
-34.014 0.15268 -3.7907 -0.20155 10.081 10.032 0.12454
-33.836 0.07349 -2.1457 -0.30531 27.706 27.278 0.076542
I'd like to create boxplots out of the values for Tracer/3600 and put them on the same graph using ggplot2 but I'm finding it not quite so straightforward. Any suggestions would be much appreciated:
I'm thinking it might something like this:
Import data from all files into separate variables:
Extract Tracer from each one and put into a data.frame
Plot the boxplots of every column Tracer/3600. But each column will be called Tracer...
What would the correct procedure be?
Here's one way to do it (if I understood you correctly):
`0_X_cell.csv` <- `0_S_cell.csv` <- `15_X_cell.csv` <- `15_S_cell.csv` <- read.table(header=T, text="
p U:0 U:1 U:2 Tracer Tracer_0 U_0:0
-34.014 0.15268 -3.7907 -0.20155 10.081 10.032 0.12454
-33.836 0.07349 -2.1457 -0.30531 27.706 27.278 0.076542")
lst <- mget(grep("cell.csv", ls(), fixed=TRUE, value=TRUE))
df <- stack(lapply(lapply(lst, "[", "Tracer"), unlist))
df$ind <- sub("^(\\d+_[A-Z]).*$", "\\1", df$ind)
library(ggplot2)
ggplot(df, aes(ind, values/3600)) + geom_boxplot()
To read in the data from your dir:
z <- list.files(pattern = ".*cell\\.csv$")
z <- lapply(1:length(z), function(x) {chars <- strsplit(z[x], "_");
cbind(data.frame(Tracer = read.csv(z[x])$Tracer), time = chars[[1]][1], treatment = chars[[1]][2])})
z <- do.call(rbind, z)
Then plot it:
library(ggplot2)
ggplot(z, aes(y = Tracer/3600, x = factor(time))) +geom_boxplot(aes(fill = factor(treatment))) + ylab("Tracer")

Resources