Transposing data frame based on factor column - r

Let's assume I have a dataframe in the following format, obtained from a .csv file:
Measurement Config Value
--------------------------- _
Time A 10 |
Object A 20 | Run 1
Nodes A 30 _|
Time A 8 |
Object A 18 | Run 2
Nodes A 29 _|
Time B 9 |
Object B 20 | Run 3
Nodes B 35 _|
...
There are a fixed number of Measurements that are taken during each run, and each run is run with a given Config.
The Measurements per run are fixed (e.g., every run consists of a Time, an Objects and a Nodes measurement in the example above), but there can be multiple runs for a single config (e.g., Config A was run two times in the example above, B only once)
My primary goal is to plot correlations (scatter plots) between two of those measurement types, e.g., plot Objects (x-axis) against Nodes (y-axis) and highlight different Configs (color)
I thought that this could be best achieved if the dataframe is in the following format:
Config Time Objects Nodes
--------------------------
A 10 20 30 <- Run 1
A 8 18 29 <- Run 2
B 9 20 35 <- Run 3
I.e., creating the columns based on the factor-values of the Measurement-column, and assigning the respective Value-value to the cells.
Is there an "easy" way in R to achieve that?

First create a run variable:
# option 1:
d$run <- ceiling(seq_along(d$Measurement)/3)
# option 2:
d$run <- 1 + (seq_along(d$Config)-1) %/% 3
Then you reshape to wide wide format with the dcast function from reshape2 or data.table:
reshape2::dcast(d, Config + run ~ Measurement, value.var = 'Value')
you will then get:
Config run Nodes Object Time
1 A 1 30 20 10
2 A 2 29 18 8
3 B 3 35 20 9

Related

Matching two datasets using different IDs

I have two datasets, one is longitudinal (following individuals over multiple years) and one is cross-sectional. The cross-sectional dataset is compiled from the longitudinal dataset, but uses a randomly generated ID variable which does not allow to track someone across years. I need the panel/longitudinal structure, but the cross-sectional dataset has more variables available than the longitudinal,
The combination of ID-year uniquely identifies each observation, but since the ID values are not the same across the two datasets (they are randomized in cross-sectional so that one cannot track individuals) I cannot match them based on this.
I guess I would need to find a set of variables that uniquely identify each observation, excluding ID, and match based on those. How would I go about ding that in R?
The long dataset looks like so
id year y
1 1 10
1 2 20
1 3 30
2 1 15
2 2 20
2 3 5
and the cross dataset like so
id year y x
912 1 10 1
492 2 20 1
363 3 30 0
789 1 15 1
134 2 25 0
267 3 5 0
Now, in actuality the data has 200-300 variables. So I would need a method to find the smallest set of variables that uniquely identifies each observation in the long dataset and then match based on these to the cross-sectional dataset.
Thanks in advance!

R: Plot several lines in the same plot: ggplot + data tables or frames vs matrices

My general problem: I tend to struggle using ggplot, because it's very data-frame-centric but the objects I work with seem to fit matrices better than data frames. Here is an example (adapted a little).
I have a quantity x that can assume values 0:5, and a "context" that can have values 0 or 1. For each context I have 7 different frequency distributions over the quantity x. (More generally I could have more than two "contexts", more values of x, and more frequency distributions.)
I can represent these 7×2 frequency distributions as a list freqs of two matrices, say:
> freqs
$`context0`
x0 x1 x2 x3 x4 x5
sample1 20 10 10 21 37 2
sample2 34 40 6 10 1 8
sample3 52 4 1 2 17 25
sample4 16 32 25 11 5 10
sample5 28 2 10 4 21 35
sample6 22 13 35 12 13 5
sample7 9 5 43 29 4 10
$`context1`
x0 x1 x2 x3 x4 x5
sample1 15 21 14 15 14 21
sample2 27 8 6 5 29 25
sample3 13 7 5 26 48 0
sample4 33 3 18 11 13 22
sample5 12 23 40 11 2 11
sample6 5 51 2 28 5 9
sample7 3 1 21 10 63 2
or a 3D array.
Or I could use a data.table tablefreqs like this one:
> tablefreqs
context x0 x1 x2 x3 x4 x5
1: 0 20 10 10 21 37 2
2: 0 34 40 6 10 1 8
3: 0 52 4 1 2 17 25
4: 0 16 32 25 11 5 10
5: 0 28 2 10 4 21 35
6: 0 22 13 35 12 13 5
7: 0 9 5 43 29 4 10
8: 1 15 21 14 15 14 21
9: 1 27 8 6 5 29 25
10: 1 13 7 5 26 48 0
11: 1 33 3 18 11 13 22
12: 1 12 23 40 11 2 11
13: 1 5 51 2 28 5 9
14: 1 3 1 21 10 63 2
Now I'd like to draw the following line plot (there's a reason why I need line plots and not, say, histograms or bar plots):
The 7 frequency distributions for context 0, with x as x-axis and the frequency as y-axis, all in the same line plot (with some alpha).
The 7 frequency distributions for context 1, again with x as x-axis and the frequency as y-axis, all in the same line plot (with alpha), but displayed upside-down below the plot for context 0.
Ggplot would surely do this very nicely, but it seems to require some acrobatics with data tables:
– If I use the data table tablefreqs it's not clear to me how to plot all its rows having context==0 in the same plot: ggplot seems to only think column-wise, not row-wise. I could use the six values of x as table rows, but then the "context" values would also end up in a row, and I'm not sure I can subset a data table by values in a row, rather than in a column.
– If I use the matrix freqs, I could create a mini-data-table having x as one column and one frequency distribution as another column, input that into ggplot+geom_line, then go over all 7 frequency distributions in a for-loop maybe. Not clear to me how to tell ggplot to keep the previous plots in this case. Then another for-loop over the two "contexts".
I'd be grateful for suggestions on how to approach this problem in particular, and more generally on what objects to choose for storing this kind of data: matrices? data tables, maybe with a different structure than shown here? some other formats?
I would suggest to familiarize yourself with the concept of what is known as Tidy Data, which are principles for data handling and storage that are adopted by ggplot2 and a number of other packages.
You are free to use a matrix or list of matrices to store your data; however, you can certainly store the data as you describe it (and as I understand it) in a data frame or single table following the following convention of columns:
context | sample | x | freq
I'll show you how I would convert the tablefreqs dataset you shared with us into that format, then how I would go about creating a plot as you are describing it in your question. I'm assuming in this case you only have the two values for context, although you allude to there being more. I'm going to try to interpret correctly what you stated in your question.
Create the Tidy Data frame
Your data frame as shown contains columns x1 through x5 that have values for x spread across more than one column, when you really need these to be converted in the format shown above. This is called "gathering" your data, and you can do that with tidyr::gather().
First, I also need to replicate the naming of your samples according to the matrix dataset, so I'll do that and gather your data:
library(dplyr)
library(tidyr)
library(ggplot2)
# create the sample names
tablefreqs$sample <- rep(paste0('sample',1:7), 2)
# gather the columns together
df <- tablefreqs %>%
gather(key='x', value='freq', -c(context, sample))
Note that in the gather() function, we have to specify to leave alone the two columns df$context and df$sample, as they are not part of the gathering effort. But now we are left with df$x containing character vectors. We can't plot that, because we want the to be in the form of a number (at least... I'm assuming you do). For that, we'll convert using:
df$x <- as.numeric(gsub("[^[:digit:].]", "", df$x))
That extracts the number from each value in df$x and represents it as a number, not a character. We have the opposite issue with df$context, which is actually a discrete factor, and we should represent it as such in order to make plotting a bit easier:
df$context <- factor(df$context)
Create the Plot
Now we're ready to create the plot. From your description, I may not have this perfectly right, but it seems that you want a plot containing both context = 1 and context = 0, and when context = 1 the data should be "upside down". By that, I'm assuming you are talking about plotting df$freq when df$context == 0 and -df$freq when df$context == 1. We could do that using some fancy logic in the ggplot() call, but I find it's easier just to create a new column in your dataset to represent what we will be plotting on the y axis. We'll call this column df$freq_adj and use that for plotting:
df$freq_adj <- ifelse(df$context==1, -df$freq, df$freq)
Then we create the plot. I'll explain a bit below the result:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(
aes(color=context, linetype=sample)
) +
geom_hline(yintercept=0, color='gray50') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
Without some clearer description or picture of what you were looking to do, I took some liberties here. I used color to discriminate between the two values for context, and I'm using linetype to discriminate the different samples. I also added a line at 0, since it seemed appropriate to do so here, and the scale_x_continuous() command is removing the extra white space that is put in place at the extreme ends of the data.
An alternative that is maybe closer to your description would be to physically have a separation between the two plots, and represent context = 1 as a physically separate plot compared to context = 0, with one over top of the other.
Here's the code and plot:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(aes(group=sample), alpha=0.3) +
facet_grid(context ~ ., scales='free_y') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
There the use of aes(group=sample) is quite important, since I want all the lines for each sample to be the same (alpha setting and color), yet ggplot2 needs to know that the connections between the points should be based on "sample". This is done using the group= aesthetic. The scales='free_y' argument on facet_grid() allows the y axis scale to shrink and fit the data according to each facet.

Forecast using time and cluster as groups

I'm a relative newbie with R and I'm trying to figure out the R code to generate a table of forecast data that I can export to a CSV for multiple variables grouped by different slices.
My data looks like this:
Time Cluster X1 X2 X3 ...
2018-04-21 A 10 53 23 ...
2018-04-21 B 65 34 79 ...
2018-04-22 A 35 80 76 ...
2018-04-22 B 12 68 34 ...
I'd like to get a forecast by date per cluster for each X value in the table. The end goal is to combine all the forecasted values into a CSV for import into a DB. My initial dataset has 7 different cluster values and about 3 months of daily data. There are about 6 different values that need forecasts. I can (and have) done this fairly easily in Excel, but the requirement going forward is R to a CSV to a DB.
Thanks in advance!
Brandon~

Referencing different coloumn as ranges between two data frames

I have one data frame/ list that gives and ID and a number
1. 25
2. 36
3. 10
4. 18
5. 12
This first list is effectively a list of objects with the number of objects contained in each eg. bricks in a wall, so a a list or walls with the number of bricks in each.
I have a second that contains a a full list of the objects being referred to in that above list and a second attribute for each.
1. 3
2. 4
3. 2
4. 8
5. 5
etc.
in the weak example I'm stringing together this would be a list of the weight of each brick in all walls.
so my first list give me the ranges i would like to average in the second list, or I would like as an end result a list of walls with the average weight of each brick per wall.
ie average the attributes of 1-25, 26-62 ... 89-101
my idea so far was to create a data frame with two coloumns
1. 1 25
2. 26 62
3. n
4. n
5. 89 101
and then attempt to create a third column that uses the first two as x and y in a mean(table2$coloumn1[x:y]) type formula, but I can't get anything to work.
the end result could probably looks something like this
1. 3.2
2. 6.5
3. 3
4. 7.9
5. 8.5
is there a way to do it like this or does anyone have a more elegant solution.
You could do something like this... set the low and high limits of your ranges and then use mapply to work out the mean over the appropriate rows of df2.
df1 <- data.frame(id=c(1,2,3,4,5),no=c(25,36,10,18,12))
df2 <- data.frame(obj=1:100,att=sample(1:10,100,replace=TRUE))
df1$low <- cumsum(c(1,df1$no[-nrow(df1)]))
df1$high <- pmin(cumsum(df1$no),nrow(df2))
df1$meanatt <- mapply(function(l,h) mean(df2$att[l:h]), df1$low, df1$high)
df1
id no low high meanatt
1 1 25 1 25 4.760000
2 2 36 26 61 5.527778
3 3 10 62 71 5.800000
4 4 18 72 89 5.111111
5 5 12 90 100 4.454545

Re-sample a data frame with panel dimension

I have a data set consisting of 2000 individuals. For each individual, i:2000 , the data set contains n repeated situations. Letting d denote this data set, each row of dis indexed by i and n. Among other variables, d has a variable pid which takes on identical value for an individual across different (situations) rows.
Taking into consideration the panel nature of the data, I want to re-sample d (as in bootstrap):
with replacement,
store each re-sample data as a data frame
I considered using the sample function but could not make it work. I am a new user of r and have no programming skills.
The data set consists of many variables, but all the variables have numeric values. The data set is as follows.
pid x y z
1 10 2 -5
1 12 3 -4.5
1 14 4 -4
1 16 5 -3.5
1 18 6 -3
1 20 7 -2.5
2 22 8 -2
2 24 9 -1.5
2 26 10 -1
2 28 11 -0.5
2 30 12 0
2 32 13 0.5
The first six rows are for the first person, for which pid=1, and the next sex rows, pid=2 are different observations for the second person.
This should work for you:
z <- replicate(100,
d[d$pid %in% sample(unique(d$pid), 2000, replace=TRUE),],
simplify = FALSE)
The result z will be a list of dataframes you can do whatever with.
EDIT: this is a little wordy, but will deal with duplicated rows. replicate has its obvious use of performing a set operation a given number of times (in the example below, 4). I then sample the unique values of pid (in this case 3 of those values, with replacement) and extract the rows of d corresponding to each sampled value. The combination of a do.call to rbind and lapply deal with the duplicates that are not handled well by the above code. Thus, instead of generating dataframes with potentially different lengths, this code generates a dataframe for each sampled pid and then uses do.call("rbind",...) to stick them back together within each iteration of replicate.
z <- replicate(4, do.call("rbind", lapply(sample(unique(d$pid),3,replace=TRUE),
function(x) d[d$pid==x,])),
simplify=FALSE)

Resources