How to create range x values with basic R - r

I have just begun using R and have gone through multiple books and sources and they get more and more complex yet I still am unable to find a solution to what I think should be quite a basic process.
I have data with 3 columns as shown: (I am really simplifying everything to try and get a really clear answer which can applied to multiple situations)
min max value
1 5 23
8 15 9
33 35 30
I would like to plot this data on a graph.
by this data I intend that every value between 1 and 5 for example on the x axis is equal to 23 on the y axis.
I have tried several things including assigning each column to vectors a , b , and c respectively.
generating the correct number of values with:
y <- rep( c, (a-b+1))
which works as expected
then the problem occurs with getting the appropriate x values, I tried:
x <- (a:b)
but because of the way R functions it only applies to the first variables.
Now I can make this work by manually typing everything in like:
x <- c(1:5, 8:15, 33:35)
but I really need an automated way to do this because I am working with huge datasets of this structure.
I have seen some other people seem to have similar issues, however the underlying principle always seem to be convoluted with vast datasets and entire codes in questions so I have been unable to get to a good solution to this problem.
If anyone with a little more experience could clear up this issue I would be hugely grateful!

dat <- read.table(text=
"min max value
1 5 23
8 15 9
33 35 30",
header=TRUE)
I'm still not quite sure what you mean, but maybe:
newdat <- with(dat,data.frame(x=c(min,max),y=rep(value,2)))
newdat <- plyr::arrange(newdat,x)
plot(y~x,type="s",data=newdat)
It's not clear what you want to do between 5 and 8, 15 and 33 ... another possibility is to plot each bit as a separate segment:
plot(max~value,data=dat,xlim=range(c(dat$min,dat$max)),
type="n")
apply(dat,1,function(x) segments(x[1],x[3],x[2],x[3]))

How about this:
# your data.frame
df<-data.frame(min=c(1,8,33),max=c(5,15,35),value=c(23,9,30))
x<-unlist(apply(df,1,function(x)x[1]:x[2]))
y<-unlist(apply(df,1,function(x)rep(x[3],x[2]-x[1]+1)))
plotdata<-data.frame(x=x,y=y)
plotdata
x y
1 1 23
2 2 23
3 3 23
4 4 23
5 5 23
6 8 9
7 9 9
8 10 9
9 11 9
10 12 9
11 13 9
12 14 9
13 15 9
14 33 30
15 34 30
16 35 30

Something like this?
a <- c(c(1:5), c(8:15), c(33:35))
b <- c(rep(23,5), rep(9,8), rep(30,3))
plot(a,b, type="l")

Related

R: Plot several lines in the same plot: ggplot + data tables or frames vs matrices

My general problem: I tend to struggle using ggplot, because it's very data-frame-centric but the objects I work with seem to fit matrices better than data frames. Here is an example (adapted a little).
I have a quantity x that can assume values 0:5, and a "context" that can have values 0 or 1. For each context I have 7 different frequency distributions over the quantity x. (More generally I could have more than two "contexts", more values of x, and more frequency distributions.)
I can represent these 7×2 frequency distributions as a list freqs of two matrices, say:
> freqs
$`context0`
x0 x1 x2 x3 x4 x5
sample1 20 10 10 21 37 2
sample2 34 40 6 10 1 8
sample3 52 4 1 2 17 25
sample4 16 32 25 11 5 10
sample5 28 2 10 4 21 35
sample6 22 13 35 12 13 5
sample7 9 5 43 29 4 10
$`context1`
x0 x1 x2 x3 x4 x5
sample1 15 21 14 15 14 21
sample2 27 8 6 5 29 25
sample3 13 7 5 26 48 0
sample4 33 3 18 11 13 22
sample5 12 23 40 11 2 11
sample6 5 51 2 28 5 9
sample7 3 1 21 10 63 2
or a 3D array.
Or I could use a data.table tablefreqs like this one:
> tablefreqs
context x0 x1 x2 x3 x4 x5
1: 0 20 10 10 21 37 2
2: 0 34 40 6 10 1 8
3: 0 52 4 1 2 17 25
4: 0 16 32 25 11 5 10
5: 0 28 2 10 4 21 35
6: 0 22 13 35 12 13 5
7: 0 9 5 43 29 4 10
8: 1 15 21 14 15 14 21
9: 1 27 8 6 5 29 25
10: 1 13 7 5 26 48 0
11: 1 33 3 18 11 13 22
12: 1 12 23 40 11 2 11
13: 1 5 51 2 28 5 9
14: 1 3 1 21 10 63 2
Now I'd like to draw the following line plot (there's a reason why I need line plots and not, say, histograms or bar plots):
The 7 frequency distributions for context 0, with x as x-axis and the frequency as y-axis, all in the same line plot (with some alpha).
The 7 frequency distributions for context 1, again with x as x-axis and the frequency as y-axis, all in the same line plot (with alpha), but displayed upside-down below the plot for context 0.
Ggplot would surely do this very nicely, but it seems to require some acrobatics with data tables:
– If I use the data table tablefreqs it's not clear to me how to plot all its rows having context==0 in the same plot: ggplot seems to only think column-wise, not row-wise. I could use the six values of x as table rows, but then the "context" values would also end up in a row, and I'm not sure I can subset a data table by values in a row, rather than in a column.
– If I use the matrix freqs, I could create a mini-data-table having x as one column and one frequency distribution as another column, input that into ggplot+geom_line, then go over all 7 frequency distributions in a for-loop maybe. Not clear to me how to tell ggplot to keep the previous plots in this case. Then another for-loop over the two "contexts".
I'd be grateful for suggestions on how to approach this problem in particular, and more generally on what objects to choose for storing this kind of data: matrices? data tables, maybe with a different structure than shown here? some other formats?
I would suggest to familiarize yourself with the concept of what is known as Tidy Data, which are principles for data handling and storage that are adopted by ggplot2 and a number of other packages.
You are free to use a matrix or list of matrices to store your data; however, you can certainly store the data as you describe it (and as I understand it) in a data frame or single table following the following convention of columns:
context | sample | x | freq
I'll show you how I would convert the tablefreqs dataset you shared with us into that format, then how I would go about creating a plot as you are describing it in your question. I'm assuming in this case you only have the two values for context, although you allude to there being more. I'm going to try to interpret correctly what you stated in your question.
Create the Tidy Data frame
Your data frame as shown contains columns x1 through x5 that have values for x spread across more than one column, when you really need these to be converted in the format shown above. This is called "gathering" your data, and you can do that with tidyr::gather().
First, I also need to replicate the naming of your samples according to the matrix dataset, so I'll do that and gather your data:
library(dplyr)
library(tidyr)
library(ggplot2)
# create the sample names
tablefreqs$sample <- rep(paste0('sample',1:7), 2)
# gather the columns together
df <- tablefreqs %>%
gather(key='x', value='freq', -c(context, sample))
Note that in the gather() function, we have to specify to leave alone the two columns df$context and df$sample, as they are not part of the gathering effort. But now we are left with df$x containing character vectors. We can't plot that, because we want the to be in the form of a number (at least... I'm assuming you do). For that, we'll convert using:
df$x <- as.numeric(gsub("[^[:digit:].]", "", df$x))
That extracts the number from each value in df$x and represents it as a number, not a character. We have the opposite issue with df$context, which is actually a discrete factor, and we should represent it as such in order to make plotting a bit easier:
df$context <- factor(df$context)
Create the Plot
Now we're ready to create the plot. From your description, I may not have this perfectly right, but it seems that you want a plot containing both context = 1 and context = 0, and when context = 1 the data should be "upside down". By that, I'm assuming you are talking about plotting df$freq when df$context == 0 and -df$freq when df$context == 1. We could do that using some fancy logic in the ggplot() call, but I find it's easier just to create a new column in your dataset to represent what we will be plotting on the y axis. We'll call this column df$freq_adj and use that for plotting:
df$freq_adj <- ifelse(df$context==1, -df$freq, df$freq)
Then we create the plot. I'll explain a bit below the result:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(
aes(color=context, linetype=sample)
) +
geom_hline(yintercept=0, color='gray50') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
Without some clearer description or picture of what you were looking to do, I took some liberties here. I used color to discriminate between the two values for context, and I'm using linetype to discriminate the different samples. I also added a line at 0, since it seemed appropriate to do so here, and the scale_x_continuous() command is removing the extra white space that is put in place at the extreme ends of the data.
An alternative that is maybe closer to your description would be to physically have a separation between the two plots, and represent context = 1 as a physically separate plot compared to context = 0, with one over top of the other.
Here's the code and plot:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(aes(group=sample), alpha=0.3) +
facet_grid(context ~ ., scales='free_y') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
There the use of aes(group=sample) is quite important, since I want all the lines for each sample to be the same (alpha setting and color), yet ggplot2 needs to know that the connections between the points should be based on "sample". This is done using the group= aesthetic. The scales='free_y' argument on facet_grid() allows the y axis scale to shrink and fit the data according to each facet.

How to make data in a single column (long) with multiple, nested group categories wide

I've got a mess of data and am trying to efficiently wrangle it into shape. Here's a simplified short sample of the general format of my data.frame right now. The main difference is that I have a few more data labels like Label1 for my sampling units - each has a set of data similar to the data.frame I'm including but in my situation they are all in the same data.frame. I don't think that will complicate the reformatting so I've just included the single sampling unit of mock data here. StatsType levels Ave, Max, and Min are effectively nested within MeasureType.
tastycheez<-data.frame(
Day=rep((1:3),9),
StatsType=rep(c(rep("Ave",3),rep("Max",3),rep("Min",3)),3),
MeasureType=rep(c("Temp","H2O","Tastiness"),each=9),
Data_values=1:27,
Label1=rep("SamplingU1",27))
Ultimately, I would like a data frame where for each sampling unit and each Day there are columns holding the Data_values for my categories, like this:
Day Label1 Ave.Temp Ave.H2O Ave.Tastiness Max.Temp ...
1 SamplingU1 1 10 19 4 ...
2 SamplingU1 2 11 20 5 ...
I think some combination of functions from reshape,dplyr,tidyr, and/or data.table could do the job but I can't figure out how to code it. Here's what I've tried:
First, I spread the tastycheez (yum!), and that got me partway:
test<-spread(tastycheez,StatsType,Data_values)
Now I'm trying to spread it again or to cast, but with no luck:
test2<-spread(test,MeasureType,(Ave,Max,Min))
test2 <- recast(Day ~ MeasureType+c(Ave,Max,Min), data=test)
(I also tried melting the tastycheez but the results were a sticky, gooey mess and my tongue got burnt. that doesn't seem to be the right function for this.)
If you hate my puns please excuse them, I really can't figure this out!
Here are a couple related questions:
Combining two subgroups of data in the same dataframe
How can I spread repeated measures of multiple variables into wide format?
reshape2 You could use dcast from reshape2:
library(reshape2)
dcast(tastycheez,
Day + Label1 ~ paste(StatsType, MeasureType, sep="."),
value.var = "Data_values")
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9
tidyr Stealing #DavidArenburg's comment, here's the tidyr way:
library(tidyr)
tastycheez %>%
unite(temp, StatsType, MeasureType, sep = ".") %>%
spread(temp, Data_values)
which gives
Day Label1 Ave.H2O Ave.Tastiness Ave.Temp Max.H2O Max.Tastiness Max.Temp Min.H2O Min.Tastiness Min.Temp
1 1 SamplingU1 10 19 1 13 22 4 16 25 7
2 2 SamplingU1 11 20 2 14 23 5 17 26 8
3 3 SamplingU1 12 21 3 15 24 6 18 27 9

When printing large data frames to console, R starts working unbearably slowly - how can I fix it?

I'm running R on a Mac (OS X). I have a rather large data frame (imported from a csv-file) that I'm working with:
dim(mydf)
[1] 75848 9
I'm trying to analyse it and find ways of breaking it up into smaller parts, so I need to print at least parts of it out to get an overview from time to time.
However, when I have printed it, R (version 3.1.2) starts working extremely slowly to the point where I just have to give up and restart it. Then R works normally until I have printed something large to the console again.
I have tried ´gc()´ and ´rm(list = ls())´, but it doesn't improve the speed - and I guess it wouldn't as it seems to be the printing to the console and not the size of the data frame that causes the slowness (clogging up memory?).
Is there anything I can do to prevent R from becoming so slow, or do I just have to choose between restarting frequently or giving up printing my data to the console?
Thanks!
Same as you, I wanted to get an overview of my data. But just a little more then the ‘head’ function would give me. So I wrote a small function that would give me the head, middle, and tail of a dataset.
hmt <- function(x){ # head, middle, tail of data set
if(class(x) == "data.frame"){
middle <- round(nrow(x)*0.5)
middle <- x[(middle-3):(middle+3),]
data <- rbind(head(x),middle,tail(x))
}
return(data)
}
hmt(cars)
And the result:
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
22 14 60
23 14 80
24 15 20
25 15 26
26 15 54
27 16 32
28 16 40
45 23 54
46 24 70
47 24 92
48 24 93
49 24 120
50 25 85
Hope this is of any help to you.

Frequency distribution with custom format data

I need help with a R plot, with a data format I have not worked with before. Please help if you know.
NUMBER FREQUENCY
10 1
11 1
12 3
10 45
11 2
12 3
i need a bar plot with numbers on X axis (continuous, not bins in histogram) and frequency on Y, but combined.
like
10 46
11 3
12 6
it seems simple enough, but i have 10,000 rows and large numbers in real data so I am looking for a good solution in R without doing it manually.
What about:
##tapply splits dd$FREQ by dd$NUM and "sums" them
barplot(tapply(dd$FREQUENCY, dd$NUMBER, sum))
to get:
Read in your data:
dd = read.table(textConnection("NUMBER FREQUENCY
10 1
11 1
12 3
10 45
11 2
12 3"), header=TRUE)

How to reorder a column in a data frame to be the last column

I have a data frame where columns are constantly being added to it. I also have a total column that I would like to stay at the end. I think I must have skipped over some really basic command somewhere but cannot seem to find the answer anywhere. Anyway, here is some sample data:
x=1:10
y=21:30
z=data.frame(x,y)
z$total=z$x+z$y
z$w=11:20
z$total=z$x+z$y+z$w
When I type z I get this:
x y total w
1 1 21 33 11
2 2 22 36 12
3 3 23 39 13
4 4 24 42 14
5 5 25 45 15
6 6 26 48 16
7 7 27 51 17
8 8 28 54 18
9 9 29 57 19
10 10 30 60 20
Note how the total column comes before the w, and obviously any subsequent columns. Is there a way I can force it to be the last column? I am guessing that I would have to use ncol(z) somehow. Or maybe not.
You can reorder your columns as follows:
z <- z[,c('x','y','w','total')]
To do this programmatically, after you're done adding your columns, you can retrieve their names like so:
nms <- colnames(z)
Then you can grab the ones that aren't 'total' like so:
nms[nms!='total']
Combined with the above:
z <- z[, c(nms[nms!='total'],'total')]
You have a logic issue here. Whenever you add to a data.frame, it grows to the right.
Easiest fix: keep total a vector until you are done, and only then append it. It will then be the rightmost column.
(For critical applications, you would of course determine your width k beforehand, allocate k+1 columns and just index the last one for totals.)

Resources