how to look at specific subset of a dataset - r

I have a dataset that looks like:
foo bar
23 0
72 1
41 1
32 2
21 1
21 1
I want to plot a qq plot and a histogram of the distribution of foo at bar equal to 1. How would I do that?
I know plot and qqnorm for qq plot. And I know hist.

Simply subset as the other suggested.
> subset(df, bar==1)
or in one line for the hist function
> hist(subset(df, bar==1))

Just get all rows with bar==1. Following should work:
df1 = ddf[ddf$bar==1,]
df1
foo bar
2 72 1
3 41 1
5 21 1
6 21 1
plot(df1$foo, df1$bar)

Related

Select the same name with different [number]

I have column names like the following plot
Can I select all alpha one time instead of typing alpha[1], alpha[2]...alpha[9]?
How can I put in the following codes to let R know I need results of all alpha?
t_alpha <- mcmc_trace(mcmc,pars="alpha")
Something like this perhaps?
library(dplyr)
library(magrittr)
df %>% select(matches("^alpha"))`
# alpha.1. alpha.10.
# 1 55 43
# 2 97 20
# 3 80 84
# 4 24 60
# 5 27 21
# 6 98 70

Dividing all possible rows within a given sub-data in R

My data looks like this:
set <- c(1,1,1,2,2,3,3,3,3,3,4,4)
density <- c(1,3,3,1,3,1,1,1,3,3,1,3)
counts <- c(100,2,4,76,33,12,44,13,54,36,65,1)
data <- data.frame(set,density,counts)
data$set <- as.factor(data$set)
data$density <- as.factor(data$density)
Within a given set there are two levels of densities "1" or "3". For a given set, I want to divide all possible combinations of counts of density "1" and density "3". I then want to print the original density associated with density "1", the ratio, and the set
For example, the result for the first few rows should look like:
set counts ratio
1 100 50 #100/2
1 100 25 #100/4
2 76 2.3 #76/33
3 12 0.22 #12/54
3 12 0.33 #12/36
3 44 0.8148 #44/54
...
I thought I could achieve it by dplyr..but it seems a little too complicated for dplyr.
It looks like the comments get you most of the way there. Here's a dplyr solution. With left_join each of the density1's get matched up with all density3's in the same set, providing output in line with your specification.
# Edited below to use dplyr syntax; my base syntax had a typo
library(dplyr)
data_combined <- data %>% filter(density == 1) %>%
# Match each 1 w/ each 3 in the set
left_join(data %>% filter(density == 3), by = "set") %>%
mutate(ratio = counts.x / counts.y) %>%
select(set, counts.x, counts.y, ratio)
data_combined
# set counts.x counts.y ratio
#1 1 100 2 50.0000000
#2 1 100 4 25.0000000
#3 2 76 33 2.3030303
#4 3 12 54 0.2222222
#5 3 12 36 0.3333333
#6 3 44 54 0.8148148
#7 3 44 36 1.2222222
#8 3 13 54 0.2407407
#9 3 13 36 0.3611111
#10 4 65 1 65.0000000

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

How to make a spaghetti plot in R?

I have the following:
heads(dataframe):
ID Result Days
1 70 0
1 80 23
2 90 15
2 89 30
2 99 40
3 23 24
ect...
what I am trying to do is: Create a spaghetti plot with the above datast. What I use is this:
interaction.plot(dataframe$Days,dataframe$ID,dataframe$Result,xlab="Time",ylab="Results",legend=F) but none of the patient lines are continuous even when they were supposed to be a long line.
Also I want to convert the above dataframe to something like this:
ID Result Days
1 70 0
1 80 23
2 90 0
2 89 15
2 99 25
3 23 0
ect... ( I am trying to take the first (or minimum) of each id and have their dating starting from zero and up). Also in the spaghetti plot i want all patients to have the same color IF a condition in met, and another color if the condition is not met.
Thank you for your time and patience.
How about this, using ggplot2 and data.table
# libs
library(ggplot2)
library(data.table)
# your data
df <- data.table(ID=c(1,1,2,2,2,3),
Result=c(70,80,90,89,99,23),
Days=c(0,23,15,30,40,24))
# adjust each ID to start at day 0, sort
df <- merge(df, df[, list(min_day=min(Days)), by=ID], by='ID')
df[, adj_day:=Days-min_day]
df <- df[order(ID, Days)]
# plot
ggplot(df, aes(x=adj_day, y=Result, color=factor(ID))) +
geom_line() + geom_point() +
theme_bw()
Contents of updated data.frame (actually a data.table):
ID Result Days min_day adj_day
1 70 0 0 0
1 80 23 0 23
2 90 15 15 0
2 89 30 15 15
2 99 40 15 25
3 23 24 24 0
You can handle the color coding easily using scale_color_manual()

adding labels and colour to different points on a graph using R

Happy new year to you all!
I am plotting some graphs and would like to differentiate some plotted lines and points. This is an example of my data and the graph that I am trying to get:
anim <- c(1,2,3,4,5)
var1 <- c(32,36,40,38,39)
var2 <- c(30,31,34,36,38)
surv <- c(0,1,0,1,1)
mydf <- data.frame(anim,var1,var2,surv)
mydf
anim var1 var2 surv
1 1 32 30 0
2 2 36 31 1
3 3 40 34 0
4 4 38 36 1
5 5 39 38 1
lm.pos1 <- lm(var1~var2,data=mydf)
plot(mydf$var2,mydf$var1,xlab="ave.ear",ylab="rtemp",xlim=c(25,45),ylim=c(25,45))
abline(lm.pos1)
abline(h=37.6,v=0,col="gray10",lty=20)
abline(h=34,v=0,col="gray10",lty=20)
First, I would like to insert the label "37.6°C" on the top horizontal and continuous line and "34.0°C" on the bottom horizontal and broken line.
Second, I would like to colour those individuals (circles) as red if surv=0 (died) or green if surv=1.
Any help would be very much appreciated!
Baz
plot(mydf$var2, mydf$var1, xlab="ave.ear", ylab="rtemp",
xlim=c(25,45), ylim=c(25,45), col=c('green', 'red')[surv+1])
abline(lm.pos1)
abline(h=37.6,v=0,col="gray10",lty=20)
text(25,38.1,parse(text='37.6*degree'),col='gray10')
abline(h=34,v=0,col="gray10",lty=20)
text(25,34.5,parse(text='34*degree'),col='gray10')

Resources