Looping over dataframe to create scatterplots - r

Data frame
x <- data.frame(id = c("A","B","C"), x_predictor = c(5,6,7), x_depended = c(5.5, 6.5, 7.5), y_predictor=c(2,3,2), y_depended=c(3,3,2), z_predictor=c(12,10,12), z_depended=c(14,11,13))
> x
id x_predictor x_depended y_predictor y_depended z_predictor z_depended
1 A 5 5.5 2 3 12 14
2 B 6 6.5 3 3 10 11
3 C 7 7.5 2 2 12 13
I would like to create a scatterplot for each level on ID and for each pair depended and predictor.
I have created a for loop where I loop over unique levels in ID, but how can I loop over pairs of depended and predictor?
uni <- unique(x$id)
for (p in uni){
print(ggplot(x[x$id==p], aes(y = x_depended,x = x_predictor))+geom_point()
}
I would like to plot depended vs predictor. Depended will always be in following column to its predictor.

This code will plot three different scatter plots where each plot will contain the different columns that you have in your data frame.
require(ggplot2)
x_plots <- list()
uni <- unique(x$id)
uni_counter <- 0
i <- 0
for (colnum in seq(2, 6, 2)) {
x_col <- names(x)[colnum]
y_col <- names(x)[colnum + 1]
# Retrieve the current uni.
curr_uni <- uni[uni_counter]
# Increment our counters
uni_counter <- uni_counter + 1
i <- i + 1
# Create the ggplot command,
# the command is created dynamically so that we can iterate through
# different columns in our data frame.
ggplot_cmd <- paste0("x_plots[[i]] <- ggplot(x[x$id == curr_uni], aes(y = ", y_col, ", x = ", y_col, "))+geom_point()")
# Evaluate each plot.
eval(parse(text = ggplot_cmd))
}
You can than load the multiplot() function posted here to draw all the generated plots in one figure using:
multiplot(plotlist = x_plots)
Hope this helps.

Related

R creating 5 variables with column name suffix 1 through 5 using loop

I am trying to create an iterative function in R using a loop or array, which will create three variables and three data frames with the same 1-3 suffix. My current code is:
function1 <- function(b1,lvl1,lvl2,lvl3,b2,x) {
lo1 <- exp(b1*lvl1 + b2*x)
lo2 <- exp(b1*lvl2 + b2*x)
lo3 <- exp(b1*lvl3 + b2*x)
out1 <- t(c(lv1,lo1))
out2 <- t(c(lvl2,lo2))
out3 <- t(c(lvl3,lo3))
out <- rbind(out1, out2, out3)
colnames(out) <- c("level","risk")
return(out)
}
function1(.18, 1, 2, 3, .007, 24)
However, I would like to iterate the same line of code three times to create lo1, lo2, lo3, and out1, out2 and out3. The syntax below is completely wrong because I don't know how to use two arguments in a for-loop, or nest a for loop within a function, but as a rough idea:
function1 <- function(b1,b2,x) {
for (i in 1:3) {
loi <- exp(b1*i + b2*x)
return(lo[i])
outi <- t(c(i, loi)
return(out[i])
}
out <- rbind(out1, out2, out3)
colnames(out) <- c("level","risk")
return(out)
}
function1(.18,.007,24)
The output should look like:
level risk
1 1.42
2 1.70
3 2.03
In R, the for loops are really inefficient. A good practice is to use all the functions from the apply family and try to use as much as possible vectorization. Here are some discussions about this.
For your work, you can simply do it with the dataframe structure. Here the example:
# The function
function1 <- function(b1,b2,level,x) {
# Create the dataframe with the level column
df = data.frame("level" = level)
# Add the risk column
df$risk = exp(b1*df$level + b2*x)
return(df)
}
# Your variables
b1 = .18
b2 = .007
level = c(1,2,3)
# Your process
function1(b1, b2, level, 24)
# level risk
# 1 1 1.416232
# 2 2 1.695538
# 3 3 2.029927

Avoiding empty and small groups when using pretty_breaks with cut2

I'm working with variables resembling the data val values created below:
# data --------------------------------------------------------------------
data("mtcars")
val <- c(mtcars$wt, 10.55)
I'm cutting this variable in the following manner:
# Cuts --------------------------------------------------------------------
cut_breaks <- pretty_breaks(n = 10, eps.correct = 0)(val)
res <- cut2(x = val, cuts = cut_breaks)
which produces the following results:
> table(res)
res
[ 1, 2) [ 2, 3) [ 3, 4) [ 4, 5) [ 5, 6) 6 7 8 9 [10,11]
4 8 16 1 3 0 0 0 0 1
In the created output I would like to change the following:
I'm not interested in creating grups with one value. Ideally, I would like to for each group to have at least 3 / 4 values. Paradoxically, I can leave with groups having 0 values as those will dropped later on when mergining on my real data
Any changes to the cutting mechanism, have to work on a variable with integer values
The cuts have to be pretty. I'm trying to avoid something like 1.23 - 2.35. Even if those values would be most sensible considering the distribution.
In effect, what I'm trying to achieve is this: try to make more or less even pretty group and if getting a really tiny group then bump it together with the next group, do not worry about empty groups.
Full code
For convenience, the full code is available below:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c("scales", "Hmisc"),
character.only = TRUE)
# data --------------------------------------------------------------------
data("mtcars") val <- c(mtcars$wt, 10.55)
# Cuts --------------------------------------------------------------------
cut_breaks <- pretty_breaks(n = 10, eps.correct = 0)(val) res <-
cut2(x = val, cuts = cut_breaks)
What I've tried
First approach
I tried to play with the eps.correct = 0 value in the pretty_breaks like in the code:
cut_breaks <- pretty_breaks(n = cuts, eps.correct = 0)(variable)
but none of the values gets me anwhere were close
Second approach
I've also tried using the m= 5 argument in the cut2 function but I keep on arriving at the same result.
Comment replies
My breaks function
I tried the mybreaks function but I would have to put some work into it to get nice cuts for more bizzare variables. Broadly speaking, pretty_breaks cuts well for me, juts the tiny groups that occur from time to time are not desired.
> set.seed(1); require(scales)
> mybreaks <- function(x, n, r=0) {
+ unique(round(quantile(x, seq(0, 1, length=n+1)), r))
+ }
> x <- runif(n = 100)
> pretty_breaks(n = 5)(x)
[1] 0.0 0.2 0.4 0.6 0.8 1.0
> mybreaks(x = x, n = 5)
[1] 0 1
You could use the quantile() function as a relatively easy way to get similar numbers of observations in each of your groups.
For example, here's a function that takes a vector of values x, a desired number of groups n, and a desired rounding off point r for the breaks, and gives you suggested cut points.
mybreaks <- function(x, n, r=0) {
unique(round(quantile(x, seq(0, 1, length=n+1)), r))
}
cut_breaks <- mybreaks(val, 5)
res <- cut(val, cut_breaks, include.lowest=TRUE)
table(res)
[2,3] (3,4] (4,11]
8 16 5

Assess each number in a column in dataframe with while loop

Let's say I have the code below:
data = data.frame(x=numeric(), y=numeric(), z=numeric(), ans=numeric())
x = rnorm(10,0,.01)
y = rnorm(10,0,.45)
z = rnorm(10,0,.8)
ans = x+y+z
data = rbind(data, data.frame(x=x, y=y, z=z, ans=ans))
example = function(ans) {
ans^2
}
data$result = example(data$ans)
I want to use a while loop to assess the ans column of the dataframe and if all of the numbers in the column are less than 0 perform the example function on the ans column. If they are not all below 0 I would like it to run again until they are. Any help is appreciated.
You can use any to test whether there are negative x values.
data = data.frame(x=rnorm(10,0,.01),
y = rnorm(10,0,.45),
z = rnorm(10,0,.8))
while(any(data$x >= 0)){
data$x[data$x >= 0] <- rnorm(length(data$x[data$x >= 0]),0,.01)
}
data$ans = x+y+z
print(data)
x y z ans
1 -0.0014348613 0.51931771 -0.4695617 1.2199625
2 -0.0037155145 -0.72322260 0.4650501 2.2660665
3 -0.0007619743 0.42842295 -0.3660313 0.2690932
4 -0.0068680912 0.36888855 1.4445536 0.6955025
5 -0.0134698425 -0.17174076 -1.2325956 0.7463931
6 -0.0029502825 -0.04208495 -1.4656484 -0.7020727
7 -0.0027566384 0.09476311 -0.1328970 -0.1437156
8 -0.0188576808 -0.25938843 -0.6648152 0.4843587
9 -0.0013769550 -0.00792926 0.4946057 -2.1885040
10 -0.0026376453 -0.15831996 -0.1263073 -0.2611772

How get matrix of p-values for ks test and t test?

I am new with R. If you could help me that would be great. My problem is as follows:
Lets say I have 5 groups, Group1, Group2, Group3, Group4 and Group5, each containing 100 data points.
Now I want to compare these groups with each other, using either t-test or ks-test and want to generate a matrix of p-values. Essentially, there would a 5x5 matrix of p-values. I have done similar kind of work with correletions using corr.mat function.
Here, 5 groups are for just illustrative purpose, at the end of the day I ahve to do it on almost 250 groups thus I have to generate a matrix of 250x250 containing p-values.
If anyone of you could help me to achieve this, it would be much kind of you.
Things I know in R so far:
Load the data into R by loading .csv file:
my.data = read.csv(file.choose())
attach(your.data)
If you know how to compute an individual p-value,
you can just put that code in a loop.
# Sample data
d <- data.frame(
group = paste( "group", rep(1:5, each=100) ),
value = rnorm( 5*100 )
)
# Matrix to store the result
groups <- unique( d$group )
result <- matrix(NA, nc=length(groups), nr=length(groups))
colnames(result) <- rownames(result) <- groups
# Loop
for( g1 in groups ) {
for( g2 in groups ) {
result[ g1, g2 ] <- t.test(
d$value[ d$group == g1 ],
d$value[ d$group == g2 ]
)$p.value
}
}
result
# group 1 group 2 group 3 group 4 group 5
# group 1 1.0000000 0.6533393 0.7531349 0.6239723 0.6194475
# group 2 0.6533393 1.0000000 0.9047020 0.9985489 0.3316215
# group 3 0.7531349 0.9047020 1.0000000 0.8957871 0.4190027
# group 4 0.6239723 0.9985489 0.8957871 1.0000000 0.2833226
# group 5 0.6194475 0.3316215 0.4190027 0.2833226 1.0000000
You could also use outer:
groups <- unique( d$group )
outer(
groups, groups,
Vectorize( function(g1,g2) {
t.test(
d$value[ d$group == g1 ],
d$value[ d$group == g2 ]
)$p.value
} )
)

In R, how to use for loop to create series of graphs

I have a wide-form table that looks like this:
ID Test_11 LVL11 Score_X_11 Score_Y_11 Test_12 LV12 Score_X_12 Score_Y_12
1 A I 100 NA NA NA 100 100
2 A II 90 100 B II 90 85
3 NA NA NA NA B II 90 NA
4 A III 100 80 A III 75 75
5 B I NA 90 NA NA 60 50
6 B I 70 100 NA NA NA NA
7 B II 85 NA A I 60 60
And a table used for sorting that looks like this
Test_11 A
Test_11 B
Test_12 A
Test_12 B
What this second table tells us is that for Test_11 there are two versions, A and B (same for Test_12).
I am trying to create a series of boxplots that graph the distribution of every combination of Test_11 and Test_12, and their respective versions (A, B). So, for Test_11==A the boxplot created would have three groups (I, II, III) and then the resulting graphical information from the subset where Test_11==A, and then the same for Test_11==B, Test_12==A, and Test_12==B. In total there should be, in this example, 4 graphs created.
What I have in R is:
z <- subset(df, df$Test_11=="A")
plot(z$LVL11, z$Score_X_11, varwidth = TRUE, notch = TRUE, xlab = 'LVL',
ylab = 'score')
What I would like, and haven't been able to figure out how to do, is to write a for loop that does the subsetting for me so that I could automate this for my actual data set which has a few dozen of these combinations.
Thanks for any help and guidance.
The "straight forward way"
Maybe you should save all your logical vectors in a data.frame or matrix before the loop:
selections <- matrix(nrow = nrow(df), ncol = 4)
selections[,1] <- df$Test_11 == "A"
selections[,2] <- df$Test_11 == "B"
selections[,3] <- df$Test_12 == "A"
selections[,4] <- df$Test_12 == "B"
# etc...
par(mfcol = c(2, 2)) # here you should customize at will...
for (i in 1:4) {
z <- subset(df, selections[,i])
plot(z$LVL11, z$Score_X_11, varwidth = TRUE,
notch = TRUE, xlab = 'LVL',
ylab = 'score')
}
You can change your code so instead of using z$Score_X_11, use z[,string]. The value of string should be constructed with paste (or other string manipulating functions). For example:
v <- c("X", "Y")
n <- c("11", "12")
for (i in 1:2) {
for (j in 1:2) {
string <- paste("Score", v[i], n[i], sep = "_")
print(string)
}
}
A similar reasoning would be used with the z$LVLXX values, so you should be able to figure out a way to accommodate for that.
Alternative way, with ggplot2 & reshape2
I'm not very experienced with using trellis graphics (like in the other anwser), but I know a little ggplot2, so I decided to take the challenge and try a bit. It is not great, but at least works:
# df <- read.table("data.txt", header = TRUE, na.string = "NA")
require(reshape2)
require(ggplot2)
# Melt your data.frame, using the scores as the "values":
mdf <- melt(df[,-1], id = c("LVL11", "LV12", "Test_11", "Test_12"))
# loop through level types:
for (lvl in c("LVL11", "LV12")) {
# looping through values of test11
for (test11 in c("A", "B")) {
# Note the use of subset before ggplot
p <- ggplot(subset(mdf, Test_11 == test11), aes_string(x=lvl, y="value"))
# I added the geom_jitter as in the example given there were only a few points
g <- p + geom_boxplot(aes(fill = variable)) + geom_jitter(aes(shape = variable))
print(g) # it is necessary to print p explicitly like this in order to use ggplot in a loop
# Finally, save each plot with a relevant name:
savePlot(paste0(lvl, "-t11", test11, ".png"))
# (note that savePlot has some problems with RStudio iirc)
}
# Same as before, but with test_12
for (test12 in c("A", "B")) {
p <- ggplot(subset(mdf, Test_12 == test12), aes_string(x=lvl, y="value"))
g <- p + geom_boxplot(aes(fill = variable)) + geom_jitter(aes(shape = variable))
print(g)
savePlot(paste0(lvl, "-t12", test12, ".png"))
}
}
If anyone knows how to use trellis graphics or maybe facet_grid in this case, so I can put all grahpics in one image, I would love to hear how.
cheers.
Classic plyr solution (HT to #hadleywickham)
require(plyr); require(lattice); require(gridExtra)
bplots <- dlply(dat, .(Test_11, Test_12), function(df){
bwplot(Score_X_11 ~ LVL11, data = df)
})
do.call('grid.arrange', bplots)

Resources