R: iterate over columns and plot - r

My data looks like this:
Group Feature_A Feature_B Feature_C Feature_D
1 1 0 3 2 4
2 1 5 2 2 8
3 1 9 8 6 5
4 2 5 7 8 8
5 2 2 6 8 1
6 2 3 8 6 4
7 3 1 5 3 5
8 3 1 4 3 4
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), Feature_A = c(0L,
5L, 9L, 5L, 2L, 3L, 1L, 1L), Feature_B = c(3L, 2L, 8L, 7L, 6L,
8L, 5L, 4L), Feature_C = c(2L, 2L, 6L, 8L, 8L, 6L, 3L, 3L), Feature_D = c(4L,
8L, 5L, 8L, 1L, 4L, 5L, 4L)), .Names = c("Group", "Feature_A",
"Feature_B", "Feature_C", "Feature_D"), class = "data.frame", row.names = c(NA,
-8L))
For every Feature I want to generate a plot (e.g., boxplot) that would higlight difference between Groups.
# Get unique Feature and Group
Features<-unique(colnames(df[,-1]))
Group<-unique(colnames(df$Group))
But how can I do the rest?
Pseudo-code might look like this:
Select Feature from Data
Split Data according Group
Boxplot
for (i in 1:levels(df$Features)){
for (o in 1:length(Group)){
}}
How can I achieve this? Hope someone can help me.

I would put py data in the long format. Then Using ggplot2 you can do some nice things.
library(reshape2)
library(ggplot2)
library(gridExtra)
## long format using Group as id
dat.m <- melt(dat,id='Group')
## bar plot
p1 <- ggplot(dat.m) +
geom_bar(aes(x=Group,y=value,fill=variable),stat='identity')
## box plot
p2 <- ggplot(dat.m) +
geom_boxplot(aes(x=factor(Group),y=value,fill=variable))
## aggregate the 2 plots
grid.arrange(p1,p2)

This is easy to do. I do this all the time
The code below will generate the charts using ggplot and save them as ch_Feature_A ....
you can wrap the answer in a pdf statement to send them to pdf as well
library(ggplot2)
df$Group <- as.factor(df$Group)
for (i in 2:dim(df)[2]) {
ch <- ggplot(df,aes_string(x="Group",y=names(df)[i],fill="Group"))+geom_boxplot()
assign(paste0("ch_",names(df)[i]),ch)
}
or even simpler, if you do not want separate charts
library(reshape2)
df1 <- melt(df)
ggplot(df1,aes(x=Group,y=value,fill=Group))+geom_boxplot()+facet_grid(.~variable)

Related

How to write a function in R that can make out the average of the 3 best scores out of 4

I am given a dataframe with 10 students, each one having a score for 4 different tests. i must select the 3 best scores and make their average using these 3
noma interro1 interro2 interro3 interro4
1 836016120449 6 3 NA 3
2 596844884419 1 4 2 8
3 803259953398 2 2 9 1
4 658786759629 3 1 3 2
5 571155022756 4 9 1 4
6 576037886365 8 7 8 7
7 045086625199 9 6 7 6
8 621909979467 5 8 4 5
9 457029205538 7 5 6 9
10 402526220817 NA 10 5 10
This dataframe provides the scores for 4 tests for 10 students.
Write a function that calculates the average score for the 3 best tests.
Calculate this average score for the 10 students.
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2, 1, average)
considering i want the 3 best, i thought that sort() could be useful here, however, what i receive is
In mean.default(x, na.rm = TRUE) :
argument is not numeric or logical: returning NA
i tried this one too
average <- function(t){
rowMeans(sort(t, decreasing = TRUE, na.rm=TRUE)[1:3])
}
UPDATE: answered, the dimensions of the dataframe were not correct in the apply line, i had to remove the first one which contained the names of the students, thus this one bellow works
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2[-1], 1, average)
Try pivot the scores, then sort the scores by name and keep the top 3 scores. Finally take the average grouping by name:
library(dplyr)
library(tidyr)
data <- data.frame(
stringsAsFactors = FALSE,
noma = c("836016120449","596844884419",
"803259953398","658786759629","571155022756",
"576037886365","045086625199","621909979467","457029205538",
"402526220817"),
interro1 = c(6L, 1L, 2L, 3L, 4L, 8L, 9L, 5L, 7L, NA),
interro2 = c(3L, 4L, 2L, 1L, 9L, 7L, 6L, 8L, 5L, 10L),
interro3 = c(NA, 2L, 9L, 3L, 1L, 8L, 7L, 4L, 6L, 5L),
interro4 = c(3L, 8L, 1L, 2L, 4L, 7L, 6L, 5L, 9L, 10L)
)
data <- data %>% pivot_longer(!noma, names_to = "interro", values_to = "value") %>% replace_na(list(value=0))
data_new1 <- data[order(data$noma, data$value, decreasing = TRUE), ] # Order data descending
data_new1 <- Reduce(rbind, by(data_new1, data_new1["noma"], head, n = 3)) # Top N highest values by group
data_new1 <- data_new1 %>% group_by(noma) %>% summarise(Value_mean = mean(value))

How to make "For loop" based on column

I have been using "For loops" before this. But the variable is usually k that refer to row numbers.
Example:
for (k in 1:n) {
expression
}
My question is, is it possible for the variable to be a certain column?
Example:
for ("column no" in 1:n) {
expression
}
I have had several trials and errors and a bit stuck now. Here is my data:
date mold no
22-May 1.35436 1
23-May 0.88592 1
24-May 0.81316 1
25-May 0.80856 1
26-May 0.84646 1
27-May 0.81762 1
28-May 0.79828 1
03-Jan 1.09158 2
04-Jan 0.86661 2
05-Jan 0.81908 2
06-Jan 0.7555 2
07-Jan 0.66577 2
08-Jan 0.66706 2
09-Jan 0.67133 2
05-Feb 20.4366 3
06-Feb 5.77923 3
06-Feb 3.12323 3
05-Feb 2.25436 3
06-Feb 1.74551 3
06-Feb 1.52744 3
05-Feb 1.45483 3
28-Jul 1.55148 4
29-Jul 1.18882 4
30-Jul 1.10595 4
31-Jul 1.14101 4
01-Aug 1.1453 4
02-Aug 1.10113 4
03-Aug 1.09152 4
30-Nov 8.3254 5
01-Dec 4.03003 5
02-Dec 2.18026 5
03-Dec 1.40028 5
04-Dec 1.02901 5
05-Dec 0.85859 5
06-Dec 0.7776 5
I would like to as R to sum up the values in the mold column for each group (1 to 5) in the no column. For example, for no=1, it would be
1.35436 + 0.88592 + 0.81316 + 0.80856 + 0.84646 + 0.81762 + 0.79828 = 6.32436
Then repeat for no = 2, 3, 4 etc.
We can loop through the unique elements, compare (==) and get the sumof the 'mold' elements that correspond to the boolean vector
un1 <- unique(df1$no)
v1 <- numeric(length(un1))
for(i in seq_along(v1)) v1[i] <- sum(df1$mold[df1$no== un1[i]])
v1
#[1] 6.32436 5.53693 36.32120 8.32521 18.60117
It is the same as rowsum
rowsum(df1$mold, df1$no)[,1]
# 1 2 3 4 5
# 6.32436 5.53693 36.32120 8.32521 18.60117
data
df1 <- structure(list(date = c("22-May", "23-May", "24-May", "25-May",
"26-May", "27-May", "28-May", "03-Jan", "04-Jan", "05-Jan", "06-Jan",
"07-Jan", "08-Jan", "09-Jan", "05-Feb", "06-Feb", "06-Feb", "05-Feb",
"06-Feb", "06-Feb", "05-Feb", "28-Jul", "29-Jul", "30-Jul", "31-Jul",
"01-Aug", "02-Aug", "03-Aug", "30-Nov", "01-Dec", "02-Dec", "03-Dec",
"04-Dec", "05-Dec", "06-Dec"), mold = c(1.35436, 0.88592, 0.81316,
0.80856, 0.84646, 0.81762, 0.79828, 1.09158, 0.86661, 0.81908,
0.7555, 0.66577, 0.66706, 0.67133, 20.4366, 5.77923, 3.12323,
2.25436, 1.74551, 1.52744, 1.45483, 1.55148, 1.18882, 1.10595,
1.14101, 1.1453, 1.10113, 1.09152, 8.3254, 4.03003, 2.18026,
1.40028, 1.02901, 0.85859, 0.7776), no = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L)),
class = "data.frame", row.names = c(NA,
-35L))

making a presence/absence timeline in r for multiple y objects

This is my first time using SO and I am an R newbie; sorry if this is a little basic or unclear (or if the question has already been answered... I'm struggling with coding and need pretty specific answers to understand)
I would like to produce an image similar to this one:
Except I would like it to be oriented horizontally on a timeline, and with two vertical lines drawn from the x-axis.
I can set the data up simply, and there are only two variables - date and Tag.
Tag Date
1 1 1/1/2014
2 3 1/1/2014
3 1 1/3/2014
4 2 1/3/2014
5 3 1/3/2014
6 5 1/3/2014
7 2 1/4/2015
8 3 1/4/2015
9 4 1/4/2015
10 6 1/4/2015
11 1 1/5/2014
12 2 1/5/2014
13 4 1/5/2014
14 6 1/5/2014
15 1 1/6/2014
16 2 1/6/2014
17 3 1/6/2014
18 4 1/6/2014
19 6 1/6/2014
20 2 1/7/2014
21 4 1/7/2014
22 1 1/8/2014
23 2 1/8/2014
24 6 1/8/2014
Here is a drawn image of what I would like to accomplish:
To recap - I want to take this data, which shows the dates of detection of animals at a certain location and plot it on a timeline with two vertical lines on two dates. If an animal (say, tag 2) was detected on consecutive days, I would like to connect those dates with a line, and if the detection happened without detection on consecutive days, a simple dot will suffice. I imagine the y-axis is stacked with each individual Tag, and the x-axis is a date scale - for each date, if A tag ID was detected, then its corresponding x,y coordinate will be marked; if a tag was not detected on a certain date; the corresponding x,y coordinate will remain blank.
Here's a follow-up question:
I want to add a shaded background to some of the dates. I figured that I can use this using geom_rect, but i keep getting the following error:
Error: Invalid input: date_trans works with objects of class Date only
using the code you wrote, this is what I have added to receive the error:
geom_rect(aes(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf), fill="red", alpha=0.25)
this code will plot, but is not transparent, and so becomes fairly useless:
geom_rect(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf)
You first need to change your date format into Date. Then you need to figure out if dates are consecutive. And finally you need to plot them. Below is a possible solution using the packages dplyr and ggplot2.
# needed packages
require(ggplot2)
require(dplyr)
# input your data (changed to 2014 dates)
dat <- structure(list(Tag = c(1L, 3L, 1L, 2L, 3L, 5L, 2L, 3L, 4L, 6L, 1L, 2L, 4L, 6L, 1L, 2L, 3L, 4L, 6L, 2L, 4L, 1L, 2L, 6L), Date = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L), .Label = c("1/1/2014", "1/3/2014", "1/4/2014", "1/5/2014", "1/6/2014", "1/7/2014", "1/8/2014"), class = "factor")), .Names = c("Tag", "Date"), class = "data.frame", row.names = c(NA, -24L))
# change date to Date format
dat[, "Date"] <- as.Date(dat[, "Date"], format='%m/%d/%Y')
# adding consecutive tag for first day of cons. measurements
dat <- dat %>% group_by(Tag) %>% mutate(consecutive=c(diff(Date), 2)==1)
# plotting command
ggplot(dat, aes(Date, Tag)) + geom_point() + theme_bw() +
geom_line(aes(alpha=consecutive, group=Tag)) +
scale_alpha_manual(values=c(0, 1), breaks=c(FALSE, TRUE), guide='none')

Checking row format of csv

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.
Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.
This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

R - Select rows for random sample of column values?

How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.
is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))
If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]

Resources