controlling text when using add_tooltip in ggvis - r - r

I am trying to get more control over the text that appears when using add_tooltip in ggvis.
Say I want to plot 'totalinns' against 'avg' for this dataframe. Color points by 'country'.
The text I want to appear in the hovering tooltip would be: 'player', 'country', 'debutyear' 'avg'
tmp:
# player totalruns totalinns totalno totalout avg debutyear country
# 1 AG Ganteaume 112 1 0 1 112.00000 1948 WI
# 2 DG Bradman 6996 80 10 70 99.94286 1928 Aus
# 3 MN Nawaz 99 2 1 1 99.00000 2002 SL
# 4 VH Stollmeyer 96 1 0 1 96.00000 1939 WI
# 5 DM Lewis 259 5 2 3 86.33333 1971 WI
# 6 Abul Hasan 165 5 3 2 82.50000 2012 Ban
# 7 RE Redmond 163 2 0 2 81.50000 1973 NZ
# 8 BA Richards 508 7 0 7 72.57143 1970 SA
# 9 H Wood 204 4 1 3 68.00000 1888 Eng
# 10 JC Buttler 200 3 0 3 66.66667 2014 Eng
I understand that I need to make a key/id variable as ggvis only takes information supplied to it. Therefore I need to refer back to the original data. I have tried changing my text inside of my paste0() command, but still can't get it right.
tmp$id <- 1:nrow(tmp)
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste0(tmp$player, tmp$country, tmp$debutyear,
tmp$avg, format(row), collapse = "<br />")
}
tmp %>% ggvis(x = ~totalinns, y = ~avg, key := ~id) %>%
layer_points(fill = ~factor(country)) %>%
add_tooltip(all_values, "hover")
Find below code to reproduce example:
tmp <- structure(list(player = c("AG Ganteaume", "DG Bradman", "MN Nawaz",
"VH Stollmeyer", "DM Lewis", "Abul Hasan", "RE Redmond", "BA Richards",
"H Wood", "JC Buttler"), totalruns = c(112L, 6996L, 99L, 96L,
259L, 165L, 163L, 508L, 204L, 200L), totalinns = c(1L, 80L, 2L,
1L, 5L, 5L, 2L, 7L, 4L, 3L), totalno = c(0L, 10L, 1L, 0L, 2L,
3L, 0L, 0L, 1L, 0L), totalout = c(1L, 70L, 1L, 1L, 3L, 2L, 2L,
7L, 3L, 3L), avg = c(112, 99.9428571428571, 99, 96, 86.3333333333333,
82.5, 81.5, 72.5714285714286, 68, 66.6666666666667), debutyear = c(1948L,
1928L, 2002L, 1939L, 1971L, 2012L, 1973L, 1970L, 1888L, 2014L
), country = c("WI", "Aus", "SL", "WI", "WI", "Ban", "NZ", "SA",
"Eng", "Eng")), .Names = c("player", "totalruns", "totalinns",
"totalno", "totalout", "avg", "debutyear", "country"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))

I think this is closer:
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste(tmp$player[x$id], tmp$country[x$id], tmp$debutyear[x$id],
tmp$avg[x$id], sep="<br>")
}

Related

How to aggregate a data frame by columns and rows?

I have the following data set:
Class Total AC Final_Coverage
A 1000 1 55
A 1000 2 66
B 1000 1 77
A 1000 3 88
B 1000 2 99
C 1000 1 11
B 1000 3 12
B 1000 4 13
B 1000 5 22
C 1000 2 33
C 1000 3 44
C 1000 4 55
C 1000 5 102
A 1000 4 105
A 1000 5 109
I would like to get the average of the AC and the Final_Coverage for the first three rows of each class. Then, I want to store the average values along with the class name in a new dataframe. To do that, I did the following:
dataset <- read_csv("/home/ad/Desktop/testt.csv")
classes <- unique(dataset$Class)
new_data <- data.frame(Class = character(0), AC = numeric(0), Coverage = numeric(0))
for(class in classes){
new_data$Class <- class
dataClass <- subset(dataset, Class == class)
tenRows <- dataClass[1:3,]
coverageMean <- mean(tenRows$Final_Coverage)
acMean <- mean(tenRows$AC)
new_data$Coverage <- coverageMean
new_data$AC <- acMean
}
Everything works fine except entering the average value into the new_data frame. I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "Class", value = "A") :
replacement has 1 row, data has 0
Do you know how to solve this?
This should get you the new dataframe by using dplyr.
dataset %>% group_by(Class) %>% slice(1:3) %>% summarise(AC= mean(AC),
Coverage= mean(Final_Coverage))
In your method the error is that you initiated your new dataframe with 0 rows and try to assign a single value to it. This is reflected by the error. You want to replace one row to a dataframe with 0 rows. This would work, though:
new_data <- data.frame(Class = classes, AC = NA, Coverage = NA)
for(class in classes){
new_data$Class <- class
dataClass <- subset(dataset, Class == class)
tenRows <- dataClass[1:3,]
coverageMean <- mean(tenRows$Final_Coverage)
acMean <- mean(tenRows$AC)
new_data$Coverage[classes == class] <- coverageMean
new_data$AC[classes == class] <- acMean
}
You could look into aggregate().
> aggregate(df1[df1$AC <= 3, 3:4], by=list(Class=df1[df1$AC <= 3, 1]), FUN=mean)
Class AC Final_Coverage
1 A 2 69.66667
2 B 2 62.66667
3 C 2 29.33333
DATA
df1 <- structure(list(Class = structure(c(1L, 1L, 2L, 1L, 2L, 3L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Total = c(1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L,
1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
AC = c(1L, 2L, 1L, 3L, 2L, 1L, 3L, 4L, 5L, 2L, 3L, 4L, 5L,
4L, 5L), Final_Coverage = c(55L, 66L, 77L, 88L, 99L, 11L,
12L, 13L, 22L, 33L, 44L, 55L, 102L, 105L, 109L)), class = "data.frame", row.names = c(NA,
-15L))

Delete rows if column values are equal

I want delete the rows if the columns (YEAR, POL, CTY, ID, AMOUNT) are equal in the values across all rows. Please see the output table below.
Table:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 36 3500 RANGE1 L0015N20W23
2017 30408 11 36 3500 RANGE1 L00210N20W24
2017 30408 11 36 3500 RANGE1 L00310N20W25
2017 30409 11 36 3500 RANGE1 L0015N20W23
2017 30409 11 35 3500 RANGE2 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
Output:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 35 3500 RANGE1 L0015N20W23
You can try this:
no_duplicate_cols <- c("YEAR", "POL", "CTY", "ID", "AMOUNT")
new_df <- df[!duplicated(df[, no_duplicate_cols]), ]
The data frame new_df will hold the rows from df that are not duplicated.
If I understood the question correctly then I think you can try this
library(dplyr)
df %>%
group_by(YEAR, POL, CTY, ID, AMOUNT) %>%
filter(n() == 1)
Output (but it seems that the output provided in the original question has bit of typo!):
# A tibble: 1 x 7
# Groups: YEAR, POL, CTY, ID, AMOUNT [1]
YEAR POL CTY ID AMOUNT RAN LEGAL
1 2017 30409 11 36 3500 RANGE1 L0015N20W23
#sample data
> dput(df)
structure(list(YEAR = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L), POL = c(30408L, 30408L, 30408L, 30409L, 30409L, 30409L,
30409L), CTY = c(11L, 11L, 11L, 11L, 11L, 11L, 11L), ID = c(36L,
36L, 36L, 36L, 35L, 35L, 35L), AMOUNT = c(3500L, 3500L, 3500L,
3500L, 3500L, 3500L, 3500L), RAN = structure(c(1L, 1L, 1L, 1L,
2L, 3L, 3L), .Label = c("RANGE1", "RANGE2", "RANGE3"), class = "factor"),
LEGAL = structure(c(1L, 2L, 3L, 1L, 4L, 4L, 4L), .Label = c("L0015N20W23",
"L00210N20W24", "L00310N20W25", "NANANA"), class = "factor")), .Names = c("YEAR",
"POL", "CTY", "ID", "AMOUNT", "RAN", "LEGAL"), class = "data.frame", row.names = c(NA,
-7L))

Combining and Transforming Data Frames in R

I have a bunch of data frames that look like this in R:
print(output[2])
Button Intensity Acc Intensity RT Time tdelta SubjectID CoupleID PrePost
1: 0 30 0 0.0 0 83325.87 0.000 1531 153 Post
2: 1 30 1 13.5 0 83362.65 36.782 1531 153 Post
3: 1 30 1 15.0 0 83376.68 14.027 1531 153 Post
4: 1 30 1 6.0 0 83392.27 15.585 1531 153 Post
5: 1 30 1 15.0 0 83398.77 6.507 1531 153 Post
print(output[1])
[[1]]
Button Intensity Acc Intensity RT Time tdelta SubjectID CoupleID PrePost
1: 0 30 0 0.0 0 77987.93 0.000 1531 153 Pre
2: 1 30 1 13.5 0 78084.57 96.639 1531 153 Pre
3: 1 30 1 15.0 0 78098.62 14.054 1531 153 Pre
4: 1 30 1 6.0 0 78114.13 15.508 1531 153 Pre
5: 1 30 1 15.0 0 78120.67 6.537 1531 153 Pre
I want to combine them into one big data frame that has the following logic and format:
SubjectID CoupleID PrePost Miss1RT Miss2RT Miss3RT Hit1RT Hit2RT Hit3RT
1531 153 Post 0.00 NA NA NA 36.78 14.027
1531 153 Pre 0.00 NA NA NA 96.638 14.054
if Button == 0, then it's a Miss, if it ==1, then it's a Hit. So, it should be something like:
for row in output[i].rows:
if Button ==0:
Miss1RT ==tdelta
elif Button ==1;
Miss1RT =='NA'
and then a flipped version where if Button is 1, Hit[i]RT is tdelta or else 'NA'.
There are 26 lines per data frame and each row is either a hit or a miss so there will be 26 Miss and 26 Hit columns and each SubjectID gets two rows - one for Pre and one for Post. So the column headers for the final output will be:
SubjectID CoupleID PrePost Miss1RT Miss2RT ...Miss26RT Hit1RT Hit2RT ... Hit26RT
I'm new to R and struggling with the proper syntax.
Something like this should work:
#Get data in structure OP has
output <- list(pre, post)
output2 <- lapply(output, function(x) cbind(x, num = paste0(1:nrow(x), "RT")))
pre_post <- do.call("rbind", output2)
#Perform actual calculations
pre_post$miss <- ifelse(pre_post$Button == 0, pre_post$tdelta, NA)
pre_post$hit <- ifelse(pre_post$Button == 1, pre_post$tdelta, NA)
pre_post_melted <- melt(pre_post, id.vars = c("SubjectID", "CoupleID", "num", "PrePost"), measure.vars = c("hit","miss"))
pre_post_res <- dcast(pre_post_melted, SubjectID + CoupleID + PrePost ~ variable + num, sep = "")
pre_post_res
#SubjectID CoupleID PrePost hit_1RT hit_2RT hit_3RT hit_4RT hit_5RT miss_1RT miss_2RT miss_3RT miss_4RT miss_5RT
#1 1531 153 Post NA 36.782 14.027 15.585 6.507 0 NA NA NA NA
#2 1531 153 Pre NA 96.639 14.054 15.508 6.537 0 NA NA NA NA
We transpose the data to dynamically create all the variables we want. We also stack the data to avoid repeated steps.
Data:
pre <- structure(list(Button = c(0L, 1L, 1L, 1L, 1L), Intensity = c(30L,
30L, 30L, 30L, 30L), Acc = c(0L, 1L, 1L, 1L, 1L), Intensity = c(0,
13.5, 15, 6, 15), RT = c(0L, 0L, 0L, 0L, 0L), Time = c(77987.93,
78084.57, 78098.62, 78114.13, 78120.67), tdelta = c(0, 96.639,
14.054, 15.508, 6.537), SubjectID = c(1531L, 1531L, 1531L, 1531L,
1531L), CoupleID = c(153L, 153L, 153L, 153L, 153L), PrePost = c("Pre",
"Pre", "Pre", "Pre", "Pre")), .Names = c("Button", "Intensity",
"Acc", "Intensity", "RT", "Time", "tdelta", "SubjectID", "CoupleID",
"PrePost"), row.names = c(NA, -5L), class = "data.frame")
post <- structure(list(Button = c(0L, 1L, 1L, 1L, 1L), Intensity = c(30L,
30L, 30L, 30L, 30L), Acc = c(0L, 1L, 1L, 1L, 1L), Intensity = c(0,
13.5, 15, 6, 15), RT = c(0L, 0L, 0L, 0L, 0L), Time = c(83325.87,
83362.65, 83376.68, 83392.27, 83398.77), tdelta = c(0, 36.782,
14.027, 15.585, 6.507), SubjectID = c(1531L, 1531L, 1531L, 1531L,
1531L), CoupleID = c(153L, 153L, 153L, 153L, 153L), PrePost = c("Post",
"Post", "Post", "Post", "Post")), .Names = c("Button", "Intensity",
"Acc", "Intensity", "RT", "Time", "tdelta", "SubjectID", "CoupleID",
"PrePost"), row.names = c(NA, -5L), class = "data.frame")

Find out the item first time shows in a data set

I have a data set ProductTable, I want to return the date of all the ProductsFamily has been ordered first time and the very last time. Examples:
ProductTable
OrderPostingYear OrderPostingMonth OrderPostingDate ProductsFamily Sales QTY
2008 1 20 R1 5234 1
2008 1 12 R2 223 2
2009 1 30 R3 34 1
2008 2 1 R1 1634 3
2010 4 23 R3 224 1
2009 3 20 R1 5234 1
2010 7 12 R2 223 2
Result as followings
OrderTime
ProductsFamily OrderStart OrderEnd SumSales
R1 2008/1/20 2009/3/20 12102
R2 2008/1/12 2010/7/12 446
R3 2009/1/30 2010/4/23 258
I have no idea how to do it. Any suggestions?
ProductTable <- structure(list(OrderPostingYear = c(2008L, 2008L, 2009L, 2008L,
2010L, 2009L, 2010L), OrderPostingMonth = c(1L, 1L, 1L, 2L, 4L,
3L, 7L), OrderPostingDate = c(20L, 12L, 30L, 1L, 23L, 20L, 12L
), ProductsFamily = structure(c(1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("R1",
"R2", "R3"), class = "factor"), Sales = c(5234L, 223L, 34L, 1634L,
224L, 5234L, 223L), QTY = c(1L, 2L, 1L, 3L, 1L, 1L, 2L)), .Names = c("OrderPostingYear",
"OrderPostingMonth", "OrderPostingDate", "ProductsFamily", "Sales",
"QTY"), class = "data.frame", row.names = c(NA, -7L))
We can also use dplyr/tidyr to do this. We arrange the columns, concatenate the 'Year:Date' columns with unite, group by 'ProductsFamily', get the first, last of 'Date' column and sum of 'Sales' within summarise.
library(dplyr)
library(tidyr)
ProductTable %>%
arrange(ProductsFamily, OrderPostingYear, OrderPostingMonth, OrderPostingDate) %>%
unite(Date,OrderPostingYear:OrderPostingDate, sep='/') %>%
group_by(ProductsFamily) %>%
summarise(OrderStart=first(Date), OrderEnd=last(Date), SumSales=sum(Sales))
# Source: local data frame [3 x 4]
# ProductsFamily OrderStart OrderEnd SumSales
# (fctr) (chr) (chr) (int)
# 1 R1 2008/1/20 2009/3/20 12102
# 2 R2 2008/1/12 2010/7/12 446
# 3 R3 2009/1/30 2010/4/23 258
You can first set up the date in a new column, and then aggregate your data using data.table package (you take the first and last date by ID, as well as the sum of sales):
library(data.table)
# First build up the date
ProductTable$date = with(ProductTable,
as.Date(paste(OrderPostingYear,
OrderPostingMonth,
OrderPostingDate, sep = "." ),
format = "%Y.%m.%d"))
# In a second step, aggregate your data
setDT(ProductTable)[,list(OrderStart = sort(date)[1],
OrderEnd = sort(date)[.N],
SumSales = sum(Sales))
,ProductsFamily]
# ProductsFamily OrderStart OrderEnd SumSales
#1: R1 2008-01-20 2009-03-20 12102
#2: R2 2008-01-12 2010-07-12 446
#3: R3 2009-01-30 2010-04-23 258

R: Assign colors to values/color gradient palette

I have a sample dataframe which looks like this:
reg1 <- structure(list(REGION = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("REG1", "REG2"), class = "factor"),STARTYEAR = c(1959L, 1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1945L, 1946L, 1947L, 1948L, 1949L), ENDYEAR = c(1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1968L, 1946L, 1947L, 1948L, 1949L, 1950L), Y_START = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 2L, 2L, 2L), Y_END = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), COLOR_VALUE = c(-969L, -712L, -574L, -312L, -12L, 1L, 0L, -782L, -999L, -100L, 23L, 45L, NA, 999L)), .Names = c("REGION", "STARTYEAR", "ENDYEAR", "Y_START", "Y_END", "COLOR_VALUE"), class = "data.frame", row.names = c(NA, -14L))
REGION STARTYEAR ENDYEAR Y_START Y_END COLOR_VALUE
1 REG1 1959 1960 0 1 -969
2 REG1 1960 1961 0 1 -712
3 REG1 1961 1962 0 1 -574
4 REG1 1962 1963 0 1 -312
5 REG1 1963 1964 0 1 -12
6 REG1 1964 1965 0 1 1
7 REG1 1965 1966 0 1 0
8 REG1 1966 1967 0 1 -782
9 REG1 1967 1968 0 1 -999
10 REG2 1945 1946 2 3 -100
11 REG2 1946 1947 2 3 23
12 REG2 1947 1948 2 3 45
13 REG2 1948 1949 2 3 NA
14 REG2 1949 1950 2 3 999
I am creating a plot with the rect() function which works fine.
xx = unlist(reg1[, c(2, 3)])
yy = unlist(reg1[, c(4, 5)])
png(width=1679, height=1165, res=150)
if(any(xx < 1946)) {my_x_lim <- c(min(xx), 2014)} else {my_x_lim <- c(1946, 2014)}
plot(xx, yy, type='n', xlim = my_x_lim)
apply(reg1, 1, function(y)
rect(y[2], y[4], y[3], y[5]))
dev.off()
In my reg1 data I have a 6th column which contains values between +1000 and -1000. What I was wondering is if there is a method that I could colour the rectangles in my plot according to my color values. Low values should be blue, values around 0 should result in white and high values in red (if no value is present or NA, then grey should be plotted).
My question: How could I create a color palette that ranges from values 1000 to -1000 (from red over white to blue) and apply it to my plot so that each rectangle gets coloured according to the color value?
Here is how your get a color ramp and match it in the data frame.
my.colors<-colorRampPalette(c("blue", "white", "red")) #creates a function my.colors which interpolates n colors between blue, white and red
color.df<-data.frame(COLOR_VALUE=seq(-1000,1000,1), color.name=my.colors(2001)) #generates 2001 colors from the color ramp
reg1.with.color<-merge(reg1, color.df, by="COLOR_VALUE")
I can't help you with the rect() plotting, I've never used it

Resources