I have a data frame in R which is comprised like this:
year
region
age
population_count
cumulative_count*
middle_value*
2001
Region x
0
10
10
50
2001
Region x
1
10
20
50
2001
Region x
2
10
30
50
2001
Region x
3
10
40
50
2001
Region x
4
10
50
50
2001
Region x
5
10
60
50
...2020
Region y
1
10
20
50
For each year and region combination I have a discrete cumulative_count (derived from population_count by age) and middle_value (derived from the cumulative_count), again discrete for each year and region combination.
I want to extract from this the row for each region and year combination where the cumulative_count is closest to the middle_value in each instance.
(in the example above this would be age 4 in region x where culmulative_count = 50 and middle_value=50).
I have tried slice from dplyr:
slice(which.min(abs(table$cumulative_count - table$middle_value)))
but this only returns the first instance of the row where there is a match, not all the subsequent year and region combinations.
group_by(year,region) doesn't return all the possible year and region combinations either.
I feel I should be looping through the data frame for all possible year and region combinations and then slicing out the rows that meet the criteria.
Any thoughts?
UPDATE
I used #Merijn van Tilborg dplyr approach, needed only the first match.
Here's a screenshot of the output table - note that variable column is the single year of age and is getting older.
I suggest to use rank as it ranks from low to high. So if you rank on the absolute difference your grouped ranks are per definition 1 for the smallest difference. You can simply filter on that value. It also allows to set the ties with tie.methods.
include ties
dat %>%
group_by(year, region) %>%
filter(rank(abs(cumulative_count - middle_value), ties.method = "min") == 1)
# # A tibble: 6 x 6
# # Groups: year, region [4]
# year region age population_count cumulative_count middle_value
# <int> <chr> <int> <int> <int> <int>
# 1 2001 Region x 2 10 30 50
# 2 2002 Region x 2 10 30 50
# 3 2001 Region y 2 10 30 50
# 4 2002 Region y 0 10 30 50
# 5 2002 Region y 1 10 30 50
# 6 2002 Region y 2 10 30 50
show first one only
dat %>%
group_by(year, region) %>%
filter(rank(abs(cumulative_count - middle_value), ties.method = "first") == 1)
# # A tibble: 4 x 6
# # Groups: year, region [4]
# year region age population_count cumulative_count middle_value
# <int> <chr> <int> <int> <int> <int>
# 1 2001 Region x 2 10 30 50
# 2 2002 Region x 2 10 30 50
# 3 2001 Region y 2 10 30 50
# 4 2002 Region y 0 10 30 50
other options include: rank(x, na.last = TRUE, ties.method = c("average", "first", "last", "random", "max", "min"))
using data.table instead of dplyr
library(data.table)
setDT(dat) # make dat a data.table
dat[, .SD[rank(abs(cumulative_count - middle_value), ties.method = "min") == 1], by = c("year", "region")]
data
dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L,
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x",
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y",
"Region y", "Region y", "Region y", "Region y", "Region y"),
age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L),
population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L,
10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA,
-12L))
You could de-mean the groups and look where the value is zero. You probably will have ties, depends on what you want, you could simply use the first one by subsetting with [1, ].
by(dat, dat[c('year', 'region')], \(x)
x[x$cumulative_count - mean(x$cumulative_count) == 0, ][1, ]) |>
do.call(what=rbind)
# year region age population_count cumulative_count middle_value
# 2 2001 Region x 1 10 20 50
# 5 2002 Region x 1 10 20 50
# 8 2001 Region y 1 10 20 50
# 10 2002 Region y 0 10 30 50
Note: R >= 4.1 used.
Data:
dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L,
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x",
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y",
"Region y", "Region y", "Region y", "Region y", "Region y"),
age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L),
population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L,
10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA,
-12L))
Related
Given a dataframe as follows:
df <- structure(list(year = c(2001L, 2001L, 2001L, 2001L, 2002L, 2002L,
2002L, 2002L, 2003L, 2003L, 2003L, 2003L), quater = c(1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), value = c(4L, 23L, 14L,
12L, 6L, 22L, 45L, 12L, 34L, 15L, 3L, 40L)), class = "data.frame", row.names = c(NA,
-12L))
Out:
year quater value
0 2001 1 4
1 2001 2 23
2 2001 3 14
3 2001 4 12
4 2002 1 6
5 2002 2 22
6 2002 3 45
7 2002 4 12
8 2003 1 34
9 2003 2 15
10 2003 3 3
11 2003 4 40
How could I plot a chart similar to the plot below:
Please note the year and quater in this dataset correspondent to year and week to the plot above.
I need to first cut the value column by (0, 10], (10, 20], (20, 30], (30, 40], (40, 50] then plot them.
The code I have tried:
ggplot(df, aes(week, year, fill= value)) +
geom_tile() +
scale_fill_gradient(low="white", high="red")
Out:
As you can see, the legend is different to what I need.
Thanks for your help.
You should first use cut to get the classes (as Ronak Shah already mentioned) and then you can use scale_fill_brewer to change the color of the tiles.
library(tidyverse)
df %>%
mutate(class = cut(value, seq(0, 50, 10))) %>%
ggplot(aes(quater, year, fill = class) ) +
geom_tile() +
scale_fill_brewer(type = "seq",
direction = 1,
palette = "RdPu")
I have data where the [1] dependent variable is taken from a controlled and independent variable [2] then independent variable. The mean and SD are taken from [1].
(a) and this is the result of SD:
Year Species Pop_Index
1 1994 Corn Bunting 2.082483
5 1998 Corn Bunting 2.048155
10 2004 Corn Bunting 2.061617
15 2009 Corn Bunting 2.497792
20 1994 Goldfinch 1.961236
25 1999 Goldfinch 1.995600
30 2005 Goldfinch 2.101403
35 2010 Goldfinch 2.138496
40 1995 Grey Partridge 2.162136
(b) And the result of mean:
Year Species Pop_Index
1 1994 Corn Bunting 2.821668
5 1998 Corn Bunting 2.916975
10 2004 Corn Bunting 2.662797
15 2009 Corn Bunting 4.171538
20 1994 Goldfinch 3.226108
25 1999 Goldfinch 2.452807
30 2005 Goldfinch 2.954816
35 2010 Goldfinch 3.386772
40 1995 Grey Partridge 2.207708
(c) This is the Code for SD:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.0824833420524, 2.04815530904537,
2.06161673349657, 2.49779159320587, 1.96123572400404, 1.99559986715288,
2.10140285528351, 2.13849611018009, 2.1621364896722)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(d) This is the code for mean:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.82166841455814, 2.91697463618566,
2.66279663056763, 4.17153795031277, 3.22610845074252, 2.45280743991572,
2.95481600904799, 3.38677188055508, 2.20770835158744)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(e) And this is the code used to take the mean of mean Pop_Index over the years:
df2 <- aggregate(Pop_Index ~ Year, df1, mean)
(f) And this is the result:
Year Pop_Index
1 1994 3.023888
2 1995 2.207708
3 1998 2.916975
4 1999 2.452807
5 2004 2.662797
6 2005 2.954816
7 2009 4.171538
8 2010 3.386772
Now it wouldn't make sense for me to take the average of SD by doing the same procedure as before with the function mean or SD.
I have looked online and found someone in a similar predicament with this data:
Month: January
Week 1 Mean: 67.3 Std. Dev: 0.8
Week 2 Mean: 80.5 Std. Dev: 0.6
Week 3 Mean: 82.4 Std. Dev: 0.8
And the response:
"With equal samples size, which is what you have, the standard deviation you are looking for is:
Sqrt [ (.64 + .36 + .64) / 3 ] = 0.739369"
How would I do this in R, or is there another way of doing this? Because I want to plot error bars and the dataset plotted is like that of (f), and it would be absurd to plot the SD of (a) against this because the vector lengths would differ.
Sample from original data.frame with a few columns and many rows not included:
structure(list(GRIDREF = structure(c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("SP8816", "SP9212", "SP9322",
"SP9326", "SP9440", "SP9513", "SP9632", "SP9939", "TF7133", "TF9437"
), class = "factor"), Lat = c(51.83568688, 51.83568688, 51.79908899,
51.88880822, 51.92476157, 52.05042795, 51.80757645, 51.97818159,
52.04057068, 52.86730817, 52.89542895), Long = c(-0.724233561,
-0.724233561, -0.667258035, -0.650074995, -0.648996758, -0.630626734,
-0.62349292, -0.603710436, -0.558026241, 0.538966197, 0.882597783
), Year = c(2006L, 2007L, 1999L, 2004L, 1995L, 2009L, 2011L,
2007L, 2011L, 1996L, 2007L), Species = structure(c(4L, 7L, 5L,
10L, 4L, 6L, 8L, 3L, 2L, 9L, 1L), .Label = c("Blue Tit", "Buzzard",
"Canada Goose", "Collared Dove", "Greenfinch", "Jackdaw", "Linnet",
"Meadow Pipit", "Robin", "Willow Warbler"), class = "factor"),
Pop_Index = c(0L, 0L, 2L, 0L, 1L, 0L, 1L, 4L, 0L, 0L, 8L)), row.names = c(1L,
100L, 1000L, 2000L, 3000L, 4000L, 5000L, 6000L, 10000L, 20213L,
30213L), class = "data.frame")
A look into this data.frame:
GRIDREF Lat Long Year Species Pop_Index TempJanuary
1 SP8816 51.83569 -0.7242336 2006 Collared Dove 0 2.128387
100 SP8816 51.83569 -0.7242336 2007 Linnet 0 4.233226
1000 SP9212 51.79909 -0.6672580 1999 Greenfinch 2 5.270968
2000 SP9322 51.88881 -0.6500750 2004 Willow Warbler 0 4.826452
3000 SP9326 51.92476 -0.6489968 1995 Collared Dove 1 4.390322
4000 SP9440 52.05043 -0.6306267 2009 Jackdaw 0 2.934516
5000 SP9513 51.80758 -0.6234929 2011 Meadow Pipit 1 3.841290
6000 SP9632 51.97818 -0.6037104 2007 Canada Goose 4 7.082580
10000 SP9939 52.04057 -0.5580262 2011 Buzzard 0 3.981290
20213 TF7133 52.86731 0.5389662 1996 Robin 0 3.532903
30213 TF9437 52.89543 0.8825978 2007 Blue Tit 8 7.028710
I have a data frame that contains the highest and lowest temperature in a given year by Climate Station - All.Stations dataset:
Station.Name Year Month Day TMAX TMIN
GRAND MARAIS 1942 7 28 82 60
GRAND MARAIS 1962 3 17 42 22
LEECH LAKE 1956 7 3 72 50
ALBERT LEA 3 SE 1998 1 25 25 15
TWO HARBORS 1933 5 20 77 42
ARGYLE 1922 9 13 NA NA
I also have a data frame of complete years by Climate Station (i.e., these are the years where I have data for every day in the year) - complete.years dataset:
Station.Name Year
DULUTH 1904
AGASSIZ REFUGE 1995
LEECH LAKE 1956
GRAND MARAIS 1942
LEECH LAKE 1994
I want to filter the first data frame to only the data where Station Name and Year exist and match in the second data frame.
The correct results would be:
Station.Name Year TMAX
GRAND MARAIS 1942 82
LEECH LAKE 1956 72
Here's what I've got so far, using dplyr:
Max.Tempurature <- All_Stations %>%
group_by(Station.Name, Year) %>%
select(Station.Name, Year, TMAX) %>%
filter(min_rank(desc(TMAX)) <= 1) %>%
filter((Year %in% complete.years$Year & Station.Name %in% complete.years$Station.Name))
I can filter by both Year and Station.Name, but that searches the whole data frame for matches.
How do I filter by Station.Name and Year existing in the same observation?
We can do an inner_join
library(dplyr)
inner_join(All.Stations[c(1, 2, 5)], complete.years)
# Station.Name Year TMAX
#1 GRAND MARAIS 1942 82
#2 LEECH LAKE 1956 72
data
All.Stations <- structure(list(Station.Name = c("GRAND MARAIS", "GRAND MARAIS",
"LEECH LAKE", "ALBERT LEA 3 SE", "TWO HARBORS", "ARGYLE"), Year = c(1942L,
1962L, 1956L, 1998L, 1933L, 1922L), Month = c(7L, 3L, 7L, 1L,
5L, 9L), Day = c(28L, 17L, 3L, 25L, 20L, 13L), TMAX = c(82L,
42L, 72L, 25L, 77L, NA), TMIN = c(60L, 22L, 50L, 15L, 42L, NA
)), class = "data.frame", row.names = c(NA, -6L))
complete.years <- structure(list(Station.Name = c("DULUTH",
"AGASSIZ REFUGE", "LEECH LAKE",
"GRAND MARAIS", "LEECH LAKE"), Year = c(1904L, 1995L, 1956L,
1942L, 1994L)), class = "data.frame", row.names = c(NA, -5L))
Or with merge
cols <- c('Station.Name', 'Year', 'TMAX')
merge(All.Stations[cols], complete.years, all.x = FALSE)
# Station.Name Year TMAX
#1 GRAND MARAIS 1942 82
#2 LEECH LAKE 1956 72
data
All.Stations <- structure(list(Station.Name = c("GRAND MARAIS", "GRAND MARAIS",
"LEECH LAKE", "ALBERT LEA 3 SE", "TWO HARBORS", "ARGYLE"), Year = c(1942L,
1962L, 1956L, 1998L, 1933L, 1922L), Month = c(7L, 3L, 7L, 1L,
5L, 9L), Day = c(28L, 17L, 3L, 25L, 20L, 13L), TMAX = c(82L,
42L, 72L, 25L, 77L, NA), TMIN = c(60L, 22L, 50L, 15L, 42L, NA
)), .Names = c("Station.Name", "Year", "Month", "Day", "TMAX",
"TMIN"), class = "data.frame", row.names = c(NA, -6L))
complete.years <- structure(list(Station.Name = c("DULUTH", "AGASSIZ REFUGE", "LEECH LAKE",
"GRAND MARAIS", "LEECH LAKE"), Year = c(1904L, 1995L, 1956L,
1942L, 1994L)), .Names = c("Station.Name", "Year"), class = "data.frame", row.names = c(NA,
-5L))
This question already has answers here:
combine data in depending on the value of one column
(2 answers)
Closed 7 years ago.
Ok I had this problem that has been solved
combine data in depending on the value of one column
I have been trying to adapt the solution for a more complicated problem but i have not been able to come with the solution instead of 2 columns i have 3
df <- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2001L, 2002L), group = c(1L, 1L, 1L, 1L, 2L, 2L), sales = c(20L, 25L, 23L, 30L, 50L, 55L), expenses = c(19L, 19L, 20L, 15L, 27L, 30L)), .Names = c("year", "group", "sales", "expenses"), class = "data.frame", row.names = c(NA, -6L))
year group sales expenses
1 2000 1 20 19
2 2001 1 25 19
3 2002 1 23 20
4 2003 1 30 15
5 2001 2 50 27
6 2002 2 55 30
And I need the same output as in the first problem but instead of just the sales I also need to include the expenses in the json file
[{"group": 1, "sales":[[2000,20],[2001, 25], [2002,23], [2003, 30]], "expenses":[[2000,19],[2001, 19], [2002,20], [2003, 15]]},
{"group": 2, "sales":[[2001, 50], [2002,55]], "expenses":[[2001, 27], [2002,30]]}]
toJSON(setDT(df1)[, list(sales= paste0('[',toString(sprintf('[%d,%d]',year, sales)),']'),
expenses= paste0('[',toString(sprintf('[%d,%d]', year, expenses)),']')), by = group])
Try this. Its not different than akrun's answer.combine data in depending on the value of one column
I am trying to get more control over the text that appears when using add_tooltip in ggvis.
Say I want to plot 'totalinns' against 'avg' for this dataframe. Color points by 'country'.
The text I want to appear in the hovering tooltip would be: 'player', 'country', 'debutyear' 'avg'
tmp:
# player totalruns totalinns totalno totalout avg debutyear country
# 1 AG Ganteaume 112 1 0 1 112.00000 1948 WI
# 2 DG Bradman 6996 80 10 70 99.94286 1928 Aus
# 3 MN Nawaz 99 2 1 1 99.00000 2002 SL
# 4 VH Stollmeyer 96 1 0 1 96.00000 1939 WI
# 5 DM Lewis 259 5 2 3 86.33333 1971 WI
# 6 Abul Hasan 165 5 3 2 82.50000 2012 Ban
# 7 RE Redmond 163 2 0 2 81.50000 1973 NZ
# 8 BA Richards 508 7 0 7 72.57143 1970 SA
# 9 H Wood 204 4 1 3 68.00000 1888 Eng
# 10 JC Buttler 200 3 0 3 66.66667 2014 Eng
I understand that I need to make a key/id variable as ggvis only takes information supplied to it. Therefore I need to refer back to the original data. I have tried changing my text inside of my paste0() command, but still can't get it right.
tmp$id <- 1:nrow(tmp)
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste0(tmp$player, tmp$country, tmp$debutyear,
tmp$avg, format(row), collapse = "<br />")
}
tmp %>% ggvis(x = ~totalinns, y = ~avg, key := ~id) %>%
layer_points(fill = ~factor(country)) %>%
add_tooltip(all_values, "hover")
Find below code to reproduce example:
tmp <- structure(list(player = c("AG Ganteaume", "DG Bradman", "MN Nawaz",
"VH Stollmeyer", "DM Lewis", "Abul Hasan", "RE Redmond", "BA Richards",
"H Wood", "JC Buttler"), totalruns = c(112L, 6996L, 99L, 96L,
259L, 165L, 163L, 508L, 204L, 200L), totalinns = c(1L, 80L, 2L,
1L, 5L, 5L, 2L, 7L, 4L, 3L), totalno = c(0L, 10L, 1L, 0L, 2L,
3L, 0L, 0L, 1L, 0L), totalout = c(1L, 70L, 1L, 1L, 3L, 2L, 2L,
7L, 3L, 3L), avg = c(112, 99.9428571428571, 99, 96, 86.3333333333333,
82.5, 81.5, 72.5714285714286, 68, 66.6666666666667), debutyear = c(1948L,
1928L, 2002L, 1939L, 1971L, 2012L, 1973L, 1970L, 1888L, 2014L
), country = c("WI", "Aus", "SL", "WI", "WI", "Ban", "NZ", "SA",
"Eng", "Eng")), .Names = c("player", "totalruns", "totalinns",
"totalno", "totalout", "avg", "debutyear", "country"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -10L))
I think this is closer:
all_values <- function(x) {
if(is.null(x)) return(NULL)
row <- tmp[tmp$id == x$id, ]
paste(tmp$player[x$id], tmp$country[x$id], tmp$debutyear[x$id],
tmp$avg[x$id], sep="<br>")
}