Related
I have a data frame in R which is comprised like this:
year
region
age
population_count
cumulative_count*
middle_value*
2001
Region x
0
10
10
50
2001
Region x
1
10
20
50
2001
Region x
2
10
30
50
2001
Region x
3
10
40
50
2001
Region x
4
10
50
50
2001
Region x
5
10
60
50
...2020
Region y
1
10
20
50
For each year and region combination I have a discrete cumulative_count (derived from population_count by age) and middle_value (derived from the cumulative_count), again discrete for each year and region combination.
I want to extract from this the row for each region and year combination where the cumulative_count is closest to the middle_value in each instance.
(in the example above this would be age 4 in region x where culmulative_count = 50 and middle_value=50).
I have tried slice from dplyr:
slice(which.min(abs(table$cumulative_count - table$middle_value)))
but this only returns the first instance of the row where there is a match, not all the subsequent year and region combinations.
group_by(year,region) doesn't return all the possible year and region combinations either.
I feel I should be looping through the data frame for all possible year and region combinations and then slicing out the rows that meet the criteria.
Any thoughts?
UPDATE
I used #Merijn van Tilborg dplyr approach, needed only the first match.
Here's a screenshot of the output table - note that variable column is the single year of age and is getting older.
I suggest to use rank as it ranks from low to high. So if you rank on the absolute difference your grouped ranks are per definition 1 for the smallest difference. You can simply filter on that value. It also allows to set the ties with tie.methods.
include ties
dat %>%
group_by(year, region) %>%
filter(rank(abs(cumulative_count - middle_value), ties.method = "min") == 1)
# # A tibble: 6 x 6
# # Groups: year, region [4]
# year region age population_count cumulative_count middle_value
# <int> <chr> <int> <int> <int> <int>
# 1 2001 Region x 2 10 30 50
# 2 2002 Region x 2 10 30 50
# 3 2001 Region y 2 10 30 50
# 4 2002 Region y 0 10 30 50
# 5 2002 Region y 1 10 30 50
# 6 2002 Region y 2 10 30 50
show first one only
dat %>%
group_by(year, region) %>%
filter(rank(abs(cumulative_count - middle_value), ties.method = "first") == 1)
# # A tibble: 4 x 6
# # Groups: year, region [4]
# year region age population_count cumulative_count middle_value
# <int> <chr> <int> <int> <int> <int>
# 1 2001 Region x 2 10 30 50
# 2 2002 Region x 2 10 30 50
# 3 2001 Region y 2 10 30 50
# 4 2002 Region y 0 10 30 50
other options include: rank(x, na.last = TRUE, ties.method = c("average", "first", "last", "random", "max", "min"))
using data.table instead of dplyr
library(data.table)
setDT(dat) # make dat a data.table
dat[, .SD[rank(abs(cumulative_count - middle_value), ties.method = "min") == 1], by = c("year", "region")]
data
dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L,
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x",
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y",
"Region y", "Region y", "Region y", "Region y", "Region y"),
age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L),
population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L,
10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA,
-12L))
You could de-mean the groups and look where the value is zero. You probably will have ties, depends on what you want, you could simply use the first one by subsetting with [1, ].
by(dat, dat[c('year', 'region')], \(x)
x[x$cumulative_count - mean(x$cumulative_count) == 0, ][1, ]) |>
do.call(what=rbind)
# year region age population_count cumulative_count middle_value
# 2 2001 Region x 1 10 20 50
# 5 2002 Region x 1 10 20 50
# 8 2001 Region y 1 10 20 50
# 10 2002 Region y 0 10 30 50
Note: R >= 4.1 used.
Data:
dat <- structure(list(year = c(2001L, 2001L, 2001L, 2002L, 2002L, 2002L,
2001L, 2001L, 2001L, 2002L, 2002L, 2002L), region = c("Region x",
"Region x", "Region x", "Region x", "Region x", "Region x", "Region y",
"Region y", "Region y", "Region y", "Region y", "Region y"),
age = c(0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L),
population_count = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L), cumulative_count = c(10L, 20L, 30L,
10L, 20L, 30L, 10L, 20L, 30L, 30L, 30L, 30L), middle_value = c(50L,
50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), class = "data.frame", row.names = c(NA,
-12L))
My dataset is constructed as follows:
# A tibble: 20 x 8
iso3 year Var1 Var1_imp Var2 Var2_imp Var1_type Var2_type
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 ATG 2000 NA 144 NA 277 imputed imputed
2 ATG 2001 NA 144 NA 277 imputed imputed
3 ATG 2002 NA 144 NA 277 imputed imputed
4 ATG 2003 NA 144 NA 277 imputed imputed
5 ATG 2004 NA 144 NA 277 imputed imputed
6 ATG 2005 NA 144 NA 277 imputed imputed
7 ATG 2006 NA 144 NA 277 imputed imputed
8 ATG 2007 144 144 277 277 observed observed
9 ATG 2008 45 45 NA 301 observed imputed
10 ATG 2009 NA 71.3 NA 325 imputed imputed
11 ATG 2010 NA 97.7 NA 349 imputed imputed
12 ATG 2011 NA 124 NA 373 imputed imputed
13 ATG 2012 NA 150. NA 397 imputed imputed
14 ATG 2013 NA 177. 421 421 imputed observed
15 ATG 2014 NA 203 434 434 imputed observed
16 ATG 2015 NA 229. 422 422 imputed observed
17 ATG 2016 NA 256. 424 424 imputed observed
18 ATG 2017 282 282 429 429 observed observed
19 ATG 2018 NA 282 435 435 imputed observed
20 EGY 2000 NA 38485 NA 146761 imputed imputed
I am new to R and I would like to create a line chart for each country with time series for variables Var1_imp and Var2_imp on the same chart (I have 193 countries in my database with data from 2000 to 2018) using filled circles when data are observed and unfilled circles when data are imputed (based on Var1_type and VAr2_type). Circles would be joined with lines if two subsequent data points are observed otherwise circles would be joined with dotted lines.
The main goal is to check country by country if the method used to impute missing data is good or bad, depending on whether there are outliers in time series.
I have tried the following:
ggplot(df, aes(x=year, y=Var1_imp, group=Var1_type))
+ geom_point(size=2, shape=21) # shape = 21 for unfilled circles and shape = 19 for filled circles
+ geom_line(linetype = "dashed") # () for not dotted line, otherwise linetype ="dashed"
I have difficulties to find out:
1/ how to do one single chart per country per variable
2/ how to include both Var1_imp and Var2_imp on the same chart
3/ how to use geom_point based on conditions (imputed versus observed in Var1_type)
4/ how to use geom_line based on conditions (plain line if two subsequent observed data points, otherwise dotted).
Thank you very much for your help - I think this exercise is not easy and I would learn a lot from your inputs.
You can use the following code
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, group=variable)) +
geom_point(size=2, shape=21) +
geom_line(linetype = "dashed") + facet_wrap(iso3~., scales = "free") +
xlab("Year") + ylab("Imp")
Better to use colour like
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2, Var1_type, Var2_type), values_to = "values", names_to = "variable") %>%
ggplot(aes(x=year, y=values, colour=variable)) +
geom_point(size=2, shape=21) +
geom_line() + facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Update
df %>%
pivot_longer(cols = -c(sl, iso3, year, Var1, Var2),
names_to = c("group", ".value"),
names_pattern = "(.*)_(.*)") %>%
ggplot(aes(x=year, y=imp, shape = type, colour=group)) +
geom_line(aes(group = group, colour = group), size = 0.5) +
geom_point(aes(group = group, colour = group, shape = type),size=2) +
scale_shape_manual(values = c('imputed' = 21, 'observed' = 16)) +
facet_wrap(iso3~., scales = "free") + xlab("Year") + ylab("Imp")
Data
df = structure(list(sl = 1:20, iso3 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L
), .Label = c("ATG", "EGY"), class = "factor"), year = c(2000L,
2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 2009L,
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L,
2000L), Var1 = c(NA, NA, NA, NA, NA, NA, NA, 144L, 45L, NA, NA,
NA, NA, NA, NA, NA, NA, 282L, NA, NA), Var1_imp = c(144, 144,
144, 144, 144, 144, 144, 144, 45, 71.3, 97.7, 124, 150, 177,
203, 229, 256, 282, 282, 38485), Var2 = c(NA, NA, NA, NA, NA,
NA, NA, 277L, NA, NA, NA, NA, NA, 421L, 434L, 422L, 424L, 429L,
435L, NA), Var2_imp = c(277L, 277L, 277L, 277L, 277L, 277L, 277L,
277L, 301L, 325L, 349L, 373L, 397L, 421L, 434L, 422L, 424L, 429L,
435L, 146761L), Var1_type = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("imputed",
"observed"), class = "factor"), Var2_type = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 1L), .Label = c("imputed", "observed"), class = "factor")), class = "data.frame", row.names = c(NA,
-20L))
Plotting two varibles at the same time in a meaningful way in a line chart is going to be a bit hard. It's easier if you use pivot_longer to create one column containing both the var1_imp and var2_imp values. You will then have a key column containing var1_imp and var2_imp, and a values column containing the values for those two. You can then plot using year as x, and the new values column as y, with fill set to the key column. You'll then get two lines per country.
However, looking for outliers based on a line chart for 193 countries ins't a very good idea. Use
outlier_values <- boxplot.stats(airquality$Ozone)$out
for to get outliers in a column, or similar with sapply to get multiple columns. Outliers are normally defined as 1.5* IQR, so it's easy to figure out which ones are.
I have data where the [1] dependent variable is taken from a controlled and independent variable [2] then independent variable. The mean and SD are taken from [1].
(a) and this is the result of SD:
Year Species Pop_Index
1 1994 Corn Bunting 2.082483
5 1998 Corn Bunting 2.048155
10 2004 Corn Bunting 2.061617
15 2009 Corn Bunting 2.497792
20 1994 Goldfinch 1.961236
25 1999 Goldfinch 1.995600
30 2005 Goldfinch 2.101403
35 2010 Goldfinch 2.138496
40 1995 Grey Partridge 2.162136
(b) And the result of mean:
Year Species Pop_Index
1 1994 Corn Bunting 2.821668
5 1998 Corn Bunting 2.916975
10 2004 Corn Bunting 2.662797
15 2009 Corn Bunting 4.171538
20 1994 Goldfinch 3.226108
25 1999 Goldfinch 2.452807
30 2005 Goldfinch 2.954816
35 2010 Goldfinch 3.386772
40 1995 Grey Partridge 2.207708
(c) This is the Code for SD:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.0824833420524, 2.04815530904537,
2.06161673349657, 2.49779159320587, 1.96123572400404, 1.99559986715288,
2.10140285528351, 2.13849611018009, 2.1621364896722)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(d) This is the code for mean:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.82166841455814, 2.91697463618566,
2.66279663056763, 4.17153795031277, 3.22610845074252, 2.45280743991572,
2.95481600904799, 3.38677188055508, 2.20770835158744)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(e) And this is the code used to take the mean of mean Pop_Index over the years:
df2 <- aggregate(Pop_Index ~ Year, df1, mean)
(f) And this is the result:
Year Pop_Index
1 1994 3.023888
2 1995 2.207708
3 1998 2.916975
4 1999 2.452807
5 2004 2.662797
6 2005 2.954816
7 2009 4.171538
8 2010 3.386772
Now it wouldn't make sense for me to take the average of SD by doing the same procedure as before with the function mean or SD.
I have looked online and found someone in a similar predicament with this data:
Month: January
Week 1 Mean: 67.3 Std. Dev: 0.8
Week 2 Mean: 80.5 Std. Dev: 0.6
Week 3 Mean: 82.4 Std. Dev: 0.8
And the response:
"With equal samples size, which is what you have, the standard deviation you are looking for is:
Sqrt [ (.64 + .36 + .64) / 3 ] = 0.739369"
How would I do this in R, or is there another way of doing this? Because I want to plot error bars and the dataset plotted is like that of (f), and it would be absurd to plot the SD of (a) against this because the vector lengths would differ.
Sample from original data.frame with a few columns and many rows not included:
structure(list(GRIDREF = structure(c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("SP8816", "SP9212", "SP9322",
"SP9326", "SP9440", "SP9513", "SP9632", "SP9939", "TF7133", "TF9437"
), class = "factor"), Lat = c(51.83568688, 51.83568688, 51.79908899,
51.88880822, 51.92476157, 52.05042795, 51.80757645, 51.97818159,
52.04057068, 52.86730817, 52.89542895), Long = c(-0.724233561,
-0.724233561, -0.667258035, -0.650074995, -0.648996758, -0.630626734,
-0.62349292, -0.603710436, -0.558026241, 0.538966197, 0.882597783
), Year = c(2006L, 2007L, 1999L, 2004L, 1995L, 2009L, 2011L,
2007L, 2011L, 1996L, 2007L), Species = structure(c(4L, 7L, 5L,
10L, 4L, 6L, 8L, 3L, 2L, 9L, 1L), .Label = c("Blue Tit", "Buzzard",
"Canada Goose", "Collared Dove", "Greenfinch", "Jackdaw", "Linnet",
"Meadow Pipit", "Robin", "Willow Warbler"), class = "factor"),
Pop_Index = c(0L, 0L, 2L, 0L, 1L, 0L, 1L, 4L, 0L, 0L, 8L)), row.names = c(1L,
100L, 1000L, 2000L, 3000L, 4000L, 5000L, 6000L, 10000L, 20213L,
30213L), class = "data.frame")
A look into this data.frame:
GRIDREF Lat Long Year Species Pop_Index TempJanuary
1 SP8816 51.83569 -0.7242336 2006 Collared Dove 0 2.128387
100 SP8816 51.83569 -0.7242336 2007 Linnet 0 4.233226
1000 SP9212 51.79909 -0.6672580 1999 Greenfinch 2 5.270968
2000 SP9322 51.88881 -0.6500750 2004 Willow Warbler 0 4.826452
3000 SP9326 51.92476 -0.6489968 1995 Collared Dove 1 4.390322
4000 SP9440 52.05043 -0.6306267 2009 Jackdaw 0 2.934516
5000 SP9513 51.80758 -0.6234929 2011 Meadow Pipit 1 3.841290
6000 SP9632 51.97818 -0.6037104 2007 Canada Goose 4 7.082580
10000 SP9939 52.04057 -0.5580262 2011 Buzzard 0 3.981290
20213 TF7133 52.86731 0.5389662 1996 Robin 0 3.532903
30213 TF9437 52.89543 0.8825978 2007 Blue Tit 8 7.028710
This question already has answers here:
combine data in depending on the value of one column
(2 answers)
Closed 7 years ago.
Ok I had this problem that has been solved
combine data in depending on the value of one column
I have been trying to adapt the solution for a more complicated problem but i have not been able to come with the solution instead of 2 columns i have 3
df <- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2001L, 2002L), group = c(1L, 1L, 1L, 1L, 2L, 2L), sales = c(20L, 25L, 23L, 30L, 50L, 55L), expenses = c(19L, 19L, 20L, 15L, 27L, 30L)), .Names = c("year", "group", "sales", "expenses"), class = "data.frame", row.names = c(NA, -6L))
year group sales expenses
1 2000 1 20 19
2 2001 1 25 19
3 2002 1 23 20
4 2003 1 30 15
5 2001 2 50 27
6 2002 2 55 30
And I need the same output as in the first problem but instead of just the sales I also need to include the expenses in the json file
[{"group": 1, "sales":[[2000,20],[2001, 25], [2002,23], [2003, 30]], "expenses":[[2000,19],[2001, 19], [2002,20], [2003, 15]]},
{"group": 2, "sales":[[2001, 50], [2002,55]], "expenses":[[2001, 27], [2002,30]]}]
toJSON(setDT(df1)[, list(sales= paste0('[',toString(sprintf('[%d,%d]',year, sales)),']'),
expenses= paste0('[',toString(sprintf('[%d,%d]', year, expenses)),']')), by = group])
Try this. Its not different than akrun's answer.combine data in depending on the value of one column
I have a problem like that. I have a database like:
Province cases year month
Newyork 10 2000 1
Newyork 20 2000 2
Newyork 30 2000 3
Newyork 40 2000 4
Los Angeles 30 2000 1
Los Angeles 40 2000 2
Los Angeles 50 2000 3
Los Angeles 60 2000 4
A very big data for 20 years and many Provinces. How can I regroup my data to get an sequence of time like that:
Province cases.at.1.2000 cases.at.2.2000 cases.at.3.2000 cases.at.4.2000
Newyork 10 20 30 40
Los Angeles 30 40 50 60
Just use dcast from reshape2 package:
library(reshape2)
dcast(df, Province~month+year, value.var='cases')
# Province 1_2000 2_2000 3_2000 4_2000
#1 LosAngeles 30 40 50 60
#2 Newyork 10 20 30 40
Data:
df=structure(list(Province = c("Newyork", "Newyork", "Newyork",
"Newyork", "LosAngeles", "LosAngeles", "LosAngeles", "LosAngeles"
), cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L)), .Names = c("Province", "cases",
"year", "month"), class = "data.frame", row.names = c(NA, -8L
))
Edit: if you have missing month/province, you can still use dcast:
# Province cases year month
#1 Newyork 10 2000 1
#2 Newyork 20 2000 2
#3 Newyork 30 2000 3
#4 Newyork 40 2000 4
#5 LosAngeles 30 2000 1
#6 LosAngeles 40 2000 2
#7 LosAngeles 50 2000 3
#8 LosAngeles 60 2000 4
#9 Newyork 99 2000 5
#10 SanDiego 99 2000 5
dcast(df, Province~month+year, value.var='cases')
# Province 1_2000 2_2000 3_2000 4_2000 5_2000
#1 LosAngeles 30 40 50 60 NA
#2 Newyork 10 20 30 40 99
#3 SanDiego NA NA NA NA 99
We can use reshape from base R after joining the 'month' and 'year' columns (paste(...))
reshape(
transform(df1, yearmonth=paste('at', month, year, sep="."))[,-(3:4)],
idvar='Province', timevar='yearmonth', direction='wide')
# Province cases.at.1.2000 cases.at.2.2000 cases.at.3.2000 cases.at.4.2000
# 1 Newyork 10 20 30 40
# 5 Los Angeles 30 40 50 60
data
df1 <- structure(list(Province = c("Newyork", "Newyork", "Newyork",
"Newyork", "Los Angeles", "Los Angeles", "Los Angeles", "Los Angeles"
), cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L)), .Names = c("Province", "cases",
"year", "month"), class = "data.frame", row.names = c(NA, -8L))
Based on #Ananda Mahto suggestion:
library(tidyr); library(dplyr)
df %>% mutate(month = paste0("cases.at.", month)) %>%
unite(key, month, year, sep=".") %>% spread(key, cases)
If you have missing month - year for some Province, use expand:
df %>% expand(Province, year, month) %>% left_join(df) %>%
mutate(month = paste0("cases.at.", month)) %>%
unite(key, month, year, sep=".") %>% spread(key, cases)
Data:
df=structure(list(Province = c("Newyork", "Newyork", "Newyork",
"Newyork", "LosAngeles", "LosAngeles", "LosAngeles", "LosAngeles", "SanDiego"),
cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L, 90L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 4L)), .Names = c("Province", "cases",
"year", "month"), class = "data.frame", row.names = c(NA, -9L))