Reshape data frame from long to wide format - r

I have a problem like that. I have a database like:
Province cases year month
Newyork 10 2000 1
Newyork 20 2000 2
Newyork 30 2000 3
Newyork 40 2000 4
Los Angeles 30 2000 1
Los Angeles 40 2000 2
Los Angeles 50 2000 3
Los Angeles 60 2000 4
A very big data for 20 years and many Provinces. How can I regroup my data to get an sequence of time like that:
Province cases.at.1.2000 cases.at.2.2000 cases.at.3.2000 cases.at.4.2000
Newyork 10 20 30 40
Los Angeles 30 40 50 60

Just use dcast from reshape2 package:
library(reshape2)
dcast(df, Province~month+year, value.var='cases')
# Province 1_2000 2_2000 3_2000 4_2000
#1 LosAngeles 30 40 50 60
#2 Newyork 10 20 30 40
Data:
df=structure(list(Province = c("Newyork", "Newyork", "Newyork",
"Newyork", "LosAngeles", "LosAngeles", "LosAngeles", "LosAngeles"
), cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L)), .Names = c("Province", "cases",
"year", "month"), class = "data.frame", row.names = c(NA, -8L
))
Edit: if you have missing month/province, you can still use dcast:
# Province cases year month
#1 Newyork 10 2000 1
#2 Newyork 20 2000 2
#3 Newyork 30 2000 3
#4 Newyork 40 2000 4
#5 LosAngeles 30 2000 1
#6 LosAngeles 40 2000 2
#7 LosAngeles 50 2000 3
#8 LosAngeles 60 2000 4
#9 Newyork 99 2000 5
#10 SanDiego 99 2000 5
dcast(df, Province~month+year, value.var='cases')
# Province 1_2000 2_2000 3_2000 4_2000 5_2000
#1 LosAngeles 30 40 50 60 NA
#2 Newyork 10 20 30 40 99
#3 SanDiego NA NA NA NA 99

We can use reshape from base R after joining the 'month' and 'year' columns (paste(...))
reshape(
transform(df1, yearmonth=paste('at', month, year, sep="."))[,-(3:4)],
idvar='Province', timevar='yearmonth', direction='wide')
# Province cases.at.1.2000 cases.at.2.2000 cases.at.3.2000 cases.at.4.2000
# 1 Newyork 10 20 30 40
# 5 Los Angeles 30 40 50 60
data
df1 <- structure(list(Province = c("Newyork", "Newyork", "Newyork",
"Newyork", "Los Angeles", "Los Angeles", "Los Angeles", "Los Angeles"
), cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L)), .Names = c("Province", "cases",
"year", "month"), class = "data.frame", row.names = c(NA, -8L))

Based on #Ananda Mahto suggestion:
library(tidyr); library(dplyr)
df %>% mutate(month = paste0("cases.at.", month)) %>%
unite(key, month, year, sep=".") %>% spread(key, cases)
If you have missing month - year for some Province, use expand:
df %>% expand(Province, year, month) %>% left_join(df) %>%
mutate(month = paste0("cases.at.", month)) %>%
unite(key, month, year, sep=".") %>% spread(key, cases)
Data:
df=structure(list(Province = c("Newyork", "Newyork", "Newyork",
"Newyork", "LosAngeles", "LosAngeles", "LosAngeles", "LosAngeles", "SanDiego"),
cases = c(10L, 20L, 30L, 40L, 30L, 40L, 50L, 60L, 90L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L), month = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 4L)), .Names = c("Province", "cases",
"year", "month"), class = "data.frame", row.names = c(NA, -9L))

Related

Cut values to intervals and plot a heatmap in ggplot2

Given a dataframe as follows:
df <- structure(list(year = c(2001L, 2001L, 2001L, 2001L, 2002L, 2002L,
2002L, 2002L, 2003L, 2003L, 2003L, 2003L), quater = c(1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), value = c(4L, 23L, 14L,
12L, 6L, 22L, 45L, 12L, 34L, 15L, 3L, 40L)), class = "data.frame", row.names = c(NA,
-12L))
Out:
year quater value
0 2001 1 4
1 2001 2 23
2 2001 3 14
3 2001 4 12
4 2002 1 6
5 2002 2 22
6 2002 3 45
7 2002 4 12
8 2003 1 34
9 2003 2 15
10 2003 3 3
11 2003 4 40
How could I plot a chart similar to the plot below:
Please note the year and quater in this dataset correspondent to year and week to the plot above.
I need to first cut the value column by (0, 10], (10, 20], (20, 30], (30, 40], (40, 50] then plot them.
The code I have tried:
ggplot(df, aes(week, year, fill= value)) +
geom_tile() +
scale_fill_gradient(low="white", high="red")
Out:
As you can see, the legend is different to what I need.
Thanks for your help.
You should first use cut to get the classes (as Ronak Shah already mentioned) and then you can use scale_fill_brewer to change the color of the tiles.
library(tidyverse)
df %>%
mutate(class = cut(value, seq(0, 50, 10))) %>%
ggplot(aes(quater, year, fill = class) ) +
geom_tile() +
scale_fill_brewer(type = "seq",
direction = 1,
palette = "RdPu")

How to take an Average of + or - SD

I have data where the [1] dependent variable is taken from a controlled and independent variable [2] then independent variable. The mean and SD are taken from [1].
(a) and this is the result of SD:
Year Species Pop_Index
1 1994 Corn Bunting 2.082483
5 1998 Corn Bunting 2.048155
10 2004 Corn Bunting 2.061617
15 2009 Corn Bunting 2.497792
20 1994 Goldfinch 1.961236
25 1999 Goldfinch 1.995600
30 2005 Goldfinch 2.101403
35 2010 Goldfinch 2.138496
40 1995 Grey Partridge 2.162136
(b) And the result of mean:
Year Species Pop_Index
1 1994 Corn Bunting 2.821668
5 1998 Corn Bunting 2.916975
10 2004 Corn Bunting 2.662797
15 2009 Corn Bunting 4.171538
20 1994 Goldfinch 3.226108
25 1999 Goldfinch 2.452807
30 2005 Goldfinch 2.954816
35 2010 Goldfinch 3.386772
40 1995 Grey Partridge 2.207708
(c) This is the Code for SD:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.0824833420524, 2.04815530904537,
2.06161673349657, 2.49779159320587, 1.96123572400404, 1.99559986715288,
2.10140285528351, 2.13849611018009, 2.1621364896722)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(d) This is the code for mean:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.82166841455814, 2.91697463618566,
2.66279663056763, 4.17153795031277, 3.22610845074252, 2.45280743991572,
2.95481600904799, 3.38677188055508, 2.20770835158744)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(e) And this is the code used to take the mean of mean Pop_Index over the years:
df2 <- aggregate(Pop_Index ~ Year, df1, mean)
(f) And this is the result:
Year Pop_Index
1 1994 3.023888
2 1995 2.207708
3 1998 2.916975
4 1999 2.452807
5 2004 2.662797
6 2005 2.954816
7 2009 4.171538
8 2010 3.386772
Now it wouldn't make sense for me to take the average of SD by doing the same procedure as before with the function mean or SD.
I have looked online and found someone in a similar predicament with this data:
Month: January
Week 1 Mean: 67.3 Std. Dev: 0.8
Week 2 Mean: 80.5 Std. Dev: 0.6
Week 3 Mean: 82.4 Std. Dev: 0.8
And the response:
"With equal samples size, which is what you have, the standard deviation you are looking for is:
Sqrt [ (.64 + .36 + .64) / 3 ] = 0.739369"
How would I do this in R, or is there another way of doing this? Because I want to plot error bars and the dataset plotted is like that of (f), and it would be absurd to plot the SD of (a) against this because the vector lengths would differ.
Sample from original data.frame with a few columns and many rows not included:
structure(list(GRIDREF = structure(c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("SP8816", "SP9212", "SP9322",
"SP9326", "SP9440", "SP9513", "SP9632", "SP9939", "TF7133", "TF9437"
), class = "factor"), Lat = c(51.83568688, 51.83568688, 51.79908899,
51.88880822, 51.92476157, 52.05042795, 51.80757645, 51.97818159,
52.04057068, 52.86730817, 52.89542895), Long = c(-0.724233561,
-0.724233561, -0.667258035, -0.650074995, -0.648996758, -0.630626734,
-0.62349292, -0.603710436, -0.558026241, 0.538966197, 0.882597783
), Year = c(2006L, 2007L, 1999L, 2004L, 1995L, 2009L, 2011L,
2007L, 2011L, 1996L, 2007L), Species = structure(c(4L, 7L, 5L,
10L, 4L, 6L, 8L, 3L, 2L, 9L, 1L), .Label = c("Blue Tit", "Buzzard",
"Canada Goose", "Collared Dove", "Greenfinch", "Jackdaw", "Linnet",
"Meadow Pipit", "Robin", "Willow Warbler"), class = "factor"),
Pop_Index = c(0L, 0L, 2L, 0L, 1L, 0L, 1L, 4L, 0L, 0L, 8L)), row.names = c(1L,
100L, 1000L, 2000L, 3000L, 4000L, 5000L, 6000L, 10000L, 20213L,
30213L), class = "data.frame")
A look into this data.frame:
GRIDREF Lat Long Year Species Pop_Index TempJanuary
1 SP8816 51.83569 -0.7242336 2006 Collared Dove 0 2.128387
100 SP8816 51.83569 -0.7242336 2007 Linnet 0 4.233226
1000 SP9212 51.79909 -0.6672580 1999 Greenfinch 2 5.270968
2000 SP9322 51.88881 -0.6500750 2004 Willow Warbler 0 4.826452
3000 SP9326 51.92476 -0.6489968 1995 Collared Dove 1 4.390322
4000 SP9440 52.05043 -0.6306267 2009 Jackdaw 0 2.934516
5000 SP9513 51.80758 -0.6234929 2011 Meadow Pipit 1 3.841290
6000 SP9632 51.97818 -0.6037104 2007 Canada Goose 4 7.082580
10000 SP9939 52.04057 -0.5580262 2011 Buzzard 0 3.981290
20213 TF7133 52.86731 0.5389662 1996 Robin 0 3.532903
30213 TF9437 52.89543 0.8825978 2007 Blue Tit 8 7.028710

Filter a data frame by two conditions in R

I have a data frame that contains the highest and lowest temperature in a given year by Climate Station - All.Stations dataset:
Station.Name Year Month Day TMAX TMIN
GRAND MARAIS 1942 7 28 82 60
GRAND MARAIS 1962 3 17 42 22
LEECH LAKE 1956 7 3 72 50
ALBERT LEA 3 SE 1998 1 25 25 15
TWO HARBORS 1933 5 20 77 42
ARGYLE 1922 9 13 NA NA
I also have a data frame of complete years by Climate Station (i.e., these are the years where I have data for every day in the year) - complete.years dataset:
Station.Name Year
DULUTH 1904
AGASSIZ REFUGE 1995
LEECH LAKE 1956
GRAND MARAIS 1942
LEECH LAKE 1994
I want to filter the first data frame to only the data where Station Name and Year exist and match in the second data frame.
The correct results would be:
Station.Name Year TMAX
GRAND MARAIS 1942 82
LEECH LAKE 1956 72
Here's what I've got so far, using dplyr:
Max.Tempurature <- All_Stations %>%
group_by(Station.Name, Year) %>%
select(Station.Name, Year, TMAX) %>%
filter(min_rank(desc(TMAX)) <= 1) %>%
filter((Year %in% complete.years$Year & Station.Name %in% complete.years$Station.Name))
I can filter by both Year and Station.Name, but that searches the whole data frame for matches.
How do I filter by Station.Name and Year existing in the same observation?
We can do an inner_join
library(dplyr)
inner_join(All.Stations[c(1, 2, 5)], complete.years)
# Station.Name Year TMAX
#1 GRAND MARAIS 1942 82
#2 LEECH LAKE 1956 72
data
All.Stations <- structure(list(Station.Name = c("GRAND MARAIS", "GRAND MARAIS",
"LEECH LAKE", "ALBERT LEA 3 SE", "TWO HARBORS", "ARGYLE"), Year = c(1942L,
1962L, 1956L, 1998L, 1933L, 1922L), Month = c(7L, 3L, 7L, 1L,
5L, 9L), Day = c(28L, 17L, 3L, 25L, 20L, 13L), TMAX = c(82L,
42L, 72L, 25L, 77L, NA), TMIN = c(60L, 22L, 50L, 15L, 42L, NA
)), class = "data.frame", row.names = c(NA, -6L))
complete.years <- structure(list(Station.Name = c("DULUTH",
"AGASSIZ REFUGE", "LEECH LAKE",
"GRAND MARAIS", "LEECH LAKE"), Year = c(1904L, 1995L, 1956L,
1942L, 1994L)), class = "data.frame", row.names = c(NA, -5L))
Or with merge
cols <- c('Station.Name', 'Year', 'TMAX')
merge(All.Stations[cols], complete.years, all.x = FALSE)
# Station.Name Year TMAX
#1 GRAND MARAIS 1942 82
#2 LEECH LAKE 1956 72
data
All.Stations <- structure(list(Station.Name = c("GRAND MARAIS", "GRAND MARAIS",
"LEECH LAKE", "ALBERT LEA 3 SE", "TWO HARBORS", "ARGYLE"), Year = c(1942L,
1962L, 1956L, 1998L, 1933L, 1922L), Month = c(7L, 3L, 7L, 1L,
5L, 9L), Day = c(28L, 17L, 3L, 25L, 20L, 13L), TMAX = c(82L,
42L, 72L, 25L, 77L, NA), TMIN = c(60L, 22L, 50L, 15L, 42L, NA
)), .Names = c("Station.Name", "Year", "Month", "Day", "TMAX",
"TMIN"), class = "data.frame", row.names = c(NA, -6L))
complete.years <- structure(list(Station.Name = c("DULUTH", "AGASSIZ REFUGE", "LEECH LAKE",
"GRAND MARAIS", "LEECH LAKE"), Year = c(1904L, 1995L, 1956L,
1942L, 1994L)), .Names = c("Station.Name", "Year"), class = "data.frame", row.names = c(NA,
-5L))

No automatic column names with shift in a data.table - give.names = TRUE

Creating a lagged variable inside a data.table with shift should according to the documentation of shift create a column name based on the type and n options. That's apparently not working in my case and I would like to know why and how I can achieve that without resorting to the usage of variables to name the columns.
dt.quarter.test <- structure(list(Year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L)
, City = c("New York","New York","New York","New York","Philadelphia","Philadelphia","Philadelphia","Philadelphia")
, Quarter = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L)
, Data.Year.to.Date = c(162, 405, 610, 938, 331, 1467, 1981, 2501)
)
, .Names = c("Year", "City", "Quarter", "Data.Year.to.Date"), class = c("data.table", "data.frame"), row.names = c(NA, -8L))
dt.quarter.test[, .(Quarter, Data.Year.to.Date, shift(Data.Year.to.Date, n = 1L, type = "lag", fill = NA, give.names = TRUE)), by = list(City)]
Edit:
Resulting data.table in my case, also making sure using shift from the data.table 1.9.6 package, by using data.table::shift.
City Quarter Data.Year.to.Date V3
1: New York 1 162 NA
2: New York 2 405 162
3: New York 3 610 405
4: New York 4 938 610
5: Philadelphia 1 331 NA
6: Philadelphia 2 1467 331
7: Philadelphia 3 1981 1467
8: Philadelphia 4 2501 1981
Cannot fit this into a comment, plus this seems to be a (rather ugly, but) solution:
library(data.table)
dt.quarter.test <- structure(list(Year = c(2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L)
, City = c("New York","New York","New York","New York","Philadelphia","Philadelphia","Philadelphia","Philadelphia")
, Quarter = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L)
, Data.Year.to.Date = c(162, 405, 610, 938, 331, 1467, 1981, 2501)
)
, .Names = c("Year", "City", "Quarter", "Data.Year.to.Date"), class = c("data.table", "data.frame"), row.names = c(NA, -8L))
by_cols <- 'City'
shift_cols <- 'Data.Year.to.Date' # or e.g. c('Data.Year.to.Date','Quarter')
cbind(
dt.quarter.test,
dt.quarter.test[
,
shift(.SD, n = 1L, type = "lag", fill = NA, give.names = TRUE),
.SDcols = shift_cols,
by = by_cols][, .SD, .SDcols = !by_cols]
)
Result:
Year City Quarter Data.Year.to.Date Data.Year.to.Date_lag_1
1: 2000 New York 1 162 NA
2: 2000 New York 2 405 162
3: 2000 New York 3 610 405
4: 2000 New York 4 938 610
5: 2001 Philadelphia 1 331 NA
6: 2001 Philadelphia 2 1467 331
7: 2001 Philadelphia 3 1981 1467
8: 2001 Philadelphia 4 2501 1981

create list by groups and create json [duplicate]

This question already has answers here:
combine data in depending on the value of one column
(2 answers)
Closed 7 years ago.
Ok I had this problem that has been solved
combine data in depending on the value of one column
I have been trying to adapt the solution for a more complicated problem but i have not been able to come with the solution instead of 2 columns i have 3
df <- structure(list(year = c(2000L, 2001L, 2002L, 2003L, 2001L, 2002L), group = c(1L, 1L, 1L, 1L, 2L, 2L), sales = c(20L, 25L, 23L, 30L, 50L, 55L), expenses = c(19L, 19L, 20L, 15L, 27L, 30L)), .Names = c("year", "group", "sales", "expenses"), class = "data.frame", row.names = c(NA, -6L))
year group sales expenses
1 2000 1 20 19
2 2001 1 25 19
3 2002 1 23 20
4 2003 1 30 15
5 2001 2 50 27
6 2002 2 55 30
And I need the same output as in the first problem but instead of just the sales I also need to include the expenses in the json file
[{"group": 1, "sales":[[2000,20],[2001, 25], [2002,23], [2003, 30]], "expenses":[[2000,19],[2001, 19], [2002,20], [2003, 15]]},
{"group": 2, "sales":[[2001, 50], [2002,55]], "expenses":[[2001, 27], [2002,30]]}]
toJSON(setDT(df1)[, list(sales= paste0('[',toString(sprintf('[%d,%d]',year, sales)),']'),
expenses= paste0('[',toString(sprintf('[%d,%d]', year, expenses)),']')), by = group])
Try this. Its not different than akrun's answer.combine data in depending on the value of one column

Resources