How could I split a data.frame? - r

I have 50 synoptic stations precipitation data from 1986 to 2015.
I need to sort the related information for the period of years from 2007 to 2015 for each station separately. I mean there are three variables:
the station's name
the specific year
the amount of precipitation
I need the result for each station separately.
Does anyone know how to use "split" for this purpose?
May you please write codes from the beginning "read.table"?

If your task is simply to split the dataframe by year you can use split:
split(df, f = df$year)
Illustrative data:
(set.seed(123)
df <- data.frame(
station = sample(LETTERS[1:3],10, replace = T),
year = paste0("201", sample(1:9, 10, replace = T)),
precipitation = sample(333:444, 10, replace = T)
)
Result:
$`2011`
station year precipitation
5 C 2011 406
8 C 2011 399
$`2013`
station year precipitation
7 B 2013 393
9 B 2013 365
$`2015`
station year precipitation
2 C 2015 410
$`2016`
station year precipitation
4 C 2016 444
$`2017`
station year precipitation
3 B 2017 404
$`2019`
station year precipitation
1 A 2019 432
6 A 2019 412
10 B 2019 349

Related

How to rearrange daily stream discharge data into monthly format and rank the discharge values for each month using R

I have a data set of daily stream discharge values from a gauging station for approximately 50 years. The data is arranged into three columns, namely, "date", "month", "discharge".(Sample data shown here)
`
Date<- as.Date(c('1938-10-01','1954-10-27', '1967-06-16','1943-01-01','1945-01-14','1945-03-14','1954-05-04','1960-04-23','1960-05-09','1962-01-18','1968-12-19','1972-01-15','1977-08-15','1981-04-11','1986-06-20','1989-01-20','1992-03-29'))
> Months<- c('Oct','Oct','Jun','Jan','Jan','Mar','May','Apr','May','Jan','Dec','Jan','Aug','Apr','Jun','Jan','Mar')
> Dis<-c('1000','1200','400','255','450','215','360','120','145','1204','752','635','1456','154','154','1204','450')
> Sampledata<-data.frame("Date"=Date,"Months"=Months,"Disch"=Dis)
> print(Sampledata)
Date Months Disch
1 1938-10-01 Oct 1000
2 1954-10-27 Oct 1200
3 1967-06-16 Jun 400
4 1943-01-01 Jan 255
5 1945-01-14 Jan 450
6 1945-03-14 Mar 215
7 1954-05-04 May 360
8 1960-04-23 Apr 120
9 1960-05-09 May 145
10 1962-01-18 Jan 1204
11 1968-12-19 Dec 752
12 1972-01-15 Jan 635
13 1977-08-15 Aug 1456
14 1981-04-11 Apr 154
15 1986-06-20 Jun 154
16 1989-01-20 Jan 1204
17 1992-03-29 Mar 450
I want to calculate ranks for each month separately for all the years. For example: Calculate rank in ascending order for the month of January for 50 years. With the same rank value assigned to a duplicate discharge value. Desired output shown here:
> Date Month Disch Rank
1 1943-01-01 Jan 255 1
2 1945-01-14 Jan 450 2
3 1962-01-18 Jan 1204 4
4 1972-01-15 Jan 635 3
5 1989-01-20 Jan 1204 4
> Date Month Disch Rank
1 1945-03-14 Mar 215 1
2 1992-03-29 Mar 450 2
3 2001-03-19 Mar 450 2
Without using any packages first convert columns 2 and 3 to numeric and then use ave and rank with the indicated ties method. Finally order the result.
Note that the output shown in the question does not correspond to the input, e.g. there are three Mar rows in the output but only two such rows in the input so this will correspond to the input but will not be identical to the output shown.
Sampledata2 <- transform(Sampledata,
Disch = as.numeric(as.character(Disch)),
Months = as.numeric(format(Date, "%m")))
Rank <- function(x) rank(x, ties = "min")
Sampledata3 <- transform(Sampledata2,
Rank = ave(Disch, Months, FUN = Rank))
o <- with(Sampledata3, order(Months, Date))
Sampledata3[o, ]
An option would be to group by 'Month' and use one of the ranking functions (dense_rank, row_number(), min_rank - based on the needs) to rank the 'Discharge' column
library(dplyr)
df1 %>%
group_by(Month) %>%
mutate(Rank = dense_rank(Discharge))

How to calculate the average year

I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588

split-apply-combine R

I have a data table with several columns.
Lets say
Location which may include Los Angles, etc.
age_Group, lets say (young, child, teenager), etc.
year = (2000, 2001, ..., 2015)
month = c(jan, ..., dec)
I would like to group_by them and see how many people has spent money
in some intervals, lets say I have intervals of interval_1 = (1, 100), (100, 1000), ..., interval_20=(1000, infinity)
How shall I proceed? What should I do after the following?
data %>% group_by(location, age_Group, year, month)
sample:
location age_gp year month spending
LA child 2000 1 102
LA teen 2000 1 15
LA teen 2000 10 9
NY old 2000 11 1000
NY old 2010 2 1000000
NY teen 2020 3 10
desired output
LA, child, 2000, jan interval_1
LA, child, 2000, feb interval_20
...
NY OLD 2015 Dec interval_1
the last column has to be determined by adding the spending of all people belonging to the same city, age_croup, year, month.
You can first create a new column (spending_cat) using, for example, the cut function. After you can add the new variable as a grouping variable and then you just need to count:
df <- data.frame(group = sample(letters[1:4], size = 1000, replace = T),
spending = rnorm(1000))
df %>%
mutate(spending_cat = cut(spending, breaks = c(-5:5))) %>%
group_by(group, spending_cat) %>%
summarise(n_people = n())
# A tibble: 26 x 3
# Groups: group [?]
group spending_cat n_people
<fct> <fct> <int>
1 a (-3,-2] 6
2 a (-2,-1] 36
3 a (-1,0] 83
4 a (0,1] 78
5 a (1,2] 23
6 a (2,3] 10
7 b (-4,-3] 1
8 b (-3,-2] 4
9 b (-2,-1] 40
10 b (-1,0] 78
# … with 16 more rows

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Resources