Work With a Value in a Data Frame by Character - r

I have a data frame with a column of text formatted by NBA season as such:
Year
2014-15
2013-14
2012-13
...
1999-00
I need to reformat this by the second year. This is a small data set, and I don't mind manually fixing the 1999-00 value, but I can't figure out how to:
paste(data$Year[1:2],data$Year[6:7])
To get:
Year
2015
2014
2013
...
2000

I think it would be simplest to just extract the first year and add one:
as.numeric(substr(data$Year, 1, 4)) + 1
# [1] 2003 2002 2001 2000 1999
Data:
(data <- data.frame(Year=c("2002-03", "2001-02", "2000-01", "1999-00", "1998-99")))
# Year
# 1 2002-03
# 2 2001-02
# 3 2000-01
# 4 1999-00
# 5 1998-99

Related

How to add a column by matching with previous year?

I have a data frame as following:
df1
ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007
I want to add a third column (let's call it infoyear) in my df that shows the year before for each data element that I have in my second column.
In other words I want to get the following result:
df1
ID closingdate infoyear
1 31/12/2005 2004
2 01/12/2009 2008
3 02/01/2002 2001
4 09/10/2000 1999
5 15/11/2007 2006
Normally, to add a column presenting only years I would use:
library(data.table)
setDT(df1)[, infoyear := year(as.IDate(closingdate, '%d/%m/%Y'))]
In my case it would produce me the following:
df1
ID closingdate infoyear
1 31/12/2005 2005
2 01/12/2009 2009
3 02/01/2002 2002
4 09/10/2000 2000
5 15/11/2007 2007
But instead of this result for column infoyear, I would like 1 year before closingdate (as the result presented before).
How can I solve a problem like this in R? Thank you!
You can try
df1 <- read.table(header=TRUE, text="ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007")
setDT(df1)
df1[, closingdate:= as.IDate(closingdate,"%d/%m/%Y")]
df1[, infoyear:= year(closingdate)-1]
df1
#Output
ID closingdate infoyear
1: 1 2005-12-31 2004
2: 2 2009-12-01 2008
3: 3 2002-01-02 2001
4: 4 2000-10-09 1999
5: 5 2007-11-15 2006

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

Canonical way to reduce number of ID variables in wide-format data

I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2

Make subset with specific values of a column with grep

I have the following data set:
usd year
1 65.09 1997
2 69.28 1998
3 71.18 1999Q1
4 72.12 1999Q2
5 70.68 1999Q3
6 71.01 1999Q4
7 71.45 2000Q1
8 72.02 2000Q2
9 72.29 2000Q3
10 71.12 2000Q4
I want to have the means of every year:
usd year
1 65.09 1997
2 69.28 1998
3 71.24 1999
7 71.72 2000
I know how I can do it if I only have years without the quarter. Is there a way to extract the years? Maybe with grep?
I have found a solution using the stringr package:
mydata <- data.frame(usd = c(65.09,69.28,71.18,72.12,70.68,71.01,71.45,72.02,72.29,71.12),
year = c("1997","1998","1999Q1","1999Q2","1999Q3","1999Q4",
"2000Q1","2000Q2","2000Q3","2000Q4"))
library(stringr)
mydata$year <- str_extract(mydata$year, "[[:digit:]]{4}")
mydata <- aggregate(usd ~ year, mydata, mean)
mydata
year usd
1 1997 65.0900
2 1998 69.2800
3 1999 71.2475
4 2000 71.7200

How can I aggregate data.table in quarterly frequency?

My data is available in monthly frequency and I'm trying to aggregate them in quarterly frequency. I'm working with data.table which package I dont understand very well, to be honest.
X.DATA_BASE NOME_INSTITUICAO SALDO.x SALDO.y
1: 199407 ASB S/A - CFI 1694581 1124580
2: 199407 BANCO ARAUCARIA S.A. 40079517 6314782
3: 199407 BANCO ATLANTIS S.A. 200463907 9356445
4: 199407 BANCO BANKPAR 1078342 5770046
5: 199407 BANCO BBI 97812975 31112289
For each date, which is defined by X.DATA_BASE, 199407 = July 1994. I have several institutions with SALDO.x and SALDO.y values. I want to add SALDO.x and SALDO.y for each institution in each quarterly. One of the problem is that some institutions get in and get out through the time. In the end of the day I want to have mydata with the same columns but quarterly frequency.
How could I do that?
Here's an example of how to group and sum by quarter (with thanks to #eddi for his suggested improvement). First let's create some fake date:
library(data.table)
set.seed(1485)
dat = data.table(date=rep(c(199401:199412,199501:199512),2),
firm=rep(c("A","B"), each=24),
value1=rnorm(48,1000,10),
value2=rnorm(48,2000,100))
dat
date firm value1 value2
1: 199401 A 1009.8620 2054.251
2: 199402 A 1009.7180 2124.202
3: 199403 A 1014.3421 1919.251
...
46: 199510 B 992.9961 2079.517
47: 199511 B 997.9147 1968.676
48: 199512 B 1002.5993 2006.231
Now, summarize by firm, year, and quarter. To do this, we create year and quarter grouping variables from date (we use integer division (%/%) to create the years and mod (%%) plus integer division to create the quarters), and calculate the sum of value1 and value2 for each sub-group. This all assumes date is numeric. If you have it stored as character or factor, convert to numeric first:
dat.summary = dat[ , list(valueByQuarter = sum(sum(value1) + sum(value2))),
by=list(firm,
year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1)]
dat.summary
firm year quarter valueByQuarter
1: A 1994 1 9131.626
2: A 1994 2 8953.116
3: A 1994 3 8981.407
4: A 1994 4 9175.959
5: A 1995 1 9003.225
6: A 1995 2 8962.690
7: A 1995 3 8809.256
8: A 1995 4 8885.264
9: B 1994 1 9000.791
10: B 1994 2 8936.356
11: B 1994 3 8905.789
12: B 1994 4 8951.369
13: B 1995 1 8922.716
14: B 1995 2 9097.134
15: B 1995 3 8724.188
16: B 1995 4 9047.934
For dplyr fans, here's a dplyr approach:
library(dplyr)
dat %>%
group_by(firm, year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1) %>%
summarise(valueByQuarter = sum(value1 + value2))

Resources