I have the following data set:
usd year
1 65.09 1997
2 69.28 1998
3 71.18 1999Q1
4 72.12 1999Q2
5 70.68 1999Q3
6 71.01 1999Q4
7 71.45 2000Q1
8 72.02 2000Q2
9 72.29 2000Q3
10 71.12 2000Q4
I want to have the means of every year:
usd year
1 65.09 1997
2 69.28 1998
3 71.24 1999
7 71.72 2000
I know how I can do it if I only have years without the quarter. Is there a way to extract the years? Maybe with grep?
I have found a solution using the stringr package:
mydata <- data.frame(usd = c(65.09,69.28,71.18,72.12,70.68,71.01,71.45,72.02,72.29,71.12),
year = c("1997","1998","1999Q1","1999Q2","1999Q3","1999Q4",
"2000Q1","2000Q2","2000Q3","2000Q4"))
library(stringr)
mydata$year <- str_extract(mydata$year, "[[:digit:]]{4}")
mydata <- aggregate(usd ~ year, mydata, mean)
mydata
year usd
1 1997 65.0900
2 1998 69.2800
3 1999 71.2475
4 2000 71.7200
Related
I have a large data frame (d) with the dates as character class. I want to extract the year only but first, I have been trying to convert to as date first and then extract the year. However I cannot change date from character.
So I have two questions - how do I change this character date to a numeric date and then how do I extract the year?
I am looking for the easiest way so I can recreate with other data sets.
site<dbl> date<chr> conc<dbl>
2001 2/1/1980 0.006521739
2001 2/2/1980 0.008260870
2001 2/3/1980 0.005652174
2001 2/4/1980 0.007826087
2001 2/5/1980 0.001000000
2001 2/7/1980 0.002222222
2001 2/8/1980 0.008666667
2001 2/11/1980 0.017777778
2001 2/12/1980 0.016250000
2001 2/13/1980 0.015416667
Here is what I have tried:
d2 <- as.Date(d$date, format = "%m/%d/%Y)
I get this error message:
Error: Incomplete expression: d2 <- as.Date(d$date, format = "%m/%d/%Y)
With dplyr:
library(dplyr)
d <- read.table(text="
site date conc
2001 2/1/1980 0.006521739
2001 2/2/1980 0.008260870
2001 2/3/1980 0.005652174
2001 2/4/1980 0.007826087
2001 2/5/1980 0.001000000
2001 2/7/1980 0.002222222
2001 2/8/1980 0.008666667
2001 2/11/1980 0.017777778
2001 2/12/1980 0.016250000
2001 2/13/1980 0.015416667 ",header=T)
d %>% mutate(date = as.Date(date,'%m/%d/%Y'),
year = year(as.Date(date,'%m/%d/%Y')))
site date conc year
1 2001 1980-02-01 0.006521739 1980
2 2001 1980-02-02 0.008260870 1980
3 2001 1980-02-03 0.005652174 1980
4 2001 1980-02-04 0.007826087 1980
5 2001 1980-02-05 0.001000000 1980
6 2001 1980-02-07 0.002222222 1980
7 2001 1980-02-08 0.008666667 1980
8 2001 1980-02-11 0.017777778 1980
9 2001 1980-02-12 0.016250000 1980
10 2001 1980-02-13 0.015416667 1980
I have a data frame as following:
df1
ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007
I want to add a third column (let's call it infoyear) in my df that shows the year before for each data element that I have in my second column.
In other words I want to get the following result:
df1
ID closingdate infoyear
1 31/12/2005 2004
2 01/12/2009 2008
3 02/01/2002 2001
4 09/10/2000 1999
5 15/11/2007 2006
Normally, to add a column presenting only years I would use:
library(data.table)
setDT(df1)[, infoyear := year(as.IDate(closingdate, '%d/%m/%Y'))]
In my case it would produce me the following:
df1
ID closingdate infoyear
1 31/12/2005 2005
2 01/12/2009 2009
3 02/01/2002 2002
4 09/10/2000 2000
5 15/11/2007 2007
But instead of this result for column infoyear, I would like 1 year before closingdate (as the result presented before).
How can I solve a problem like this in R? Thank you!
You can try
df1 <- read.table(header=TRUE, text="ID closingdate
1 31/12/2005
2 01/12/2009
3 02/01/2002
4 09/10/2000
5 15/11/2007")
setDT(df1)
df1[, closingdate:= as.IDate(closingdate,"%d/%m/%Y")]
df1[, infoyear:= year(closingdate)-1]
df1
#Output
ID closingdate infoyear
1: 1 2005-12-31 2004
2: 2 2009-12-01 2008
3: 3 2002-01-02 2001
4: 4 2000-10-09 1999
5: 5 2007-11-15 2006
I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
I have data organized by two ID variables, Year and Country, like so:
Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8
I'd like to keep Year as an ID variable, but create multiple columns for VarA and VarB, one for each value of Country (I'm not picky about column order), to make the following table:
Year VarA.Canada VarA.USA VarB.Canada VarB.USA
2014 0 NA 10 NA
2015 6 1 5 3
2016 7 2 8 2
I managed to do this with the following code:
require(data.table)
require(reshape2)
data <- as.data.table(read.table(header=TRUE, text='Year Country VarA VarB
2015 USA 1 3
2016 USA 2 2
2014 Canada 0 10
2015 Canada 6 5
2016 Canada 7 8'))
molten <- melt(data, id.vars=c('Year', 'Country'))
molten[,variable:=paste(variable, Country, sep='.')]
recast <- dcast(molten, Year ~ variable)
But this seems a bit hacky (especially editing the default-named variable field). Can I do it with fewer function calls? Ideally I could just call one function, specifying the columns to drop as IDs and the formula for creating new variable names.
Using dcast you can cast multiple value.vars at once (from data.table v1.9.6 on). Try:
dcast(data, Year ~ Country, value.var = c("VarA","VarB"), sep = ".")
# Year VarA.Canada VarA.USA VarB.Canada VarB.USA
#1: 2014 0 NA 10 NA
#2: 2015 6 1 5 3
#3: 2016 7 2 8 2
I have a data frame with a column of text formatted by NBA season as such:
Year
2014-15
2013-14
2012-13
...
1999-00
I need to reformat this by the second year. This is a small data set, and I don't mind manually fixing the 1999-00 value, but I can't figure out how to:
paste(data$Year[1:2],data$Year[6:7])
To get:
Year
2015
2014
2013
...
2000
I think it would be simplest to just extract the first year and add one:
as.numeric(substr(data$Year, 1, 4)) + 1
# [1] 2003 2002 2001 2000 1999
Data:
(data <- data.frame(Year=c("2002-03", "2001-02", "2000-01", "1999-00", "1998-99")))
# Year
# 1 2002-03
# 2 2001-02
# 3 2000-01
# 4 1999-00
# 5 1998-99