Sum column values that match year in another column in R - r

I have the following dataframe
y<-data.frame(c(2007,2008,2009,2009,2010,2010),c(10,13,10,11,9,10),c(5,6,5,7,4,7))
colnames(y)<-c("year","a","b")
I want to have a final data.frame that adds together within the same year the values in "y$a" in the new "a" column and the values in "y$b" in the new "b" column so that it looks like this"
year a b
2007 10 5
2008 13 6
2009 21 12
2010 19 11
The following loop has done it for me,
years<- as.numeric(levels(factor(y$year)))
add.a<- numeric(length(y[,1]))
add.b<- numeric(length(y[,1]))
for(i in years){
ind<- which(y$year==i)
add.a[ind]<- sum(as.numeric(as.character(y[ind,"a"])))
add.b[ind]<- sum(as.numeric(as.character(y[ind,"b"])))
}
y.final<-data.frame(y$year,add.a,add.b)
colnames(y.final)<-c("year","a","b")
y.final<-subset(y.final,!duplicated(y.final$year))
but I just think there must be a faster command. Any ideas?
Kindest regards,
Marco

The aggregate function is a good choice for this sort of operation, type ?aggregate for more information about it.
aggregate(cbind(a,b) ~ year, data = y, sum)
# year a b
#1 2007 10 5
#2 2008 13 6
#3 2009 21 12
#4 2010 19 11

Related

Selecting later date observation in panel data in R

I have the following panel data in R:
ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)
#The data looks like this:
ID_column Date_column Price_column
1: A 20040131 12
2: A 20041231 13
3: A 20051231 17
4: A 20061231 19
5: B 20051231 35
6: B 20061231 38
7: B 20071231 39
8: B 20081231 41
My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:
Data$Date_column<- substr(Data$Date_column,1,4)
#The data then looks like:
ID_column Date_column Price_column
1 A 2004 12
2 A 2004 13
3 A 2005 17
4 A 2006 19
5 B 2005 35
6 B 2006 38
7 B 2007 39
8 B 2008 41
My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.
Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.
Kind regards and thank you!
I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.
Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
>
> Data %>% arrange(desc(Date_column)) %>%
+ distinct(ID_column,year,.keep_all = TRUE) %>%
+ arrange(Date_column)
ID_column Date_column Price_column year
1 A 2004-12-31 13 2004
2 A 2005-12-31 17 2005
3 B 2005-12-31 35 2005
4 A 2006-12-31 19 2006
5 B 2006-12-31 38 2006
6 B 2007-12-31 39 2007
since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

How to calculate the exponential in some columns of a dataframe in R?

I have a dataframe:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 2009 12.42669703 12.41831191
2 2010 12.39309563 12.40043599
3 2011 12.36596964 12.38256006
4 2012 12.32067284 12.36468414
5 2013 12.303095 12.34680822
6 2014 NA 12.32893229
7 2015 NA 12.31105637
8 2016 NA 12.29318044
9 2017 NA 12.27530452
10 2018 NA 12.25742859
I want to calulate the exponential of the third and fourth columns. How can I do that?
In case your dataframe is called dfs, you can do the following:
dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')] <- exp(dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')])
which gives you:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 1 2009 249371 247288.7
2 2 2010 241131 242907.5
3 3 2011 234678 238603.9
4 4 2012 224285 234376.5
5 5 2013 220377 230224.0
6 6 2014 NA 226145.1
7 7 2015 NA 222138.5
8 8 2016 NA 218202.9
9 9 2017 NA 214336.9
10 10 2018 NA 210539.5
In case you know the column numbers, this could then also simply be done by using:
dfs[,3:4] <- exp(dfs[,3:4])
which gives you the same result as above. I usually prefer to use the actual column names as the indices might change when the data frame is further processed (e.g. I delete columns, then the indices change).
Or you could do:
dfs$Dependent.variable.1 <- exp(dfs$Dependent.variable.1)
dfs$Forecast.Dependent.variable.1 <- exp(dfs$Forecast.Dependent.variable.1)
In case you want to store these columns in new variables (below they are called exp1 and exp2, respectively), you can do:
exp1 <- exp(dfs$Forecast.Dependent.variable.1)
exp2 <- exp(dfs$Dependent.variable.1)
In case you want to apply it to more than two columns and/or use more complicated functions, I highly recommend to look at apply/lappy.
Does that answer your question?

converting a dataframe in given format

Given data frame values are
Group year Value
A 2010 17
A 2011 18
F 2010 8
F 2011 9
i want to convert it into
Year A F
2010 17 8
2011 18 9
is there any simple solution to solve this
library('reshape2')
df <- read.table(text=" Group year Value
A 2010 17
A 2011 18
F 2010 8
F 2011 9", header = TRUE)
dfc <- dcast(df, year ~ Group )
Although the syntax can be confusing, I still find reshape in base R useful to know. Using df provided by gauden
reshape_df <- reshape(df,dir="wide",idvar="year",timevar="Group")
colnames(reshape_df) <- c("year","A","F")
The converts to data from "long" format to "wide". Usually, the time variable becomes the column name, but in this case, we seek "A" and "F". Therefore, the syntax calls for timevar to be "Group".

writing the outcome of a nested loop to a vector object in R

I have the following data read into R as a data frame named "data_old":
yes year month
1 15 2004 5
2 9 2005 6
3 15 2006 3
4 12 2004 5
5 14 2005 1
6 15 2006 7
. . ... .
. . ... .
I have written a small loop which goes through the data and sums up the yes variable for each month/year combination:
year_f <- c(2004:2006)
month_f <- c(1:12)
for (i in year_f){
for (j in month_f){
x <- subset(data_old, month == j & year == i, select="yes")
if (nrow(x) > 0){
print(sum(x))
}
else{print("Nothing")}
}
}
My question is this: I can print the sum for each month/year combination in the terminal, but how do i store it in a vector? (the nested loop is giving me headaches trying to figure this out).
Thomas
Another way,
library(plyr)
ddply(data_old,.(year,month),function(x) sum(x[1]))
year month V1
1 2004 5 27
2 2005 1 14
3 2005 6 9
4 2006 3 15
5 2006 7 15
Forget the loops, you want to use an aggregation function. There's a recent discussion of them in this SO question.
with(data_old, tapply(yes, list(year, month), sum))
is one of many solutions.
Also, you don't need to use c() when you aren't concatenating anything. Plain 1:12 is fine.
Just to add a third option:
aggregate(yes ~ year + month, FUN=sum, data=data_old)

Resources