How to add variable from one dataframe to another dataframe (several conditions) - r

I had a read through the existing topics, but nothing I've read matched the thing I want to do.
dataframe 1: newdata (excerpt)
country year sector emissions
Austria 1990 Total 6.229223e+04
Austria 1990 Regulated 3.826440e+04
Austria 1990 Unregulated 2.402783e+04
Austria 1991 Total 6.589968e+04
Austria 1991 Regulated 3.931820e+04
Austria 1991 Unregulated 2.658148e+04
dataframe 2: EUETS (excerpt)
country year emissions
Austria 2005 164925659
Belgium 2005 282762153
Croatia 2005 0
Cyprus 2005 16021583
Czech Republic 2005 288986144
Denmark 2005 171815416
Estonia 2005 71336242
What I want to do:
Add information from EUETS$emissions to a new column newdata$EUETS
this insertation should be based on country and year and be inserted in the row for this country and year where newdata$sector = "regulated"
newdata$sector = "unregulated" and newdata$sector = "Total" need to receive NA and under no circumstances 0
if there is no corresponding information in EUETS$country and/or EUETS$year, NA should be inserted into newdata$EUETS
if there is information in EUETS$emissions, but no matching year and/or country for this in newdata, a new row shall be created for this information filling in the values from EUETS as above, but inserting NA in the new cells for newdata$emissions = Total and newdata$unregulated.
This should look like this:
country year sector emissions EUETS
Austria 1990 Total 6.229223e+04 NA
Austria 1990 Regulated 3.826440e+04 2516843
Austria 1990 Unregulated 2.402783e+04 NA
Austria 1991 Total 6.589968e+04 NA
Austria 1991 Regulated 3.931820e+04 446656
Austria 1991 Unregulated 2.658148e+04 NA
Liechtenstein 2005 Total NA NA
Liechtenstein 2005 Regulated NA 654612641
Liechtenstein 2005 Unregulated NA NA
Liechtenstein was only in EUETS$country and didn't exist in newdata$country and was consequently added to the latter dataframe.
This may be several questions/post in one, but I hope this is appropriate to ask here. I tried myself a few things, but didn't manage especially when it comes to filling in the values into the existing columns in newdata (country and year).
I appreciate help with any part of this task.
Thanks so much in advance!
Nordsee

First, change the EUETS column names and sector as you want the to show up in the end:
names(EUETS)[3] = "EUETS"
EUETS$sector = "Regulated"
Make sure your original sector column is a character, not a factor:
newdata$sector = as.character(newdata$sector)
Merge the data
result = merge(newdata, EUETS, all = TRUE)
For adding unrepresented countries back into EUETS, I'm not sure what year and emissions values you want to add in, so I'll ignore that for now. But basically you want to use merge again.

Related

Combine rows with two matching columns in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a df that resembles this:
Year Country Sales($M)
2013 Australia 120
2013 Australia 450
2013 Armenia 80
2013 Armenia 175
2013 Armenia 0
2014 Australia 500
2014 Australia 170
2014 Armenia 0
2014 Armenia 100
I'd like to combine the rows that match Year and Country, adding the Sales column. The result should be:
Year Country Sales($M)
2013 Australia 570
2013 Armenia 255
2014 Australia 670
2014 Armenia 100
I'm sure I could write a long loop to check whether Year and Country are the same and then add the Sales from those rows, but this is R so there must be a simple function that I'm totally missing.
Many thanks in advance.
library(tidyverse)
df %>%
group_by(Year,Country) %>%
summarise(Sales = sum(Sales))

decompose() for yearly time series in R

I'm trying to perform analysis on a time series data of inflation rates from the year 1960 to 2015. The dataset is a yearly time series over 56 years with 1 real value per each year, which is the following:
Year Inflation percentage
1960 1.783264746
1961 1.752021563
1962 3.57615894
1963 2.941176471
1964 13.35403727
1965 9.479452055
1966 10.81081081
1967 13.0532972
1968 2.996404315
1969 0.574712644
1970 5.095238095
1971 3.081105573
1972 6.461538462
1973 16.92815855
1974 28.60169492
1975 5.738605162
1976 -7.63438068
1977 8.321619342
1978 2.517518817
1979 6.253164557
1980 11.3652609
1981 13.11510484
1982 7.887270664
1983 11.86886396
1984 8.32157969
1985 5.555555556
1986 8.730811404
1987 8.798689021
1988 9.384775808
1989 3.26256011
1990 8.971233545
1991 13.87024609
1992 11.78781925
1993 6.362038664
1994 10.21150033
1995 10.22488756
1996 8.977149075
1997 7.16425362
1998 13.2308409
1999 4.669821024
2000 4.009433962
2001 3.684807256
2002 4.392199745
2003 3.805865922
2004 3.76723848
2005 4.246353323
2006 6.145522388
2007 6.369996746
2008 8.351816444
2009 10.87739112
2010 11.99229692
2011 8.857845297
2012 9.312445605
2013 10.90764331
2014 6.353194544
2015 5.872426595
'stock1' contains my data where the first column stands for Year, and the second for 'Inflation.percentage', as follows:
stock1<-read.csv("India-Inflation time series.csv", header=TRUE, stringsAsFactors=FALSE, as.is=TRUE)
The following is my code for creating the time series object:
stock <- ts(stock1$Inflation.percentage,start=(1960), end=(2015),frequency=1)
Following this, I am trying to decompose the time series object 'stock' using the following line of code:
decom_add <- (decompose(stock, type ="additive"))
Here I get an error:
Error in decompose(stock, type = "additive") : time series has no
or less than 2 periods
Why is this so? I initially thought it has something to do with frequency, but since the data is annual, the frequency has to be 1 right? If it is 1, then aren't there definitely more than 2 periods in the data?
Why isn't decompose() working? What am I doing wrong?
Thanks a lot in advance!
Please try for frequency=2, because frequency needs to be greater than 1. Because this action will change your model, for me the better way is to load data which contain and month column, so the frequency will be 12.

getting minimum value after tapply

I started learning R recently, and I am completely new. Sorry if my question will seem lame to some of you but I have spent more than an hour trying to research how to do this using indexing or subset but couldn't find anything.
So here it goes :
I have a file which has
temperature lower rain month yr
10.8 6.5 12.2 1 1987
10.5 4.5 1.3 1 1987
7.5 -1 0.1 1 1987
This file contains 6,940 lines of data.
I read the file in R. and I wanted to find the average rainfall per year for which i used :
A <- tapply(temperature,yr,mean)
this function returned:
1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
13.27014 13.79126 15.54986 15.62986 14.11945 14.61612 14.30984 15.12877 15.81260 13.98082 15.63918 15.02568 15.63736 14.94071 14.90849 15.47589 16.03260 15.25109 15.06000
Now the question is I need the year where the average rain is the min.
when I apply :
min(A)
It returns 13.27014 which corresponds for the year 1987 but how do I query for the year which corresponds to the min Value.
And when I try :
A[,min(A)]
It returns an error
Sorry again for the lame question but this is driving me crazy

How to get column mean for specific rows only?

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:
period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991
This is the structure of my data:
country year score
Algeria 1980 -1.1201501
Algeria 1981 -1.0526943
Algeria 1982 -1.0561565
Algeria 1983 -1.1274560
Algeria 1984 -1.1353926
Algeria 1985 -1.1734330
Algeria 1986 -1.1327666
Algeria 1987 -1.1263586
Algeria 1988 -0.8529455
Algeria 1989 -0.2930265
Algeria 1990 -0.1564207
Algeria 1991 -0.1526328
Algeria 1992 -0.9757842
Algeria 1993 -0.9714060
Algeria 1994 -1.1422258
Algeria 1995 -0.3675797
...
The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.
This is how it should look like:
country year score mean
Algeria 1980 -1.1201501 -1.089
Algeria 1981 -1.0526943 -1.089
Algeria 1982 -1.0561565 -1.089
Algeria 1983 -1.1274560 -1.089
Algeria 1984 -1.1353926 -0.839
Algeria 1985 -1.1734330 -0.839
Algeria 1986 -1.1327666 -0.839
Algeria 1987 -1.1263586 -0.839
Algeria 1988 -0.8529455 -0.839
Algeria 1989 -0.2930265 -0.839
Algeria 1990 -0.1564207 -0.839
...
Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...
Many many thanks for your help!
datfrm$mean <-
with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )
The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:
mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks #DWin]. For completeness the data.table equivalent (scales for large data) is :
require(data.table)
DT = as.data.table(DF) # or just start with a data.table in the first place
DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]
or findInterval is likely faster as DWin used :
DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
If the rows are ordered by year, I think the easiest way to accomplish this would be:
m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.
If the rows are not ordered by year, I would use tapply like this.
list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)
Here, tapply takes three parameters:
First, the data you want to do stuff with (in this case, datafram$score).
Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.
Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.
So, list.of.means should be a list of the 3 values you are looking for.

Create a moving sum of past levels of a variable, summed over for each level of 3 other variables, in R

I have a data.frame of the following structure (panel data), with 16 levels of time(quarters) 14 levels of geo (countries) and 20 levels of citizen, each of them repeating accordingly in the dataframe.
time geo citizen X
2008Q1 Belgium Afghanistan 22
2008Q1 Belgium Armenia 10
2008Q1 Belgium Bangladesh 25
2008Q1 Belgium Democratic Republic of the Congo 55
2008Q1 Belgium China (including Hong Kong) 5
2008Q1 Belgium Eritrea 8
I would like to create a new column lets say MOVSUM where it will sum variable X for each level of citizen and geo and time for the previous 4 quarters, so that I would have for each quarter, t, how many X's of each citizen in each geo were available during t-4 to t-1 quarters.
Thanks in advance

Resources