multiplying column from data frame 1 by a condition found in data frame 2 - r

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?

We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035

Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

Related

Updating table with custom numbers

Below is my dataset, which contains four columns id, year, quarter, and price.
df <- data.frame(id = c(1,2,1,2),
year = c(2010,2010,2011,2011),
quarter = c("2010-q1","2010-q2","2011-q1","2011-q2"),
price = c(10,50,10,50))
Now I want to expand this dataset for 2012 and 2013. First, I want to copy rows for 2010 and 2011 and paste them below, and after that, replace these values for years and quarters with 2012 and 2013 and also quarters with 2012-q1,2012-q2,2013-q1 and 2013-q2.
So can anybody help me with how to solve this and prepare the table as the table below?
df %>%
mutate(year = year + 2, quarter = paste0(year, "-q", id)) %>%
bind_rows(df, .)
id year quarter price
1 1 2010 2010-q1 10
2 2 2010 2010-q2 50
3 1 2011 2011-q1 10
4 2 2011 2011-q2 50
5 1 2012 2012-q1 10
6 2 2012 2012-q2 50
7 1 2013 2013-q1 10
8 2 2013 2013-q2 50

Rearranging data columns in R

I have an excel file that contains two columns : Car_Model_Year and Cost.
Car_Model_Year Cost
2018 25000
2010 9000
2005 13000
2002 35000
1995 8000
I want to sort my data as follows:
Car_Model_Year Cost
1995 8000
2002 35000
2005 13000
2010 9000
2018 25000
So now, the Car_Model_Year are sorted in ascending order. I wrote the following R code, but I don't know how to rearrange the values of the variable Cost accordingly.
my_data <- read.csv2("data.csv")
my_data <- sort(my_data$Car_Model_Year, decreasing = FALSE)
Any help will be very appreciated!
Are you looking for this?
sorted_df <- df[order(df$Car_Model_Year, df$Cost),]
print(sorted_df)
# A tibble: 5 x 2
Car_Model_Year Cost
<dbl> <dbl>
1 1995 8000
2 2002 35000
3 2005 13000
4 2010 9000
5 2018 25000
Note that you can use signs (+/ -) to indicate asc or desc:
# Sort by car_model(descending) and cost(acending)
sorted_df <-df[order(-df$Car_Model_Year, df$Cost),]
Does the below approach work? To sort by two or more columns, you just add them to the order() - i.e. order(var1, var2,...)
my_data <- data.frame(Car_Model_Year=c(2018,2010,2005,2002,1995),
Cost=c(25000,9000,13000,35000,8000))
sorted <- my_data[order(my_data$Car_Model_Year, my_data$Cost),]
> print(sorted)
Car_Model_Year Cost
5 1995 8000
4 2002 35000
3 2005 13000
2 2010 9000
1 2018 25000
dplyr::arrange() makes it easy:
library(dplyr)
my_data %>% arrange(Car_Model_Year, Cost)
Descending price instead:
my_data %>% arrange(Car_Model_Year, desc(Cost))

Correlation matrix using a moving window in data

I'm trying to create correlation matrices using a 5-year moving window, so using data from 2000-2005, 2001-2006 etc.
Here's some example data:
d <- data.frame(v1=seq(2000,2015,1),
v2=rnorm(16),
v3=rnorm(16),
v4=rnorm(16))
v1 v2 v3 v4
1 2000 -1.0907101 -1.3697559 0.52841978
2 2001 -1.3143654 -0.6443144 -0.44653227
3 2002 -0.1762554 2.0513870 -1.07372405
4 2003 0.1668012 -1.6985891 -0.32962331
5 2004 0.6006146 -0.1843326 -0.56936906
6 2005 -1.3113762 -0.3854868 -1.61247953
7 2006 3.1914908 -0.2635004 0.04689692
8 2007 0.7935639 -1.0844792 -0.25895397
9 2008 1.4217089 1.9572254 1.27221568
10 2009 -0.4192379 -0.5451291 0.18891557
11 2010 -0.1304170 -1.4676465 0.17137507
12 2011 1.2212943 0.9523027 -0.39269076
13 2012 -0.4464840 -0.7117153 -0.71619199
14 2013 0.1879822 1.0693801 -0.44835571
15 2014 -0.5602422 -0.7036433 0.53531753
16 2015 1.4322259 1.5398703 1.00294281
I've created new columns start and end for each group using dplyr:
d<-d%>%
mutate(start=floor(v1),
end=ifelse(ceiling(v1)==start,start+5,ceiling(v1)))
I tried group_by(start,end) and then running the correlation, but that didn't work. Is there a quicker way than filtering the data to do this?
This prints correlation matrices for 5 year windows:
require("tidyverse")
lapply(2000:2011, function(y) {
filter(d, v1 >= y & v1 <= (y + 4)) %>%
dplyr::select(-v1) %>%
cor() %>%
return()
})

How to 'stretch' the cell of a column from a data frame in R

'stretch' may not be the most suitable way to put it, but I can't come up with any other word.
I have a data frame like this :
var1 <- c(rep(0, each=9),1999,rep(0, each=9),2000,rep(0, each=9),2001)
var2 <- c(rnorm(n=30))
df1 <- data.frame(var1,var2)
What I want to do is to replace every 0 from the column var1 by the next number encountered in the column. Hence I want sthg like:
var1 <- c(rep(1999, each=10),rep(2000, each=10),rep(2001, each=10))
var2 <- c(rnorm(n=30))
df2 <- data.frame(var1,var2)
With var2 having specific and ordered values I don't want to move around.
The thing is, the data frame is 500 000 rows long, so I would like not to find the row number of every var1 different from 0.
(it's likely that such question has been asked before, but since I couldn't find another word than 'stretch'...)
One way using na.locf from zoo:
library(zoo)
#convert zeros to NA in order to use na.locf afterwards
df1$var1[df1$var1 == 0] <- NA
#fromLast carries the observations backwards
df1$var1 <- na.locf(df1$var1, fromLast = TRUE)
Out:
> df1
var1 var2
1 1999 -0.04750614
2 1999 -0.35462388
3 1999 0.30700748
4 1999 1.09506443
5 1999 -0.61049306
6 1999 0.66687294
7 1999 0.54623236
8 1999 -0.04848903
9 1999 -0.56502719
10 1999 0.08067966
11 2000 -0.05474748
12 2000 0.27380898
13 2000 -0.21283353
14 2000 -0.89820808
15 2000 -0.18752047
16 2000 0.21827094
17 2000 0.56370895
18 2000 -1.21738551
19 2000 -0.61426847
20 2000 -1.34144736
21 2001 -0.52697208
22 2001 0.90209640
23 2001 -0.52040468
24 2001 -0.37432746
25 2001 -0.21218776
26 2001 0.88372231
27 2001 0.54274394
28 2001 0.06127087
29 2001 0.04263164
30 2001 0.52294204

Aggregate using a certain value

I'm trying to use the aggregate function in R to get the mean EMISSIONS, organized by YEAR, but only for rows where FIPS is equal to 24510. The following code gives me the right result, but in addition it also adds the overall EMISSIONS, summed across all FIPS values. What am I missing here?
This is the function I'm using:
sum <- aggregate(NEI$Emissions, list(Year = NEI$year, NEI$fips == 24510), sum);
This is the output:
Year Group.2 x
1 1999 FALSE 7329692.557
2 2002 FALSE 5633326.582
3 2005 FALSE 5451611.723
4 2008 FALSE 3462343.556
5 1999 TRUE 3274.180
6 2002 TRUE 2453.916
7 2005 TRUE 3091.354
8 2008 TRUE 1862.282
This is the output that I would like:
Year x
1 1999 3274.180
2 2002 2453.916
3 2005 3091.354
4 2008 1862.282
Should I be using subset separately or can this be done with aggregate alone?
Using this sample
set.seed(15)
NEI <- data.frame(year=2000:2004, fips=rep(c(24510,57399), each=5), Emissions=rnorm(10))
you could use the command
mysum <- aggregate(Emissions~year, subset(NEI, fips == 24510), sum);
to get
year Emissions
1 2000 0.2588229
2 2001 1.8311207
3 2002 -0.3396186
4 2003 0.8971982
5 2004 0.4880163
(also, don't save a value to a variable named sum -- that will conflict with the base function sum())

Resources