How to 'stretch' the cell of a column from a data frame in R - r

'stretch' may not be the most suitable way to put it, but I can't come up with any other word.
I have a data frame like this :
var1 <- c(rep(0, each=9),1999,rep(0, each=9),2000,rep(0, each=9),2001)
var2 <- c(rnorm(n=30))
df1 <- data.frame(var1,var2)
What I want to do is to replace every 0 from the column var1 by the next number encountered in the column. Hence I want sthg like:
var1 <- c(rep(1999, each=10),rep(2000, each=10),rep(2001, each=10))
var2 <- c(rnorm(n=30))
df2 <- data.frame(var1,var2)
With var2 having specific and ordered values I don't want to move around.
The thing is, the data frame is 500 000 rows long, so I would like not to find the row number of every var1 different from 0.
(it's likely that such question has been asked before, but since I couldn't find another word than 'stretch'...)

One way using na.locf from zoo:
library(zoo)
#convert zeros to NA in order to use na.locf afterwards
df1$var1[df1$var1 == 0] <- NA
#fromLast carries the observations backwards
df1$var1 <- na.locf(df1$var1, fromLast = TRUE)
Out:
> df1
var1 var2
1 1999 -0.04750614
2 1999 -0.35462388
3 1999 0.30700748
4 1999 1.09506443
5 1999 -0.61049306
6 1999 0.66687294
7 1999 0.54623236
8 1999 -0.04848903
9 1999 -0.56502719
10 1999 0.08067966
11 2000 -0.05474748
12 2000 0.27380898
13 2000 -0.21283353
14 2000 -0.89820808
15 2000 -0.18752047
16 2000 0.21827094
17 2000 0.56370895
18 2000 -1.21738551
19 2000 -0.61426847
20 2000 -1.34144736
21 2001 -0.52697208
22 2001 0.90209640
23 2001 -0.52040468
24 2001 -0.37432746
25 2001 -0.21218776
26 2001 0.88372231
27 2001 0.54274394
28 2001 0.06127087
29 2001 0.04263164
30 2001 0.52294204

Related

multiplying column from data frame 1 by a condition found in data frame 2

I have two separate data frame and what I am trying to do is that for each year, I want to check data frame 2 (in the same year) and multiply a column from data frame 1 by the found number. So for example, imagine my first data frame is:
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
year price
1 2001 1000
2 2003 1000
3 2001 1000
4 2004 1000
5 2006 1000
6 2007 1000
7 2008 1000
8 2008 1000
9 2001 1000
10 2009 1000
11 2001 1000
Now, I have a second data frame which includes inflation conversion rate (code from #akrun)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data<-inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100))
ref_year ref_inf final_inf
1 2010 2.0 1.020000
2 2009 3.0 1.050600
3 2008 1.0 1.061106
4 2007 2.2 1.084450
5 2006 1.3 1.098548
6 2005 1.5 1.115026
7 2004 1.9 1.136212
8 2003 1.8 1.156664
9 2002 1.9 1.178640
10 2001 1.9 1.201035
What I want to do is that for example for the first row of data frame 1, it's the year 2001, so I go and found a conversion for the year 2001 from data frame 2 which is 1.201035 and then multiply the price in a data frame 1 by this found conversion rate.
So the result should look like this:
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
is there any way to do this without using else and if commands?
We can do a join on the 'year' with 'ref_year' and create the new column by assigning (:=) the output of product of 'price' and 'final_inf'
library(data.table)
setDT(df)[inf_data, after_conv := price * final_inf, on = .(year = ref_year)]
-output
df
# year price after_conv
# 1: 2001 1000 1201.035
# 2: 2003 1000 1156.664
# 3: 2001 1000 1201.035
# 4: 2004 1000 1136.212
# 5: 2006 1000 1098.548
# 6: 2007 1000 1084.450
# 7: 2008 1000 1061.106
# 8: 2008 1000 1061.106
# 9: 2001 1000 1201.035
#10: 2009 1000 1050.600
#11: 2001 1000 1201.035
Since the data is already being processed by dplyr, we can also solve this problem with dplyr. A dplyr based solution joins the data with the reference data by year and calculates after_conv.
year <- c(2001,2003,2001,2004,2006,2007,2008,2008,2001,2009,2001)
price <- c(1000,1000,1000,1000,1000,1000,1000,1000,1000,1000,1000)
df <- data.frame(year, price)
library(dplyr)
ref_inf <- c(2,3,1,2.2,1.3,1.5,1.9,1.8,1.9,1.9)
ref_year<- seq(2010,2001)
inf_data <- data.frame(ref_year,ref_inf)
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv)
We use left_join() to keep the data ordered in the original order of df as well as ensure rows in inf_data only contribute to the output if they match at least one row in df. We use . to reference the data already in the pipeline as the right side of the join, merging in final_inf so we can use it in the subsequent mutate() function. We then select() to keep the three result columns we need.
...and the output:
Joining, by = "year"
year price after_conv
1 2001 1000 1201.035
2 2003 1000 1156.664
3 2001 1000 1201.035
4 2004 1000 1136.212
5 2006 1000 1098.548
6 2007 1000 1084.450
7 2008 1000 1061.106
8 2008 1000 1061.106
9 2001 1000 1201.035
10 2009 1000 1050.600
11 2001 1000 1201.035
We can save the result to the original df by writing the result of the pipeline to df.
inf_data %>%
mutate(final_inf = cumprod(1 + ref_inf/100)) %>%
rename(year = ref_year) %>%
left_join(df,.) %>%
mutate(after_conv = price * final_inf ) %>%
select(year,price,after_conv) -> df

How to create a loop for sum calculations which then are inserted into a new row?

I have tried to find a solution via similar topics, but haven't found anything suitable. This may be due to the search terms I have used. If I have missed something, please accept my apologies.
Here is a excerpt of my data UN_ (the provided sample should be sufficient):
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
AT 1990 Total 7.869005
AT 1991 1 1.484667
AT 1991 2 1.001578
AT 1991 3 4.625927
AT 1991 4 2.515453
AT 1991 5 2.702081
AT 1991 Total 8.249567
....
BE 1994 1 3.008115
BE 1994 2 1.550344
BE 1994 3 1.080667
BE 1994 4 1.768645
BE 1994 5 7.208295
BE 1994 Total 1.526016
BE 1995 1 2.958820
BE 1995 2 1.571759
BE 1995 3 1.116049
BE 1995 4 1.888952
BE 1995 5 7.654881
BE 1995 Total 1.547446
....
What I want to do is, to add another row with UN_$sector = Residual. The value of residual will be (UN_$sector = Total) - (the sum of column UN for the sectors c("1", "2", "3", "4", "5")) for a given year AND country.
This is how it should look like:
country year sector UN
AT 1990 1 1.407555
AT 1990 2 1.037137
AT 1990 3 4.769618
AT 1990 4 2.455139
AT 1990 5 2.238618
----> AT 1990 Residual TO BE CALCULATED
AT 1990 Total 7.869005
As I don't want to write many, many lines of code I'm looking for a way to automate this. I was told about loops, but can't really follow the concept at the moment.
Thank you very much for any type of help!!
Best,
Constantin
PS: (for Parfait)
country year sector UN ETS
UK 2012 1 190336512 NA
UK 2012 2 18107910 NA
UK 2012 3 8333564 NA
UK 2012 4 11269017 NA
UK 2012 5 2504751 NA
UK 2012 Total 580957306 NA
UK 2013 1 177882200 NA
UK 2013 2 20353347 NA
UK 2013 3 8838575 NA
UK 2013 4 11051398 NA
UK 2013 5 2684909 NA
UK 2013 Total 566322778 NA
Consider calculating residual first and then stack it with other pieces of data:
# CALCULATE RESIDUALS BY MERGED COLUMNS
agg <- within(merge(aggregate(UN ~ country + year, data = subset(df, sector!='Total'), sum),
aggregate(UN ~ country + year, data = subset(df, sector=='Total'), sum),
by=c("country", "year")),
{UN <- UN.y - UN.x
sector = 'Residual'})
# ROW BIND DIFFERENT PIECES
final_df <- rbind(subset(df, sector!='Total'),
agg[c("country", "year", "sector", "UN")],
subset(df, sector=='Total'))
# ORDER ROWS AND RESET ROWNAMES
final_df <- with(final_df, final_df[order(country, year, as.character(sector)),])
row.names(final_df) <- NULL
Rextester demo
final_df
# country year sector UN
# 1 AT 1990 1 1.407555
# 2 AT 1990 2 1.037137
# 3 AT 1990 3 4.769618
# 4 AT 1990 4 2.455139
# 5 AT 1990 5 2.238618
# 6 AT 1990 Residual -4.039062
# 7 AT 1990 Total 7.869005
# 8 AT 1991 1 1.484667
# 9 AT 1991 2 1.001578
# 10 AT 1991 3 4.625927
# 11 AT 1991 4 2.515453
# 12 AT 1991 5 2.702081
# 13 AT 1991 Residual -4.080139
# 14 AT 1991 Total 8.249567
# 15 BE 1994 1 3.008115
# 16 BE 1994 2 1.550344
# 17 BE 1994 3 1.080667
# 18 BE 1994 4 1.768645
# 19 BE 1994 5 7.208295
# 20 BE 1994 Residual -13.090050
# 21 BE 1994 Total 1.526016
# 22 BE 1995 1 2.958820
# 23 BE 1995 2 1.571759
# 24 BE 1995 3 1.116049
# 25 BE 1995 4 1.888952
# 26 BE 1995 5 7.654881
# 27 BE 1995 Residual -13.643015
# 28 BE 1995 Total 1.547446
I think there are multiple ways you can do this. What I may recommend is to take advantage of the tidyverse suite of packages which includes dplyr.
Without getting too far into what dplyr and tidyverse can achieve, we can talk about the power of dplyr's inline commands group_by(...), summarise(...), arrange(...) and bind_rows(...) functions. Also, there are tons of great tutorials, cheat sheets, and documentation on all tidyverse packages.
Although it is less and less relevant these days, we generally want to avoid for loops in R. Therefore, we will create a new data frame which contains all of the Residual values then bring it back into your original data frame.
Step 1: Calculating all residual values
We want to calculate the sum of UN values, grouped by country and year. We can achieve this by this value
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))
Step 2: Add sector column to res_UN with value 'residual'
This should yield a data frame which contains country, year, and UN, we now need to add a column sector which the value 'Residual' to satisfy your specifications.
res_UN$sector = 'Residual'
Step 3 : Add res_UN back to UN_ and order accordingly
res_UN and UN_ now have the same columns and they can now be added back together.
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)
Piecing this all together, should answer your question and can be achieved in a couple lines!
TLDR:
res_UN = UN_ %>% group_by(country, year) %>% summarise(UN = sum(UN, na.rm = T))`
res_UN$sector = 'Residual'
UN_ = bind_rows(UN_, res_UN) %>% arrange(country, year, sector)

How can I add new variable with MUTATE: growth rate?

I haven't coded for several months and now am stuck with the following issue.
I have the following dataset:
Year World_export China_exp World_import China_imp
1 1992 3445.534 27.7310 3402.505 6.2220
2 1993 1940.061 27.8800 2474.038 18.3560
3 1994 2458.337 39.6970 2978.314 3.3270
4 1995 4641.168 15.9790 5504.787 18.0130
5 1996 5680.688 74.1650 6939.291 25.1870
6 1997 7206.604 70.2440 8639.422 31.9030
7 1998 7069.725 99.6510 8530.293 41.5030
8 1999 5916.077 169.4593 6673.743 37.8139
9 2000 7331.588 136.2180 8646.253 47.3789
10 2001 7471.374 143.0542 8292.893 41.2899
11 2002 8074.975 217.4286 9092.341 46.4730
12 2003 9956.433 162.2522 11558.007 71.7753
13 2004 13751.671 282.8678 16345.452 157.0768
14 2005 15976.238 430.8655 16708.094 284.1065
15 2006 19728.935 398.6704 22344.856 553.6356
16 2007 24275.244 484.5276 28693.113 815.7914
17 2008 32570.781 613.3714 39381.251 1414.8120
18 2009 21282.228 173.9463 28563.576 1081.3720
19 2010 25283.462 475.7635 34884.450 1684.0839
20 2011 41418.670 636.5881 45759.051 2193.8573
21 2012 46027.529 432.6025 46404.382 2373.4535
22 2013 37132.301 460.7133 43022.550 2829.3705
23 2014 36046.461 640.2552 40502.268 2373.2351
24 2015 26618.982 781.0016 30264.299 2401.1907
25 2016 23537.354 472.7022 27609.884 2129.4806
What I need is simple: to compute growth rates of each variable, that is, find difference between two elements, divide it by first element and multiply by 100.
I'm trying to write a script, that ends up with error message:
trade_Ch %>%
mutate (
World_exp_grate = sapply(2:nrow(trade_Ch),function(i)((World_export[i]-World_export[i-1])/World_export[i-1]))
)
Error in mutate_impl(.data, dots) : Column World_exp_grate must
be length 25 (the number of rows) or one, not 24
although this piece of code gives me right values:
x <- sapply(2:nrow(trade_Ch),function(i)((trade_Ch$World_export[i]-trade_Ch$World_export[i-1])/trade_Ch$World_export[i-1]))
How can I correctly embedd the code into my MUTATE part from dplyr package?
OR
Is there is another elegant way to solve this issue?
library(dplyr)
df %>%
mutate_each(funs(chg = ((.-lag(.))/lag(.))*100), World_export:China_imp)
trade_Ch %>%
mutate(world_exp_grate = 100*(World_export - lag(World_export))/lag(World_export))
The problem is that you cannot calculate the World_exp_grate for your first row. Therefore you have to set it to NA.
One variant to solve this is
trade_Ch %>%
mutate (World_export_lag = lag(World_export),
World_exp_grate = (World_export - World_export_lag)/World_export_lag)) %>%
select(-World_export_lag)
lag shifts the vector by one position.
lag(1:5)
# [1] NA 1 2 3 4

Reshaping Data into panel form

I have data where the object name is a variable name like EPS, Profit etc. (around 25 such distinct objects)
The data is arranged like this :
EPS <- read.table(text = "
Year Microsoft Facebook
2001 12 20
2002 15 23
2003 16 19
", header = TRUE)
Profit <- read.table(text = "
Year Microsoft Facebook
2001 15 36
2002 19 40
2003 25 45
", header = TRUE)
I want output like this :
Year Co_Name EPS Profit
2001 Microsoft 12 15
2002 Microsoft 15 19
2003 Microsoft 16 25
2001 Facebook 20 36
2002 Facebook 23 40
2003 Facebook 19 45
How it can be done? Is there any way to arrange data of all variables as a single object? Data of each variable is imported into R from a csv file like EPS.csv, Profit.csv etc. Is there any way to create loop from importing to arranging data in a desired format?
Just for fun we can also achieve the same result using dplyr, tidyr and purrr.
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
list_of_csv <- list.files(path = ".", pattern = ".csv", full.names = TRUE)
file_name <- gsub(".csv", "", basename(list_of_csv))
list_of_csv %>%
map(~ read_csv(.)) %>%
map(~ gather(data = ., key = co_name, value = value, -year)) %>%
reduce(inner_join, by = c("year", "co_name")) %>%
setNames(., c("year", "co_name", file_name))
## Source: local data frame [6 x 4]
## year co_name eps profit
## (int) (fctr) (int) (int)
## 1 2001 microsoft 12 15
## 2 2002 microsoft 15 19
## 3 2003 microsoft 16 25
## 4 2001 facebook 20 36
## 5 2002 facebook 23 40
## 6 2003 facebook 19 45
We can get the datasets in a list. If we already created 'EPS', 'Profit' as objects, use mget to get those in a list, convert to a single data.table with rbindlist, melt to long format and reshape it back to 'wide' with dcast.
library(data.table)#v1.9.6+
DT <- rbindlist(mget(c('EPS', 'Profit')), idcol=TRUE)
DT1 <- dcast(melt(rbindlist(mget(c('EPS', 'Profit')), idcol=TRUE),
id.var=c('.id', 'Year'), variable.name='Co_Name'),
Year+Co_Name~.id, value.var='value')
DT1
# Year Co_Name EPS Profit
#1: 2001 Microsoft 12 15
#2: 2001 Facebook 20 36
#3: 2002 Microsoft 15 19
#4: 2002 Facebook 23 40
#5: 2003 Microsoft 16 25
#6: 2003 Facebook 19 45
If we need to arrange it, use order
DT1[order(factor(Co_Name, levels=unique(Co_Name)))]

Reducing rows and expanding columns of data.frame in R

I have this data.frame in R.
> a <- data.frame(year = c(2001,2001,2001,2001), country = c("Japan", "Japan","US","US"), type = c("a","b","a","b"), amount = c(35,67,39,45))
> a
year country type amount
1 2001 Japan a 35
2 2001 Japan b 67
3 2001 US a 39
4 2001 US b 45
How should I transform this into a data.frame that looks like this?
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Basically I want the number of rows to be the number of (year x country) pairs, and I want to create additional columns for each type.
base solution, but requires renaming columns and rows
reshape(a, v.names="amount", timevar="type", idvar="country", direction="wide")
year country amount.a amount.b
1 2001 Japan 35 67
3 2001 US 39 45
reshape2 solution
library(reshape2)
dcast(a, year+country ~ paste("type", type, sep="."), value.var="amount")
year country type.a type.b
1 2001 Japan 35 67
2 2001 US 39 45
Another way would be to use spread in the tidyr package and rename in the dplyr package to deliver the expected outcome.
library(dplyr)
library(tidyr)
spread(a,type, amount) %>%
rename(type.a = a, type.b = b)
# year country type.a type.b
#1 2001 Japan 35 67
#2 2001 US 39 45

Resources