Calculate range of data with breaks for missing values - R - r

An example of my data is as follows:
site<-c("A","B","C","D")
year1<-c(1990,1990,1990,1990)
year2<-c("",1991,1991,1991)
year3<-c(1992,1992,1992,1992)
year4<-c(1993,"",1993,"")
year5<-c(1994,1994,1994,1994)
dat<-data.frame(site,year1,year2,year3,year4,year5)
I would like to calculate the range of data for each row (or site in this example) but I would like to include breaks where missing values exist.
So creating a column that resembles something like this.
dat$year_range<-c("1990, 1992-1994","1990-1992, 1994","1990-1994","1990-1992, 1994")
Thanks.

Here's a proposal, I suppose it could be done in a simpler way:
dat$year_range <- apply(dat[-1], 1, function(x) {
x <- as.integer(x)
paste(tapply(x[!is.na(x)], cumsum(is.na(x))[!is.na(x)], function(y)
paste(unique(range(y)), collapse = "-")), collapse = ", ")
})
# site year1 year2 year3 year4 year5 year_range
# 1 A 1990 1992 1993 1994 1990, 1992-1994
# 2 B 1990 1991 1992 1994 1990-1992, 1994
# 3 C 1990 1991 1992 1993 1994 1990-1994
# 4 D 1990 1991 1992 1994 1990-1992, 1994

Here's some regex-fu for you (read/try from inside out):
gsub(',+', ',', # final cleanup of multiple commas
gsub('(^,+|,+$)', '', # cleanup of commas at end of start
# the meat - take out adjacent years and replace them with a '-'
gsub('((?<=,,)|^)([0-9]+),([0-9]+,)+([0-9]+)((?=,,)|$)',
',\\2-\\4,',
apply(dat[, -1], 1, paste, collapse = ","), perl = TRUE)))
#[1] "1990,1992-1994" "1990-1992,1994" "1990-1994" "1990-1992,1994"

Related

r merge data with different year

I would like to merge two data using different years.
My data are like the below with more than 1,000 firms with 20 years span.
And I want to merge data to examine firm A's ratio at t's impact on firm A's count at t+1.
Data A
firm year ratio
A 1990 0.2
A 1991 0.3
...
B 1990 0.1
Data B
firm tyear count
A 1990 2
A 1991 6
...
B 1990 4
Expected Output
firm year ratio count
A 1990 0.2 6
Any suggestion for code to merge data?
Thank you
This should get you started on the dataset, just make sure you do the right lag/lead transformation on the table.
library(data.table)
dt.a.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.b.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.merged <- merge( x = dt.a.years
, y = dt.b.years[, .(Year, lag.Year = shift(Year, n = 1, fill = NA))]
, by.x = "Year"
, by.y = "lag.Year")
>dt.merged
Year Year.y
1: 1990 1991
2: 1991 1992
3: 1992 1993
4: 1993 1994
5: 1994 1995
6: 1995 1996
7: 1996 1997
8: 1997 1998
9: 1998 1999
How about like this:
A$tyear = A$year+1
AB = merge(A,B,by=c('firm','tyear'),all=F)

Lagging a variable by adding up the previous 5 years?

I am working with data that look like this:
Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000
I want to create a lagged variable by adding up the previous five years in the data
So that the observation for 2000 looks like this:
Country Year Aid Lagged5
Angola 2000 416420000 1953200000
Which was derived by adding the Aid observations from 1995 to 1999 together:
416420000 + 459310000 + 354660000 + 335270000 + 387540000 = 1953200000
Also, I will need to group by country as well.
Thank You!
You could do:
library(dplyr)
df %>%
group_by(Country) %>%
mutate(Lagged5 = sapply(Year, function(x) sum(Aid[between(Year, x - 5, x - 1)])))
Output:
# A tibble: 6 x 4
# Groups: Country [1]
Country Year Aid Lagged5
<chr> <int> <int> <int>
1 Angola 1995 416420000 0
2 Angola 1996 459310000 416420000
3 Angola 1997 354660000 875730000
4 Angola 1998 335270000 1230390000
5 Angola 1999 387540000 1565660000
6 Angola 2000 302210000 1953200000
Using the input DF shown reproducibly in the Note at the end define a roll function which sums the prior 5 rows and use ave to run it for each Country. The width argument list(-seq(5)) to rollapplyr means use offsets -1, -2, -3, -4, -5 in summing, i.e. the values in the prior 5 rows.
The question did not discuss what to do with the initial rows in each country so we put in NA values but if you want partial sums add the partial = TRUE argument to rollapplyr. You can also change the fill=NA to some other value if you wish so it is quite flexible.
library(zoo)
roll <- function(x) rollapplyr(x, list(-seq(5)), sum, fill = NA)
transform(DF, Lag5 = ave(Aid, Country, FUN = roll))
Note
The input was assumed to be the following. We added a second country.
Lines <- "Country Year Aid
Angola 1995 416420000
Angola 1996 459310000
Angola 1997 354660000
Angola 1998 335270000
Angola 1999 387540000
Angola 2000 302210000"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE,
colClasses = c("character", "integer", "numeric"))
DF <- rbind(DF, transform(DF, Country = "Belize"))

Organizing Multidimensional Data in R

I am trying to organize multidimensional data in R. The data is extracted in R from CSV file. My data in data frame of R is, as following:
Rank Arrangers YearAmt
1994
1 JPM 6,605.00
2 UBS 7,806.00
3 RBS 1,167.34
1995
1 Citi 1,150.00
2 Scotiabank 483.33
3 ING 800.56
4 UniCredit 700.70
This is just a toy data. Original dataset is large. I would like to subset the data by year like 1994, 1995 etc. So that I can conduct some analysis. I have tried to subset the data set by factor/level using sapply and subset. But, I realized R is just treating 1994 and 1995 as a data in a row. I am thinking to format the original csv file by creating Year as a separate column and then putting a corresponding year in a field for all the rows.
I would appreciate any help in suggesting a way to organize data in R. I am expecting an output like this:
Rank Arrangers YearAmt Year
1 JPM 6,605.00 1994
2 UBS 7,806.00 1994
3 RBS 1,167.34 1994
1 Citi 1,150.00 1995
2 Scotiabank 483.33 1995
3 ING 800.56 1995
4 UniCredit 700.70 1995
1) ave Using cumsum(Rank == "") to create a grouping variable for years, this uses ave to create a Year column creating within each group of year rows a Year consisting of NA followed by the year repeated. Finally use na.omit to remove the rows with NA. No packages are used:
na.year <- function(x) c(NA, rep(x[1], length(x) - 1)) # c(NA, x[1], x[1], ..., x[1])
na.omit( transform(df1, Year = ave(YearAmt, cumsum(Rank == ""), FUN = na.year)) )
Using the input df1 reproducibly defined in the answer from #akrun we get:
Rank Arrangers YearAmt Year
2 1 JPM 6,605.00 1994
3 2 UBS 7,806.00 1994
4 3 RBS 1,167.34 1994
6 1 Citi 1,150.00 1995
7 2 Scotiabank 483.33 1995
8 3 ING 800.56 1995
9 4 UniCredit 700.70 1995
2) by Using by split df1 into years applying addYear to each component of the split. Finally put them back together. No packages are used.
addYear <- function(x) cbind(x[-1, ], Year = x[1, "YearAmt"])
do.call("rbind", by(df1, cumsum(df1$Rank == ""), addYear))
3) sqldf Using the sqldf package we can join each row of df1 with all prior rows of itself having a zero length rank Rank taking the maximum YearAmt of those to form the Year. Then keep only those rows having a non-zero length Rank.
library(sqldf)
sqldf("select b.*, max(a.YearAmt) Year
from df1 a join df1 b on a.rowid < b.rowid and a.Rank = ''
group by b.rowid
having b.Rank != ''")
We create a logical vector based on the blank elements in 'Rank' ('i1'), then subset the rows of 'df1' by removing all the blank rows using 'i1' (df1[!i1,]) and transform the dataset to create the 'Year' column by replicating the 'YearAmt' (that corresponds to the blank in 'Rank') using the cumulative sum of 'i1'.
i1 <- df1$Rank == ''
res <- transform(df1[!i1,], Year = df1$YearAmt[i1][cumsum(i1)[!i1]])
res
# Rank Arrangers YearAmt Year
#2 1 JPM 6,605.00 1994
#3 2 UBS 7,806.00 1994
#4 3 RBS 1,167.34 1994
#6 1 Citi 1,150.00 1995
#7 2 Scotiabank 483.33 1995
#8 3 ING 800.56 1995
#9 4 UniCredit 700.70 1995
Or as #G.Grothendieck mentioned in the comments, the transform step can be made compact by
res <- transform(df1, Year = YearAmt[i1][cumsum(i1)])[!i1, ]
row.names(res) <- NULL
NOTE: No external packages are needed. Only baseverse..
Or using dtverse/zooverse
library(data.table)
library(zoo)
setDT(df1)[Rank=='', Year:= YearAmt][, Year := na.locf(Year)][Rank!='']
# Rank Arrangers YearAmt Year
#1: 1 JPM 6,605.00 1994
#2: 2 UBS 7,806.00 1994
#3: 3 RBS 1,167.34 1994
#4: 1 Citi 1,150.00 1995
#5: 2 Scotiabank 483.33 1995
#6: 3 ING 800.56 1995
#7: 4 UniCredit 700.70 1995
data
df1 <- structure(list(Rank = c("", "1", "2", "3", "", "1", "2", "3",
"4"), Arrangers = c("", "JPM", "UBS", "RBS", "", "Citi", "Scotiabank",
"ING", "UniCredit"), YearAmt = c("1994", "6,605.00", "7,806.00",
"1,167.34", "1995", "1,150.00", "483.33", "800.56", "700.70")),
.Names = c("Rank",
"Arrangers", "YearAmt"), row.names = c(NA, -9L), class = "data.frame")
A tidyverse option:
library(dplyr)
library(tidyr)
# add Year column, with NAs where no year in row
df %>% mutate(Year = ifelse(Rank == '' & Arrangers == '', YearAmt, NA)) %>%
# fill year downwards
fill(Year) %>%
# chop out year rows
filter(Rank != '', Arrangers != '')
## Rank Arrangers YearAmt Year
## 1 1 JPM 6,605.00 1994
## 2 2 UBS 7,806.00 1994
## 3 3 RBS 1,167.34 1994
## 4 1 Citi 1,150.00 1995
## 5 2 Scotiabank 483.33 1995
## 6 3 ING 800.56 1995
## 7 4 UniCredit 700.70 1995

R equal sampling takes too long

I want to sample rows from different years given some constraints.
Say my dataset looks like this:
library(data.table)
dataset = data.table(ID=sample(1:21), Vintage=c(1989:1998, 1989:1998, 1992), Region.Focus=c("Europe", "US", "Asia"))
> dataset
ID Vintage Region.Focus
1: 7 1989 Europe
2: 10 1990 US
3: 20 1991 Asia
4: 18 1992 Europe
5: 4 1993 US
6: 17 1994 Asia
7: 13 1995 Europe
8: 9 1996 US
9: 12 1997 Asia
10: 3 1998 Europe
11: 11 1989 US
12: 14 1990 Asia
13: 8 1991 Europe
14: 16 1992 US
15: 19 1993 Asia
16: 1 1994 Europe
17: 5 1995 US
18: 15 1996 Asia
19: 6 1997 Europe
20: 21 1998 US
21: 2 1992 Asia
ID Vintage Region.Focus
I want to 1,000 draws of sample size 2 and 4 (separate from each other) spread along two years. E.g. for 1,000 draws of sample size 2, it could be the first and the second row. I also have a constraint that the sample must consist of rows with the same region focus. My solution is the code below, but it is way too slow.
for(i in c(2,4)) {
simulate <- function(i) {
repeat{
start <- dataset[sample(nrow(dataset), 1, replace=TRUE),]
t <- start$Vintage:(start$Vintage + 1)
matches <- which(dataset$Vintage %in% t & dataset$Region.Focus == start$Region.Focus) #constraints
DT <- dataset[matches,]
DT <- as.data.table(DT)
x <- DT[,.SD[sample(.N,min(.N,i/length(t)))],by = Vintage]
if(nrow(x) ==i) {
x <- as.data.frame(x)
x <- x %>% mutate(EqualWeight = 1 / i) %>% mutate(RandomWeight = prop.table(runif(i)))
x <- ungroup(x)
return(x)
} else {
x <- 0
}
}
}
#now replicate the expression 1000 times
r <- replicate(1000, simulate(i), simplify=FALSE)
r <- rbindlist(r, idcol="draw")
f <- as.data.frame(r)
write.csv(p, file=paste("Performance.fof.5", i, "csv", sep="."))
fof <- paste("fof.5", i, sep = ".")
assign(fof, f)
}
This code is very slow. My initial intuition is that my approach would need a lot of funds and keeps looping due to the constraint. I have 5,800 rows.
Is there a way other than the repeat function that results in a lot of looping? Perhaps there is another way of expressing the line DT[,.SD[sample(.N,min(.N,i/length(t)))],by = Vintage] to get rid off the repeat expression? Thank you in advance for any input!

ddply and adding columns

I have a data frame with columns year|country|growth_rate. I wanted to to find country with highest growth rate in every year, which I did with:
ddply(data, .(year), summarise, highest=max(growth_rate))
and I've got data frame with 2 columns; year and highest
I would like to add third column here, which would show that country that had that max growth_rate, but I can't figure out how to do this.
R> data = data.frame(year = rep(1990:1993, 2), growth_rate = runif(8), country = rep(c("US", "FR"), each = 4))
R> data
year growth_rate country
1 1990 0.82785327 US
2 1991 0.86724498 US
3 1992 0.84813164 US
4 1993 0.35884355 US
5 1990 0.92792399 FR
6 1991 0.08659153 FR
7 1992 0.26732516 FR
8 1993 0.37819132 FR
R> ddply(data, .(year), summarize, highest = max(growth_rate), country = country[which.max(growth_rate)])
year highest country
1 1990 0.9279240 FR
2 1991 0.8672450 US
3 1992 0.8481316 US
4 1993 0.3781913 FR

Resources