Renaming Columns in R According to Repeating Sequence - r

I have a wide data frame in R and I am trying to rename the column names so that I can reshape it to a long format.
Currently, the data is structured like this:
long lat V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V477
I'd like to rename the columns so that they are:
long lat Jan_1979 Feb_1979 Mar_1979 Apr_1979 ... Sept_2018
I'm not sure how to go about doing this. Any help would be appreciated.

There are multiple ways you could do this.
One way in base R is by using seq to create monthly dates in the format you need. So for example, you could create first 10 sequence starting from 1979-01-01 by
format(seq(as.Date('1979-01-01'), length.out = 10, by = "1 month"), "%b_%Y")
#[1] "Jan_1979" "Feb_1979" "Mar_1979" "Apr_1979" "May_1979" "Jun_1979" "Jul_1979"
#[8] "Aug_1979" "Sep_1979" "Oct_1979"
For your case, this should work
names(df)[3:479] <- format(seq(as.Date('1979-01-01'),
length.out = 477, by = "1 month"), "%b_%Y")

We can use expand.grid to get all month year combinations:
name_combn <- expand.grid(month.abb, 1979:2018)[1:477,]
names(df) <- c('long', 'lat', paste(name_combn$Var1, name_combn$Var2, sep = "_"))
Output:
> head(name_combn, 20)
Var1 Var2
1 Jan 1979
2 Feb 1979
3 Mar 1979
4 Apr 1979
5 May 1979
6 Jun 1979
7 Jul 1979
8 Aug 1979
9 Sep 1979
10 Oct 1979
11 Nov 1979
12 Dec 1979
13 Jan 1980
14 Feb 1980
15 Mar 1980
16 Apr 1980
17 May 1980
18 Jun 1980
19 Jul 1980
20 Aug 1980

Related

Applying custom function to a list of DFs, taking another list as an input - R

I have a list of dfs and a list of annual budgets.
Each df represents one business year, and each budget represents a total spend for that year.
# the business year starts from Feb and ends in Jan.
# the budget column is first populated with the % of annual budget allocation
df <- data.frame(monthly_budget=c(0.06, 0.13, 0.07, 0.06, 0.1, 0.06, 0.06, 0.09, 0.06, 0.06, 0.1, 0.15),
month=month.abb[c(2:12, 1)])
# dfs for 3 years
df2019_20 <- df
df2020_21 <- df
df2021_22 <- df
# budgets for 3 years
budget2019_20 <- 6000000
budget2020_21 <- 7000000
budget2021_22 <- 8000000
# into lists
df_list <- list(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
I've written the following function to both apply the right year to Jan and fill in the rest by deparsing the respective dfs name.
It works perfectly if I supply a single df and a single budget.
budget_func <- function(df, budget){
df_name <- deparse(substitute(df))
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
To speed things up I want to pass both lists as arguments to mapply. However I don't get the results I want - what am I doing wrong?
final_budgets <- mapply(budget_func, df_list, budget_list)
Instead of using deparse/substitute (which works when we are passing a single dataset, and is different in the loop because the object passed is not the object name), we may add a new argument to pass the names. In addition, when we create the list, it should have the names as well. We can either use list(df2019_20 = df2019_20, ...) or use setNames or an easier option is dplyr::lst which does return with the name of the object passed
budget_func <- function(df, budget, nm1){
df_name <- nm1
df <- df %>%
mutate(year=ifelse(month=="Jan",
as.numeric(str_sub(df_name, -2)) + 2000,
as.numeric(str_extract(df_name, "\\d{4}(?=_)")))
)
for (i in 1:12){
df[i,1] <- df[i,1] * budget
i <- i+1
}
return(df)
}
-testing
df_list <- dplyr::lst(df2019_20, df2020_21, df2021_22)
budget_list <- list(budget2019_20, budget2020_21, budget2021_22)
Map(budget_func, df_list, budget_list, names(df_list))
-output
$df2019_20
monthly_budget month year
1 360000 Feb 2019
2 780000 Mar 2019
3 420000 Apr 2019
4 360000 May 2019
5 600000 Jun 2019
6 360000 Jul 2019
7 360000 Aug 2019
8 540000 Sep 2019
9 360000 Oct 2019
10 360000 Nov 2019
11 600000 Dec 2019
12 900000 Jan 2020
$df2020_21
monthly_budget month year
1 420000 Feb 2020
2 910000 Mar 2020
3 490000 Apr 2020
4 420000 May 2020
5 700000 Jun 2020
6 420000 Jul 2020
7 420000 Aug 2020
8 630000 Sep 2020
9 420000 Oct 2020
10 420000 Nov 2020
11 700000 Dec 2020
12 1050000 Jan 2021
$df2021_22
monthly_budget month year
1 480000 Feb 2021
2 1040000 Mar 2021
3 560000 Apr 2021
4 480000 May 2021
5 800000 Jun 2021
6 480000 Jul 2021
7 480000 Aug 2021
8 720000 Sep 2021
9 480000 Oct 2021
10 480000 Nov 2021
11 800000 Dec 2021
12 1200000 Jan 2022

how to perform calculation chr and dbl

let say I have this run this code
df_customer %>%
separate(DOB,sep = "-",into = c("D", "M","Y")) %>%
mutate(Age=2021)
then this dataframe comes out
ID D M Y G C Age
<int> < chr > <dbl>
268408 02 01 1970 M 4 2021
269696 07 01 1970 F 8 2021
268159 08 01 1970 F 8 2021
270181 10 01 1970 F 2 2021
268073 11 01 1970 M 1 2021
273216 15 01 1970 F 5 2021
266929 15 01 1970 M 8 2021
275152 16 01 1970 M 4 2021
275034 18 01 1970 F 4 2021
273966 21 01 1970 M 8 2021
then, I want to change that list of mutate column
how can I calculate something like 2021-"Y" column?
2021 is dbl and Y is chr
Adding convert = TRUE in separate should give you numeric values. You can also use as.numeric to convert character to numbers.
library(dplyr)
library(tidyr)
df_customer %>%
separate(DOB,sep = "-",into = c("D", "M","Y"), convert = TRUE) %>%
mutate(Age=2021 - as.numeric(Y))
We could do this in base R
transform(cbind(df_customer, read.table(text = df_customer$DOB, sep = "-",
column.names = c("D", "M", "Y"))), Age = 2021- Y)

R Studio: look up a value in table(both direction V&H), then use as a variable in loop

I am dealing with a dataset ("IndexTable") have 3 million+ observations. Please see following for the first 6 observations:
Identity gender type amount Year Month
1 65 F W 31.88 1987 Jan
2 23 M P 29.21 1985 Mar
3 45 F W 44.70 1987 Jan
4 47 F W 72.64 1987 Jan
5 56 M P 28.92 1986 Jul
6 09 F W 34.32 1990 Jan
and the index table ("index") from which the value will be searched (part of the table):
year average Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1 1950 32.84210 33.19118 33.10321 33.01572 32.89977 32.81334 32.98665 32.98665 33.10321 32.89977 32.55677 32.41595 32.24857
2 1951 30.09866 31.94615 31.64936 31.43694 30.94371 30.19568 30.09866 29.64623 29.50617 29.29854 29.09382 28.98131 28.78098
3 1952 27.56470 28.28139 28.25313 28.11271 27.67259 27.67259 27.21981 27.24604 27.40444 27.45766 27.21981 27.24604 27.06353
4 1953 26.73099 27.08945 27.01183 26.83243 26.58025 26.68055 26.53038 26.53038 26.70575 26.75628 26.75628 26.68055 26.78162
5 1954 26.25941 26.73099 26.78162 26.53038 26.43120 26.50552 26.35730 25.92244 26.08984 26.13807 26.01783 25.89871 25.75718
6 1955 25.11668 25.66369 25.66369 25.66369 25.52472 25.57087 25.04994 24.96151 25.13901 24.98356 24.72149 24.33854 24.33854
For each observation in "IndexTable", I would like to find the value in "index" which match the Year and Month, then use the value to multiply it's amount to get the adjusted amount.
Thanks in advance J
Using the dplyr and tidyr package:
index_long <- index %>%
gather(Month, multiplier, Jan:Dec) %>%
select(-average)
left_join(IndexTable, index_long, by = c("Year" = "year", "Month" = "Month")) %>%
mutate(adjusted_amount = amount*multiplier)
First I gather the Month columns into one column with the value column multiplier.
I drop the average column, because it doesn't need to be joined to the other table. Then by using a left join only does value with a matching year month combination will be joined to the IndexTable.
Then finally I used the multiplier to create the new column adjusted_amount

Fill in missing year in ordered list of dates

I have collected some time series data from the web and the timestamp that I got looks like below.
24 Jun
21 Mar
20 Jan
10 Dec
20 Jun
20 Jan
10 Dec
...
The interesting part is that the year is missing in the data, however, all the records are ordered, and you can infer the year from the record and fill in the missing data. So the data after imputing should be like this:
24 Jun 2014
21 Mar 2014
20 Jan 2014
10 Dec 2013
20 Jun 2013
20 Jan 2013
10 Dec 2012
...
Before lifting my sleeves and start writing a for loop with nested logic.. is there a easy way that might work out of box in R to impute the missing year.
Thanks a lot for any suggestion!
Here's one idea
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("2012", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 2014 - cumsum(diff(c(julian[1], julian))>0)
## Check that it worked
df
# day month year
# 1 24 Jun 2014
# 2 21 Mar 2014
# 3 20 Jan 2014
# 4 10 Dec 2013
# 5 20 Jun 2013
# 6 20 Jan 2013
# 7 10 Dec 2012
The OP has requested to complete the years in descending order starting in 2014.
Here is an alternative approach which works without date conversion and fake dates. Furthermore, this approach can be modified to work with fiscal years which start on a different month than January.
# create sample dataset
df <- data.frame(
day = c(24L, 21L, 20L, 10L, 20L, 20L, 21L, 10L, 30L, 10L, 10L, 7L),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Jan", "Dec", "Jan",
"Jan", "Jan", "Jun"))
df$year <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) > 0))
df
day month year
1 24 Jun 2014
2 21 Mar 2014
3 20 Jan 2014
4 10 Dec 2013
5 20 Jun 2013
6 20 Jan 2013
7 21 Jan 2012
8 10 Dec 2011
9 30 Jan 2011
10 10 Jan 2011
11 10 Jan 2011
12 7 Jun 2010
Completion of fiscal years
Let's assume the business has decided to start its fiscal year on February 1. Thus, January lies in a different fiscal year than February or March of the same calendar year.
To handle fiscal years, we only need to shuffle the factor levels accordingly:
df$fy <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb[c(2:12, 1)])) + df$day) > 0))
df
day month year fy
1 24 Jun 2014 2014
2 21 Mar 2014 2014
3 20 Jan 2014 2013
4 10 Dec 2013 2013
5 20 Jun 2013 2013
6 20 Jan 2013 2012
7 21 Jan 2012 2011
8 10 Dec 2011 2011
9 30 Jan 2011 2010
10 10 Jan 2011 2010
11 10 Jan 2011 2010
12 7 Jun 2010 2010

How can I avoid having to loop through and search through this data frame?

I have a 1 million row data frame that contains monthly water usage data (HCF) for various accounts from 2003-2010:
> head(LeakyAccts)
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Sep 2007 24
3 10114488 Nov 2006 11
4 10114488 Jun 2008 18
5 10114488 Aug 2003 6
6 10114488 Jan 2008 30
Dates are yearmon's. I want to know how much each account used every month compared to the same month in the previous year. So for each row, I'd like to find the difference between the usage in that month (Date) and the usage in the same month the previous year (Date - 1). In other words, I want this:
for(i in 1:nrow(LeakyAccts)) {
row <- which((LeakyAccts$ACCOUNT == LeakyAccts[i,]$UB_ACCT_NBR) & (LeakyAccts$Date == (LeakyAccts[i,]$Date - 1)))
if (length(row) == 1) { # no previous year for 2003
LeakyAccts[i,]$Difference <- LeakyAccts[i,]$HCF - LeakyAccts[row,]$HCF
}
}
Needless to say, this loop takes hours to run and seems very un-R-like. How can I avoid using an ugly for loop and speed up the computation? Is there perhaps a way to do this using an apply function or a data.table?
I've reconfigured your data a little to give a complete example:
library(zoo)
dat <- structure(list(ACCOUNT = c(10114488L, 10114488L, 10114488L, 20114488L, 20114488L, 20114488L), ate = structure(c(2010.75, 2009.75, 2008.75, 2008, 2007, 2006), class = "yearmon"), HCF = c(25L, 24L, 11L, 18L, 6L, 30L)), .Names = c("ACCOUNT", "Date", "HCF"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
Which looks like:
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Oct 2009 24
3 10114488 Oct 2008 11
4 20114488 Jan 2008 18
5 20114488 Jan 2007 6
6 20114488 Jan 2006 30
Since yearmon is essentially just a numeric value where a difference of 1 is a year's difference, you can get the matching differences from a year ago like:
dat$HCF - dat$HCF[match(dat$Date-1,dat$Date)]
#[1] 1 13 NA 12 -24 NA
...which you can also apply within each group like:
do.call(c,by(dat,dat$ACCOUNT,function(x) x$HCF - x$HCF[match(x$Date-1,x$Date)]))
#101144881 101144882 101144883 201144881 201144882 201144883
# 1 13 NA 12 -24 NA
Or using data.table like:
library(data.table)
dat <- as.data.table(dat)
dat[, Difference := HCF - HCF[match(Date-1,Date)], by=ACCOUNT]
dat
# ACCOUNT Date HCF Difference
#1: 10114488 Oct 2010 25 1
#2: 10114488 Oct 2009 24 13
#3: 10114488 Oct 2008 11 NA
#4: 20114488 Jan 2008 18 12
#5: 20114488 Jan 2007 6 -24
#6: 20114488 Jan 2006 30 NA

Resources