How to rename many columns in R - r

I have a messy data in this format
Brand<–c("Brand1","Brand2","Brand3")
Sold_quantity_this_week<–c(5,8,17)
Sold_dollar_amount_this_week<–c(150,350,780)
Sold_quantity_minus_1_week<–c(7,6,8)
Sold_dollar_amount_minus_1_week<–c(200,300,350)
Sold_quantity_minus_2_week<–c(8,9,10)
Sold_dollar_amount_minus_2_week<–c(220,400,420)
| Brand | Sold quantity(this week) | Sold $amount(this week) | Sold quantity(-1 week) | Sold $amount(-1 week) | Sold quantity(-2 week) | Sold $amount(-2 week) |
|--------|--------------------------|-------------------------|------------------------|-----------------------|------------------------|-----------------------|
| Brand1 | 5 | 150 | 7 | 200 | 8 | 220 |
| Brand2 | 8 | 350 | 6 | 300 | 9 | 400 |
| Brand3 | 17 | 780 | 8 | 350 | 10 | 420 |
| | | | | | | |
This is just a simple case of my problem. I have weekly sales data with 35 weeks. I want to represent the columns in date format in order to rename all the columns with a few lines of code.
My goal is to set i column name as Date and the i+2 would be i column -7 to see the values for the previous weeks. Then the names of the columns coerce again back as character,add "quantity" to the name,(do the same for dollar amount) and then to represent the data in long format.
How can I do it?
names(data)[2] <-"26.08.2018"
for(i in seq(2,72,2)){
names(data)[,i+2]=names(data)[,i]-7
}
My code here is not working maybe because it is not possible to have Date format column names, I guess.However I do not want to rename all the names manually then make long format data. Can you please suggest possible solutions? Thanks.

Related

R loop to assign variable based on dates in ref table

I'm trying to assign variables in a dataframe using a loop and referencing another table with dates. The loop would create a new variable (YRTR) in df1 using df2 as a reference.
The problem I'm running into is that some observations need to be assigned multiple YRTRs depending on the begin/end dates. So one observation may turn into multiple observations.
If an END_DATE is 9999-12-31 then the observation is current to today's date.
For example obs. 1 in df1 would turn into 11 observations, 1 for each YRTR since 2021.
Obs. 2 in df1 would turn into 2 observations, 1 with a YRTR of 20221, and 1 with a YRTR of 20223.
Obs. 3 in df1 would turn into 5 observations, 1 for each YRTR since 20221.
df1 looks like this:
|ID| BEGIN_DATE | END_DATE |
|---------------------|---------------------|------------------|
|1| 2019-05-18 | 9999-12-31 |
|2| 2021-05-15 | 2021-12-17 |
|3| 2021-05-15 | 9999-12-31 |
|4| 2018-12-22 | 2019-05-18 |
The reference data frame (df2) looks like this:
|YRTR| BEGIN_DATE | END_DATE |
|---------------------|---------------------|------------------|
|20193| 8/27/2018 | 12/21/2018 |
|20195| 1/14/2019 | 5/17/2019 |
|20201| 6/3/2019 | 8/8/2019 |
|20203| 8/26/2019 | 12/20/2019 |
|20205| 1/13/2020 | 5/15/2020 |
|20211| 6/1/2020 | 8/6/2020 |
|20213| 8/24/2020 | 12/18/2020 |
|20215| 1/11/2021 | 5/14/2021 |
|20221| 6/1/2021 | 8/5/2021 |
|20223| 8/23/2021 | 12/17/2021 |
|20225| 1/10/2022 | 5/13/2022 |
|20231| 5/31/2022 | 8/5/2022 |
|20233| 8/22/2022 | 12/16/2022 |
I'm trying to utilize for loops in R to solve this problem.
Ended up using a much simpler data.table method:
setDT(df2)
setDT(df1)
setkey(df2, BEGIN_DATE, END_DATE)
warn_long <- foverlaps(df1, df2, nomatch=NULL)

Merge two large datasets by ID and timestamp, and also get measurements for next n time intervals in R

I want to merge two files using a unique ID and timestamp, and also get measurements for next next n intervals.
The first file has over 15,000 unique IDs. Each ID has measurements taken at 15 minute intervals from Jan 1, 00:00 to Dec 31, 23:45. The database is quite big (35 GB) with over 500 million rows. The file looks something like this.
First file
| ID | Time | Measurement|
|:----:|:---------------:|:------:|
| 1 |2012-12-31 22:45| 61 |
| 1 |2012-12-31 23:00| 60 |
| 1 |2012-12-31 23:15| 61 |
| 1 |2012-12-31 23:30| 59 |
| 1 |2012-12-31 23:45| 59 |
| 2 |2012-01-01 0:00| 60 |
| 2 |2012-01-01 0:15| 61 |
| 2 |2012-01-01 0:30| 60 |
| 2 |2012-01-01 0:45| 62 |
The second file has unique IDs and a timestamp. IDs in this file is a subset of IDs in the first file. The file is realtively small (~50 MB) compared to the first file.
Second file
| ID | Time |
|:----:|:---------------:|
| 1 |2012-12-31 22:48|
| 1 |2012-12-31 23:48|
| 2 |2012-01-01 0:16|
I want to merge the two files such that the measurements are extracted for current interval, and the next n intervals. I also want to be able to specify n and and run the code dynamically.
The merged file file should look like this for n = 3. For example, for the second row the measurements for next intervals should not be derived from another ID.
After merge
| ID | Time | Measurement 1| Measurement 2| Measurement 3|
|:----:|:---------------:|:----:|:---------------:|:----:|
| 1 | 2012-12-31 22:48| 61| 60| 61 |
| 1 | 2012-12-31 23:48| 59| 59| 59 |
| 2 | 2012-01-01 0:16| 61| 60| 62 |

How do I merge 2 dataframes without a corresponding column to match by?

I'm trying to use the Merge() function in RStudio. Basically I have two tables with 5000+ rows. They both have the same amount of rows. Although there is no corresponding Columns to merge by. However the rows are in order and correspond. E.g. The first row of dataframe1 should merge with first row dataframe2...2nd row dataframe1 should merge with 2nd row dataframe2 and so on...
Here's an example of what they could look like:
Dataframe1(df1):
+-------------------------------------+
| Name | Sales | Location |
+-------------------------------------+
| Rod | 123 | USA |
| Kelly | 142 | CAN |
| Sam | 183 | USA |
| Joyce | 99 | NED |
+-------------------------------------+
Dataframe2(df2):
+---------------------+
| Sex | Age |
+---------------------+
| M | 23 |
| M | 33 |
| M | 31 |
| F | 45 |
+---------------------+
NOTE: this is a downsized example only.
I've tried to use the merge function in RStudio, here's what I've done:
DFMerged <- merge(df1, df2)
This however increases both the rows and columns. It returns 16 rows and 5 columns for this example.
What am I missing from this function, I know there is a merge(x,y, by=) argument but I'm unable to use a column to match them.
The output I would like is:
+----------------------------------------------------------+
| Name | Sales | Location | Sex | Age |
+----------------------------------------------------------+
| Rod | 123 | USA | M | 23 |
| Kelly | 142 | CAN | M | 33 |
| Sam | 183 | USA | M | 31 |
| Joyce | 99 | NED | F | 45 |
+-------------------------------------+--------------------+
I've considering making extra columns in each dataframes, says row# and match them by that.
You could use cbind:
cbind(df1, df2)
If you want to use merge you could use:
merge(df1, df2, by=0)
You could use:
cbind(df1,df2)
This will necessarily work with same number of rows in two data frames

Filter multiple occurrences based on group [duplicate]

This question already has answers here:
dplyr - filter by group size
(7 answers)
Keep only groups of data with multiple observations
(2 answers)
Closed 3 years ago.
I have a dataset like mentioned below:
df=data.frame(Supplier_id=c("1","2","7","7","7","4","5","8","12","7"), Supplier=c("Tian","Yan","Goldy","Goldy","Goldy","Amy","Lauren","Cassy","Shaan","Goldy"),Date=c("1/17/2019","4/30/2019","11/29/2018","11/29/2018","11/29/2018","5/21/2018","5/23/2018","5/24/2018","6/15/2018","6/20/2018"),Buyer=c("Unclassified","Unclassified","Kelly","Kelly","Kelly","Kelly","Amanda","Echo","Shao","Shao"))
df$Supplier_id=as.numeric(as.character(df$Supplier_id))
Thus, df appears like below:
| Supplier_id | Supplier | Date | Buyer |
|-------------|----------|------------|--------------|
| 1 | Tian | 1/17/2019 | Unclassified |
| 2 | Yan | 4/30/2019 | Unclassified |
| 7 | Goldy | 11/29/2018 | Kelly |
| 7 | Goldy | 11/29/2018 | Kelly |
| 7 | Goldy | 11/29/2018 | Kelly |
| 4 | Amy | 5/21/2018 | Kelly |
| 5 | Lauren | 5/23/2018 | Amanda |
| 8 | Cassy | 5/24/2018 | Echo |
| 12 | Shaan | 6/15/2018 | Shao |
| 7 | Goldy | 6/20/2018 | Shao |
Now, I want to filter out the Supplier_id's that occur only once for each unique Buyer. For example, in the above dataset, Supplier_id '1' and '2' belong to 'unclassified' buyer, but because they have different ids, I do not want them in my final output. However, when we look at the buyer 'Kelly', it has two supplier_ids, '7' and '4', where, '7' is occurring 3 times and '4' only once. So, the output table should have the record with supplier_id='7'. The grouping should be based on 'Buyer'. So it is important to note that since the supplier_id '7' exists for both 'Kelly' and 'Shao', but it should be grouped differently for both these buyers and not considered together.
The expected output should be:
| Supplier_id | Supplier | Date | Buyer_id |
|-------------|:--------:|-----------:|----------|
| 7 | Goldy | 11/29/2018 | Kelly |
| 7 | Goldy | 11/29/2018 | Kelly |
| 7 | Goldy | 11/29/2018 | Kelly |
I have tried using group_by and filter but this would not work because there will be distinct supplier_id's for every buyer.I have also tried using duplicate but not sure how can I group the supplier_id for each buyer.
df <-df %>% group_by(Buyer) %>% filter(Supplier_id>1)
and also this
df2=df[duplicated(df[1]) | duplicated(df[1], fromLast=TRUE),]
EDIT: The original dataset has many such instances and there are n occurrences of different supplier_id for each buyer.
What could be other way to get the desired output?
I think you need -
df %>% group_by(Supplier_id, Buyer) %>% filter(n() > 1)

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Resources