Deleting rows dynamically based on certain condition in R - r

Problem description:
From the below table, I would want to remove all the rows above the quarter value of 2014-Q3 i.e. rows 1,2
Also note that this is a dynamic data-set. Which means when we move on to the next quarter i.e. 2016-Q3, I would want to remove all the rows above quarter value of 2014-Q4 automatically through a code without any manual intervention
(and when we move to next qtr 2016-Q4, would want to remove all rows above 2015-Q1 and so on)
I have a variable which captures the first quarter I would like to see in my final data-frame (in this case 2014-Q3) and this variable would change as we progress in the future
QTR Revenue
1 2014-Q1 456
2 2014-Q2 3113
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
How do I code this?

Here is a semi-automated method using which:
myFunc <- function(df, year, quarter) {
dropper <- paste(year, paste0("Q",(quarter-1)), sep="-")
df[-(1:which(as.character(df$QTR)==dropper)),]
}
myFunc(df, 2014, 3)
QTR Revenue
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
To subset, you can just assign output
dfNew <- myFunc(df, 2014, 3)
At this point, you can pretty easily change the year and quarter to perform a new subset.

Thanks lmo
Was going through articles and I think we can use the dplyr package to do this in a much simpler way:
>df % slice((nrow(df)-7):(nrow(df)))
Get the below result
>df
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
This would act in a dynamic way too as once we have more rows entered beyond 2016-Q2, the range of 8 rows (to be selected) is maintained by the nrow function

Related

Subset a df by the last non-NA value in a column

My dataframe looks like this:
Year aquil_7 aquil_8 aquil_9
2018 NA 201 222
2019 192 145 209
2020 166 121 NA
2021 190 NA NA
I want to subset this dataframe so as to include only those columns where the last non-NA year is equal to or less then 2020. In the example above, this means deleting the aquil_7 column since it's last non-NA year is 2021.
How could I do this?
A simple baseR answer.
Explanation - columnwise (that explaining arg 2 in apply) iteration to check given conditions on all database except first column. cbinding the result with T so that the result includes first column.
df <- read.table(text = "Year aquil_7 aquil_8 aquil_9
2018 NA 201 222
2019 192 145 209
2020 166 121 NA
2021 190 NA NA", header = T)
df[c(T, apply((!is.na(df[-1]))*df$Year, 2, function(x){max(x) < 2021}))]
Year aquil_8 aquil_9
1 2018 201 222
2 2019 145 209
3 2020 121 NA
4 2021 NA NA
Not sure if there's a better way to implement this (but I do hope so). In the meantime, you could e.g. do
library(tidyverse)
cols_to_keep <- df %>%
pivot_longer(-Year) %>%
group_by(name) %>%
summarize(var = min(Year[is.na(value)]) >= 2020) %>%
filter(var) %>%
pull(name)
df %>%
select(Year, cols_to_keep)

Create a variable and apllying a function to another one

I have a dataframe with a column Arrivo (formatted as date) and a column Giorni (formatted as integer) with number of days (es.: 2, 3, 6 etc..).
I would like to apply two function to these columns and precisely, I would like to duplicate a row for the number in the column Giorni and while duplicating these rows, I would like to create a new column called data.osservazione that is equal to Arrivo and augmented of one day iteratively.
From this:
No. Casa Anno Data Categoria Camera Arrivo Stornata.il Giorni
1 2.867 SEELE 2019 03/09/2019 CDV 316 28/03/2020 NA 3
2 148.000 SEELE 2020 20/01/2020 CDS 105 29/03/2020 NA 3
3 3.684 SEELE 2019 16/11/2019 CD 102 02/04/2020 NA 5
to this:
No. data.osservazione Casa Anno Data Categoria Camera Arrivo
1 2867 3/28/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
2 2867 3/29/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
3 2867 3/30/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
4 148 3/29/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
5 148 3/30/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
6 148 3/31/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
Stornata.il Giorni
1 #N/D 3
2 #N/D 3
3 #N/D 3
4 #N/D 3
I was able to duplicate the rows but I don't know how to concurrently create the new column with the values I need.
Please don't mind the date values in the columns, I'll fix them in the end.
Thanks in advance
Since I am a fan of the data.table package, I will propose a solution using data.table. You can install it by typing on the console: install.packages("data.table").
My approach was to create a separate data.frame with an index going from 0 to whatever number is in Giorni by row from the original data.frame and then merge this new data.frame with the original data that you have and, by virtue of many to one matches from the key specified, the resulting data.frame will "expand" to the size desired, therefore "duplicating" the rows when necessary.
For this, I used seq_len(). If you do seq_len(3L), you get: [1] 1 2 3 which is the sequence from 1L to whatever integer you've given in length.out when length.out >= 1L. Thus seq_len() will produce a sequence that ends in whatever is in Giorni, the challenge is to do by row since length.out in seq_len() needs to be a vector of size 1. We use by in data.table syntax to accomplish this.
So let's get started, first you load data.table:
library(data.table) # load data.table
setDT(data) # data.frame into data.table
In your example, it isn't clear whether Arrivo is in Date format, I am assuming it isn't so I convert to Date --you will need this to add the days later.
# is `Arrivo`` date? If no, into date fmt
data[["Arrivo"]] <- as.Date(data[["Arrivo"]], format = "%d/%m/%y")
The next bit is key, using seq_len() and by in data.table syntax, I create a separate data.table --which is always a data.frame, but not the other way around-- with the sequence by every single element of Giorni, therefore expanding the data to the desired size. I use by = "No." because I want to apply seq_len() to every value of Giorni associated with No. the values in No..
# create an index with the count from `Giorni`, subtract by 1 so the first day is 0.
d1 <- data[, seq_len(Giorni) - 1, by = "No."]
Check the result, you can see where I am going now:
> d1
No. V1
1: 2867 0
2: 2867 1
3: 2867 2
4: 148 0
5: 148 1
Lastly, you inner join d1 with the original data, I am using data.table join syntax here. Then you add the index V1 to Arrivo:
# merge with previous data
res <- d1[data, on = "No."]
# add days to `Arrivo``, create column data.osservazione
res[ , data.osservazione := V1 + Arrivo]
Result:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
The next commands are just cosmetic, formatting dates and deleting columns:
# reformat `Arrivo` and `data.osservazione`
cols <- c("Arrivo", "data.osservazione")
res[, (cols) := lapply(.SD, function(x) format(x=x, format="%d/%m/%Y")), .SDcols=cols]
# remove index
res[, V1 := NULL]
Console:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
Hi #JdeMello and really thank you for the quick answer!
Indeed it was what I was looking for, but in the mean time I kinda found a solution myself using lubridate and tidyverse and purrr.
What I did was transform variables from Posix to date (revenue is my df):
revenue <- revenue %>% mutate(Data = as_date(Data), Arrivo = as_date(Arrivo), `Stornata il` = as_date(`Stornata il`), Partenza = as_date(Partenza))
Then I created another data frame but included variables id and data_obs:
revenue_1 <- revenue %>% mutate(data_obs = Arrivo, id = 1:nrow(revenue))
I created another data frame with the variable data_obs iterated by the number of Giorni:
revenue_2 <- revenue_1 %>% group_by(id, data_obs) %>%
complete(Giorni = sequence(Giorni)) %>%
ungroup() %>%
mutate(data_obs = data_obs + Giorni -1)
I extracted data_obs:
data_obs <- revenue_2$data_obs
I created another data frame to duplicate the rows:
revenue_3 <- revenue %>% map_df(.,rep, .$Giorni)
And finally created the ultimate data frame that I needed:
revenue_finale <- revenue_3 %>% mutate(data_obs = data_obs)
I know it's kinda redundant having created all those data frame but I have very little knowledge of R at the moment and had to work around.
I wanted to merge data frames but for unknown reasons to me, it didn't work out.
What I found kinda fun is that you can play with many packages to get to your point instead of using just one.
I've never used data.table so your answer is very interesting and I'll try to memorize it.
So again, really really thank you!!

Reshaping data in R with multiple variable levels - "aggregate function missing" warning

I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA

extracting values of a column into a string and replacing values in a data frame column

More than the programming, I am lost on the right approach for this problem. I have 2 data frames with a market name column. Unfortunately the names vary by a few characters / symbols in each column, for e.g. Albany.Schenectady.Troy = ALBANY, Boston.Manchester = BOSTON.
I want to standardize the market names in both data frames so I can perform merge operations later.
I thought of tackling the problem in two steps:
1) Create a vector of the unique market names from both tables and use that to create a look up table. Something that looks like:
Table 1 Markets > "Albany.Schenectady.Troy" , "Albuquerque.Santa.Fe", "Atlanta" . . . .
Table2 Markets > "SPOKANE" , "BOSTON" . . .
I tried marketnamesvector <- paste(unique(Table1$Market, sep = "", collapse = ",")) but that doesn't produce the desired output.
2) Change Market names in Table 2 to equivalent market names in Table 1. For any market name not available in Table 1, Table 2 should retain the same value in market name.
I know I could use a looping function like below but I still need a lookup table I think.
replacefunc <- function (data, oldvalue, newvalue) {
newdata <- data
for (i in unique(oldvalue)) newdata[data == i] <- newvalue[oldvalue == i]
newdata
}
Table 1: This table is 90 rows x 2 columns and has 90 unique market names.
Market Leads Investment Leads1 Leads2 Leads3
1 Albany.Schenectady.Troy NA NA NA NA NA
2 Albuquerque.Santa.Fe NA NA NA NA NA
3 Atlanta NA NA NA NA NA
4 Austin NA NA NA NA NA
5 Baltimore NA NA NA NA NA
Table 2 : This table is 150K rows x 20 columns and has 89 unique market names.
> df
Spot.ID Date Hour Time Local.Date Broadcast.Week Local.Hour Local.Time Market
2 13072765 6/30/14 0 12:40 AM 2014-06-29 1 21 9:40 PM SPOKANE
261 13072946 6/30/14 5 5:49 AM 2014-06-30 1 5 5:49 AM BOSTON
356 13081398 6/30/14 10 10:52 AM 2014-06-30 1 7 7:52 AM SPOKANE
389 13082306 6/30/14 11 11:25 AM 2014-06-30 1 8 8:25 AM SPOKANE
438 13082121 6/30/14 8 8:58 AM 2014-06-30 1 8 8:58 AM BOSTON
469 13081040 6/30/14 9 9:17 AM 2014-06-30 1 9 9:17 AM ALBANY
482 13080104 6/30/14 12 12:25 PM 2014-06-30 1 9 9:25 AM SPOKANE
501 13082120 6/30/14 9 9:36 AM 2014-06-30 1 9 9:36 AM BOSTON
617 13080490 6/30/14 13 1:23 PM 2014-06-30 1 10 10:23 AM SPOKANE
Assume that the data is in data frames df1, df2. The goal is to adjust the market names to be the same, they are currently slightly different.
First, list the markets, use the following command to list the unique names in df1, repeat for df2.
mk1 <- sort(unique(df1$market))
mk2 <- sort(unique(df2$market))
dmk12 <- setdiff(mk1,mk2)
dmk21 <- setdiff(mk2,mk1)
Use dmk12 and dmk21 to identify the different markets. Decide what names to use, and how they match up, let's change "Atlanta, GA" from df1 to "Atlanta" from df2. Then use
df2[df2$market=="Atlanta","market"] = "Atlanta, GA"
The format is
df_to_change[df_to_change[,"column"]=="old data", "column"] = "new data"
If you only have 90 names to correct, I would write out 90 change lines like the one above.
After adjusting all the names, do sort(unique(df)) again and use setdiff twice to confirm all the names are the same.

count unique values in one column for specific values in another column,

I have a data frame on bills that has (among other variables) a column for 'year', a column for 'issue', and a column for 'sub issue.' A simplified example df looks like this:
year issue sub issue
1970 4 20
1970 3 21
1970 4 22
1970 2 8
1971 5 31
1971 4 22
1971 9 10
1971 3 21
1971 4 22
Etc., for about 60 years. I want to count the unique values in the issue and sub issue columns for each year, and use those to create a new df- dat2. Using the df above, dat2 would look like this:
year issues sub issues
1970 3 4
1971 4 4
Weary of factors, I confirmed that the values in all columns are integers, if that makes a difference. I am new at R (obviously), and I haven't been able to find relevant code for this specific purpose online. Thanks for any help!!
That's a one-liner, with aggregate:
with(d,aggregate(cbind(issue,subissue) ~ year,FUN=function(x){length(unique(x))}))
returning:
year issue subissue
1 1970 3 4
2 1971 4 4

Resources