I have a dataframe with a column Arrivo (formatted as date) and a column Giorni (formatted as integer) with number of days (es.: 2, 3, 6 etc..).
I would like to apply two function to these columns and precisely, I would like to duplicate a row for the number in the column Giorni and while duplicating these rows, I would like to create a new column called data.osservazione that is equal to Arrivo and augmented of one day iteratively.
From this:
No. Casa Anno Data Categoria Camera Arrivo Stornata.il Giorni
1 2.867 SEELE 2019 03/09/2019 CDV 316 28/03/2020 NA 3
2 148.000 SEELE 2020 20/01/2020 CDS 105 29/03/2020 NA 3
3 3.684 SEELE 2019 16/11/2019 CD 102 02/04/2020 NA 5
to this:
No. data.osservazione Casa Anno Data Categoria Camera Arrivo
1 2867 3/28/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
2 2867 3/29/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
3 2867 3/30/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
4 148 3/29/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
5 148 3/30/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
6 148 3/31/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
Stornata.il Giorni
1 #N/D 3
2 #N/D 3
3 #N/D 3
4 #N/D 3
I was able to duplicate the rows but I don't know how to concurrently create the new column with the values I need.
Please don't mind the date values in the columns, I'll fix them in the end.
Thanks in advance
Since I am a fan of the data.table package, I will propose a solution using data.table. You can install it by typing on the console: install.packages("data.table").
My approach was to create a separate data.frame with an index going from 0 to whatever number is in Giorni by row from the original data.frame and then merge this new data.frame with the original data that you have and, by virtue of many to one matches from the key specified, the resulting data.frame will "expand" to the size desired, therefore "duplicating" the rows when necessary.
For this, I used seq_len(). If you do seq_len(3L), you get: [1] 1 2 3 which is the sequence from 1L to whatever integer you've given in length.out when length.out >= 1L. Thus seq_len() will produce a sequence that ends in whatever is in Giorni, the challenge is to do by row since length.out in seq_len() needs to be a vector of size 1. We use by in data.table syntax to accomplish this.
So let's get started, first you load data.table:
library(data.table) # load data.table
setDT(data) # data.frame into data.table
In your example, it isn't clear whether Arrivo is in Date format, I am assuming it isn't so I convert to Date --you will need this to add the days later.
# is `Arrivo`` date? If no, into date fmt
data[["Arrivo"]] <- as.Date(data[["Arrivo"]], format = "%d/%m/%y")
The next bit is key, using seq_len() and by in data.table syntax, I create a separate data.table --which is always a data.frame, but not the other way around-- with the sequence by every single element of Giorni, therefore expanding the data to the desired size. I use by = "No." because I want to apply seq_len() to every value of Giorni associated with No. the values in No..
# create an index with the count from `Giorni`, subtract by 1 so the first day is 0.
d1 <- data[, seq_len(Giorni) - 1, by = "No."]
Check the result, you can see where I am going now:
> d1
No. V1
1: 2867 0
2: 2867 1
3: 2867 2
4: 148 0
5: 148 1
Lastly, you inner join d1 with the original data, I am using data.table join syntax here. Then you add the index V1 to Arrivo:
# merge with previous data
res <- d1[data, on = "No."]
# add days to `Arrivo``, create column data.osservazione
res[ , data.osservazione := V1 + Arrivo]
Result:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
The next commands are just cosmetic, formatting dates and deleting columns:
# reformat `Arrivo` and `data.osservazione`
cols <- c("Arrivo", "data.osservazione")
res[, (cols) := lapply(.SD, function(x) format(x=x, format="%d/%m/%Y")), .SDcols=cols]
# remove index
res[, V1 := NULL]
Console:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
Hi #JdeMello and really thank you for the quick answer!
Indeed it was what I was looking for, but in the mean time I kinda found a solution myself using lubridate and tidyverse and purrr.
What I did was transform variables from Posix to date (revenue is my df):
revenue <- revenue %>% mutate(Data = as_date(Data), Arrivo = as_date(Arrivo), `Stornata il` = as_date(`Stornata il`), Partenza = as_date(Partenza))
Then I created another data frame but included variables id and data_obs:
revenue_1 <- revenue %>% mutate(data_obs = Arrivo, id = 1:nrow(revenue))
I created another data frame with the variable data_obs iterated by the number of Giorni:
revenue_2 <- revenue_1 %>% group_by(id, data_obs) %>%
complete(Giorni = sequence(Giorni)) %>%
ungroup() %>%
mutate(data_obs = data_obs + Giorni -1)
I extracted data_obs:
data_obs <- revenue_2$data_obs
I created another data frame to duplicate the rows:
revenue_3 <- revenue %>% map_df(.,rep, .$Giorni)
And finally created the ultimate data frame that I needed:
revenue_finale <- revenue_3 %>% mutate(data_obs = data_obs)
I know it's kinda redundant having created all those data frame but I have very little knowledge of R at the moment and had to work around.
I wanted to merge data frames but for unknown reasons to me, it didn't work out.
What I found kinda fun is that you can play with many packages to get to your point instead of using just one.
I've never used data.table so your answer is very interesting and I'll try to memorize it.
So again, really really thank you!!
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a large data frame of data from across months and I want to select the
first number that is not NA in each row. For instance ID 895 would correspond to the value in Feb15, 687.
ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17
454 NA 977 NA 146
It would be helpful to store them in a variable so I could perform further calculations by month.
apply(tempdat[,32:43],1, function(x) head(which(x>0),1))
This data frame contains thousands of rows so, is it possible to have the all the numbers returned for each month stored into their own new vars or one new data frame by month.
In this case:
AggJan15 = 451
AggFeb15 = 687
AggMar15 = 0
AggApr15 = 625
The two answers below are based on different assumptions on what the question is saying.
1) In this answer we are assuming you want the first non-NA in each row. First find the index of the first NAs, one per row, using max.col giving ix. Then create an output data frame whose first column is ID, second is the first non-NA month for that row and whose third column is the value in that month. The next line NAs out any month that does not have a non-NA value and is not needed if you know that every row has at least one non-NA. Note that we have convert month/year to class yearmon so that they sort properly.
library(zoo)
DF1 <- DF[-1]
ix <- max.col(!is.na(DF1), "first")
out <- data.frame(ID = DF$ID,
month = as.yearmon(names(DF1)[ix], "%b%y"),
value = DF1[cbind(1:nrow(DF1), ix)])
out$month[is.na(out$value)] <- NA
## ID month value
## 1 100 Apr 2015 625
## 2 113 Jan 2015 451
## 3 895 Feb 2015 687
In a comment the poster says they want the sum by month so in that case we first sum by month giving ag and then we merge that with all months within the range to fill it out. The third line can be omitted if it is OK to have absent months filled in with NA; otherwise, use it and they will be filled with 0.
ag <- aggregate(value ~ month, out, sum)
m <- merge(ag, seq(min(ag$month), max(ag$month), 1/12), by = 1, all = TRUE)
m$value[is.na(m$value)] <- 0
## month value
## 1 Jan 2015 451
## 2 Feb 2015 687
## 3 Mar 2015 0
## 4 Apr 2015 625
2) Originally I thought you wanted the first non-NA in each column and this answer addresses that.
Assuming DF is as shown reproducibly in the Note at the end use na.locf specifying reverse order and take the first row.
library(zoo)
Agg <- na.locf(DF[-1], fromLast = TRUE)[1, ]
Agg
## Jan15 Feb15 Mar15 Apr15
## 1 451 586 313 625
Agg$Jan15
## [1] 451
Note
Lines <- "ID Jan15 Feb15 Mar15 Apr15
----- ------- ------- ------- -------
100 NA NA NA 625
113 451 586 NA NA
895 NA 687 313 17 "
DF <- read.table(text = Lines, header = TRUE, comment.char = "-")
I'm trying to use dcast in reshape2 to transform a data frame from long to wide format. The data is hospital visit dates and a list of diagnoses. (Dx.num lists the sequence of diagnoses in a single visit. If the same patient returns, this variable starts over and the primary diagnosis for the new visit starts at 1.) I would like there to be one row per individual (id). The data structure is:
id visit.date visit.id bill.num dx.code FY Dx.num
1 1/2/12 203 1234 409 2012 1
1 3/4/12 506 4567 512 2013 1
2 5/6/18 222 3452 488 2018 1
2 5/6/18 222 3452 122 2018 2
3 2/9/14 567 6798 923 2014 1
I'm imagining I would end up with columns like this:
id, date_visit1, date_visit2, visit.id_visit1, visit.id_visit2, bill.num_visit1, bill.num_visit2, dx.code_visit1_dx1, dx.code_visit1_dx2 dx.code_visit2_dx1, FY_visit1_dx1, FY_visit1_dx2, FY_visit2_dx1
Originally, I tried creating a visit_dx column like this one:
**visit.dx**
v1dx1 (visit 1, dx 1)
v2dx1 (visit 2, dx 1)
v1dx1 (...)
v1dx2
v1dx1
And used the following code, omitting "Dx.num" from the DF, as it's accounted for in "visit.dx":
wide <-
dcast(
setDT(long),
id + visit.date + visit.id + bill.num ~ visit.dx,
value.var = c(
"dx.code",
"FY"
)
)
When I run this, I get the warning "Aggregate function missing, defaulting to 'length'" and new dataframe full of 0's and 1's. There are no duplicate rows in the dataframe, however. I'm beginning to think I should go about this completely differently.
Any help would be much appreciated.
The data.table package extended dcast with rowid and allowing multiple value.var, so...
library(data.table)
dcast(setDT(DF), id ~ rowid(id), value.var=setdiff(names(DF), "id"))
id visit.date_1 visit.date_2 visit.id_1 visit.id_2 bill.num_1 bill.num_2 dx.code_1 dx.code_2 FY_1 FY_2 Dx.num_1 Dx.num_2
1: 1 1/2/12 3/4/12 203 506 1234 4567 409 512 2012 2013 1 1
2: 2 5/6/18 5/6/18 222 222 3452 3452 488 122 2018 2018 1 2
3: 3 2/9/14 <NA> 567 NA 6798 NA 923 NA 2014 NA 1 NA
I got projection data in data.frame (resulted by projecting all german weather stations data to German shapefile). However, in my first step, I want to extract out all rows whose begin date and end date attributes within 01.01.1981 ~ 31.12.2014. So I did take subset on original merged data.frame, but don't know why the operation was failed. I shared original data in csv format on fly (data is here). Basically, I am gonna keep all instances whose date interval between 01.01.1981 ~ 31.12.2014 (I need to analyze recent 35 years weather data of Germany ). I am pretty sure my code can work with my data, but still failed at the end. Any quick solution? How can I make this happen in R? Is dplyr, data.table can help for this? Any more thoughts? Thanks
Here is the what data look like (original data source is placed on the fly):
Stationsname Stations_ID ID__Index Station.Identification Width Length Station_Height River_Basin Federal_state
1 Aach 1 KL 02783 47.8410 8.8490 478 NA BW
2 Aach 1 RR 70191 47.8410 8.8490 478 NA BW
3 Aach/Hegau 10771 PE 10771 47.8500 8.8500 480 NA BW
4 Aachen 3 EB 02205 50.7827 6.0941 202 803100 NW
5 Aachen 3 FF 02205 50.7827 6.0941 202 803100 NW
6 Aachen 3 KL 02205 50.7827 6.0941 202 803100 NW
Begin End ID_0 ISO NAME_0 ID_1 NAME_1 ID_2 NAME_2 HASC_2 CCN_2 CCA_2
1 01.01.1937 30.06.1986 86 DEU Germany 1 Baden-Württemberg 22 Konstanz DE.BW.KN 0 8335
2 01.01.1912 30.06.1986 86 DEU Germany 1 Baden-Württemberg 22 Konstanz DE.BW.KN 0 8335
3 86 DEU Germany 1 Baden-Württemberg 22 Konstanz DE.BW.KN 0 8335
4 01.01.1951 31.03.2011 86 DEU Germany 10 Nordrhein-Westfalen 290 Städteregion Aachen DE.NW.AC 0 5334
5 01.01.1937 31.03.2011 86 DEU Germany 10 Nordrhein-Westfalen 290 Städteregion Aachen DE.NW.AC 0 5334
6 01.01.1891 31.03.2011 86 DEU Germany 10 Nordrhein-Westfalen 290 Städteregion Aachen DE.NW.AC 0 5334
TYPE_2 ENGTYPE_2 NL_NAME_2 VARNAME_2
1 Landkreis District NA
2 Landkreis District NA
3 Landkreis District NA
4 Kreis District NA
5 Kreis District NA
6 Kreis District NA
I read the experimental dataset down below:
joinedData <- read.csv(file = "~/joinedLayer_attrTabl.csv",sep = "," ,header = TRUE)
head(as.data.frame(joinedData)); tail(as.data.frame(joinedData))
This is my initial tryout:
dateInterval <- function(x,y){joinedData[joinedData$Begin >= x
& joinedData$End <= y,]}
DATE1 <- as.Date("01-01-1981")
DATE2 <- as.Date("31-12-2014")
res <- dateInterval(DATE1,DATE2)
Here is the error that raised by Rstudio:
> dateInterval <- function(x,y){joinedData[joinedData$Begin > x & joinedData$End < y, ]}
>
> DATE1 <- as.Date("01-01-1981")
> DATE2 <- as.Date("31-12-2014")
> res <- dateInterval(DATE1,DATE2)
Warning messages:
1: In `[.data.frame`(joinedData, joinedData$Begin > x & joinedData$End < :
Incompatible methods ("Ops.factor", "Ops.Date") for ">"
2: In `[.data.frame`(joinedData, joinedData$Begin > x & joinedData$End < :
Incompatible methods ("Ops.factor", "Ops.Date") for "<"
I also tried this down below:
joinedData[joinedData$Begin & joinedData$End %between% c("01.01.1981", "31.12.2014"),]
still, I didn't get my expected result. Why did this error happen to me? Any idea?
Output:
I am gonna select all rows whose begin and end data falls within my specified date interval. Any way to fix the problem? How can I make this happen?
I can see couple of problem in OP's code.
Prob#1: The default format expected by as.Date is "%Y-%m-%d" or "%Y/%m/%d". But the formats of the characters (Begin, End columns) used in code is %d.%m.%Y or %s-%m-%Y. Hence default format in function as.Date() will not work. The format argument should specifically provided to as.Date() function.
The correct code to create DATE1 and DATE2 should be:
DATE1 <- as.Date("01-01-1981", format = "%d-%m-%Y")
DATE2 <- as.Date("31-12-2014", , format = "%d-%m-%Y")
Prob#2: The Begin and End columns of dataframe should be changed to as.Date format as well before attempting filter operations.
The format of those 2 columns can be changed as:
joinedData$Begin = as.Date(joinedData$Begin, format = "%d.%m.%Y")
joinedData$End= as.Date(joinedData$End, format = "%d.%m.%Y")
Now, the OP'2 initial approach should work.
Note: Personally I prefer using as.POSIXlt over as.Date
I have two dfs as below
>codes1
Country State City Start No End No
IN Telangana Hyderabad 100 200
IN Maharashtra Pune (Bund Garden) 300 400
IN Haryana Gurgaon 500 600
IN Maharashtra Pune 700 800
IN Gujarat Ahmedabad (Vastrapur) 900 1000
Now i want to tag ip address from table 1
>codes2
ID No
1 157
2 346
3 389
4 453
5 562
6 9874
7 98745
Now i want to tag numbers in codes2 df as per the range given in codes1 df for No column , expected ouput is
ID No Country State City
1 157 IN Telangana Hyderabad
2 346 IN Maharashtra Pune(Bund Garden)
.
.
.
Basically want to tag No column in codes 2 with codes1 according to the range (Start No and End No) that No observations falls in.
Also the order could be anything in codes 2 df .
You could use the non-equi join capability of the data.table package for that:
library(data.table)
setDT(codes1)
setDT(codes2)
codes2[codes1, on = .(No > StartNo, No < EndNo), ## (1)
`:=`(cntry = Country, state = State, city = City)] ## (2)
(1) obtains matching row indices in codes2 corresponding to each row in codes1, while matching on the condition provided to the on argument.
(2) updates codes2 values for those matching rows for the columns specified directly by reference (i.e., you don't have to assign the result back to another variable).
This gives:
codes2
# ID No cntry state city
# 1: 1 157 IN Telangana Hyderabad
# 2: 2 346 IN Maharashtra Pune (Bund Garden)
# 3: 3 389 IN Maharashtra Pune (Bund Garden)
# 4: 4 453 NA NA NA
# 5: 5 562 IN Haryana Gurgaon
# 6: 6 9874 NA NA NA
# 7: 7 98745 NA NA NA
if you're comfortable writing SQL, you might consider using the sqldf package to do something like
library('sqldf')
result <- sqldf('select * from codes2 left join codes1 on codes2.No between codes1.StartNo and codes1.EndNo')
you may have to remove special characters and spaces from the columnnames of your dataframes beforehand.
Problem description:
From the below table, I would want to remove all the rows above the quarter value of 2014-Q3 i.e. rows 1,2
Also note that this is a dynamic data-set. Which means when we move on to the next quarter i.e. 2016-Q3, I would want to remove all the rows above quarter value of 2014-Q4 automatically through a code without any manual intervention
(and when we move to next qtr 2016-Q4, would want to remove all rows above 2015-Q1 and so on)
I have a variable which captures the first quarter I would like to see in my final data-frame (in this case 2014-Q3) and this variable would change as we progress in the future
QTR Revenue
1 2014-Q1 456
2 2014-Q2 3113
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
How do I code this?
Here is a semi-automated method using which:
myFunc <- function(df, year, quarter) {
dropper <- paste(year, paste0("Q",(quarter-1)), sep="-")
df[-(1:which(as.character(df$QTR)==dropper)),]
}
myFunc(df, 2014, 3)
QTR Revenue
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
To subset, you can just assign output
dfNew <- myFunc(df, 2014, 3)
At this point, you can pretty easily change the year and quarter to perform a new subset.
Thanks lmo
Was going through articles and I think we can use the dplyr package to do this in a much simpler way:
>df % slice((nrow(df)-7):(nrow(df)))
Get the below result
>df
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
This would act in a dynamic way too as once we have more rows entered beyond 2016-Q2, the range of 8 rows (to be selected) is maintained by the nrow function