Sample Data. We are expected to receive Products everyday with Total counts.
TimeStamp
Product Desc
Product Code
Available count
2022-01-02T09:00:00Z
Berries
111
10
2022-01-02T09:00:00Z
Chocolate
222
20
2022-01-02T09:00:00Z
Mayo
333
30
2022-01-03T09:00:00Z
Berries
111
15
2022-01-03T09:00:00Z
Chocolate
222
22
2022-01-04T09:00:00Z
Berries
111
30
If no product received on that particular day, i have to show the last date received product as the current day.
TimeStamp
Product Desc
Product Code
Available count
2022-01-03T09:00:00Z
Berries
111
15
2022-01-03T09:00:00Z
Chocolate
222
22
2022-01-03T09:00:00Z
Mayo
333
30
2022-01-04T09:00:00Z
Berries
111
30
2022-01-04T09:00:00Z
Chocolate
222
22
2022-01-04T09:00:00Z
Mayo
333
30
datatable (['TimeStamp']:datetime,['Product Desc']:string,['Product Code']:int,['Available count']:int)
[
'2022-01-02T09:00:00Z' ,'Berries' ,111 ,10
,'2022-01-02T09:00:00Z' ,'Chocolate' ,222 ,20
,'2022-01-02T09:00:00Z' ,'Mayo' ,333 ,30
,'2022-01-03T09:00:00Z' ,'Berries' ,111 ,15
,'2022-01-03T09:00:00Z' ,'Chocolate' ,222 ,22
,'2022-01-04T09:00:00Z' ,'Berries' ,111 ,30
]
| summarize arg_max(['TimeStamp'], *) by ['Product Code']
Product Code
TimeStamp
Product Desc
Available count
333
2022-01-02T09:00:00Z
Mayo
30
222
2022-01-03T09:00:00Z
Chocolate
22
111
2022-01-04T09:00:00Z
Berries
30
Fiddle
Related
I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
library(tidyverse)
library(lubridate)
sub %>%
left_join(stack, by="id") %>%
mutate(
release=ymd_hms(paste0(release, ":00")),
date=ymd_hms(date)
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)
I'm trying to merge two datasets that have different structures. The first dataset contains indicators on a country level, per country and age class and looks like this:
country age_class nb_birth nb_singleton nb_twin prob_twin nb_under_5_dead nb_under_5_dead_singleton
1 AO7 15-19 28 28 0 0.000000000 2 2
2 AO7 20-24 1133 1123 10 0.008826125 107 101
3 AO7 25-29 3338 3256 82 0.024565608 327 302
4 AO7 30-34 4152 4059 93 0.022398844 425 402
5 AO7 35-39 4934 4784 150 0.030401297 545 509
6 AO7 40-44 4840 4647 193 0.039876033 726 660
The second datasets contains indicators on an individual level, per mother and looks like this:
caseid weight country age age_class region region_type education wealth parity
1 00010001 02 1.086089 AO7 38 35-39 benguela rural no education poorest 7
2 00010002 02 1.086089 AO7 40 40-44 benguela rural no education poorest 6
3 00010002 03 1.086089 AO7 16 15-19 benguela rural no education poorest 1
4 00010003 02 1.086089 AO7 43 40-44 benguela rural primary poorest 8
5 00010004 02 1.086089 AO7 25 25-29 benguela rural no education poorest 6
6 00010006 01 1.086089 AO7 26 25-29 benguela rural no education poorest 4
As you can see, the variables "country" and "age class" are shared. What I'm trying to do is assign indicators from the country dataset to every row in the mother dataset. I would like to end up with something like this:
caseid weight country age age_class [...] nb_births nb_singleton nb_twin prob_twin
1 00010001 02 1.086089 AO7 38 35-39 [...] 4934 4784 150 0.0304
2 00010002 02 1.086089 AO7 40 40-44 [...] 4840 4647 193 0.0399
3 00010002 03 1.086089 AO7 16 15-19 [...] 28 28 0 0.0000
In the end, all mothers from the same country and age class will have the same values for variables from the country dataset.
I am working with the dplyr package and have played with the join functions and mutate function. But I can't figure it out.
Do you have any idea how to solve this issue?
If I understand your question correctly the only thing you need to do is a dplyr::left_join. df1 is the country dataset and df1 is the mothers dataset.
left_join(df2, df1)
## Joining, by = c("country", "age_class")
## caseid weight country age age_class [...] nb_singleton nb_twin prob_twin nb_under_5_dead nb_under_5_dead_singleton
## 1 00010001 02 1.086089 AO7 38 35-39 [...] 4784 150 0.03040130 545 509
## 2 00010002 02 1.086089 AO7 40 40-44 [...] 4647 193 0.03987603 726 660
## 3 00010002 03 1.086089 AO7 16 15-19 [...] 28 0 0.00000000 2 2
## 4 00010003 02 1.086089 AO7 43 40-44 [...] 4647 193 0.03987603 726 660
## 5 00010004 02 1.086089 AO7 25 25-29 [...] 3256 82 0.02456561 327 302
## 6 00010006 01 1.086089 AO7 26 25-29 [...] 3256 82 0.02456561 327 302
dplyr::left_join keeps all the rows from the first data.frame and adds matching rows from the second data.frame.
I have this table that has rows with similar or equal names I need to sum these rows. Make a single record, how can I do it using R. I try with a loop for doesn't work.
or (i in length(df1$variante)) {
+ if (df1$variante == "INTERNACIONAL" | df1$variante == " INTERNAC" ){
+ rbind(df1$variante)
+ }
+
+
+ }
Warning message:
In if (df1$variante == "INTERNACIONAL" | df1$variante == " INTERNAC") { :
a condição tem comprimento > 1 e somente o primeiro elemento será usado
df_variante
x freq
1 259
2 INTERNACIONAL 844
3 GOLD 2164
4 GOLD EXECUTIVOS 2
5 GRAFITE 109
6 GRAFITE INTERNACIONA 231
7 INFINITE 546
8 INFINITE EXECUTIVOS 2
9 INTERNACIONAL 4660
10 NACIONAL 8390
11 NANQUIM 12
12 NANQUIM INTERNACIONA 57
13 PLATINUM 1407
14 AZUL 9
15 AZUL CARD 112
16 BLACK 775
17 GOLD 2872
18 IN INTERNAC 1
19 IN NACIONAL 7
20 INTERNACIONAL 6678
21 MC ELECTRONIC GOV BAHIA 5
22 NACIONAL 9692
23 PLATINUM 1383
24 TIGRE NACIONAL 207
25 TURISMO GOLD 337
26 TURISMO INTERNACION 528
27 TURISMO NACIONAL 841
28 TURISMO PLATINUM 90
29 TURISMO GOLD 322
30 TURISMO INTER 531
31 AMAZONIA GOLD 5
32 AMAZONIA INTERNACIONAL 14
33 AMAZONIA NACIONAL 19
34 EPIDEMIA CORINTHIANA PLATINUM 4
35 JCB UNICO 203
36 UNIVERSITARIO INTERNACION 92
37 UNIVERSITARIO INTER 262
If you want the rows where freq contains for example 'INTER'- you could maybe do:
sum(df_variante$freq[grep(df_variante$x,pattern='INTER')])
Here you are just using grep to search the x column for the pattern 'INTER'
I noticed there are some rows with something like 'IN NACIONAL'.
In this case you could do:
idxs=unique(c(grep(df_variante$x,pattern='INTER'),grep(df_variante$x,pattern='NACIONAL'),....)) #do this for all patterns of interest
df_sub=df_variante[idxs,]
BTW:
your for loop in your question is not working because- your looping through the columns with length you want to loop through the rows:
for(i in 1:nrow(df_variante)){....}
But that is only if you still want to do it that way
I have the following list, "dates", of numerics representing dates and associated values representing number of hours:
$`2012-08-01 20:05:37`
214 216
$`2012-08-01 22:05:32`
211
I have a data frame, "data", in which each row contains two sets of dates ("hour" and "time_point") and the difference in hours between the two dates ("diff"). I need to cycle through each element of "list", and find the corresponding "time_point" in the data frame "data" that is associated with the addition of list values to the numeric. For example, the first value associated with the first element in "list" should match with "time_point" "2012-08-10 17:53:16" because 214 hours added to the numeric results in that date. The output can be a list or data frame of the dates. Any idea on how to do this?
diff hour time_point ID
70 214 2012-08-01 20:05:37 2012-08-10 17:53:16 18
71 215 2012-08-01 20:05:37 2012-08-10 18:53:21 18
72 216 2012-08-01 20:05:37 2012-08-10 19:53:16 18
73 217 2012-08-01 20:05:37 2012-08-10 20:53:21 18
74 218 2012-08-01 20:05:37 2012-08-10 21:54:51 18
75 219 2012-08-01 20:05:37 2012-08-10 22:53:31 18
218 206 2012-08-02 02:05:12 2012-08-10 15:53:16 24
316 200 2012-08-02 06:50:17 2012-08-10 14:53:16 28
490 53 2012-08-02 22:49:52 2012-08-05 03:50:18 44
491 54 2012-08-02 22:49:52 2012-08-05 04:50:48 44
If I'm understanding you correctly you're looking for this?
ll <- list("date1"=c(214,216),"date2" = 211)
f <- function(x) d[d[,"diff"] %in% x,"time_point"]
lapply(ll,f)
Data
d <- "diff hour time_point ID
214 '2012-08-01 20:05:37' '2012-08-10 17:53:16' 18
215 '2012-08-01 20:05:37' '2012-08-10 18:53:21' 18
216 '2012-08-01 20:05:37' '2012-08-10 19:53:16' 18
217 '2012-08-01 20:05:37' '2012-08-10 20:53:21' 18
218 '2012-08-01 20:05:37' '2012-08-10 21:54:51' 18
219 '2012-08-01 20:05:37' '2012-08-10 22:53:31' 18
206 '2012-08-02 02:05:12' '2012-08-10 15:53:16' 24
200 '2012-08-02 06:50:17' '2012-08-10 14:53:16' 28
53 '2012-08-02 22:49:52' '2012-08-05 03:50:18' 44
54 '2012-08-02 22:49:52' '2012-08-05 04:50:48' 44"
d <- read.table(text = d,header = T)
I have a huge dataset and I have to compute "Monthly Child Cost% and Monthly Parent cost%". I am new to R and tried my best. But not much luch. Please help.
In my original dataset, I have Prent/Child/Item/Month/Cost data. I have to compute 2 new columns...
Monthly child cost% = 100/(total Items cost in that particular month for that child) * Item cost
Example for 1st row: 100/100 * 70 = 70)
Monthly Parent cost % = 100/total Items cost in that particular month for that Parent) * Item cost
Example for first row: 100/345 * 215 (Total Milk cost for that parent) = 62.3
Please note: It is ok to have duplicate in Monthly_Parent_Cost%. I can get only distinct values by Parent and Item.
Parent Child Item Month Cost Monthly_Child_Cost% Monthly_Parent_Cost%
1001 22 Milk Jan 70 70 62.32
1001 22 Bread Jan 20 20 31.88
1001 22 Eggs Jan 10 10 5.8
1001 22 Milk Feb 60 60 62.32
1001 22 Bread Feb 40 40 31.88
1001 11 Milk Mar 40 40 62.32
1001 11 Bread Mar 50 50 31.88
1001 11 Eggs Mar 10 10 5.8
1001 11 Milk Apr 45 100 62.32
1002 44 Milk Jan 20 20 40.3
1002 44 Bread Jan 40 40 33.2
1002 44 Eggs Jan 40 40 26.3
1002 44 Milk Feb 34 34 40.3
1002 44 Bread Feb 66 66 33.2
1002 55 Milk Mar 20 20 40.3
1002 55 Bread Mar 20 20 33.2
1002 55 Eggs Mar 60 60 26.3
1002 55 Milk Apr 79 100 40.3
You can use aggregate function to aggregate cost values by Child + Month + Item and also Parent + Month + Item. After this, you can join the merge the results and add the resultant vector as a new one.
# Aggregate
childCosts <- aggregate(x = ds$Cost, by=list(ds$Child, ds$Month, ds$Item), FUN=sum)
# modify column names for easy merge
colnames(childCosts) <- c("Child", "Month", "Item", "Monthly_child_total")
ds2 <- merge(ds, childCosts)
# Compute desired result
ds2$Monthly_Child_Cost_Pct <- ds2$Cost*100/(ds2$Monthly_child_total)
P.S. my formulaes might not be correct as I am unclear about what do you want to aggreagte for the two columns. Adjust your code accordingly.