I have a table with following columns:
ID: identifies an imported document (think filename). This is unique for a combination of ImportId and ImportTime.
SomeData: some data column. In the real table there are more columns.
ImportId: an ID in the format of YYYY-MM-DD, eg "2022-05-14" (this is a string column)
ImportTime: the date and time the import was done (this is a string column)
The RowNum is NOT part of the table but used here to be able to reference records/rows.
RowNum
ID
Value
ImportId
ImportTime
1
A
Doc A content as of May 11, 2022
2022-05-11
2022-05-11 13:00
2
B
Doc B content as of May 11, 2022
2022-05-11
2022-05-11 13:00
3
A
Doc A content as of May 11, 2022
2022-05-11
2022-05-11 17:00
4
B
Doc B content as of May 11, 2022
2022-05-11
2022-05-11 17:00
5
A
Doc A content as of May 14, 2022
2022-05-14
2022-05-17 08:00
6
B
Doc B content as of May 14, 2022
2022-05-14
2022-05-17 08:00
7
A
Doc A content as of May 14, 2022
2022-05-14
2022-05-17 10:00
8
B
Doc B content as of May 14, 2022
2022-05-14
2022-05-17 10:00
9
A
Doc A content as of May 11, 2022
2022-05-11
2022-05-18 15:00
10
B
Doc B content as of May 11, 2022
2022-05-11
2022-05-18 15:00
In the table above there were three imports for May 11 (ImportId = "2022-05-11") and two imports for data from May 14 (ImportId = "2022-05-14").
The latest import run (ImportTime) was at 2022-05-18 15:00
The latest ImportTime does not necessarily correlate with the latest import data. In my example above, someone ran an import on May 18 at 15:00 but imported the state of the catalog as it was on May 11 (ImportId = "2022-05-11").
Challenge:
I need to get the records with the newest ImportId (which would be "2022-05-14") and the latest ImportTime (which would be "2022-05-18 15:00").
For the example above, the result should contain the two rows with ImportId "2022-05-14" and ImportTime "2022-05-17 10:00" (row numbers 7 and 8).
What I tried:
Approach 1
I used arg_max() on ImportTime:
T
| summarize arg_max(ImportTime, *) by ID
This returns the last two rows (9 and 10), where ImportId is "2022-05-11". That's not what I'm after because the newest ImportId is "2022-05-14".
Approach 2
If I use arg_max(ImportId, *) by ID instead, I am getting the ones for "2022-05-14" (rows 5 and 6), but not the ones with the latest ImportTime.
Approach 3
I combined ImportTime and ImportId into an extended column and applied arg_max() on that. This seems to work but I'm unsure if it's correct in all cases?
T
| extend Combined = strcat(ImportId, ImportTime)
| summarize arg_max(Combined, *) by ID
This returns the expected rows 7 and 8 for "2022-05-14" at the import time of "2022-05-17 10:00".
Are there better options?
Check out top-nested operator:
datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
"A", datetime(2022-05-11), datetime(2022-05-11 13:00),
"B", datetime(2022-05-11), datetime(2022-05-11 13:00),
"A", datetime(2022-05-11), datetime(2022-05-11 17:00),
"B", datetime(2022-05-11), datetime(2022-05-11 17:00),
"A", datetime(2022-05-14), datetime(2022-05-17 08:00),
"B", datetime(2022-05-14), datetime(2022-05-17 08:00),
"A", datetime(2022-05-14), datetime(2022-05-17 10:00),
"B", datetime(2022-05-14), datetime(2022-05-17 10:00),
"A", datetime(2022-05-11), datetime(2022-05-18 15:00),
"B", datetime(2022-05-11), datetime(2022-05-18 15:00)
]
| top-nested of Value by ignore=max(1),
top-nested 1 of ImportId by max(ImportId),
top-nested 1 of ImportTime by max(ImportTime)
| project Value, ImportId, ImportTime
Value
ImportId
ImportTime
A
2022-05-14 00:00:00.0000000
2022-05-17 10:00:00.0000000
B
2022-05-14 00:00:00.0000000
2022-05-17 10:00:00.0000000
You can try this approach as well using the unlimited partition operator:
datatable(Value:string, ImportId:datetime, ImportTime:datetime)
[
"A", datetime(2022-05-11), datetime(2022-05-11 13:00),
"B", datetime(2022-05-11), datetime(2022-05-11 13:00),
"A", datetime(2022-05-11), datetime(2022-05-11 17:00),
"B", datetime(2022-05-11), datetime(2022-05-11 17:00),
"A", datetime(2022-05-14), datetime(2022-05-17 08:00),
"B", datetime(2022-05-14), datetime(2022-05-17 08:00),
"A", datetime(2022-05-14), datetime(2022-05-17 10:00),
"B", datetime(2022-05-14), datetime(2022-05-17 10:00),
"A", datetime(2022-05-11), datetime(2022-05-18 15:00),
"B", datetime(2022-05-11), datetime(2022-05-18 15:00)
]
| partition hint.strategy = native by Value
(
partition hint.strategy = native by ImportId
(
top 1 by ImportTime
)
| top 1 by ImportId
)
Related
I have something like the following, as a dataframe
ID Date Bug
3452 02/01/2020 A
3452 02/01/2020 A
6532 06/01/2020 D
8732 09/01/2020 C
3466 20/01/2020 A
3466 31/01/2020 A
What I wish to do is this: Start on row 1 column 1, take that ID number and compare it to the ID number on the next row. Here we can see the ID is the same, so now I wish to move onto the next column, which is date. I wish to compare the date of the first row we checked, with the second row. If they are within a 14 day window, then I wish to finally compare the Bug, to check if they are the same Bug. If all conditions are met, then I wish to run this entire algorithm again on the next row until it has found all consecutive rows where all conditions are met. Then, to add a new column that classes them as one entity. What I have currently is this:
df1 <- structure(list(ID = c(3452L, 3452L, 6532L, 8732L, 3466L, 3466L),
Date = c("02/01/2020",
"02/01/2020", "06/01/2020", "09/01/2020", "20/01/2020", "31/01/2020"), Bug = c("A",
"A", "D", "C", "A", "A")), class = "data.frame", row.names = c(NA,
-6L))
df1 <- df1 %>%
mutate(Date = dmy(Date)) %>%
group_by(Bug) %>%
mutate(Flag = c(FALSE, diff(Date) < 14)) %>%
ungroup
df1$Episode <- ifelse(df1$Flag == TRUE, "Yes", "No")
Which gives the following:
ID Date Bug Flag Episode
1 3452 2020-01-02 A FALSE No
2 3452 2020-01-02 A TRUE Yes
3 6532 2020-01-06 D FALSE No
4 8732 2020-01-09 C FALSE No
5 3466 2020-01-20 A FALSE No
6 3466 2020-01-31 A TRUE Yes
Now some of this gives the result I want, as you can see. However when testing this framework with large datasets which I cannot share due to confidentiality, many of these are classed incorrectly. I suspect that the code is not comparing the Bug at the end. What would be a workaround for this? Additionally, the date appears to be causing problems. E.g. if you changed row 6's date to 01/02/2020, it is unable to perform the algorithm. What am I missing here?
I think you want something like this?
lag() and lead() are ideal for when we are comparing an element to the last or next element in a vector/data frame. You can compare the Bug using Bug==lag(Bug)
df1%>%
mutate(Date = dmy(Date))%>%
mutate(Episode=ifelse(is.na(lag(ID)), "No",
ifelse(ID==lag(ID) & Date-lag(Date)<14 & Bug==lag(Bug), "Yes", "No")))
ID Date Bug Episode
1 3452 2020-01-02 A No
2 3452 2020-01-02 A Yes
3 6532 2020-01-06 D No
4 8732 2020-01-09 C No
5 3466 2020-01-20 A No
6 3466 2020-02-01 A Yes
Changing the last line was fine with my code
I have a dataset of football teams and their Win/Loss results from 2009-2017. Currently the Wins and Losses are in the same column, one after the other, and I want to create a new column for the losses.
A sample of the data looks like:
Football <- data.frame (
Season = rep ("2009", 10),
Team = rep (c("ARI", "ARI", "ATL", "ATL", "BAL", "BAL", "BUF", "BUF", "CAR", "CAR")),
Value = c(10, 6, 7, 9, 7, 9, 6, 10, 8, 8)
)
I would like the final output to show:
Season Team Wins Losses
2009 ARI 10 6
2009 ATL 7 9
2009 BAL 7 9
and so on. There are also several other variables but the only one that changes for each Season/Team pair is the "Value".
I have tried several iterations of spread() and mutate() but they typically make many more columns (i.e. 2009.Wins, 2009.Losses, 2010.Wins, 2010.Losses) than I want.
Thanks for any help. I hope this post turns out alright, its my first time posting.
Cheers, Jeremy
We create a column of "Winloss" and then spread to 'wide' format
library(tidyverse)
Football %>%
mutate(Winloss = rep(c("Win", "Loss"), n()/2)) %>%
spread(Winloss, Value)
# Season Team Loss Win
#1 2009 ARI 6 10
#2 2009 ATL 9 7
#3 2009 BAL 9 7
#4 2009 BUF 10 6
#5 2009 CAR 8 8
data
Football <- data.frame (
Season = rep ("2009", 10),
Team = rep (c("ARI", "ARI", "ATL", "ATL", "BAL", "BAL", "BUF", "BUF", "CAR", "CAR")),
Value = c(10, 6, 7, 9, 7, 9, 6, 10, 8, 8)
)
Using reshape2 package
> Football$WL <- rep(c("Win", "Losses"), nrow(Football)/2)
>
> library(reshape2)
> dcast(Football, Season + Team ~ WL, value.var="Value")
Season Team Losses Win
1 2009 ARI 6 10
2 2009 ATL 9 7
3 2009 BAL 9 7
4 2009 BUF 10 6
5 2009 CAR 8 8
I am dealing with a data frame with column names, company name, division name all_production_2017, bad_production_2017...with many years back
Now I am writing a function that takes a company name and a year as arguments and summarize the company's production in that year. Then sort it by decreasing order in all_production_year
I have already converted the year to a string and filter the rows and columns required. But how can I sort it by a specific column? I do not know how to access that column name because the argument year is the suffix of that.
Here is a rough sketch of the structure of my data frame.
structure(list(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"),
division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M", "CHANG2M"),
all_production_2000 = c(15, 25, 25, 10, 25, 18),
good_production_2000 = c(10, 24, 10, 8, 10, 10),
bad_production_2000 = c(2, 1, 2, 1, 3, 5)))
with data from 2000 to 2017
I want to write a function that given a name of the company and a year.
It can filter out the company and the year relevant, and sort the all_production_thatyear, by decreasing order.
I have done so far.
ExportCompanyYear <- function(company.name, year){
year.string <- toString(year)
x <- filter(company.data, company == company.name) %>%
select(company, division, contains(year.string))
}
I just do not know how to sort by decreasing order because i do not know how to access the column name which contains the argument year.
You definitely need to reshape your data in such a way that year values could be passed as a parameter.
To create a reproducible example, I have added another year 2001 in the data.
df = data.frame(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"), division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M", "CHANG2M"), all_production_2000 = c(15, 25, 25, 10, 25, 18), good_production_2000 = c(10, 24, 10, 8, 10, 10), bad_production_2000 = c(2, 1, 2, 1, 3, 5),all_production_2001 = 2*c(15, 25, 25, 10, 25, 18), good_production_2001 = 2*c(10, 24, 10, 8, 10, 10), bad_production_2001 = 2*c(2, 1, 2, 1, 3, 5))
Now you can reshape the data using the reshape function in R.
Here, the variables "all_production","good_production","bad_production" are varying with time, and year values are changing for those variables.
So we specify v.names = c("all_production","good_production","bad_production").
df2 = reshape(df,direction="long",
v.names = c("all_production","good_production","bad_production"),
varying = names(df)[3:8],
idvar = c("company","division"),
timevar = "year",times = c(2000,2001))
For your data.frame you can specify times=2000:2017 and varying=3:ncol(df)
>df2
company division year all_production good_production bad_production
DLT.Marketing.2000 DLT Marketing 2000 15 2 10
DLT.CHANG1.2000 DLT CHANG1 2000 25 1 24
DLT.CAHNG2.2000 DLT CAHNG2 2000 25 2 10
MSF.MARKETING.2000 MSF MARKETING 2000 10 1 8
MSF.CHANG1M.2000 MSF CHANG1M 2000 25 3 10
MSF.CHANG2M.2000 MSF CHANG2M 2000 18 5 10
DLT.Marketing.2001 DLT Marketing 2001 30 4 20
DLT.CHANG1.2001 DLT CHANG1 2001 50 2 48
DLT.CAHNG2.2001 DLT CAHNG2 2001 50 4 20
MSF.MARKETING.2001 MSF MARKETING 2001 20 2 16
MSF.CHANG1M.2001 MSF CHANG1M 2001 50 6 20
MSF.CHANG2M.2001 MSF CHANG2M 2001 36 10 20
Now you can filter and sort like this:
library(dplyr)
somefunc<-function(company.name,yearval){
df2%>%filter(company==company.name,year==yearval)%>%arrange(-all_production)
}
>somefunc("DLT",2001)
company division year all_production good_production bad_production
1 DLT CHANG1 2001 50 2 48
2 DLT CAHNG2 2001 50 4 20
3 DLT Marketing 2001 30 4 20
Although it seems OP has provided a very simple sample data which contains data for only year 2000.
A solution approach could be:
1. Convert the list to data.frame
2. Use gather from tidyr to arrange dataframe in way where filter can be applied
ll <- structure(list(company = c("DLT", "DLT", "DLT", "MSF", "MSF", "MSF"),
division = c("Marketing", "CHANG1", "CAHNG2", "MARKETING", "CHANG1M",
"CHANG2M"), all_production_2000 = c(15, 25, 25, 10, 25, 18),
good_production_2000 = c(10, 24, 10, 8, 10, 10),
bad_production_2000 = c(2, 1, 2, 1, 3, 5)))
df <- as.data.frame(ll)
library(tidyr)
gather(df, key = "key", value = "value", -c("company", "division"))
#result:
# company division key value
#1 DLT Marketing all_production_2000 15
#2 DLT CHANG1 all_production_2000 25
#3 DLT CAHNG2 all_production_2000 25
#4 MSF MARKETING all_production_2000 10
#5 MSF CHANG1M all_production_2000 25
#6 MSF CHANG2M all_production_2000 18
#7 DLT Marketing good_production_2000 10
#8 DLT CHANG1 good_production_2000 24
#9 DLT CAHNG2 good_production_2000 10
#10 MSF MARKETING good_production_2000 8
#11 MSF CHANG1M good_production_2000 10
#12 MSF CHANG2M good_production_2000 10
#13 DLT Marketing bad_production_2000 2
#14 DLT CHANG1 bad_production_2000 1
#15 DLT CAHNG2 bad_production_2000 2
Now, filter can be applied easily on above data.frame.
I looking at foreign powers intervening into civil wars using R studio. My first dataset unit of analysis is conflict year while the second one is conflict month. I would need to have both of them in conflict years so I can merge them.
Is there any command that allows you to do the opposite of expanding rows?
It's hard to give you specifics without a sample of your data so we know what the structure is. I'm assuming your month-level dataset stores the month as a character string that includes a year. You should be able to extract the year with separate from the tidyr package:
library(tidyverse)
month <- c("June 2015", "July 2015", "September 2016", "August 2016", "March 2014")
conflict <- c("A", "B", "C", "D", "E")
my.data <- data.frame(month, conflict)
my.data
month conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E
my.data <- my.data %>%
separate(month, c("month", "year"), sep = " ")
> my.data
month year conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E
If I had a dataframe that has a string of numbers separated by a comma in a column, how can I convert that string to an ordered and unique converted set in another column?
Month String_of_Nums Converted
May 3,3,2 2,3
June 3,3,3,1 1,3
Sept 3,3,3, 3 3
Oct 3,3,3, 4 3,4
Jan 3,3,4 3,4
Nov 3,3,5,5 3,5
I tried splitting up the string of numbers to get unique to work
strsplit(df$String_of_Nums,",")
but I end up with spaces in the character list. Any ideas how to efficently generate a Converted column? Also need to figure out how to operate on all elements of the column, etc.
Try:
df1 <- read.table(text="Month String_of_Nums
May '3,3,2'
June '3,3,3,1'
Sept '3,3,3,3'
Oct '3,3,3,4'
Jan '3,3,4'
Nov '3,3,5,5'", header = TRUE)
df1$converted <- apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ","))
df1
Month String_of_Nums converted
1 May 3,3,2 2,3
2 June 3,3,3,1 1,3
3 Sept 3,3,3,3 3
4 Oct 3,3,3, 4 3,4
5 Jan 3,3,4 3,4
6 Nov 3,3,5,5 3,5
I'd like to leave another way. As far as I see, Jay's example has String_of_Nums as factor. Given you said that strsplit() worked, I am assuming that you have String_of_Nums as character. Here I have the column as character as well. First, split each string (strsplit), find unique characters (unique), sort the characters (sort), and paste them (toString). At this point, you have a list. You want to convert the vectors in the list using as_vector from the purrr package. Of interest, I used benchmark to see how performance would be like to create a vector (i.e., Converted)
library(magrittr)
library(purrr)
lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character") -> mydf$out
# Month String_of_Nums out
#1 May 3,3,2 2, 3
#2 June 3,3,3,1 1, 3
#3 Sept 3,3,3,3 3
#4 Oct 3,3,3,4 3, 4
#5 Jan 3,3,4 3, 4
#6 Nov 3,3,5,5 3, 5
library(microbenchmark)
microbenchmark(
jazz = lapply(strsplit(mydf$String_of_Nums, split = ","),
function(x) toString(sort(unique(x)))) %>%
as_vector(.type = "character"),
jay = apply(read.csv(text=as.character(df1$String_of_Nums), header = FALSE), 1,
function(x) paste(sort(unique(x)), collapse = ",")),
times = 10000)
# expr min lq mean median uq max neval
# jazz 358.913 393.018 431.7382 405.9395 420.1735 54779.29 10000
# jay 1099.587 1151.244 1233.5631 1167.0920 1191.5610 56871.45 10000
DATA
Month String_of_Nums
1 May 3,3,2
2 June 3,3,3,1
3 Sept 3,3,3,3
4 Oct 3,3,3,4
5 Jan 3,3,4
6 Nov 3,3,5,5
mydf <- structure(list(Month = c("May", "June", "Sept", "Oct", "Jan",
"Nov"), String_of_Nums = c("3,3,2", "3,3,3,1", "3,3,3,3", "3,3,3,4",
"3,3,4", "3,3,5,5")), .Names = c("Month", "String_of_Nums"), row.names = c(NA,
-6L), class = "data.frame")