Lag in R dataframe - r

I have the following sample dataset (below and/or as CSVs here: http://goo.gl/wK57T) which I want to transform as follows. For each person in a household I want to create two new variables OrigTAZ and DestTAZ. It should take the value in TripendTAZ and put that in DestTAZ. For OrigTAZ it should put value of TripendTAZ from the previous row. For the first trip of every person in a household (Tripid = 1) the OrigTAZ = hometaz. For each person in a household, from the second trip OrigTAZ = TripendTAZ_(n-1) and DestTAZ = TripEndTAZ. The sample input and output data are shown below. I tried the suggestions shown here: Basic lag in R vector/dataframe but have not had luck. I am used to doing something like this in SAS.
Any help is appreciated.
TIA,
Krishnan
SAS Code Sample
if Houseid = lag(Houseid) then do;
if Personid = lag(Personid) then do;
DestTAZ = TripendTAZ;
if Tripid = 1 then OrigTAZ = hometaz
else
OrigTAZ = lag(TripendTAZ);
end;
end;
INPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ
1,1,1,45,4
1,1,2,45,7
1,1,3,45,87
1,1,4,45,34
1,1,5,45,45
2,1,1,8,96
2,1,2,8,4
2,1,3,8,2
2,1,4,8,1
2,1,5,8,8
2,2,1,8,58
2,2,2,8,67
2,2,3,8,9
2,2,4,8,10
2,2,5,8,8
3,1,1,7,89
3,1,2,7,35
3,1,3,7,32
3,1,4,7,56
3,1,5,7,7
OUTPUT DATA
Houseid,Personid,Tripid,hometaz,TripendTAZ,OrigTAZ,DestTAZ
1,1,1,45,4,45,4
1,1,2,45,7,4,7
1,1,3,45,87,7,87
1,1,4,45,34,87,34
1,1,5,45,45,34,45
2,1,1,8,96,8,96
2,1,2,8,4,96,4
2,1,3,8,2,4,2
2,1,4,8,1,2,1
2,1,5,8,8,1,8
2,2,1,8,58,8,58
2,2,2,8,67,58,67
2,2,3,8,9,67,9
2,2,4,8,10,9,10
2,2,5,8,8,10,8
3,1,1,7,89,7,89
3,1,2,7,35,89,35
3,1,3,7,32,35,32
3,1,4,7,56,32,56
3,1,5,7,7,56,7

Just proceed through the steps you outlined step-by-step and it isn't so bad.
First I'll read in your data by copying it:
df <- read.csv(file('clipboard'))
Then I'll sort to make sure the data frame is ordered by houseid, then personid, then tripid:
# first sort so that it's ordered by Houseid, then Personid, then Tripid:
df <- with(df, df[order(Houseid,Personid,Tripid),])
Then follow the steps you specified:
# take value in TripendTAZ and put it in DestTAZ
df$DestTAZ <- df$TripendTAZ
# Set OrigTAZ = value from previous row
df$OrigTAZ <- c(NA,df$TripendTAZ[-nrow(df)])
# For the first trip of every person in a household (Tripid = 1),
# OrigTAZ = hometaz.
df$OrigTAZ[ df$Tripid==1 ] <- df$hometaz[ df$Tripid==1 ]
You'll notice that df is then what you're after.

Related

Dataframe with all information in one row. How can spilt them?

I have a dataframe where all information is in one row. see picture below
untidy data
I need to change it to something like this
tidy data
so the first value (suffix_name) in the row should be changed to a variable and the second value (none) should be first value of new variable (suffix_name)
please see images
with the following code you can split the information of the row.
You need to use the library data.table to subset the vars or values columns.
library(data.table)
data_raw.dt <- data.table(
V2_00011 = "'SUFFIX_NAME'",
V2_00012 = 'NONE}}',
V2_00013 = "'PATIENT_ID'",
V2_00014 = "'CZMIl1844982497'",
V2_00015 = "'BIRTH_DATE'",
V2_00016 = "'1987-01-01'",
V2_00017 = "'GENDER'",
V2_00018 = "'Unknown'",
V2_00019 = "'OBSCURITY_LEVEL'",
V2_00020 = "'Normal'")
vars <- seq(1, ncol(data_raw.dt), by = 2)
vals <- seq(2, ncol(data_raw.dt), by = 2)
data_ref.dt <- data.table(matrix(data_raw.dt[, ..vals], ncol = length(vals)))
names(data_ref.dt) <- paste(data_raw.dt[, ..vars])
Here you can see the results.
print(data_raw.dt)
V2_00011 V2_00012 V2_00013 V2_00014 V2_00015 V2_00016 V2_00017 V2_00018 V2_00019 V2_00020
'SUFFIX_NAME' NONE}} 'PATIENT_ID' 'CZMIl1844982497' 'BIRTH_DATE' '1987-01-01' 'GENDER' 'Unknown' 'OBSCURITY_LEVEL' 'Normal'
print(data_ref.dt)
'SUFFIX_NAME' 'PATIENT_ID' 'BIRTH_DATE' 'GENDER' 'OBSCURITY_LEVEL'
NONE}} 'CZMIl1844982497' '1987-01-01' 'Unknown' 'Normal'

Create and combine 2 grid.tables

I have created a grid.table object to display a dataframe in PowerBi, below there is my code:
dataset <- data.frame(BDS_ID = c("001","002"),
PRIORITY = c("high","medium"),
STATUS = c("onair","onair"),
COMPANY = c("airfr","fly"))
my.result <- melt(dataset, id = c("BDS_ID"))
mytheme <- ttheme_default(base_size = 10,
core=list(fg_params=list(hjust=0, x=0.01),
bg_params=list(fill=c("white", "grey90"))))
for (i in 1:nrow(tg)) {
tg$grobs[[i]] <- editGrob(tg$grobs[[i]], gp=gpar(fontface="bold"))
}
grid.draw(tg)
and this is my output:
I would like to improve my output in the following way: I would like that the row headers to be unique and have a different column for each different value of each variable repeating the column with the row headers each time.
I tried to do this using the statement t(dataset), but I do not get the desired result because the row headers are not repeated.
I would like to get an output (always classy grob) similar to this:
**PRIORITY** high **PRIORITY** medium
**STATUS** onair **STATUS** onair
**COMPANY** airfr **COMPANY** fly
Does anyone knows how to achive this?
Thanks
I'm unable to reproduce the grob format you've shown based on the code you've provided, but I've got something similar:
dataset <- data.frame(BDS_ID = c("001","002"),
PRIORITY = c("high","medium"),
STATUS = c("onair","onair"),
COMPANY = c("airfr","fly"))
dataset <- data.frame(t(dataset))
dataset$label1 <- rownames(dataset)
dataset$label2 <- rownames(dataset)
colnames(dataset) <- c("status1", "status2", "label1", "label2")
dataset <- dataset[c(2:nrow(dataset)), c(3, 1, 4, 2)]
rownames(dataset) <- NULL
test <- grid.draw(tableGrob(dataset))
The above code produces the following object. It doesn't look exactly like yours, but it's in the general structure you're looking for:

R & xml2: Locate elements by specific text value, store all children values in data.frame

I work with regularly refreshed XML reports and I would like to automate the munging process using R & xml2.
Here's a link to an entire example file.
Here's a sample of the XML:
<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
<includedFileHeader>
<outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
<cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
<cmsJobIdentifier>16220</cmsJobIdentifier>
<snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
<snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
<outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
<interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
<edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
<edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
<outboundFileTypeCode>RIDE</outboundFileTypeCode>
<edgeServerIdentifier>2800273</edgeServerIdentifier>
<issuerIdentifier>25032</issuerIdentifier>
</includedFileHeader>
<calendarYear>2015</calendarYear>
<executionType>P</executionType>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
<memberMonths>12.13</memberMonths>
<totalAllowedClaims>1000.00</totalAllowedClaims>
<totalPaidClaims>100.00</totalPaidClaims>
<moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier>CADULT4SM00101</claimIdentifier>
<claimPaidAmount>100.00</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
<memberMonths>9.17</memberMonths>
<totalAllowedClaims>0.00</totalAllowedClaims>
<totalPaidClaims>0.00</totalPaidClaims>
<moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier></claimIdentifier>
<claimPaidAmount>0</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>
I would like to:
Read in the XML into R
Locate a specific insuredMemberIdentifier
Extract the planIdentifier and all claimIdentifier data associated with the member ID in (2)
Store all text and values for insuredMemberIdentifier, planIdentifier, claimIdentifier, and claimPaidAmount in a data.frame with a row for each unique claim ID (member ID to claim ID is a 1 to many)
So far, I have accomplished 1 and I'm in the ballpark on 2:
## Step 1 ##
ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))
[I know that I can then use xml_text() to extract the text of the element.]
After the code in Step 2 above, I've tried using xml_parent() to locate the parent node of the insuredMemberIdentifier, saving that as a variable, and then repeating Step 2 for claim info on that saved variable node.
node <- xml_parent(memID)
xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))
But this just results in pulling all claimIdentifiers in the global file.
Any help/information on how to get to step 4, above, would be greatly appreciated. Thank you in advance.
Apologies for the late response, but for posterity, import data as above using xml2, then parse the xml file by ID, as hinted by har07.
# output object to collect all claims
res <- data.frame(
insuredMemberIdentifier = rep(NA, 1),
planIdentifier = NA,
claimIdentifier = NA,
claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
# find ii-th id
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(x = ride,
xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
# find node for
node <- xml_parent(memID)
# as har07's comment find claim id within this node
cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
# add invalid data handling if necessary
if (length(cid) != length(cpa)) {
warning(paste("cid and cpa do not match for", ids[ii]))
next
}
# collect outputs
res[seq_along(cid) + starti - 1, ] <- list(
ids[ii],
xml_text(pid),
xml_text(cid),
xml_text(cpa))
# adjust counter to add next id into correct row
starti <- starti + length(cid)
}
res
# insuredMemberIdentifier planIdentifier claimIdentifier claimPaidAmount
# 1 ARS001 25032VA013000101 CADULT4SM00101 100.00

R Programming Random Stock Pick

I stuck in a problem with R Programming.
My aim is to randomly select 2 stocks out of the Swiss Market Index, which contains of 30 stocks.
Until now I solved the random pick of the 2 stocks with the following code:
SMI_components <- cbind("ABB (ABBN.VX)", "ADECCO (ADEN.VX)", "ACTELION (ATLN.VX)", "JULIUS BAER GRP (BAER.VX)", "RICHEMONT (CFR.VX)", "CREDIT SUISSE (CSGN.VX)", "GEBERIT (GEBN.VX)", "GIVAUDAN (GIVN.VX)", "HOLCIM (HOLN.VX)", "NESTLE (NESN.VX)", "NOVARTIS (NOVN.VX)", "TRANSOCEAN (RIGN.VX)", "ROCHE HOLDING (ROG.VX)", "SWISSCOM (SCMN.VX)", "SGS (SGSN.VX)", "SWISS RE (SREN.VX)", "SYNGENTA (SYNN.VX)", "UBS (UBSG.VX)", "SWATCH GROUP (UHR.VX)", "ZURICH INSURANCE GROUP (ZURN.VX)")
for(i in 1:1){
print(sample(SMI_components, 2))
}
How do I continue my code, if I want to download the historical data from these two random picked stocks?
For example, the random selection is:
"NOVARTIS (NOVN.VX)" and "ZURICH INSURANCE GROUP (ZURN.VX)"
how to continue that ...
SMI_NOVARTIS <- yahooSeries ("NOVN.VX", from = "2005-01-01", to = "2015-07-30", frequency = "daily")
SMI_ZURICH <- yahooSeries ("ZURN.VX", from = "2005-01-01", to = "2015-07-30", frequency = "daily")
I would really appreciate your help
Regards
print outputs to the console but doesn't store anything. So the first thing to do is assign the output of sample into a variable.
my_picks <- sample(SMI_components, 2)
Extract ticker symbol between parens (courtesy the comment below):
my_picks <- sub(".*\\((.*)\\).*", "\\1", my_picks)
Then you can use lapply, to call a function (yahooSeries) for each value in my_picks.
series_list <- lapply(my_picks, yahooSeries, from = "2005-01-01", to = "2015-07-30", frequency = "daily")
Then you'll get the output in a list. series_list[[1]] will have the output of yahooSeries for the first value of my_picks, and series_list[[2]] for the second
Lastly, not sure why you bothered with the single-iteration for loop, but you don't need that

Avoid for-loop: Define blocks of actions within a time range

I need to define blocks of actions - so I want to group together all actions for a single id that take place less than 30 days since the last action. If it's more than 30 days since the last action, then I'd increment the label by one (so label 2, 3, 4...). Every new id would start at 1 again.
Here's the data:
dat = data.frame(cbind(
id = c(rep(1,2), rep(16,3), rep(17,24)),
##day_id is the action date in %Y%m%d format - I keep it as numeric but could potentially turn to a date.
day_id = c(20130702, 20130121, 20131028, 20131028, 20130531, 20140513, 20140509,
20140430, 20140417, 20140411, 20140410, 20140404,
20140320, 20140313, 20140305, 20140224, 20140213, 20140131, 20140114,
20130827, 20130820, 20130806, 20130730, 20130723,
20130719, 20130716, 20130620, 20130620, 20130614 ),
###diff is the # of days between actions/day_ids
diff =c(NA,162,NA,0,150,NA,4,9,13,6,1,6,15,7,8,9,11,13,17,140,7,14,
7,7,4,3,26,0,6),
###Just a flag to say whether it's a new id
new_id = c(1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
))
I've done it with a for loop and managed to avoid loops within loops (see below) but can't seem to get rid of that outer loop. Of course, it gets extremely slow with thousands of ids. In the example below, 'call_block' is what I'm trying to reproduce but without the for loop. Can anyone help me get this out of a loop??
max_days = 30
r = NULL
for(i in unique(dat$id)){
d = dat$diff[dat$id==i]
w = c(1,which(d>=max_days) , length(d)+1)
w2 = diff(w)
r = c(r,rep(1:(length(w)-1), w2))
}
dat$call_block = r
Thank you!
Posting #alexis_laz's answer here to close out the question
library(data.table)
f = function(x){
ret = c(1, cumsum((x >= 30)[-1]) + 1)
return(ret = ret)
}
df = data.table(dat)
df2 = df[,list(call_block= f(diff)), by = id]

Resources