Accessing R variable in loop - r

I am trying to access the R variables in a loop in the following way
bes2 = data.frame("id"=c(1,2), "generalElectionVoteW1"=c("Labour","Bla"),
"generalElectionVoteW2"=c("x","t"))
general_names <- c("generalElectionVoteW1", "generalElectionVoteW2")
labour_w = bes2[bes2$general_names[1] == "Labour",]
Which will simply result in an empty vector.
general_names is simply used to keep generalElectionVoteW1, ...W2 and many more saved for easier access in a loop.
However if I access them manually like labour_w = bes2[bes2$generalElectionVoteW1 == "Labour",] it works as desired. Where is my mistake?
bes2:
id generalElectionVoteW1 generalElectionVoteW2
1 1 Labour x
2 2 Bla t
general_names:
"generalElectionVoteW1" "generalElectionVoteW2"

Related

How to deselect many variables without removing specific variables in dplyr

Say there is a data frame that has a structure like this:
df <- data.frame(x.1 = rnorm(n=100),
x.2 = rnorm(n=100),
x.3 = rnorm(n=100),
x.special = rnorm(n=100),
x.y.z = rnorm(n=100))
Inspecting the head, we get this output:
x.1 x.2 x.3 x.special x.y.z
1 1.01014580 -1.4047666 1.50374721 -0.8339784 -0.0831983
2 0.44307253 -0.4695634 -0.71951820 1.5758893 1.2163749
3 -0.87051845 0.1793721 -0.26838489 -1.0477929 -1.0813926
4 -0.28491936 0.4186763 -0.07494088 -0.2177471 0.3490200
5 -0.03769566 -0.3656822 0.12478667 -0.7975811 -0.4481193
6 -0.83808036 0.6842561 0.71231627 -0.3348798 1.7418141
Suppose I want to remove all the numbered variables but keep the x.special and x.y.z variables. I know that I can easily deselect with:
df %>%
select(-x.1,
-x.2,
-x.3)
However for something like 50 or 100 variables like this, it would become cumbersome. Similarly, I know I can pick patterns like so:
df %>%
select(-contains("x."))
But this of course removes everything because the special variables have the . name. Is there a more intelligent way of picking these variables? I feel like there is an option for finding the numeric variable in the name.
# use regex to remove these colums...
colsBool <- !grepl(x=names(df), pattern="\\d")
Result:
> head(df[, colsBool])
x.special x.y.z
1 1.1145156 -0.4911891
2 0.7059937 0.4500111
3 -0.6566422 1.6085353
4 -0.6322514 -0.8017260
5 0.4785106 0.6014765
6 -0.8508830 -0.5078307
Regular expressions are your best friend in this situation.
For instance, if you wanted to remove columns whose last value is a number, just do !grepl(pattern = "\\d$",...), the $ sign at the end of the expression will match only columns ending with a number. The ! sign in front of the grepl() expression negates the values in the match, that is, a TRUE becomes FALSE and vice-versa.

R: Replace all Values that are not equal to a set of values

All.
I've been trying to solve a problem on a large data set for some time and could use some of your wisdom.
I have a DF (1.3M obs) with a column called customer along with 30 other columns. Let's say it contains multiple instances of customers Customer1 thru Customer3000. I know that I have issues with 30 of those customers. I need to find all the customers that are NOT the customers I have issues and replace the value in the 'customer' column with the text 'Supported Customer'. That seems like it should be a simple thing...if it werent for the number of obs, I would have loaded it up in Excel, filtered all the bad customers out and copy/pasted the text 'Supported Customer' over what remained.
Ive tried replace and str_replace_all using grepl and paste/paste0 but to no avail. my current code looks like this:
#All the customers that have issues
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127",
"Customer128", ..... , "Customer140")
#Look for everything that is NOT in the list above and replace with "Enabled"
orderData$customer <- str_replace_all(orderData$customer, paste0("[^", paste(out, collapse =
"|"), "]"), "Enabled Customers")
That code gets me this error:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
I've tried the inverse of this approach and pulled a list of all obs that dont match the list of out customers. Something like this:
in <- orderData %>% filter(!customer %in% out) %>% select(customer) %>%
distinct(customer)
This gets me a much larger list of customers that ARE enabled (~3,100). Using the str_replace_all and paste approach seems to have issues though. At this large number of patterns, paste no longer collapses using the "|" operator. instead I get a string that looks like:
"c(\"Customer1\", \"Customer2345\", \"Customer54\", ......)
When passed into str_replace_all, this does not match any patterns.
Anyways, there's got to be an easier way to do this. Thanks for any/all help.
Here is a data.table approach.
First, some example data since you didn't provide any.
customer <- sample(paste0("Customer",1:300),5000,replace = TRUE)
orderData <- data.frame(customer = sample(paste0("Customer",1:300),5000,replace = TRUE),stringsAsFactors = FALSE)
orderData <- cbind(orderData,matrix(runif(0,100,n=5000*30),ncol=30))
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127", "Customer128","Customer140")
library(data.table)
setDT(orderData)
result <- orderData[!(customer %in% out),customer := gsub("Customer","Supported Customer ",customer)]
result
customer 1 2 3 4 5 6 7 8 9
1: Supported Customer 134 65.35091 8.57117 79.594166 84.88867 97.225276 84.563997 17.15166 41.87160 3.717705
2: Supported Customer 225 72.95757 32.80893 27.318046 72.97045 28.698518 60.709381 92.51114 79.90031 7.311200
3: Supported Customer 222 39.55269 89.51003 1.626846 80.66629 9.983814 87.122153 85.80335 91.36377 14.667535
4: Supported Customer 184 24.44624 20.64762 9.555844 74.39480 49.189537 73.126275 94.05833 36.34749 3.091072
5: Supported Customer 194 42.34858 16.08034 34.182737 75.81006 35.167769 23.780069 36.08756 26.46816 31.994756
---

Substring (variable length) values in entire column of dataframe

I have looked for this tirelessly with no luck. I am coming from a Java background and new to R. (On a side note, I am loving R, but disliking string operations in it as well as the documentation - maybe that's just a Java bias.)
Anyhow, I have a dataframe with a single column, it is composed of a latitude and longitude numbers seperated by a colon e.g. ROAD:_:-87.4968190989999:38.7414455360001
I would like to create 2 new data frames where each will have the separate lat and long numbers.
I have successfully written a piece of code where I use for loops (but I know this is inefficient - and that there has to be another way)
Here is a snippet of the inefficient code:
length <- length(fromLatLong)
for (i in 1:length){
fromLat[i] <- strsplit(fromLatLong[i] ,":")[[1]][4]
}
for (i in 1:length){
fromLong[i] <- strsplit(fromLatLong[i] ,":")[[1]][3]
}
for (i in 1:length){
toLat[i] <- strsplit(toLatLong[i] ,":")[[1]][4]
}
for (i in 1:length){
toLong[i] <- strsplit(toLatLong[i] ,":")[[1]][3]
}
Here is how I tried to optimize it using mutate, but I only get the first value copied over to all rows as such:
fromLat = mutate(fromLatLong, FROM_NODE_ID = (strsplit(as.character(fromLatLong$FROM_NODE_ID),":")[[1]][4]))
fromLong = mutate(fromLatLong, FROM_NODE_ID = (strsplit(fromLatLong$FROM_NODE_ID,":")[[1]][3]))
toLat = mutate(toLatLong, TO_NODE_ID = (strsplit(toLatLong$TO_NODE_ID,":")[[1]][4]))
toLong = mutate(toLatLong, TO_NODE_ID = (strsplit(toLatLong$TO_NODE_ID,":")[[1]][3]))
And here is the result:
FROM_NODE_ID
1
38.7414455360001
2
38.7414455360001
3
38.7414455360001
4
38.7414455360001
5
38.7414455360001
6
38.7414455360001
7
38.7414455360001
8
38.7414455360001
9
38.7414455360001
I would appriciete your help on this. Thanks
You can use the map_chr function of the purrr package. For instance:
fromLat = mutate(fromLatLong, FROM_NODE_ID = map_chr(FROM_NODE_ID, ~ strsplit(as.character(.x),":")[[1]][4]))
The following expression will produce a data frame with each of the colon-delimited components as a separate column. You can then break this up into separate data frames or do whatever else you want with it.
as.data.frame(t(matrix(unlist(strsplit(fromLatLong$coords, ":", fixed=TRUE), recursive=FALSE), nrow=4)),stringsAsFactors=FALSE)
(Assuming the column name of your values in the data frame is coords.)

R & xml2: Locate elements by specific text value, store all children values in data.frame

I work with regularly refreshed XML reports and I would like to automate the munging process using R & xml2.
Here's a link to an entire example file.
Here's a sample of the XML:
<?xml version="1.0" ?>
<riDetailEnrolleeReport xmlns="http://vo.edge.fm.cms.hhs.gov">
<includedFileHeader>
<outboundFileIdentifier>f2e55625-e70e-4f9d-8278-fc5de7c04d47</outboundFileIdentifier>
<cmsBatchIdentifier>RIP-2015-00096</cmsBatchIdentifier>
<cmsJobIdentifier>16220</cmsJobIdentifier>
<snapShotFileName>25032.BACKUP.D03152016T032051.dat</snapShotFileName>
<snapShotFileHash>20d887c9a71fa920dbb91edc3d171eb64a784dd6</snapShotFileHash>
<outboundFileGenerationDateTime>2016-03-15T15:20:54</outboundFileGenerationDateTime>
<interfaceControlReleaseNumber>04.03.01</interfaceControlReleaseNumber>
<edgeServerVersion>EDGEServer_14.09_01_b0186</edgeServerVersion>
<edgeServerProcessIdentifier>8</edgeServerProcessIdentifier>
<outboundFileTypeCode>RIDE</outboundFileTypeCode>
<edgeServerIdentifier>2800273</edgeServerIdentifier>
<issuerIdentifier>25032</issuerIdentifier>
</includedFileHeader>
<calendarYear>2015</calendarYear>
<executionType>P</executionType>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS001</insuredMemberIdentifier>
<memberMonths>12.13</memberMonths>
<totalAllowedClaims>1000.00</totalAllowedClaims>
<totalPaidClaims>100.00</totalPaidClaims>
<moopAdjustedPaidClaims>100.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier>CADULT4SM00101</claimIdentifier>
<claimPaidAmount>100.00</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
<includedInsuredMemberIdentifier>
<insuredMemberIdentifier>ARS002</insuredMemberIdentifier>
<memberMonths>9.17</memberMonths>
<totalAllowedClaims>0.00</totalAllowedClaims>
<totalPaidClaims>0.00</totalPaidClaims>
<moopAdjustedPaidClaims>0.00</moopAdjustedPaidClaims>
<cSRMOOPAdjustment>0.00</cSRMOOPAdjustment>
<estimatedRIPayment>0.00</estimatedRIPayment>
<coinsurancePercentPayments>0.00</coinsurancePercentPayments>
<includedPlanIdentifier>
<planIdentifier>25032VA013000101</planIdentifier>
<includedClaimIdentifier>
<claimIdentifier></claimIdentifier>
<claimPaidAmount>0</claimPaidAmount>
<crossYearClaimIndicator>N</crossYearClaimIndicator>
</includedClaimIdentifier>
</includedPlanIdentifier>
</includedInsuredMemberIdentifier>
</riDetailEnrolleeReport>
I would like to:
Read in the XML into R
Locate a specific insuredMemberIdentifier
Extract the planIdentifier and all claimIdentifier data associated with the member ID in (2)
Store all text and values for insuredMemberIdentifier, planIdentifier, claimIdentifier, and claimPaidAmount in a data.frame with a row for each unique claim ID (member ID to claim ID is a 1 to many)
So far, I have accomplished 1 and I'm in the ballpark on 2:
## Step 1 ##
ride <- read_xml("/Users/temp/Desktop/RIDetailEnrolleeReport.xml")
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(ride, "//d1:insuredMemberIdentifier[text()='ARS001']", xml_ns(ride))
[I know that I can then use xml_text() to extract the text of the element.]
After the code in Step 2 above, I've tried using xml_parent() to locate the parent node of the insuredMemberIdentifier, saving that as a variable, and then repeating Step 2 for claim info on that saved variable node.
node <- xml_parent(memID)
xml_find_all(node, "//d1:claimIdentifier", xml_ns(ride))
But this just results in pulling all claimIdentifiers in the global file.
Any help/information on how to get to step 4, above, would be greatly appreciated. Thank you in advance.
Apologies for the late response, but for posterity, import data as above using xml2, then parse the xml file by ID, as hinted by har07.
# output object to collect all claims
res <- data.frame(
insuredMemberIdentifier = rep(NA, 1),
planIdentifier = NA,
claimIdentifier = NA,
claimPaidAmount = NA)
# vector of ids of interest
ids <- c('ARS001')
# indexing counter
starti <- 1
# loop through all ids
for (ii in seq_along(ids)) {
# find ii-th id
## Step 2 -- assume the insuredMemberIdentifier of interest is 'ARS001' ##
memID <- xml_find_all(x = ride,
xpath = paste0("//d1:insuredMemberIdentifier[text()='", ids[ii], "']"))
# find node for
node <- xml_parent(memID)
# as har07's comment find claim id within this node
cid <- xml_find_all(node, ".//d1:claimIdentifier", xml_ns(ride))
pid <- xml_find_all(node, ".//d1:planIdentifier", xml_ns(ride))
cpa <- xml_find_all(node, ".//d1:claimPaidAmount", xml_ns(ride))
# add invalid data handling if necessary
if (length(cid) != length(cpa)) {
warning(paste("cid and cpa do not match for", ids[ii]))
next
}
# collect outputs
res[seq_along(cid) + starti - 1, ] <- list(
ids[ii],
xml_text(pid),
xml_text(cid),
xml_text(cpa))
# adjust counter to add next id into correct row
starti <- starti + length(cid)
}
res
# insuredMemberIdentifier planIdentifier claimIdentifier claimPaidAmount
# 1 ARS001 25032VA013000101 CADULT4SM00101 100.00

Ordering Merged data frames

As a fairly new R programmer I seem to have run into a strange problem - probably my inexperience with R
After reading and merging successive files into a single data frame, I find that order does not sort the data as expected.
I have multiple references in each file but each file refers to measurement data obtained at a different time.
Here's the code
library(reshape)
# Enter file name to Read & Save data
FileName=readline("Enter File name:\n")
# Find first occurance of file
for ( round1 in 1 : 6) {
ReadFile=paste(round1,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile))
break
}
x = data.frame(read.csv(ReadFile, header=TRUE),rnd=round1)
for ( round2 in (round1+1) : 6) {
#
ReadFile=paste(round2,"C_",FileName,"_Stats.csv", sep="")
if (file.exists(ReadFile)) {
y = data.frame(read.csv(ReadFile, header=TRUE),rnd = round2)
if (round2 == (round1 +1))
z=data.frame(merge(x,y,all=TRUE))
z=data.frame(merge(y,z,all=TRUE))
}
}
ordered = order(z$lab_id)
results = z[ordered,]
res = data.frame( lab=results[,"lab_id"],bw=results[,"ZBW"],wi=results[,"ZWI"],pf_zbw=0,pf_zwi=0,r = results[,"rnd"])
#
# Establish no of samples recorded
nsmpls = length(res[,c("lab")])
# Evaluate Z_scores for Between Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"bw"] > 3 | res[i,"bw"] < -3)
res[i,"pf_zbw"]=1
}
# Evaluate Z_scores for Within Lab Results
for ( i in 1 : nsmpls) {
if (res[i,"wi"] > 3 | res[i,"wi"] < -3)
res[i,"pf_zwi"]=1
}
dd = melt(res, id=c("lab","r"), "pf_zbw")
b = cast(dd, lab ~ r)
If anyone could see why the ordering only works for about 55 of 70 records and could steer me in the right direction I would be obliged
Thanks very much
Check whether z$lab_id is a factor (with is.factor(z$lab_id)).
If it is, try
z$lab_id <- as.character(z$lab_id)
if it is supposed to be a character vector; or
z$lab_id <- as.numeric(as.character(z$lab_id))
if it is supposed to be a numeric vector.
Then order it again.
Ps. I had previously put these in the comments.

Resources