Parsing a string efficiently - r

So I've got a column in my data frame that is essentially one long characteristic string that is used to encode about variables for each record. It might look something like this:
string<-c('001034002025003996','001934002199004888')
But much longer.
The strings are structured so each 6 characters are paired together. So you can look at the string above like this:
001034 002025 003996
001934 002199 004888
The first three characters of these is a code corresponding to a certain variable and the next three correspond to the value of that variable. So the above can be broken down into three columns that look like this:
var001 var002 var003 var004
1 034 025 996 NA
2 934 199 NA 888
I need a way to parse this string and return a data frame with the expanded columns.
I wrote a nested loop that looks like this:
for(i in 1:length(string)){
text <- string[i]
for(j in seq(1,505,6)){
var <- substr(text,j, j+2)
var.value <- substr(text, j+3, j+5)
index <- (as.numeric(var))
df[i, index] <- var.value
}
}
where df is an empty data frame created to receive the data. This works, but is slow on larger amounts of data. Is there a better way to do this?

1) This one-liner produces a character matrix (which can easily be converted to a data.frame if need be). No packages are used.
read.dcf(textConnection(gsub("(...)(...)", "\\1: \\2\n", string)))
giving:
001 002 003 004
[1,] "034" "025" "996" NA
[2,] "934" "199" NA "888"
2) This alternative produces the same matrix. The read.table produces a long form data.frame and then tapply reshapes it to a wide matrix.
long <- read.table(text = gsub("(...)(...)", "\\1 \\2\n", string),
colClasses = "character", col.names = c("id", "var"))
tapply(long$var, list(gl(length(string), nchar(string[1])/6), long$id), c)

Related

R: How do you subset all data-frames within a list?

I have a list of data-frames called WaFramesCosts. I want to simply subset it to show specific columns so that I can then export them. I have tried:
for (i in names(WaFramesCosts)) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used")]
}
but it returns the error of
Error in `[.data.frame`(WaFramesCosts[[i]], , c("Cost_Center", "Department", :
undefined columns selected
I also tried:
for (i in seq_along(WaFramesCosts)){
WaFramesCosts[[i]][ , -which(names(WaFramesCosts[[i]]) %in% c("Cost_Center","Domestic_Anytime_Min_Used","Department",
"Domestic_Anytime_Min_Used"))]
but I get the same error. Can anyone see what I am doing wrong?
Side Note: For reference, I used this:
for (i in seq_along(WaFramesCosts)) {
t <- WaFramesCosts[[i]][ , grepl( "Domestic" , names( WaFramesCosts[[i]] ) )]
q <- subset(WaFramesCosts[[i]], select = c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used"))
WaFramesCosts[[i]] <- merge(q,t)
}
while attempting the same goal with a different approach and seemed to get closer.
Welcome back, Kootseeahknee. You are still incorrectly assuming that the last command of a for loop is implicitly returned at the end. If you want that behavior, perhaps you want lapply:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,c("Cost_Center","Domestic_Anytime_Min_Used","Department","Domestic_Anytime_Min_Used")]
})
The undefined columns selected error tells me that your assumptions of the datasets are not correct: at least one is missing at least one of the columns. From your previous question (How to do a complex edit of columns of all data frames in a list?), I'm inferring that you want columns that match, not assuming that it is in everything. From that, you could/should be using grep or some variant:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,grep("(Cost_Center|Domestic_Anytime_Min_Used|Department)",
colnames(WaFramesCosts)),drop=FALSE]
})
This will match column names that contain any of those strings. You can be a lot more precise by ensuring whole strings or start/end matches occur by using regular expressions. For instance, changing from (Cost|Dom) (anything that contains "Cost" or "Dom") to (^Cost|Dom) means anything that starts with "Cost" or contains "Dom"; similarly, (Cost|ment$) matches anything that contains "Cost" or ends with "ment". If, however, you always want exact matches and just need those that exist, then something like this will work:
myoutput <- lapply(names(WaFramesCosts)), function(i) {
WaFramesCosts[[i]][,intersect(c("Cost_Center","Domestic_Anytime_Min_Used","Department"),
colnames(WaFramesCosts)),drop=FALSE]
})
Note, in that last example: notice the difference between mtcars[,2] (returns a vector) and mtcars[,2,drop=FALSE] (returns a data.frame with 1 column). Defensive programming, if you think it at all possible that your filtering will return a single-column, make sure you do not inadvertently convert to a vector by appending ,drop=FALSE to your bracket-subsetting.
Based on your description, this is an example of using library dplyr to achieve combining a list of data frames for a given set of columns. This doesn't require all data frames to have identical columns (Providing your data in a reproducible example would be better)
# test data
df1 = read.table(text = "
c1 c2 c3
a 1 101
b 2 102
", header = TRUE, stringsAsFactors = FALSE)
df2 = read.table(text = "
c1 c2 c3
w 11 201
x 12 202
", header = TRUE, stringsAsFactors = FALSE)
# dfs is a list of data frames
dfs <- list(df1, df2)
# use dplyr::bind_rows
library(dplyr)
cols <- c("c1", "c3")
result <- bind_rows(dfs)[cols]
result
# c1 c3
# 1 a 101
# 2 b 102
# 3 w 201
# 4 x 202

strpslit a character array and convert to dataframe simultaneously

I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.
You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.

Crafty ways to make super efficient R vector processing?

I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?
I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]

Parse currency values from CSV, convert numerical suffixes for Million and Billion

I'm curious if there's any sort of out of the box functions in R that can handle this.
I have a CSV file that I am reading into a data frame using read.csv. One of the columns in the CSV contains currency values in the format of
Currency
--------
$1.2M
$3.1B
N/A
I would like to convert those into more usable numbers that calculations can be performed against, so it would look like this:
Currency
----------
1200000
3100000000
NA
My initial thoughts were to somehow subset the dataframe into 3 parts based on rows that contain *M, *B, or N/A. Then use gsub to replace the $ and M/B, then multiply the remaining number by 1000000 or 1000000000, and finally rejoin the 3 subsets back into 1 data frame.
However I'm curious if there's a simpler way to handle this sort of conversion in R.
We could use gsubfn to replace the 'B', 'M' with 'e+9', 'e+6' and convert to numeric (as.numeric).
is.na(v1) <- v1=='N/A'
options(scipen=999)
library(gsubfn)
as.numeric(gsubfn('([A-Z]|\\$)', list(B='e+9', M='e+6',"$"=""),v1))
#[1] 1200000 3100000000 NA
EDIT: Modified based on #nicola's suggestion
data
v1 <- c('$1.2M', '$3.1B', 'N/A')
Another way, is using a for-loop :
x <- c("1.2M", "2.5M", "1.6B", "N/A")
x <- ifelse(x=="N/A", NA, x)
num <- as.numeric(strsplit(x, "[^0-9.]+"))
for(i in 1:length(x)) {
if(grepl('M', x[i]))
print(prod(num[i], 1000000))
else
print(prod(num[i], 100000000))
}
# [1] 1200000
# [1] 2500000
# [1] 1.6e+08
# [1] NA

Creating a list of data frames using indexes for start and stop, in R

In R, take any large data frame (example 300,000 rows and 30 columns). I want to create a list of data frames using start and stop index values I have stored in another data frame (two columns, first column are the start values and the second contains the stop values.) The number of rows in the start-stop df will be the number of dataframes stored in the list (in this small example, 6). To me there sounds like there might be an easy function to do this, but before I've always created lists of data frames before using the split command or with different conditional statements, so I did some research but couldn't find a solution. Also, I'm double looping below, which is not preferable. Any help greatly appreciated!
Example of start, stop data frame
> df
headID tailID
[1,] 688 704
[2,] 2576 2583
[3,] 4005 4018
[4,] 4336 5761
[5,] 5762 7201
[6,] 7202 8641
So I'm thinking something like (pseudo-code):
n <- length(bigDF)
subList <- list()
start.idx <- NA
obs <- dim(bigDF)
for(i in 2:obs){
for(j in 1:df) {
start.idx <- df$headID[j]
}
else if
end.idx <- df$tailID[j]
subMat <- bigDF[start.idx:end.idx,]
subList[[counter]] <- subMat
start.idx <- NA
counter <- counter + 1
}
}
}
I would write a function and apply it...
f <- function(x, data) {
data[x[1]:x[2],]
}
apply(df, 1, f, bigDF)

Resources