How to efficiently extract delimited strings from a data table in R - r

I have a data table in R with text columns of colon delimited data. I want to return a matrix/data table of results where one of the delimited values is returned for each cell.
The code pasted below demonstrates the problem and is a working solution. However, my actual data table is large (a few thousand rows and columns), and the pasted method takes on the order of a minute or two to complete.
I'm wondering if there is a more efficient way to perform this task? It appears that the sep2 option in fread will be very useful for this problem once implemented.
Thanks!
> # Set up data.table
> DT <- data.table(A = c("cat:1:meow", "dog:2:bark", "cow:3:moo"),
B = c("dog:3:meow", "dog:4:bark", "frog:3:croak"),
C = c("dingo:0:moo", "cat:8:croak", "frog:1:moo"))
> print(DT)
A B C
1: cat:1:meow dog:3:meow dingo:0:moo
2: dog:2:bark dog:4:bark cat:8:croak
3: cow:3:moo frog:3:croak frog:1:moo
# grab the second delimited value in each cell
> part_index <- 2
> f = function(x) {vapply(t(x), function(x) {unlist(strsplit(x, ":", fixed=T))[part_index]}, character(1))}
> sapply(DT, f)
A B C
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"

1) sub Try this:
DT[, lapply(.SD, sub, pattern = ".*:(.*):.*", replacement = "\\1")]
giving:
A B C
1: 1 3 0
2: 2 4 8
3: 3 3 1
2) fread or using fread:
DT[, lapply(.SD, function(x) fread(paste(x, collapse = "\n"))$V2)]
3) matrix Note that similar code would work with plain character matrix without data.table:
m <- as.matrix(DT)
replace(m, TRUE, sub(".*:(.*):.*", "\\1", m))
giving:
A B C
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"
3a) Even simpler (no regular expressions) would be:
replace(m, TRUE, read.table(text = m, sep = ":")$V2)
3b) or using fread from data.table:
replace(m, TRUE, fread(paste(m, collapse = "\n"))$V2)

Related

rbind named vector to matrix with different lengths

I am trying to bind together a named vector onto a matrix. The named vector has a different length as the matrix:
> m <- matrix(data = c("1", "2", "3"),
nrow = 1, ncol = 3,
dimnames = list(c(),
c("column 1", "column 2", "column 3")))
> named_vec <- c("4", "5")
> names(named_vec) <- c("column 1", "column 2")
> rbind(m, named_vec)
I get the following:
Warning message:
In rbind(m, named_vec) :
number of columns of result is not a multiple of vector length (arg 2)
This has the undesired effect of repeating the shorter vector.
Also, plyr's rbind.fill function does not work here, since both arguments need to be data frames:
> plyr::rbind.fill(data.frame(m), data.frame(named_vec))
Error: All inputs to rbind.fill must be data.frames
My desired output is a matrix that fills in missing values with NA's instead of repeating the vector, like this:
column 1 column 2 column 3
[1,] "1" "2" "3"
[2,] "4" "5" NA
Below is a base R solution
do.call(rbind,lapply(u<-list(m,named_vec),`length<-`,max(lengths(u))))
such that
column 1 column 2 column 3
[1,] "1" "2" "3"
[2,] "4" "5" NA
If it is ok to convert the matrices to dataframe, you can use bind_rows.
dplyr::bind_rows(data.frame(m), data.frame(t(named_vec)))
# column.1 column.2 column.3
#1 1 2 3
#2 4 5 <NA>
We can use rbindlist
library(data.table)
rbindlist(list(as.data.frame(m), as.data.frame(t(named_vec))), fill = TRUE)

sapply not applying a function created to all rows in R dataframe

I have the following dataframe in R and am trying to use a stringsplit function to the same to yield a different dataframe
DF
A B C
"1,2,3" "1,2"
"2" "1"
The cells of the dataframe are filled with characters. The empty spaces are blank values. I have created the following function
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
The function works neatly when i use it on a single column
sapply(DF$A, sepfunc)
[1] "1" "2"
However, the following command yields only a single row
sapply(DF, sepfunc)
A B C
"1" NA "1"
The second row is not displayed. I know I must be missing something rudimentary. I request someone to help.
The expected output is
A B C
"1" NA "1"
"2" "1" "NA"
When we do the strsplit, the output is a list of vectors. If we just subset the first list element with [[1]], then the rest of the elements are skipped. Here the first element corresponds to the first row. But, when we do the same on a single column, it is looping through each element and then do the strsplit. It will not hurt by taking the first element [[1]] because the list is of length 1. Here, the case is different. The number of list elements are the same as the number of rows for each of the columns. So, we need to loop through the list (either with sapply/lapply - former gives a vector depends on the case, while latter always return list)
sapply(DF, function(x) sapply(strsplit(as.character(x), ","), `[`, 1))
# A B C
#[1,] "1" NA "1"
#[2,] "2" "1" NA
Let's look this more closely by splitting the codes into chunks. On each column, we can find the output as list of splitted vectors
lapply(DF, function(x) strsplit(as.character(x), ","))
#$A
#$A[[1]]
#[1] "1" "2" "3"
#$A[[2]]
#[1] "2"
#$B
#$B[[1]]
#[1] NA
#$B[[2]]
#[1] "1"
#$C
#$C[[1]]
#[1] "1" "2"
#$C[[2]]
#character(0)
When we do [[1]], the first element is extracted i.e. the first row of 'A', 'B', 'C'
lapply(DF, function(x) strsplit(as.character(x), ",")[[1]])
#$A
#[1] "1" "2" "3"
#$B
#[1] NA
#$C
#[1] "1" "2"
If we again subset on the above, i.e. the first element, the output will be 1 NA 1.
Instead we want to loop through the list and get the first element of each list
As you only want to extract the first part before the , you can also do
sapply(DF, function(x) gsub("^([^,]*),.*$", "\\1", x))
# A B C
# [1,] "1" NA "1"
# [2,] "2" NA "1"
This extracts the the first group (\\1) which is here marked with brackets. ([^,]*)
Or with stringr:
library(stringr)
sapply(DF, function(x) str_extract(x, "^([^,]*)"))
Here is another version of this
lapply(X = df, FUN = function(x) sapply(strsplit(x = as.character(x), split = ","), FUN = head, n=1))
First of all, notice that your sepfun should always give an error:
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
split should go with strsplit, not as.character, so what you meant is probably:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
Second, the question of data sanity. You have character variables stored as factors, and missing data stored as empty strings. I would recommend dealing with these issues before trying to do anything else. (Why do I say NA is more sensible here than an empty string? Because you told me so. You want NA's in the output, so I guess this means that if there are no numbers in the string, it means that something is missing. Missing = NA. There is also a technical reason which would take a bit longer to explain.)
So in the following, I'm just using an altered version of your DF:
DF <- data.frame(A=c("1,2,3", "2"), B=c(NA, "1"), C=c("1,2", NA), stringsAsFactors=FALSE)
(If DF comes from a file, then you could use read.csv("file", as.is=TRUE). And then DF[DF==""] <- NA.)
The output of strsplit is a list so you'll need sapply to get something useful out from it. And another sapply to apply it to all columns in a data frame.
sapply(DF, function(x) sapply(strsplit(x, ","), head, 1))
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
Or step by step. Before you can sapply a function over all columns of a data frame, you need it to give meaningful results for all the columns. Let's try:
sf <- function(x) sapply(strsplit(x, ","), head, 1)
# and sepfunc as defined above:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
sf(DF$A)
# [1] "1" "2"
# as expected
sepfunc(DF$A)
# [1] "1"
Notice that sepfunc uses only the first element (as you told it to!) of each column, and the rest is discarded. You need sapply or something similar to use all elements. So as a consequence, you get this:
sapply(DF, sepfunc)
# A B C
# "1" NA "1"
(It works, because we've redefined empty strings as NA. But you get the results only for the first row of each variable.)
sapply(DF, sf)
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA

Change values from categorical to nominal in R

I want to change all the values in categorical columns by rank. Rank can be decided using the index of the sorted unique elements in the column.
For instance,
> data[1:5,1]
[1] "B2" "C4" "C5" "C1" "B5"
then I want these entries in the column replacing categorical values
> data[1:5,1]
[1] "1" "4" "5" "3" "2"
Another column:
> data[1:5,3]
[1] "Verified" "Source Verified" "Not Verified" "Source Verified" "Source Verified"
Then the updated column:
> data[1:5,3]
[1] "3" "2" "1" "2" "2"
I used this code for this task but it is taking a lot of time.
for(i in 1:ncol(data)){
if(is.character(data[,i])){
temp <- sort(unique(data[,i]))
for(j in 1:nrow(data)){
for(k in 1:length(temp)){
if(data[j,i] == temp[k]){
data[j,i] <- k}
}
}
}
}
Please suggest me the efficient way to do this, if possible.
Thanks.
Here a solution in base R. I create a helper function that convert each column to a factor using its unique sorted values as levels. This is similar to what you did except I use as.integer to get the ranking values.
rank_fac <- function(col1)
as.integer(factor(col1,levels = unique(col1)))
Some data example:
dx <- data.frame(
col1= c("B2" ,"C4" ,"C5", "C1", "B5"),
col2=c("Verified" , "Source Verified", "Not Verified" , "Source Verified", "Source Verified")
)
Applying it without using a for loop. Better to use lapply here to avoid side-effect.
data.frame(lapply(dx,rank_fac)
Results:
# col1 col2
# [1,] 1 3
# [2,] 4 2
# [3,] 5 1
# [4,] 3 2
# [5,] 2 2
using data.table syntax-sugar
library(data.table)
setDT(dx)[,lapply(.SD,rank_fac)]
# col1 col2
# 1: 1 3
# 2: 4 2
# 3: 5 1
# 4: 3 2
# 5: 2 2
simpler solution:
Using only as.integer :
setDT(dx)[,lapply(.SD,as.integer)]
Using match:
# df is your data.frame
df[] <- lapply(df, function(x) match(x, sort(unique(x))))

Converting a list of lists of strings to a data frame of numbers in R

I have a list of lists of strings as follows:
> ll
[[1]]
[1] "2" "1"
[[2]]
character(0)
[[3]]
[1] "1"
[[4]]
[1] "1" "8"
The longest list is of length 2, and I want to build a data frame with 2 columns from this list. Bonus points for also converting each item in the list to a number or NA for character(0). I have tried using mapply() and data.frame to convert to a data frame and fill with NA's as follows.
# Find length of each list element
len = sapply(awards2, length)
# Number of NAs to fill for column shorter than longest
len = 2 - len
df = data.frame(mapply( function(x,y) c( x , rep( NA , y ) ) , ll , len))
However, I do not get a data frame with 2 columns (and NA's as fillers) using the code above.
Thanks for the help.
We can use stri_list2matrix from stringi. As the list elements are all character vectors, it seems okay to use this function
library(stringi)
t(stri_list2matrix(ll))
# [,1] [,2]
#[1,] "2" "1"
#[2,] NA NA
#[3,] "1" NA
#[4,] "1" "8"
If we need to convert to data.frame, wrap it with as.data.frame

Splitting a Large Data File in R using Strsplit and R Connection

Hi I am trying to read in a large data file into R. It is a tab delimited file, however the first two columns are filled with multiple pieces of data separated by a "|". The file looks like:
A|1 B|2 0.5 0.4
C|3 D|4 0.9 1
I only care about the first values in both the first and second columns as well as the third and fourth column. In the end I want to end up with a vectors for each line that look like:
A B 0.5 0.4
I am using a connection to read in the file:
con <- file("inputfile.txt", open = "r")
lines <- readLines(con)
which gives me:
lines[1]
[1] "A|1\tB|2/t0.5\t0.4"
then I am using strsplit to split the tab delimited file:
linessplit <- strsplit(lines, split="\t")
which gives me:
linessplit[1]
[1] "A|1" "B|2"
[3] "0.5" "0.4"
When I try the following to split "A|1" into "A" "1":
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
I get:
"Error in strsplit(line1[1], split = "|") : non-character argument"
Does anyone have a way in which I can fix this?
Thanks!
Since you provided an approach I explain the errors in the code even though for your problem maybe you have to consider another approach.
Anyway putting aside personal tastes about code, the problems are:
you have to extract the first element of the list with double brackets
line1[[1]]
the split argument accepts regular
expressions. If you supply | which is a metacharacter, it won't be
read as is. You must escape it with \\| or (as suggested by #nongkrong) you have to use the fixed = T argument that allows you to match strings exactly as is (say, without their meaning as a meta characters).
The final code is l1 <- strsplit(line1[[1]], split = "\\|")
as a final personal consideration, you might take into considerations an lapply solution:
lapply(linessplit, strsplit, split = "|", fixed = T)
Here is my solution to your original problem, says
split lines
"A|1\tB|2\t0.5\t0.4"
"C|3\tD|4\t0.9\t1"
into
A B 0.5 0.4
C D 0.9 1
Below is my code:
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1:4))
linessplit
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
That's break it down:
Read original data into lines
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
Results:
[1] "A|1\tB|2\t0.5\t0.4" "C|3\tD|4\t0.9\t1" "E|5\tF|6\t0.7\t0.2"
Load reshape2 library to import function colsplit, then use it with pattern "\t" to split lines into 4 columns named 1,2,3,4.
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1,2,3,4))
linessplit
Results:
1 2 3 4
1 A|1 B|2 0.5 0.4
2 C|3 D|4 0.9 1.0
3 E|5 F|6 0.7 0.2
That's make a function to take a row, split into rows and select the row we want.
Take the first row of linessplit into colsplit
tmp <- colsplit(linessplit[1,], pattern="\\|", names=c(1:2))
tmp
Results:
1 2
1 A 1
2 B 2
3 0.5 NA
4 0.4 NA
Take transpose
tmp <- t(colsplit(linessplit[1,], pattern="\\|", names=c(1:2)))
tmp
Results:
[,1] [,2] [,3] [,4]
1 "A" "B" "0.5" "0.4"
2 " 1" " 2" NA NA
Select first row:
tmp[1,]
Results:
[1] "A" "B" "0.5" "0.4"
Make above steps a function split_n_select:
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
Use sapply to apply function split_n_select to each row in linessplit
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
Results:
1 2 3 4
[1,] "A" "B" "0.5" "0.4"
[2,] "C" "D" "0.9" "1"
[3,] "E" "F" "0.7" "0.2"
You can also select the second row by adding sel=c(2)
linessplit2 <- sapply(linessplit, split_n_select, sel=c(2))
linessplit2
Results:
1 2 3 4
[1,] "1" "2" NA NA
[2,] "3" "4" NA NA
[3,] "5" "6" NA NA
Change
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
to
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "[|]") #i added square brackets

Resources