Splitting a Large Data File in R using Strsplit and R Connection - r

Hi I am trying to read in a large data file into R. It is a tab delimited file, however the first two columns are filled with multiple pieces of data separated by a "|". The file looks like:
A|1 B|2 0.5 0.4
C|3 D|4 0.9 1
I only care about the first values in both the first and second columns as well as the third and fourth column. In the end I want to end up with a vectors for each line that look like:
A B 0.5 0.4
I am using a connection to read in the file:
con <- file("inputfile.txt", open = "r")
lines <- readLines(con)
which gives me:
lines[1]
[1] "A|1\tB|2/t0.5\t0.4"
then I am using strsplit to split the tab delimited file:
linessplit <- strsplit(lines, split="\t")
which gives me:
linessplit[1]
[1] "A|1" "B|2"
[3] "0.5" "0.4"
When I try the following to split "A|1" into "A" "1":
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
I get:
"Error in strsplit(line1[1], split = "|") : non-character argument"
Does anyone have a way in which I can fix this?
Thanks!

Since you provided an approach I explain the errors in the code even though for your problem maybe you have to consider another approach.
Anyway putting aside personal tastes about code, the problems are:
you have to extract the first element of the list with double brackets
line1[[1]]
the split argument accepts regular
expressions. If you supply | which is a metacharacter, it won't be
read as is. You must escape it with \\| or (as suggested by #nongkrong) you have to use the fixed = T argument that allows you to match strings exactly as is (say, without their meaning as a meta characters).
The final code is l1 <- strsplit(line1[[1]], split = "\\|")
as a final personal consideration, you might take into considerations an lapply solution:
lapply(linessplit, strsplit, split = "|", fixed = T)

Here is my solution to your original problem, says
split lines
"A|1\tB|2\t0.5\t0.4"
"C|3\tD|4\t0.9\t1"
into
A B 0.5 0.4
C D 0.9 1
Below is my code:
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1:4))
linessplit
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
That's break it down:
Read original data into lines
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
Results:
[1] "A|1\tB|2\t0.5\t0.4" "C|3\tD|4\t0.9\t1" "E|5\tF|6\t0.7\t0.2"
Load reshape2 library to import function colsplit, then use it with pattern "\t" to split lines into 4 columns named 1,2,3,4.
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1,2,3,4))
linessplit
Results:
1 2 3 4
1 A|1 B|2 0.5 0.4
2 C|3 D|4 0.9 1.0
3 E|5 F|6 0.7 0.2
That's make a function to take a row, split into rows and select the row we want.
Take the first row of linessplit into colsplit
tmp <- colsplit(linessplit[1,], pattern="\\|", names=c(1:2))
tmp
Results:
1 2
1 A 1
2 B 2
3 0.5 NA
4 0.4 NA
Take transpose
tmp <- t(colsplit(linessplit[1,], pattern="\\|", names=c(1:2)))
tmp
Results:
[,1] [,2] [,3] [,4]
1 "A" "B" "0.5" "0.4"
2 " 1" " 2" NA NA
Select first row:
tmp[1,]
Results:
[1] "A" "B" "0.5" "0.4"
Make above steps a function split_n_select:
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
Use sapply to apply function split_n_select to each row in linessplit
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
Results:
1 2 3 4
[1,] "A" "B" "0.5" "0.4"
[2,] "C" "D" "0.9" "1"
[3,] "E" "F" "0.7" "0.2"
You can also select the second row by adding sel=c(2)
linessplit2 <- sapply(linessplit, split_n_select, sel=c(2))
linessplit2
Results:
1 2 3 4
[1,] "1" "2" NA NA
[2,] "3" "4" NA NA
[3,] "5" "6" NA NA

Change
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
to
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "[|]") #i added square brackets

Related

How to remove brackets and keep content inside from Data Frame

I have a dataframe pulled from a pdf that looks something like this
library(tidyverse)
Name <- c("A","B","A","A","B","B","C","C","C")
Result <- c("ND","[0.5]","1.2","ND","ND","[0.8]","ND","[1.1]","22")
results2 <- data.frame(Name, Result)
results2
I am trying to get rid of the brackets and have tried using gsub and a subsection and have failed. it looked like
results2$RESULT <-gsub("\\[","",results2$RESULT)
and resulted in an 'unexpected symbol' error message. I would like to get rid of the brackets and turn the Results into a numeric column.
We may need to change it to add an OR (|) with ] so that both opening and closing square brackets are matched. Also, the column name is lower case. As R is case-sensitive, it wouldn't match the 'RESULT' which doesn't exist in the data
gsub("\\[|\\]", "", results2$Result)
[1] "ND" "0.5" "1.2" "ND" "ND" "0.8" "ND" "1.1" "22"
As you are using library(tidyverse), you could do: regex is from akrun:
library(dplyr)
library(stringr)
results2 %>%
mutate(Result = str_replace_all(Result, "\\[|\\]", ""))
Name Result
1 A ND
2 B 0.5
3 A 1.2
4 A ND
5 B ND
6 B 0.8
7 C ND
8 C 1.1
9 C 22

Storing unique values of each column (of a df) in list

It is straight forward to obtain unique values of a column using unique. However, I am looking to do the same but for multiple columns in a dataframe and store them in a list, all using base R. Importantly, it is not combinations I need but simply unique values for each individual column. I currently have the below:
# dummy data
df = data.frame(a = LETTERS[1:4]
,b = 1:4)
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols)
{
x = unique(i)
unique_values_by_col[[i]] = x
}
The problem comes when displaying unique_values_by_col as it shows as empty. I believe the problem is i is being passed to the loop as a text not a variable.
Any help would be greatly appreciated. Thank you.
Why not avoid the for loop altogether using lapply:
lapply(df, unique)
Resulting in:
> $a
> [1] A B C D
> Levels: A B C D
> $b
> [1] 1 2 3 4
Or you have also apply that is specifically done to be run on column or line:
apply(df,2,unique)
result:
> apply(df,2,unique)
a b
[1,] "A" "1"
[2,] "B" "2"
[3,] "C" "3"
[4,] "D" "4"
thought if you want a list lapply return you a list so may be better
Your for loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i is just a character, the name of a column within df so unique(i) doesn't make sense.
Anyhow, the most standard way for this task is lapply() as shown by demirev.
Could this be what you're trying to do?
Map(unique,df)
Result:
$a
[1] A B C D
Levels: A B C D
$b
[1] 1 2 3 4

How to efficiently extract delimited strings from a data table in R

I have a data table in R with text columns of colon delimited data. I want to return a matrix/data table of results where one of the delimited values is returned for each cell.
The code pasted below demonstrates the problem and is a working solution. However, my actual data table is large (a few thousand rows and columns), and the pasted method takes on the order of a minute or two to complete.
I'm wondering if there is a more efficient way to perform this task? It appears that the sep2 option in fread will be very useful for this problem once implemented.
Thanks!
> # Set up data.table
> DT <- data.table(A = c("cat:1:meow", "dog:2:bark", "cow:3:moo"),
B = c("dog:3:meow", "dog:4:bark", "frog:3:croak"),
C = c("dingo:0:moo", "cat:8:croak", "frog:1:moo"))
> print(DT)
A B C
1: cat:1:meow dog:3:meow dingo:0:moo
2: dog:2:bark dog:4:bark cat:8:croak
3: cow:3:moo frog:3:croak frog:1:moo
# grab the second delimited value in each cell
> part_index <- 2
> f = function(x) {vapply(t(x), function(x) {unlist(strsplit(x, ":", fixed=T))[part_index]}, character(1))}
> sapply(DT, f)
A B C
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"
1) sub Try this:
DT[, lapply(.SD, sub, pattern = ".*:(.*):.*", replacement = "\\1")]
giving:
A B C
1: 1 3 0
2: 2 4 8
3: 3 3 1
2) fread or using fread:
DT[, lapply(.SD, function(x) fread(paste(x, collapse = "\n"))$V2)]
3) matrix Note that similar code would work with plain character matrix without data.table:
m <- as.matrix(DT)
replace(m, TRUE, sub(".*:(.*):.*", "\\1", m))
giving:
A B C
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"
3a) Even simpler (no regular expressions) would be:
replace(m, TRUE, read.table(text = m, sep = ":")$V2)
3b) or using fread from data.table:
replace(m, TRUE, fread(paste(m, collapse = "\n"))$V2)

Change values from categorical to nominal in R

I want to change all the values in categorical columns by rank. Rank can be decided using the index of the sorted unique elements in the column.
For instance,
> data[1:5,1]
[1] "B2" "C4" "C5" "C1" "B5"
then I want these entries in the column replacing categorical values
> data[1:5,1]
[1] "1" "4" "5" "3" "2"
Another column:
> data[1:5,3]
[1] "Verified" "Source Verified" "Not Verified" "Source Verified" "Source Verified"
Then the updated column:
> data[1:5,3]
[1] "3" "2" "1" "2" "2"
I used this code for this task but it is taking a lot of time.
for(i in 1:ncol(data)){
if(is.character(data[,i])){
temp <- sort(unique(data[,i]))
for(j in 1:nrow(data)){
for(k in 1:length(temp)){
if(data[j,i] == temp[k]){
data[j,i] <- k}
}
}
}
}
Please suggest me the efficient way to do this, if possible.
Thanks.
Here a solution in base R. I create a helper function that convert each column to a factor using its unique sorted values as levels. This is similar to what you did except I use as.integer to get the ranking values.
rank_fac <- function(col1)
as.integer(factor(col1,levels = unique(col1)))
Some data example:
dx <- data.frame(
col1= c("B2" ,"C4" ,"C5", "C1", "B5"),
col2=c("Verified" , "Source Verified", "Not Verified" , "Source Verified", "Source Verified")
)
Applying it without using a for loop. Better to use lapply here to avoid side-effect.
data.frame(lapply(dx,rank_fac)
Results:
# col1 col2
# [1,] 1 3
# [2,] 4 2
# [3,] 5 1
# [4,] 3 2
# [5,] 2 2
using data.table syntax-sugar
library(data.table)
setDT(dx)[,lapply(.SD,rank_fac)]
# col1 col2
# 1: 1 3
# 2: 4 2
# 3: 5 1
# 4: 3 2
# 5: 2 2
simpler solution:
Using only as.integer :
setDT(dx)[,lapply(.SD,as.integer)]
Using match:
# df is your data.frame
df[] <- lapply(df, function(x) match(x, sort(unique(x))))

How to ignore a "Level" in R?

Not sure how to do the following. Please refer to the picture in the link below:
https://i.stack.imgur.com/Kx79x.png
I have some blank spaces, and they are the missing values. I do not want this level to be read. I want R to ignore this level. I want to write a regression so that this empty category is not part of the model.
The data was read from a csv file. The variable is "I", "II"...."IV", but there is an extra "" factor because of missing data. I want R to ignore this factor. My question is how?
you can do the following:
df <- data.frame(letters=letters[1:5], numbers=c(1,2,3,"",5)) # my data frame
# letters numbers
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d
# 5 e 5
levels(df$numbers)
# "" "1" "2" "3" "5"
subdf <- subset(df, numbers != "") # data subset
subdf$numbers <- factor(subdf$numbers)
levels(subdf$numbers)
# "1" "2" "3" "5"
change the "" data to missing:
# generate sample data
df <- data.frame(x = sample(c("","I","II","III"),100, replace = T), stringsAsFactors = T)
option 1
df[df$x=="",'x'] <- NA
option 2
df$x <- factor(ifelse(df$x == "",NA,as.character(df$x)))

Resources