Extracting from text file in R - r

I need to compare two .txt files with the following formats , with R:
rows in file1:
1-11!AIVDM,1,1,,B,11b4N?#P?w<tSF0l4Q#>4?wp1`Oo,0*3D
1347204643
2-12$GPRMC,153102,A,6300.774,N,05238.627,W,12.9,186,090912,30,W*79
1347204664
( here for some reason the time (1347204643) is in the separate row)
rows in file2:
#1:1347204643:11!AIVDM,1,1,,B,11b4N?#P?w<tSF0l4Q#>4?wp1`Oo,0*3D
#2:1347204664:12$GPRMC,153102,A,6300.774,N,05238.627,W,12.9,186,090912,30,W*79
I am interested only in verifying if the same ID, which is in the beginning of the row (e.g. 1 and 2 here), exists in both files ( if the ID that exists in file1 exists also in file2).
Can someone help me with this? Thank you very much in advance!

You can do something like this :
First you read 2 two files using readLines
ll1 <- readLines(textConnection('#1:1347204643:11!AIVDM,1,1,,B,11b4N?#P?w<tSF0l4Q#>4?wp1`Oo,0*3D
#2:1347204664:12$GPRMC,153102,A,6300.774,N,05238.627,W,12.9,186,090912,30,W*79'))
ll2 <- readLines(textConnection('1-11!AIVDM,1,1,,B,11b4N?#P?w<tSF0l4Q#>4?wp1`Oo,0*3D
1347204643
2-12$GPRMC,153102,A,6300.774,N,05238.627,W,12.9,186,090912,30,W*79
1347204664'))
Do some treatments
#Remove '#` fom the first files
ll1 <- gsub('#','',ll1)
#Take only the odd lines from the second file
ll2 <- ll2[c(TRUE,FALSE)]
Extract the index of each lines using substr
ll1 <- substr(ll1,1,1)
ll2 <- substr(ll2,1,1)
Now you have this 2 lists :
ll1
[1] "1" "2"
> ll2
[1] "1" "2
To compare you can use match
match(ll1,ll2)
[1] 1 2

Related

How to see the difference in two strings

I'm trying to find the difference between two columns in a CSV file, which I named Test.
I'd like to add a new column called 'Results' that contains the difference between Events_1 & Events_2. If there is no difference the Results can be blank.
This is a basic example, for what I'm trying to accomplish, the real list contains hundreds of events in both columns.
Not tested with your data, but
vec2 <- c("hello,goodbye","hello,goodbye")
vec1 <- c("hello","hello,goodbye")
Map(setdiff, strsplit(vec2, "[,\\s]+"), strsplit(vec1, "[,\\s]+"))
# [[1]]
# [1] "goodbye"
# [[2]]
# character(0)
If you need them to be comma-delimited strings, then
mapply(function(a,b) paste(setdiff(a,b), collapse=","), strsplit(vec2, "[,\\s]+"), strsplit(vec1, "[,\\s]+"))
# [1] "goodbye" ""

How to vectorize a for loop in R for a large dataset

I'm relatively new to R and I have a question about data processing. The main issue is that the dataset is too big, and I want to write a vectorized function that's faster than a for loop, but I don't know how. The data is about movies and user ratings, is formatted like this (below).
1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
The 1: and 2: represent movies, while the other lines represent a user id, user rating and dating of rating for that movie (in that order from left to right, separated by commas). I want to format the data as an edge list, like this:
Movie | User
1: | 5
1: | 1
1: | 3
2: | 2
2: | 3
2: | 5
I wrote the code below to perform this function. Basically, for every row, it check if its a movie id (containing ':') or if it's user data. It then combines the movie id and user id as two columns for every movie and user, and then rowbinds it to a new data frame. At the same time, it also only binds those users who rate a movie 5 out of 5.
el <- data.frame(matrix(ncol = 2, nrow = 0))
for (i in 1:nrow(data))
{
if (grepl(':', data[i,]))
{
mid <- data[i,]
} else(grepl(',', data[i,]))
{
if(grepl(',5,', data[i,]))
{
uid <- unlist(strsplit(data[i,], ','))[1]
add <- c(mid, uid)
el <- rbind(el, add)
}
}
}
However, I have about 100 million entries, and the for loop runs throughout the night without being able to complete. Is there a way to speed this up? I read about vectorization, but I can't figure out how to vectorize this function. Any help?
You can do this with a few regular expressions, for which I'll use the stringr package, as well as na.locf from the zoo package. (You'll have to install stringr and zoo first).
First we'll set up your data, which it sounds like is in a one-column data frame:
data <- read.table(textConnection("1:
5,3,2005-09-06
1,5,2005-05-13
3,4,2005-10-19
2:
2,4,2005-12-26
3,3,2004-05-03
5,3,2005-11-17
"))
You can then follow the following steps (explanation in comments).
# Pull out the column as a character vector for simplicity
lines <- data[[1]]
library(stringr)
# Figure out which lines represent movie IDs, and extract IDs
movie_ids <- str_match(lines, "(\\d+):")[, 2]
# Fill the last observation carried forward (locf), to find out
# the most recent non-NA value
library(zoo)
movie_ids_filled <- na.locf(movie_ids)
# Extract the user IDs
user_ids <- str_match(lines, "(\\d+),")[, 2]
# For each line that has a user ID, match it to the movie ID
result <- cbind(movie_ids_filled[!is.na(user_ids)],
user_ids[!is.na(user_ids)])
This gets the result
[,1] [,2]
[1,] "1" "5"
[2,] "1" "1"
[3,] "1" "3"
[4,] "2" "2"
[5,] "2" "3"
[6,] "2" "5"
The most important part of this code is the use of regular expressions, particularly the capturing groups in parentheses of "(\\d+):" and (\\d+),. For more on using str_match with regular expressions, do check out this guide.

Locate different patterns in a sequence

If I want to find two different patterns in a single sequence how am I supposed to do
eg:
seq="ATGCAAAGGT"
the patterns are
pattern=c("ATGC","AAGG")
How am I supposed to find these two patterns simultaneously in the sequence?
I also want to find the location of these patterns like for example the patterns locations are 1,4 and 5,8.
Can anyone help me with this ?
Lets say your sequence file is just a vector of sequences:
seq.file <- c('ATGCAAAGGT','ATGCTAAGGT','NOTINTHISONE')
You can search for both motifs, and then return a true / false vector that identifies if both are present using the following one-liner:
grepl('ATGC', seq.file) & grepl('AAGG', seq.file)
[1] TRUE TRUE FALSE
Lets say the vector of sequences is a column within data frame d, which also contains a column of ID values:
id <- c('s1','s2','s3')
d <- data.frame(id,seq.file)
colnames(d) <- c('id','sequence')
You can append a column to this data frame, d, that identifies whether a given sequence matches with this one-liner:
d$match <- grepl('ATGC',d$sequence) & grepl('AAGG', d$sequence)
> print(d)
id sequence match
1 s1 ATGCAAAGGT TRUE
2 s2 ATGCTAAGGT TRUE
3 s3 NOTINTHISONE FALSE
The following for-loop can return a list of the positions of each of the patterns within the sequence:
require(stringr)
for(i in 1: length(d$sequence)){
out <- str_locate_all(d$sequence[i], pattern)
first <- c(out[[1]])
first.o <- paste(first[1],first[2],sep=',')
second <- c(out[[2]])
second.o <- paste(second[1],second[2], sep=',')
print(c(first.o, second.o))
}
[1] "1,4" "6,9"
[1] "1,4" "6,9"
[1] "NA,NA" "NA,NA"
You can try using the stringr library to do something like this:
seq = "ATGCAAAGGT"
library(stringr)
str_extract_all(seq, 'ATGC|AAGG')
[[1]]
[1] "ATGC" "AAGG"
Without knowing more specifically what output you are looking for, this is the best I can provide right now.
How about this using stringr to find start and end positions:
library(stringr)
seq <- "ATGCAAAGGT"
pattern <- c("ATGC","AAGG")
str_locate_all(seq, pattern)
#[[1]]
# start end
#[1,] 1 4
#
#[[2]]
# start end
#[1,] 6 9

In R, using results of rle (Run Length Encoding) including named row and column headers

I have a large matrix containing companies as row names, months as column names and data for each of the elements. Test data below:
testmatrix<-matrix(c(1,0,0,0,10,5,5,5,5,5,2,2,0,0,0,0,0,1,1,1),nrow=4,ncol=5,byrow=TRUE)
colnames(testmatrix)<-c("Jan","Feb","Mar","Apr","May")
rownames(testmatrix)<-c("Company1","Company2","Company3","Company4")
progression<-apply(testmatrix,1,rle)
progression
The progression object is the output of the rle function applied over each of the rows of the matrix. The result is a list with 2 elements that are both of class 'rle'. I would like to:
Understand how to output (in R) a 4x3 (row by column) matrix of Company1 as follows:
Hence I'm struggling to understand how to deal with the output provided by progression
Export progression to excel for further analysis (preferably in the format in (1) above (including column and row headers (in the list output they're referred to as: attr(*,"names")).
Your assistance is much appreciated!
This is not particularly elegant but this does the job:
format_rle <- function(rle, rn){
l <- list(rle$lengths,
names(rle$lengths),
rle$values,
names(rle$values))
m <- as.matrix(do.call(rbind, l))
colnames(m) <- NULL
rownames(m) <- rep(rn, nrow(m))
m
}
Try format_rle(progression[[1]], "foo") to get the idea:
[,1] [,2] [,3]
foo "1" "3" "1"
foo "Feb" "May" ""
foo "1" "0" "10"
foo "Jan" "Apr" "May"
Then we apply this function to all elements in progression and save the result to individual csv files named according to the names in progression. You should have a bunch of .csv files in your working directory (getwd() to print it).
for (i in seq_along(progression))
write.csv(format_rle(progression[[i]], names(progression)[i]),
file=paste0(names(progression[i]), ".csv"))
Is this what you want?

R: Make character string refer to an object

I have a large list of files (file1, file2, file3, etc.) and, for each analysis, I want to refer to two files from this list (e.g. function(file1,file2)). When I try to do this using paste0("file", pairs[1,x] I get back the character string "file1" rather than the object file1.
How can I refer to the objects rather than create a character string?
Thank you very much!
Additional comment:
pairs is a 2xn matrix where each column is the combination of files for one analysis (e.g. pairs[1,1] = 1 and pairs[2,1] = 2 for the comparison between file1 and file2).
Are you looking for get()???
a <- 1:5
> get("a")
[1] 1 2 3 4 5
How to get the variable from a string containing the variable name:
> a = 10
> string = "a"
> string
[1] "a"
> eval(parse(text = string))
[1] 10
> eval(parse(text = "a"))
[1] 10
Hope this helps.
Another alternative:
eval(as.name("file"))

Resources