Ignoring specific characters in read table of list

Ignoring specific characters in read table of list - r

I have the following list:
> str1<-'cor [1] 0.8832846 0.8880517 0.8881286 0.8845148 0.8832846 0.8880517 0.8818238 0.8767492 0.8876672 0.8822851 0.8854375 0.8850531 0.8835153
[14] 0.8832846 0.8908965 0.8803629'
I use the following command:
> df1 <- read.table(text=scan(text=str1, what='', quiet=TRUE), header=TRUE)
However, [1] and [14] are included in df1. What can I change in df1 in order to ignore all [x] (where x is a number?

We can remove the square brackets including the numbers inside with gsub, scan and then read.table as in the OP's post.
read.table(text=scan(text=gsub('\\[\\d+\\]', '', str1),
what='', quiet=TRUE), header=TRUE)
# cor
#1 0.8832846
#2 0.8880517
#3 0.8881286
#4 0.8845148
#5 0.8832846
#6 0.8880517
#7 0.8818238
#8 0.8767492
#9 0.8876672
#10 0.8822851
#11 0.8854375
#12 0.8850531
#13 0.8835153
#14 0.8832846
#15 0.8908965
#16 0.8803629
Or without using scan as #Richard Scriven mentioned
read.table(text=gsub('\\s+(\\[\\d+\\]\\s+)?', '\n', str1), header=TRUE)

Related

In R, How to write from a list to file with a set amount of elements on each line?

Let's say I have a list of 23 elements.
ls <- list(1:23)
Which I want to write to a file which has 5 elements on each line, separated by a tab until not possible anymore:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23
How would I go about doing this? I don't see any options in write.lines or write.table.
The code by #akrun works best:
cat(gsub("\\s*((\\d+\\s+){1,4}\\d+)", "\\1\n",
paste(unlist(ls), collapse="\t")), '\n', file = 'file1.txt')
With a minor error for decimal values, as the resulting file1.txt looks like this:
0.0005862 0.0005983 0.0006225 0.0006637 0
.0006622 0.0006197 0.000599 0.0005983 0
.0006247 0.0006707 0.0006641 0.0006253 0
.0006087 0.0006234 0.0006807 0.0007485 0
.0007546 0.0007 0.000643 0.0006183 0
.0006264 0.0006819 0.000697 0.0006453 0

It can be done with cat and gsub. unlist the list, paste them into a single string, insert nextline (\n) at every block of 'n' digits with spaces, and use cat to write into console
cat(gsub("\\s*((\\d+\\s+){1,4}\\d+)", "\\1\n",
paste(unlist(ls), collapse="\t")), '\n')
#1 2 3 4 5
#6 7 8 9 10
#11 12 13 14 15
#16 17 18 19 20
#21 22 23
or write to a file
cat(gsub("\\s*((\\d+\\s+){1,4}\\d+)", "\\1\n",
paste(unlist(ls), collapse="\t")), '\n', file = 'file1.txt')
If it is a complex data with scientific notation etc. we could split into a list and then append NA at the end for those elements with less number of elements
v1 <- unlist(ls)
lst1 <- split(v1, (seq_along(v1)-1) %/% 4 + 1)
mat1 <- do.call(rbind, lapply(lst1, `length<-`, max(lengths(lst1))))
write(mat1, 'file2.txt')

You first need to define the chunks, I used BBmisc which have chunk function to obtain chunks of N elementes (five in your case).
Then you can use write.table witch have the append option.
library(BBmisc)
x <-list(1:20)
n<-5
splited<-chunk(x[[1]],n)
for(i in 1:length(splited)){
x=splited[[i]]
line=paste(x,collapse = "\t")
write.table(line, file = "output.txt", sep = "\t",
row.names = FALSE, col.names = FALSE, quote = FALSE, append = T)
}
Regards

splitting variable product code into letters and numbers

I have a product code variable like:
Product Code
RMMI001,
RMMI001,
CMCM009,
ASCMOT064,
ASPMOA023,
CMCM009,
CMCM012,
CMCM001,
ASCMBW001,
RMMI001,
TMHO002,
TMSP001,
TMHO002,
TMDMST003
I need to split those and need these characters in another column.

You may try using sub here to remove all trailing numbers, leaving you with the character portion:
df <- data.frame(product_code=c("RMMI001", "RMMI001", "CMCM009"))
df$code <- sub("\\d*$", "", df$product_code)
df
product_code code
1 RMMI001 RMMI
2 RMMI001 RMMI
3 CMCM009 CMCM
Demo

What about something like this?
# Sample product codes
ss <- c("RMMI001", "RMMI001", "CMCM009", "ASCMOT064", "ASPMOA023", "CMCM009", "CMCM012", "CMCM001", "ASCMBW001", "RMMI001", "TMHO002", "TMSP001", "TMHO002", "TMDMST003")
# Separate code and numbers and store in data.frame
read.csv(text = gsub("^([a-zA-Z]+)(\\d+)$", "\\1,\\2", ss), header = F)
# V1 V2
#1 RMMI 1
#2 RMMI 1
#3 CMCM 9
#4 ASCMOT 64
#5 ASPMOA 23
#6 CMCM 9
#7 CMCM 12
#8 CMCM 1
#9 ASCMBW 1
#10 RMMI 1
#11 TMHO 2
#12 TMSP 1
#13 TMHO 2
#14 TMDMST 3

You can use tidyr::extract as well, it works with dataframes only.
tidyr::extract(data.frame(x =c("RMMI001", "CMCM009")),x, c("first", "second"), "([a-zA-Z]+)(\\d+)" )
Output:
# first second
#1 RMMI 001
#2 CMCM 009
This will extract both the alphabets and numbers in separate columns, if you choose "([a-zA-Z]+)\d+" instead of "([a-zA-Z]+)(\d+)". It will then extract only the first match represented as english words like below. Note the difference here is the capturing group represented by parenthesis.It is used here for capturing the match, in this case these are words and numbers into separate columns.
tidyr::extract(data.frame(x =c("RMMI001", "CMCM009")),x, c("first"), "([a-zA-Z]+)\\d+" )
# first
# 1 RMMI
# 2 CMCM

Load all R data files from specific folder

I've got a lot of Rdata files which I want to combine in one dataframe.
My files, as an example, are:
file1.RData
file2.RData
file3.RData
All the datafiles have the structure: datafile$a and datafile$b. From all of the files above I would like to load take the variable $aand add this to and already existing dataframe called md. My problem isn't loading the files into the global environment, but processing the data in the RData file.
My code so far, which obviously doesn't work.
library(dplyr)
files <- list.files("correct directory", pattern="*.RData")
This returns the correct list of files.
I also know I need to lapply over a function.
lapply(files, myFun)
My problem is in the function. What I've got at the moment:
myFun <- function(files) {
load(files)
df <- data.frame(datafile$a)
md <- bind_rows(md, df)
}
The code above doesn't work, any idea how I get this to work?

Try
library(dplyr)
bind_rows(lapply(files, myFun))
# a
#1 1
#2 2
#3 3
#4 4
#5 5
#6 1
#7 2
#8 3
#9 4
#10 5
#11 6
#12 7
#13 8
#14 9
#15 10
#16 11
#17 12
#18 13
#19 14
#20 15
where
myFun <- function(files) {
load(files)
df <- data.frame(a= datafile$a)
}
data
datafile <- data.frame(a=1:5, b=6:10)
save(datafile, file='file1.RData')
datafile <- data.frame(a=1:15, b=16:30)
save(datafile, file='file2.RData')
files <- list.files(pattern='file\\d+.RData')
files

Stack columns row by row

I have a dataframe which contains 2 columns, such as
Name Seq
1 ENSE00000789668:ENSE00000789668 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGCGATCCATTTTCTCAGCCTATTAAATTTC
2 ENSE00000789668:ENSE00000814448 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTTTCAGCGGATGTTCTCTCCAGCTTTCAAC
3 ENSE00000789668:ENSE00000814452 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGTTTTGCTGGGCCTGCGTGATACTAGCGAT
4 ENSE00000789668:ENSE00001021870 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTGTCCCGTTTCCGGACCCGTCTCTATGGTG
5 ENSE00000789668:ENSE00001316145 CTCAAAATTTGCTGCAGCAGAAATTACTGAGATTCTCCTATGTGTGTCGTCTGCAGCCATC
6 ENSE00000789668:ENSE00001445604 CTCAAAATTTGCTGCAGCAGAAATTACTGAGCTGCTTGGCTTTGAGGAAGAGTGGCAGTAC
I wish to stack one column onto anther row by row to give:
ENSE00000789668:ENSE00000789668
CTCAAAATTTGCTGCAGCAGAAATTACTGAGGCGATCCATTTTCTCAGCCTATTAAATTTC
ENSE00000789668:ENSE00000814448
CTCAAAATTTGCTGCAGCAGAAATTACTGAGTTTCAGCGGATGTTCTCTCCAGCTTTCAAC
ENSE00000789668:ENSE00000814452
CTCAAAATTTGCTGCAGCAGAAATTACTGAGGTTTTGCTGGGCCTGCGTGATACTAGCGAT
ENSE00000789668:ENSE00001021870
CTCAAAATTTGCTGCAGCAGAAATTACTGAGTGTCCCGTTTCCGGACCCGTCTCTATGGTG
ENSE00000789668:ENSE00001316145
CTCAAAATTTGCTGCAGCAGAAATTACTGAGATTCTCCTATGTGTGTCGTCTGCAGCCATC
ENSE00000789668:ENSE00001445604
CTCAAAATTTGCTGCAGCAGAAATTACTGAGCTGCTTGGCTTTGAGGAAGAGTGGCAGTAC
How do I do this?

You can try
data.frame(Col1=c(t(df)))
# Col1
#1 ENSE00000789668:ENSE00000789668
#2 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGCGATCCATTTTCTCAGCCTATTAAATTTC
#3 ENSE00000789668:ENSE00000814448
#4 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTTTCAGCGGATGTTCTCTCCAGCTTTCAAC
#5 ENSE00000789668:ENSE00000814452
#6 CTCAAAATTTGCTGCAGCAGAAATTACTGAGGTTTTGCTGGGCCTGCGTGATACTAGCGAT
#7 ENSE00000789668:ENSE00001021870
#8 CTCAAAATTTGCTGCAGCAGAAATTACTGAGTGTCCCGTTTCCGGACCCGTCTCTATGGTG
#9 ENSE00000789668:ENSE00001316145
#10 CTCAAAATTTGCTGCAGCAGAAATTACTGAGATTCTCCTATGTGTGTCGTCTGCAGCCATC
#11 ENSE00000789668:ENSE00001445604
#12 CTCAAAATTTGCTGCAGCAGAAATTACTGAGCTGCTTGGCTTTGAGGAAGAGTGGCAGTAC
Or
library(reshape2)
melt(t(df))[3]
Or may be this too
data.frame(Col1=as.matrix(df)[c(matrix(seq(prod(dim(df))), nrow=2, byrow=2))])

R reading values of numeric field in file wrongly

R is reading the values from a file wrongly. One can check if this statement is true with the following example:
A sample picture/snapshot which explains the problem areas is here
(1) Copy paste the following 10 numbers into a test file (sample.csv)
1000522010609612
1000522010609613
1000522010609614
1000522010609615
1000522010609616
1000522010609617
971000522010609612
1501000522010819466
971000522010943717
1501000522010733490
(2) Read these contents into R using read.csv
X <- read.csv("./test.csv", header=FALSE)
(3) Print the output
print(head(X, n=10), digits=22)
The output I got was
V1
1 1000522010609612.000000
2 1000522010609613.000000
3 1000522010609614.000000
4 1000522010609615.000000
5 1000522010609616.000000
6 1000522010609617.000000
7 971000522010609664.000000
8 1501000522010819584.000000
9 971000522010943744.000000
10 1501000522010733568.000000
The problem is that rows 7,8,9,10 are not correct (check the sample 10 numbers that we considered before).
What could be the problem? Is there some setting that I am missing with my R - terminal?

You could try
library(bit64)
x <- read.csv('sample.csv', header=FALSE, colClasses='integer64')
x
# V1
#1 1000522010609612
#2 1000522010609613
#3 1000522010609614
#4 1000522010609615
#5 1000522010609616
#6 1000522010609617
#7 971000522010609612
#8 1501000522010819466
#9 971000522010943717
#10 1501000522010733490
If you load the bit64, then you can also try fread from data.table
library(data.table)
x1 <- fread('sample.csv')
x1
# V1
#1: 1000522010609612
#2: 1000522010609613
#3: 1000522010609614
#4: 1000522010609615
#5: 1000522010609616
#6: 1000522010609617
#7: 971000522010609612
#8: 1501000522010819466
#9: 971000522010943717
#10: 1501000522010733490