I have a text file in the following format: elt1\telt2\t... with 1,000,000 elements.
Most of these elements are integers, but some of them are of the form number_number or chainOfCharacters. For example: 1\t2\t2_3\t4_44\t2\t'sap'\t34\t'stack' should output: 1 2 2_3 4_44 2 'sap' 34 'stack'.I tried to load this data in R using data <- read.table(file(fileName),row.names=0,sep='\t') but it is taking for ever. Is it possible to speed this up?
You should use scan instead:
scan(fileName, character(), quote = "")
# Read 8 items
# [1] "1" "2" "2_3" "4_44" "2" "'sap'" "34" "'stack'"
Related
I have hundreds of tab-separated tables saved as text files in a folder. I would like to transpose all those tables using R and then export the transposed tables as tab-separated text files. I'm using the following code:
files <- list.files()
for (i in files) {
x <- t(read.table(i, header = TRUE, sep = "\t"))
filename <- paste0("transposed_", i)
write.table(x, file = filename)
}
The above code works perfectly well except that, because the tables being transposed contain character strings, the function t() returns matrices with all values as strings. As a result, the transposed tables exported with write.table have all values within quotation marks "".
So, the question is: how could I transpose dataframes that contain character strings without getting all values converted into character strings? If someone can demonstrate this using the following dataset, I can replicate for my task.
# Hypothetical dataframe
data <- data.frame(dist = 1:5,
time = 6:10,
vel = 11:15,
pos = c("x","y","z","w","k"))
row.names(data) <- c("indA","indB","indC","indD","indE")
data
# dist time vel pos
# indA 1 6 11 x
# indB 2 7 12 y
# indC 3 8 13 z
# indD 4 9 14 w
# indE 5 10 15 k
t(data)
# indA indB indC indD indE
# dist "1" "2" "3" "4" "5"
# time " 6" " 7" " 8" " 9" "10"
# vel "11" "12" "13" "14" "15"
# pos "x" "y" "z" "w" "k"
# Even if I use as.data.frame(t(data)), all values remain as character strings
I've tried several solutions offered in other topics, but none worked. Also, I'd like (if possible) to perform this task using R base functions only.
crayon is a package for adding color to printed output, e.g.
library(crayon)
message(red('blue'), green('green'), blue('red'))
However, nchar used on its output is wrong:
# should be 4 characters
nchar(red('1234'))
# [1] 14
I tried all the different type= options for nchar, to no avail -- how can I get R to tell me the correct number of characters in this string (4)?
First, note that the output of red is just a plain string:
r = red('1234')
dput(r)
# "\033[31m1234\033[39m"
class(r)
# [1] "character"
The garbled-looking parts (\033[31m and \033[39m) are what are known as ANSI escape codes -- you can think of it here as signalling "start red" and "stop red". While the program that converts the character object into printed characters in your terminal is aware of and translates these, nchar is not. nchar in fact sees 14 characters:
strsplit(r, NULL)[[1L]]
# [1] "\033" "[" "3" "1" "m" "1" "2" "3" "4" "\033" "["
# [12] "3" "9" "m"
To get the 4 we're after, crayon provides a helper function: col_nchar which first applies strip_style to get rid of the ANSI markup, then runs plain nchar:
strip_style(r)
# [1] "1234"
col_nchar(r)
# [1] 4
So you can either do nchar(strip_style(x)) yourself if you find that more readable, or use col_nchar.
I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
I would like to remove constant (shared) parts of a string automatically and retain the variable parts.
e.g. i have a column with the following:
D20181116_Basel-Take1_digital
D20181116_Basel-Take2_digital
D20181116_Basel-Take3_digital
D20181116_Basel-Take4_digital
D20181116_Basel-Take5_digital
D20181116_Basel-Take5a_digital
how can i get automatically to for any similar column (here removing: "D20181116_Basel-Take" and "_digital"). But the code should be find the constant part itself and remove them.
1
2
3
4
5
5a
I hope this is clear. Thank you very much.
You can do it with a regex: it will remove everything before 'Take' and after the underscore character:
vec<- c("D20181116_Basel-Take1_digital",
"D20181116_Basel-Take2_digital",
"D20181116_Basel-Take3_digital",
"D20181116_Basel-Take4_digital",
"D20181116_Basel-Take5_digital",
"D20181116_Basel-Take5a_digital")
sub(".*?Take(.*?)_.*", "\\1", vec)
[1] "1" "2" "3" "4" "5" "5a"
with gsub():
assuming you have a dataframe df and want to change column
df$column <- gsub("^D20181116_Basel-Take","",df$column)
df$column <- gsub("_digital$","",df$column)
Here i am writing a small code where number needs to be printed horizantally in the txt file which is been generated here as "note.txt"
for(n in 1:4)
{
write.table(n,"note.txt",append = TRUE)
}
I am getting output like
"x"
"1" 1
"x"
"1" 2
"x"
"1" 3
"x"
"1" 4
Whereas i want output as :
1 2 3 4
or
1,2,3,4
Please help me.
Paste function in R can be used to combine a vector/strings using a separator ( Use the attribute collapse in paste command to specify the separator).
If the vector v contains the set of numbers to be printed horizontally.
v=c(1:4);
write(paste(as.character(v), collapse=","),"note.txt",append="TRUE";