I have hundreds of tab-separated tables saved as text files in a folder. I would like to transpose all those tables using R and then export the transposed tables as tab-separated text files. I'm using the following code:
files <- list.files()
for (i in files) {
x <- t(read.table(i, header = TRUE, sep = "\t"))
filename <- paste0("transposed_", i)
write.table(x, file = filename)
}
The above code works perfectly well except that, because the tables being transposed contain character strings, the function t() returns matrices with all values as strings. As a result, the transposed tables exported with write.table have all values within quotation marks "".
So, the question is: how could I transpose dataframes that contain character strings without getting all values converted into character strings? If someone can demonstrate this using the following dataset, I can replicate for my task.
# Hypothetical dataframe
data <- data.frame(dist = 1:5,
time = 6:10,
vel = 11:15,
pos = c("x","y","z","w","k"))
row.names(data) <- c("indA","indB","indC","indD","indE")
data
# dist time vel pos
# indA 1 6 11 x
# indB 2 7 12 y
# indC 3 8 13 z
# indD 4 9 14 w
# indE 5 10 15 k
t(data)
# indA indB indC indD indE
# dist "1" "2" "3" "4" "5"
# time " 6" " 7" " 8" " 9" "10"
# vel "11" "12" "13" "14" "15"
# pos "x" "y" "z" "w" "k"
# Even if I use as.data.frame(t(data)), all values remain as character strings
I've tried several solutions offered in other topics, but none worked. Also, I'd like (if possible) to perform this task using R base functions only.
Related
I have strings that have dots here and there and I would like to remove them - that is done, and after some other operations - these are also done, I would like to insert the dots back at their original place - this is not done. How could I do that?
library(stringr)
stringOriginal <- c("abc.def","ab.cd.ef","a.b.c.d")
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
I see that str_sub() may help, for example str_sub(stringModified[2], 3,2) <- "." gets me somewhere, but it is still far from the right place, and also I have no idea how to do it programmatically. Thank you for your time!
update
stringOriginal <- c("11.123.100","11.123.200","1.123.1001")
stringOriginalF <- as.factor(stringOriginal)
dotIndex <- str_locate_all(pattern ='\\.', stringOriginal)
stringModified <- str_remove_all(stringOriginal, "\\.")
stringNumFac <- sort(as.numeric(stringModified))
stringi::stri_sub(stringNumFac[1:2], 3, 2) <- "."
stringi::stri_sub(stringNumFac[1:2], 7, 6) <- "."
stringi::stri_sub(stringNumFac[3], 2, 1) <- "."
stringi::stri_sub(stringNumFac[3], 6, 5) <- "."
factor(stringOriginal, levels = stringNumFac)
after such manipulation, I am able to order the numbers and convert them back to strings and use them later for plotting.
But since I wouldn't know the position of the dot, I wanted to make it programmatical. Another approach for factor ordering is also welcomed. Although I am still curious about how to insert programmatically back a character in a string at the exact position where it was originally.
This might be one of the cases for using base R's strsplit, which gives you a list, with a vector of substrings for each entry in your original vector. You can manipulate these with lapply or sapply very easily.
split_string <- strsplit(stringOriginal, "[.]")
#> split_string
#> [[1]]
#> [1] "11" "123" "100"
#>
#> [[2]]
#> [1] "11" "123" "200"
#>
#> [[3]]
#> [1] "1" "123" "1001"
Now you can do this to get the numbers
sapply(split_string, function(x) as.numeric(paste0(x, collapse = "")))
# [1] 11123100 11123200 11231001
And this to put the dots (or any replacement for the dots) back in:
sapply(split_string, paste, collapse = ".")
# [1] "11.123.100" "11.123.200" "1.123.1001"
And you could get the location of the dots within each element of your original vector like this:
lapply(split_string, function(x) cumsum(nchar(x) + 1))
# [[1]]
# [1] 3 7 11
#
# [[2]]
# [1] 3 7 11
#
# [[3]]
# [1] 2 6 11
I'm trying to extract values from a vector of strings. Each string in the vector, (there are about 2300 in the vector), follows the pattern of the example below:
"733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
What I'd like is to extract the numbers following the pattern "Sent. " and place them into a separate vector. For the example, I'd like to extract "1311531".
I'm having trouble using gsub to accomplish this.
library(tidyverse)
Data <- c("PASTE YOUR WHOLE STRING")
str_locate(Data, "Sent. ")
Reference <- str_locate_all(Data, "Sent. ") %>% as.data.frame()
Reference %>% names() #Returns [1] "start" "end"
Reference <- Reference %>% mutate(end = end +1)
YourNumbers <- substr(Data,start = Reference$end[1], stop = Reference$end[1])
for (i in 2:dim(Reference)[1]){
Temp <- substr(Data,start = Reference$end[i], stop = Reference$end[i])
YourNumbers <- paste(YourNumbers, Temp, sep = "")
}
YourNumbers #Returns "1234567"
We can use str_match_all from stringr to get all the numbers followed by "Sent".
str_match_all(text, "Sent.*?_+(\\d+)")[[1]][, 2]
#[1] "1" "3" "1" "1" "5" "3" "1"
A base R option using strsplit and sub
lapply(strsplit(ss, "\\|"), function(x)
sub("Sent.+: _+(\\d+)_+", "\\1", x[grepl("^Sent", x)]))
#[[1]]
#[1] "1" "3" "1" "1" "5" "3" "1"
Sample data
ss <- "733|Overall (-2 to 2): _________2________________|How controversial is each sentence (1-5)?|Sent. 1 (ANALYSIS BY...): ________1__________|Sent. 2 (Bail is...): ____3______________|Sent. 3 (2) A...): _______1___________|Sent. 4 (3) A...): _______1___________|Sent. 5 (Proposition 100...): _______5___________|Sent. 6 (In 2006,...): _______3___________|Sent. 7 (That legislation...): ________1__________|Types of bias (check all that apply):|Pro Anti|X O Word use (bold, add alternate)|X O Examples (italicize)|O O Extra information (underline)|X O Any other bias (explain below)|Last sentence makes it sound like an urgent matter.|____________________________________________|NA|undocumented, without a visa|NA|NA|NA|NA|NA|NA|NA|NA|"
I need to convert over 100 images name into a format like: SITE_T001_L001.jpg, where Site is CGS1, T= TUBES, L= image number.
All those images are contain into a single file named CGS1 (the site), subdivided by file named accordingly to their tubes number. Then the images are organised by date. This order represents the image number. The first one is 1, the second one is two.(the alpahabetic order is not correct)
here, I have a graphical representation:
I found how to do it manually in R
file.rename("Snap_029.jpg",
paste("CGS1","T001","L003", ".jpg", sep = "_"))
but is there anyway to automate it with a loop?
In more details - as requested in the response:
I have this series of input filenames (including leading path)- ordered by dates of modification (important).
file_list
[1] "CGS1/1/Snap_001.jpg" "CGS1/1/Snap_002.jpg" "CGS1/1/Snap_005.jpg" "CGS1/2/Snap_006.jpg" "CGS1/2/Snap_007.jpg" "CGS1/2/Snap_082.jpg"
I am looking to modify the name of each images following the main folder CGS1, the subfolder from T001 to T002, and following the date of modification from L001 to L003 as this output filenames
new_file_list
[1] "CGS1_T001_L001.jpg" "CGS1_T001_L002.jpg" "CGS1_T001_L003.jpg" "CGS1_T002_L001.jpg" "CGS1_T002_L002.jpg" "CGS1_T002_L003.jpg"
Try this:
file_list <- list.files(path = "...", recursive = TRUE, pattern = "\\.jpg$")
### for testing
file_list <- c(
"CGS1/1/Snap_001.jpg", "CGS1/1/Snap_005.jpg", "CGS1/1/Snap_002.jpg",
"CGS1/2/Snap_006.jpg", "CGS1/2/Snap_007.jpg", "CGS1/2/Snap_0082.jpg"
)
spl <- strsplit(file_list, "[/\\\\]")
# ensure that all files are exactly two levels down
stopifnot(all(lengths(spl) == 3))
m <- do.call(rbind, spl)
m
# [,1] [,2] [,3]
# [1,] "CGS1" "1" "Snap_001.jpg"
# [2,] "CGS1" "1" "Snap_005.jpg"
# [3,] "CGS1" "1" "Snap_002.jpg"
# [4,] "CGS1" "2" "Snap_006.jpg"
# [5,] "CGS1" "2" "Snap_007.jpg"
# [6,] "CGS1" "2" "Snap_0082.jpg"
From this, we'll update the second/third columns to be what you expect.
# either one (not both), depending on if you are guaranteed integers
m[,2] <- sprintf("T%03.0f", as.integer(m[,2]))
# ... or if you may have non-numbers
m[,2] <- paste0("T", strrep("0", max(0, 3 - nchar(m[,2]))), m[,2])
# since we really don't care about 'Snap_001.jpg' (etc), we can discard the third column
new_file_list <- apply(m[,1:2], 1, paste, collapse = "_")
# back-street way of applying sequences to each CGS/T combination while preserving order
for (prefix in unique(new_file_list)) {
new_file_list[new_file_list == prefix] <- sprintf("%s_L%03d.jpg",
new_file_list[new_file_list == prefix],
seq_len(sum(new_file_list == prefix)))
}
new_file_list
# [1] "CGS1_T001_L001.jpg" "CGS1_T001_L002.jpg" "CGS1_T001_L003.jpg"
# [4] "CGS1_T002_L001.jpg" "CGS1_T002_L002.jpg" "CGS1_T002_L003.jpg"
Now it's a matter of renaming. Note that this will move all files into the current directory.
file.rename(file_list, new_file_list)
This is a complete re-write of my original question in an attempt to clarify it and make it as answerable as possible. My objective is to write a function which takes a string as input and returns the information contained therein in tabular format. Two examples of the kind of character strings the function will face are the following
s1 <- " 9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167***\r"
s2 <- " 10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0\r"
(For those who had read my original question, these are smaller strings for simplicity.)
The required output would be:
Rank Code Name Club Class Time Points
9 9875 Γεωργίου Άγγελος Δημήτρης ΑΒ/Γ Π/Π Β 00:54:05 167
10 8954F Smith John ΔΕΖ N ΔΕΝ ΕΚΚΙΝΗΣΕ 0
I have managed to split the string based on where there's a blank space using:
strsplit(s1, " ")[[1]][strsplit(s1, " ")[[1]] != ""]
although a more elegant solution was given by G. Grothendieck in the comments below using:
unlist(strsplit(trimws(s1), " +"))
This results in
"9" "9875" "Γεωργίου" "Άγγελος" "Δημήτρης" "ΑΒ/Γ" "Π/Π" "Β" "00:54:05" "167***\r"
However, this is still problematic as "Γεωργίου" "Άγγελος" and "Δημήτρης" should be combined into "Γεωργίου Άγγελος Δημήτρης" (note that the number of elements could be two OR three) and the same applies to "Π/Π" "Β" which should be combined into "Π/Π Β".
The question
How can I use the additional information that I have, namely:
The order of the elements will always be the same
The Name data will consist of two or three words
The Club data (i.e. ΑΒ/Γ in s1 and ΔΕΖ in s2) will come from a pre-defined list of clubs (e.g. stored in a character vector named sClub)
The Class data (i.e. Π/Π Β in s1 and N in s2) will come from a pre-defined list of classes (e.g. stored in a character vector named sClass)
The Points data will always contain "\r" and won't contain any spaces.
to produce the required output above?
Defining
sClub <- c("ΑΒ/Γ", "ΔΕΖ")
sClass <- c("Π/Π Β", "N")
we may do
library(stringr)
myfun <- function(s)
gsub("\\*", "", trimws(str_match(s, paste0("^\\s*(\\d+)\\s*?(\\w+)\\s*?([\\w ]+)\\s*(", paste(sClub, collapse = "|"),")\\s*(", paste(sClass, collapse = "|"), ")(.*?)\\s*([^ ]*\r)"))[, -1]))
sapply(list(s1, s2), myfun)
# [,1] [,2]
# [1,] "9" "10"
# [2,] "9875" "8954F"
# [3,] "Γεωργίου Άγγελος Δημήτρης" "Smith John"
# [4,] "ΑΒ/Γ" "ΔΕΖ"
# [5,] "Π/Π Β" "N"
# [6,] "00:54:05" "ΔΕΝ ΕΚΚΙΝΗΣΕ"
# [7,] "167" "0"
The way it works is just taking into account all your additional information and constructing a long regex. It finishes with erasing * and removing leading/trailing whitespace.
I have a text file in the following format: elt1\telt2\t... with 1,000,000 elements.
Most of these elements are integers, but some of them are of the form number_number or chainOfCharacters. For example: 1\t2\t2_3\t4_44\t2\t'sap'\t34\t'stack' should output: 1 2 2_3 4_44 2 'sap' 34 'stack'.I tried to load this data in R using data <- read.table(file(fileName),row.names=0,sep='\t') but it is taking for ever. Is it possible to speed this up?
You should use scan instead:
scan(fileName, character(), quote = "")
# Read 8 items
# [1] "1" "2" "2_3" "4_44" "2" "'sap'" "34" "'stack'"