I have a very large data frame with SNPs in rows (~50.000) and IDs in columns (~500), imagine an extraction would look something like this:
R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
Now I want to save this as a txt, normally no problem with
write.table(example, "example.txt", colnames=T, rownames=T, quotes=F)
BUT I need to have a tab (\t) as first column entrance, so in the txt file the data frame should look sth like:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178 0.91 0.80 0.58
(\t for the tab)
Can anyone help me how to do this?
Btw I also tried:
write.table(data.frame("\t"=rownames(example),example),"example.txt", row.names=FALSE)
It did not work, unfortunately...
Thanks!
This kind of works, just replace stdout() with the path to your output-file:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c(' ', names(data)), collapse = '\t'),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = '\t')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 35 97 27
#> B 12 69 24
#> C 25 9 34
Or with spaces as seperators and the tab you wished for in the first column:
data <- data.frame(x = sample(1:100,3),
y = sample(1:100,3),
z = sample(1:100,3))
row.names(data) <- LETTERS[1:3]
lines <- c(paste(c('\t', names(data)), collapse = ' '),
sapply(seq_len(nrow(data)),
function(i){
paste(c(row.names(data)[i], data[i,]),collapse = ' ')
}))
writeLines(lines, con = stdout())
#> x y z
#> A 3 30 11
#> B 62 69 70
#> C 93 55 73
Using a data frame like the following, where I've changed one row name to illustrate how to deal with cases of unequal length:
df <- read.table(text = "R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58")
You could do something like this:
df <- format(as.matrix(df))
df <- cbind("\\t" = rownames(df), df)
df <- rbind(colnames(df), df)
df[,1] <- stringr::str_pad(df[,1], max(nchar(df[,1])), "right")
write.table(df,
file = "example.txt",
sep = " ",
quote = F,
row.names = F,
col.names = F)
Output:
\t R015 R016 R007
cg158 0.81 0.90 0.87
cg178kdfj 0.91 0.80 0.58
I first converted the numeric values to character and formatted them to make sure they have the same number of digits, otherwise they won't line up. Then I turn the row names into a new variable named \\t, and then I turn the column names into a new row. I use stringr::str_pad() to account for row names of differing lengths. Finally, I write the data frame to TXT file without the row or column names.
Related
I have raw, messy data for time series containing around 1400 observations. Here is a snippet of what it looks like:
[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null] ... etc
I want to pull the date and its respective value to form a tsibble in R. So, from the above values, it would be like
Date
y-variable
2021-08-24
1.67
2021-08-23
1.65
2021-08-22
1.62
Notice how only the first value is to be paired with its respective date - I don't need the other values. Right now, the raw data has been copied and pasted into a word document and I am unsure about how to approach data wrangling to import into R.
How could I achieve this?
#replace the text conncetion with a file connection if desired, the file should be a txt then
input <- readLines(textConnection("[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"))
#insert line breaks
input <- gsub("],[", "\n", input, fixed = TRUE)
#remove "new Date"
input <- gsub("new Date", "", input, fixed = TRUE)
#remove parentheses and brackets
input <- gsub("[\\(\\)\\[\\]]", "", input, perl = TRUE)
#import cleaned data
DF <- read.csv(text = input, header = FALSE, quote = "'")
DF$V1 <- as.Date(DF$V1)
print(DF)
# V1 V2 V3 V4 V5
#1 2021-08-24 1.67 1.68 0.9 null
#2 2021-08-23 1.65 1.68 0.9 null
#3 2021-08-22 1.62 1.68 0.9 null
How is this?
text <- "[new Date('2021-08-24'),1.67,1.68,0.9,null],[new Date('2021-08-23'),1.65,1.68,0.9,null],[new Date('2021-08-22'),1.62,1.68,0.9,null]"
df <- read.table(text = unlist(strsplit(gsub('new Date\\(|\\)', '', gsub('^.(.*).$', '\\1', text)), "].\\[")), sep = ",")
> df
V1 V2 V3 V4 V5
1 2021-08-24 1.67 1.68 0.9 null
2 2021-08-23 1.65 1.68 0.9 null
3 2021-08-22 1.62 1.68 0.9 null
Changing column names and removing the last columns is trivial from this point
I have some results from a model in Python which i have saved as a .txt to render in RMarkdown.
The .txt is this.
precision recall f1-score support
0 0.71 0.83 0.77 1078
1 0.76 0.61 0.67 931
avg / total 0.73 0.73 0.72 2009
I read the file into r as,
x <- read.table(file = 'report.txt', fill = T, sep = '\n')
When i save this, r saves the results as one column (V1) instead of 5 columns as below,
V1
1 precision recall f1-score support
2 0 0.71 0.83 0.77 1078
3 1 0.76 0.61 0.67 931
4 avg / total 0.73 0.73 0.72 2009
I tried using strsplit() to split the columns, but doesn't work.
strsplit(as.character(x$V1), split = "|", fixed = T)
May be strsplit() is not the right approach? How do i get around this so that i have a [4x5] dataframe.
Thanks a lot.
Not very elegant, but this works. First we read the raw text, then we use regex to clean up, delete white space, and convert to csv readable format. Then we read the csv.
library(stringr)
library(magrittr)
library(purrr)
text <- str_replace_all(readLines("~/Desktop/test.txt"), "\\s(?=/)|(?<=/)\\s", "") %>%
.[which(nchar(.)>0)] %>%
str_split(pattern = "\\s+") %>%
map(., ~paste(.x, collapse = ",")) %>%
unlist
read.csv(textConnection(text))
#> precision recall f1.score support
#> 0 0.71 0.83 0.77 1078
#> 1 0.76 0.61 0.67 931
#> avg/total 0.73 0.73 0.72 2009
Created on 2018-09-20 by the reprex
package (v0.2.0).
Since much simpler to have python output csv, i am posting an alternative here. Just in case if it is useful as even in python needs some work.
def report_to_csv(report, title):
report_data = []
lines = report.split('\n')
# loop through the lines
for line in lines[2:-3]:
row = {}
row_data = line.split(' ')
row['class'] = row_data[1]
row['precision'] = float(row_data[2])
row['recall'] = float(row_data[3])
row['f1_score'] = float(row_data[4])
row['support'] = float(row_data[5])
report_data.append(row)
df = pd.DataFrame.from_dict(report_data)
# read the final summary line
line_data = lines[-2].split(' ')
summary_dat = []
row2 = {}
row2['class'] = line_data[0]
row2['precision'] = float(line_data[1])
row2['recall'] = float(line_data[2])
row2['f1_score'] = float(line_data[3])
row2['support'] = float(line_data[4])
summary_dat.append(row2)
summary_df = pd.DataFrame.from_dict(summary_dat)
# concatenate both df.
report_final = pd.concat([df,summary_df], axis=0)
report_final.to_csv(title+'cm_report.csv', index = False)
Function inspired from this solution
I'm having trouble figuring out how to import my data with multiple delimiters. The following is what my computer automatically saves into text file. The issue is that some of the results are printed with differently spaced delimiters. Some of the delimiters are colons (:) and others are multiple spaces with inconsistent length.
Each letter (B: to Z:) codes for some unique variable. For example:
B: Number of responses
C: Number of seconds, etc.
However, the information below "Z: 0.000", where the layout changes, is when the variables get subset. So,
A:
0: value1 value2 value3 value4
is referenced as:
A(0) = value1 (e.x. number of responses in the first trial)
A(1) = value2 (e.x. number of responses in the second trial)
A(2) = value3 (e.x. number of responses in the third trial)
A(3) = value4 (e.x. number of responses in the fourth trial)
Here, there are 4 "A" variables that each can carry unique values too.
Example of Text File:
Start Date: 05/20/18
End Date: 05/20/18
Subject: 1
Start Time: 16:23:11
End Time: 17:26:24
B: 7.000
C: 12000.000
D: 9500.000
E: 1.000
Q: 203.000
T: 1200.100
U: 218.000
W: 7.000
X: 347.000
Y: 0.000
Z: 0.000
A:
0: 1.000 0.000 0.000 0.000
F:
0: 11500.000 9500.000 13500.000 7500.000 15500.000
5: 5500.000 17500.000
I've tried a few methods, but they get stuck because the multiple delimiters issue. Let's assume "data" is the text file.
# This is the closest - some of the values are still not separated properly
temp <- read.delim2(file = "data", quote = ":", sep = "",)
# This one separate the information mostly correctly for the top half only
temp <- read.delim2(file = "data", sep = ":")
I eventually want a dataframe with labels in one column (StartDate, A(0), B, etc.) and values in the other (05/20/2018, 1, 7).
library(dplyr)
library(splitstackshape)
#read file
txt <- readLines("test.txt")
#Fix 'A:' rows
A_idx <- grep("A:", txt)
txt[A_idx] <- paste0(txt[A_idx], gsub("0:\\s+", "", txt[A_idx+1]))
txt <- txt[-(A_idx+1)]
#Fix 'F:' rows
F_idx <- grep("F:", txt)
txt[F_idx] <- paste0(txt[F_idx], paste(gsub("0:\\s+", "", txt[F_idx+1]),
gsub("5:\\s+", "", txt[F_idx+2])))
txt <- txt[-c(F_idx+1, F_idx+2)]
Now txt is in DCF format so it can be read using read.dcf
df <- data.frame(read.dcf(textConnection(txt)), stringsAsFactors = F) %>%
cSplit("A", " ") %>%
cSplit("F", " ")
Output is:
df
Start.Date End.Date Subject Start.Time End.Time B C D E Q T
1: 05/20/18 05/20/18 1 16:23:11 17:26:24 7.000 12000.000 9500.000 1.000 203.000 1200.100
U W X Y Z A_1 A_2 A_3 A_4 F_1 F_2 F_3 F_4 F_5 F_6 F_7
1: 218.000 7.000 347.000 0.000 0.000 1 0 0 0 11500 9500 13500 7500 15500 5500 17500
Sample data: test.txt contains
Start Date: 05/20/18
End Date: 05/20/18
Subject: 1
Start Time: 16:23:11
End Time: 17:26:24
B: 7.000
C: 12000.000
D: 9500.000
E: 1.000
Q: 203.000
T: 1200.100
U: 218.000
W: 7.000
X: 347.000
Y: 0.000
Z: 0.000
A:
0: 1.000 0.000 0.000 0.000
F:
0: 11500.000 9500.000 13500.000 7500.000 15500.000
5: 5500.000 17500.000
Start Date: 05/20/18
End Date: 05/20/18
... another block of data
Edit: If you want column A & F's index to start from 0
#read DCF data (i.e 'txt') using read.dcf
df <- data.frame(read.dcf(textConnection(txt)), stringsAsFactors = F)
#convert column A into wide format by splitting it into multiple columns
A_df <- data.frame(do.call(rbind, strsplit(as.character(df$A),'\\s+')), stringsAsFactors = F)
colnames(A_df) <- paste("A", sequence(ncol(A_df))-1, sep = "_")
#convert column F into wide format by splitting it into multiple columns
F_df <- data.frame(do.call(rbind, strsplit(as.character(df$F),'\\s+')), stringsAsFactors = F)
colnames(F_df) <- paste("F", sequence(ncol(F_df))-1, sep = "_")
#final data
final_df <- cbind(df[, !names(df) %in% c("A", "F")], A_df, F_df)
which gives
final_df
# Start.Date End.Date Subject Start.Time End.Time B C D E Q T U
#1 05/20/18 05/20/18 1 16:23:11 17:26:24 7.000 12000.000 9500.000 1.000 203.000 1200.100 218.000
# W X Y Z A_0 A_1 A_2 A_3 F_0 F_1 F_2 F_3 F_4
#1 7.000 347.000 0.000 0.000 1.000 0.000 0.000 0.000 11500.000 9500.000 13500.000 7500.000 15500.000
# F_5 F_6
#1 5500.000 17500.000
The good news is that your file does NOT have different delimiters. It is "Debian Control File" format. The whitespace marks continuous lines. See ?read.dcf Unfortunately, I cannot figure out if there is a way to parse .dcf including the semantics of continuous lines. But what the heck, once the data is in R, you can just clean it with library(tidyr)
x <- read.dcf("yoursourcefilename.txt")
y <– as.data.frame(x) # read.dcf reads in as matrix
z <- y %>%
separate("A", into = c("drop", "A0"), sep = "0:") %>%
separate("A0", into = c("drop", paste0("A0_val_", 1:4)), sep = "\\s{2,}") %>%
separate("F", into = c("drop", "F0"), sep = "0:") %>%
separate("F0", into = c("F0", "F5"), sep = "5:") %>%
separate("F0", into = c("drop", paste0("F0_val_", 1:5)), sep = "\\s{2,}") %>%
separate("F5", into = c("drop", paste0("F5_val_", 1:2)), sep = "\\s{2,}") %>%
select(-drop) %>% t() %>% as.data.frame()
z$V1 <- trimws(z$V1) # clean whatever whitespace is left
This will yield you a long dataframe:
dim(z)
[1] 27 1
Like so:
> z
V1
Start Date 05/20/18
End Date 05/20/18
Subject 1
Start Time 16:23:11
End Time 17:26:24
B 7.000
C 12000.000
D 9500.000
E 1.000
Q 203.000
T 1200.100
U 218.000
W 7.000
X 347.000
Y 0.000
Z 0.000
F5_val_1 5500.000
F5_val_2 17500.000
F0_val_1 11500.000
F0_val_2 9500.000
F0_val_3 13500.000
F0_val_4 7500.000
F0_val_5 15500.000
A0_val_1 1.000
A0_val_2 0.000
A0_val_3 0.000
A0_val_4 0.000
I am not sure this is the most efficient to work with the data (not a tidy format), but sounds like this is what you wanted?
I am wanting to convert several columns in a data.frame from chr to numeric and I would like to do it in a single line. Here is what I am trying to do:
items[,2:4] <- as.numeric(sub("\\$","",items[,2:4]))
But I get an error saying:
Warning message:
NAs introduced by coercion
If I do it column by column though it works:
items[,2:2] <- as.numeric(sub("\\$","",items[,2:2]))
items[,3:3] <- as.numeric(sub("\\$","",items[,3:3]))
items[,4:4] <- as.numeric(sub("\\$","",items[,4:4]))
What am I missing here? Why I specify this command for multiple columns? Is this some odd R idiosyncrasy that I am not aware of?
Example Data:
Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09
Your problem is, that gsub converts its x argument to character. If a list (a data.frame is in fact a list) is converted to character something wired happen:
as.character(list(a=c("1", "1"), b="1"))
# "c(\"1\", \"1\")" "1"
# and "c(\"1\", \"1\")" can not convert into a numeric
as.numeric("c(\"1\", \"1\")")
# NA
A one line solution would be to unlist the x argument:
items[, 2:5] <- as.numeric(gsub("\\$", "", unlist(items[, 2:5])))
Yes there is: apply is the command you are looking for:
items<-read.table(text="Name, Cost1, Cost2, Cost3, Cost4
A, $10.00, $15.50, $13.20, $45.45
B, $45.23, $34.23, $34.24, $23.34
C, $23.43, $45.23, $65.23, $34.23
D, $76.34, $98.34, $90.34, $45.09", header=TRUE,sep=",")
items[,2:4]<-apply(items[,2:4],2,function(x){as.numeric(gsub("\\$","",x))})
items
Name Cost1 Cost2 Cost3 Cost4
1 A 10.00 15.50 13.20 $45.45
2 B 45.23 34.23 34.24 $23.34
3 C 23.43 45.23 65.23 $34.23
4 D 76.34 98.34 90.34 $45.09
A more efficient approach would be:
items[-1] <- lapply(items[-1], function(x) as.numeric(gsub("$", "", x, fixed = TRUE)))
items
# Name Cost1 Cost2 Cost3 Cost4
# 1 A 10.00 15.50 13.20 45.45
# 2 B 45.23 34.23 34.24 23.34
# 3 C 23.43 45.23 65.23 34.23
# 4 D 76.34 98.34 90.34 45.09
Some benchmarks of the answers so far
fun1 <- function() {
A[-1] <- lapply(A[-1], function(x) as.numeric(gsub("$", "", x, fixed=TRUE)))
A
}
fun2 <- function() {
A[, 2:ncol(A)] <- as.numeric(gsub("\\$", "", unlist(A[, 2:ncol(A)])))
A
}
fun3 <- function() {
A[, 2:ncol(A)] <- apply(A[,2:ncol(A)], 2, function(x) { as.numeric(gsub("\\$","",x)) })
A
}
Here's some sample data and processing times
set.seed(1)
A <- data.frame(Name = sample(LETTERS, 10000, TRUE),
matrix(paste0("$", sample(99, 10000*100, TRUE)),
ncol = 100))
system.time(fun1())
# user system elapsed
# 0.72 0.00 0.72
system.time(fun2())
# user system elapsed
# 5.84 0.00 5.85
system.time(fun3())
# user system elapsed
# 4.14 0.00 4.14
I Have a tab delim file with 400 columns.Now I want to append text to the column names.ie if there is column name is A and B,I want it to change A to A.ovca and B to B.ctrls.Like wise I want to add the texts(ovca and ctrls) to 400 coulmns.Some column names with ovca and some with ctrls.All the columns are unique and contains more than 1000 rows.A sample code of the delim file is given below:
X Y Z A B C
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
And i want the file to be look like:
X.ovca Y.ctrls Z.ctrls A.ovca B.ctlrs C.ovca
2.34 .89 1.4 .92 9.40 .82
6.45 .04 2.55 .14 1.55 .04
1.09 .91 4.19 .16 3.19 .56
5.87 .70 3.47 .80 2.47 .90
Please do help me
Regards
Thileepan
If you data.frame is called dat, you can access (and write to) the column names with colnames(dat).
Therefore:
cn <- colnames(dat)
cn <- sub("([AXC])","\\1.ovca",cn)
cn <- sub("([YZB])","\\1.ctrls",cn)
colnames(dat) <- cn
> cn
[1] "X.ovca" "Y.ctrls" "Z.ctrls" "A.ovca" "B.ctrls" "C.ovca"
The \\1 is called back-substitution within your regular expression. It will replace \\1 with whatever's inside the parentheses in the pattern. Since inside the parentheses you have a bracket, it will match any of the letters inside. In this case, "A" becomes "A.ovca" and "X" becomes "X.ovca".
If your variable names are more than one letter, easy enough to extend; just look up a bit on regex's.
Here is a two liner using the stringr package.
nam <- names(mydf)
names(mydf) <- ifelse(nam %in% c('X', 'A', 'Z'),
str_c(nam, '.ovca'), str_c(nam, '.ctrls'))
How about this? You basically find columns that you want to append "ovca" and "ctrls" using %in%, and append the appropriate tag.
> (mydf <- data.frame(X = runif(10), Y = runif(10), Z = runif(10), A = runif(10), B = runif(10), C = runif(10)))
X Y Z A B C
1 0.81030594 0.1624974 0.3977381 0.9619541 0.9866498 0.4424760
2 0.92498687 0.2069429 0.6065115 0.9969835 0.2407364 0.2455184
3 0.11033869 0.2878640 0.5662793 0.7936232 0.6066735 0.8210634
> names(mydf)[names(mydf) %in% c("X", "A", "C")] <- paste(names(mydf)[names(mydf) %in% c("X", "A", "C")], "ovca", sep = ".")
> names(mydf)[names(mydf) %in% c("Y", "Z", "B")] <- paste(names(mydf)[names(mydf) %in% c("Y", "Z", "B")], "ctrls", sep = ".")
> mydf
X.ovca Y.ctrls Z.ctrls A.ovca B.ctrls C.ovca
1 0.81030594 0.1624974 0.3977381 0.9619541 0.9866498 0.4424760
2 0.92498687 0.2069429 0.6065115 0.9969835 0.2407364 0.2455184
3 0.11033869 0.2878640 0.5662793 0.7936232 0.6066735 0.8210634