data:
sam:
res = 0.25 , res1=0.30
bad:
res= 0.30 , res1=0.23
code:
write.table(sam, file = "C:\\Users\\data1.txt", append = F, sep = " ", row.names = TRUE,col.names = TRUE)
write.table(bad, file = "C:\\Users\\data1.txt", append = T, sep = " ",row.names = TRUE, col.names = TRUE)
output of data1:
"x"
"1" 0.25
"x"
"1" 0.3
In fact, I want the output in the text file data1 to be something like this:
res res1
sam 0.25 0.30
bad 0.3 0.23
Any idea is appreciated!
You can supply a named vector in write.table(). You'll just need to set the second set of column names to FALSE since they've already been provided in the first initial call.
sam <- 0.25; bad <- 0.30
write.table(c(sam = sam), col.names = "res", file = "data")
write.table(c(bad = bad), col.names = FALSE, file = "data", append = TRUE)
## read it back in
read.table("data")
# res
# sam 0.25
# bad 0.30
In response to your comment, you can write a helper function to do the appending once the file is initialized. Then we can read it as a list so we have a choice between returning a data frame or a matrix.
sam <- 0.25
bad = 0.3
## initial file creation
write.table(cbind(sam, bad), "data", row.names = FALSE)
## function to append to 'data'
wtFun <- function(x) {
write.table(x, "data", append = TRUE, col.names = FALSE, row.names = FALSE)
}
## new values
sam2 <- 0.99
bad2 <- 25
## append new values
wtFun(cbind(sam2, bad2))
## read the file as a list and set the names
res <- setNames(
scan("data", what = list(0, 0), skip = 1L),
scan("data", what = "", nlines = 1L)
)
## 'res' as a matrix
do.call(rbind, res)
# [,1] [,2]
# sam 0.25 0.99
# bad 0.30 25.00
## 'res' as a data frame
as.data.frame(res)
# sam bad
# 1 0.25 0.3
# 2 0.99 25.0
This will work better if you write sam and bad as data.frames (or matrices), instead of atomics. For example,
sam <- 0.25
bad <- 0.30
##
write.table(
data.frame(res=sam,row.names="sam"),
file="F:/temp/data1.txt",
append=F,sep=" ",
col.names=TRUE,
row.names=TRUE)
##
write.table(
data.frame(res=bad,row.names="bad"),
file="F:/temp/data1.txt",
append=T,sep=" ",
col.names=FALSE,
row.names=TRUE)
##
R> read.table("F:/temp/data1.txt",header=TRUE)
res
sam 0.25
bad 0.30
IMO though, it's not a good idea to force the row.names attribute like this because if you append an object to the file with a row name that already exists, you will get an error when you try to read it back in since the row.names attribute cannot contain duplicate values. You would be better off doing something like
write.table(
data.frame(name="sam",res=sam),
file="F:/temp/data1.txt",
append=F,sep=" ",
col.names=TRUE,
row.names=FALSE)
##
write.table(
data.frame(name="bad",res=bad),
file="F:/temp/data1.txt",
append=T,sep=" ",
col.names=FALSE,
row.names=FALSE)
##
R> read.table("F:/temp/data1.txt",header=TRUE)
name res
1 sam 0.25
2 bad 0.30
Related
Rscript test.R ../Data/bam/a.bam:0 ../Data/bam/b.bam:0.1 ../Data/bam/c.bam:0.5 ../Data/bam/d.bam:1
I want to make a list of keys and values for commandline arguments. I have use following code.
#test.R
args <- commandArgs(trailingOnly = TRUE)
key_value_pairs <- strsplit(args, " ")
key_value_pairs <- lapply(key_value_pairs, function(x) strsplit(basename(x), ":")[[1]])
key_value_pairs <- as.data.frame(key_value_pairs, stringsAsFactors = FALSE)
colnames(key_value_pairs) <- c("key", "value")
key_value_pairs$value <- as.numeric(key_value_pairs$value)
print(key_value_pairs)
i got follwoing output:
key value
1 a.bam NA
2 0
NA
1 b.bam
2 0.1
NA
1 c.bam
2 0.5
NA
1 d.bam
2 1
NA
but i want out like:
key value
a.bam 0.0
b.bam 0.1
c.bam 0.5
d.bam 1
Can someone help me to find the issue and how to solve it. Thanks
#test.R
args <- commandArgs(trailingOnly = TRUE)
key_value_pairs <- strsplit(args, " ")
key_value_pairs <- lapply(key_value_pairs, function(x) strsplit(basename(x), ":")[[1]])
key_value_pairs <- as.data.frame(key_value_pairs, stringsAsFactors = FALSE)
colnames(key_value_pairs) <- c("key", "value")
key_value_pairs$value <- as.numeric(key_value_pairs$value)
print(key_value_pairs)
Note that the command args are already separated by space so you don't need to do that yourself. And you need a different strategy for creating your data.frame. This should work
args <- commandArgs(trailingOnly = TRUE)
key_value_pairs <- lapply(args, function(x) strsplit(basename(x), ":")[[1]])
key_value_pairs <- as.data.frame(do.call("rbind", key_value_pairs), stringsAsFactors = FALSE)
colnames(key_value_pairs) <- c("key", "value")
key_value_pairs$value <- as.numeric(key_value_pairs$value)
key_value_pairs
# key value
# 1 a.bam 0.0
# 2 b.bam 0.1
# 3 c.bam 0.5
# 4 d.bam 1.0
An alternate way to do this would be
args <- commandArgs(trailingOnly = TRUE)
key_value_pairs <- read.table(text=basename(args), sep=":",
col.names = c("key", "value"))
That will pretty much do everything in one go.
I would like to extract the column index of a variable of a dataframe using the variable name.
here is the df for exemple:
>df
Mean Var Max
a 1 0.5 3
b 1.5 0.4 4
c 0.7 0.3 2.5
d 0.3 0.1 0.5
I want to "reverse" this:
> variable.names(df[2])
[1] "Var"
with something like that:
> variable.names(df$Var)
NULL
But getting "2" instead of "NULL"
here is my entire problem:
my_fct ← function(data, v_cont, v_cat){
for (i in 1:nlevels(as.factor(v_cat))){
sub <- subset(data , v_cat == levels(as.factor(v_cat))[i])
sub_stat <- c(levels(as.factor(v_cat))[i],
mean( **sub[,COLINDEX(v_cat)**] , na.rm = TRUE)
mat_stat <- rbind(mat_stat, sub_stat)
sub[,COLINDEX(v_cat) is what need to solve. How to select the initial variable in my new matrix freshly created?
Note: v_cat and v_cont have the following form: df$variable1 , df$variable2
thanks for helping
It is not entirely clear about the situation. But based on the function provided, it can rewritten by passing the column name and subsetting with [[ instead of passing the df$variable1 or df$variable2
my_fct <- function(data, v_cont, v_cat){
mat_stat <- NULL
for (i in 1:nlevels(as.factor(data[[v_cat]]))){
sub <- subset(data , data[[v_cat]] ==
levels(as.factor(data[[v_cat]]))[i])
sub_stat <- c(levels(as.factor(data[[v_cat]]))[i],
mean(sub[,v_cat] , na.rm = TRUE)
mat_stat <- rbind(mat_stat, sub_stat)
}
return(mat_stat)
}
-testing
my_fct(df, "variable1", "variable2")
With the OP's original function if the input is df$variable1, df$variable2, an option is to use deparse(subsitute to capture the argument, extract the column name with sub and use that as column name
my_fct <- function(data, v_cont, v_cat){
nm1 <- sub(".*\\$", "", deparse(substitute(v_cat)))
mat_stat <- NULL
for (i in 1:nlevels(as.factor(v_cat))){
sub <- subset(data , v_cat == levels(as.factor(v_cat))[i])
sub_stat <- c(levels(as.factor(v_cat))[i],
mean(sub[, nm1] , na.rm = TRUE)
mat_stat <- rbind(mat_stat, sub_stat)
}
return(mat_stat)
}
-testing
my_fct(df, df$variable1, df$variable2)
Similar to LMc(+1) solution -> We could use grep:
df <- structure(list(Mean = c(1, 1.5, 0.7, 0.3), Var = c(0.5, 0.4,
0.3, 0.1), Max = c(3, 4, 2.5, 0.5)), class = "data.frame", row.names = c("a",
"b", "c", "d"))
grep("Var", colnames(df))
output:
[1] 2
Use match:
match("Var", colnames(df))
This should do it using which
df <- data.frame(Mean=c(1,1.5,0.7,0.3),Var=c(0.5,0.4,0.3,0.1),Max=c(3,4,2.5,0.5))
df
Mean Var Max
1 1.0 0.5 3.0
2 1.5 0.4 4.0
3 0.7 0.3 2.5
4 0.3 0.1 0.5
which(colnames(df)=="Var")
Output:
[1] 2
I am looking for speed improvement for a function which imports several ".txt" files to one data frame (adding file name). The number of ".txt" files is > 10 000 and all those files have the same structure and are located in one directory with several sub directories. Size of all 10 000 files is around 800 MB in total. It takes couple of hours to load all 10 000 file to a df.
My PC: Toshiba P50t with 8GB RAM and 1TB HDD
Please see the code I am using.
I am happy to hear suggestions how to improve loading speed (I would prefer not to use intermediary tool like load data to MS SQL and import it to R) I have tried to use fread instead of read_csv without significant speed difference.
files_to_df_v01 <- function( directory , Output_file_name , What_stocks) {
List <- data.frame(dir(directory, pattern="*.txt", recursive = T))
names(List)[1] <- "Path_file"
List <- arrange(List,List$Path_file)
List_wse_stocks <- (filter ( List , str_count(List$Path_file , pattern = What_stocks ) > 0 ))
library(readr)
rownumber = 1
setwd(directory)
############## LOOP ################
for (i in List_wse_stocks$Path_file) {
if (file.info(i)$size != 0) {
dat <- read_csv(i,col_types = cols(Ticker = col_character(), Date = col_date(format = "%Y-%m-%d"), Open = col_double(), High = col_double(), Low = col_double(), Close = col_double(), Volume = col_integer(), OpenInt = col_integer() ))
L_ = (str_locate_all(i,"/"))
sapply(L_,max)
File_name <- substr(i,sapply(L_,max)+1, nchar(i))
dat$Ticker <- substr(File_name,1,nchar(File_name)-4)
datt = dat %>% select(Ticker, Date, Open, High, Low, Close, Volume, OpenInt)
if (rownumber == 1) { rownumber = rownumber + 1
GPW_wse_stocks <- datt }
else{GPW_wse_stocks <- rbind(GPW_wse_stocks, datt)}
}
}
# ) ############## END of LOOP
save(GPW_wse_stocks,file=Output_file_name)
return(data.frame(GPW_wse_stocks))
}
Using data.table I managed to get around 4 times faster solution:
# Creating test data :
dir.create("Test")
dd <- "Test/csvReadingTest2"
dir.create(dd)
dir.create(file.path(dd, "v1"))
dir.create(file.path(dd, "v2"))
n <- 3000
f <- function(x) sample(x, n, replace = T)
require(data.table)
set.seed(123)
d1 <- data.table(Ticker = f(LETTERS),
Date = f(seq.Date(as.Date("2016-01-01"), by = "month",
length.out = n/100)),
Open = f(c(1.2, 1.3)), High = f(c(1.2, 1.3)),
Low = f(c(1.2, 1.3)), Close = f(c(1.2, 1.3)),
Volume = f(1:10), OpenInt = f(1:10))
d1
# Ticker Date Open High Low Close Volume OpenInt
# 1: H 2203-04-01 1.2 1.3 1.2 1.2 6 4
# 2: N 2121-05-01 1.2 1.3 1.2 1.2 9 6
# 3: E 2060-04-01 1.3 1.2 1.2 1.3 1 3
# 4: V 2132-04-01 1.3 1.3 1.3 1.2 7 8
# 5: F 2253-04-01 1.2 1.3 1.3 1.2 3 10
# ---
# 2996: J 2027-05-01 1.3 1.3 1.2 1.2 7 6
# 2997: K 2177-05-01 1.2 1.3 1.2 1.2 5 4
# 2998: S 2200-03-01 1.2 1.2 1.2 1.2 6 2
# 2999: V 2110-05-01 1.3 1.3 1.3 1.2 4 3
# 3000: Q 2043-05-01 1.2 1.3 1.2 1.2 3 5
invisible(lapply(1:100, function(x) fwrite(d1, paste0(dd, "/v1/d", x, ".txt"))))
invisible(lapply(1:100, function(x) fwrite(d1, paste0(dd, "/v2/d", x, ".txt"))))
A little bit modified your function:
################################################################################
yourFunction_modified <- function(directory, Output_file_name, What_stocks) {
# require(plyr)
require(dplyr)
require(stringr)
library(readr)
# List <- data.frame(dir(directory, pattern = "*.txt", recursive = T))
# names(List)[1] <- "Path_file"
# List <- arrange(List, List$Path_file)
# List_wse_stocks <- (filter(List , str_count(List$Path_file ,
# pattern = What_stocks ) > 0 ))
l <- list.files(directory, recursive = T, full.names = T, pattern = "*.txt")
l <- l[grepl(What_stocks, l)]
rownumber = 1
for (i in l) {
if (file.info(i)$size != 0) {
dat <- read_csv(i,
col_types = cols(Ticker = col_character(),
Date = col_date(format = "%Y-%m-%d"),
Open = col_double(), High = col_double(),
Low = col_double(), Close = col_double(),
Volume = col_integer(),
OpenInt = col_integer()))
L_ = (str_locate_all(i,"/"))
File_name <- substr(i,sapply(L_,max) + 1, nchar(i))
dat$Ticker <- substr(File_name,1,nchar(File_name) - 4)
datt = dat %>% select(Ticker, Date, Open, High, Low, Close,
Volume, OpenInt)
if (rownumber == 1) {
rownumber = rownumber + 1
GPW_wse_stocks <- datt
} else {
GPW_wse_stocks <- rbind(GPW_wse_stocks, datt)
}
}
}
save(GPW_wse_stocks, file = Output_file_name)
return(data.frame(GPW_wse_stocks))
}
system.time(
x <- yourFunction_modified(dd, file.path(dirname(dd), "csvReadingTest2.Rdat"),
"/d[0-9]")
)
# 25 - 18 sek
My function:
myFun <- function(directory, Output_file_name, What_stocks) {
require(data.table)
require(Hmisc)
l <- list.files(directory, recursive = T, full.names = T, pattern = "*.txt")
l <- l[grepl(What_stocks, l)]
l <- l[file.info(l)$size != 0]
dtList <- lapply(l, function(i) {
dat <- fread(i)
File_name <- basename(i)
dat$Ticker <- substr(File_name, 1, nchar(File_name) - 4)
necessary <- Cs(Ticker, Date, Open, High, Low, Close, Volume, OpenInt)
# Delete unnecesary columns:
for (ii in setdiff(colnames(dat), necessary)) {
set(dat, j = ii, value = NULL)
}
dat
})
dtList[1:2]
dt <- rbindlist(dtList, use.names = T, fill = T, idcol = F)
require(fasttime)
dt[, Date := as.Date(fastPOSIXct(Date))]
save(dt, file = Output_file_name)
return(dt[])
}
system.time(
x2 <- myFun(dd, file.path(dirname(dd), "csvReadingTest2v2.Rdat"),
"/d[0-9]")
)
# 6 - 4 sek
all.equal(as.data.table(x), x2)
# [1] TRUE1
rbindlist(lapply(files, fread)) is pretty quick, though if you have a high number of small files and you don't care about preserving the filename, you may be best using the operating system directly.
Set up data because OP didn't: 10,000 files of 100 rows.
setwd(tempdir())
dir.create("48492154")
setwd("48492154")
dates <- as.character(seq.Date(as.Date("2012-01-01"),
as.Date(Sys.Date()),
length.out = 500))
library(data.table)
for (i in 1:1e4) {
DT <- data.table(Ticker = 1:100,
Date = sample(dates, size = 100),
Open = round(runif(100) + 100, 1),
Close = round(runif(100) + 100, 1),
Volume = sample(1:100),
OpenInt = 1:100)
cat(i, "of 10,000\r")
flush.console()
fwrite(DT, paste0(i, ".csv"), showProgress = FALSE)
}
Simple method (also handles repeated headers and gets the colClasses nearer to the truth.)
system.time({
res <- rbindlist(lapply(dir(pattern = "\\.csv"), fread))
})
#> user system elapsed
#> 5.46 3.17 8.62
Using Windows's system copy:
system.time({
# Windows only
shell("copy /b *.csv out.txt > dump.log")
new_res <- fread("out.txt")
# Delete the headers mixed in (whereas rbindlist() above
# handles this automatically -- and better)
for (j in names(new_res)) {
new_res <- new_res[.subset2(new_res, j) != j]
}
})
#> user system elapsed
#> 0.76 0.13 3.31
I am new to R and will try to explain my problem as good as I can.
I am working in a dataframe where I have 15571 obs and 18976 variables. The colnames and the rownames are gene-names and most of them have an identical name match. The entries consist of only numeric values and are correlation values. This is how it looks like.
[GENE128] [GENE271] [GENE2983]
[GENE231] 0.71 0.98 0.32
[GENE128] 0.23 0.61 0.90
[GENE271] 0.87 0.95 0.63
What I am trying to do is to write a code where I paste a list with all the genes in the df with the logical operator, x > 0.8, AND only the genes where the genenames (col- and rownames) are identical so with the example above only the "GENE271" would be "TRUE" in this case.
Is there a way to do this?
your example data as data frame
vec = c( 0.71,0.98,0.32,0.23,0.61,0.90,0.87,0.95,0.63)
mt = matrix(vec, 3, 3, byrow = T)
coln = c('GENE128', 'GENE271', 'GENE2983')
rown = c('GENE231', 'GENE128', 'GENE271')
df = data.frame(mt)
colnames(df) = coln
rownames(df) = rown
use the row-names and colnames to build a new data frame and vectorize the values
ndf = data.frame(coln = as.vector(sapply(coln, function(x) rep(x, ncol(df)))), rown = rep(rown, ncol(df)), data = as.vector(as.matrix(df)), stringsAsFactors = F)
idx_true = sapply(1:nrow(ndf), function(x) ndf[x, 1] == ndf[x, 2])
subs_ndf = ndf[idx_true, ]
subs_ndf[which(ndf[idx_true, 'data'] > 0.8 ), ]
output
coln rown data
6 GENE271 GENE271 0.95
I'm sure someone has a better, faster way. This way will be slow but it should work....
test <- data.frame(GENE128 = c(0.71,0.23,0.87), GENE271 = c(0.98,0.61,0.95),
GENE2983 = c(0.32,0.90,0.63))
row.names(test) <- c('GENE231', 'GENE128', 'GENE271')
gene.equal <- function(x, limit = 0.8){
df <- c()
for(i in 1:nrow(x)){
row <- x[i,]
indexes <- which(row.names(row) == colnames(x))
if(length(indexes) > 0 && row[,indexes] > limit){
row[,indexes] <- 'TRUE'
}
df <- rbind(df, row)
}
df
}
new.df <- gene.equal(x = test)
I made 'TRUE' as text because otherwise it'll convert it to '1.00' if you use TRUE (no quotes).
Following statement provides the desired result in 2 steps (df is your data frame).
> df <- df[which(row.names(df) %in% colnames(df) & df >= 0.8),]
> df
GENE128 GENE271 GENE2983
GENE271 0.87 0.95 0.63
NA NA NA NA
NA.1 NA NA NA
> na.omit(df)
GENE128 GENE271 GENE2983
GENE271 0.87 0.95 0.63
I have to use na.omit(df) to get rid of those NA, but the solution provides accurate data without running complex code.
I am new to topic modeling and was testing the lda.collapsed.gibbs.sampler() method by trying to "characterize" some 98 CVs. I first tried to do it using a corpus (as it is easier to do filtering etc) However this gave some unexpected results - probably because the lexicalize() function first converted this to an object with only 3 documents/objects
# method 1
a <-Corpus(DirSource(doc.folder,pattern = ".txt$"), readerControl = list(language="eng"))
a <- tm_map(a, content_transformer(removeNumbers))
a <- tm_map(a, content_transformer(removePunctuation))
a <- tm_map(a, content_transformer(stripWhitespace))
a <- tm_map(a, content_transformer(tolower))
lex <- lexicalize(a)
result = lda.collapsed.gibbs.sampler(lex$documents, 8,lex$vocab, 30, 0.1,0.1, initial = NULL, burnin = NULL, compute.log.likelihood = T)
length(a) # output: [1] 98
length(lex$documents) # output: [1] 3, even though I expect 98
dim(result$document_sums) # output: [1] 8 3 even though I expect 8 98
However when I directly used the cv text as a vector, it gave the expected results
# method 2
filenames = list.files(path=doc.folder,pattern=".txt$",full.names = T)
df <- data.frame(stringsAsFactors=FALSE)
for (filename in filenames){
myfile = file(filename)
df <- rbind(df,cbind(name=file_path_sans_ext(basename(filename)),text=paste(readLines(myfile),collapse=" ")))
close(myfile)
}
# the following avoids an error due to french words etc being used
df[,"text"] <- sapply(df[,"text"],iconv,"WINDOWS-1252","UTF-8")
lex <- lexicalize(df[,"text"])
result = lda.collapsed.gibbs.sampler(lex$documents, 8,lex$vocab, 30, 0.1,0.1, initial = NULL, burnin = NULL, compute.log.likelihood = T)
NROW(df) # output: [1] 98
length(lex$documents) # output: [1] 98 as expected
dim(result$document_sums) # output: [1] 8 98 as expected