I have 40 CSV files with only 1 column each. I want to combine all 40 files data into 1 CSV file with 2 columns.
Data format is like this :
I want to split this column by space and combine all 40 CSV files into 1 file. I want to preserve the number format as well.
I tried below code but Number format is not fixed and and extra 3rd column added for Negative numbers. Not sure why.
My Code :
filenames <- list.files(path="C://R files", full.names=TRUE)
merged <- data.frame(do.call("rbind", lapply(filenames, read.csv, header = FALSE)))
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," ",fixed=FALSE))
write.csv(data, "export1.csv", row.names=FALSE, na="NA")
The output which i got is as shown below. If you observe, the negative numbers are put into extra column. I just want to split by space and put in 2 columns in the exact number format as in the input.
R Output:
The problem is that the source data is delimited by:
one space when the second number is negative, and
two spaces when the second number is positive (space for the absent minus sign).
The trick is to split the string on one or more spaces:
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," +",fixed=FALSE))
I'm a bit OCD on charsets, unreliable files, etc, so I tend to use splitters such as "[[:space:]]+" instead, since it'll catch whitespace-variants instead of the space " " or tab "\t".
(In regex-speak, the + says "one or more". Other modifiers include ? as zero or one, and * as zero or more.)
Related
I am trying to output a dataframe in R to a .txt file. I want the .txt file to ultimately mirror the dataframe output, with columns and rows all aligned. I found this post on SO which mostly gave me the desired output with the following (now modified) code:
gene_names_only <- select(deseq2_hits_table_df, Gene, L2F)
colnames(gene_names_only) <- c()
capture.output(
print.data.frame(gene_names_only, row.names=F, col.names=F, print.gap=0, quote=F, right=F),
file="all_samples_comparison_gene_list.txt"
)
The resultant output, however, does not align negative and positive values. See:
I ultimately want both positive and negative values to be properly aligned with one another. This means that -0.00012 and 4.00046 would have the '-' character from the prior number aligned with the '4' of the next character. How could I accomplish this?
Two other questions:
The output file has a blank line at the beginning of the output. How can I change this?
The output file also seems to put far more spaces between the left column and the right column than I would want. Is there any way I can change this?
Maybe try a finer scale treatment of the printing using sprintf and a different format string for positive and negative numbers, e.g.:
> df = data.frame(x=c('PICALM','Luc','SEC22B'),y=c(-2.261085123,-2.235376098,2.227728912))
> sprintf('%15-s%.6f',df$x[1],df$y[1])
[1] "PICALM -2.261085"
> sprintf('%15-s%.6f',df$x[2],df$y[2])
[1] "Luc -2.235376"
> sprintf('%15-s%.7f',df$x[3],df$y[3])
[1] "SEC22B 2.2277289"
EDIT:
I don't think that write.table or similar functions accept custom format strings, so one option could be to create a data frame of formatted strings and the use write.table or writeLines to write to a file, e.g.
dfstr = data.frame(x=sprintf('%15-s', df$x),
y=sprintf(paste0('%.', 7-1*(df$y<0),'f'), df$y))
(The format string for y here is essentially what I previously proposed.) Next, write dfstr directly:
write.table(x=dfstr,file='filename.txt',
quote=F,row.names=F,col.names=F)
I’m attempting to develop a specific string count script. There’s one step I can’t seem to solve. I have several files (tab-delimited tables), each file contains a data frame with over 1,000 strings each in a row. I’m trying to count the number of times a particular string in a row appears in another row as part of that row’s string. Here’s what I have so far. This yields a list of each file name, the number of times a string appears in a row by itself or inside another string. I’m able to develop the concept, but right now I have to search string manually, which is impractical when dealing with thousands of strings of different lengths. As you can see, the script iterates over each file in the folder. The result should only generate a list of those strings that appear in other rows and the number of times it does per file. Also, the files don’t necessarily have the same list of strings so each file should be checked separately.
Here’s a simple example of the data frame.
north.txt
1. abcd
2. bdcd
3. tabcdt
4. bdcad
I've been able to get the script to check for each word, but I have to input the word manually.
library(stringr)
library(tidyverse)
# Read all .txt files in folder.
files <- list.files(path="/Data/Processed_data/docs_by_name", pattern=".txt")
###Action on each file
# Select the column with the sequences-clones
for (i in files){
print(i)
data <- read.table(file =paste0( "/Data/Processed_data/samples_by_name/", i), sep = '\t', header = TRUE)
# Compare selected string with strings of other rows and count matches
# Select file
for (t in unique(data)){
word <- deframe(data)
number.word <- str_count(word, “abcd”)
repeats <- sum(number.word)-1
print(repeats)
}
}
Here’s an example of what I’m hoping to get.
north.txt
abcd
2
bdca
1
south.txt
abcd…
I download a file which in every column contains an item or empty cells in csv format. When i write the code:
groceries_data = groceries_data <- read.transactions("groceries.csv")
Surprisingly I see the result :
summary(groceries_data)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
8146 columns (items) and a density of 0.0004401248
but when i write the code
groceries_data = read.transactions("groceries.csv",sep=",")
Then the result is:
summary(groceries_data)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
which is the right result from the book but logically it should work with the first command and not by the second. What is going wrong here?
That function isn't intended to work with CSV by default. See help(read.transactions) - for the sep argument it states:
a character string specifying how fields are separated in the data file. The default ("") splits at whitespaces.
So unless you tell it to split on comma, it is splitting on every white space. If you've got spaces in many product names, then every word of every product name will become a column.
By specifying the sep argument as a comma, it's importing the CSV file correctly, as you wanted.
I have data files that contain the following:
the first 10 columns are numbers, the last column is text. They are separated by space. The problem is that the text in the last column may also contain space. So when I used read.table() I got the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 21 did not have 11 elements
what's the easiest way of reading the first 10 columns into a data matrix, and the last column into a string vector? Should I use readLines() first then process it?
If you cannot re-export or recreate your data files with different, non-whitespace separators or with quotation marks around the last column to avoid that problem, you can use read.table(... , fill = TRUE) to read in a file with unequal columns and then combine columns 11+ with dat$col11 <- do.call(paste, c(dat[11:nrow(dat)], sep=" ")) (or something like that) and then drop the now unwanted columns with dat[11:(nrow(dat)-1)] <- NULL. Finally, you may need to trim the whitespace from the end of the eleventh column with trimws(dat$col11).
Note that fill only considers the first five lines of your file, so you may need to find out the number of 'pseudo-columns' in the longest line manually and specify an appropriate number of col.names in read.table (see the linked answer).
Hinted by the useful fill = TRUE option of read.table() function, I used the following to solve my problem:
dat <- read.table(fname, fill = T)
dat <- dat[subset(1:nrow(dat),!((1:nrow(dat)) %in% (which(dat[,11]=="No") + 1))),]
The fill = TRUE option puts everything after the first space of the 11th column to a new row (redundant rows that the original data do not have). The code above removes the redundant rows based on three assumptions: (1) the number of space separators in the 11th column is no more than 11 such that we know there is only one more row of text after a line whose 11th column contains space (that's what the +1 does); (2) we know the line whose 11th column starts with a certain word (in my case it is "No") (3) Keeping only the first word in the 11th column would be sufficient (without ambiguity).
The following solved my problem:
nc <- max(count.fields(fname), sep = " ")
data <- read.table(fname, fill = T, col.names = paste0("V", seq_len(nc)), sep = " ", header = F)
Then the first 10 columns will be the numeric results I want and the remaining nc-10 columns can be combined into one string vector.
The most helpful post is:
How can you read a CSV file in R with different number of columns
You could reformat your file before reading it in R.
For example, using perl in a terminal:
perl -pe 's/(?<=[0-9]) /,/g' myfile.txt > myfile.csv
This replaces every space preceded by a number by a comma.
Then read it into R using read.csv:
df = read.csv("myfile.csv")
I'm improving my R-skills rebuilding some of the amazing stuff they do on r-bloggers. Right now im trying to reproduce this:
http://wiekvoet.blogspot.nl/2015/06/deaths-in-netherlands-by-cause-and-age.html. The relevant dataset for this excersize could be found here:
http://statline.cbs.nl/Statweb/publication/?VW=D&DM=SLNL&PA=7052_95&D1=0-1%2c7%2c30-31%2c34%2c38%2c42%2c49%2c56%2c62-63%2c66%2c69-71%2c75%2c79%2c92&D2=0&D3=0&D4=0%2c10%2c20%2c30%2c40%2c50%2c60%2c63-64&HD=150710-0924&HDR=G1%2cG2%2cG3&STB=T
If I'm diving into the code (to be found at the bottom of the first link) and am running into this piece of code:
r1 <- read.csv(sep=';',header=FALSE,
col.names=c('Causes','Causes2','Age','year','aantal','count'),
na.strings='-',text=txtlines[3:length(txtlines)]) %>%
select(.,-aantal,-Causes2)
Could anybody help me seperating the steps that are taken here?
Here is an explanation of what each line in the call to read.csv() is doing from your example. Note that the assignment of the last parameter text is complicated and is dependent on the script from the link you gave above. From a high level, he is first reading in all lines from the file "Overledenen__doodsoo_170615161506.csv" which contain the string "Centraal", using only the third to final lines from that filtered set. There is an additional step applied to these lines as well.
r1 <- read.csv( # columns separate by semi-colon
sep=';',
# first row is data (i.e. is NOT a header)
header=FALSE,
# names of the six columns
col.names=c('Causes','Causes2','Age','year','aantal','count'),
# treat hyphen as NA
na.strings='-',
# read from third line to final line of the original input
# Overledenen__doodsoo_170615161506.csv, after some
# filtering has been applied
text=txtlines[3:length(txtlines)]) %>% select(.,-aantal,-Causes2)
The read.csv, read the csv file, separating column with the separator ";"
so that an input like this a;b;c will be separated in: first column=a, second=b, third=c
header=FALSE -> It specifies no header in the original file was given.
col.names assigns the listed names to your columns in r
na.strings='-' substitutes NA values with '-'
text=txtlines[3:length(txtlines)]) read the lines from position 3 till the end.
%>% select(.,-aantal,-Causes2) filter the data frame