I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000 rows long. Each file is named in the format "B001.BaseCovDist.txt" to "B060.BaseCovDist.txt", and in R I have them as "B001" to "B060".
How can I combine the data frames by Coverage? This is complicated by missing rows. I've tried various approaches in bash, base R, reshape(2), and dplyr.
How can I make a single graph of the Counts(y-axis) against Coverage (x-axis) with each unique sample as a different series. Ggplot2 seems ideal but I seem to need a loop or a list to add the series without having to type out all of the names in full (which would be ridiculous).
One approach that seemed good was to add a third column that contains the unique sample name because this creates a molten dataset. However this didn't work in bash (awk) because the number of whitespace delimiters varies by row.
Any help would be very welcome.
Coverage Count
1 0 7089359
2 1 983611
3 2 658253
4 3 520767
5 4 448916
6 5 400904
A good starting point is to consider a long-format for the data vice a wide-format. Since you mentioned reshape2, this should make sense, but check out tidyr as well, as the docs for both document the differences between long/wide.
Going with a long format, try the following:
allfiles <- lapply(list.files(pattern='foo.csv'),
function(fname) cbind(fname=fname, read.csv(fname)))
dat <- rbind_all(allfiles)
dat
## fname Coverage Count
## 1 B001.BaseCovDist.txt 0 7089359
## 2 B001.BaseCovDist.txt 1 983611
## 3 B001.BaseCovDist.txt 2 658253
## 4 B001.BaseCovDist.txt 3 520767
## 5 B001.BaseCovDist.txt 4 448916
## 6 B001.BaseCovDist.txt 5 400904
ggplot(data=dat, aes(x=Coverage, y=Count, group=fname)) + geom_line()
Just to add to your answer, r2evans I added a gsub command so that the filename suffix is removed from the added column (and also some boring import modifers).
allfiles <- lapply(list.files(pattern='.BasCovDis.txt'), function(sample) cbind(sample=gsub("[.]BasCovDis.txt","", sample), read.table(sample, header=T, skip=3)))
Related
I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.
I need to open a CSV file with the following options in the figure below. I add the link to my files. You can try with the file "20140313_Helix2_FP140_SC45.csv"
https://www.dropbox.com/sh/i5y8r8g7wymalw8/AABXsLkbpowxGObFpGHgv4m-a?dl=0
I have tried many options with read.table and read.csv but I need a dataframe with more than one column and data are separated.
It looks like captured printer output. But it's not too messy:
# read it in as raw lines
lines <- readLines("20140313_Helix2_FP140_SC45.csv")
I'm assuming you want the "frequency point" data (it's the most prevalent) so we find the first one of those:
start <- which(grepl("^FREQUENCY POINTS:", lines))[1]
The rest of the file is "regular" enough to just look for lines beginning with a number (i.e. the PNT column) and read that in, giving it saner column names than the read.table default):
dat <- read.table(textConnection(grep("^[0-9]+",lines[start:length(lines)], value=TRUE)),
col.names=c("PNT", "FREQ", "MAGNITUDE"))
And, here's the result:
head(dat)
## PNT FREQ MAGNITUDE
## 1 1 0.800000 -19.033
## 2 2 0.800125 -19.038
## 3 3 0.800250 -19.071
## 4 4 0.800375 -19.092
## 5 5 0.800500 -19.137
## 6 6 0.800625 -19.167
nrow(dat)
## [1] 1601
The # of rows matches (from what I can tell) the # of frequency point records.
I am working with two CSV files. They are formatted like this:
File 1
able,2
gobble,3
highway,3
test,6
zoo,10
File 2
able,6
gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10
In my program I want to do the following:
Create a keyword list by combining the values from two CSV files and keeping only unique keywords
Compare that keyword list to each individual CSV file to determine the maximum number of occurences of a given keyword, then append that information to the keyword list.
The first step I have done already.
I am getting confused by R reading things as vectors/factors/data frames etc...and "coercion to lists". For example in my files given above, the maximum occurrence for the word "gobble" should be 10 (its value is 3 in file 1 and 10 in file 2)
So basically two things need to happen. First, I need to create a column in "keywords" that holds information about the maximum number of occurrences of a word from the CSV files. Second, I need to populate that column with the maximum value.
Here is my code:
# Read in individual data sets
keywordset1=as.character(read.csv("set1.csv",header=FALSE,sep=",")$V1)
keywordset2=as.character(read.csv("set2.csv",header=FALSE,sep=",")$V1)
exclude_list=as.character(read.csv("exclude.csv",header=FALSE,sep=",")$V1)
# Sort, capitalize, and keep unique values from the two keyword sets
keywords <- sapply(unique(sort(c(keywordset1, keywordset2))), toupper)
# Keep keywords greater than 2 characters in length (basically exclude in at etc...)
keywords <- keywords[nchar(keywords) > 2]
# Keep keywords that are not in the exclude list
keywords <- setdiff(keywords, sapply(exclude_list, toupper))
# HERE IS WHERE I NEED HELP
# Compare the read keyword list to the master keyword list
# and keep the frequency column
key1=read.csv("set1.csv",header=FALSE,sep=",")
key1$V1=sapply(key1[[1]], toupper)
keywords$V2=key1[which(keywords[[1]] %in% key1$V1),2]
return(keywords)
The reason that your last commmand fails is that you try to use the $ operator on a vector. It only works on lists or data frames (which are a special case of lists).
A remark regarding toupper (and many other functions in R): it works on vectors, such that you don't need to use sapply. toupper(c(keywordset1, keywordset2)) is perfectly fine.
But I would like to propose an entirely different solution to your problem. First, I create the data as follows:
keywords1 <- read.table(text="able,2
gobble,3
highway,3
test,6
zoo,10",sep=",",stringsAsFactors=FALSE)
keywords2 <- read.table(text="gobble,10
highway,3
speed,7
test,8
upper,3
zoo,10",sep=",",stringsAsFactors=FALSE)
Note that I use stringsAsFactors=FALSE. This prevents read.table from converting characters to factors, such that there is no need to call as.character later.
The next steps are to capitalize the keyword columns in both tables. At the same time, I put both tables in a list. This is often a good way to simplify calculations in R, because you can use lapply to apply a function on all the list elements. Then I put both tables into a single table.
keyword_list <- lapply(list(keywords1,keywords2),function(kw)
transform(kw,V1=toupper(V1)))
keywords_all <- do.call(rbind,keyword_list)
The next step is to sort the data frame in decreasing order by the number in the second column:
keywords_sorted <- keywords_all[order(keywords_all$V2,decreasing=TRUE),]
keywords_sorted looks as follows:
V1 V2
5 ZOO 10
6 GOBBLE 10
11 ZOO 10
9 TEST 8
8 SPEED 7
4 TEST 6
2 GOBBLE 3
3 HIGHWAY 3
7 HIGHWAY 3
10 UPPER 3
1 ABLE 2
As you notice, some keywords appear only once and for those that appear twice, the first appearance is the one you want to keep. There is a function in R that can be used to extract exactly these elements: duplicated() (run ?duplicated to learn more). Basically, the function returns TRUE, if an element appears for the at least second time in a vector. These are the elements you don't want. To convert TRUE to FALSE (and vice versa), you use the operator !. So the following gives your desired result:
keep <- !duplicated(keywords_sorted$V1)
keywords_max <- keywords_sorted[keep,]
V1 V2
5 ZOO 10
6 GOBBLE 10
9 TEST 8
8 SPEED 7
3 HIGHWAY 3
10 UPPER 3
1 ABLE 2
I am reading in parameter estimates from some results files that I would like to compare side by side in a table. But I cant get the dataframe to the structure that I want to have (Parameter name, Values(file1), Values(file2))
When I read in the files I get a wide dataframe with each parameter in a separate column that I would like to transform to "long" format using melt. But that gives only one column with values. Any idea on how to get several value columns without using a for loop?
paraA <- c(1,2)
paraB <- c(6,8)
paraC <- c(11,9)
Source <- c("File1","File2")
parameters <- data.frame(paraA,paraB,paraC,Source)
wrong_table <- melt(parameters, by="Source")
You can use melt in combination with cast to get what you want. This is in fact the intended pattern of use, which is why the functions have the names they do:
m<-melt(parameters)
dcast(m,variable~Source)
# variable File1 File2
# 1 paraA 1 2
# 2 paraB 6 8
# 3 paraC 11 9
Converting #alexis's comment to an answer, transpose (t()) pretty much does what you want:
setNames(data.frame(t(parameters[1:3])), parameters[, "Source"])
# File1 File2
# paraA 1 2
# paraB 6 8
# paraC 11 9
I've used setNames above to conveniently rename the resulting data.frame in one step.
I want to (as ever) use code that performs better but functions equivalently to the following:
write.table(results.df[seq(1, ncol(results.df),2)],file="/path/file.txt", row.names=TRUE, sep="\t")
write.table(results.df[seq(2, ncol(results.df),2)],file="/path/file2.txt",row.names=TRUE, sep="\t")
results.df is a dataframe that looks something thus:
row.names 171401 171401 111201 111201
1 1 0.8320923 10 0.8320923
2 2 0.8510621 11 0.8510621
3 3 0.1009001 12 0.1009001
4 4 0.9796110 13 0.9796110
5 5 0.4178686 14 0.4178686
6 6 0.6570377 15 0.6570377
7 7 0.3689075 16 0.3689075
There is no consistent patterning in the column headers except that each one is repeated twice consecutively.
I want to create (1) one file with only odd-numbered columns of results.df and (2) another file with only even-numbered columns of results.df. I have one solution above, but was wondering whether there is a better-performing means of achieving the same thing.
IDEA UPDATE: I was thinking there may be some way of excising - deleting it from memory - each processed column rather than just copying it. This way the size of the dataframe progressively decreases and may result in a performance increase???
The code is only slightly shorter but...
# Instead of
results.df[seq(1, ncol(results.df), 2]
results.df[seq(2, ncol(results.df), 2]
#you could use
results.df[c(T,F)]
results.df[c(F,T)]