melt giving several value columns - r

I am reading in parameter estimates from some results files that I would like to compare side by side in a table. But I cant get the dataframe to the structure that I want to have (Parameter name, Values(file1), Values(file2))
When I read in the files I get a wide dataframe with each parameter in a separate column that I would like to transform to "long" format using melt. But that gives only one column with values. Any idea on how to get several value columns without using a for loop?
paraA <- c(1,2)
paraB <- c(6,8)
paraC <- c(11,9)
Source <- c("File1","File2")
parameters <- data.frame(paraA,paraB,paraC,Source)
wrong_table <- melt(parameters, by="Source")

You can use melt in combination with cast to get what you want. This is in fact the intended pattern of use, which is why the functions have the names they do:
m<-melt(parameters)
dcast(m,variable~Source)
# variable File1 File2
# 1 paraA 1 2
# 2 paraB 6 8
# 3 paraC 11 9

Converting #alexis's comment to an answer, transpose (t()) pretty much does what you want:
setNames(data.frame(t(parameters[1:3])), parameters[, "Source"])
# File1 File2
# paraA 1 2
# paraB 6 8
# paraC 11 9
I've used setNames above to conveniently rename the resulting data.frame in one step.

Related

Rename column R

I am trying to rename columns but I do not know if that column will be present in the dataset. I have a large data set and if a certain column name is present I want to rename it. For example:
A B C D E
1 4 5 9 2
3 5 6 9 1
4 4 4 9 1
newNames <- data %>% rename(1=A,2=B,3=C,4=D,5=E)
This works to rename what is in the dataset but I am looking for the flexibility to add more potential name changes, without an error occurring.
newNames2 <- data %>% rename(1=A,2=B,3=C,4=D,5=E,6=F,7=G)
This ^ will not work it give me an error because F and G are not in the data set.
Is there any way to write a code to ignore the column change if the name does not exist?
Thanks!
There can be plenty of ways to do this. One would be to create a named vector with the names and their corresponding 'new name' (as the vector's names) and use that, i.e.
#The below vector v1, uses LETTERS as old names and 1:7 as the new ones
v1 <- setNames(LETTERS[1:7], 1:7)
names(df) <- names(v1)[v1 %in% names(df)]

Replace semicolon-separated values to tab

I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.

Extract then row.bind data.frames from nested lists

I have a function that outputs a large matrix (Mat1) and a small dataframe (Smalldf1) I store them in a list called "Result".
This function is run inside a loop so I create many versions of my "Result" list until my loop ends....each time I add them to a list called "FINAL".
At the end of my loop I end up with a List called FINAL, that has many smaller "Result" lists inside...each containing a small dataframe and a large Matrix.
I want to rbind all of the small dataframes together to form one larger dataframe and call it DF1 - I'm not sure how I access these now that I'm accessing a list within a list..?
A similar question on here gave a solution like this:
DF1 <- do.call("rbind",lapply(FINAL, function(x) x["Smalldf1"]))
However this gives the output as a single column called "Smalldf1" with the description of Smalldf1...ie this is literally printed in the column list(X1="xxx", X2="xxx", X3="xxx")...I need this broken out to look like the original format, 3 columns housing the information...?
Any help would be great.
I make my comment into an answer. This could be your data:
df <- data.frame(X1=1:3, X2=4:6, X3=7:9)
FINAL=list(Result=list(Smalldf1=df, Mat1=as.matrix(df)),
Result=list(Smalldf1=df+1, Mat1=as.matrix(df+1)))
You can combine lapply to extract the first (or Nth, just change the 1) elements of the nested lists, and then do a rbind on this result, either with do.call or with dplyr:
#### # Doing it in base R:
do.call("rbind", lapply(FINAL, "[[", 1) )
#### # Or doing it with dplyr:
library(dplyr)
lapply(FINAL, "[[", 1) %>% bind_rows
#### X1 X2 X3
#### 1 1 4 7
#### 2 2 5 8
#### 3 3 6 9
#### 4 2 5 8
#### 5 3 6 9
#### 6 4 7 10
This should be your expected result
WARNING:
The solution using dplyr doesn't work for old versions of dplyr (i tested it on dplyr_0.5.0, but it returns an error on dplyr_0.2 for instance)

merge and plot multiple text files

I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000 rows long. Each file is named in the format "B001.BaseCovDist.txt" to "B060.BaseCovDist.txt", and in R I have them as "B001" to "B060".
How can I combine the data frames by Coverage? This is complicated by missing rows. I've tried various approaches in bash, base R, reshape(2), and dplyr.
How can I make a single graph of the Counts(y-axis) against Coverage (x-axis) with each unique sample as a different series. Ggplot2 seems ideal but I seem to need a loop or a list to add the series without having to type out all of the names in full (which would be ridiculous).
One approach that seemed good was to add a third column that contains the unique sample name because this creates a molten dataset. However this didn't work in bash (awk) because the number of whitespace delimiters varies by row.
Any help would be very welcome.
Coverage Count
1 0 7089359
2 1 983611
3 2 658253
4 3 520767
5 4 448916
6 5 400904
A good starting point is to consider a long-format for the data vice a wide-format. Since you mentioned reshape2, this should make sense, but check out tidyr as well, as the docs for both document the differences between long/wide.
Going with a long format, try the following:
allfiles <- lapply(list.files(pattern='foo.csv'),
function(fname) cbind(fname=fname, read.csv(fname)))
dat <- rbind_all(allfiles)
dat
## fname Coverage Count
## 1 B001.BaseCovDist.txt 0 7089359
## 2 B001.BaseCovDist.txt 1 983611
## 3 B001.BaseCovDist.txt 2 658253
## 4 B001.BaseCovDist.txt 3 520767
## 5 B001.BaseCovDist.txt 4 448916
## 6 B001.BaseCovDist.txt 5 400904
ggplot(data=dat, aes(x=Coverage, y=Count, group=fname)) + geom_line()
Just to add to your answer, r2evans I added a gsub command so that the filename suffix is removed from the added column (and also some boring import modifers).
allfiles <- lapply(list.files(pattern='.BasCovDis.txt'), function(sample) cbind(sample=gsub("[.]BasCovDis.txt","", sample), read.table(sample, header=T, skip=3)))

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help
apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

Resources