Re-creating file extentions for easy loading in R - r

I am trying to read in a number of datasets (approx 300) all with similar names to the following (I am not loading them all in at the same time, but I am trying to find a generalised solution where I only change a few things at the beginning of the R file)
E:/Data/Academic/Year1/External/beer/beer_drug_1114_1165
E:/Data/Academic/Year1/External/beer/beer_groc_1114_1165
E:/Data/Academic/Year1/External/beer/beer_PANEL_DR_1114_1165.dat
E:/Data/Academic/Year1/External/beer/beer_PANEL_GR_1114_1165.dat
E:/Data/Academic/Year1/External/beer/beer_PANEL_MA_1114_1165
E:/Data/Academic/Year1/External/beer/Delivery_Stores
The only things which changes are;
Year1 in the E:/Data/Academic/Year1/External
beer in the beer/beer_drug_1114_1165
1114_1165 at the end and the extentions
So I am trying different combinations of paste0 in order to recreate the file extentions.
I have something such as the following which isnĀ“t working so well.
file <- "E:/IRI Data/Academic Dataset External/Year1/External/"
product <- "/beer"
weeks <- "_1114_1165"
paste0(file, product, product, weeks)
But I would like to change /Year1/ in the middle of the extension.
Extentions:
drug <- read.table("E:/Data/Academic/Year1/External/beer/beer_drug_1114_1165", header = TRUE)
groc <- read.table("E:/Data/Academic/Year1/External/beer/beer_groc_1114_1165", header = TRUE)
PANEL_DR <- read.delim("E:/Data/Academic/Year1/External/beer/beer_PANEL_DR_1114_1165.dat", header = TRUE)
PANEL_GR <- read.delim("E:/Data/Academic/Year1/External/beer/beer_PANEL_GR_1114_1165.dat", header = TRUE)
PANEL_MA <- read.delim("E:/Data/Academic/Year1/External/beer/beer_PANEL_MA_1114_1165.dat", header = TRUE)
Delivery_Stores <- read.fwf("E:/Data/Academic/Year1/External/beer/Delivery_Stores",
widths = c(7, 3, 9, 21, 5, 4, 5, 9))

Related

How do I link code together in R such that sequential steps run?

I am trying to link code together such that each step runs sequentially. The input of each step requires the output of the previous one. For loops do not work and pipes do not either. I am using the package xcms for pre-processing omics data.
##Initial preparation
cdfs <- dir(system.file("cdf", package = "faahKO"), full.names = TRUE,
recursive = TRUE)[c(1, 2, 5, 6, 7, 8, 11, 12)]
phenodat <- data.frame(sample_name = sub(basename(cdfs), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = c(rep("KO", 4), rep("WT", 4)),
stringsAsFactors = FALSE)
##Steps
peaky <- xcmsSet(files=cdfs, phenoData= phenodat, method="centWave",
peakwidth=c(20,80), snthresh=10, noise=5000, prefilter=c(6, 5000))
alig <- retcor(peaky, method="obiwarp", plottype="deviation")
groupy <- group(alig, bw = 20, mzwid=0.015)
fill <- fillPeaks(groupy)
Any suggestions for automatically linking these steps would be greatly appreciated. The aim is to ultimately submit one R script as a "batch" job in a HPC.

Format PDF using layout()

I want to print two table (3x2 and 2x3) at the top of a PDF page next to each other. The following code prints them in the centre despite pagecentre = FALSE:
tab2 <- tableGrob(df2)
tab3 <- tableGrob(df3)
pdf("file.pdf", height = 20, width = 15, pagecentre = FALSE)
grid.arrange(tab2, tab3, ncol = 2, nrow = 1))
dev.off()
How do I fix this using layout()? I looked at the function but can't understand how the to set the matrix.
I'd also like to add table titles. Do I do this with using a data frame function or while writing to df to pdf?

R - IMDb dataset files - how to merge lines per film

One of the files (title.principals) available on IMDb dataset files contains details about cast and crew.
I would like to extract Directors details and merge them into single line, as there can be several Directors per film.
Is it possible?
#title.principals file download
url <- "https://datasets.imdbws.com/title.principals.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)
#file load
title_principals <- readr::read_tsv(
file = gzfile(tmp),
col_names = TRUE,
quote = "",
na = "\\N",
progress = FALSE
)
#name.basics file download
url <- "https://datasets.imdbws.com/name.basics.tsv.gz"
tmp <- tempfile()
download.file(url, tmp)
#file load
name_basics <- readr::read_tsv(
file = gzfile(tmp),
col_names = TRUE,
quote = "",
na = "\\N",
progress = FALSE
)
#extract directors data
df_directors <- title_principals %>%
filter(str_detect(category, "director")) %>%
select(tconst, ordering, nconst, category) %>%
group_by(tconst)
df_directors <- df_directors %>% left_join(name_basics)
head(df_directors, 20)
I'm joining it with name_basics file to have Director name.
Name basics contains Name, birth and death year, profession.
And after this step, I would like to merge all Directors per film into single cell split by comma for example.
Is it somehow possible?
Please see this guide for minimal reproducible example. Setting up a simplified example with fake data that highlights the exact problem will help other people help you faster.
As I understand it, you want to take a file that has multiple rows per value of ID_tconst with different values of Director_Name and collapse it to a file with one row per value of ID_tconst and a comma separated list of Director_Names.
Here is a simple mock data set and solution. Note the use of the collapse argument in paste instead of sep.
library(tidyverse)
example <- tribble(
~ID_tconst, ~Director_Name,
1, "Aaron",
2, "Bob",
2, "Cathy",
3, "Doug",
3, "Edna",
3, "Felicty"
)
collapsed <- example %>%
group_by(ID_tconst) %>%
summarize(directors = paste(Director_Name, collapse = ","))

Dynamic Reporting in R

I am looking for a help to generate a 'rtf' report from R (dataframe).
I am trying output data with many columns into a 'rtf' file using following code
library(rtf)
inp.data <- cbind(ChickWeight,ChickWeight,ChickWeight)
outputFileName = "test.out"
rtf<-RTF(paste(".../",outputFileName,".rtf"), width=11,height=8.5,font.size=10,omi=c(.5,.5,.5,.5))
addTable(rtf,inp.data,row.names=F,NA.string="-",col.widths=rep(1,12),header.col.justify=rep("C",12))
done(rtf)
The problem I face is, some of the columns are getting hide (as you can see last 2 columns are getting hide). I am expecting these columns to print in next page (without reducing column width).
Can anyone suggest packages/techniques for this scenario?
Thanks
Six years later, there is finally a package that can do exactly what you wanted. It is called reporter (small "r", no "s"). It will wrap columns to the next page if they exceed the available content width.
library(reporter)
library(magrittr)
# Prepare sample data
inp.data <- cbind(ChickWeight,ChickWeight,ChickWeight)
# Make unique column names
nm <- c("weight", "Time", "Chick", "Diet")
nms <- paste0(nm, c(rep(1, 4), rep(2, 4), rep(3, 4)))
names(inp.data) <- nms
# Create table
tbl <- create_table(inp.data) %>%
column_defaults(width = 1, align = "center")
# Create report and add table to report
rpt <- create_report("test.rtf", output_type = "RTF", missing = "-") %>%
set_margins(left = .5, right = .5) %>%
add_content(tbl)
# Write the report
write_report(rpt)
Only thing is you need unique columns names. So I added a bit of code to do that.
If docx format can replace rtf format, use package ReporteRs.
library( ReporteRs )
inp.data <- cbind(ChickWeight,ChickWeight,ChickWeight)
doc = docx( )
# uncomment addSection blocks if you want to change page
# orientation to landscape
# doc = addSection(doc, landscape = TRUE )
doc = addFlexTable( doc, vanilla.table( inp.data ) )
# doc = addSection(doc, landscape = FALSE )
writeDoc( doc, file = "inp.data.docx" )

creating a data frame with lapply(and plyr package)

I have multiple data files in which im interested in cleaning up then obtaining means from to run repeated measures ANOVA on.
Here's example data, in real data theres 4500 rows and another line called Actresponse which sometimes contains a 9 which I trim around : https://docs.google.com/file/d/0B20HmmYd0lsFVGhTQ0EzRFFmYXc/edit?pli=1
I have just discovered plyr and how awesome it is for manipulating data, but the way I'm using it right now looks rather stupid to me. I have 4 different things I"m interested in that I want to read into a data frame. I've read them in to 4 separate data frames to start, I'm wondering if there is a way I can combine this and read all the means into one data frame (a row for each reqresponse of each file) with less lines of code. Basically, can I achieve what I've done here without rewriting a lot of the same code 4 times?
PMScoreframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
ddply(data[data$Reqresponse==9,],.(Condition,Reqresponse),summarise,Score=mean(Score))
})
PMRTframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
ddply(data[data$Reqresponse==9,],.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
OtherScoreframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
select <- rep(TRUE, nrow(data))
index <- which(data$Reqresponse==9|data$Actresponse==9|data$controlrepeatedcue==1)
select[unique(c(index,index+1,index+2))] <- FALSE
data <- data[select,]
ddply(data[data$Reqresponse=="a"|data$Reqresponse=="b",],. (Condition,Reqresponse),summarise,Score=mean(Score))
})
OtherRTframe <- lapply(list.files(pattern='^[2-3].txt'),function(ff){
data <- read.table(ff, header=T, quote="\"")
data <- data[-c(seq(from = 1, to = 4001, by=500), seq(from = 2, to = 4002, by=500)), ]
select <- rep(TRUE, nrow(data))
index <- which(data$Reqresponse==9|data$Actresponse==9|data$controlrepeatedcue==1)
select[unique(c(index,index+1,index+2))] <- FALSE
data <- data[select,]
data <- data[data$RT>200,]
data <- ddply(data,.(Condition),function(x) x[!abs(scale(x$RT)) > 3,])
ddply(data[data$Reqresponse=="a"|data$Reqresponse=="b",],.(Condition,Reqresponse,Score),summarise,RT=mean(RT))
})
I think this deals with what you're trying to do. Basically, I think you need to read all the data in once, then deal with that data.frame. There are several questions dealing with how to read it all in, here is how I would do it so I maintain a record of which file each row in the data.frame comes from, which can also be used for grouping:
filenames <- list.files(".", pattern="^[2-3].txt")
import <- mdply(filenames, read.table, header = T, quote = "\"")
import$file <- filenames[import$X1]
Now import is a big dataframe with all your files in it (I'm assuming your pattern recognition etc for reading in files is correct). You can then do summaries based on whatever criteria you like.
I'm not sure what you're trying to achieve in line 3 of your code above, but for the ddply below that, you just need to do:
ddply(import[import$Reqresponse==9,],.(Condition,Reqresponse,file),summarise,Score=mean(Score))
There's so much going on in the rest of your code that it's hard to make out exactly what you want.
I think the important thing is that to make this efficient, and easier to follow, you need to read your data in once, then work on that dataset - making subsets if necessary, doing summary stats or whatever else it is.
As an example of how you can work with this, here's an attempt to deal with your problem of dealing with trials (rows?) that have reqresponse == 9 and the following two. There are probably ways of doing this more efficiently, but this is slightly based on how you were doing it to show you briefly how to work with the larger dataframe. Now modified to remove the first two trials of each file:
import.clean <- ddply(import, .(file), function(x) {
index <- which(x$reqresponse == 9)
if(length(index) > 0) {
index <- unique(c(index, index + 1, index + 2, 1, 2))
}
else index <- c(1,2)
x <- x[-index,]
return(x)
})

Resources