Loop tasks by replacing one unique filename part by another - r

I am new to R and have just written a code, which works fine. I would like to loop this so that it also applies to the other identical 41 data.frames.
The inputfiles are called "weatherdata.. + UNIQUE NUMBER", the output files I would like to call "df + UNIQUE NUMBER".
The code I have written applies now only to the file weatherdata..5341. I could just press CTRL + F and replace all 5341 and run which is easy to do. But could I also do this with some sort of loop? or do you have a nice tutorial for me that could teach me how to do this? I have seen a tutorial with the for-loop but I couldn't figure out how to apply it for my code.
A small part of the code is provided below! I think that if the loop works on the code given below it will also work for the rest of the code. All help appreciated! :)
#List of part of the datafiles just 4 out of 42 files
list.dat <- list(weatherdata..5341,weatherdata..5344, weatherdata..5347,
weatherdata..5350)
# add colum with date(month) as a decimal number
weatherdata..5341$Month <- format( as.Date(weatherdata..5341$Date) , "%m")
# convert to date if not already
weatherdata..5341$Date <- as.Date(weatherdata..5341$Date, "%d-%m-%Y")
#Try rename columns
colnames(weatherdata..5341)[colnames(weatherdata..5341)=="Max.Temperature"] <- "TMPMX"
# store as a vector
v1 <- unlist(Tot1)
# store in outputfile dataframe
Df5341<- as.data.frame.list(v1)

You can create a list of all the dataframes and then use sapply to loop through each one of them. Here is a sample code :
> v1 <- list(data.frame(x = c(1,2), y = c('a', 'b')), data.frame(x = c(3,4), y = c('c', 'd')))
> v1
[[1]]
x y
1 1 a
2 2 b
[[2]]
x y
1 3 c
2 4 d
> sapply(v1 , function(x){(x$x <- x$x/4)})
[,1] [,2]
[1,] 0.25 0.75
[2,] 0.50 1.00
Then you can replace content inside the function. Hope this helps.

Something like this should work:
## Assuming that your files are CSV files and are alone in the folder
Fnames <- list.files() # this pulls the names of all the files
# in your working directory
Data <- lapply(Fnames, read.csv)
for( i in 1:length(Data)){
# Put your code in here, replacing the Df names with Data[[i]]
# for example:
# add colum with date(month) as a decimal number
Data[[i]]$Month <- format( as.Date(Data[[i]]$Date) , "%m")
# convert to date if not already
Data[[i]]$Date <- as.Date(Data[[i]]$Date, "%d-%m-%Y")
#Try rename columns
colnames(Data[[i]])[colnames(Data[[i]])=="Max.Temperature"] <- "TMPMX"
# And so on..
}

Related

I would like to automate the reading of PDF documents into R using pdf_text

I currently have a code to extract certain details within a PDF document. However, as i have thousands of other PDF documents to extract information from, I would like to automate this process. I am using the pdf_text option to read PDFs into R. My code looks something like this:
library(pdftools)
x <- pdf_text("Test.pdf")
y1 <- str_split(x, "\r")
#pdf output contains a total of 7 lists
a <- y1 [[4]]
b <- c(a[4],a[11:13]) #Obtain only rows 4, 11 to 13 from list 4
n2 <- y1[[3]]
n3 <- c(n2[3]) #Obtain only rows 3 from list 3
n <- y1[[5]]
n1 <- c(n[3]) #Obtain only rows 3 from list 5
c <- y1[[6]]
d <- c(c[4:18]) #Obtain only rows 4 to 18 from list 6
e <- c(n3,b,d,n1) #Combining all necessary information into one list
z <- substr(s[1:21], start = 15, stop = 200) #to remove white spaces between quotes
Name <- z[1]
InterestedParty <- z[2]
TotalOwnBefore <- substr(z[11], start = 97, stop = 120)
Ownership <- list(NM = Name, Party = InterestedParty, OwnBefore = TotalOwnBefore)
write.csv(Ownership, file="MyData.csv")
The above code allows me to output a file for a single company. However, I have thousands other PDFs ("Test_1.pdf" to "Test_1000.pdf") to be read. Is there a way to automate the reading of the PDF files into R with pdf_text? Would also be great if there's a way for me to store all results into a single file instead of one firm per file.
I have since managed to automate the process using a for loop as follows:
for (i in 1:1000){
x <- paste("Test_",i,".pdf", sep="")
y <- pdf_text(print(x))
total <- strsplit(y, "\r")
print(y1)
}

R: How to change data in a column across multiple files. Help understanding lapply

I have a folder with about 160 files that are formatted with three columns: onset time, variable1 'x', and variable 2 'y'. Onset is listed in R as a string, but it is a time variable which is Hour:Minute:Second:FractionalSecond. I need to remove the fractional second. If I could round that would be great, but it would be okay to just remove the fractional second using something like substr(file$onset,1,8).
My files are named in a format similar to File001 File002 File054 File1001
onset X Y
00:55:17:95 3 3
00:55:29:66 3 4
00:55:31:43 3 3
01:00:49:24 3 3
01:02:00:03
I am trying to use lapply. lapply seems simple, but I'm having a hard time figuring it out. The code written below returns an error that the final line doesn't have 3 elements. For my final output it is important that my last line only have the value for onset.
lapply(files, function(x) {
t <- read.table(x, header=T) # load file
t$onset<-substr(t$onset,1,8)
out <- function(t)
# write to file
write.table(out, "filepath", sep="\t", quote=F, row.names=F, col.names=T)
})
First create a data frame of all text files, then you can apply strptime and format functions for the same vector to remove the fractional second.
filelist <- list.files(pattern = "\\.txt")
alltxt.files <- list() # create a list to populate with table data (if you wind to bind all the rows together)
count <- 1
for (file in filelist) {
dat <- read.table(file,header = T)
alltxt.files[[count]] <- dat # creat a list of rows from txt files
count <- count + 1
}
allfiles <- do.call(rbind.data.frame, alltxt.files)
allfiles$onset <- strptime(allfiles$onset,"%H:%M:%S")
allfiles$onset <- format(allfiles$onset,"%H:%M:%S")

R: Repeat script n times, changing variables in each iteration

I have a script that I want to repeat n times, where some variables are changed by 1 each iteration. I'm creating a data frame consisting of the standard deviation of the difference of various vectors. My script currently looks like this:
standard.deviation <- data.frame
c(
sd(diff(t1[,1])),
sd(diff(t1[,2])),
sd(diff(t1[,3])),
sd(diff(t1[,4])),
sd(diff(t1[,5]))
),
c(
sd(diff(t2[,1])),
sd(diff(t2[,2])),
sd(diff(t2[,3])),
sd(diff(t2[,4])),
sd(diff(t2[,5]))
),
c(
sd(diff(t3[,1])),
sd(diff(t3[,2])),
sd(diff(t3[,3])),
sd(diff(t3[,4])),
sd(diff(t3[,5]))
),
)
I want to write the script creating the vector only once, and repeat it n times (n=3 in this example) so that I end up with n vectors. In each iteration, I want to add 1 to a variable (in this case: 1 -> 2 -> 3, so the number next to 't'). t1, t2 and t3 are all separate data frames, and I can't figure out how to loop a script with changing data frame names.
1) How to make this happen?
2) I would also like to divide each sd value in a row by the row number. How would I do this?
3) I will be using 140 data frames in total. Is there a way to call all of these with a simple function, rather than making a list and adding each of the 140 data frames individually?
Use functions to get a more readable code:
set.seed(123) # so you'll get the same number as this example
t1 <- t2 <- t3 <- data.frame(replicate(5,runif(10)))
# make a function for your sd of diff
sd.cols <- function(data) {
# loop over the df columns
sapply(data,function(x) sd(diff(x)))
}
# make a list of your data frames
dflist <- list(sdt1=t1,sdt2=t2,sdt3=t3)
# Loop overthe list
result <- data.frame(lapply(dflist,sd.cols))
Which gives:
> result
sdt1 sdt2 sdt3
1 0.4887692 0.4887692 0.4887692
2 0.5140287 0.5140287 0.5140287
3 0.2137486 0.2137486 0.2137486
4 0.3856857 0.3856857 0.3856857
5 0.2548264 0.2548264 0.2548264
Assuming that you always want to use columns 1 to 5...
# some data
t3 <- t2 <- t1 <- as.data.frame(matrix(rnorm(100),10,10))
# script itself
lis=list(t1,t2,t3)
sapply(lis,function(x) sapply(x[,1:5],function(y) sd(diff(y))))
# [,1] [,2] [,3]
# V1 1.733599 1.733599 1.733599
# V2 1.577737 1.577737 1.577737
# V3 1.574130 1.574130 1.574130
# V4 1.158639 1.158639 1.158639
# V5 0.999489 0.999489 0.999489
The output is a matrix, so as.data.frame should fix that.
For completeness: As #Tensibai mentions, you can just use list(mget(ls(pattern="^t[0-9]+$"))), assuming that all your variables are t followed by a number.
Edit: Thanks to #Tensibai for pointing out a missing step and improving the code, and the mget step.
You can itterate through a list of the ts...
ans <- data.frame()
dats <- c(t, t1 , t2)
for (k in dats){
temp <- c()
for (k2 in c(1,2,3,4,5)){
temp <- c(temp , sd(k[,k2]))
}
ans <- rbind(ans,temp)
}
rownames(ans) <- c("t1","t2","t3")
colnames(ans) <- c(1,2,3,4,5)
attr(results,"title") <- "standard deviation"

Merge and name data frames in for loop

I have a bunch of DF named like: df1, df2, ..., dfN
and lt1, lt2, ..., ltN
I would like to merge them in a loop, something like:
for (X in 1:N){
outputX <- merge(dfX, ltX, ...)
}
But I have some troubles getting the name of output, dfX, and ltX to change in each iteration. I realize that plyr/data.table/reshape might have an easier way, but I would like for loop to work.
Perhaps I should clarify. The DF are quite large, which is why plyr etc will not work (they crash). I would like to avoid copy'ing.
The next in the code is to save the merged DF.
This is why I prefer the for-loop apporach, since I know what each merged DF is named in the enviroment.
You can combine data frames into lists and use mapply, as in the example below:
i <- 1:3
d1.a <- data.frame(i=i,a=letters[i])
d1.b <- data.frame(i=i,A=LETTERS[i])
i <- 11:13
d2.a <- data.frame(i=i,a=letters[i])
d2.b <- data.frame(i=i,A=LETTERS[i])
L1 <- list(d1.a, d2.a)
L2 <- list(d1.b, d2.b)
mapply(merge,L1,L2,SIMPLIFY=F)
# [[1]]
# i a A
# 1 1 a A
# 2 2 b B
# 3 3 c C
#
# [[2]]
# i a A
# 1 11 k K
# 2 12 l L
# 3 13 m M
If you'd like to save every of the resulting data frames in the global environment (I'd advise against it though), you could do:
result <- mapply(merge,L1,L2,SIMPLIFY=F)
names(result) <- paste0('output',seq_along(result))
which will give a name to every data frame in the list, an then:
sapply(names(result),function(s) assign(s,result[[s]],envir = globalenv()))
Please note that provided is a base R solution that does essentially the same thing as your sample code.
If your data frames are in a list, writing a for loop is trivial:
# lt = list(lt1, lt2, lt3, ...)
# if your data is very big, this may run you out of memory
lt = lapply(ls(pattern = "lt[0-9]*"), get)
merged_data = merge(lt[[1]], lt[[2]])
for (i in 3:length(lt)) {
merged_data = merge(merged_data, lt[[i]])
save(merged_data, file = paste0("merging", i, ".rda"))
}

In R: How to perform a str() on multiple files

How could I go about performing a str() function in R on all of these files loaded in the workspace at the same time? I simply want to export this information out, but in a batch-like process, to a .csv file. I have over 100 of them, and want to compare one workspace with another to help locate incongruities in data structure and avoid mismatches.
I came painfully close to a solution via UCLA's R Code Fragment, however, they failed to include the instructions for how to form the read.dta function which loops through the files. That is the part I need help on.
What I have so far:
#Define the file path
f <- file.path("C:/User/Datastore/RData")
#List the files in the path
fn <- list.files(f)
#loop through file list, return str() of each .RData file
#write to .csv file with 4 columns (file name, length, size, value)
EDIT
Here is an example of what I am after (the view from RStudio--it simply lists the Name, Type, Length, Size, and Value of all of the RData Files). I want to basically replicate this view, but export it out to a .csv. I am adding the tag to RStudio in case someone might know a way of exporting this table out automatically? I couldn't find a way to do it.
Thanks in advance.
I've actually written a function for this already. I also asked a question about it, and dealing with promise objects with the function. That post might be of some use to you.
The issue with the last column is that str is not meant to do anything but print a compact description of objects and therefore I couldn't use it (but that's been changed with recent edits). This updated function gives a description for the values similar to that of the RStudio table. The data frames and lists are tricky because their str output is more than one line. This should be good.
objInfo <- function(env = globalenv())
{
obj <- mget(ls(env), env)
out <- lapply(obj, function(x) {
vals1 <- c(
Type = toString(class(x)),
Length = length(x),
Size = object.size(x)
)
val2 <- gsub("|^\\s+|'data.frame':\t", "", capture.output(str(x))[1])
if(grepl("environment", val2)) val2 <- "Environment"
c(vals1, Value = val2)
})
out <- do.call(rbind, out)
rownames(out) <- seq_len(nrow(out))
noquote(cbind(Name = names(obj), out))
}
And then we can test it out on a few objects..
x <- 1:10
y <- letters[1:5]
e <- globalenv()
df <- data.frame(x = 1, y = "a")
m <- matrix(1:6)
l <- as.list(1:5)
objInfo()
# Name Type Length Size Value
# 1 df data.frame 2 1208 1 obs. of 2 variables
# 2 e environment 11 56 Environment
# 3 l list 5 328 List of 5
# 4 m matrix 6 232 int [1:6, 1] 1 2 3 4 5 6
# 5 objInfo function 1 24408 function (env = globalenv())
# 6 x integer 10 88 int [1:10] 1 2 3 4 5 6 7 8 9 10
# 7 y character 5 328 chr [1:5] a b c d e
Which is pretty close I guess. Here's the screen shot of the environment in RStudio.
I would write a function, something like below. And then loop through that function, so you basically write the code for a single dataset
library(foreign)
giveSingleDataset <- function( oneFile ) {
#Read .dta file
df <- read.dta( oneFile )
#Give e.g. structure
s <- ls.str(df)
#Return what you want
return(s)
}
#Actually call the function
result <- lapply( fn, giveSingleDataset )

Resources