R: Repeat script n times, changing variables in each iteration - r

I have a script that I want to repeat n times, where some variables are changed by 1 each iteration. I'm creating a data frame consisting of the standard deviation of the difference of various vectors. My script currently looks like this:
standard.deviation <- data.frame
c(
sd(diff(t1[,1])),
sd(diff(t1[,2])),
sd(diff(t1[,3])),
sd(diff(t1[,4])),
sd(diff(t1[,5]))
),
c(
sd(diff(t2[,1])),
sd(diff(t2[,2])),
sd(diff(t2[,3])),
sd(diff(t2[,4])),
sd(diff(t2[,5]))
),
c(
sd(diff(t3[,1])),
sd(diff(t3[,2])),
sd(diff(t3[,3])),
sd(diff(t3[,4])),
sd(diff(t3[,5]))
),
)
I want to write the script creating the vector only once, and repeat it n times (n=3 in this example) so that I end up with n vectors. In each iteration, I want to add 1 to a variable (in this case: 1 -> 2 -> 3, so the number next to 't'). t1, t2 and t3 are all separate data frames, and I can't figure out how to loop a script with changing data frame names.
1) How to make this happen?
2) I would also like to divide each sd value in a row by the row number. How would I do this?
3) I will be using 140 data frames in total. Is there a way to call all of these with a simple function, rather than making a list and adding each of the 140 data frames individually?

Use functions to get a more readable code:
set.seed(123) # so you'll get the same number as this example
t1 <- t2 <- t3 <- data.frame(replicate(5,runif(10)))
# make a function for your sd of diff
sd.cols <- function(data) {
# loop over the df columns
sapply(data,function(x) sd(diff(x)))
}
# make a list of your data frames
dflist <- list(sdt1=t1,sdt2=t2,sdt3=t3)
# Loop overthe list
result <- data.frame(lapply(dflist,sd.cols))
Which gives:
> result
sdt1 sdt2 sdt3
1 0.4887692 0.4887692 0.4887692
2 0.5140287 0.5140287 0.5140287
3 0.2137486 0.2137486 0.2137486
4 0.3856857 0.3856857 0.3856857
5 0.2548264 0.2548264 0.2548264

Assuming that you always want to use columns 1 to 5...
# some data
t3 <- t2 <- t1 <- as.data.frame(matrix(rnorm(100),10,10))
# script itself
lis=list(t1,t2,t3)
sapply(lis,function(x) sapply(x[,1:5],function(y) sd(diff(y))))
# [,1] [,2] [,3]
# V1 1.733599 1.733599 1.733599
# V2 1.577737 1.577737 1.577737
# V3 1.574130 1.574130 1.574130
# V4 1.158639 1.158639 1.158639
# V5 0.999489 0.999489 0.999489
The output is a matrix, so as.data.frame should fix that.
For completeness: As #Tensibai mentions, you can just use list(mget(ls(pattern="^t[0-9]+$"))), assuming that all your variables are t followed by a number.
Edit: Thanks to #Tensibai for pointing out a missing step and improving the code, and the mget step.

You can itterate through a list of the ts...
ans <- data.frame()
dats <- c(t, t1 , t2)
for (k in dats){
temp <- c()
for (k2 in c(1,2,3,4,5)){
temp <- c(temp , sd(k[,k2]))
}
ans <- rbind(ans,temp)
}
rownames(ans) <- c("t1","t2","t3")
colnames(ans) <- c(1,2,3,4,5)
attr(results,"title") <- "standard deviation"

Related

How do I subset a vector while retaining row names?

I am looking for differentially expressed genes in a data set. After using my function to determine fold change, I am given a vector that returns the gene names and fold change which looks like this:
df1
[,1]
gene1074 1.1135131
gene22491 1.0668137
gene15416 0.9840414
gene18645 1.1101060
gene4068 1.0055899
gene19043 1.1463878
I want to look for anything that has a greater than 2 fold change, so to do this I execute:
df2 <- subset(df1 >= 2)
which returns the following:
head(df2)
[,1]
gene1074 FALSE
gene22491 FALSE
gene15416 FALSE
gene18645 FALSE
gene4068 FALSE
gene19043 FALSE
and that is not what I'm looking for.
I've tried another subsetting method:
df2 <- df1[df1 >= 2]
which returns:
head(df2)
[1] 4.191129 127.309557 2.788121 2.090916 11.382345 2.186330
Now that is the values that are over 2, but I've lost the gene names that came along with them.
How would I go about subsetting my data so that it returns in the following format:
head(df2)
[,1]
genex 4.191129
geney 127.309557
genez 2.788121
genea 2.090916
geneb 11.382345
Or something at least approximating that format where I am given the gene and it's corresponding fold change value
You are looking for subsetting like so:
df2 <- df1[df1[, 1] >= 2, ]
To show you on some data:
# Create some toy data
df1 <- data.frame(val = rexp(100))
rownames(df1) <- paste0("gene", 1:100)
head(df1)
# val
#gene1 0.9295632
#gene2 1.2090513
#gene3 0.1550578
#gene4 1.7934942
#gene5 0.7286462
#gene6 1.8424025
Now we take the first column of df1 and compare to 2 (df1[,1] > 2). The output of that (a logical vector) is used to select the rows which fulfill the criteria:
df2 <- df1[df1[,1] > 2, ]
head(df2)
#[1] 2.705683 3.410672 3.544905 3.695313 2.523586 2.229879
Using the drop = FALSE keeps the output as a data.frame:
df3 <- df1[df1[,1] > 2, ,drop = FALSE]
head(df3)
# val
#gene8 2.705683
#gene9 3.410672
#gene22 3.544905
#gene23 3.695313
#gene38 2.523586
#gene42 2.229879
The same can be achieved by
subset(df1, subset = val > 2)
or
subset(df1, subset = df1[1,] > 2)
The former of these two expressions does not work in your case as it appears you have not named the columns.
You can also compute the positions in the data that correspond to your predicate, and use them for indexing:
# create some test data
df <- read.csv(
textConnection(
"g, v
gene1074, 1.1135131
gene22491, 1.0668137
gene15416, 0.9840414
gene18645, 1.1101060
gene4068, 1.0055899
gene19043, 1.1463878"
))
# positions that match a given predicate
idx <- which(df$v > 1)
# indexing "as usual"
df[idx, ]
Output:
g v
1 gene1074 1.113513
2 gene22491 1.066814
4 gene18645 1.110106
5 gene4068 1.005590
6 gene19043 1.146388
I find this code reads quite nicely and is pretty intuitive, but that might just be my opinion.

Creating a data frame from looping through a list of data frames in R

I have a large data set that is organized as a list of 1044 data frames. Each data frame is a profile that holds the same data for a different station and time. I am trying to create a data frame that holds the output of my function fitsObs, but my current code only goes through a single data frame. Any ideas?
i=1
start=1
for(i in 1:1044){
station1 <- surveyCTD$stations[[i]]
df1 <- surveyCTD$data[[i]]
date1 <- surveyCTD$dates[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
if(start==1){
start=0
dfout <- data.frame(
date=date1
,station=station1
)
names(fitObs) <- paste0(names(fitObs),"o")
dfout<-cbind(dfout, df1$temp, df1$depth)
dfout <- cbind(dfout, fitObs)
}
}
From a first look I would try two ways to debug it. First print out the head of a DF to understand the behavior of your loop, then check the range of your variable dfout, it looks like the variable is local to your loop.
Moreover your i variable out of the loop does not change anything in your loop.
I have created a reproducible example of my best guess as to what you are asking. I also assume that you are able to adjust the concepts in this general example to suit your own problem. It's easier if you provide an example of your list in future.
First we create some reproducible data
a <- c(10,20,30,40)
b <- c(5,10,15,20)
c <- c(20,25,30,35)
df1 <- data.frame(x=a+1,y=b+1,z=c+1)
df2 <- data.frame(x=a,y=b,z=c)
ls1 <- list(df1,df2)
Which looks like this
print(ls1)
[[1]]
x y z
1 11 6 21
2 21 11 26
3 31 16 31
4 41 21 36
[[2]]
x y z
1 10 5 20
2 20 10 25
3 30 15 30
4 40 20 35
So we now have two dataframes within a single list. The following code should then work to go through the columns within each dataframe of the list and apply the mean() function to the data in the column. You change this to row by selecting '1' rather than '2'.
df <- do.call("rbind", lapply(ls1, function(x) apply(x,2,mean)))
as.data.frame(df)
print(df)
x y z
1 26 13.5 28.5
2 25 12.5 27.5
You should be able to replace mean() with whatever function you have written for your data. Let me know if this helps.
Consider building a generalized function to be called withi Map (wrapper to mapply, the multiple, elementwise iterator member of apply family) to build a list of data frames each with your fitObs output. And pass all equal length objects into the data.frame() constructor.
Then outside the loop, run do.call for a final, single appended dataframe of all 1,044 smaller dataframes (assuming each maintains exact same and number of columns):
# GENERALIZED FUNCTION
add_fit_obs <- function(dt, st, df) {
fitObs <- fitTp2(-df$depth, df$temp)
names(fitObs) <- paste0(names(fitObs),"o")
tmp <- data.frame(
date = dt,
station = st,
depth = df1$depth,
temp = df1$temp,
fitObs
)
return(tmp)
}
# LIST OF DATA FRAMES
df_list <- Map(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data)
# EQUIVALENTLY:
# df_list <- mapply(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data, SIMPLIFY=FALSE)
# SINGLE DATAFRAME
master_df <- do.call(rbind, df_list)

Loop tasks by replacing one unique filename part by another

I am new to R and have just written a code, which works fine. I would like to loop this so that it also applies to the other identical 41 data.frames.
The inputfiles are called "weatherdata.. + UNIQUE NUMBER", the output files I would like to call "df + UNIQUE NUMBER".
The code I have written applies now only to the file weatherdata..5341. I could just press CTRL + F and replace all 5341 and run which is easy to do. But could I also do this with some sort of loop? or do you have a nice tutorial for me that could teach me how to do this? I have seen a tutorial with the for-loop but I couldn't figure out how to apply it for my code.
A small part of the code is provided below! I think that if the loop works on the code given below it will also work for the rest of the code. All help appreciated! :)
#List of part of the datafiles just 4 out of 42 files
list.dat <- list(weatherdata..5341,weatherdata..5344, weatherdata..5347,
weatherdata..5350)
# add colum with date(month) as a decimal number
weatherdata..5341$Month <- format( as.Date(weatherdata..5341$Date) , "%m")
# convert to date if not already
weatherdata..5341$Date <- as.Date(weatherdata..5341$Date, "%d-%m-%Y")
#Try rename columns
colnames(weatherdata..5341)[colnames(weatherdata..5341)=="Max.Temperature"] <- "TMPMX"
# store as a vector
v1 <- unlist(Tot1)
# store in outputfile dataframe
Df5341<- as.data.frame.list(v1)
You can create a list of all the dataframes and then use sapply to loop through each one of them. Here is a sample code :
> v1 <- list(data.frame(x = c(1,2), y = c('a', 'b')), data.frame(x = c(3,4), y = c('c', 'd')))
> v1
[[1]]
x y
1 1 a
2 2 b
[[2]]
x y
1 3 c
2 4 d
> sapply(v1 , function(x){(x$x <- x$x/4)})
[,1] [,2]
[1,] 0.25 0.75
[2,] 0.50 1.00
Then you can replace content inside the function. Hope this helps.
Something like this should work:
## Assuming that your files are CSV files and are alone in the folder
Fnames <- list.files() # this pulls the names of all the files
# in your working directory
Data <- lapply(Fnames, read.csv)
for( i in 1:length(Data)){
# Put your code in here, replacing the Df names with Data[[i]]
# for example:
# add colum with date(month) as a decimal number
Data[[i]]$Month <- format( as.Date(Data[[i]]$Date) , "%m")
# convert to date if not already
Data[[i]]$Date <- as.Date(Data[[i]]$Date, "%d-%m-%Y")
#Try rename columns
colnames(Data[[i]])[colnames(Data[[i]])=="Max.Temperature"] <- "TMPMX"
# And so on..
}

Merge and name data frames in for loop

I have a bunch of DF named like: df1, df2, ..., dfN
and lt1, lt2, ..., ltN
I would like to merge them in a loop, something like:
for (X in 1:N){
outputX <- merge(dfX, ltX, ...)
}
But I have some troubles getting the name of output, dfX, and ltX to change in each iteration. I realize that plyr/data.table/reshape might have an easier way, but I would like for loop to work.
Perhaps I should clarify. The DF are quite large, which is why plyr etc will not work (they crash). I would like to avoid copy'ing.
The next in the code is to save the merged DF.
This is why I prefer the for-loop apporach, since I know what each merged DF is named in the enviroment.
You can combine data frames into lists and use mapply, as in the example below:
i <- 1:3
d1.a <- data.frame(i=i,a=letters[i])
d1.b <- data.frame(i=i,A=LETTERS[i])
i <- 11:13
d2.a <- data.frame(i=i,a=letters[i])
d2.b <- data.frame(i=i,A=LETTERS[i])
L1 <- list(d1.a, d2.a)
L2 <- list(d1.b, d2.b)
mapply(merge,L1,L2,SIMPLIFY=F)
# [[1]]
# i a A
# 1 1 a A
# 2 2 b B
# 3 3 c C
#
# [[2]]
# i a A
# 1 11 k K
# 2 12 l L
# 3 13 m M
If you'd like to save every of the resulting data frames in the global environment (I'd advise against it though), you could do:
result <- mapply(merge,L1,L2,SIMPLIFY=F)
names(result) <- paste0('output',seq_along(result))
which will give a name to every data frame in the list, an then:
sapply(names(result),function(s) assign(s,result[[s]],envir = globalenv()))
Please note that provided is a base R solution that does essentially the same thing as your sample code.
If your data frames are in a list, writing a for loop is trivial:
# lt = list(lt1, lt2, lt3, ...)
# if your data is very big, this may run you out of memory
lt = lapply(ls(pattern = "lt[0-9]*"), get)
merged_data = merge(lt[[1]], lt[[2]])
for (i in 3:length(lt)) {
merged_data = merge(merged_data, lt[[i]])
save(merged_data, file = paste0("merging", i, ".rda"))
}

Generate a sequence of Data frame from function

I searched but I couldn't find a similar question, so Apologies in advance if this is a duplicate question. I am trying to Generate a data frame from within a for loop in R.
what I want to do:
Define each columns of each data frame by a function,
Generate n data frames (length of my sequence of data frame) using loop,
As example I will use n=100 :
n<-100
k<-8
d1 <- data.frame()
for(i in 1:(k)) {d1 <- rbind(d1,c(a="i+1",b="i-1",c="i/1"))}
d2 <- data.frame()
for(i in 1:(k+2)) {d2 <- rbind(d2,c(a="i+2",b="i-2",c="i/2"))}
...
d100 <- data.frame()
for(i in 1:(k+100)) {d100 <- rbind(d100,c(i+100, i-100, i/100))}
It is clear that It'll be difficult to construct one by one each data.frame. I tried this:
d<-list()
for(j in 1:100) {
d[j] <- data.frame()
for(i in 1:(k+j)) {d[j] <- rbind(d[j] ,c(i+j, i-j, i/j))}
But I cant really do anything with it, I run into an error :
Error in d[j] <- data.frame() : replacement has length zero
In addition: Warning messages:
1: In d[j] <- rbind(d[j], c(i + j, i - j, i/j)) :
number of items to replace is not a multiple of replacement length
And a few more remarks about your example:
the number of rows in each data frame are not the same : d1 has 8 rows, d2 has 10 rows, and d100 has 8+100 rows,
the algorithm should give us : D=(d1,d2, ... ,d100).
It would be great to get an answer using the same approach (rbind) and a more base like approach. Both will aid in my understanding. Of course, please point out where I'm going wrong if it's obvious.
Here's how to create an empty data.frame (and it's not what you are trying):
Create an empty data.frame
And you should not be creating 100 separate dataframes but rather a list of dataframes. I would not do it with rbind, since that would be very slow. Instead I would create them with a function that returns a dataframe of the required structure:
make_df <- function(n,var) {data.frame( a=(1:n)+var,b=(1:n)-var,c=(1:n)/var) }
mylist <- setNames(
lapply(1:100, function(n) make_df(n,n)) , # the dataframes
paste0("d_", 1:100)) # the names for access
head(mylist,3)
#---------------
$d_1
a b c
1 2 0 1
$d_2
a b c
1 3 -1 0.5
2 4 0 1.0
$d_3
a b c
1 4 -2 0.3333333
2 5 -1 0.6666667
3 6 0 1.0000000
Then if you want the "d_40" dataframe it's just:
mylist[[ "d_40" ]]
Or
mylist$d_40
If you want to perform the same operation or get a result from all of them at nce; just use lapply:
lapply(mylist, nrow) # will be a list
Or:
sapply(mylist, nrow) #will be a vector because each value is the same length.

Resources