Fairly new to R, so any guidance is appreciated.
GOAL: I'm trying to create hundreds of dataframes in a short script. They follow a pattern, so I thought a For Loop would suffice, but the data.frame function seems to ignore the variable nature of the variable, reading it as it appears. Here's an example:
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
for (i in dfTitles){
i <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))
}
# Trying an alternative method
for (i in 1:length(dfTitles))
{dfTitles[i] <- data.frame(matrix(0, nrow = 4, ncol = 338, dimnames = list(Copes, Voxels)))}
This results in the creation of one dataframe named 'i', in the former, or a list of 4, in the case of the latter. Any ideas? Thank you!
PROBABLY UNNECESSARY BACKGROUND INFORMATION: We're using fMRI data to run an analysis which will run correlations across stimuli, brain voxels, brain regions, and participants. We're correlating whole matrices, so separating the values (aka COPEs) into separate dataframes by both Participant ID and Brain Region is going to make the next step much much easier. I already had tried the next step after having loaded and sorted the data into one large dataframe and it was a big pain in the butt.
rm(list=ls)
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:3)
# (Theoretically) creating a separate dataframe for each of the terms in 'dfTitles'
nr <- length(Voxels)
nc <- length(Copes)
N <- length(dfTitles) # Number of data frames, same as length of dfTitles
DF <- vector(N, mode="list")
for (i in 1:N){
DF[[i]] <- data.frame(matrix(rnorm(nr*nc), nrow = nr))
dimnames(DF[[i]]) <- list(Voxels, Copes)
}
names(DF) <- dfTitles
DF[1:2]
$C2000.AMY
Cope1 Cope2 Cope3 Cope4
1 -0.8293164 -1.813807 -0.3290645 -0.7730110
2 -1.1965588 1.022871 -0.7764960 -0.3056280
3 0.2536782 -0.365232 2.3949076 0.5672671
$C2000.ACC
Cope1 Cope2 Cope3 Cope4
1 -0.7505513 1.023325 -0.3110537 -1.4298174
2 1.2807725 1.216997 1.0644983 1.6374749
3 1.0047408 1.385460 0.1527678 0.1576037
When creating objects in a for loop, they need to be saved somewhere before the next iteration of the loop, or it gets overwritten.
One way to handle that is to create an empty list or vector with c()before the beginning of your loop, and append the output of each run of the loop.
Another way to handle it is to assign the object to your environment before moving on to the next iteration of the loop.
# Defining some dummy variables for the sake of this example
dfTitles <- c("C2000.AMY", "C2000.ACC", "C2001.AMY", "C2001.ACC")
Copes <- c("Cope1", "Cope2", "Cope3", "Cope4")
Voxels <- c(1:338)
# initialize a list to store the data.frame output
df_list <- list()
for (d in dfTitles) {
# create data.frame with the dfTitle, and 1 row per Copes observation
df <- data.frame(dfTitle = d,
Copes = Copes)
# append columns for Voxels
# setting to NA, can be reassigned later as needed
for (v in Voxels) {
df[[paste0("Voxel", v)]] <- NA
}
# store df in the list as the 'd'th element
df_list[[d]] <- df
# or, assign the object to your environment
# assign(d, df)
}
# data.frames can be referenced by name
names(df_list)
head(df_list$C2000.AMY)
Related
I do most of my projects with lists downloading information from the web. Sometimes when I download 100 different sets of data the website does not give me data for a few of them.
You can tell it has no data because it says.
A data.frame with 0 rows and 7 colums.
A good dataframe says something like this.
A data.Frame with 245345 rows and 7 colums.
My script does not like no data in the list. It stops my loop in that spot.
Thank you in advance.
#Pulls all the active USGS gages for the URL's
GageList <- CDEC
gage <- c(as.character(GageList$GAGE_ID))
duration <- c(as.character(GageList$DURATION_CODE))
number <- c(as.character(GageList$SENSOR_CODE))
View(GageList)
#CDEC URL
urls <- sprintf(final=list()
TOTALERRORS =list()
#Pulls all the active USGS gages for the URL's
GageList <- CDEC
gage <- c(as.character(GageList$GAGE_ID))
duration <- c(as.character(GageList$DURATION_CODE))
number <- c(as.character(GageList$SENSOR_CODE))
View(GageList)
#CDEC URL
urls <- sprintf("http://cdec.water.ca.gov/cgi-progs/querySHEF?
station_id=%s&dur_code=%s&sensor_num=%s&start_date=10/25/2019",
gage,duration,number)
View(urls)
data <- suppressWarnings(lapply(urls, fread, header=TRUE)))
It difficult to answer whithout having an eye on the list. But from your description, here is an example :
# first here is a list of data.frame
l = list(data.frame(0,nrow = 2,ncol =5),
data.frame(0,nrow = 1, ncol=5))
# here i remove the only row of the second data.frame
l[[2]] = l[[2]][-1,]
l
# I set a data.frame identifiying the dim of each data.frame
d2rm = data.frame(n = rep(1:length(l),each = 2),
empty =unlist(lapply(l,dim)) == 0)
# I remove from the list the data.frames that have dim of 0 (in col or row)
l[[d2rm[which(d2rm$empty),1]]] = NULL
l
I need to generate a Data Frame in R from the below Excel Table.
Every time I modify one of the values from column Value the variable Score will have a different value (the cell is protected so I cannot see the formula).
The idea is to generate enough samples to check the main sources of variability, and perform some basic statistics.
I think the only way would be to manually modify the variables in the column Value and anotate the result from Score in the Dataframe.
The main issue I am having is that I am not used to work with data that has this format, and because of this I am finding difficult to visualize how should I structure the Data Frame.
I am getting stuck because the variable Score depends on 5 different Stages (where each one of them has 2 different variables) and a set of dimensions with 7 different variables.
I was trying the way I am used to create Data Frames, starting with the Vectors, but it feels wrong and I cannot see how can I represent this relationships between the different variables.
stage <- c('Inspection','Cut','Assembling','Test','Labelling','Dimensions')
variables <- c('Experience level', 'Equipement', 'User','Length','Wide','Length Body','Width Body','Tape Wing','Tape Body','Clip)
range <- c('b','m','a','UA','UB','UC') ?? not sure what to do about the range??
Could anybody help me with the logic on how this should be modelled?
As suggested by #Gregor, to resolve your main issue consider building a data frame of all needed values in respective columns. Then run each row to produce Score.
Specifically, to build needed data frame from inputs in Excel table, consider Map (wrapper to mapply) and data.frame constructor on equal-length list or vectors of 17 items:
Excel Table Inputs
# VECTOR OF 17 CHARACTER ITEMS
stage_list <- c(rep("Inspection", 2),
rep("Cut", 2),
rep("Assembling", 2),
rep("Test", 2),
rep("Labelling", 2),
rep("Dimensions", 7))
# VECTOR OF 17 CHARACTER ITEMS
exp_equip <- c("Experience level", "Equipement")
var_list <- c(rep(exp_equip, 3),
c("User", "Equipement"),
exp_equip,
c("Length", "Wide", "Length body", "Width body",
"Tape wing", "Tape body", "Clip"))
# LIST OF 17 VECTORS
bma_range <- c("b", "m", "a")
noyes_range <- c("no", "yes")
range_list <- c(replicate(6, bma_range, simplify=FALSE),
list(c("UA", "UB", "UC")),
replicate(3, bma_range, simplify=FALSE),
list(seq(6.5, 9.5, by=0.1)),
list(seq(11.9, 12.1, by=0.1)),
list(seq(6.5, 9.5, by=0.1)),
list(seq(4, 6, by=1)),
replicate(3, noyes_range, simplify=FALSE))
Map + data.frame
df_list <- Map(function(s, v, r)
data.frame(Stage = s, Variable = v, Range = r, stringsAsFactors=FALSE),
stage_list, var_list, range_list, USE.NAMES = FALSE)
# APPEND ALL DFS
final_df <- do.call(rbind, df_list)
head(final_df)
# Stage Variable Range
# 1 Inspection Experience level b
# 2 Inspection Experience level m
# 3 Inspection Experience level a
# 4 Inspection Equipement b
# 5 Inspection Equipement m
# 6 Inspection Equipement a
Rextester demo
Score Calculation (using unknown score_function, assumed to take three non-optional args)
# VECTORIZED METHOD
final_df$Score <- score_function(final_df$Stage, final_df$Variable, final_df$Range)
# NON-VECTORIZED/LOOP ROW METHOD
final_df$Score <- sapply(1:nrow(final_df), function(i)
score_function(final_df$Stage[i], final_df$Variable[i], final_df$Range[i])
# NON-VECTORIZED/LOOP ELEMENTWISE METHOD
final_df$Score <- mapply(score_function, final_df$Stage, final_df$Variable, final_df$Range)
I have a very simple assignment for a project that requires processing a large amount of information; my professor's first words were "this will take a while to run" so I figured it'd be a good opportunity to spend that time i would be running my program making a super efficient one :P
Basically, I have a input file where each line is either a node or details. It might look something like:
#NODE1_length_17_2309482.2394832.2
val1 5 18
val2 6 21
val3 100 23
val4 9 6
#NODE2_length_1298_23948349.23984.2
val1 2 293
...
and so on. Basically, I want to know how I can efficiently use R to either output, line by line, something like:
NODE1_length_17 val1 18
NODE1_length_17 val2 21
...
So, as you can see, I would want to node name, the value, and the third column of the value line. I have implemented it using an ultra slow for loop that uses strsplit a whole bunch of times, and obviously this is not ideal. My current implementation looks like:
nodevals <- which(substring(data, 1, 1) == "#") # find lines with nodes
vallines <- which(substring(data, 1, 3) == "val")
out <- vector(mode="character", length=length(vallines))
for (i in vallines) {
line_ra <- strsplit(data[i], "\\s+")[[1]]
... and so on using a bunch of str splits and pastes to reformat
out[i] <- paste(node, val, value, sep="\t")
}
Does anybody know how I can optimize this using data frames or crafty vector manipulations?
EDIT: I'm implementing vecor wise splitting for everything, and so far I've found that the main thing I can't split correctly is the names of each node. I'm trying to do something like,
names <- data[max(nodes[nodelines < vallines])]
where nodes are the names of each line containing a node and vallines are the numbers of each line containing a val. The return vector should have the same number of elements as vallines. The goal is to find the maximum nodelines that is less than the line number of vallines for each vallines. Any thoughts?
I suggest using data.table package - it has very fast string split function tstrsplit.
library(data.table)
#read from file
data <- scan('data.txt', 'character', sep = '\n')
#create separate objects for nodes and values
dt <- data.table(data)
dt[, c('IsNode', 'NodeId') := list(IsNode <- substr(data, 1, 1) == '#', cumsum(IsNode))]
nodes <- dt[IsNode == TRUE, list(NodeId, data)]
values <- dt[IsNode == FALSE, list(data, NodeId)]
#split string and join back values and nodes
tmp <- values[, tstrsplit(data, '\\s+')]
values <- data.table(values[, list(NodeId)], tmp[, list(val = V1, value = V3)], key = 'NodeId')
res <- values[nodes]
I'm quite new to R, and I trying to use it to organize and extract info from some tables into different, but similar tables, and instead of repeating the commands but changing the names of the table:
#DvE, DvS, and EvS are dataframes
Sum.DvE <- data.frame(DvE$genes, DvE$FDR, DvE$logFC)
names(Sum.DvE) <- c("gene","FDR","log2FC")
Sum.DvS <- data.frame(DvS$genes, DvS$FDR, DvS$logFC)
names(Sum.DvS) <- c("gene","FDR","log2FC")
Sum.EvS <- data.frame(EvS$genes, EvS$FDR, EvS$logFC)
names(Sum.EvS) <- c("gene","FDR","log2FC")
I thought it would be easier to create a vector of the table names, and feed it into a for loop:
Sum.Comp <- c("DvE","DvS","EvS")
for(i in 1:3){
Sum.Comp[i] <- data.frame(i$genes, i$FDR, i$logFC)
names(Sum.Comp[i]) <- c("gene","FDR","log2FC")
}
But I get
>Error in i$genes : $ operator is invalid for atomic vectors
which I kind of expected because I was just trying it out, but can someone tell me if what I want to do can be done some other way, or if you have some suggestions for me, that would be much appreciated!
Clarification: Basically I'm trying to ask if there's a way to feed a dataframe name into a for loop through a vector, because I think I get the error because R doesn't realize "i" in the for loop stands for a dataframe name. This is a more simplified example:
DF1 <- data.frame(A=1:5, B=1:5, C=1:5, D=1:5)
DF2 <- data.frame(A=10:15, B=10:15, C=10:15, D=10:15)
DF3 <- data.frame(A=20:25, B=20:25, D=20:25, D=20:25)
DFs <- ("DF1", "DF2", "DF3")
for (i in 1:3){
New.i <- dataframe(i$A, i$D)
}
And I'd like it to make 3 new dataframes called "New.DF1", "New.DF2", "New.DF3" with example outputs like:
New.DF1
A D
1 1
2 2
3 3
4 4
5 5
New.DF2
A D
10 10
11 11
12 12
13 13
14 14
15 15
Thank you!
Not entirely sure I understand your problem, but the code below may do what you're asking. I've created simple values for the input data frames for testing.
DvE <- data.frame(genes=1:2, FDR=2:3, logFC=3:4)
DvS <- data.frame(genes=4, FDR=5, logFC=6)
EvS <- data.frame(genes=7, FDR=8, logFC=9)
df_names <- c("DvE","DvS", "EvS")
sum_df <- function(x) data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
for(df in df_names) {
assign(paste("Sum.",df,sep=""), do.call("sum_df", list(as.name(df)) ) )
}
Instead of operating on the names of variables, it would be easier to store the data frames you want to process in a list and then process them with lapply:
to.process <- list(DvE, DvS, EvS)
processed <- lapply(to.process, function(x) {
data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
})
Now you can access the new data frames with processed[[1]], processed[[2]], and processed[[3]].
I have a list which contains list entries, and I need to transpose the structure.
The original structure is rectangular, but the names in the sub-lists do not match.
Here is an example:
ax <- data.frame(a=1,x=2)
ay <- data.frame(a=3,y=4)
bw <- data.frame(b=5,w=6)
bz <- data.frame(b=7,z=8)
before <- list( a=list(x=ax, y=ay), b=list(w=bw, z=bz))
What I want:
after <- list(w.x=list(a=ax, b=bw), y.z=list(a=ay, b=bz))
I do not care about the names of the resultant list (at any level).
Clearly this can be done explicitly:
after <- list(x.w=list(a=before$a$x, b=before$b$w), y.z=list(a=before$a$y, b=before$b$z))
but this is ugly and only works for a 2x2 structure. What's the idiomatic way of doing this?
The following piece of code will create a list with i-th element of every list in before:
lapply(before, "[[", i)
Now you just have to do
n <- length(before[[1]]) # assuming all lists in before have the same length
lapply(1:n, function(i) lapply(before, "[[", i))
and it should give you what you want. It's not very efficient (travels every list many times), and you can probably make it more efficient by keeping pointers to current list elements, so please decide whether this is good enough for you.
The purrr package now makes this process really easy:
library(purrr)
before %>% transpose()
## $x
## $x$a
## a x
## 1 1 2
##
## $x$b
## b w
## 1 5 6
##
##
## $y
## $y$a
## a y
## 1 3 4
##
## $y$b
## b z
## 1 7 8
Here's a different idea - use the fact that data.table can store data.frame's (in fact, given your question, maybe you don't even need to work with lists of lists and could just work with data.table's):
library(data.table)
dt = as.data.table(before)
after = as.list(data.table(t(dt)))
While this is an old question, i found it while searching for the same problem, and the second hit on google had a much more elegant solution in my opinion:
list_of_lists <- list(a=list(x="ax", y="ay"), b=list(w="bw", z="bz"))
new <- do.call(rbind, list_of_lists)
new is now a rectangular structure, a strange object: A list with a dimension attribute. It works with as many elements as you wish, as long as every sublist has the same length. To change it into a more common R-Object, one could for example create a matrix like this:
new.dims <- dim(new)
matrix(new,nrow = new.dims[1])
new.dims needed to be saved, as the matrix() function deletes the attribute of the list. Another way:
new <- do.call(c, new)
dim(new) <- new.dims
You can now for example convert it into a data.frame with as.data.frame() and split it into columns or do column wise operations. Before you do that, you could also change the dim attribute of the matrix, if it fits your needs better.
I found myself with this problem but I needed a solution that kept the names of each element. The solution I came up with should also work when the sub lists are not all the same length.
invertList = function(l){
elemnames = NULL
for (i in seq_along(l)){
elemnames = c(elemnames, names(l[[i]]))
}
elemnames = unique(elemnames)
res = list()
for (i in seq_along(elemnames)){
res[[elemnames[i]]] = list()
for (j in seq_along(l)){
if(exists(elemnames[i], l[[j]], inherits = F)){
res[[i]][[names(l)[j]]] = l[[names(l)[j]]][[elemnames[i]]]
}
}
}
res
}