i'm a R-beginner and i have a little problem. I want to create new dataframes by a random selection of consisting dataframes.
I have 4 (categories), each divided into 10 dataframes and i want to create 10 new dataframes, containing 1 dataframe from each category.
For example, these are my dataframes:
Cat_1_Data_1 Cat_2_Data_1 Cat_3_Data_1 Cat_4_Data_1
Cat_1_Data_2 Cat_2_Data_2 Cat_3_Data_2 Cat_4_Data_2
Cat_1_Data_3 Cat_2_Data_3 Cat_3_Data_3 Cat_4_Data_3
Cat_1_Data_4 Cat_2_Data_4 Cat_3_Data_4 Cat_4_Data_4
Cat_1_Data_5 Cat_2_Data_5 Cat_3_Data_5 Cat_4_Data_5
Cat_1_Data_6 Cat_2_Data_6 Cat_3_Data_6 Cat_4_Data_6
Cat_1_Data_7 Cat_2_Data_7 Cat_3_Data_7 Cat_4_Data_7
Cat_1_Data_8 Cat_2_Data_8 Cat_3_Data_8 Cat_4_Data_8
Cat_1_Data_9 Cat_2_Data_9 Cat_3_Data_9 Cat_4_Data_9
Cat_1_Data_10 Cat_2_Data_10 Cat_3_Data_10 Cat_4_Data_10
Creating new dataframes (that's how i do it):
new_data_1 <- rbind(cat_1_data_1,cat_2_data_1,cat_3_data_1,cat_4_data_1)
...
new_data_10 <- rbind(cat_1_data_10,cat_2_data_10,cat_3_data_10,cat_4_data_10)
But i want a random pick of the datasets, like:
new_data_1 <- rbind(cat_1_data_[Random 1-10],cat_2_data_[Random 1-10]... and so on)
...
new_data_10 <- rbind(cat_1_data_[Random 1-10],cat_2_data_[Random 1-10]...and so on)
Is there any possibility to solve this problem? Actually i don't know how to approach this problem :(
Here is one sampling strategy that would work.
Create lists of your data.frames, one per category shuffling them as you go:
dflist.cat1 <- sample(list(Cat_1_Data_1, Cat_1_Data_2, ...))
dflist.cat2 <- sample(list(Cat_2_Data_1, Cat_2_Data_2, ...))
...
Run lapply to rbind the corresponding element of each list. This will result in a list of length 10:
dflist.new <- lapply(1:10, function(i){
rbind(dflist.cat1[[i]],
dflist.cat2[[i]],
dflist.cat3[[i]],
dflist.cat4[[i]])
})
You can access your data.frames using dflist.new[[1]] for the first one, and so on.
I am sure there is a more elegant way to do this with 2-dimensional list indices, but this works well for a small number of categories.
Related
I have 4 datasets that contains the same var called "siteid_public". The ultimate goal is: I want to see how many unique "siteid_public" in this four datasets. I will add them together and then use length (unique()) to get the number.
I use very stupid way to get this goal,the code like this:
site1<-dflist[[1]]%>%
select(siteid_public)
site2<-dflist[[2]]%>%
select(siteid_public)
site3<-dflist[[3]]%>%
select(siteid_public)
site4<-dflist[[4]]%>%
select(siteid_public)
site<-c(site1$siteid_public, site2$siteid_public,site3$siteid_public,site4$siteid_public)
length(unique(site))
But now, I want to improve its efficiency.
So, first, I use this code to create a list called "sitelist" which contains 4 lists coming from for datasets.(The dflist[[i]] in the code is the place where I store these 4 datasets.) After I run the code below, each list has one same var called siteid_public. The code is here:
sitelist<-list()
for (i in 1:4){
sitelist[[i]]<-dflist[[i]]%>%
select(siteid_public)
}
Now I want to add all 4 lists in sitelist as one list, and then use unique to see how many unique siteid_public value in this combined list. Could people help me to continue this code and achieve that goal? thanks a lot~~!
You can use lapply to iterate on a list of frames, either on the whole list or just as easily a subset (including one or zero).
Your site1 through site4 can be created as a list with
sites <- lapply(dflist[1:4], function(z) select(z, siteid_public))
and you can do your unique-counting with
unique(unlist(sites))
This works as well with
sites <- lapply(dflist, ...) # all of it
sites <- lapply(dflist[3], ...) # singleton, note not the `[[` index operator
indices <- ... # integer or logical of indices to include
sites <- lapply(dflist[indices], ...)
I am trying to automatise some post-hoc analysis, but I will try to explain myself with a metaphor that I believe will illustrate what I am trying to do.
Suppose I have a list of strings in two lists, in the first one I have a list of names and in the other a list of adjectives:
list1 <- c("apt", "farm", "basement", "lodge")
list2 <- c("tiny", "noisy")
Let's suppose also I have a data frame with a bunch of data that I have named something like this as they are the results of some previous linear analysis.
> head(df)
qt[apt_tiny,Intercept] qt[apt_noisy,Intercept] qt[farm_tiny,Intercept]
1 4.196321 -0.4477012 -1.0822793
2 3.231220 -0.4237787 -1.1433449
3 2.304687 -0.3149331 -0.9245896
4 2.768691 -0.1537728 -0.9925387
5 3.771648 -0.1109647 -0.9298861
6 3.370368 -0.2579591 -1.0849262
and so on...
Now, what I am trying to do is make some automatic operations where the strings in the previous lists dynamically change as they go in a for loop. I have made a list with all the distinct combinations and called it distinct. Now I am trying to do something like this:
for (i in 1:nrow(distinct)){
var1[[i]] <- list1[[i]]
var2[[i]] <- list2[[i]]
#this being the insertable name part for the rest of the variables and parts of variable,
#i'll put it inside %var[[i]]% for the sake of the explanation.
%var1[[i]]%_%var2[[i]]%_INT <- df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept]`+ df$`qt[%var1[[i]]%,Intercept]`
}
The difficult thing for me here is %var1[[i]]% is at the same time inside a variable and as the name of a column inside a data frame.
Any help would be much appreciated.
You cannot use $ to extract column values with a character variable. So df$`qt[%var1[[i]]%_%var2[[i]]%,Intercept] will not work.
Create the name of the column using sprintf and use [[ to extract it. For example to construct "qt[apt_tiny,Intercept]" as column name you can do :
i <- 1
sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])
#[1] "qt[apt_tiny,Intercept]"
Now use [[ to subset that column from df
df[[sprintf('qt[%s_%s,Intercept]', list1[i], list2[i])]]
You can do the same for other columns.
I have the following problem:
I am creating a flexible number of data.frames (or matrices would do the job as well) which are filled until the last one, which is not full and should therefore be treated differently
I now want to call the data.frame platen and want to fill it afterwards with data. I would like to do something like
assign(paste0("plate", n)[x,y], %some data%)
where [x,y] is the position in the data.frame(/matrix) and n is the n'th data.frame
however it allways gives me
incorrect number of dimensions
You can't. See ?assign:
‘assign’ does not dispatch assignment methods, so it cannot be
used to set elements of vectors, names, attributes, etc.
(if you can't set elements of vectors you can't do it for dataframes either).
It's probably better to create a list of dataframes, rather than manually labelling plate1, plate2 etc. For example, supposing that plates was a list such that plates[[1]] was plate 1, plates[[2]] was plate 2, etc:
n <- length(plates)
plates[[n]] # is the plateN dataframe
plates[[n]][x, y] <- some_data
In terms of constructing the plates list, it will depend on your particular example (how you are doing it at the moment), but probably a for loop will be helpful here, e.g.:
plates <- list()
# supposing you know `n`, the number of plates
for (i in 1:n) {
plates[[i]] <- code_that_constructs_the_dataframe
}
If you're feeling adventurous, lapply is quite nice, e.g.
# suppose each plate was constructed by reading in a CSV
list_of_csvs <- list.files(pattern='*.csv')
plates <- lapply(list_of_csvs, read.csv)
# then plates[[1]] is the dataframe from the first csv, etc
but as you seem new to R, perhaps it is best to stick with the for loops for now.
I am looking for a hint about how to create a new nested list element from two existing nested list elements. In the current form of the script I am working on, I create a list called tardis that is n elements long, based on the number of elements in an input list. In the example blow, that input list, dataLayers, is 2 elements long.
After creating tardis, the script populates it by reading in data from 1200 netCDF files. Each of the 12 elements in 'mean' and 'sd' in tardis are matrices of geographic data, tardis[['data']][[decade]][['mean']][[month]], for example, for the 12 calendar months. When the list is fully populated I would like to create some derived variables. For example, in the snippet below, I would like to create a variable TOTALPRECIP by adding SNOW and RAIN. In doing this, I would like to create TOTALPRECIP from SNOW + RAIN as a third list element in tardis with the exact nested structure as the other two elements (adding them together and preserving the structure).
Is this possible with apply or its related functions?
begin <- 1901
end <- 1991
dataLayers <- c("SNOW","RAIN")
tardis<-list()
for (i in 1:length(dataLayers)){
tardis[[dataLayers[i]]]<-list('longName'='timeLord','units'='theDr','data'=list())
for (j in seq(begin,end,10)){
tardis[[dataLayers[[i]]]][['data']][[as.character(j)]]<-list('mean'=vector(mode='list',length=12),'sd'=vector(mode='list',length=12))
}
}
#add SNOW AND RAIN
print(names(tardis))
>[1] "SNOW" "RAIN" "TOTALPRECIP"
Here are your for loops using replicate (Note that the expression value for each replicate is the same expression you have in the assignment portion of your for loop)
## This is your inner for-loop, using replicate
inds <- seq(begin, end, 10)
datas <- replicate(length(inds), list('mean'=vector(mode='list',length=12),'sd'=vector(mode='list',length=12))
, simplify=FALSE)
names(datas) <- inds
# This is your outer loop
tardis2 <- replicate(length(dataLayers), list('longName'='timeLord','units'='theDr','data'=datas)
, simplify=FALSE)
names(tardis2) <- dataLayers
# Compare Results
identical(tardis2, tardis)
# [1] TRUE
However, I'm not sure if lists are relaly the best structure for this. Have you considered data.frames?
So I've created an object of 12 binary files. As part of the analysis that I want to do, I compare one of the 12 against the other 11, using functions to do some analysis.
i.e.
In loop one, object$1 compared against object$1 2:12,
loop two, object$2 against object$ 1,3:12
...
loop 12, object$12 against object$1[1:11]
I can do it on a small scale manually, by specifying the file names. But as it involves comparing all 12 against each other, and I have many groups of 12 files (250 files in total) to work ok, how I automate this?
The eventual output is a data frame, so I'd like that to be created in each loop too (with the relevant file name, like object$1.csv or something).
firstbatch <-bams[1:12] #bams is character vector of the files
bedfile <- "filename.bed"
my.counts <- getBamCounts(bed.file = bedfile, bam.files = firstbatch) #creates object
my.test <- firstbatch$1
my.ref.samples <- firstbatch$2...firstbatch$12
series of functions comparing $1 against 2:12
maybe you cold use this procedure :
a <- combn(12,2) # will give you all possible combinations
for (i in 1:dim(a)[2]) { #loops over all possible combinations
firstbatch[ a [1,i]] # first sample name to compare
firstbatch[ a [2,i]] # second sample name to compare against
...
}