I have a pretty basic loop question:
I have a matrix (365x20). So for twenty years I have daily rainfall data.
I need to slice the matrix in order to conduct the next steps of my analysis, which I did like this:
year1 <- as.vector(Rainfall_data$year1)
year2 <- as.vector(Rainfall_data$year2)
...
year20 <- as.vector(Rainfall_data$year20)
This gives me in total 20 single 1x365 vectors.
Now, I want to do the same for the transposed Rainfall data to obtain a vector containing the value of the same day for all twenty years. Since this would mean to do
as.vector(t_Rainfall_data$day1-365)
I wanted to write a loop. The columns are called day1 to day 365. t_Rainfall_data would be the transposed matrix. Main aim is to obtain in total 365 single 1x20 vectors.
I tried several ways, but failed them all.
The comments are right: anything you want to do with the vector day1 can just as well be done with t_Rainfall_data$day1 (or likely with Rainfall_data[1,]) and it's better practice to slice your dataframe when you're doing something with it rather than creating a lot of redundant vectors out of it. Similarly, even if you need a bunch of objects, it's almost always easier to deal with a list of objects than it is to create separate named objects. All that said, here's how to get what you're asking for:
As in comments, you can return a list of vectors with
lapply(seq_len(nrow(Rainfall_data)), function(i) Rainfall_data[i, ])
If you would prefer a loop, and to create the objects rather than return a list, you can do something like
for(i in 1:nrow(Rainfall_data){
assign(paste0("day",i),as.vector(t_Rainfall_data[,paste0("day",i)]))
}
assign will create an object named after the string passed to it, that contains the second argument.
Related
I have this list of data that I created by using split on a dataframe:
dat_discharge = split(dat2,dat2$discharge_id)
I am trying to create a training and test set from this list of data by sampling in order to take into account the discharge id groups which are not at all equally distributed in the data.
I am trying to do this using lapply as I'd rather not have to individually sample each of the groups within the list.
trainlist<-lapply(dat_discharge,function(x) sample(nrow(x),0.75*nrow(x)))
trainL = dat_discharge[(dat_discharge %in% trainlist)]
testL = dat_discharge[!(dat_discharge %in% trainlist)]
I tried emulating this post (R removing items in a sublist from a list) in order to create the testing and training subsets however the training list is entirely empty, which I assume means that is not the correct way to do that for a list of dataframes?
Is what I am looking to do possible without selecting for the individual dataframes in the list like data_frame[[1]]?
You could use map_dfr instead of lapply from purrr library (do have into account that you need to install.package("purr") and the library(purrr) before doing the next steps. But maybe you already have it installed since it's a common package.
Then you could use the next code
dat2$rowid<-1:nrow(dat2)
dat_discharge <- split(dat2,dat2$id)
trainList<- dat_discharge %>% map_dfr(.f=function(x){
sampling <- sample(1:nrow(x),round(0.75*nrow(x),0))
result <- x[sampling,]
})
testL<-dat2[!(dat2$rowid %in% trainList$rowid),]
To explain the above code. First of all, I added a unique rowid to dat2 so I know which rows I am sampling and which not. This will be used in the last line of code to differentiate the Test and Train datasets such as Train dataset doesnt have any rowid that test has.
Then i do the split to create dat_discharge as you did
Then to each dataframe inside the dat_discharge list I apply the function in the map_dfr. The map_dfr fucntion is the same as the lapply, just that it "concatenates" the outputs in a single dataframe instead of putting each output in a list as the lapply does. Provided that the output of each of the iterations of the map_dfr is a dataframe with same columns as the first iteration. Think of it as "Okay, i got this dataframe, im gonna bind its row to the previous dataframe result". So the result is just one big dataframe.
Inside that function you can notice that i am doing the sample a bit different. I am taking 75% of the sequence of numbers of the rows that the iteration dataframe has, then, with that sampled sequence I subset the iteration dataframe with the x[sampling,] and that yields my sampled dataframe for that iteration (which is one of the dataframes from the dat_discharge list). And automatically, the map_dfr joins those sampled dataframes for each result in a single, big dataframe instead of putting them on a list as the lapply does.
So lastly, i just create the test as all the rowids from dat2 that are NOT present in the test set.
Hope this servers you well :)
Do note that, if you want to sample 75% of the observations for each id, then each id should have at least 4 observation for it to make sence. Imagine if you only had 1 observation in a particular id, yikes!. This code would still work (it will simply select that observation), but you really need to think of that implication when you build your statistic model
I have the following MWE:
list <- c("Canada","USA","Brazil","China","Germany","Spain","France","UK","India","Iran",
"Italy","Japan","Mexico","Nederlands","Norway","NZ","Philippines","Poland","Russia","Sweden","Ukraine")
#What I want to do is something like this (although this doesn't do it):
avg_Canada <- sum(Canada[1:12], Canada>0) / length(Canada[1:12, Canada>0)
> Error in Canada[1:12, Canada > 0] : incorrect number of dimensions
#My idea for the loop
subsets <- c("1:12","13:24","25:36","37:48","49:60","61:72","73:84") #each year
for (c in seq_along(list)) {
avg_c <- rep(1:7) #create a new vector with 7 elements for each country
for (i in seq_along(subsets)) {
#here go through subsets and store 7 values in each of avg_c
}
}
I want to take each vector in list (they are num [1:84]) and create a new vector that takes averages of subsets of them. This is because the data contained within each is monthly and I need to convert it to annual data. Sadly I am not able to include the vectors in the MWE. When I try to create avg_Canada for example, I am only trying to do it for the first year as opposed to all years so I figured a loop would be appropriate to get at each year. Further, as you can tell, I have many countries.
My naming scheme is incorrect in the first nested loop, I have zero clue how to go about creating the avg_c variable within the second nested loop, but I believe my intuition is captured in the avg_Canada "attempt" if you could even call it that. Wondering if this is the way to go about this.
We can use mget to get all the list countries in a list and take mean of every 12 elements in a list using by.
lapply(mget(list), function(x) as.numeric(by(x, rep(1:7, each = 12), mean)))
This will return you a list of length length(list) with every element a numeric vector of size 7.
I hope I phrased the question right, I'm not even sure how to word my question, which is probably part of why I'm having trouble finding the answer.
Consider a data.frame that has multiple string vectors. I would like to construct another variable that pair-wise combines the two vectors together, agnostic of their order.
For example, consider the following data.frame
df <- data.frame(var1 = c('string1', 'string2', 'string3'),
var2 = c('string3', 'string4', 'string1')
)
I'd like to have a variable that is identical for the first and 3rd element, like:
c('string1, string3', 'string2, string 4', 'string1, string3')
I'm imagining that it might be best to make a variable/vector that's a list of the two component variables, but I'm obviously open to any solution. I tried to make a list variable that does what I want based on this question but with no luck:
Create a data.frame where a column is a list
If possible, I'd like to do this in a way that could extend to more than 2 columns and could efficiently run over millions of rows, especially if there is a data.table method.
Thanks for your help!
Edit: A crappy example of how I could do it with a forloop that doesn't quite work but you get the idea:
for (i in 1:nrow(df)) {
df$var.new[i] <- paste(sort( c(df$var1[i], df$var2[i])))
}
I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric
My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.