R Write multiple MannKendall results to data frame or csv - r

I'm a R beginner, and I'm struggling to find a solution to something that's probably extremely straightforward. Help appreciated.
I'm evaluating sulfate trends in >1,000 groundwater wells using the MannKendall package in R, and I've been storing results as individual lists. I'd like to combine all the results into a single dataframe, so I can target wells with increasing concentrations and export results to CSV and share with folks who don't know how to use R.
#Example:
library(Kendall)
w1<-c(4.3,5.7,2.4,9.8,6.7,3.9,8.3,9.6,4.7)
w2<-c(3.2,5.8,9.9,14.6,17.8,13.5,20.4,78.9,50.3)
w1mk<-MannKendall(w1)
w2mk<-MannKendall(w2)
#Next step: combine and store w1mk and w2mk results as data frame for analysis/export

#CBernhardt I'd break this down into small manageable steps to see what's going on with your data...
Starting with your example data...
library(Kendall)
w1<-c(4.3,5.7,2.4,9.8,6.7,3.9,8.3,9.6,4.7)
w2<-c(3.2,5.8,9.9,14.6,17.8,13.5,20.4,78.9,50.3)
First thing I would do is to put all these individual lists into one big dataframe with one site (well) per column
groundwater <- data.frame(w1=w1,w2=w2)
groundwater
Then use a simple lapply command to run the test across each column (well/site)
allofthem <- lapply(groundwater, function(y) unlist(MannKendall(y)))
allofthem is now a list of the Mann Kendall results per site...
allofthem
#$w1
# tau sl S D varS
# 0.2222222 0.4655123 8.0000000 36.0000000 92.0000000
When you're sure that's working through it all in a dataframe
MKResults <- as.data.frame(allofthem)
MKResults

not sure how you would like your data frame to look like, but anyhow check this
as.data.frame(rbind(unlist(w1mk),unlist(w2mk)))
or
library(dplyr)
bind_rows(unlist(w1mk),unlist(w2mk))

Related

Running multiple Kruskal Wallis test with lapply taking long. Easier solution?

I have a data frame 90 observations and 124306 variables named KWR all numeric data. I want to run a Kruskal Wallis analysis within every column between groups. I added a vector with every different group behind my variables named "Group". To test the accuracy, I tested one peptide (named x2461) with this code:
kruskal.test(X2461 ~ Group, data = KWR)
Which worked out fine and got me a result instantly. However, I need all the variables to be analyzed. I used lapply while reading this post: How to loop Bartlett test and Kruskal tests for multiple columns in a dataframe?
cols <- names(KWR)[1:124306]
allKWR <- lapply(cols, function(x) kruskal.test(reformulate("Group", x), data = KWR))
However, after 2 hours of R working non stop, I quit the job. Is there any more efficient way of doing this?
Thanks in advance.
NB: first time poster, beginner in R
Take a look at kruskaltests in the Rfast package. For the KWR data.frame, it appears it would be something like:
allKWR <- Rfast::kruskaltests(as.matrix(KWR[,1:124306]), as.numeric(as.factor(KWR$Group)))
This was great - I got 50 columns and several hundred cases in 0.01 system time.

How to fill dataframe rows for progressive files in a for loop in R

I'm trying to analyze some data acquired from experimental tests with several variables being recorded. I've imported a dataframe into R and I want to obtain some statistical information by processing these data.
In particular, I want to fill in an empty dataframe with the same variable names of the imported dataframe but with statistical features like mean, median, mode, max, min and quantiles as rows for each variable.
The input dataframes are something like 60 columns x 250k rows each.
I've already managed to do this using apply as in the following lines of code for a single input file.
df[1,] <- apply(mydata,2,mean,na.rm=T)
df[2,] <- apply(mydata,2,sd,na.rm=T)
...
Now I need to do this in a for loop for a number of input files mydata_1, mydata_2, mydata_3, ... in order to build several summary statistics dataframes, one for each input file.
I tried in several different ways, trying with apply and assign but I can't really manage to access each row of interest in the output dataframes cycling over the several input files.
I wuold like to do something like the code below (I know that this code does not work, it's just to give an idea of what I want to do).
The output df dataframes are already defined and empty.
for (xx in 1:number_of_mydata_files) {
df_xx[1,]<-apply(mydata_xx,2,mean,na.rm=T)
df_xx[2,]<-apply(mydata_xx,2,sd,na.rm=T)
...
}
Actually I can't remember the error message given by this code, but the problem is that I can't even run this because it does not work.
I'm quite a beginner of R, so I don't have so much experience in using this language. Is there a way to do this? Are there other functions that could be used instead of apply and assign)?
EDIT:
I add here a simple table description that represents the input dataframes I’m using. Sorry for the poor data visualization right here. Basically the input dataframes I’m using are .csv imported files, looking like tables with the first row being the column description, aka the name of the measured variable, and the following rows being the acquired data. I have 250 000 acquisitions for each variable in each file, and I have something like 5-8 files like this being my input.
Current [A] | Force [N] | Elongation [%] | ...
—————————————————————————————————————
Value_a_1 | Value_b_1 | Value_c_1 | ...
I just want to obtain a data frame like this as an output, with the same variables name, but instead with statistical values as rows. For example, the first row, instead of being the first values acquired for each variable, would be the mean of the 250k acquisitions for each variable. The second row would be the standard deviation, the third the variance and so on.
I’ve managed to build empty dataframes for the output summary statistics, with just the columns and no rows yet. I just want to fill them and do this iteratively in a for loop.
Not sure what your data looks like but you can do the following where lst represents your list of data frames.
lst <- list(iris[,-5],mtcars,airquality)
lapply(seq_along(lst),
function(x) sapply(lst[[x]],function(x)
data.frame(Mean=mean(x,na.rm=TRUE),
sd=sd(x,na.rm=TRUE))))
Or as suggested by #G. Grothendieck simply:
lapply(lst, sapply, function(x)
data.frame(Mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE)))
If all your files are in the same directory, set working directory as that and use either list.files() or ls() to walk along your input files.
If they share the same column names, you can rbind the result into a single data set.

convert lists to data frames

I'm a beginner in R. I'm working in a data to expand my knowledge especially in data manipulation.
The task is to split my data set based on a parameter(column). Then to calculate the standard deviation for each group, then to provide some graphs. I did split my data set to about 3000 list, but I'm stuck in converting the lists into separate data sets so I can collect the SD for each data set. Or if there is an efficient way to do it in one code.
this is what I did so far.
xx <- read.table("NetworkRail1.csv",sep=",",header=TRUE)
selected <- select(xx,ID, Location, Top70m)
splitNR <- split(selected, selected$Location %% 0.125)
If your problem is to save the different components of the list separately, you can try this:
list2env(splitNR, envir=.GlobalEnv)

How can I see multiple variable's outlier in one boxplot using R?

I am a newbie to R. I have a question. For checking the outlier of a variable we generally use:
boxplot(train$rate)
Suppose, the rate is the variable of my datasets and train is my data sets name. But when I have multiple variables like 100 or 150 variables, then it will be very time consuming to check one by one variable's outlier. Is there any function to bring the 100 variables' outlier in one boxplot?
If yes, then which function is used to remove those variable's outlier at one time instead of one by one? Please help to solve this problem.
Thanks in advance
I agree with Rui Barradas that it is bad practice to remove outliers without further thought. As long as the value is valid you should keep it in your data or at least run two separate analyses with and without the influential value. You could use a for loop to apply a function to every variable in your dataset.
train2<-train # Copy old dataset
outvalue<-list() # Create two empty lists
outindex<-list()
for(i in 1:ncol(train2){ # For every column in your dataset
outvalue[[i]]<-boxplot(train2[,i])$out # Plot and get the outlier value
outindex[[i]]<-which(train2[,i] == outvalue[[i]]) # Get the outlier index
train2[outindex[[i]],i] <- NA # Remove the outliers
}
This works and plots the data, but it is quite slow. If you don't want to plot the data but just want the outliers you could look into other outlier functions, the extremevalues package has a function that takes a different approach to identifying outliers and doesn't require a plot.
This uses the getOutliers function from the extremevalues package
outRight<-list()
outLeft<-outRight
for(i in 1:ncol(train2){
outRight[[i]]<-getOutliers(train2[,i])$iRight
outLeft[[i]]<-getOutliers(train2[,i])$iLeft
train2[outRight[[i]],i] <- NA
train2[outLeft[[i]],i] <- NA
}
The function boxplot returns a value. If you see the Value section of its help page you'll see that it's a list with named components, one of which is out. That's the one you seem to be looking for.
bp <- boxplot(train$rate)
bp$out
clean <- train$rate[-which(train$rate %in% bp$out)] # to remove the outliers
I also would not do that. Outliers are data, and normal/likely to occur. By eliminating them you are not taking into account the entirety of your data, a bad practice.

Extract r^2 from multiple models in plyr

I am hoping to efficiently combine my regressions using plyr functions. I have data frames with monthly data for multiple years in format yDDDD (so y2014, y2013, etc.)
Right now, I have the below code for one of those dfs, y2014. I am running the regressions by month, as desired within each year.
modelsm2= by(y2014,y2014$Date,function(x) lm(y~,data=x))
summarym2=lapply(modelsm2,summary)
coefficientsm2=lapply(modelsm2,coef)
coefsm2v2=ldply(modelsm2,coef) #to get the coefficients into an exportable df
I have several things I'd like to do and I would really appreciate your help!
A. Extract the r^2 for each model. I know that for one model, you can do summary(model)$r.squared to get it, but I have not had luck with my construct.
B. Apply the same methodology in a loop-type structure to get the models to run for all of my data frames (y2013 and backwards)
C. Get the summary into an easily exportable (to Excel) format --> the ldply function does not work for the summaries.
Thanks again.
A. You need to subset out the r.squared values from your summaries:
lapply(summarym2,"[[","r.squared")
B. Put all your data into a list, and put another lapply around it, eg:
lapply(list(y2014,y2013,y2012), function(dat)
by(dat,dat$Date, function(x) lm(y~.,data=x))
)
You will then have a list of lists so for instance to extract the summaries, you would use:
lapply(lmlist,lapply,summary)
C. summary returns a fairly complex data structure that cannot be coerced into a data.frame. The result you see is a consequence of the print method for it. You can use capture.output to get a charactor vector of each line of the output that you may use to write to a file.

Resources