Formatting dataframes for statistical analyses - r

What I would like to do is to test the statistical relationship between one response and one explanatory variable. To do this, I assumed a one-way ANOVA was an effective procedure. However, my dataframe is not set up to do this. I have one column for a response variable (df1) but several columns that would be categorised into the explanatory variable I want (df2 and df3) below. As a crude example, df2 and df3 represent a season (summer) in 2 seperate locations. How would I test the influence of summer on the response variable in this instance?
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df3 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
Example <- cbind(df1,df2,df3)
Would this involve restructuring the dataframe so that df2 and df3 merge to become one long column and double the length of df1?
Thank you in advance for any help!

As suggested by Jaap and Andrew Taylor, the problem was formatting a linear regression. This was achieved through the 'stack' and 'cbind' functions.
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df3 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
Example <- cbind(df2,df3)
Stacked <- stack(Example)
Combined <- cbind(df1,Stacked)
colnames(Combined) <- c("Response","Explanatory","Variable")
Linear <- lm(Explanatory~Response, data = Combined)
summary(Linear)
Stack put all the explanatory variables (df2 and df3) into one column, whilst cbind combined this new column with the values from response (df1), with these values being replicated to create a dataframe with an even number of rows, as per SabDeM's comment.

Related

Change factor levels for variable over multiple data frames

I have 4 data sets which I would like to get the percentage for each group for each data set. This is all fine using prop.table(table(df1$group)) changing for df2$group and so on, but I would like labels on my tables. So I have converted the column to a factor and assigned appropriate levels, however this involves assigning the levels to each data set.
I have tried using lapply but I end up with NAs for the factor levels.
Here is some data
df1 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df2 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df3 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df4 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df1$group <- as.factor(df1$group)
df2$group <- as.factor(df2$group)
df3$group <- as.factor(df3$group)
df4$group <- as.factor(df4$group)
what I have tried:
df <- list(df1,df2,df3,df4)
df <- lapply(df,function(x) x[,group:=factor(group, levels = c("A","B","C","D","E"))])
but this returns changes the levels but results in NAs.
The data are all in data.tables and I am interested in 5 factors per data.table. I would also be interested in changing the class of multiple variables across multiple data.tables but for simplicity this could be another question.
We need to specify the labels that correspond to the levels present in the original data
lapply(df, function(x) x[, group := factor(group, levels = 1:5, labels = LETTERS[1:5])])

Split Apply Combine

I have a large list, and would like to apply the exact technique detailed in the answer here:
Create mutually exclusive dummy variables from categorical variable in R
However, my data is much larger, and I would like to split, apply and combine the operation to each individual row.
This code, which of course does not work, illustrates what I am trying to do:
id <- c(1,1,1,1)
time <- c(1,2,3,4)
time <- as.character(time)
unique.time <- as.character(unique(df$time))
df <- data.frame(id,time)
df1 <- split(df, row(df))
sapply(df1, (unique.time, function(x)as.numeric(df1$time == x)))
z <- unsplit(lapply(df1, row(df)), scale), x)
Thanks!

Averaging values between paired columns across a large data frame

I have a dataframe consisting of a series of paired columns. Here is a small example.
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(rep(1:12, each=30))
df3 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df4 <- as.data.frame(c(rep(5:12, each=30),rep(1:4, each=30)))
df5 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
df6 <- as.data.frame(c(rep(8:12, each=30),rep(1:7, each=30)))
Example <- cbind(df1,df2,df3,df4,df5,df6)
What I would like to do is find an average value for the odd numbers columns (df1,df3,df5) based on the values in the adjacent column, so in the example I would have three sets of averages for each value between 1 and 12. I have managed to apply a function for a specific pair of columns...
Example_two <- cbind(df1,df2)
colnames (Example_two) <- c("x","y")
tapply(Example_two$x, Example_two$y, mean)
However, the dataframe I will be looking at will be considerably larger so some form of apply function would be ideal to perform this iteratively across each paired set. I have found a similar problem Is there a R function that applies a function to each pair of columns?, but I can't seem to apply this to my own dataset.
Any help would be much appreciated, thank you in advance.
Try
mapply(function(x,y) tapply(x,y, FUN=mean) ,
Example[seq(1, ncol(Example), 2)], Example[seq(2, ncol(Example), 2)])
Or instead of seq(1, ncol(Example), 2) just use c(TRUE, FALSE) and c(FALSE, TRUE) for the second case

R: Add columns to a data frame on the fly

new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))

randomize a data.frame based on a column and keeping proportions

I have a data.frame that looks like this (my real data.frame is bigger but the structure is similar):
df <- data.frame(ID=c(rep('A', 5), rep('B', 5), rep('C',5)), Score=c(1,1,0,0,0,1,1,1,0,0,1,1,1,0,0))
And I would like to obtain several randomized data.frames (e.g 100) where column Score is randomized and column ID remains the same, but I need to keep the same number of zeros and ones in `df$Score.
I've tried with:
df1 <- transform(df, Score=ave(Score, ID, FUN=function(b) sample(b, replace=T)))
but the proportions of 0s and 1s are not kept always,
Thanks
If you want to keep the 0-1 proportion within IDs, set replace=F (which is by default):
df1 <- transform(df, Score=ave(Score, ID, FUN=function(b) sample(b, replace=F)))
If you want to keep the overall 0-1 porportion, you can simply do this:
df1 <- data.frame(ID=df$ID, Score=sample(df$Score))

Resources