new at R and programming in general over here. I have several binary matrices of presence/absence data for species (columns) and plots (rows). I'm trying to use them in several dissimilarity indices which requires that they all have the same dimensions. Although there are always 10 plots there are a variable number of columns based on which species were observed at that particular time. My attempt to add the 'missing' columns to each matrix so I can perform the analyses went as follows:
df1 <- read.csv('file1.csv', header=TRUE)
df2 <- read.csv('file2.csv', header=TRUE)
newCol <- unique(append(colnames(df1),colnames(df2)))
diff1 <- setdiff(newCol,colnames(df1))
diff2 <- setdiff(newCol,colnames(df2))
for (i in 1:length(diff1)) {
df1[paste(diff1[i])]
}
for (i in 1:length(diff2)) {
df2[paste(diff2[i])]
}
No errors are thrown, but df1 and df2 both remain unchanged. I suspect my issue is with my use of paste, but I couldn't find any other way to add columns to a data frame on the fly like that. When added, the new columns should have 0s in the matrix as well, but I think that's the default, so I didn't add anything to specify it.
Thanks all.
Using your code, you can generate the columns without the for loop by:
df1[, diff1] <- 0 #I guess you want `0` to fill those columns
df2[, diff2] <- 0
identical(sort(colnames(df1)), sort(colnames(df2)))
#[1] TRUE
Or if you want to combine the datasets to one, you could use rbind_list from data.table with fill=TRUE
library(data.table)
rbindlist(list(df1, df2), fill=TRUE)
data
set.seed(22)
df1 <- as.data.frame(matrix(sample(0:1, 10*6, replace=TRUE), ncol=6,
dimnames=list(NULL, sample(paste0("Species", 1:10), 6, replace=FALSE))))
set.seed(35)
df2 <- as.data.frame(matrix(sample(0:1, 10*8, replace=TRUE), ncol=8,
dimnames=list(NULL, sample(paste0("Species", 1:10),8 , replace=FALSE))))
Related
I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))
I have gotten instructions to do an analysis in R with the vegan package (concerning DCA's).
The instructions on a single dataframe are pretty straightforward, but I would like to apply the analysis on a set of dataframes.
I know this can be done with a for-loop or lapply or sapply, but I have trouble dealing with the fact that each step of the analysis a new extension is added to the name of the dataframe.
An example below
Say I have a dataframe DF, then it goes as follows:
DF.t1 <- decostand(DF, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
What I want to achieve is to run several dataframes through this analysis without writing it all out separately.
Let's say I have a set of dataframes called "DF_1", "DF_2" & "DF_3" which I want to do this analysis on.
I probably need to put the dataframes in a list, and get all the steps in a for-loop or one of the apply methods.
But how do I approach the problem with the extensions added (.ra, .t1, .t2, .t2.dca, .t2.dca.DW etc.) to the dataframe names?
Edit: I need to retain the original dataframes after the analysis, in order to do follow-up analysis on them.
Unless you have a very limited amount of data frames, I would not advise to define ca. 8 new objects for each data frame in the global environment as this can become very messy.
One approach you might consider is creating a nested list where the first level is the data frame and the second level are the modified data frames.
# some example data sets
DF1 <- mtcars
DF2 <- mtcars*2
DF3 <- mtcars*3
all_dfs <- list(DF1 = DF1, DF2 = DF2, DF3 =DF3)
some_stuff <- function(df) {
DF.t1 <- decostand(df, "total")
DF.t2 <- decostand(DF.t1, "max")
DF.t2.dca <- decorana(DF.t2)
DF.t2.dca.DW <- decorana(DF.t2, iweigh=1)
names(DF.t2.dca)
summary(DF.t2.dca)
DF.t2.dca.taxonscores <- scores(DF.t2.dca, display=c("species"), choices=c(1,2))
DF.t2.dca.taxonscores <- DF.t2.dca$cproj[ ,1:2]
DF.t2.dca.samplescores <- scores(DF.t2.dca, display=c("sites"), choices=1)
return(list(DF.t1 = DF.t1, DF.t2 = DF.t2,
DF.t2.dca = DF.t2.dca,
DF.t2.dca.DW = DF.t2.dca.DW,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores,
DF.t2.dca.taxonscores = DF.t2.dca.taxonscores
))
}
nested_list <- lapply(all_dfs, some_stuff)
# To obtain any of the objects for a specific data.frame you could, for example, run
nested_list$DF1$DF.t2.dca.DW
I have a data frame with factor columns. Here is a tiny example:
dat <- data.frame(one = factor(c("a", "b")), two = factor(c("c", "d")))
I can calculate the means of the numeric values that underlie the factor labels for each column:
mean(as.integer(dat$one))
[1] 1.5
But since there are very many columns in my data frame, I would like to avoid having to calculate all the individual means and would rather do something like:
colMeans(dat)
which doesn't work, since the columns are factors, or
colMeans(as.integer(dat))
which doesn't work either.
So how can I easily calculate the means of all factor columns, without a loop or individually calculating them all?
Do I really have to change the class of all columns?
The data.matrix is pretty much designed for such a task. It also skips numeric and integer columns, if present, and hence reduces memory usage, though the conversion to matrix could be an overhead, sometimes. So as long you don't have character columns, this should be pretty straightforward
colMeans(data.matrix(dat))
# one two
# 1.5 1.5
We can use lapply
lapply(dat, function(x) mean(as.integer(x)))
Or with dplyr
library(dplyr)
dat %>%
summarise_each(funs(mean(as.integer(.))))
For big datasets, it may be better to calculate the mean by each column separately as converting to matrix may also create memory issues.
Write a simple function that uses a for loop to write all of the values into a vector.
dat <- data.frame(one = c(1:10), two = c(1:10))
colMeans <- function(tablename){
i <- 1
colmean <- c(1:ncol(tablename))
for(i in c(1:ncol(tablename))){
colmean[i] <- mean(tablename[,i])
}
return(colmean)
}
colMeans(dat)
Hope this works
You can also use data.table package, which is faster than data.frame. if your data is big e.g. millions of observations, than you need data.table to optimize run time.
Below is the code:
library(data.table)
dat <- data.table(one = factor(c("a", "b")), two = factor(c("c", "d")))
factorCols <- c("one", "two")
dat[, lapply(.SD, FUN=function(x) mean(as.integer(x))), .SDcols=factorCols]
I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.
You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]
Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}
I have a dataframe consisting of a series of paired columns. Here is a small example.
df1 <- as.data.frame(matrix(sample(0:1000, 36*10, replace=TRUE), ncol=1))
df2 <- as.data.frame(rep(1:12, each=30))
df3 <- as.data.frame(matrix(sample(0:500, 36*10, replace=TRUE), ncol=1))
df4 <- as.data.frame(c(rep(5:12, each=30),rep(1:4, each=30)))
df5 <- as.data.frame(matrix(sample(0:200, 36*10, replace=TRUE), ncol=1))
df6 <- as.data.frame(c(rep(8:12, each=30),rep(1:7, each=30)))
Example <- cbind(df1,df2,df3,df4,df5,df6)
What I would like to do is find an average value for the odd numbers columns (df1,df3,df5) based on the values in the adjacent column, so in the example I would have three sets of averages for each value between 1 and 12. I have managed to apply a function for a specific pair of columns...
Example_two <- cbind(df1,df2)
colnames (Example_two) <- c("x","y")
tapply(Example_two$x, Example_two$y, mean)
However, the dataframe I will be looking at will be considerably larger so some form of apply function would be ideal to perform this iteratively across each paired set. I have found a similar problem Is there a R function that applies a function to each pair of columns?, but I can't seem to apply this to my own dataset.
Any help would be much appreciated, thank you in advance.
Try
mapply(function(x,y) tapply(x,y, FUN=mean) ,
Example[seq(1, ncol(Example), 2)], Example[seq(2, ncol(Example), 2)])
Or instead of seq(1, ncol(Example), 2) just use c(TRUE, FALSE) and c(FALSE, TRUE) for the second case