order function only partially reordering dataframe - r

I have created a data frame using rbind() to append two data frames with the same row names together. I am then trying to use the order() function to order the factor levels alphabetically. However, it is still treating the data frames as two separate objects, and ordering the first alphabetically, and then the second alphabetically separately.
Example:
df1 <- data.frame(site=c("A", "F", "C"))
df2 <- data.frame(site=c("B", "G", "D"))
new.df <- rbind(df1, df2)
new.df <- new.df[order(new.df$site),]
outcome:
site
A
C
F
B
D
G
I have looked at other methods of reordering data, for example using the arrange function from package dplyr, but have not had any success. Any suggestions of how to fix this?
Any help much appreciated.
Thanks

Avoid creation of factors by
df1 <- data.frame(site=c("A", "F", "C"), stringsAsFactors = FALSE)
df2 <- data.frame(site=c("B", "G", "D"), stringsAsFactors = FALSE)
then the remaining stuff will work as expected.

I'm guessing you're not doing quite what you think you're doing there: the resulting new.df isn't a data frame any more, it's a factor. The result of order is to put it in the order of the levels of the factor (see levels(new.df$site). So, if you really want to do it this way (ie, keeping it as a factor rather than a character vector), you will need to reorder the levels first.
new.df$site <- factor(new.df$site, levels = sort(levels(new.df$site)))
new.df[order(new.df$site), ]
[1] A B C D F G
Levels: A B C D F G
But unless you really need it to be a factor from the start, I think you would be best advised to do what #Uwe Block suggests and, if necessary, turn it in to a factor after you've used rbind and done the sorting.

Related

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

R: Reprint column headers when using rbind

Is there a way to reprint column headers directly below the last row of the first data set (directly above the second data set) when using rbind to put two data sets together? I have searched and searched but haven't seen any examples like this. Thanks!
I generated some example data (easier if you provide this yourself when asking the question). Basically, you take the column names of the second dataframe, convert these to a dataframe object. You also need to setNames function to give each dataframe that you want to rbind the same column names as the first dataframe.
df1 <- data.frame(one=c("a", "b"), two=c("c", "d"))
df1
# one two
#1 a c
#2 b d
df2 <- data.frame(three=c("e", "f"), four=c("g", "g"))
df2
# three four
#1 e g
#2 f g
rbind(df1,
setNames(as.data.frame(t(colnames(df2))), names(df1)),
setNames(df2, names(df1)))
# one two
#1 a c
#2 b d
#3 three four
#4 e g
#5 f g
A not so sophisticated but this will work for you.
Import both the data frames with keeping header = F
after that use
library(dplyr)
final<- bind_rows(df1,df2) ##this will bind both the data frames
names(final) <- final[1,] ##this will take 1st row as column names or header
final <- final[-1,] ##this will remove your 1st row which is not useful now.
This method will help you do your work.

How to assign the output of a sapply loop to the original columns in a data frame without losing other columns

I a data frame with different columns that has string answers from different assessors, who used random upper or lower cases in their answers. I want to convert everything to lower case. I have a code that works as follows:
# Creating a reproducible data frame similar to what I am working with
dfrm <- data.frame(a = sample(names(islands))[1:20],
b = sample(unname(islands))[1:20],
c = sample(names(islands))[1:20],
d = sample(unname(islands))[1:20],
e = sample(names(islands))[1:20],
f = sample(unname(islands))[1:20],
g = sample(names(islands))[1:20],
h = sample(unname(islands))[1:20])
# This is how I did it originally by writing everything explicitly:
dfrm1 <- dfrm
dfrm1$a <- tolower(dfrm1$a)
dfrm1$c <- tolower(dfrm1$c)
dfrm1$e <- tolower(dfrm1$e)
dfrm1$g <- tolower(dfrm1$g)
head(dfrm1) #Works as intended
The problem is that as the number of assessors increase, I keep making copy paste errors. I tried to simplify my code by writing a function for tolower, and used sapply to loop it, but the final data frame does not look like what I wanted:
# function and sapply:
dfrm2 <- dfrm
my_list <- c("a", "c", "e", "g")
my_low <- function(x){dfrm2[,x] <- tolower(dfrm2[,x])}
sapply(my_list, my_low) #Didn't work
# Alternative approach:
dfrm2 <- as.data.frame(sapply(my_list, my_low))
head(dfrm2) #Lost the numbers
What am I missing?
I know this must be a very basic concept that I'm not getting. There was this question and answer that I simply couldn't follow, and this one where my non-working solution simply seems to work. Any help appreciated, thanks!
Maybe you want to create a logical vector that selects the columns to change and run an apply function only over those columns.
# only choose non-numeric columns
changeCols <- !sapply(dfrm, is.numeric)
# change values of selected columns to lower case
dfrm[changeCols] <- lapply(dfrm[changeCols], tolower)
If you have other types of columns, say logical, you also could be more explicit regarding the types of columns that you want to change. For example, to select only factor and character columns, use.
changeCols <- sapply(dfrm, function(x) is.factor(x) | is.character(x))
For your first attempt, if you want the assignments to your data frame dfrm2 to stick, use the <<- assignment operator:
my_low <- function(x){ dfrm2[,x] <<- tolower(dfrm2[,x]) }
sapply(my_list, my_low)
Demo

Calculating the means of many factor columns in a data frame

I have a data frame with factor columns. Here is a tiny example:
dat <- data.frame(one = factor(c("a", "b")), two = factor(c("c", "d")))
I can calculate the means of the numeric values that underlie the factor labels for each column:
mean(as.integer(dat$one))
[1] 1.5
But since there are very many columns in my data frame, I would like to avoid having to calculate all the individual means and would rather do something like:
colMeans(dat)
which doesn't work, since the columns are factors, or
colMeans(as.integer(dat))
which doesn't work either.
So how can I easily calculate the means of all factor columns, without a loop or individually calculating them all?
Do I really have to change the class of all columns?
The data.matrix is pretty much designed for such a task. It also skips numeric and integer columns, if present, and hence reduces memory usage, though the conversion to matrix could be an overhead, sometimes. So as long you don't have character columns, this should be pretty straightforward
colMeans(data.matrix(dat))
# one two
# 1.5 1.5
We can use lapply
lapply(dat, function(x) mean(as.integer(x)))
Or with dplyr
library(dplyr)
dat %>%
summarise_each(funs(mean(as.integer(.))))
For big datasets, it may be better to calculate the mean by each column separately as converting to matrix may also create memory issues.
Write a simple function that uses a for loop to write all of the values into a vector.
dat <- data.frame(one = c(1:10), two = c(1:10))
colMeans <- function(tablename){
i <- 1
colmean <- c(1:ncol(tablename))
for(i in c(1:ncol(tablename))){
colmean[i] <- mean(tablename[,i])
}
return(colmean)
}
colMeans(dat)
Hope this works
You can also use data.table package, which is faster than data.frame. if your data is big e.g. millions of observations, than you need data.table to optimize run time.
Below is the code:
library(data.table)
dat <- data.table(one = factor(c("a", "b")), two = factor(c("c", "d")))
factorCols <- c("one", "two")
dat[, lapply(.SD, FUN=function(x) mean(as.integer(x))), .SDcols=factorCols]

R - show only levels used in a subset of data frame

I have a rather large data frame with a factor that has a lot of levels (more than 4,000). I have another column in the same data frame that I'm using as a reference, and what I'd like to find is a subset of the levels whenever this reference column is NA.
The first step I'm using is subsetrows <- which(is.na(mydata$reference)) but after that I'm stuck. I want something like levels(mydata[subsetrows,mydata$factor]) but unfortunately, this command shows me all the levels and not just the ones existing in subsetrows. I suppose I could create a new vector outside of my data frame of only my subset rows and then drop any unused levels, but is there any easier/cleaner way to do this, possibly without copying my data outside the data frame?
As an example of what I want returned, if my data frame has factor levels from A to Z, but in my subset only P, R and Y appear, I want something that returns the levels P, R and Y.
You can certainly accomplish this with base functions. But my personal preference is to use dplyr with chained operations such as this:
library(dplyr)
d %>%
filter(is.na(ref)) %>%
select(field) %>%
distinct()
data
d <- data.frame(
field = c("A", "B", "C", "A", "B", "C"),
ref = c(NA, "a", "b", NA, "c", NA)
)
I modified a suggestion in the comments by Marat to use the function unique that seems to return the correct levels.
Solution:
subsetrows <- which(is.na(mydata$reference))
unique(as.character(mydata$factor[subsetrows]))
While I like learning new packages and functions, this solution seems better at this point since it's more compact and easier for me to understand if I need to revisit this code at some distant point in the future.

Resources