I have a small problem with R.
I have merged together 2 datasets and I have to compute simple ratios between them. The datasets are not that small (18 columns per dataset) and I would like to avoid going by simple brute force.
To give you an example
df <- data.frame(a1= sample(1:100, 10), b1 = sample(1:100, 10), a2= sample(1:100, 10), b2 = sample(1:100,10))
The ratios would simply be a column divided by another one, so in the example it would be c1=a1/b1 and c2=a2/b2. And it could be simply implemented by:
mutate(df, c1=a1/b1, c2=a2/b2)
My question is if there is a way to make this process automatic and instruct R to perform a mutate without manually inputting all the formulas such that it computes c1,c2,c3.... c18.
I've tried setting up a for cycle with subsets on the columns but I can't seem to make it work within tidyverse.
Thank you in advance
One simple base R way would be to do something like:
for (i in 1:2) {
df[paste0("c", i)] <- df[paste0("a", i)] / df[paste0("b", i)]
}
But it's dependent on what pattern your actual variable names have.
Another way using tidyverse tools (but there's probably a more elegant way of doing this):
library(tidyverse)
library(glue)
map_dfc(1:2, function(x) {
transmute(df, "c{x}" := .data[[glue("a{x}")]] / .data[[glue("b{x}")]])
}) %>%
bind_cols(df) %>%
relocate(starts_with("c"), .after = last_col())
Related
This may well have an answer elsewhere but I'm having trouble formulating the words of the question to find what I need.
I have two dataframes, A and B, with A having many more rows than B. I want to look up a value from B based on a column of A, and add it to another column of A. Something like:
A$ColumnToAdd + B[ColumnToMatch == A$ColumnToMatch,]$ColumnToAdd
But I get, with a load of NAs:
Warning in `==.default`: longer object length is not a multiple of shorter object length
I could do it with a messy for-loop but I'm looking for something faster & elegant.
Thanks
If I understood your question correctly, you're looking for a merge or a join, as suggested in the comments.
Here's a simple example for both using dummy data that should fit what you described.
library(tidyverse)
# Some dummy data
ColumnToAdd <- c(1,1,1,1,1,1,1,1)
ColumnToMatch <- c('a','b','b','b','c','a','c','d')
A <- data.frame(ColumnToAdd, ColumnToMatch)
ColumnToAdd <- c(1,2,3,4)
ColumnToMatch <- c('a','b','c','d')
B <- data.frame(ColumnToAdd, ColumnToMatch)
# Example using merge
A %>%
merge(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
# Example using join
A %>%
inner_join(B, by = c("ColumnToMatch")) %>%
mutate(sum = ColumnToAdd.x + ColumnToAdd.y)
The advantages of the dplyr versions over merge are:
rows are kept in existing order
much faster
tells you what keys you're merging by (if you don't supply)
also work with database tables.
and thanks to all in advance.
I have the following data:
set.seed(123)
data <- data.frame (name=LETTERS[sample(1:26, 500, replace=T)],present=sample(0:1,500,replace = T))
And I want to quickly calculate the percentage of present observations (1's) for each letter. I can do it manually, but I believe there is an easier way to do this:
library(dplyr)
A <- filter(data, name=="A" & present==1)
A2 <- filter(data, name=="A")
data$Percentage[data$name=="A"] <- nrow(A)/nrow(A2)
And so on until I arrive to "Z".
Can I make this task automatically without having to change the values of the "name" colum manually?
Best regards,
We can use prop.table with table to get the proportion
prop.table(table(data), 1)[,2]
To add it as a column, we can expand it by matching with the 'names'
data$Percentage <- prop.table(table(data), 1)[,2][as.character(data$name)]
Or as #Lars Lau Raket suggested, we don't need to convert to character
prop.table(table(data), 1)[,2][data$name]
If we need to create a column
library(dplyr)
data %>%
group_by(name) %>%
mutate(Percentage = mean(present==1))
I'm working with a dataframe (in R) that contains observations of animals in the wild (recording time/date, location, and species identification). I want to remove rows that contain a certain species if there are less than x observations of them in the whole dataframe. As of now, I managed to get it to work with the following code, but I know there must be a more elegant and efficient way to do it.
namelist <- names(table(ind.data$Species))
for (i in 1:length(namelist)) {
if (table(ind.data$Species)[namelist[i]] <= 2) {
while (namelist[i] %in% ind.data$Species) {
j <- match(namelist[i], ind.data$Species)
ind.data <- ind.data[-j,]
}
}
}
The namelist vector contains all the species names in the data frame ind.data, and the if statement checks to see if the frequency of the ith name on the list is less than x (2 in this example).
I'm fully aware that this is not a very clean way to do it, I just threw it together at the end of the day yesterday to see if it would work. Now I'm looking for a better way to do it, or at least for how I could refine it.
You can do this with the dplyr package:
library(dplyr)
new.ind.data <- ind.data %>%
group_by(Species) %>%
filter(n() > 2) %>%
ungroup()
An alternative using built-in functions is to use ave():
group_sizes <- ave(ind.data$Species, ind.data$Species, FUN = length)
new.ind.data <- ind.data[group_sizes > 2, ]
We can use data.table
library(data.table)
setDT(ind.data)[, .SD[.N >2], Species]
I am currently using ddply to apply a function I have written to a data frame. The function evaluates each row based on the values in the columns and then applies a number of other functions to the data in that row. The result is a data frame with the same structure as the input data frame and an additional column with the result of the applied function for each row.
My problem is the data set is reasonably large and therefore using ddply takes a long time - too long for the purpose!
I have read a number of other SO questions and blog posts on replacements to ddply when time is of the importance. Most posts either recommend using data.table or some combination of functions in the dplyr package with do. While speed is of the most importance, I have never used data.table so ease of use / intuitiveness is also important.
Similarly, while this question was very useful in explaining how to use different dplyr functions in combination your own function, I also need to pass in other objects to my function, which I am unsure how to do using the answer in the question.
I have created a simplified example below. My question then is how to replicate the below ddply function call with either dplyr or data table given my above points.
First, I set up some data to mimic the structure of the actual data
noObs <- 1e5
dataIn <- data.frame(One = rep(c("J", "K"), noObs/2), Two = rep(c("ID", "BR", "LB", "OZ"), noObs/4),
Three = runif(noObs))
secondaryData <- data.frame(Two = c("ID", "BR", "LB", "OZ"), Size = c(300, 500, 250, 400))
A simplified example of my function is below (in practice, the function parameters are greater than 2 and it calls other functions in itself)
MyFunction <- function(dataIn, secondaryData){
groupNames <- c("BR", "LB")
if(dataIn$One == "J"){
if(!(dataIn$Two%in%groupNames)){
if(dataIn$Two == "ID"){
idx <- match(dataIn$Two, secondaryData$Two)
value <- secondaryData[idx, "Size"]
dataIn$newCalc <- dataIn$Three*value
}else{
dataIn$newCalc <- dataIn$Three*1000
}
}else{
idx <- match(dataIn$Two, secondaryData$Two)
value <- secondaryData[idx, "Size"]
dataIn$newCalc <- dataIn$Three*value+1
}
}else{
idx <- match(dataIn$Two, secondaryData$Two)
value <- secondaryData[idx, "Size"]
dataIn$newCalc <- dataIn$Three*value
}
return(dataIn)
}
The ddply call looked like
dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)
Finally, some examples of things I have tried (I am yet to try data.table)
dataIn %>% group_by(names(dataIn)) %>% do(MyFunction(dataIn, secondaryData))
dataIn %>% group_by(names(dataIn)) %>% MyFunction(dataIn, secondaryData)
dataIn %>% group_by(.dots = names(dataIn)) %>% MyFunction(secondaryData)
EDIT
I have been able to find a way with dplyr that works except it is even slower than with ddply and I can't figure out how to use group_by with names. This doesn't seem right to me as dplyr is meant to be faster.
In addition, I have been experimenting with data.table, but haven't been able to get it to work. Again, I am looking for something that runs faster than ddply
#Plyr
start <- proc.time()
dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)
plyrTime <- proc.time() - start
#Dplyr
#Works
start <- proc.time()
res <- dataIn %>% group_by(One, Two, Three) %>% do(MyFunction(.,secondaryData))
dplyrTime <- proc.time() - start
#Doesn't work
res <- dataIn %>% group_by(.,names(dataIn)) %>% do(MyFunction(.,secondaryData))
#Data.table
dataInDT <- data.table(dataIn)
dataInDT[,.(MyFunction(.,secondaryData)), by=.(One, Two, Three)]
I found a solution using data.table. Notably, it performs the correct calculations for each row but at a remarkably faster speed. The format of function is different to adapt to the different style of data.table. I'm sure there is an even better or more correct way to solve it using data.table, but the below solution works well.
dataInDT <- data.table(dataIn)
groupNames <- c("BR", "LB")
start <- proc.time()
dataInDT[, NewCalc := {
if(One == "J"){
if(!(Two%in%groupNames)){
if(Two == "ID"){
Three*secondaryData[match(Two, secondaryData$Two), "Size"]
}else{
Three*1000
}
}else{
Three*secondaryData[match(Two, secondaryData$Two), "Size"]+1
}
}else{
Three*secondaryData[match(Two, secondaryData$Two), "Size"]
}}, by=.(One, Two, Three)]
datTableTime <- proc.time() - start
Comparing this to the old solution and you can see the speed is greatly improved
start <- proc.time()
dataOut <- ddply(dataIn, names(dataIn), MyFunction, secondaryData)
plyrTime <- proc.time() - start
Of course, in practice the data.table function I used was even more intricate, in particular the by section was much longer.
I was unable to find a solution using dplyr and am still curious to know how it would work.
I have a very large dataframe (265,874 x 30), with three sensible groups: an age category (1-6), dates (5479 such) and geographic locality (4 total). Each record consists of a choice from each of these, plus 27 count variables. I want to group by each of the grouping variables, then take a colSums on the resulting sub-grouped 27 variables. I've been trying to use dplyr (v0.2) to do it, because doing it manually ends up setting up a lot of redundant things (or resorting to a loop for iterating across the grouping options, for lack of an elegant solution).
Example code:
countData <- sample(0:10, 2000, replace = TRUE)
dates <- sample(seq(as.Date("2010/1/1"), as.Date("2010/01/30"), "days"), 200, replace = TRUE)
locality <- sample(1:2, 2000, replace = TRUE)
ageCat <- sample(1:2, 2000, replace = TRUE)
sampleDF <- data.frame(dates, locality, ageCat, matrix(countData, nrow = 200, ncol = 10))
then what I'd like to do is ...
library("dplyr")
sampleDF %.% group_by(locality, ageCat, dates) %.% do(colSums(.[, -(1:3)]))
but this doesn't quite work, as the results from colSums() aren't data frames. If I cast it, it works:
sampleDF %.% group_by(locality, ageCat, dates) %.% do(data.frame(matrix(colSums(.[, -(1:3)]), nrow = 1, ncol = 10)))
but the final do(...) bit seems very clunky.
Any thoughts on how to do this more elegantly or effectively? I guess the question comes down to: how best to use the do() function and the . operator to summarize a data frame via colSums.
Note: the do(.) operator only applies to dplyr 0.2, so you need to grab it from GitHub (link), not from CRAN.
Edit: results from suggestions
Three solutions:
My suggestion in post: elapsed, 146.765 seconds.
#joran's suggestion below: 6.902 seconds
#eddi's suggestion in the comments, using data.table: 6.715 seconds.
I didn't bother to replicate, just used system.time() to get a rough gauge. From the looks of it, dplyr and data.table perform approximately the same on my data set, and both are significantly faster when used properly than the hack solution I came up with yesterday.
Unless I'm missing something, this seems like a job for summarise_each (a sort of colwise analogue from plyr):
sampleDF %.% group_by(locality, ageCat, dates) %.% summarise_each(funs(sum))
The grouping column are not included in the summarizing function by default, and you can select only a subset of columns to apply the functions to using the same technique as when using select.
(summarise_each is in version 0.2 of dplyr but not in 0.1.3, as far as I know.)
The method summarise_each mentioned in joran's answer from 2014 has been deprecated.
Instead, please use summarize_all() or summarize_at().
The methods summarize_all and summarize_at mentioned in Hack-R's answer from 2018 have been superseded.
Instead, please use summarize()/summarise() combined with across().