I have a dataframe with millions of rows and tens of columns that I need to apply a rowwise operation. My solution below works using dplyr but I hope a switch to data.table will speed things up. Any help converting the below code to a data.table version would be appreciated.
library(tidyverse)
library(trend)
df = structure(list(id = 1:2, var = c(3L, 9L), col1_x = c("[(1,2,3)]",
"[(100,90,80,70,60,50,40,30,20)]"), col2_x = c("[(2,4,6)]", "[(100,50,25,12,6,3,1,1,1)]"
)), class = "data.frame", row.names = c(NA, -2L))
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
x_cols = df %>%
select(ends_with("x")) %>%
names()
df = df %>%
rowwise() %>%
mutate(across(all_of(x_cols) ,~ ifelse(var<=4,0,sens.slope(as.numeric(unlist(strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
While what #Ritchie Sacramento wrote is absolutely true, here's the information you asked for.
First, I want to start with set or :=. When you see the keyword set (which can just be part of the function name) or the := symbol, you've told data.table not to make copies of the data. Without declaring or declaration (that pesky = or <-), you've changed the data table. This is one of the key methods to prevent wasted memory with this package.
Keep in mind that the environment pane in RStudio is triggered to update when it registers that operator (= or <-), creating something new. Since you did a replace-in-place, the environment pane may reflect incorrect information. You can use the refresh icon (top right of the pane), or you can print the object to the console to check.
As soon as you declare anything that the pane identifies, everything in the pane is updated.
Change a data frame to a data.table. (Notice that keyword—set!) Both of these do the same thing. However, one copies everything in memory and makes it again. (Naming the frame the same thing does not prevent copies.)
setDT(df)
df <- data.table(df)
I'm not going to start with your first code blurb. I'm starting with the name extraction.
You wrote:
x_cols = df %>%
select(ends_with("x")) %>%
names()
# [1] "col1_x" "col2_x"
There are many ways to get this information. This is what I did. Note that this doesn't really have anything to do with data.table. I just used base R here. You could use a data frame the same way.
xcols <- names(df)[endsWith(names(df), 'x')]
# [1] "col1_x" "col2_x"
I'm going to use this object, xcols in the remaining examples. (Why keep reiterating the same declaration?)
You wrote the following to remove the brackets and parentheses.
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
# id var col1_x col2_x
# 1 1 3 1,2,3 2,4,6
# 2 2 9 100,90,80,70,60,50,40,30,20 100,50,25,12,6,3,1,1,1
There are several ways you could do this, whether in a data frame or a data.table. Here are a couple of methods you can use with data.table. These do the exact same thing as each other and your code.
Note the :=, which means the table changed.
In the first example, I used .SD and .SDcols. These are data column selection tools. You use .SD in place of the column name when you want to use more than one column. Then use .SDcols to tell data.table what columns you're trying to use. By annotating (xcols), where xcols is my variable representing my column names to use, this tells data.table to replace the data in the columns used for the aggregation.
The difference between these two is how I used lapply, which doesn't have anything to do with data.table. If you need more info on this function, you can ask me, or you can look through the many Q & As out there already.
df[,
(xcols) := lapply(.SD, function(k) gsub("[][()]", "", k)),
.SDcols = xcols]
df[,
(xcols) := lapply(.SD, gsub, pattern = "[][()]",
replacement = ""),
.SDcols = xcols]
Your last request was based on this code.
df %>%
rowwise() %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
Since you used var to delineate when to apply this, I've used the by argument (as in dplyr's group_by). In terms of the other requirements, you'll see .SD and lapply again.
df[,
(xcols) := lapply(.SD,
function(k) {
ifelse(var <= 3, 0,
sens.slope(as.numeric(strsplit(k, ",")[[1]])
)$estimates[[1]])
}), by = var, .SDcols = xcols]
If you think about how these differ, you may find that, in a lot of ways, they aren't all that different. For example, in this last translation, you may see a similar approach in dplyr that I used.
df %>% group_by(var) %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]])))
I hope there's an easier way to do this, but I can't figure it out. I'm calculating trip lengths and defined "started" and "ended" columns for the calculations. I want to add a column that includes the result.
I'm using tidyverse and lubridate.
started <- cyc_trip_data$started_at
ended <- cyc_trip_data$ended_at
trip_duration <-
duration(as.double(ended - started))
While this gets me the data I want, when I add it with bind_cols() to the df, it names the new column "...n" where n is the column number.
cyc_trip_data <- cyc_trip_data %>%
bind_cols(trip_duration)
So far I have used rename() and relocate() to move the column and rename it. I hope someone might have an idea for a more elegant solution so I can save a step.
I want to name the columns after the variable I define, e.g. "trip_duration" above. I know .name_repair is an argument that can be passed in bind_cols() but I don't quite understand if it even applies here and when I tried using it, I still ended up with "...n"
cyc_trip_data <- cyc_trip_data %>%
rename(trip_duration = ...12) %>%
relocate(trip_duration, .after = ended_at)
Both
cyc_trip_data <- cyc_trip_data %>%
bind_cols(trip_duration = trip_duration)
AND
cyc_trip_data <- cyc_trip_data %>%
mutate(trip_dur = duration(as.double(ended - started)))
did the trick. Exactly what I was looking for. Thank you!
I have a large data frame (6 million rows, 20 columns) where data in one column corresponds to data in another column. I created a key that I now want to use to fix rows that have the wrong value. As a small example:
key = data.frame(animal = c('dog', 'cat', 'bird'),
sound = c('bark', 'meow', 'chirp'))
The data frame looks like this (minus the other columns of data):
df = data.frame(id = c(1, 2, 3, 4),
animal = c('dog', 'cat', 'bird', 'cat'),
sound = c('meow', 'bark', 'chirp', 'chirp'))
I swear I have done this before but can't remember my solution. Any ideas?
Using dplyr. If you want to fix sound according to animal,
library(dplyr)
df <- df %>%
mutate(sound = sapply(animal, function(x){key %>% filter(animal==x) %>% pull(sound)}))
should do the trick. If you want to fix animal according to sound:
df <- df %>%
mutate(animal = sapply(sound, function(x){key %>% filter(sound==x) %>% pull(animal)}))
I'm not sure about relative efficiency, but it's simpler to replace the partially incorrect column completely. It may not even cost you very much time (since you have to look up values anyway to determine that an animal/sound pair is mismatched).
library(tidyverse)
df %>% select(-sound) %>% full_join(key, by = "animal")
For 6 million rows, you may be better off using data.table. If you convert df and key to data tables (as.data.table()) that will take some up-front computational time but may speed up subsequent operations; you can use tidyverse operations on data.table objects without doing any further modifications, but native data.table operations might be faster:
library(data.table
dft <- as.data.table(df)
k <- as.data.table(key)
merge(dft[,-"sound"], k, by = "animal")
I haven't bothered to do any benchmarking (would need much larger examples to be able to measure any differences).
I'm doing a prediction with a class tree, with "rpart" library, and when I make "predict", I get a table with probabilities and its value/category that test data can take, and I want to get the value/category from the hightest probability. For example (once predict is done), table I get is:
Table1
And I want to have this table:
Tale2
thanks in advance, I've tried a few things but haven't achieved much since I'm pretty new to R, cheers!
One way to achieve your desired output could be:
identify your values in vector pattern
mutate across the relevant columns and use str_detect to
check if values are in this column -> if true use cur_column() to place
the column name in the new column.
the do some tricks with .names and unite and
finally select.
library(dplyr)
library(tidyr)
library(stringr)
pattern <- c("0.85|0.5|0.6|0.8")
df %>%
mutate(across(starts_with("cat"), ~case_when(str_detect(., pattern) ~ cur_column()), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
select(index, pred_category = New_Col)
index pred_category
<dbl> <chr>
1 1 cat2
2 2 cat1
3 3 cat3
4 4 cat3
You didn't post your data so I just put it in a .csv and accessed it from my R folder on my C: drive.
Might be an easier way to do it, but this is the method I use when I might have multiple different types (by column or row) I'd like to sort for. If you're new to R and don't have data.table or dplyr installed yet, you'll need to enter the second parts in the console.
I left the values in but that can be fixed with the last line if you don't want them.
setwd("C:/R")
library(data.table)
library(dplyr)
Table <- read.csv("Table1.csv", check.names = FALSE, fileEncoding = 'UTF-8-BOM')
#Making the data long form makes it much easier to sort as your data gets more complex.
LongForm <- melt(setDT(Table), id.vars = c("index"), variable.name = "Category")
Table1 <- as.data.table(LongForm)
#This gets you what you want.
highest <- Table1 %>% group_by(index) %>% top_n(1, value)
#Then just sort it how you wanted it to look
Table2 <- highest[order(highest$index, decreasing = FALSE), ]
View(Table2)
If you don't have the right packages
install.packages("data.table")
and
install.packages("dplyr")
To get rid of the numbers
Table3 <- Table2[,1:2]
Consider the following dataframe:
df <- data.frame(replicate(5,sample(1:10, 10, rep=TRUE)))
If I want to divide each row by its sum (to make a probability distribution), I need to do something like this:
df %>% mutate(rs = rowSums(.)) %>% mutate_each(funs(. / rs), -rs) %>% select(-rs)
This really feels inefficient:
Create an rs column
Divide each of the values by their corresponding row rowSums()
Remove the temporarily created column to clean up the original dataframe.
When working with existing columns, it feels much more natural:
df %>% summarise_each(funs(weighted.mean(., X1)), -X1)
Using dplyr, would there a better way to work with temporary columns (created on-the-fly) than having to add and remove them after processing ?
I'm also interested in how data.table would handle such a task.
As I mentioned in a comment above I don't think that it makes sense to keep that data in either a data.frame or a data.table, but if you must, the following will do it without converting to a matrix and illustrates how to create a temporary variable in the data.table j-expression:
dt = as.data.table(df)
dt[, names(dt) := {sums = Reduce(`+`, .SD); lapply(.SD, '/', sums)}]
Why not considering base R as well:
as.data.frame(as.matrix(df)/rowSums(df))
Or just with your data.frame:
df/rowSums(df)