I have a Data Table with two Text columns. I need to use column b to determine which letters to replace in column a with an "x".
I can do it using a for loop as in the code below. however my actual data set has 250,000+ rows so the script takes ages. Is there a more efficient way to do this? I considered lappy but couldn't get my head round it.
DT <- data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT$c <- ""
for (i in 1 : NROW(DT)){
DT[i]$c <- sub(DT[i,b], "x", DT[i,a])
}
Here is one approach using the tidyverse
library(tidyverse)
DT <- data.table::data.table(a = c("ABCD","ABCD","ABCD","ABCD"), b = c("A","B","C", "D"))
DT %>%
mutate(new_vec = str_replace_all(string = a,pattern = b,replacement = "X"))
Related
I create a list with all unique combinations of two columns in a data.table.
Based on all unique combinations in this list I want to take samples from a data.table.
I already wrote a function for this and I know that I could use a for-loop or a foreach-loop.
How could the funcion in the following be used with "apply" or one of its variations?
Thank you very much :-)
MWE:
dt <- data.table(filename = c("a", "b", "c", "c", "a"), class = c(1,2,1,1,4), var = c(1,2,3,4,5))
unique_combinations <- unique(dt[, c("filename", "class")])
take_samples <- function(dt, filename, class, n) {
dt %>%
.[filename==filename & class==class] %>%
sample_n(size=n, replace = FALSE)
#TBD: append result to other data.table
}
# How to do the following call automatically for every unique combination using apply?
take_samples(dt, unique_combinations$filename[0], unique_combinations$class[0], 1)
I think you need groupby:
n <- 1
dt[,.SD[sample(.N, size = n, replace = T)], .(filename, class)]
Explanation
Grouping by .(filename, class) will take unique combination of the two columns.
.SD contains the grouped dataframe.
Here's the ouptut looks like:
filename class var
1: a 1 1
2: b 2 2
3: c 1 4
4: a 4 5
I have a column in dataframe df with value 'name>year>format'. Now I want to split this column by > and add those values to new columns named as name, year, format. How can I do this in R.
You can do that easily using separate function in tidyr;
library(tidyr)
library(dplyr)
data <-
data.frame(
A = c("Joe>1993>student")
)
data %>%
separate(A, into = c("name", "year", "format"), sep = ">", remove = FALSE)
# A name year format
# Joe>1993>student Joe 1993 student
If you do not want the original column in the result dataframe change remove to TRUE
An option is read.table in base R
cbind(df, read.table(text = as.character(df$column), sep=">",
header = FALSE, col.names = c("name", "year", "format")))
In case your data is big, it would be a good idea to use data.table as it is very fast.
If you know how many fields your "combined" column has:
Suppose the column has 3 fields, and you know it:
library(data.table)
# the 1:3 should be replaced by 1:n, where n is the number of fields
dt1[, paste0("V", 1:3) := tstrsplit(y, split = ">", fixed = TRUE)]
If you DON'T know in advance how many fields the column has:
Now we can get some help from the stringi package:
library(data.table)
library(stringi)
maxFields <- dt2[, max(stri_count_fixed(y, ">")) + 1]
dt2[, paste0("V", 1:maxFields) := tstrsplit(y, split = ">", fixed = TRUE, fill = NA)]
Data used:
library(data.table)
dt1 <- data.table(x = c("A", "B"), y = c("letter>2018>pdf", "code>2020>Rmd"))
dt2 <- rbind(dt1, data.table(x = "C", y = "report>2019>html>pdf"))
I have a master file which has other data frame name (df2, df3), row and columns index which use to populate the master file x column
I think to use the for loop but don't know how to start and I haven't used R for a while.
master <- data.frame(df = c("df2","df2","df3"), column =c("A","C","B"),row = c(1,2,3), x = c(1,1,1))
df2 <- data.frame(A = c(2,4,6), B = c(1,3,5),C = c(4,8,5))
df3 <- data.frame(A = c(12,14,16), B = c(11,13,15),C = c(24,28,25))
Thanks
If you are going to use for-loop, I guess the following could help you
for (k in 1:nrow(master)) {
master$x[k] <- eval(parse(text = sprintf("%s$%s[%s]",master$df[k],master$column[k],master$row[k])))
}
where eval and parse can evaluate your query as string
I want to substitute parts of the transform function with variable inputs.
I have created a df using subset with col1 from an existing table:
col1 = c('A','B','C')
The df looks something like this:
A = c(1, 3)
B = c(3, 1)
C = c(5, 2)
df = data.frame(A, B, C)
I now want to automate calculations which manually would look like this:
df <- transform(df, 'ABC' = (A + B + C))
where (A + B + C) refers to the columns of the df. Because I have hundreds of 'col1's I can't do it by hand. I was trying to use something similar to %s (as available in python 2.X), yet so far nothing really worked and I understand too little of R (related to eval()?)to get things working (tried paste, as.formula, sprintf, substitute etc.).
Using cv(col1) I'm trying to paste the output inside the transform function, yet the furthest I got was transform trying to grab values from the environment (not columns) when using as.formula.
cv = function(var){
output = paste('(', paste(var, collapse = ' + '), ')', sep = '')
return(output)
}
Would appreciate any hints or ideas!
You have maneuvered yourself into a strange corner. This is easy with R:
cols <- c("A", "B", "C")
df[, paste(cols, collapse = "")] <- rowSums(df[, cols])
#alternatively for other binary functions:
#Reduce("+", df[, cols])
# A B C ABC
#1 1 3 5 9
#2 3 1 2 6
You can get a similar effect using mutate from dplyr:
library(dplyr)
cols <- c("A", "B", "C")
df %>% mutate_(.dots = setNames(paste(cols, collapse = '+'),
'new_column_name'))
Here we tell mutate_ (spot the _) what to do via paste() which yields "A+B+C", and use setNames to name the new column.
I acknowledge the syntax is somewhat convoluted, but this is related to non-standard evaluation in dplyr. But if you want to do this in the dplyr ecosystem, this is the way to do it.
I try to create a data.fame, and then add some columns to this data.frame.
I try following code, but it does not work:
test.dim <- as.data.frame(matrix(nrow=0, ncol=4))
names <- c("A", "B", "C", "D")
colnames(test.dim) <- names
for (i in 1:4) {
name = names[i]
# do some calculations, at last get another data.fame named x.data
mean.data <- apply(x.data, 1, mean)
test.dim[, name] <- mean.data
}
Usually one would already have a data.frame (call it df) and simply add frames by calling df$newColName = values or df[,newColNames] = frame_of_values.
Your question indicates that you are separating the creation of your values from putting them in the data frame (which I do not recommend). But if you really want to start from a zero row zero col frame here are some options:
colnamesToAdd = LETTERS[1:4]
test.dim = data.frame( matrix(rep(NA),length(colnamesToAdd),nrow=1) )
colnames(test.dim) = colnamesToAdd
test.dim = test.dim[-1,]
Another option:
colnamesToAdd = LETTERS[1:4]
test.dim = data.frame("USELESS" = NA)
test.dim[,colnamesToAdd] = NA
test.dim = test.dim[-1,-1]
If you are looking to add a mean to your table and repeat it for every factor:
library(data.table);
test.dim = data.table("FACTOR" = sample(letters[1:4],100,replace=TRUE), "VALUE" = runif(100), "MEAN" = NA)
means = test.dim[,list(AVG=mean(VALUE)),by="FACTOR"]
# without data.table: by(test.dim$VALUE, test.dim$FACTOR, mean)
for(x in 1:nrow(means)) { test.dim$MEAN[test.dim$FACTOR==means$FACTOR[x]] = means$AVG[x] } # normally I would use the foreach package instead of this last for loop