Related
This question already has answers here:
Use dynamic name for new column/variable in `dplyr`
(10 answers)
Closed 1 year ago.
I thought of simplifying my code a bit and tried to engage with the world of custom build functions. I build the following dummy:
library(dplyr, tidyverse)
##
string <- c("car", "train", 'bike', 'plain')
speed1 <- runif(4, min = 0, max = 10000)
speed2 <- runif(4, min = 0, max = 10000)
n1 <- sample(1:100, 4)
n2 <- sample(1:100, 4)
##
df <- data.frame(string, speed1, speed2, n1, n2)
df
Basically, I want to do simple operations of the following type:
df <- df%>%
dplyr::mutate(speed1_n1 = paste0(format(speed1, big.mark = ",") , '\n(' , format(n1, big.mark = ",") , ')'))
df
I can build a function:
my_fun <- function(dataf, V1, V2){
dataf <- dataf%>%
dplyr::mutate(speed1_n1 = paste0(format(V1, big.mark = ",") , '\n(' , format(V2, big.mark = ",") , ')'))}
df<-df%>%my_fun( df$speed1, df$n1)
df
However, the variable names needs to be generated dynamically from the called variable names similar to the question here, but within a function. I cannot manually specify mutate(speed1_n1=.... Something, like: !!paste('speed1', 'n1', sep = "_"):=, but with the variable / column names and not in quotes.
We may use the arguments as unquoted and use {{}} for evaluation
my_fun <- function(dataf, V1, V2){
dataf %>%
dplyr::mutate("{{V1}}_{{V2}}" := paste0(format({{V1}}, big.mark = ",") ,
'\n(' , format({{V2}}, big.mark = ",") , ')'))
}
-testing
my_fun(df, speed1, n1)
string speed1 speed2 n1 n2 speed1_n1
1 car 7886.962 3218.585 37 83 7,886.962\n(37)
2 train 9534.978 5524.649 98 34 9,534.978\n(98)
3 bike 6984.790 9476.838 60 55 6,984.790\n(60)
4 plain 6543.198 2638.609 9 53 6,543.198\n( 9)
I have a df with multiple vars containing dates.
Among these vars some report multiple dates separated by formatting symbols.
For each cell in each of the relevant vars, I would like to split the string, reformat as data, and pick the last date.
DATA
data <- data.frame(ex=c(1,2),date_1 = c("30/12/1997\n22/12/1998","15/12/1993"), date_2 = c("21/03/1997\n11/04/1996\n11/04/1996\n11/04/1996\n11/04/1996",NA))
expected <- data.frame(ex=c(1,2),date_1 = c("1998-12-22","1993-12-15"), date_2 = c("1997-03-21",NA))
CODE ATTEMPTED (1) ERROR: ALL ENTRIES GET THE VAR MAX VALUE NOT THE CELL MAX VALUE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(x) max(as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T))
CODE ATTEMPTED (2) (NESTED LAPPLY) ERROR: CODE BREAKS DOWN SOMEWHERE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(y) max(y, lapply(data[grep("date",names(data),value = T)], function(x)
as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T)))
CODE ATTEMPTED (3) (NESTED LAPPLY) ERROR: CODE BREAKS DOWN SOMEWHERE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(y) max(y,function(x) as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T))
We can use :
data[-1] <- lapply(data[-1], function(y) sapply(strsplit(y ,"\n"),
function(x) max(as.Date(x, "%d/%m/%Y"))))
data[-1] <- lapply(data[-1], as.Date)
data
# ex date_1 date_2
#1 1 1998-12-22 1997-03-21
#2 2 1993-12-15 <NA>
The logic is the same as described for every column (except first) we split the string on "\n", convert to date and return the max value. The inner sapply loop returns numeric representation of dates so we use another lapply to convert the numbers to date.
I am looking for the most frequent values (character strings) and its frequency.
The intended results is a dataframe with three columns:
char: the names of the original columns
mode: the most frequent value in each char
freq: the frequency of the modes
When there is a tie in frequencies, I want to put all of the qualified values in one cell, separated by a comma. -- Or is there any better representation?
Questions: I don't know how to deal with a tie.
I have used the table() function to get the frequency tables of each column.
clean <- read.xlsx("test.xlsx", sheet = "clean") %>% as_tibble()
freqtb <- apply(clean, 2, table)
Here is the second table I got in freqtb:
$休12
个 休 天 饿
1 33 2 1
Then I looped through the tables:
freq <- vector()
mode <- vector()
for (tb in freqtb) {
max = max(tb)
name = names(tb)[tb==max]
freq <- append(freq, max)
mode <- append(mode, name)
}
results <- data.frame(char = names(freqtb), freq = freq, mode=mode)
The mode has a greater length than other vectors, and it cannot attached to results. I bet it is due to ties.
How can can get the same length for this "mode" variable?
You can make some small modifications to the code here to get a Mode function. Then Map over your data frame and rbind the results together
options(stringsAsFactors = F)
set.seed(2)
df.in <-
data.frame(
a = sample(letters[1:3], 10, T),
b = sample(1:3, 10, T),
c = rep(1:2, 5))
Mode <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ind <- which(tab == max(tab))
data.frame(char = ux[ind], freq = tab[ind])
}
do.call(rbind, lapply(df.in, Mode))
# char freq
# a c 4
# b 1 4
# c.1 1 5
# c.2 2 5
I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
I'm interested in building a function making use of apply/sapply or Map that would iterate over available columns in dta and replace values in each column with matched values from data frame available in a nameless list of data frames with list item index corresponding to the column number of the dta data frame.
Example
Given objects:
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE)
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D")
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q")
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January")
)
)
Desired results
When applied on dta, the function should return a data.frame corresponding to the extract below:
unitA unitB unitC someValue
Letter B small t Apr 912876
Letter B small q March 293604
C s Apr 459066
Letter D p March 332395
Letter A small q March 650871
Letter D small q Apr 258017
Letter D p January 478546
C small q Feb 766311
C small t March 84247
Letter A small q March 875322
Letter A r Feb 339073
Letter A r Ap 839441
C r Feb 346684
Letter B p January 333775
Letter D small t January 476352
(...)
Existing approach
replaceLbls <- function(dataSet, lstDict) {
sapply(seq_along(dataSet), function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
dataSet[,i][match(dataSet[,i], dtaDict[,1])] <- dtaDict[,2][match(dtaDict[,1], dataSet[,i])]
})
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
Of course the approach proposed above does not work as it will try to use NA in assignments; but it summarises what I want to achieve:
Error in x[...] <- m : NAs are not allowed in subscripted assignments
In addition: Warning message: In [<-.factor(*tmp*, match(dataSet[,
i], dtaDict[, 1]), value = c(NA, : invalid factor level, NA
generated
Additional remarks
Source data set
The key characteristics of the data are:
The list is nameless so subsetting has to be done by item numbers not by names
Item number correspond to column numbers
There is no full match between metadata data frames available in the list of data frames and unit columns available in the data
The someValue column also should be iterated over as it may contain labels that should be replaced
Solution
I'm not interested in dplyr/data.table/sqldf-based solutions.
I'm not interested in nested for-loops
I have a hacky solution that doesn't use for loops or other packages. I needed to convert the factors to characters for it to work but you might be able to improve my solution.
The solution works by only matching values that are found in your lstMeta by creating a vector of indices where matches are found. I also used the <<- operator. If you're better at R than me, you can probably improve this.
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE),
stringsAsFactors = F
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D"),
stringsAsFactors = F
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q"),
stringsAsFactors = F
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January"),
stringsAsFactors = F
)
)
replaceLbls <- function(dataSet, lstDict) {
sapply(1:3, function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
myUniques <- which(dataSet[,i] %in% dtaDict[,1])
dataSet[myUniques,i]<<- dtaDict[,2][match(dataSet[myUniques,i],dtaDict[,1])]
})
return(dataSet)
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
The following approach works well for the example data:
replaceLbls <- function(dataSet, lstDict) {
dataSet[seq_along(lstDict)] <- Map(function(x, lst) {
x <- as.character(x)
idx <- match(x, as.character(lst$V1))
replace(x, !is.na(idx), as.character(lst$V2)[na.omit(idx)])
}, dataSet[seq_along(lstDict)], lstDict)
dataSet
}
head(replaceLbls(dta, lstMeta))
# unitA unitB unitC someValue
# 1 Letter B small t Apr 912876
# 2 Letter B small q March 293604
# 3 C s Apr 459066
# 4 Letter D p March 332395
# 5 Letter A small q March 650871
# 6 Letter D small q Apr 258017
This assumes that you want to apply the changes to the first X column of the data that are as long as the meta-list. You might want to include an extra step to convert back to factor since this approach converts the adjusted columns to character class.
Another remark on factors: you could potentially speed up the performance by working only on the levels of any factor variables instead the whole column. The general process would be similar but requires a few more steps to check classes etc.
You can also try this:
mapr<-function(t,meta){
ind<-match(t,meta$V1)
if(!is.na(ind)){return(meta$V2[ind])}
else{return(t)}}
Then using sapply:
dta<-as.data.frame(cbind(sapply(1:3,function(t,df,meta){sapply(df[,t],mapr,lstMeta[[t]])},dta,lstMeta,simplify = T),dta[,4]))
A couple of mapplys can do the job
f1 <- function(df, lst){
d1 <- setNames(data.frame(mapply(function(x, y) x$V2[match(y, x$V1)], lst, df[1:3]),
df$someValue, stringsAsFactors = FALSE),
names(df))
as.data.frame(mapply(function(x, y) replace(x, is.na(x), y[is.na(x)]), d1, df))
}