How to apply a custom function to every value in a dataset - r

I have a multipart function to convert characters in a specified column to numbers, as follows:
ccreate <- function(df, x){ revalue(df[[x]], c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )) }
I then use that function to create new columns in my original dataset of just the values using another function:
coladd <- function(df, x){ df[[paste(x, "_col", sep='' )]] <- ccreate(df,x) df }
Here is an example of the function:
col1 <- c("A", "B", "C", "D", "?")
col2 <- c("A", "A", "A", "D", "?")
col3 <- c("C", "B", "?", "A", "B")
test <- data.frame(col1, col2, col3)
test
coladd(test, "col1")
This works, but I have to feed each column name from my dataset into coladd() one at a time. Is there a way to apply the coladd() function to every column in a dataframe without having to type in each column name?
Thanks again, and sorry for any confusion, this is my first post here.

Using your functions, you can use Reduce.
ccreate <- function(df, x){ revalue(df[[x]], c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )) }
coladd <- function(df, x){ df[[paste(x, "_col", sep='' )]] <- ccreate(df,x); df }
Reduce(coladd, names(test), test)
# col1 col2 col3 col1_col col2_col col3_col
# 1 A A C 5 5 3
# 2 B A B 4 5 4
# 3 C A ? 3 5 1
# 4 D D A 2 2 5
# 5 ? ? B 1 1 4
Here is how I would do it, though not using your functions.
library(dplyr)
# this is a named vector to serve as your lookup
recode_val <- c( "?"=1 , "D"=2 , "C"=3 , "B"=4 , "A"=5 )
test %>%
mutate(across(everything(), list(col = ~ recode_val[.])))
# col1 col2 col3 col1_col col2_col col3_col
# 1 A A C 5 5 3
# 2 B A B 4 5 4
# 3 C A ? 3 5 1
# 4 D D A 2 2 5
# 5 ? ? B 1 1 4

Related

Count of number of elements between distinct elements in vector

Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.
Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.
An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")

R count and list unique rows for each column satisfying a condition

I have been going crazy with something basic...
I am trying to count and list in a comma separated column each unique ID coming up in a data frame, e.g.:
df<-data.frame(id = as.character(c("a", "a", "a", "b", "c", "d", "d", "e", "f")), x1=c(3,1,1,1,4,2,3,3,3),
x2=c(6,1,1,1,3,2,3,3,1),
x3=c(1,1,1,1,1,2,3,3,2))
> > df
id x1 x2 x3
1 a 3 6 1
2 a 1 1 1
3 a 1 1 1
4 b 1 1 1
5 c 4 3 1
6 d 1 2 2
7 d 3 3 3
8 e 1 3 3
9 f 3 1 2
I am trying to get a count of unique id that satisfy a condition, >1:
res = data.frame(x1_counts =5, x1_names="a,c,d,e,f", x2_counts = 4, x2_names="a,c,d,f", x3_counts = 3, x3_names="d,e,f")
> res
x1_counts x1_names x2_counts x2_names x3_counts x3_names
1 5 a,c,d,e,f 4 a,c,d,f 3 d,e,f
I have tried with data.table but it seems very convoluted, i.e.
DT = as.data.table(df)
res <- DT[, list(x1= length(unique(id[which(x1>1)])), x2= length(unique(id[which(x2>1)]))), by=id)
But I can't get it right, I am going not getting what I need to do with data.table since it is not really a grouping I am looking for. Can you direct me in the right path please? Thanks so much!
You can reshape your data to long format and then do the summary:
library(data.table)
(melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable])
# You can replace the list to toString if you want a string as name instead of list
# variable counts names
#1: x1 5 a,c,d,e,f
#2: x2 4 a,c,d,e
#3: x3 3 d,e,f
To get what you need, reshape it back to wide format:
dcast(1~variable,
data = (melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable]),
value.var = c('counts', 'names'))
# . counts_x1 counts_x2 counts_x3 names_x1 names_x2 names_x3
# 1: . 5 4 3 a,c,d,e,f a,c,d,e d,e,f

data.table : remove duplicate subset of rows for a given index value

I would like to improve my piece of code. Let's say you want to remove duplicate rows that have the same 'label' and 'id'. The way I do it is:
library(data.table)
dt <- data.table(label = c("A", "A", "B", "B", "C", "A", "A", "A"),
id = c(1, 1, 2, 2, 3, 4, 5, 5))
tmp = dt[label == 'A',]
tmp = unique(tmp, by = 'id')
dt = dt[label != 'A',]
dt = rbind(dt, tmp)
Is there a smarter/shorter way to accomplish that? If possible by reference?
This code looks very ugly and implies a lot of copies.
(Moreover I have to do this operation for a few labels, but not all of them. So this implies 4 lines for every label...)
Thanks !
Example:
label id
A 1
A 1
B 2
B 2
C 3
A 4
A 5
A 5
Would give :
label id
A 1
B 2
B 2
C 3
A 4
A 5
Note that line 3 and 4 stay duplicated since the label is equal to 'B' and not to 'A'.
There is no need to create tmp and then rbind it again. You can simply use the duplicated function as follows:
dt[label != "A" | !duplicated(dt, by=c("label", "id"))]
# label id
# 1: A 1
# 2: B 2
# 3: B 2
# 4: C 3
# 5: A 4
# 6: A 5
If you want to do this over several labels:
dt[!label %in% c("A", "C") | !duplicated(dt, by=c("label", "id"))]
See ?duplicated to learn more about de-duplication functions in data.table.
This could be also done using an if/else condition
dt[, if(all(label=='A')) .SD[1L] else .SD, by = id]
# id label
#1: 1 A
#2: 2 B
#3: 2 B
#4: 3 C
#5: 4 A
#6: 5 A

R return only one row for each value in vector

I have a dataframe with rows of repeating values for example:
id
A
A
A
B
B
C
C
D
D
What I would like to achieve is a line of code that retains only one value for each value in another vector, for example in:
keeps <- c("A", "C")
The result should be this:
id
A
C
Try this:
df[df$id %in% c("A", "C") & !duplicated(df$id),,drop = FALSE]
# id
# 1 A
# 6 C
or this:
unique(df[df$id %in% c("A", "C"),,drop = FALSE])
# id
# 1 A
# 6 C

Add new variable to specific position in dataframe

I have a DF where I want to add a new variable called "B" into the 2nd position.
A C D
1 1 5 2
2 3 3 7
3 6 2 3
4 6 4 8
5 1 1 2
Anyone have an idea?
The easiest way would be to add the columns you want and then reorder them:
dat$B <- 1:5
newdat <- dat[, c("A", "B", "C", "D")]
Another way:
newdat <- cbind(dat[1], B=1:5, dat[,2:3])
If you're concerned about overhead, perhaps a data.table solution? (With help from this answer):
library(data.table)
dattable <- data.table(dat)
dattable[,B:=1:5]
setcolorder(dattable, c("A", "B", "C", "D"))
dat$B <- 1:5
ind <- c(1:which(names(data) == "A"),ncol(data),(which(names(data) == "A")+1):ncol(data)-1)
data <- data[,ind]
Create the variable at the end of the data.frame and then using an indicator vector signaling how to reorder the columns. ind is just a vector of numbers

Resources