I have a data.frame df
df = data.frame(v = c('E', 'B', 'EB', 'RM'))
df$n= 100 / apply(df, 1, nchar)
Where v represents values E = 4, B = 3, R = 2, and M = 1
I want to calculate a column like so:
v n idx
1 E 100 400
2 B 100 300
3 EB 50 350
4 RM 50 150
Where idx is n (v). For example for the first row 4 * 100 = 400 and for the last row (2 + 1) * 50 = 150
I have something like this:
df$e = ifelse(grepl('E', df$v), 4, 0)
df$b = ifelse(grepl('B', df$v), 3, 0)
df$r = ifelse(grepl('R', df$v), 2, 0)
df$m = ifelse(grepl('M', df$v), 1, 0)
df$idx = df$n * (df$e + df$b + df$r + df$m)
But it becomes unfeasible as the number of columns grows.
1) Define a lookup table, lookup, and a function Sum that takes a vector of single letters, looks up each and sums their lookup number.
split v into a list of vectors of single letters and sapply over that list using Sum mulitplying the result by n.
lookup <- c(E = 4, B = 3, R = 2, M = 1)
Sum <- function(x) sum(lookup[x])
transform(df, idx = n * sapply(strsplit(as.character(v), ""), Sum))
giving:
v n idx
1 E 100 400
2 B 100 300
3 EB 50 350
4 RM 50 150
2) An alternative using lookup from above is the following which for each character in v applies lookup using the anonymous function expressed in formula notation creating a list over which we sapply the sum and finally multiply by n.
library(gsubfn)
transform(df, idx = n * sapply(strapply(as.character(v), ".", x ~ lookup[x]), sum))
3) A dplyr/tidyr solution using lookup from above is the following. We insert an id to uniquely identify each row and the use separate_rows to place each letter of v in a separate row. We then summarize all rows with the same id by looking up each letter and summing. Finally we remove id.
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:n()) %>%
separate_rows(v, sep = "(?<=.)(?=.)") %>%
group_by(id, n) %>%
summarize(idx = sum(n * lookup[v])) %>%
ungroup %>%
select(-id)
giving:
# A tibble: 4 x 3
id n idx
<int> <dbl> <dbl>
1 1 100. 400.
2 2 100. 300.
3 3 50. 350.
4 4 50. 150.
One could avoid the complex regular expression by replacing the separate_rows statement with these two statements:
mutate(v = strsplit(as.character(v), "")) %>%
unnest %>%
Make a look-up table with your values. Then match between a split version (via strsplit) of your df$v column, sum the corresponding values and do your multiplication calculation:
lkup <- data.frame(id=c("E","B","R","M"),value=c(4,3,2,1))
sapply(
strsplit(as.character(df$v),""),
function(x) sum(lkup$value[match(x,lkup$id)])
) * df$n
#[1] 400 300 350 150
Related
I have two dataframes and want to add a specific vector from one as a column to another, for multiplication purposes OR how can I multiply data from one dataframe to a specific vector from another?
Example
library (dplyr)
df <- data.frame (name = c("A", "B", "C", "D", "E"),
area = c(1,2,3,4,5),
yield = c(10, 20, 30, 40, 50))
df2 <- data.frame (application = c("test", "current", "future"),
number = c(5,10,15))
The intended result is e.g. get the value for "current" in df2 and create a new column on df named "number", that will be multiplied with the other columns in df and generate the column "calculation" - excel example on how df would look like at the end below:
I tried
df$number <- df2 %>%
filter(application == "current") %>%
select(number)
But I get an Error in $<-.data.frame(*tmp*, number, value = list(number = 10)) :
replacement has 1 row, data has 5
I know that I could do
df$number <- df2[2,2]
But I want it to be specifically related to "current" (as I tried to do with dplyr). This is only an example - in reality, df2 is a big file and the order can change when people are adding more data.
Here is a base R approach -
df$number <- df2$number[df2$application == "current"]
df$calculation <- with(df, area * yield * number)
df
Or if you prefer dplyr -
library(dplyr)
df <- df %>%
bind_cols(df2 %>%
filter(application == "current") %>%
select(number)) %>%
mutate(calculation = area * yield * number)
df
# name area yield number calculation
#1 A 1 10 10 100
#2 B 2 20 10 400
#3 C 3 30 10 900
#4 D 4 40 10 1600
#5 E 5 50 10 2500
I have a dataframe with two variables (start,end). would like to create an identifier variable which grows in ascending order of start and, most importantly, is kept constant if the value of start coincides with end of any other row in the dataframe.
Below is a simple example of the data
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
The output I would be looking for is the following:
output_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17),
NEW_VAR = c(1,1,2,3,4))
You could try adapting this answer to group by ranges that are adjacent to each other. Credit goes entirely to #r2evans.
In this case, you would use expand.grid to get combinations of start and end. Instead of labels you would have row numbers rn to reference.
In the end, you can number the groups based on which rows appear together in the list. The last few lines starting with enframe use tibble/tidyverse. To match the group numbers I resorted the results too.
I hope this might be helpful.
library(tidyverse)
toy_data <- data.frame(start = c(1,5,6,10,16),
end = c(10,9,11,15,17))
toy_data$rn = 1:nrow(toy_data)
eg <- expand.grid(a = seq_len(nrow(toy_data)), b = seq_len(nrow(toy_data)))
eg <- eg[eg$a < eg$b,]
together <- cbind(
setNames(toy_data[eg$a,], paste0(names(toy_data), "1")),
setNames(toy_data[eg$b,], paste0(names(toy_data), "2"))
)
together <- subset(together, end1 == start2)
groups <- split(together$rn2, together$rn1)
for (i in toy_data$rn) {
ind <- (i == names(groups)) | sapply(groups, `%in%`, x = i)
vals <- groups[ind]
groups <- c(
setNames(list(unique(c(i, names(vals), unlist(vals)))), i),
groups[!ind]
)
}
min_row <- as.numeric(sapply(groups, min))
ctr <- seq_along(groups)
lapply(ctr[order(match(min_row, ctr))], \(x) toy_data[toy_data$rn %in% groups[[x]], ]) %>%
enframe() %>%
unnest(col = value) %>%
select(-rn)
Output
name start end
<int> <dbl> <dbl>
1 1 1 10
2 1 10 15
3 2 5 9
4 3 6 11
5 4 16 17
The following function should give you the desired identifier variable NEW_VAR.
identifier <- \(df) {
x <- array(0L, dim = nrow(df))
count <- 0L
my_seq <- seq_len(nrow(df))
for (i in my_seq) {
if(!df[i,]$start %in% df$end) {
x[i] <- my_seq[i] + count
} else {
x[i] <- my_seq[i]-1L + count
count <- count - 1L
}
}
x
}
Examples
# your example
toy_data <- data.frame(start = c(1,10,5,6,16),
end = c(10,15,9,11,17))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 1 1 2 3 4
# other example
toy_data <- data.frame(start = c(1, 2, 2, 4, 16, 21, 18, 3),
end = c(16, 2, 21, 2, 2, 2, 3, 1))
toy_data$NEW_VAR <- identifier(toy_data)
# ---------------------
> toy_data$NEW_VAR
[1] 0 0 0 1 1 1 2 2
Suppose I have a data frame that looks like this:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
# x y
# 1 1 4
# 2 2 5
# 3 3 6
And a vector of column names, one per row of the data frame:
colname = c('x', 'y', 'x')
For each row of the data frame, I would like to select the value from the corresponding column in the vector. Something similar to dframe[, colname] but for each row.
Thus, I want to obtain c(1, 5, 3) (i.e. row 1: col "x"; row 2: col "y"; row 3: col "x")
My favourite old matrix-indexing will take care of this. Just pass a 2-column matrix with the respective row/column index:
rownames(dframe) <- seq_len(nrow(dframe))
dframe[cbind(rownames(dframe),colname)]
#[1] 1 5 3
Or, if you don't want to add rownames:
dframe[cbind(seq_len(nrow(dframe)), match(colname,names(dframe)))]
#[1] 1 5 3
One can use mapply to pass arguments for rownumber (of dframe) and vector for column name (for each row) to return specific column value.
The solution using mapply can be as:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
colname = c('x', 'y', 'x')
mapply(function(x,y)dframe[x,y],1:nrow(dframe), colname)
#[1] 1 5 3
Although, the next option may not be very intuitive but if someone wants a solution in dplyr chain then a way using gather can be as:
library(tidyverse)
data.frame(colname = c('x', 'y', 'x'), stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
left_join(dframe %>% rownames_to_column() %>%
gather(colname, value, -rowname),
by = c("rowname", "colname" )) %>%
select(rowname, value)
# rowname value
# 1 1 1
# 2 2 5
# 3 3 3
Basically, I want to have all unique combinations of column i and j; and expecting NA in column k if it is not available (similar like doing a group_by & summarise, but with all unique possible combinations).
Is there any existing function from the tidyverse or else that does the same thing as the one that I wrote below?
library(tidyverse)
df <- tibble(
i = c("a", "a", "b"),
j = c("x", "y", "x"),
k = c(100, 300, 20)
)
# I often write this chunk below after group_by & summarise
df %>%
spread(j, k) %>%
gather(j, k, -i)
you can use complete function from the tidyverse
http://tidyr.tidyverse.org/reference/complete.html
df %>% complete(i,j)
This gives you all the combinations of column i, and j
As it seems that cross_join() is still a tidyverse feature request on github, expand.grid(unique(df$i), unique(df$j), stringsAsFactors = FALSE) needs to be used to create a data.frame with all unique possible combinations:
df %>%
right_join(expand.grid(unique(df$i), unique(df$j), stringsAsFactors = FALSE),
by = c("i" = "Var1", "j" = "Var2"))
i j k
1 a x 100
2 b x 20
3 a y 300
4 b y NA
So, OP's approach using spread() and gather() appears to be more concise.
Personally, I do prefer the CJ() function from the data.table package:
library(data.table)
setDT(df)[CJ(i = i, j = j, unique = TRUE), on = .(i, j)]
i j k
1: a x 100
2: a y 300
3: b x 20
4: b y NA
CJ()can be used as replacement for expand.grid() together with right_join():
df %>%
right_join(data.table::CJ(i = .$i, j = .$j, unique = TRUE))
i j k
1 a x 100
2 a y 300
3 b x 20
4 b y NA
I want to count individual and combine occurrence of variables (1 represents presence and 0 represents absence). This can be obtained by multiple uses of table function (See MWE below). Is it possible to use a more efficient approach to get the required output given below?
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
table(A)
A
0 1
48 52
table(B)
B
0 1
53 47
table(C)
C
0 1
34 66
table(A, B)
B
A 0 1
0 25 23
1 28 24
table(A, C)
C
A 0 1
0 12 36
1 22 30
table(B, C)
C
B 0 1
0 21 32
1 13 34
table(A, B, C)
, , C = 0
B
A 0 1
0 8 4
1 13 9
, , C = 1
B
A 0 1
0 17 19
1 15 15
Required Output
I am requiring something like the following:
A = 52
B = 45
C = 66
A + B = 24
A + C = 30
B + C = 34
A + B + C = 15
Expanding on Sumedh's answer, you can also do this dynamically without having to specify the filter every time. This will be useful if you have more than only 3 columns to combine.
You can do something like this:
lapply(seq_len(ncol(df)), function(i){
# Generate all the combinations of i element on all columns
tmp_i = utils::combn(names(df), i)
# In the columns of tmp_i we have the elements in the combination
apply(tmp_i, 2, function(x){
dynamic_formula = as.formula(paste("~", paste(x, "== 1", collapse = " & ")))
df %>%
filter_(.dots = dynamic_formula) %>%
summarize(Count = n()) %>%
mutate(type = paste0(sort(x), collapse = ""))
}) %>%
bind_rows()
}) %>%
bind_rows()
This will:
1) generate all the combinations of the columns of df. First the combinations with one element (A, B, C) then the ones with two elements (AB, AC, BC), etc.
This is the external lapply
2) then for every combination will create a dynamic formula. For AB for instance the formula will be A==1 & B==1, exactly as Sumedh suggested. This is the dynamic_formula bit.
3) Will filter the dataframe with the dynamically generated formula and count the number of rows
4) Bind all together (the two bind_rows)
The output will be
Count type
1 52 A
2 47 B
3 66 C
4 24 AB
5 30 AC
6 34 BC
7 15 ABC
EDITED TO ADD: I see now that you don't want to get the exclusive counts (i.e. A and AB should both include all As).
I got more than a little nerd-sniped by this today, particularly as I wanted to solve it using base R with no packages. The below should do that.
There is a very easy (in principle) solution that simply uses xtabs(), which I've illustrated below. However, to generalize it for any potential number of dimensions, and then to apply it to a variety of combinations, actually was harder. I strove to avoid using the dreaded eval(parse()).
set.seed(12345)
A <- rbinom(n = 100, size = 1, prob = 0.5)
B <- rbinom(n = 100, size = 1, prob = 0.6)
C <- rbinom(n = 100, size = 1, prob = 0.7)
df <- data.frame(A, B, C)
# Turn strings off
options(stringsAsFactors = FALSE)
# Obtain the n-way frequency table
# This table can be directly subset using []
# It is a little tricky to pass the arguments
# I'm trying to avoid eval(parse())
# But still give a solution that isn't bound to a specific size
xtab_freq <- xtabs(formula = formula(x = paste("~",paste(names(df),collapse = " + "))),
data = df)
# Demonstrating what I mean
# All A
sum(xtab_freq["1",,])
# [1] 52
# AC
sum(xtab_freq["1",,"1"])
# [1] 30
# Using lapply(), we pass names(df) to combn() with m values of 1, 2, and 3
# The output of combn() goes through list(), then is unlisted with recursive FALSE
# This gives us a list of vectors
# Each one being a combination in which we are interested
lst_combs <- unlist(lapply(X = 1:3,FUN = combn,x = names(df),list),recursive = FALSE)
# For nice output naming, I just paste the values together
names(lst_combs) <- sapply(X = lst_combs,FUN = paste,collapse = "")
# This is a function I put together
# Generalizes process of extracting values from a crosstab
# It does it in this fashion to avoid eval(parse())
uFunc_GetMargins <- function(crosstab,varvector,success) {
# Obtain the dimname-names (the names within each dimension)
# From that, get the regular dimnames
xtab_dnn <- dimnames(crosstab)
xtab_dn <- names(xtab_dnn)
# Use match() to get a numeric vector for the margins
# This can be used in margin.table()
tgt_margins <- match(x = varvector,table = xtab_dn)
# Obtain a margin table
marginal <- margin.table(x = crosstab,margin = tgt_margins)
# To extract the value, figure out which marginal cell contains
# all variables of interest set to success
# sapply() goes over all the elements of the dimname names
# Finds numeric index in that dimension where the name == success
# We subset the resulting vector by tgt_margins
# (to only get the cells in our marginal table)
# Then, use prod() to multiply them together and get the location
tgt_cell <- prod(sapply(X = xtab_dnn,
FUN = match,
x = success)[tgt_margins])
# Return as named list for ease of stacking
return(list(count = marginal[tgt_cell]))
}
# Doing a call of mapply() lets us get the results
do.call(what = rbind.data.frame,
args = mapply(FUN = uFunc_GetMargins,
varvector = lst_combs,
MoreArgs = list(crosstab = xtab_freq,
success = "1"),
SIMPLIFY = FALSE,
USE.NAMES = TRUE))
# count
# A 52
# B 47
# C 66
# AB 24
# AC 30
# BC 34
# ABC 15
I ditched the prior solution that used aggregate.
Using dplyr,
Occurrence of only A:
library(dplyr)
df %>% filter(A == 1) %>% summarise(Total = nrow(.))
Occurrence of A and B:
df %>% filter(A == 1, B == 1) %>% summarise(Total = nrow(.))
Occurence of A, B, and C
df %>% filter(A == 1, B == 1, C == 1) %>% summarise(Total = nrow(.))