Pick a column to multiply with, contingent on value of other variables - r

I am still doing my first footsteps with R and found SO to be a great tool for learning more and finding answers to my questions. For this one i though did not manage to find any good solution here.
I have a dataframe that can be simplified to this structure:
set.seed(10)
df <- data.frame(v1 = rep(1:2, times=3),
v2 = c("A","B","B","A","B","A"),
v3 = sample(1:6),
xA_1 = sample(1:6),
xA_2 = sample(1:6),
xB_1 = sample(1:6), xB_2 = sample(1:6))
df thus looks like this:
> df
v1 v2 v3 xA_1 xA_2 xB_1 xB_2
1 1 A 4 2 1 3 3
2 2 B 2 6 3 5 4
3 1 B 5 3 2 4 5
4 2 A 3 5 4 2 1
5 1 B 1 4 6 6 2
6 2 A 6 1 5 1 6
I now want R to create a fourth variable, which is dependent on the values of v1 and v2. I achieve this by using the following code:
df <- data.table(df)
df[, v4 := ifelse(v1 == 1 & v2 == "A", v3*xA_1,
ifelse(v1 == 1 & v2 == "B", v3*xB_1,
ifelse(v1 == 2 & v2 == "A", v3*xA_2,
ifelse(v1 == 2 & v2 == "B", v3*xB_2, v3*1))))]
So v4 is created by multiplying v3 with the column that contains the v1 and the v2 value
(e.g. for row 1: v1=1 and v2=A thus multiply v3=4 with xA_1=2 -> 8).
> df$v4
[1] 8 8 20 12 6 30
Obviuosly, my ifelse approach is tedious when v1 and v2 in fact have many more different values than they have in this example. So I am looking for an efficient way to tell R if v1 == y & v2 == z, multiply v3 with column xy_z.
I tried writing a for-loop, writing a function that has y and z as index and using the apply function. However none of this worked as wanted.
I appreciate any ideas!

Here's a base R option:
i <- paste0("x", df$v2, "_", df$v1)
df$v4 <- df$v3 * as.numeric(df[cbind(1:nrow(df), match(i, names(df)))])
For the sample data provided below, it creates a column v4 as:
> df$v4
[1] 25 12 2 6 3 10
Or if you want to include the "else" condition to multiply by 1 in case there's no matching column name:
i <- paste0("x", df$v2, "_", df$v1)
tmp <- as.numeric(df[cbind(1:nrow(df), match(i, names(df)))])
df$v4 <- df$v3 * ifelse(is.na(tmp), 1, tmp)
Sample data:
df <- structure(list(v1 = c(1L, 2L, 1L, 2L, 1L, 2L), v2 = structure(c(1L,
2L, 2L, 1L, 2L, 1L), .Label = c("A", "B"), class = "factor"),
v3 = c(5L, 4L, 1L, 6L, 3L, 2L), xA_1 = c(5L, 6L, 3L, 1L,
2L, 4L), xA_2 = c(6L, 4L, 2L, 1L, 3L, 5L), xB_1 = c(4L, 6L,
2L, 5L, 1L, 3L), xB_2 = c(5L, 3L, 2L, 4L, 1L, 6L)), .Names = c("v1",
"v2", "v3", "xA_1", "xA_2", "xB_1", "xB_2"), row.names = c(NA,
-6L), class = "data.frame")

This is a standard "wide" table problem - what you want is harder to do as-is, but easy when the data is "melted":
dt = as.data.table(df)
melt(dt, id.vars = c('v1', 'v2', 'v3'))[variable == paste0('x', v2, '_', v1)
][dt, on = c('v1', 'v2', 'v3'), v3 * value]
#[1] 8 8 20 12 6 30

You can try this :
v4 <- c()
for(i in 1:nrow(df)){
col <- paste("x",df$v2[i],"_",df$v1[i],sep="")
v4 <- c(v4,df$v3[i]*df[i,col])
}
df$v4 <- v4

Related

Count values above 0 and count how many match a pattern in a row (in R)

I would like to count how many rows in each column are >0 and how many of those rows (that are >0) start with "mt-".
The result should also be in a data frame.
Here is an example.
df1
mt-abc 1 0 2
mt-dca 1 1 2
cla 0 2 0
dla 0 3 0
result
above0 2 3 2
mt 2 1 2
In base R you can do :
mat <- df[-1] > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(df$V1, 'mt') & mat))
# V2 V3 V4
#above0 2 3 2
#mt 2 1 2
Actual data has numbers in the column and names in rownames for which we can do :
mat <- df > 0
rbind(above0 = colSums(mat),
mt = colSums(startsWith(rownames(df), 'mt') & mat))
data
df <- structure(list(V1 = c("mt-abc", "mt-dca", "cla", "dla"), V2 = c(1L,
1L, 0L, 0L), V3 = 0:3, V4 = c(2L, 2L, 0L, 0L)), class = "data.frame",
row.names = c(NA, -4L))
I don't think this is the most elegant approach in the tidyverse, but just out of curiosity:
library(tidyverse)
my_df <- data.frame(
stringsAsFactors = FALSE,
var = c("mt-abc", "mt-dca", "cla", "dla"),
x = c(1L, 1L, 0L, 0L),
y = c(0L, 1L, 2L, 3L),
z = c(2L, 2L, 0L, 0L)
)
df_1 <- my_df %>%
summarize(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="above0")
df_2 <- my_df %>%
filter(str_detect(var, "^mt")) %>%
summarise(across(.cols=x:z, .fn=~sum(.x > 0))) %>%
mutate(var="mt")
bind_rows(df_1, df_2)
#> x y z var
#> 1 2 3 2 above0
#> 2 2 1 2 mt
Created on 2020-12-04 by the reprex package (v0.3.0)

R loop/lapply, cumulative totals with group by

I am trying to create new variables in my data set that are cumulative totals which restart based on other variables (using group by)… I want these to be new columns in the data set and this is the part I am struggling with...
Using the data below, I want to create cumulative Sale and Profit columns that will restart for every Product and Product_Cat grouping.
The below code partly gives me what I need, but the variables are not new variables, instead it overwrites the existing Sale/Profit... what am I getting wrong? I imagine this is simple haven't found anything.
Note: I'm using lapply as my real data set has 40+ varbs that I need to create calculations for.
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
Data for example:
Product <- c('A','A','A','B','B','B','C','C','C')
Product_Cat <- c('S1','S1','S2','C1','C1','C1','D1','E1','F1')
Sale <- c(10,15,5,20,15,10,5,5,5)
Profit <- c(2,4,2,6,8,2,4,6,8)
Sale_Cum <- c(10,25,5,20,35,45,5,5,5)
Profit_Cum <- c(2,6,2,6,14,16,4,6,8)
Data <- data.frame(Product,Product_Cat,Sale,Profit)
Desired_Data <- data.frame(Product,Product_Cat,Sale,Profit,Sale_Cum,Profit_Cum)
This doesn't use the group by per se but I think it achieves what you're looking for in that it is easily extensible to many columns:
D2 <- data.frame(lapply(Data[,c(3,4)], cumsum))
names(D2) <- gsub("$", "_cum", names(Data[,c(3,4)]))
Data <- cbind(Data, D2)
If you have 40+ columns just change the c(3,4) to include all the columns you're after.
EDIT:
I forgot that the OP wanted it to reset for each category. In that case, you can modify your original code:
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
names(D2)[c(-1,-2)] <- gsub("$", "_cum", names(Data)[c(-1,-2)])
cbind(Data, D2[,c(-1,-2)])
library(data.table)
setDT(Data)
cols <- names(Data)[3:4]
Data[, paste0(cols, '_cumsum') := lapply(.SD, cumsum)
, by = .(Product, Product_Cat)
, .SDcols = cols]
Data:
structure(list(Product = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Product_Cat = structure(c(5L,
5L, 6L, 1L, 1L, 1L, 2L, 3L, 4L), .Label = c("C1", "D1", "E1",
"F1", "S1", "S2"), class = "factor"), Sale = c(10L, 15L, 5L,
20L, 15L, 10L, 5L, 5L, 5L), Profit = c(2L, 4L, 2L, 6L, 8L, 2L,
4L, 6L, 8L), Sale_Cum = c(10, 25, 5, 20, 35, 45, 5, 5, 5), Profit_Cum = c(2,
6, 2, 6, 14, 16, 4, 6, 8)), .Names = c("Product", "Product_Cat",
"Sale", "Profit", "Sale_Cum", "Profit_Cum"), row.names = c(NA,
-9L), class = "data.frame")`
We can iteratively slice the dataframe based on Product and Product_Cat, and for each iteration, assign the output produced by cumsum() to Sale_Cum and Product_Cum:
cols <- c('Sale', 'Profit')
for (column in cols){
x[, paste0(column, '_Cum')] <- 0
for(p in unique(x$Product)){
for (pc in unique(x$Product_Cat)){
x[x$Product == p & x$Product_Cat == pc, paste0(column, '_Cum')] <- cumsum(x[x$Product == p & x$Product_Cat == pc, column])
}
}
}
print(x)
# Product Product_Cat Sale Profit Sale_Cum Profit_Cum
# 1 A S1 10 2 10 2
# 2 A S1 15 4 25 6
# 3 A S2 5 2 5 2
# 4 B C1 20 6 20 6
# 5 B C1 15 8 35 14
# 6 B C1 10 2 45 16
# 7 C D1 5 4 5 4
# 8 C E1 5 6 5 6
# 9 C F1 5 8 5 8
Here is some pretty poor code that does everything step by step
#sample data
d<-sample(1:10)
f<-sample(1:10)
p<-c("f","f","f","f","q","q","q","w","w","w")
pc<-c("c","c","d","d","d","v","v","v","b","b")
cc<-data.table(p,pc,d,f)
#storing the values that are overwritten first.
three<-cc[,3]
four<- cc[,4]
#applying your function
dt<-setDT(c)[,lapply(.SD,cumsum), by=.(p,pc)]
#binding the stored values to your function and renaming everything.
x<-cbind(dt,three,four)
colnames(x)[5]<-"sale"
colnames(x)[6]<-"profit"
colnames(x)[4]<-"CumSale"
colnames(x)[3]<-"CumProfit"
#reordering the columns
xx<-x[,c("p","pc","profit","sale","CumSale","CumProfit")]
xx

how to count and remove similar strings across columns

I have a data with many columns . for example this is with three columns
df<-structure(list(V1 = structure(c(5L, 1L, 7L, 3L, 2L, 4L, 6L, 6L
), .Label = c("CPSIAAAIAAVNALHGR", "DLNYCFSGMSDHR", "FPEHELIVDPQR",
"IADPDAVKPDDWDEDAPSK", "LWADHGVQACFGR", "WGEAGAEYVVESTGVFTTMEK",
"YYVTIIDAPGHR"), class = "factor"), V2 = structure(c(5L, 2L,
7L, 3L, 4L, 6L, 1L, 1L), .Label = c("", "CPSIAAAIAAVNALHGR",
"GCITIIGGGDTATCCAK", "HVGPGVLSMANAGPNTNGSQFFICTIK", "LLELGPKPEVAQQTR",
"MVCCSAWSEDHPICNLFTCGFDR", "YYVTIIDAPGHR"), class = "factor"),
V3 = structure(c(4L, 3L, 2L, 4L, 3L, 1L, 1L, 1L), .Label = c("",
"AVCMLSNTTAIAEAWAR", "DLNYCFSGMSDHR", "FPEHELIVDPQR"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
-The first column, we don't look at any other column, we just count how many strings there are and keep the unique one
The second column, we keep the unique and also we remove those that were already in the first column
The third column, we keep the unique and we remove the strings that were in the first and second column
This continues for as many columns as we have
for example for this data, we will have the following
Column 1 Column 2 Column 3
LWADHGVQACFGR
CPSIAAAIAAVNALHGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
YYVTIIDAPGHR GCITIIGGGDTATCCAK
FPEHELIVDPQR HVGPGVLSMANAGPNTNGSQFFICTIK
DLNYCFSGMSDHR MVCCSAWSEDHPICNLFTCGFDR
IADPDAVKPDDWDEDAPSK
WGEAGAEYVVESTGVFTTMEK
Here is a solution via tidyverse,
library(tidyverse)
df1 <- df %>%
gather(var, string) %>%
filter(string != '' & !duplicated(string)) %>%
group_by(var) %>%
mutate(cnt = seq(n())) %>%
spread(var, string) %>%
select(-cnt)
Which gives
# A tibble: 7 x 4
cnt V1 V2 V3
* <int> <chr> <chr> <chr>
1 1 LWADHGVQACFGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
2 2 CPSIAAAIAAVNALHGR GCITIIGGGDTATCCAK <NA>
3 3 YYVTIIDAPGHR HVGPGVLSMANAGPNTNGSQFFICTIK <NA>
4 4 FPEHELIVDPQR MVCCSAWSEDHPICNLFTCGFDR <NA>
5 5 DLNYCFSGMSDHR <NA> <NA>
6 6 IADPDAVKPDDWDEDAPSK <NA> <NA>
7 7 WGEAGAEYVVESTGVFTTMEK <NA> <NA>
You can use colSums to get the number of strings,
colSums(!is.na(df1))
#V1 V2 V3
# 7 4 1
A similar approach via base R, that would save the strings in a list would be,
df[] <- lapply(df, as.character)
d1 <- stack(df)
d1 <- d1[d1$values != '' & !duplicated(d1$values),]
l1 <- unstack(d1, values ~ ind)
lengths(l1)
#V1 V2 V3
# 7 4 1
A base R solution. df2 is the final output.
# Convert to character
L1 <- lapply(df, as.character)
# Get unique string
L2 <- lapply(L1, unique)
# Remove ""
L3 <- lapply(L2, function(vec){vec <- vec[!(vec %in% "")]})
# Use for loop to remove non-unique string from previous columns
for (i in 2:length(L3)){
previous_vec <- unlist(L3[1:(i - 1)])
current_vec <- L3[[i]]
L3[[i]] <- current_vec[!(current_vec %in% previous_vec)]
}
# Get the maximum column length
max_num <- max(sapply(L3, length))
# Append "" to each column
L4 <- lapply(L3, function(vec){vec <- c(vec, rep("", max_num - length(vec)))})
# Convert L4 to a data frame
df2 <- as.data.frame(do.call(cbind, L4))

delete the rows with duplicated ids

I want to delete the rows with duplicated ids
data
id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4
The problem is that some people did the test for a few times, which generate multiple scores on V2, I want to delete the duplicated id and retain one of the scores in V2 randomly.
output
id V1 V2
1 a 1
2 a 2
3 a 4
I tried this:
neu <- unique(neu$userid)
but it didn't work
Using dplyr:
library(dplyr)
set.seed(1)
df %>% sample_frac(., 1) %>% arrange(id) %>% distinct(id)
Output:
id V1 V2
1 1 b 2
2 2 c 3
3 3 a 4
Data:
df <- structure(list(id = c(1L, 1L, 2L, 2L, 3L), V1 = structure(c(1L,
2L, 1L, 3L, 1L), .Label = c("a", "b", "c"), class = "factor"),
V2 = c(1L, 2L, 2L, 3L, 4L)), .Names = c("id", "V1", "V2"), class = "data.frame", row.names = c(NA,
-5L))
Creating the data frame based on your example:
df <- read.table(text =
"id V1 V2
1 a 1
1 b 2
2 a 2
2 c 3
3 a 4", h = T)
Since you want to remove rows randomly, first sort the rows of your data frame randomly:
df <- df[sample(nrow(df)),]
Then remove duplicates in the order of appearence:
df <- df[!duplicated(df$id),]
Now sort your data frame back:
df <- df[with(df, order(id)),]
Remember to change df by your data frame name.

Data roll up along with transpose

I want to roll up at customer unique id level with each observation being transposed againt it as given below
Below is the snapshot of my data
basedata <- structure(list(customer = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "d"), class = "factor"), obs = c(12L,
11L, 12L, 10L, 3L, 5L, 7L, 8L, 1L)), .Names = c("customer", "obs"
), class = "data.frame", row.names = c(NA, -9L))
Or
customer obs
a 12
a 11
a 12
a 10
b 3
b 5
b 7
d 8
d 1
I want to convert it in the following form
customer obs1 obs2 obs3 obs4
a 12 11 12 10
b 3 5 7 -
d 8 1 - -
I used the following code
basedata$shopping <- unlist(tapply(rawdata$customer, rawdata$customer,
function (x) seq(1, len = length(x))))
reshape(basedata, idvar = "customer", direction = "wide")
It gives the following error
Error in `[.data.frame`(data, , timevar) : undefined columns selected
How can I do it in R and excel?
Thank You
x <- structure(list(customer = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "d"), class = "factor"), obs = c(12L,
11L, 12L, 10L, 3L, 5L, 7L, 8L, 1L)), .Names = c("customer", "obs"
), class = "data.frame", row.names = c(NA, -9L))
I chose to use a couple of extra packages (plyr and reshape2) because I find them easier and more general to use than reshape from the base package.
library(plyr)
library(reshape2)
## add observation number
x2 <- ddply(x,"customer",transform,num=1:length(customer))
## reshape
dcast(x2,customer~num,value.var="obs")
A base R way, assuming dat is the data,
> s <- split(dat$obs, dat$customer)
> df <- data.frame(do.call(rbind, lapply(s, function(x){ length(x) <- 4; x })))
> names(df) <- paste0('obs', seq(df))
> df
# obs1 obs2 obs3 obs4
# a 12 11 12 10
# b 3 5 7 NA
# d 8 1 NA NA
If you want the unique customer ID to be a column,
> df2 <- cbind(customer = rownames(df), df)
> rownames(df2) <- seq(nrow(df2))
> df2
# customer obs1 obs2 obs3 obs4
# 1 a 12 11 12 10
# 2 b 3 5 7 NA
# 3 d 8 1 NA NA
I'm assuming that "basedata" and "rawdata" are supposed to be the same (or at least copies of each other). If that's the case, you're simply missing specifying what the timevar argument for reshape should be.
Continuing from where you left off:
rawdata$shopping <- unlist(tapply(rawdata$customer, rawdata$customer,
function (x) seq(1, len = length(x))))
## rawdata$shopping <- with(rawdata, ave(customer, customer, FUN = seq_along))
Here's the actual reshaping step:
reshape(rawdata, idvar = "customer", timevar="shopping", direction = "wide")
# customer obs.1 obs.2 obs.3 obs.4
# 1 a 12 11 12 10
# 5 b 3 5 7 NA
# 8 d 8 1 NA NA

Resources