I am trying to do a rowSum for the actuals columns. However, I would like to include the values up to the UpTo date for certain observations. Here is the data frame:
dat <- structure(list(Company = c("ABC", "DEF", "XYZ"), UpTo = c(NA,
"Q2", "Q3"), Actual.Q1 = c(100L, 80L, 100L), Actual.Q2 = c(50L,
75L, 50L), Forecast.Q3 = c(80L, 50L, 80L), Forecast.Q4 = c(90L,
80L, 100L)), .Names = c("Company", "UpTo", "Actual.Q1", "Actual.Q2",
"Forecast.Q3", "Forecast.Q4"), class = "data.frame", row.names = c("1",
"2", "3"))
Company UpTo Actual.Q1 Actual.Q2 Forecast.Q3 Forecast.Q4
1 ABC NA 100 50 80 90
2 DEF Q2 80 75 50 80
3 XYZ Q3 100 50 80 100
For company ABC, since there is no UpTo date, it will just be Actual.Q1 + Actual.Q2, which is 150.
For company DEF, since the UpTo date is Q2, it will be Actual.Q1 + Actual.Q2, which is 155.
For company XYZ, since the UpTo date is Q3, it will be Actual.Q1 + Actual.Q2 + Forecast.Q3, which is 230.
The resulting data frame would look like this:
Company UpTo Actual.Q1 Actual.Q2 Forecast.Q3 Forecast.Q4 SumRecent
1 ABC NA 100 50 80 90 150
2 DEF Q2 80 75 50 80 155
3 XYZ Q3 100 50 80 100 230
I have tried to use the rowSums function. However, it does not take into effect the variable UpTo. Any help is appreciated. Thanks!
Here is a possibility:
df$SumRecent <- sapply(1:nrow(df), function(x) {sum(df[x,3:ifelse(is.na(grep(df[x,2], colnames(df))[1]), 4, grep(df[x,2], colnames(df))[1])])})
# Company UpTo Actual.Q1 Actual.Q2 Forecast.Q3 Forecast.Q4 SumRecent
# 1 ABC <NA> 100 50 80 90 150
# 2 DEF Q2 80 75 50 80 155
# 3 XYZ Q3 100 50 80 100 230
We are looking with the use of grep for a match of the value in the column UpTo (df[x,2]) in the column names of df (colnames(df)). If we find it we get the sum, if we don't find it we just sum the values in columns 3 and 4.
We can use binary weighted row sums.
UpTo <- as.character(dat$UpTo) ## in case you have factor column
UpTo[is.na(UpTo)] <- "Q2" ## replace `NA` to "Q2"
w <- outer(as.integer(substr(UpTo, 2, 2)), 1:4, ">=")
# [,1] [,2] [,3] [,4]
#[1,] TRUE TRUE FALSE FALSE
#[2,] TRUE TRUE FALSE FALSE
#[3,] TRUE TRUE TRUE FALSE
We have a logical matrix. But it does not affect arithmetic computation as TRUE is 1 and FALSE is 0. Then we do weighted row sums:
X <- data.matrix(dat[3:6])
dat$SumRecent <- rowSums(X * w)
# Company UpTo Actual.Q1 Actual.Q2 Forecast.Q3 Forecast.Q4 SumRecent
#1 ABC <NA> 100 50 80 90 150
#2 DEF Q2 80 75 50 80 155
#3 XYZ Q3 100 50 80 100 230
The advantage of this approach is its speed / efficiency, as it is fully vectorized. This method is super fast. You can refer to the benchmark result in Fast way to create a binary matrix with known number of 1 each row in R.
This should also work:
df$UpTo <- as.character(df$UpTo)
df$SumRecent <- apply(df, 1, function(x) ifelse(is.na(x[2]), sum(as.integer(x[3:4])),
sum(as.integer(x[3:(grep(x[2], names(df)))]))))
df
# Company UpTo Actual.Q1 Actual.Q2 Forecast.Q3 Forecast.Q4 SumRecent
#1 ABC <NA> 100 50 80 90 150
#2 DEF Q2 80 75 50 80 155
#3 XYZ Q3 100 50 80 100 230
Another approach using data table:
require(data.table)
dat <- fread('Company UpTo Actual.Q1 Actual.Q2 Forecast.Q3 Forecast.Q4
ABC NA 100 50 80 90
DEF Q2 80 75 50 80
XYZ Q3 100 50 80 100')
dat[, SumRecent:= ifelse(is.na(UpTo), Actual.Q1 + Actual.Q2,
sum(.SD[, grepl(paste0("Q[1-", substring(UpTo, 2), "]$"), names(.SD)), with = F]) ), by = Company]
Related
I have a list of about thousand data frames. All data frames have column Z and the column consists mostly on NA values, but whenever there is an actual value, it has either "VALUE1" or "VALUE2" in it. For example:
weight | height | Z
---------------------------
62 100 NA
65 89 NA
59 88 randomnumbersVALUE1randomtext
66 92 NA
64 90 NA
64 87 randomnumbersVALUE2randomtext
57 84 NA
The first actual value of each data frame in the column Z should always contain a value of "VALUE1" in it, so in the example data frame above everything is as it should be. However, if the data frame would look like this:
weight | height | Z
---------------------------
62 100 NA
65 89 NA
59 88 randomnumbersVALUE2randomtext
66 92 NA
64 90 NA
64 87 randomnumbersVALUE1randomtext
57 84 NA
I would need to add a new row into the beginning of the data frame with "VALUE1" in the Z column and value 0 in the height and weight columns. How could I do this for my list of data frames (with the help of functions such as add_row and filter)..?
If dfs is your list of dataframes, you can do this:
dfs = lapply(dfs, function(x) {
if(grepl("VALUE2", x[!is.na(x$Z),"Z"][1])) {
rbind(data.frame(weight=0,height=0,Z="VALUE1"),x)
} else x
})
library(dplyr)
dat %>%
filter(!is.na(Z)) %>%
slice(1) %>%
mutate(across(weight:height, ~ 0)) %>%
filter(!grepl("VALUE1", Z)) %>%
mutate(Z = "VALUE1") %>%
bind_rows(., dat)
# weight height Z
# 1 0 0 VALUE1
# 2 62 100 <NA>
# 3 65 89 <NA>
# 4 59 88 randomnumbersVALUE2randomtext
# 5 66 92 <NA>
# 6 64 90 <NA>
# 7 64 87 randomnumbersVALUE1randomtext
# 8 57 84 <NA>
Data
dat <- structure(list(weight = c(62L, 65L, 59L, 66L, 64L, 64L, 57L), height = c(100L, 89L, 88L, 92L, 90L, 87L, 84L), Z = c(NA, NA, "randomnumbersVALUE2randomtext", NA, NA, "randomnumbersVALUE1randomtext", NA)), class = "data.frame", row.names = c(NA, -7L))
I have something like this,
A B C
100 24
18
16
21
14
I am trying to write a function that calculates C = A-B for the respective row and then adds 20 to C which is A for the next row and repeats the step and it should be like this at the end.
A B C
100 24 76
96 18 78
98 16 82
102 21 81
101 14 87
I am doing it manually atm like
df$C[1] = df$A[1] - df$B[1] and then
df$A[2] = df$C[1]+20 and repeating it.
I would like to create a function instead of doing this way. Any help would be appreciated.
Here is another approach using for loop:
data
df <- data.frame(A=NA, B = c(24L, 18L, 16L, 21L, 14L),C=NA)
Initialize first row of df
df$A[1] <- 100
df$C[1] <- df$A[1]-df$B[1]
Populate the remaining rows of df
for (i in 1:(length(df$B)-1)){
df$C[i+1] <- df$C[i]-df$B[i+1]+20
df$A[i+1] <- df$C[i]+20
}
Output
df
A B C
1 100 24 76
2 96 18 78
3 98 16 82
4 102 21 81
5 101 14 87
We can start with only B column and then calculate A and C respectively.
start_value <- 100
df$A <- c(start_value, start_value - cumsum(df$B) + 20 * 1:nrow(df))[-(nrow(df) + 1)]
df$C <- df$A - df$B
df
# B A C
#1 24 100 76
#2 18 96 78
#3 16 98 82
#4 21 102 81
#5 14 101 87
data
df <- structure(list(B = c(24L, 18L, 16L, 21L, 14L)),
class = "data.frame", row.names = c(NA, -5L))
I have a table in R like this one:
id v1 v2 v3
1 115 116 150
2 47 50 55
3 70 77 77
I would like to calculate the ratio between v2/v1 as (v2/v1)-1, v3/v2 as (v3/v2)-1 and so on (I have around 55 variables, and need to get values like this:
id v1 v2 v3 rat1 rat2
1 115 116 150 0.01 0.29
2 47 50 55 0.06 0.10
3 70 77 77 0.10 0.00
Is there a workaround so I donĀ“t have to code each pair independently?
Thx!
It's essentially a loop over column i and column i+1, which you could write a for loop to do so. Or in R speak, use a vectorised function like Map/mapply:
vars <- paste0("v",1:3)
outs <- paste0("rat",1:2)
dat[outs] <- mapply(`/`, dat[vars[-1]], dat[vars[-length(vars)]]) - 1
dat
# id v1 v2 v3 rat1 rat2
#1 1 115 116 150 0.008695652 0.2931034
#2 2 47 50 55 0.063829787 0.1000000
#3 3 70 77 77 0.100000000 0.0000000
As we remove equal number of columns from the beginning and end ('id' common), the datasets would still be similar in dimensions, so can directly do a /
dat[paste0("rat", 1:2)] <- 1- dat[-c(1, ncol(dat))]/dat[-(1:2)]
data
dat <- structure(list(id = 1:3, v1 = c(115L, 47L, 70L), v2 = c(116L,
50L, 77L), v3 = c(150L, 55L, 77L)), class = "data.frame", row.names = c(NA,
-3L))
I'm trying to create a new column, presumably using mutate, that will identify if whether the row meets a few criteria. Basically, for each user, I want to identify the final row (by Time) for a certain DataCode. Only some DataCodes are applicable (1000 and 2000 in the example below), and others should return NA (3000 here). I've been trying to work this through in my head, and all I can think is a really long mutate item with a number of If statements. Is there a more elegant way?
The IsFinal column below demonstrates what the product would be.
User Time DataCode Data IsFinal
101 10 1000 50 0
101 20 2000 300 1
101 30 3000 150 NA
101 40 1000 250 1
101 50 3000 300 NA
102 10 2000 50 0
102 20 1000 150 0
102 30 1000 150 0
102 40 2000 350 1
102 50 3000 150 NA
102 60 1000 50 1
This desires what you need by using merge and dplyr package:
library(dplyr)
new.tab <- query.tab %>%
group_by(User, DataCode) %>%
arrange(Time) %>%
filter(DataCode != 3000) %>%
mutate(IsFinal = ifelse(row_number()==n(),1,0))
fin.tab <- merge(new.tab, query.tab, all.x = FALSE, all.y = TRUE)
If you want to do everything inside dplyr then this is your answer:
fin.tab <-
query.tab %>%
group_by(User, DataCode) %>%
arrange(User,Time) %>%
mutate(IsFinal = ifelse(DataCode == 3000 , NA,
ifelse(row_number()==n(),1,0)))
Both of these solutions will give:
> fin.tab
# User Time DataCode Data IsFinal
# 1 101 10 1000 50 0
# 2 101 20 2000 300 1
# 3 101 30 3000 150 NA
# 4 101 40 1000 250 1
# 5 101 50 3000 300 NA
# 6 102 10 2000 50 0
# 7 102 20 1000 150 0
# 8 102 30 1000 150 0
# 9 102 40 2000 350 1
# 10 102 50 3000 150 NA
# 11 102 60 1000 50 1
Data:
query.tab <- structure(list(User = c(101L, 101L, 101L, 101L, 101L, 102L, 102L,
102L, 102L, 102L, 102L), Time = c(10L, 20L, 30L, 40L, 50L, 10L,
20L, 30L, 40L, 50L, 60L), DataCode = c(1000L, 2000L, 3000L, 1000L,
3000L, 2000L, 1000L, 1000L, 2000L, 3000L, 1000L), Data = c(50L,
300L, 150L, 250L, 300L, 50L, 150L, 150L, 350L, 150L, 50L)), .Names = c("User",
"Time", "DataCode", "Data"), row.names = c(NA, -11L), class = "data.frame")
Note: Read history of edits. It may give you some insight how to handle similar problems.
Is it feasible for you to make an array of the approved codes? That would make the if statement much simpler.
# Can you obtain list of viable codes?
codes <- c("2000", "1000")
# Can you put them in order?
goodcodes <- codes[order(codes)]
# last item in ordered goodcodes should be the end code
endcode <- goodcodes[length(goodcodes)]
testcodes <- c("0500", "1000", "2000", "3000")
n <- length(testcodes)
IsFinal <- rep(0, n)
for (i in 1:n) {
if (testcodes[i] %in% goodcodes) {
if (testcodes[i] == endcode) (IsFinal[i] = 1)
} else (IsFinal[i] = NA)
}
> IsFinal
[1] NA 0 1 NA
>
In base R, we can use ave along with duplicated and its fromLast argument to get the binary values. Then replace the desired values with NA. Using the data in #masoud's answer.
# get binary values for final DataCode by user
query.tab$IsFinal <- with(query.tab,
ave(DataCode, User, FUN=function(x) !duplicated(x, fromLast=TRUE)))
# Fill in NA values
is.na(query.tab$IsFinal) <- query.tab$DataCode %in% c(3000)
This returns
query.tab
User Time DataCode Data IsFinal
1 101 10 1000 50 0
2 101 20 2000 300 1
3 101 30 3000 150 NA
4 101 40 1000 250 1
5 101 50 3000 300 NA
6 102 10 2000 50 0
7 102 20 1000 150 0
8 102 30 1000 150 0
9 102 40 2000 350 1
10 102 50 3000 150 NA
11 102 60 1000 50 1
Note that this assumes that the data is ordered by user-time. This can be achieved with a call to order prior to using the code above.
query.tab <- query.tab[order(query.tab$User, query.tab$Time),]
How to count frequencies occurring in two columns ?
Sample datas :
> sample <- dput(df)
structure(list(Nom_xp = c("A1FAA", "A1FAJ", "A1FBB", "A1FJA",
"A1FJR", "A1FRJ"), GB05.x = c(100L, 98L, NA, 100L, 102L, 98L),
GB05.1.x = c(100L, 106L, NA, 100L, 102L, 98L), GB18.x = c(175L,
173L, 177L, 177L, 173L, 177L), GB18.1.x = c(177L, 175L, 177L,
177L, 177L, 177L)), .Names = c("Nom_xp", "GB05.x", "GB05.1.x",
"GB18.x", "GB18.1.x"), row.names = c(NA, 6L), class = "data.frame")
Counting frequencies :
apply(sample[,2:5],2,table)
Now, how to combine the count by prefix of columns, or by every two columns ? The expected output, for the four first columns would be a list:
$GB05
98 100 102 106
3 4 2 1
$GB18
173 175 177
2 2 8
One way to get the count for the first two columns :
table(c(apply(sample[,2:3],2,rbind)))
98 100 102 106
3 4 2 1
But how to apply this to a whole data.frame ?
If you want to apply table to your whole data frame, you can use :
table(unlist(sample[,-1]))
Which gives :
98 100 102 106 173 175 177
3 4 2 1 2 2 8
If you want to group by column name prefix, for example the 4th first characters, you can do something like this :
cols <- names(sample)[-1]
groups <- unique(substr(cols,0,4))
sapply(groups, function(name) table(unlist(sample[,grepl(paste0("^",name),names(sample))])))
Which gives :
$GB05
98 100 102 106
3 4 2 1
$GB18
173 175 177
2 2 8
i would've said juba's answer was correct, but given you're looking for something else, perhaps it's this?
library(reshape2)
x <- melt( sample[ , 2:5 ] )
table( x[ , c( 'variable' , 'value' ) ] )
which gives
value
variable 98 100 102 106 173 175 177
GB05.x 2 2 1 0 0 0 0
GB05.1.x 1 2 1 1 0 0 0
GB18.x 0 0 0 0 2 1 3
GB18.1.x 0 0 0 0 0 1 5
please provide an example of your desired output structure :)
Here is another answer that is sort of a hybrid between Anthony's answer and juba's answer.
The first step is to convert the data.frame into a "long" data.frame. I generally use stack when I can, but you can also do library(reshape2); df2 <- melt(df) to get output similar to my df2 object.
df2 <- data.frame(df[1], stack(df[-1]))
head(df2)
# Nom_xp values ind
# 1 A1FAA 100 GB05.x
# 2 A1FAJ 98 GB05.x
# 3 A1FBB NA GB05.x
# 4 A1FJA 100 GB05.x
# 5 A1FJR 102 GB05.x
# 6 A1FRJ 98 GB05.x
Next, we need to know the unique values of ind. juba did that with substr, but I've done it here with gsub and a regular expression. We don't need to add that into our data.frame; we can call it directly in our other functions. The two functions which immediately come to mind are by and tapply, and both give you the output you are looking for.
by(df2$values,
list(ind = gsub("([A-Z0-9]+)\\..*", "\\1", df2$ind)),
FUN=table)
# ind: GB05
#
# 98 100 102 106
# 3 4 2 1
# ------------------------------------------------------------------------------
# ind: GB18
#
# 173 175 177
# 2 2 8
tapply(df2$values, gsub("([A-Z0-9]+)\\..*", "\\1", df2$ind), FUN = table)
# $GB05
#
# 98 100 102 106
# 3 4 2 1
#
# $GB18
#
# 173 175 177
# 2 2 8