I have a dataset like the one shown below
library(tidyverse)
dat <- data.frame(col.1 = 1:16,
col.2 = c("B", "B", "B", "B", "B", "B", "A", "B",
"A", "A", "B", "A", "A", "A", "A", "A"),
col.3 = c(30, 60, 75, 105, 40, 80, -20, 60, -20, -60, 40,
-40,-105,-20,-20,-45),
col.4 = c(39.34775, 31.66806, 28.57107, 28.43085, 29.30417, 36.21187,
40.29794, 40.70641, 65.85152, 66.85943, 69.26766, 67.24402,
74.85330, 79.17230, 78.75405, 64.47038))
dat
I'm trying to reach the final column which looks like this:
dat.2 <- dat %>%
mutate(col.Final = c(1180.43, 1900.08, 2142.83, 2985.24, 1172.17,
2896.95, -629.63, 2442.38, -655.37, -1966.11,
2770.71, -1460.48, -3833.76, -730.24, -730.24,
-1643.04))
So far, I have tried using mutate() function to reach this point.
dat.1 <- dat %>%
mutate(col.5 = col.3*col.4) %>%
mutate(col.6 = cumsum(col.3)) %>%
mutate(col.7 = if_else(col.2 == 'B', col.6, col.6 - col.3),
col.8 = col.3/col.7)
When I'm trying to reach the final column I'm not getting the same results.
dat.1 %>%
mutate(col.9 = if_else(col.2 == 'A', col.8*lag(cumsum(col.5)), col.5))
Note: This same calculation was done successfully using Excel's SUMIFS() function.
I'm Trying to get the same results with R instead.
I have seen some of the Q&A for similar posts but still stuck with the final calculation. In Excel, it felt as if iteration was performed for certain condition and then the next condition was executed. Though, not sure what was done using excel, I think, somehow this is possible using R as well. Just unable to figure out how to get that.
Any help would be appreciated at this point.
Update:
Values for col.5 and col.8 corresponding to col.2:
col.2 = c("B", "B", "B", "B", "B", "B", "A", "B",
"A", "A", "B", "A", "A", "A", "A", "A")
col.5 <- c(1180.4325, 1900.0836, 2142.8302, 2985.2393, 1172.1668,
2896.9496, -805.9588, 2442.3846, -1317.0304, -4011.5658,
2770.7064, -2689.7608, -7859.5965, -1583.4460, -1575.0810,
-2901.1671)
col.8 <-c(1.00000000, 0.66666667, 0.45454545, 0.38888889, 0.12903226,
0.20512821, -0.05128205, 0.13953488, -0.04651163, -0.14634146,
0.10256410,-0.10256410, -0.30000000, -0.08163265, -0.08888889,
-0.21951220)
Verifying values Using Hand Calculation!
Calculations using col.5 & col.8
for "B" from top :
1180.43 + 1900.08 + 2142.83 + 2985.24 + 1172.17 + 2896.95 = 12277.7020
for A after :
12277.7020 x -0.05128205 = -629.6266509 .... the 1st desired value for A
for "B" after:
12277.720 - 629.6266509 = 11648.07535
11648.07535 + 2442.3846 = 14090.45995
for "A" after:
14090.45995 x -0.04651163 = -655.37026 ... 2nd desired Value for A
for "A" after:
14090.45995 - 655.37026 = 13435.08969
13435.08969 x -0.14634146 = -1966.110641 ... 3rd desired value for A
and so on....
I hope this explains.
Related
I have a massive data file that I am breaking down into day blocks by person and then plotting events that occurred during the day and the duration of those events (either A, B or C)
Data is structured like below: t_z is the interval between rows, period is the event variable, this example is for one individual for one day ( actual data is xdays xpersons)
intervals <- c(0,5.1166667,6.2166667,3.5166667,0.06666667,3.0666667,6.3,
2.3833333,0.06666667,4.7,18.666667,17.383333,21.533333,
0.1,0.08333333,0.85)
period <- c("C", "B", "A", "B", "C", "B", "C", "B",
"C", "B", "C", "B", "C", "B", "C", "B")
i <- as.data.frame(intervals)
p <- as.data.frame(period)
d <- cbind(i,p)
Getting a bar plot is easy enough but it stacks all "periods" into blocks by day:
d$id<-1
e <- ggplot(d,aes(id))
e + geom_bar(aes(fill=period))
Simple aggregated stacked bar of time data:
However, I would like each "period" to be represented discretely and by its magnitude:
Periods as discrete stacked blocks example:
Thanks YBS but your method comes close but the size of the periods is not correct any ideas? The first C=5 is not the same size as the first A=5?
intervals <- c(5, 15, 5, 3,7,3,6, 2)
period <- c("C","B","A","B","C","B","C","B")
d <- data.frame(intervals,period)
colors=c("red","blue","green")
dc <- data.frame(period=unique(d$period),colors)
d2 <- d %>% mutate(nid = paste0(d$period,'_',row_number()))
d3 <- left_join(d2,dc, by="period")
d3$id<-1
e <- ggplot(d3,aes(x=id, y=intervals)) +
geom_col(aes(fill=nid))
e + scale_fill_manual(name='period', labels=d3$period, values=d3$colors )
The trick is to create a newid with all the discrete values, and then reverting back to initial period values via scale_fill_manual. You can use coord_flip() to make it horizontal and change the legend position as necessary. Perhaps this is the desired output.
intervals <- c(0, 5.1166667, 6.2166667, 3.5166667,0.6666667,3.0666667,6.3, 2.3833333)
#,0.06666667 , 4.7,18.666667,17.383333,21.533333, 0.1,0.08333333,0.85)
period <- c("C", "B", "A", "B", "C", "B", "C", "B")
# ,"C", "B", "C", "B", "C", "B", "C", "B")
d <- data.frame(intervals,period)
colors=c("red", "blue","green")
dc <- data.frame(period=unique(d$period),colors)
d2 <- d %>% mutate(nid = paste0(d$period,'_',row_number()))
d3 <- left_join(d2,dc, by="period")
d3$id<-1
e <- ggplot(d3,aes(x=id, y=intervals)) +
geom_col(aes(fill=nid))
e + scale_fill_manual(name='period', labels=d3$period, values=d3$colors )
Short question:
I can substitute certain variable values like this:
values <- c("a", "b", "a", "b", "c", "a", "b")
df <- data.frame(values)
What's the easiest way to replace all the values of df$values by "x" (where the value is neither "a" or "b")?
Output should be:
c("a", "b", "a", "b", "x", "a", "b")
Your example is a bit unclear and not reproducible.
However, based on guessing what you actually want, I could suggest trying this option using the data.table package:
df[values %in% c("a", "b"), values := "x"]
or the dplyr package:
df %>% mutate(values = ifelse(values %in% c("a","b"), x, values))
What about:
df[!df[, 1] %in% c("a", "b"), ] <- "x"
values
1 a
2 b
3 a
4 b
5 x
6 a
7 b
I have a dataframe like this:
df <- data.frame(Patient.ID = rep(paste("Pat", seq(1:3), sep = ""), 2),
Gene = c(rep("Gene1", 3), rep("Gene2", 3)),
Ref = c("A", "C", "G", "T", "A", "T"),
Tum1 = c("A", "A", "T", "T", "A", "T"),
Tum2 = c("A", "C", "G", "G", "C", "C"))
What I would like to do is determine the change that is occurring between the Ref or either Tum column. In other words, if Tum1 is different from Tum2 take the character string which is different to the Ref column and store that in a separate column as the change so the dataframe above would become:
df <- data.frame(Patient.ID = rep(paste("Pat", seq(1:3), sep = ""), 2),
Gene = c(rep("Gene1", 3), rep("Gene2", 3)),
Ref = c("A", "C", "G", "T", "A", "T"),
Tum1 = c("A", "A", "T", "T", "A", "T"),
Tum2 = c("A", "C", "G", "G", "C", "C"),
BaseChange = c("NoCh", "C.A", "G.T", "T.G", "A.C", "T.C"))
I'm aware I could use a nested ifelse() statement like below (but extended) to solve this, but my actual dataframe has many more combinations and I figure there has to be a "safer" method of doing so.
df$BaseChange <- as.factor(ifelse(df$Ref == "C" & df$Tum1 == "A" | df$Ref== "C" & df$Tum2 == "A", "C.A",
ifelse((df$Ref == "G" & df$Tum1 == "T" | df$Ref == "G" & df$Tum2 == "T"), "G.T",...)))
Any help would be greatly appreciated.
It's not pretty, but it works:
df <- df %>%
mutate(BaseChange2 = ifelse( (as.character(Ref)==as.character(Tum1) & as.character(Ref) == as.character(Tum2)), "NoCh",
ifelse(as.character(Ref)==as.character(Tum1),paste(Ref,Tum2, sep="."),paste(Ref,Tum1, sep="."))))
It seems tha you need to paste unique Tums together, i.e.
apply(df[3:5], 1, function(i) paste0(unique(i), collapse = '.'))
#[1] "A" "C.A" "G.T" "T.G" "A.C" "T.C"
To replace the first A,
v2 <- apply(df[3:5], 1, function(i) paste0(unique(i), collapse = '.'))
replace(v2, nchar(v2) == 1, 'NoChange')
#[1] "NoChange" "C.A" "G.T" "T.G" "A.C" "T.C"
I have a subset data that has a total count for each observation from a bigger dataset. If I want to drop duplicates based on a higher count and drop codes that appear less if the name is the same, how would I go about that? So for instance:
name = c("a", "a", "b", "b", "b", "c", "d", "e", "e", "e")
code = c(1,1,2,3,4,1,1,2,2,3)
n = c(1,10,2,3,5,4,8,100,90,40)
data = data.frame(name,code,n)
The end product would be left with these:
name = c("a", "b", "c", "d", "e")
code = c(1,4,1,1,2)
n = c(10,5,4,8,100)
data2 = data.frame(name,code,n)
If you can use dplyr, this should do the trick:
library(dplyr)
data %>%
group_by(name) %>%
filter(n == max(n)) %>%
ungroup()
I'm writing a function to aggregate a dataframe, and it needs to be generally applicable to a wide variety of datasets. One step in this function is dplyr's filter function, used to select from the data only the ad campaign types relevant to the task at hand. Since I need the function to be flexible, I want ad_campaign_types as an input, but this makes filtering kind of hairy, as so:
aggregate_data <- function(ad_campaign_types) {
raw_data %>%
filter(ad_campaign_type == ad_campaign_types) -> agg_data
agg_data
}
new_data <- aggregate_data(ad_campaign_types = c("campaign_A", "campaign_B", "campaign_C"))
I would think the above would work, but while it runs, oddly enough it only returns only a small fraction of what the filtered dataset should be. Is there a better way to do this?
Another tiny example of replaceable code:
ad_types <- c("a", "a", "a", "b", "b", "c", "c", "c", "d", "d")
revenue <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
data <- as.data.frame(cbind(ad_types, revenue))
# Now, filtering to select only ad types "a", "b", and "d",
# which should leave us with only 7 values
new_data <- filter(data, ad_types == c("a", "b", "d"))
nrow(new_data)
[1] 3
For multiple criteria use %in% function:
filter(data, ad_types %in% c("a", "b", "d"))
you can also use "not in" criterion:
filter(data, !(ad_types %in% c("a", "b", "d")))
However notice that %in%'s behavior is a little bit different than ==:
> c(2, NA) == 2
[1] TRUE NA
> c(2, NA) %in% 2
[1] TRUE FALSE
some find one of those more intuitive than other, but you have to remember about the difference.
As for using multiple different criteria simply use chains of criteria with and/or statements:
filter(mtcars, cyl > 2 & wt < 2.5 & gear == 4)
Tim is correct for filtering a dataframe. However, if you want to make a function with dplyr, you need to follow the instructions at this webpage: https://rpubs.com/hadley/dplyr-programming.
The code I would suggest.
library(tidyverse)
ad_types <- c("a", "a", "a", "b", "b", "c", "c", "c", "d", "d")
revenue <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
df <- data_frame(ad_types = as.factor(ad_types), revenue = revenue)
aggregate_data <- function(df, ad_types, my_list) {
ad_types = enquo(ad_types) # Make ad_types a quosure
df %>%
filter(UQ(ad_types) %in% my_list) # Unquosure
}
new_data <- aggregate_data(df = df, ad_types = ad_types,
my_list = c("a", "b", "c"))
That should work!