Having a rough time approaching this problem with a large dataset. Essentially there are multiple rows for the same item. However, only one of the items contains the required value. I need to copy that value to all matching items.
Eg. below, I need item 100 to have a cost of 1203 for every row.
df = data.frame("item" = c(100, 100, 100, 105, 105, 102, 102, 102),
"cost" = c(1203, 0, 0, 66, 0, 1200, 0, 0))
> df
item cost
1 100 1203
2 100 0
3 100 0
4 105 66
5 105 0
6 102 1200
7 102 0
8 102 0
Like so:
df_wanted = data.frame("item" = c(100, 100, 100, 105, 105, 102, 102, 102),
"cost" = c(1203, 1203, 1203, 66, 66, 1200, 1200, 1200))
> df_wanted
item cost
1 100 1203
2 100 1203
3 100 1203
4 105 66
5 105 66
6 102 1200
7 102 1200
8 102 1200
Below is my attempt at I think an inefficient method:
for (row in 1:length(df$cost)){
if (df$cost[row] == 0){
df$cost[row] = df$cost[row-1]
}
}
here is one option. After grouping by 'item', subset the 'cost' where the 'cost' is not 0 and select the first element
library(dplyr)
df %>%
group_by(item) %>%
mutate(cost = first(cost[cost!=0))
# A tibble: 8 x 2
# Groups: item [3]
# item cost
# <dbl> <dbl>
#1 100 1203
#2 100 1203
#3 100 1203
#4 105 66
#5 105 66
#6 102 1200
#7 102 1200
#8 102 1200
Looks like you want to group by item and then replace 0 in cost with the last non-zero value. In each group, cummax(which(cost != 0)) will give the index of the last non-zero value.
library(dplyr)
df %>%
group_by(item) %>%
mutate(cost = cost[cummax(which(cost != 0))]) %>%
ungroup()
## A tibble: 8 x 2
# item cost
# <dbl> <dbl>
#1 100 1203
#2 100 1203
#3 100 1203
#4 105 66
#5 105 66
#6 102 1200
#7 102 1200
#8 102 1200
Base R equivalent is
transform(df, cost = ave(cost, item, FUN = function(x) x[cummax(which(x != 0))]))
What I ended up going with after revisiting this problem as a left_join(). Which makes more sense to me intuitively though it may not be the best solution.
The original DF below.
df = tibble("item" = as.factor(c(100, 100, 100, 105, 105, 102, 102, 102)),
"cost" = c(1203, 0, 0, 66, 0, 0, 1200, 0))
> df
# A tibble: 8 x 2
item cost
<fct> <dbl>
1 100 1203
2 100 0
3 100 0
4 105 66
5 105 0
6 102 0
7 102 1200
8 102 0
Create an 'index' of item-value pairs
df_index <- df %>%
group_by(item) %>%
arrange(-cost) %>%
slice(1)
> df_index
# A tibble: 3 x 2
# Groups: item [3]
item cost
<fct> <dbl>
1 100 1203
2 102 1200
3 105 66
Finally, join the dataframes by item to fill in the empty row values.
df_joined <- df %>%
left_join(df_index, by="item")
> df_joined
# A tibble: 8 x 3
item cost.x cost.y
<fct> <dbl> <dbl>
1 100 1203 1203
2 100 0 1203
3 100 0 1203
4 105 66 66
5 105 0 66
6 102 0 1200
7 102 1200 1200
8 102 0 1200
Related
I have the following data frame in R. For this experiment I was testing the survival of cells at several times with 2 treatments, and 2 replicates for each treatment. I want to calculate the percentage of cells alive at each time for each treatment/replicate.
For example, for Treat 1 Rep 1 it would be 500/500, 470/500, 100/500, 20/500, for Treat 2 Rep 1 it would be 430/430, 420/430, 300/430, 100/430
Thanks!
x <- data.frame("treatment"= c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
"rep"=c(1,1,1,1,2,2,2,2,1,1,1,1,2,2,2,2),
"Time" = c(0, 30, 60, 180, 0, 30, 60, 180, 0, 30, 60, 180,0, 30, 60, 180 ),
"cells_alive" = c(500, 470, 100, 20, 476, 310, 99, 2, 430, 420, 300, 100, 489, 451, 289, 4))
We can group by 'treatment', 'rep', calculate the 'prop'ortion by dividing the 'cells_alive' with the value of 'cells_alive' that correspond to 'Time' as 0
library(dplyr)
x1 <- x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/cells_alive[Time == 0])
-output
x1
# A tibble: 16 x 5
# Groups: treatment, rep [4]
# treatment rep Time cells_alive prop
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 500 1
# 2 1 1 30 470 0.94
# 3 1 1 60 100 0.2
# 4 1 1 180 20 0.04
# 5 1 2 0 476 1
# 6 1 2 30 310 0.651
# 7 1 2 60 99 0.208
# 8 1 2 180 2 0.00420
# 9 2 1 0 430 1
#10 2 1 30 420 0.977
#11 2 1 60 300 0.698
#12 2 1 180 100 0.233
#13 2 2 0 489 1
#14 2 2 30 451 0.922
#15 2 2 60 289 0.591
#16 2 2 180 4 0.00818
Or with match
x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/cells_alive[match(0, Time)])
if the 'Time' is already ordered
x %>%
group_by(treatment, rep) %>%
mutate(prop = cells_alive/first(cells_alive))
This is my dataset:
df = structure(list(from = c(0, 0, 0, 0, 38, 43, 49, 54), to = c(43,
54, 56, 62, 62, 62, 62, 62), count = c(342, 181, 194, 386, 200,
480, 214, 176), group = c("keiner", "keiner", "keiner", "keiner",
"paid", "paid", "owned", "earned")), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -8L))
My Problem is that the columns from and to need to be ranked (the ranking has to be done for the two columns from and to), since the visualisation library requires that and also needs to start with an index of 0.
Thats why I build two vectors, one (ranking) with a ranking of each unique value of the two columns, the other (uniquevalues) with original unique values of the dataset.
ranking <- dplyr::dense_rank(unique(c(df$from, df$to))) - 1 ### Start Index at 0, "recode" variables
uniquevalues <- unique(c(df$from, df$to))
Now I have to recode the original dataset. The columns to and from have to receive the values from ranking, according to the corresponding value of uniquevalues.
The only option I came around with was to create a dataframe of the the two vectors and loop over each row, but I would really like to have a vectorized solution for this. Can anyone help me?
This:
<dbl> <dbl> <dbl> <chr>
1 0 43 342 keiner
2 0 54 181 keiner
3 0 56 194 keiner
4 0 62 386 keiner
5 38 62 200 paid
6 43 62 480 paid
7 49 62 214 owned
8 54 62 176 earned
should become this:
from to count group
<dbl> <dbl> <dbl> <chr>
1 0 2 342 keiner
2 0 4 181 keiner
3 0 5 194 keiner
4 0 6 386 keiner
5 1 6 200 paid
6 2 6 480 paid
7 3 6 214 owned
8 4 6 176 earned
We could unlist the values and match them with uniquevalues
df[1:2] <- match(unlist(df[1:2]), uniquevalues) - 1
df
# from to count group
# <dbl> <dbl> <dbl> <chr>
#1 0 2 342 keiner
#2 0 4 181 keiner
#3 0 5 194 keiner
#4 0 6 386 keiner
#5 1 6 200 paid
#6 2 6 480 paid
#7 3 6 214 owned
#8 4 6 176 earned
Or using column names instead of index.
df[c("from", "to")] <- match(unlist(df[c("from", "to")]), uniquevalues) - 1
Another solution converting to factor and back.
f <- unique(unlist(df1[1:2]))
df[1:2] <- lapply(df[1:2], function(x) {
as.integer(as.character(factor(x, levels=f, labels=1:length(f) - 1)))
})
df
# # A tibble: 8 x 4
# from to count group
# <fct> <fct> <dbl> <chr>
# 1 0 2 342 keiner
# 2 0 4 181 keiner
# 3 0 5 194 keiner
# 4 0 6 386 keiner
# 5 1 6 200 paid
# 6 2 6 480 paid
# 7 3 6 214 owned
# 8 4 6 176 earned
I would use mapvalues function. Like this
library(plyr)
df[ , 1:2] <- mapvalues(unlist(df[ , 1:2]),
from= uniquevalues,
to= ranking)
df
# from to count group
# <dbl> <dbl> <dbl> <chr>
#1 0 2 342 keiner
#2 0 4 181 keiner
#3 0 5 194 keiner
#4 0 6 386 keiner
#5 1 6 200 paid
#6 2 6 480 paid
#7 3 6 214 owned
#8 4 6 176 earned
In R, I have a reference table (dataframe) with three columns. Below is an example:
reftable <- data.frame(
X_lower = c(0, 101, 181, 231, 280, 300, 340, 390, 500),
X_upper = c(100, 180, 230, 279, 299, 339, 389, 499, 600),
Percentile = c(2, 3, 4, 6, 8, 11, 15, 20, 25))
# X_lower X_upper Percentile
# 0 100 2
# 101 180 3
# 181 230 4
# etc.
I have a separate dataframe, scores, with specific values for X, and I want to use the reference table to look up the percentile rank associated with each value.
scores <- data.frame(
X = c(58, 127, 175, 245, 300, 90, 405, 284, 330),
PercRank = NA))
# X PercRank
# 58 ?
# 127 ?
# 175 ?
# 245 ?
# etc.
I've tried using match or findInterval but can't find a solution. I've searched through existing questions. If this has been asked before, I'm must not be hitting on the right search terms.
You can try:
scores$PercRank=sapply(scores$X,function(x){
i = which(reftable$X_upper>x)[1]
reftable$Percentile[i]
})
> scores
X PercRank
1 58 2
2 127 3
3 175 3
4 245 6
5 300 11
6 90 2
7 405 20
8 284 8
9 330 11
Because reftable is ordered, you only need to check the first upper value bigger than your X.
1) sqldf An SQL left join can be used:
library(sqldf)
scores$PercRank <- NULL
sqldf("select s.X, r.Percentile as PercRank
from scores as s
left join reftable as r on s.X between r.X_lower and r.X_upper")
giving:
X PercRank
1 58 2
2 127 3
3 175 3
4 245 6
5 300 11
6 90 2
7 405 20
8 284 8
9 330 11
2) findInterval A base alternative is findInterval:
transform(scores, percRank = with(reftable, Percentile[ findInterval(X, X_lower) ]))
1) An option would be foverlaps from data.table
library(data.table)
scores$PercRank <- foverlaps(scores1, reftable)[order(rn)]$Percentile
scores$rn <- NULL
scores
# X PercRank
#1 58 2
#2 127 3
#3 175 3
#4 245 6
#5 300 11
#6 90 2
#7 405 20
#8 284 8
#9 330 11
2) Or use a non-equi join
setDT(scores)[reftable, PercRank := Percentile, on = .(X >= X_lower, X <= X_upper)]
scores
# X PercRank
#1: 58 2
#2: 127 3
#3: 175 3
#4: 245 6
#5: 300 11
#6: 90 2
#7: 405 20
#8: 284 8
#9: 330 11
3) Or with fuzzyjoin
library(fuzzyjoin)
library(dplyr)
fuzzy_left_join(scores, reftable, by = c("X" = "X_lower", "X" = "X_upper"),
match_fun = list(`>=`, `<=`)) %>%
select(X, Percentile)
# X Percentile
#1 58 2
#2 127 3
#3 175 3
#4 245 6
#5 300 11
#6 90 2
#7 405 20
#8 284 8
#9 330 11
data
scores <- data.frame(
X = c(58, 127, 175, 245, 300, 90, 405, 284, 330))
scores$rn <- seq_len(nrow(scores))
scores1 <- data.table(X_lower = scores$X, X_upper = scores$X, rn = scores$rn)
setkeyv(scores1, c("X_lower", "X_upper"))
setkeyv(reftable, c("X_lower", "X_upper"))
Sample data
library(dplyr)
df <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3),
day = c(3,8,14,29,4,6,8,1,4,9),
value = c(75, 101, 115, 120, 110, 106, 122, 100, 128, 140))
The idea behind the question:
Select the smallest day for each ID subject and multiply the value feature by 1.3 (ID 1 - day 3 - value 75, ID 2 - day 4 - value 110, ID 3 - day 1 - value 100).
Then compare that newly created value with other values that have the same ID, but have different day number.
For example:
The smallest day number for ID 1 is 3. Then multiply the value of that row by 1.3 (75 * 1.3 = 97.5). Compare the newly created value (97.5) with the values ((101, 115, 120)) that have the same ID of 1. Then anwser TRUE or FALSE if the new value is greater than values ((101, 115, 120)).
Repeat that as well for ID 2 and 3.
library(dplyr)
df <- data.frame(
ID = c(1,1,1,1,2,2,2,3,3,3),
day = c(3,8,14,29,4,6,8,1,4,9),
value = c(75, 101, 115, 120, 110, 106, 122, 100, 128, 140))
df %>%
group_by(ID) %>%
mutate(v = value[day == min(day)] * 1.3,
flag = value > v) %>%
ungroup()
# # A tibble: 10 x 5
# ID day value v flag
# <dbl> <dbl> <dbl> <dbl> <lgl>
# 1 1 3 75 97.5 FALSE
# 2 1 8 101 97.5 TRUE
# 3 1 14 115 97.5 TRUE
# 4 1 29 120 97.5 TRUE
# 5 2 4 110 143 FALSE
# 6 2 6 106 143 FALSE
# 7 2 8 122 143 FALSE
# 8 3 1 100 130 FALSE
# 9 3 4 128 130 FALSE
#10 3 9 140 130 TRUE
If you want to flag IDs with at least one TRUE flag you can create flag2 like this:
df %>%
group_by(ID) %>%
mutate(v = value[day == min(day)] * 1.3,
flag = value > v,
flag2 = max(flag)) %>%
ungroup()
# # A tibble: 10 x 6
# ID day value v flag flag2
# <dbl> <dbl> <dbl> <dbl> <lgl> <int>
# 1 1 3 75 97.5 FALSE 1
# 2 1 8 101 97.5 TRUE 1
# 3 1 14 115 97.5 TRUE 1
# 4 1 29 120 97.5 TRUE 1
# 5 2 4 110 143 FALSE 0
# 6 2 6 106 143 FALSE 0
# 7 2 8 122 143 FALSE 0
# 8 3 1 100 130 FALSE 1
# 9 3 4 128 130 FALSE 1
#10 3 9 140 130 TRUE 1
Or extract the IDs as a vector:
df %>%
group_by(ID) %>%
mutate(v = value[day == min(day)] * 1.3,
flag = value > v) %>%
ungroup() -> df2
df2 %>%
filter(flag == TRUE) %>%
distinct(ID) %>%
pull(ID)
#[1] 1 3
Say I have this dataset:
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df
time val
1 100 1
2 101 3
3 101 1
4 101 2
5 102 3
6 102 1
7 103 2
8 105 3
9 109 1
10 109 2
11 109 1
We can identify duplicate times in the 'time' column like this:
df[duplicated(df$time),]
What I want to do is to adjust the value of time (add 0.1) if it's duplicate. I could do this like this:
df$time <- ifelse(duplicated(df$time),df$time+.1,df$time)
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.1 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.1 1
The issue here is that we still have duplicate values e.g.rows 3 and 4 (that they differ in the column 'val' is irrelevant). Rows 10 and 11 have the same problem. Rows 5 and 6 are fine.
Is there a way of doing this iteratively - i.e. adding 0.1 to first duplicate, 0.2 to second duplicate (of same time value) etc. This way row 4 would become 101.2, and row 11 would become 109.2 . The number of duplicates per value is unknown but will never equal 10 (usually maximum 4).
As in the top answer for the related question linked by #Henrik, this uses data.table::rowid
library(data.table)
setDT(df)
df[, time := time + 0.1*(rowid(time) - 1)]
# time val
# 1: 100.0 1
# 2: 101.0 3
# 3: 101.1 1
# 4: 101.2 2
# 5: 102.0 3
# 6: 102.1 1
# 7: 103.0 2
# 8: 105.0 3
# 9: 109.0 1
# 10: 109.1 2
# 11: 109.2 1
Here's a one line solution using base R -
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df$new_time <- df$time + duplicated(df$time)*0.1*(ave(seq_len(nrow(df)), df$time, FUN = seq_along) - 1)
df
# time val new_time
# 1 100 1 100.0
# 2 101 3 101.0
# 3 101 1 101.1
# 4 101 2 101.2
# 5 102 3 102.0
# 6 102 1 102.1
# 7 103 2 103.0
# 8 105 3 105.0
# 9 109 1 109.0
# 10 109 2 109.1
# 11 109 1 109.2
With dplyr:
library(dplyr)
df %>%
group_by(time1 = time) %>%
mutate(time = time + (0:(n()-1))*0.1) %>%
ungroup() %>%
select(-time1)
or with row_number() (suggested by Henrik):
df %>%
group_by(time1 = time) %>%
mutate(time = time + (row_number()-1)*0.1) %>%
ungroup() %>%
select(-time1)
Output:
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.2 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.2 1