How to generate variation identifier and bucket in r dataframe - r

I have below-mentioned dataframe in R.
ID First Value End Value First Grade Final Grade
I-1 150000 5000 100 -80
I-2 150000 5000 100 80
I-3 NA NA NA NA
I-4 1000 1500 75 100
By using the above dataframe, I want to create the following column based on certain condition.
Value Var - If End Value is is higher than first value then High, If end Value is lower than first value then Low if end and first value are same then No Diff and If end and first value has NA then Outlier.
Grade Var - The above logic to be followed for this as well.
Value % Diff - To derive this we need to substract first value from end value divided by first value
Grade % Diff - The above logic to be followed for this as well
Required Output<-

You can test each condition within case_when and assign values to new column.
library(dplyr)
df %>%
mutate(ValueVar = case_when(EndValue > FirstValue ~ 'High',
EndValue < FirstValue ~ 'Low',
EndValue == FirstValue ~ 'No diff',
TRUE ~ 'Outlier'),
GradeVar = case_when(FinalGrade > FirstGrade ~ 'High',
FinalGrade < FirstGrade ~ 'Low',
FinalGrade == FirstGrade ~ 'No diff',
TRUE ~ 'Outlier'),
ValueDiff = (EndValue - FirstValue)/FirstValue * 100,
GradeDiff = (FinalGrade - FirstGrade)/FirstGrade * 100,
across(ends_with('Diff'), ~case_when(is.na(.) ~ 'Outlier',
TRUE ~ paste0(round(., 2), '%'))))
# ID FirstValue EndValue FirstGrade FinalGrade ValueVar GradeVar ValueDiff GradeDiff
#1 I-1 150000 5000 100 -80 Low Low -96.67% -180%
#2 I-2 150000 5000 100 80 Low Low -96.67% -20%
#3 I-3 NA NA NA NA Outlier Outlier Outlier Outlier
#4 I-4 1000 1500 75 100 High High 50% 33.33%
data
df <- structure(list(ID = c("I-1", "I-2", "I-3", "I-4"), FirstValue = c(150000L,
150000L, NA, 1000L), EndValue = c(5000L, 5000L, NA, 1500L), FirstGrade = c(100L,
100L, NA, 75L), FinalGrade = c(-80L, 80L, NA, 100L)),
class = "data.frame", row.names = c(NA, -4L))

Related

How To return the true condition only of a result on a list in r?

I have a problem with my R code.
Here, I have a list named bought_list with lists of customer and checkout (checkout is a data frame),
And this how checkout lists looks like:
items price qty total
Milk 10 2 20
Dolls 15 10 150
Chocolate 5 5 25
Toys 50 1 50
I want to know which one is for play_purpose and date_purpose
So I made a variable of boolean
play_purpose <- Bought_list[["checkout"]][,"total"] >= 50 & Bought_list[["checkout"]][,"total"] <= 150
date_purpose <- Bought_list[["checkout"]][,"total"] > 0 & Bought_list[["checkout"]][,"total"] < 50
How to return the items name and total value of selected condition like this?
for play_purpose:
Dolls 150
Toys 50
for date_purpose :
Milk 20
Chocolate 25
I'm not clear on the structure of your data, but you could subset with your current code:
play_purpose <-
Bought_list[["checkout"]][Bought_list[["checkout"]][, "total"] >= 50 &
Bought_list[["checkout"]][, "total"] <= 150, c(1, 4)]
# items total
#2 Dolls 150
#4 Toys 50
date_purpose <-
Bought_list[["checkout"]][Bought_list[["checkout"]][, "total"] > 0 &
Bought_list[["checkout"]][, "total"] < 50, c(1, 4)]
# items total
#1 Milk 20
#3 Chocolate 25
Another option is to use dplyr:
Bought_list$checkout %>%
filter(total >= 50 & total <= 150) %>%
select(items, total)
Bought_list$checkout %>%
filter(total > 0 & total < 50) %>%
select(items, total)
Or if you are needing to applying this function to multiple dataframes in the list, then we could use map from purrr:
map(Bought_list, ~ .x %>%
filter(total >= 50 & total <= 150) %>%
select(items, total))
map(Bought_list, ~ .x %>%
filter(total > 0 & total < 50) %>%
select(items, total))
Data
Bought_list <- list(checkout = structure(list(items = c("Milk", "Dolls", "Chocolate",
"Toys"), price = c(10L, 15L, 5L, 50L), qty = c(2L, 10L, 5L, 1L
), total = c(20L, 150L, 25L, 50L)), class = "data.frame", row.names = c(NA,
-4L)))

Subtracting rows in R based on matching value

I am trying to substract two rows in my dataset from each other:
Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40
I want to subtract the Time, Distance and Load values of Period B from Period A for matching Names.
eg. Subtract row 5 (Tim, Period B) from row 1 (Tim, Period A)
The new values should be written into a new table looking like this:
Name Period Time Distance Load
Tim C 01:02:10 5500 600
Max C 01:02:10 4500 550
Leo C 01:02:10 5100 600
Noa C 01:06:20 6500 700
The real dataset contains many more rows. I tried to play around with dplyr but could not get the result I am looking for.
Thanks in advance
There are so many answers already that this is just a bit of fun at this stage. I think this way is nice as it uses unnest_wider():
library(dplyr)
library(tidyr)
library(purrr)
diff <- function(data) {
if(apply(data[2, -1], 1, function(x) all(is.na(x)))) {
data[1, -1]
} else {
data[1, -1] - data[2, -1]
}
}
df %>% group_by(Name) %>% nest() %>%
mutate(diff = map(data, diff)) %>% unnest_wider(diff) %>%
mutate(Period = "C") %>% select(Period, Time, Distance, Load)
# A tibble: 4 x 5
Name Period Time Distance Load
<chr> <chr> <time> <dbl> <dbl>
1 Tim C 01:02:10 5500 600
2 Max C 01:02:10 4500 550
3 Leo C 01:02:10 5100 600
4 Noa C 01:06:20 6500 700
Apart from the diff() function (which can probably be made neater and 'exclusively' tidyverse), this way is also shorter.
DATA
library(readr)
# courtesy of #MartinGal
df <- read_table2("Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40")
You could filter on the two periods and then join them together, thus facilitating the subtraction of columns.
library(dplyr)
inner_join(filter(df, Period=="A"), filter(df, Period=="B"), by="Name") %>%
mutate(Period="C",
Time=Time.x-Time.y,
Distance=Distance.x-Distance.y,
Load=Load.x-Load.y) %>%
select(Name, Period, Time, Distance, Load)
Name Period Time Distance Load
1 Tim C 1.036111 hours 5500 600
2 Max C 1.036111 hours 4500 550
3 Leo C 1.036111 hours 5100 600
It's basically the same idea as #Edward. You could use dplyr and tidyr:
df %>%
pivot_wider(names_from="Period", values_from=c("Time", "Distance", "Load")) %>%
mutate(Period = "C",
Time = coalesce(Time_A - Time_B, Time_A),
Distance = coalesce(Distance_A - Distance_B, Distance_A),
Load = coalesce(Load_A - Load_B, Load_A)
) %>%
select(-matches("_\\w"))
returns
# A tibble: 4 x 5
Name Period Time Distance Load
<chr> <chr> <time> <dbl> <dbl>
1 Tim C 01:02:10 5500 600
2 Max C 01:02:10 4500 550
3 Leo C 01:02:10 5100 600
4 Noa C 01:06:20 6500 700
Data
df <- read_table2("Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40")
Here is a different approach which groups by Name to get the difference.
library(dplyr)
library(chron)
df <- structure(list(Name = structure(c(4L, 2L, 1L, 3L, 4L, 2L, 1L), .Label = c("Leo", "Max", "Noa", "Tim"), class = "factor"),
Period = structure(c(1L,1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Time = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("0:04:10", "1:06:20"), class = "factor"),
Distance = c(6000L, 5000L, 5500L, 6500L, 500L, 500L, 400L),
Load = c(680L, 600L, 640L, 700L, 80L, 50L, 40L)), class = "data.frame", row.names = c(NA, -7L))
df %>%
mutate(Time = times(Time)) %>%
group_by(Name) %>%
mutate(Time = lag(Time) - Time,
Distance = lag(Distance) - Distance,
Load = lag(Load) - Load,
Period = LETTERS[which(LETTERS == Period) + 1]) %>%
filter(!is.na(Time))
You can use data.table too.
dt <- data.table(Name = c('Tim', 'Max', 'Leo', 'Noa', 'Tim', 'Max', 'Leo'),
Period = c('A', 'A', 'A', 'A', 'B', 'B', 'B'),
Time = c('01:06:20', '01:06:20' , '01:06:20' , '01:06:20' , '00:04:10' , '00:04:10' , '00:04:10' ),
Distance = c(6000, 5000, 5500, 6500, 500, 500, 400 ),
Load = c(680, 600, 640, 700, 80, 50, 40))
Then the first thing to do is to convert the Time var:
dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
sapply(dt, class)
Then you use dcast.data.table:
dtCast <- dcast.data.table(dt, Name ~ Period, value.var = c('Time', 'Distance', 'Load'))
And then you create a new object:
dtFinal <- dtCast[,list(Period = 'C',
Time = Time_A - Time_B,
Distance = Distance_A - Distance_B,
Load = Load_A - Load_B),
by = 'Name']
Mind that if you want to convert the Time to the same format as above, you need to do the following:
library(hms)
dtFinal[, Time := as_hms(Time)]

R: remove a subset of a dataframe from the original one with multiple conditions using for loop

I have a dataframe as follows:(dput of original table is quite big hence providing a small example)
Date Sales Depo
2020-01 100 ABC
2020-02 125 ABC
2020-03 0 ABC
2020-04 0 ABC
2020-01 0 BBC
2020-02 0 BBC
2020-03 0 BBC
2020-04 5 BBC
I want to remove all the records pertaining to BBC based on the following conditions
either the sum(cols) <= max(col_value) or rowcount with zero exceeds 80% of
total row count
The above rule should be applicable for each Depo.
So the resultant df would be
Date Sales Depo
2020-01 100 ABC
2020-02 125 ABC
2020-03 0 ABC
2020-04 0 ABC
My Approach:
df_final = data.frame(Date = NULL,Sales = NULL, Depo =NULL)
for (v in unique(df$Depo)){
temp <- subset(data,Depo==v)
temp_f <- temp[,colSums(Sales!=0) > 0]
df_final <-rbind(df_final,temp_f)
}
But the above gives me a NULL data frame
Can anybody throws any light?
How can I achieve the same?
Using dplyr :
library(dplyr)
df %>%
group_by(Depo) %>%
filter((sum(Sales) > max(Sales)) & (sum(Sales == 0) < (0.8 * n())))
#Opposite can be written as :
#filter(!((sum(Sales) <= max(Sales)) | (sum(Sales == 0) > (0.8 * n()))))
The same logic can also be implemented in base R :
subset(df, as.logical(ave(Sales, Depo, FUN = function(x)
(sum(x) > max(x)) & (sum(x == 0) < (0.8 * length(x))))))
and data.table :
library(data.table)
setDT(df)[, .SD[(sum(Sales) > max(Sales)) & (sum(Sales == 0) < (0.8 * .N))], Depo]
data
df <- structure(list(Date = c("2020-01", "2020-02", "2020-03", "2020-04",
"2020-01", "2020-02", "2020-03", "2020-04"), Sales = c(100L,
125L, 0L, 0L, 0L, 0L, 0L, 5L), Depo = c("ABC", "ABC", "ABC",
"ABC", "BBC", "BBC", "BBC", "BBC")), class = "data.frame", row.names =c(NA, -8L))

Percent change for grouped subjects at multiple timepoints R

id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)

Remove duplicates based on specific criteria

I have a dataset that looks something like this:
df <- structure(list(Claim.Num = c(500L, 500L, 600L, 600L, 700L, 700L,
100L, 200L, 300L), Amount = c(NA, 1000L, NA, 564L, 0L, 200L,
NA, 0L, NA), Company = structure(c(NA, 1L, NA, 4L, 2L, 3L, NA,
3L, NA), .Label = c("ATT", "Boeing", "Petco", "T Mobile"), class = "factor")), .Names =
c("Claim.Num", "Amount", "Company"), class = "data.frame", row.names = c(NA,
-9L))
I want to remove duplicate rows based on Claim Num values, but to remove duplicates based on the following criteria: df$Company == 'NA' | df$Amount == 0
In other words, remove records 1, 3, and 5.
I've gotten this far: df <- df[!duplicated(df$Claim.Num[which(df$Amount = 0 | df$Company == 'NA')]),]
The code runs without errors, but doesn't actually remove duplicate rows based on the required criteria. I think that's because I'm telling it to remove any duplicate Claim Nums which match to those criteria, but not to remove any duplicate Claim.Num but treat certain Amounts & Companies preferentially for removal. Please note that, I can't simple filter out the dataset based on specified values, as there are other records that may have 0 or NA values, that require inclusion (e.g. records 8 & 9 shouldn't be excluded because their Claim.Nums are not duplicated).
If you order your data frame first, then you can make sure duplicated keeps the ones you want:
df.tmp <- with(df, df[order(ifelse(is.na(Company) | Amount == 0, 1, 0)), ])
df.tmp[!duplicated(df.tmp$Claim.Num), ]
# Claim.Num Amount Company
# 2 500 1000 ATT
# 4 600 564 T Mobile
# 6 700 200 Petco
# 7 100 NA <NA>
# 8 200 0 Petco
# 9 300 NA <NA>
Slightly different approach
r <- merge(df,
aggregate(df$Amount,by=list(Claim.Num=df$Claim.Num),length),
by="Claim.Num")
result <-r[!(r$x>1 & (is.na(r$Company) | (r$Amount==0))),-ncol(r)]
result
# Claim.Num Amount Company
# 1 100 NA <NA>
# 2 200 0 Petco
# 3 300 NA <NA>
# 5 500 1000 ATT
# 7 600 564 T Mobile
# 9 700 200 Petco
This adds a column x to indicate which rows have Claim.Num present more than once, then filters the result based on your criteria. The use of -ncol(r) just removes the column x at the end.
Another way based on subset and logical indices:
subset(dat, !(duplicated(Claim.Num) | duplicated(Claim.Num, fromLast = TRUE)) |
(!is.na(Amount) & Amount))
Claim.Num Amount Company
2 500 1000 ATT
4 600 564 T Mobile
6 700 200 Petco
7 100 NA <NA>
8 200 0 Petco
9 300 NA <NA>

Resources