Subtracting rows in R based on matching value - r

I am trying to substract two rows in my dataset from each other:
Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40
I want to subtract the Time, Distance and Load values of Period B from Period A for matching Names.
eg. Subtract row 5 (Tim, Period B) from row 1 (Tim, Period A)
The new values should be written into a new table looking like this:
Name Period Time Distance Load
Tim C 01:02:10 5500 600
Max C 01:02:10 4500 550
Leo C 01:02:10 5100 600
Noa C 01:06:20 6500 700
The real dataset contains many more rows. I tried to play around with dplyr but could not get the result I am looking for.
Thanks in advance

There are so many answers already that this is just a bit of fun at this stage. I think this way is nice as it uses unnest_wider():
library(dplyr)
library(tidyr)
library(purrr)
diff <- function(data) {
if(apply(data[2, -1], 1, function(x) all(is.na(x)))) {
data[1, -1]
} else {
data[1, -1] - data[2, -1]
}
}
df %>% group_by(Name) %>% nest() %>%
mutate(diff = map(data, diff)) %>% unnest_wider(diff) %>%
mutate(Period = "C") %>% select(Period, Time, Distance, Load)
# A tibble: 4 x 5
Name Period Time Distance Load
<chr> <chr> <time> <dbl> <dbl>
1 Tim C 01:02:10 5500 600
2 Max C 01:02:10 4500 550
3 Leo C 01:02:10 5100 600
4 Noa C 01:06:20 6500 700
Apart from the diff() function (which can probably be made neater and 'exclusively' tidyverse), this way is also shorter.
DATA
library(readr)
# courtesy of #MartinGal
df <- read_table2("Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40")

You could filter on the two periods and then join them together, thus facilitating the subtraction of columns.
library(dplyr)
inner_join(filter(df, Period=="A"), filter(df, Period=="B"), by="Name") %>%
mutate(Period="C",
Time=Time.x-Time.y,
Distance=Distance.x-Distance.y,
Load=Load.x-Load.y) %>%
select(Name, Period, Time, Distance, Load)
Name Period Time Distance Load
1 Tim C 1.036111 hours 5500 600
2 Max C 1.036111 hours 4500 550
3 Leo C 1.036111 hours 5100 600

It's basically the same idea as #Edward. You could use dplyr and tidyr:
df %>%
pivot_wider(names_from="Period", values_from=c("Time", "Distance", "Load")) %>%
mutate(Period = "C",
Time = coalesce(Time_A - Time_B, Time_A),
Distance = coalesce(Distance_A - Distance_B, Distance_A),
Load = coalesce(Load_A - Load_B, Load_A)
) %>%
select(-matches("_\\w"))
returns
# A tibble: 4 x 5
Name Period Time Distance Load
<chr> <chr> <time> <dbl> <dbl>
1 Tim C 01:02:10 5500 600
2 Max C 01:02:10 4500 550
3 Leo C 01:02:10 5100 600
4 Noa C 01:06:20 6500 700
Data
df <- read_table2("Name Period Time Distance Load
Tim A 01:06:20 6000 680
Max A 01:06:20 5000 600
Leo A 01:06:20 5500 640
Noa A 01:06:20 6500 700
Tim B 00:04:10 500 80
Max B 00:04:10 500 50
Leo B 00:04:10 400 40")

Here is a different approach which groups by Name to get the difference.
library(dplyr)
library(chron)
df <- structure(list(Name = structure(c(4L, 2L, 1L, 3L, 4L, 2L, 1L), .Label = c("Leo", "Max", "Noa", "Tim"), class = "factor"),
Period = structure(c(1L,1L, 1L, 1L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Time = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("0:04:10", "1:06:20"), class = "factor"),
Distance = c(6000L, 5000L, 5500L, 6500L, 500L, 500L, 400L),
Load = c(680L, 600L, 640L, 700L, 80L, 50L, 40L)), class = "data.frame", row.names = c(NA, -7L))
df %>%
mutate(Time = times(Time)) %>%
group_by(Name) %>%
mutate(Time = lag(Time) - Time,
Distance = lag(Distance) - Distance,
Load = lag(Load) - Load,
Period = LETTERS[which(LETTERS == Period) + 1]) %>%
filter(!is.na(Time))

You can use data.table too.
dt <- data.table(Name = c('Tim', 'Max', 'Leo', 'Noa', 'Tim', 'Max', 'Leo'),
Period = c('A', 'A', 'A', 'A', 'B', 'B', 'B'),
Time = c('01:06:20', '01:06:20' , '01:06:20' , '01:06:20' , '00:04:10' , '00:04:10' , '00:04:10' ),
Distance = c(6000, 5000, 5500, 6500, 500, 500, 400 ),
Load = c(680, 600, 640, 700, 80, 50, 40))
Then the first thing to do is to convert the Time var:
dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
sapply(dt, class)
Then you use dcast.data.table:
dtCast <- dcast.data.table(dt, Name ~ Period, value.var = c('Time', 'Distance', 'Load'))
And then you create a new object:
dtFinal <- dtCast[,list(Period = 'C',
Time = Time_A - Time_B,
Distance = Distance_A - Distance_B,
Load = Load_A - Load_B),
by = 'Name']
Mind that if you want to convert the Time to the same format as above, you need to do the following:
library(hms)
dtFinal[, Time := as_hms(Time)]

Related

How to generate variation identifier and bucket in r dataframe

I have below-mentioned dataframe in R.
ID First Value End Value First Grade Final Grade
I-1 150000 5000 100 -80
I-2 150000 5000 100 80
I-3 NA NA NA NA
I-4 1000 1500 75 100
By using the above dataframe, I want to create the following column based on certain condition.
Value Var - If End Value is is higher than first value then High, If end Value is lower than first value then Low if end and first value are same then No Diff and If end and first value has NA then Outlier.
Grade Var - The above logic to be followed for this as well.
Value % Diff - To derive this we need to substract first value from end value divided by first value
Grade % Diff - The above logic to be followed for this as well
Required Output<-
You can test each condition within case_when and assign values to new column.
library(dplyr)
df %>%
mutate(ValueVar = case_when(EndValue > FirstValue ~ 'High',
EndValue < FirstValue ~ 'Low',
EndValue == FirstValue ~ 'No diff',
TRUE ~ 'Outlier'),
GradeVar = case_when(FinalGrade > FirstGrade ~ 'High',
FinalGrade < FirstGrade ~ 'Low',
FinalGrade == FirstGrade ~ 'No diff',
TRUE ~ 'Outlier'),
ValueDiff = (EndValue - FirstValue)/FirstValue * 100,
GradeDiff = (FinalGrade - FirstGrade)/FirstGrade * 100,
across(ends_with('Diff'), ~case_when(is.na(.) ~ 'Outlier',
TRUE ~ paste0(round(., 2), '%'))))
# ID FirstValue EndValue FirstGrade FinalGrade ValueVar GradeVar ValueDiff GradeDiff
#1 I-1 150000 5000 100 -80 Low Low -96.67% -180%
#2 I-2 150000 5000 100 80 Low Low -96.67% -20%
#3 I-3 NA NA NA NA Outlier Outlier Outlier Outlier
#4 I-4 1000 1500 75 100 High High 50% 33.33%
data
df <- structure(list(ID = c("I-1", "I-2", "I-3", "I-4"), FirstValue = c(150000L,
150000L, NA, 1000L), EndValue = c(5000L, 5000L, NA, 1500L), FirstGrade = c(100L,
100L, NA, 75L), FinalGrade = c(-80L, 80L, NA, 100L)),
class = "data.frame", row.names = c(NA, -4L))

Multiply multiple columns (e.g. 1A*1B and 2A*2B)

I'm looking for a fast way to multipy multiple columns with a specific ending.
I have this table
Revenue_1
Revenue_2
Sales_1
Sales_2
2,000
1,000
100
50
2,500
4,000
80
200
I want these additional columns by multiplying Revenue_N with Sales_N - For example, multiplying Revenue_1 * Sales_1 = 2,000 * 100 = 200,000
Revenue_1
Revenue_2
Sales_1
Sales_2
Sales_1
Sales_2
2,000
1,000
100
50
200,000
50,000
2,500
4,000
50
200
125,000
800,000
I was playing around with for-loops or using some form of apply but coulnd't figure it out. Any help?
Assuming dat as given reproducibly in the Note at the end we can just multiply the first two columns by the last two columns like this in base R. Note that in the question that the new columns have the same names as other columns which is not a good idea since then you can't reference them by name so here we have used new names. You could always change the names if you don't like the ones we used.
If there were additional columns that should not participate then instead of using -ix we could use grep("Sales", names(dat)) .
We computed ix but it would be possible to simply hard code it as (1:2) and then omit the first line.
ix <- grep("Revenue", names(dat)) # 1:2
cbind(dat, prod = dat[ix] * dat[-ix])
## Revenue_1 Revenue_2 Sales_1 Sales_2 prod.Revenue_1 prod.Revenue_2
## 1 2000 1000 100 50 2e+05 5e+04
## 2 2500 4000 80 200 2e+05 8e+05
Note
dat <- structure(list(Revenue_1 = c(2000, 2500), Revenue_2 = c(1000,
4000), Sales_1 = c(100, 80), Sales_2 = c(50, 200)), row.names = c(NA,
-2L), class = "data.frame")
We could also do this with across in mutate
library(dplyr)
library(stringr)
df1 <- df %>%
mutate(across(starts_with('Sales'), ~ .* get(str_replace(cur_column(), 'Sales',
'Revenue')), .names = 'Revenue_{.col}'))
-output
df1
# Revenue_1 Revenue_2 Sales_1 Sales_2 Revenue_Sales_1 Revenue_Sales_2
#1 2000 1000 100 50 200000 50000
#2 2500 4000 80 200 200000 800000
data
df <- structure(list(Revenue_1 = c(2000L, 2500L), Revenue_2 = c(1000L,
4000L), Sales_1 = c(100L, 80L), Sales_2 = c(50L, 200L)),
class = "data.frame", row.names = c(NA,
-2L))
Try the base R option below
> cbind(df, do.call("*", rev(split.default(df, gsub("\\d", "", names(df))))))
Revenue_1 Revenue_2 Sales_1 Sales_2 Sales_1 Sales_2
1 2000 1000 100 50 200000 50000
2 2500 4000 80 200 200000 800000
Data
> dput(df)
structure(list(Revenue_1 = c(2000L, 2500L), Revenue_2 = c(1000L,
4000L), Sales_1 = c(100L, 80L), Sales_2 = c(50L, 200L)), class = "data.frame", row.names = c(NA,
-2L))
I think this is what you asked for
df <- data.frame(Revenue_1=c(2000,2500),Revenue_2=c(1000,4000),
Sales_1=c(100,80),Sales_2=c(50,200))
df1 <- df %>%
mutate(Sales__1 = Revenue_1*Sales_1,
Sales__2 = Revenue_2*Sales_2) %>%
format(scientific=F)
Result
Revenue_1 Revenue_2 Sales_1 Sales_2 Sales__1 Sales__2
1 2000 1000 100 50 200000 50000
2 2500 4000 80 200 200000 800000
The tidyverse solution:
library(tidyr)
library(dplyr)
dat <- structure(list(Revenue_1 = c(2000, 2500), Revenue_2 = c(1000,
4000), Sales_1 = c(100, 80), Sales_2 = c(50, 200)), row.names = c(NA,
-2L), class = "data.frame")
dat %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = c(".value", "category"), names_pattern = "^(.*)_(.)$") %>%
mutate(Revenue_Sales = Revenue * Sales) %>%
pivot_wider(names_from = category, values_from = c(Revenue, Sales, Revenue_Sales)) %>%
select(-id)
# A tibble: 2 x 6
Revenue_1 Revenue_2 Sales_1 Sales_2 Revenue_Sales_1 Revenue_Sales_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000 1000 100 50 200000 50000
2 2500 4000 80 200 200000 800000

How to find out minimum value from various columns in data frame with R? [duplicate]

This question already has answers here:
Find the maximum and minimum value of every column and then find the maximum and minimum value of every row
(4 answers)
Closed 2 years ago.
My data frame is:
`Account id Fcast 1 Fcast 2 Fcast 3 Diff 1 Diff 2 Diff 3
101 4000 2000 1000 1000 3000 4000
201 2900 3300 5000 100 300 2000
301 -100 5500 -800 1700 7300 1000
401 5000 8000 7100 2500 500 400
501 9000 12000 2000 15000 12000 22000
Result required is to find out minimum value from the column labeled as Diff...
`Account id Min
101 1000
201 100
301 1000
401 400
501 12000
Also ideally i also need to fetch another column which tells is filled by column name from which the minimum value is fetched.
We can use apply in row mode here:
data.frame(AccountId=df$AccountId,
Min=apply(df[names(df)[grepl("^Diff\\d", names(df))]], 1, FUN=min))
AccountId Min
1 101 1000
2 201 100
3 301 1000
4 401 400
5 501 12000
Data:
df <- data.frame(AccountId=c(101, 201, 301, 401, 501),
Fcast1=c(4000, 2900, -100, 5000, 9000),
Fcast2=c(2000, 3300, 5500, 8000, 12000),
Fcast3=c(1000, 5000, -800, 7100, 2000),
Diff1=c(1000, 100, 1700, 2500, 15000),
Diff2=c(3000, 300, 7300, 500, 12000),
Diff3=c(4000, 2000, 1000, 400, 22000))
another option would be to use apply function:
df <- data.frame(df$AccountId, min = apply(df[, 2:ncol(df)], 1, min))
Using dplyr :
library(dplyr)
cols <- grep('Diff', names(df), value = TRUE)
df %>%
group_by(Accountid) %>%
mutate(Min = min(c_across(cols)),
Min_name = cols[which.min(c_across(cols))]) %>%
select(Accountid, Min, Min_name)
# Accountid Min Min_name
# <int> <int> <chr>
#1 101 1000 Diff1
#2 201 100 Diff1
#3 301 1000 Diff3
#4 401 400 Diff3
#5 501 12000 Diff2
data
df <- structure(list(Accountid = c(101L, 201L, 301L, 401L, 501L),
Fcast1 = c(4000L, 2900L, -100L, 5000L, 9000L), Fcast2 = c(2000L, 3300L, 5500L,
8000L, 12000L), Fcast3 = c(1000L, 5000L, -800L, 7100L, 2000L),
Diff1 = c(1000L, 100L, 1700L, 2500L, 15000L), Diff2 = c(3000L,
300L, 7300L, 500L, 12000L), Diff3 = c(4000L, 2000L, 1000L,
400L, 22000L)), class = "data.frame", row.names = c(NA, -5L))
A solution using data.table
dt[,`:=`(min_val=apply(.SD,1,min),
min_col=names(.SD)[apply(.SD,1,which.min)]),.SDcols=names(dt) %like% 'diff']
Here,.SDcols chooses the subset of columns to work with, in this case, columns having the work diff in it. Hence, the use of %like
.SD now behaves as a subsetted data.table having only the diff columns.

Percent change for grouped subjects at multiple timepoints R

id timepoint dv.a
1 baseline 100
1 1min 105
1 2min 90
2 baseline 70
2 1min 100
2 2min 80
3 baseline 80
3 1min 80
3 2min 90
I have repeated measures data for a given subject in long format as above. I'm looking to calculate percent change relative to baseline for each subject.
id timepoint dv pct.chg
1 baseline 100 100
1 1min 105 105
1 2min 90 90
2 baseline 70 100
2 1min 100 143
2 2min 80 114
3 baseline 80 100
3 1min 80 100
3 2min 90 113
df <- expand.grid( time=c("baseline","1","2"), id=1:4)
df$dv <- sample(100,12)
df %>% group_by(id) %>%
mutate(perc=dv*100/dv[time=="baseline"]) %>%
ungroup()
You're wanting to do something for each 'id' group, so that's the group_by, then you need to create a new column, so there's a mutate. That new variable is the old dv, scaled by the value that dv takes at the baseline - hence the inner part of the mutate. And finally it's to remove the grouping you'd applied.
Try creating a helper column, group and arrange on that. Then use the window function first in your mutate function:
df %>% mutate(clean_timepoint = str_remove(timepoint,"min") %>% if_else(. == "baseline", "0", .) %>% as.numeric()) %>%
group_by(id) %>%
arrange(id,clean_timepoint) %>%
mutate(pct.chg = (dv / first(dv)) * 100) %>%
select(-clean_timepoint)
in Base Ryou can do this
for(i in 1:(NROW(df)/3)){
df[1+3*(i-1),4] <- 100
df[2+3*(i-1),4] <- df[2+3*(i-1),3]/df[1+3*(i-1),3]*100
df[3+3*(i-1),4] <- df[3+3*(i-1),3]/df[1+3*(i-1),3]*100
}
colnames(df)[4] <- "pct.chg"
output:
> df
id timepoint dv.a pct.chg
1 1 baseline 100 100.0000
2 1 1min 105 105.0000
3 1 2min 90 90.0000
4 2 baseline 70 100.0000
5 2 1min 100 142.8571
6 2 2min 80 114.2857
7 3 baseline 80 100.0000
8 3 1min 80 100.0000
9 3 2min 90 112.5000
Base R solution: (assuming "baseline" always appears as first record per group)
data.frame(do.call("rbind", lapply(split(df, df$id),
function(x){x$pct.change <- x$dv/x$dv[1]; return(x)})), row.names = NULL)
Data:
df <- structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
timepoint = c(
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min",
"baseline",
"1min",
"2min"
),
dv = c(100L, 105L, 90L, 70L, 100L, 80L, 80L, 80L, 90L)
),
class = "data.frame",
row.names = c(NA,-9L)
)

How to merge two dataframes based on range value of one table

DF1
SIC Value
350 100
460 500
140 200
290 400
506 450
DF2
SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Land
Note:class of SIC1 is having character,we need to convert to numeric range
i am trying to get the output like below
Desired output:
DF3
SIC Value AREA
350 100 Education
460 500 Land
140 200 Forest
290 400 Education
506 450 Land
i have tried first to convert character class of SIC1 to numeric
then tried to merge,but no luck,can someone guide on this?
An option can be to use tidyr::separate along with sqldf to join both tables on range of values.
library(sqldf)
library(tidyr)
DF2 <- separate(DF2, "SIC1",c("Start","End"), sep = "-")
sqldf("select DF1.*, DF2.AREA from DF1, DF2
WHERE DF1.SIC between DF2.Start AND DF2.End")
# SIC Value AREA
# 1 350 100 Education
# 2 460 500 Lan
# 3 140 200 Forest
# 4 290 400 Education
# 5 506 450 Lan
Data:
DF1 <- read.table(text =
"SIC Value
350 100
460 500
140 200
290 400
506 450",
header = TRUE, stringsAsFactors = FALSE)
DF2 <- read.table(text =
"SIC1 AREA
100-200 Forest
201-280 Hospital
281-350 Education
351-450 Government
451-550 Lan",
header = TRUE, stringsAsFactors = FALSE)
We could do a non-equi join. Split (tstrsplit) the 'SIC1' column in 'DF2' to numeric columns and then do a non-equi join with the first dataset.
library(data.table)
setDT(DF2)[, c('start', 'end') := tstrsplit(SIC1, '-', type.convert = TRUE)]
DF2[, -1, with = FALSE][DF1, on = .(start <= SIC, end >= SIC),
mult = 'last'][, .(SIC = start, Value, AREA)]
# SIC Value AREA
#1: 350 100 Education
#2: 460 500 Land
#3: 140 200 Forest
#4: 290 400 Education
#5: 506 450 Land
Or as #Frank mentioned we can do a rolling join to extract the 'AREA' and update it on the first dataset
setDT(DF1)[, AREA := DF2[DF1, on=.(start = SIC), roll=TRUE, x.AREA]]
data
DF1 <- structure(list(SIC = c(350L, 460L, 140L, 290L, 506L), Value = c(100L,
500L, 200L, 400L, 450L)), .Names = c("SIC", "Value"),
class = "data.frame", row.names = c(NA, -5L))
DF2 <- structure(list(SIC1 = c("100-200", "201-280", "281-350", "351-450",
"451-550"), AREA = c("Forest", "Hospital", "Education", "Government",
"Land")), .Names = c("SIC1", "AREA"), class = "data.frame",
row.names = c(NA, -5L))

Resources