I have a dataframe in R looking like that
ID1 <- c(1,2,3,4,5,6,7,8,9)
Value1 <- c(2,3,5,2,5,8,17,3,5)
ID2 <- c(1,2,3,4,5,6,7,8,9)
Value2 <- c(4,6,3,5,8,1,2,8,10)
df <- as.data.frame(cbind(ID1,Value1,ID2,Value2))
Now I am searching for the minimum value of the sum of Value1 and Value2 which has a sum of ID1 and ID2 equal or smaller than 9. Thus, it should show me the minimum of the combination of Value1 + Value2 (not needed to be within the same row) without exceding 9 as the sum of ID1+ID2.
The result should point me to the combination of x in Value1 and y in Value2, which together are the lowest potential values under the condition that ID1+ID2 are <=9.
Thanks in advance!
One possibility
library(dplyr)
goodrow <- filter(df, ID1 + ID2 <= 9) %>% mutate(sumval = Value1 + Value2) %>% filter(sumval == min(sumval))
If I understand well your question, consider using the crossing function. This will compute all the combination of ID1 and ID2
library(dplyr)
df <- as.data.frame(cbind(ID1,Value1))
df2 <- as.data.frame(cbind(ID2,Value2))
df_test <- crossing(df, df2)
goodrow <- filter(df_test, ID1 + ID2 <= 9) %>% mutate(sumval = Value1 + Value2) %>% filter(sumval == min(sumval))
For your specific case
which.min(rowSums(df[rowSums(df[,c("ID1","ID2")])<10,c("Value1","Value2")]))
You can use a SQL query to answer the question with the sqldf package
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
df <- structure(list(ID1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9), Value1 = c(2,
3, 5, 2, 5, 8, 17, 3, 5), ID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
Value2 = c(4, 6, 3, 5, 8, 1, 2, 8, 10)), class = "data.frame", row.names = c(NA,
-9L))
# just get min sum
sqldf('
select
min(a.Value1 + b.Value2) as min_sum
from
df a
join df b
on a.ID1 + b.ID2 <= 9
')
#> min_sum
#> 1 3
# show the rows where min sum occurs
sqldf('
select
a.Value1
, b.Value2
, a.ID1
, b.ID2
from
df a
join df b
on a.ID1 + b.ID2 <= 9
group by
1 = 1
having
a.Value1 + b.Value2 = min(a.Value1 + b.Value2)
')
#> Value1 Value2 ID1 ID2
#> 1 2 1 1 6
Created on 2021-11-15 by the reprex package (v2.0.1)
Another one liner,
filter(transform(df, 'new' = df$Value1 + df$Value2),(df$ID1 + df$ID2 <=9)&(new == min(new)))
Related
I do have a dataframe, which is a result of a merge (all =TRUE) and looks like this one (where the merge is conducted by Groupname, ObservationName and Date, the 2 Treatment columns come from the x :
A <- data.frame(GroupName = c(rep(c("A", "B", "C"), each = 6)),
ObservationName = c("alpha", "beta", "gamma", "alpha", "beta", "gamma", rep(c("delta", "epsilon"),3), rep(c("zeta", "eta", "theta"),2)),
Date = rep(rep(seq(as.Date("2010-1-1"), as.Date("2010-3-1"), by = "month"), each =3), 2),
Value = runif(n = 18, min = 1, max = 10),
Treatment1 = rep(NA, 18),
Treatment2 = rep(NA, 18))
A[c(1, 5, 6, 10, 12,13),5] <- 1
A[c(1, 5, 6, 10, 12,13),6] <- c(1, 3, 5, 7, 3, 4)
A[c( 7, 10 , 14), c(1,2,4)] <- NA
I would like to carry the values of my Treatment1 and Treatment 2 on. Namely I want to group my dfs by Groupname and Observationname and order it by Date column. If Treatment1 has a one in a earlier observation of that group, all later Treatments should have a one as well. In Treatment2 the numbers shall cumulate. That mean: in row 1,2,3,4 should be 1, in row 5 should be 4 (since 1 + 3) and in row 6 there should be 9 (since 1 +3+5). and so on. Thanks for help.
One of my tries with dplyr is:
A %>% group_by(GroupName, ObservationName) %>%
arrange(Date) %>%
mutate(Treatment1 = sum(Treatment1),
Treatment1cm = cummax(Treatment1)) %>%
ungroup()
but that does not override the NAs.
The aim is to delete all the rows where only treatment1 and Treatment 2 is given, since the (or value is NA) but all information a took over.
Lets say I have the following data frame:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))
I want to create a data frame that has the following columns: id, start_day, end_day
such that for each unique id I only need the minimum of start_day column and the maximum of the end_day column. The final data frame should look like as follow:
To get this new data frame I wrote the following code:
df <- df[!(is.na(df$start_day)), ]
dt <- data.frame(matrix(ncol =3 , nrow = length(unique(df$id))))
colnames(dt) <- c("id", "start_day", "end_day")
dt$id <- unique(df$id)
st_day <- vector()
en_day <- vector()
for (elm in dt$id) {
d <- df[df$id == elm, ]
minimum <- min(d$start_day)
maximum <- max(d$end_day)
st_day <- c(st_day, minimum)
en_day <- c(en_day, maximum)
}
dt$start_day <- st_day
dt$end_day <- en_day
df <- dt
My code is creating what I am looking for, but I am not happy with it. I would love to learn a better and cleaner way to do the same thing. Any idea is very much appreciated.
You can try data.table like below
> library(data.table)
> na.omit(setDT(df))[, .(start_day = min(start_day), end_day = max(end_day)), id]
id start_day end_day
1: 1 1 15
2: 2 4 22
3: 3 5 21
This should do:
df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
end_day = max(end_day, na.rm = T))
Output:
id start_day end_day
<dbl> <dbl> <dbl>
1 1 1 15
2 2 4 22
3 3 5 21
is it possible to do something like this in R (assuming both df1 and df2 have the same number of rows?
if (df1$var1 = 8) df2$var1 = 1.
if (df1$var2 = 9) df2$var2 = 1.
A simple two line code can be done with Base R ifelse statement
df1 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2$var1 <- ifelse(df1$var1 == 8, 1,df2$var1)
df2$var2 <- ifelse(df1$var2 == 9, 1,df2$var2)
Here is one simple option in base R, where we replicate the values 8, 9 to make the lengths same and compare with the subset of columns of 'df1', resulting in a logical matrix. Subset the 'df2' and assign those columns to 1
nm1 <- c('var1', 'var2')
df2[nm1][df1[nm1] == c(8, 9)[col(df1[nm1])]] <- 1
df2
# var1 var2 var3
#1 5 1 1
#2 3 1 2
#3 1 3 3
#4 1 4 4
#5 4 2 5
Or this can be done in two steps
df2$var1[df1$var1 == 8] <- 1
df2$var2[df1$var2 == 9] <- 1
Or using Map
df2[nm1] <- Map(function(x, y, z) replace(x, y == z, 1),
df2[nm1], df1[nm1], c(8, 9))
The if/else loop can be also done, but it is not vectorized i.e. it expects input to be of length 1. If we do a loop, then it can be done (but would be inefficient in R)
vals <- c(8, 9)
for(i in seq_len(nrow(df1))) {
for(j in seq_along(nm1)) {
if(df1[[nm1[j]]][i] == vals[j]) df2[[nm1[j]]][i] <- 1
}
}
data
df1 <- data.frame(var1 = c(1, 3, 8, 5, 2), var2 = c(9, 3, 1, 8, 4),
var3 = 1:5)
df2 <- data.frame(var1 = c(5, 3, 2, 1, 4), var2 = c(3, 1, 3, 4, 2),
var3 = 1:5)
I am looking for the dplyr equivalent of the following SQL:
SELECT x
FROM ABT1
WHERE x IN (SELECT z FROM ABT2 WHERE q = ABT1.q)
I need this to be able to add a new column to a data frame based on values in an other data frame. I might be doing this the wrong way (hope you can tell me), but the idea I have is along the lines of:
ABT1 <- ABT1 %>% mutate(x = ifelse(ABT2 %>% filter(x = ABT1.x) %>% count() > 0, 0, 1))
The code above does not work as I don't know how to finish it. ABT1 and ABT2 are both data frames.
Does anyone know how I can solve this?
With dplyr, we can do
library(dplyr)
inner_join(ABT1, select(ABT2, q, z), by = 'q') %>%
filter(x %in% z) %>%
select(x) %>%
distinct()
# x
#1 4
#2 3
-testing with 'sqldf'
library(sqldf)
sqldf('SELECT x
FROM ABT1
WHERE x IN (SELECT z FROM ABT2 WHERE q = ABT1.q)')
# x
#1 4
#2 3
data
ABT1 <- data.frame(q = rep(letters[1:3], each = 2), x = c(1, 3, 5, 2, 4, 3))
ABT2 <- data.frame(q = rep(letters[2:4], each = 3),
z = c(4, 9, 12, 3, 1, 4, 10, 6, 5))
I have a dataset with hundreds of rows structured like this
User Date Value1 Value2
A 2012-01-01 4 3
A 2012-01-02 5 7
A 2012-01-03 6 1
A 2012-01-04 7 4
B 2012-01-01 2 4
B 2012-01-02 3 2
B 2012-01-03 4 9
B 2012-01-04 5 3
As the panel data has two indices (User=k, Date=t), I struggle to run a regression on R where the dependent variable (Value 1) is lagged only on the time index. the regression should be performed as follows:
Value1(k,t+1) ~ Value2(k,t)
or
Value1(k,t) ~ Value2(k,t-1)
Any suggestions?
For every user, you can do:
> df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
+ Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
+ Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
+ Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
>
> df_A <- df[df$User == "A", c("Value1", "Value2")]
> ts_A <- ts(df_A, start = c(2012, 1, 1), frequency = 365)
> ts_A <- ts.intersect(ts_A, lag(ts_A, -1))
> colnames(ts_A) <- c("Value1", "Value2", "Value1_t_1", "Value2_t_1")
>
> lm(Value1 ~ Value2_t_1, ts_A)
Call:
lm(formula = Value1 ~ Value2_t_1, data = ts_A)
Coefficients:
(Intercept) Value2_t_1
6.3929 -0.1071
>
Hope it helps.
Here's a solution using the dplyr package, you may notice in the code below I explicitly reference the lag function from dplyr as opposed to base R (stats). This is because the lag function from dplyr does not require a time series input.
I would also note that the two formulas you list may produce different regression results as you will be running them over different sets of data i.e.
Value1(k,t+1) ~ Value2(k,t) : run on the time period of 1-01-2012 to 1-03-2012
Value1(k,t) ~ Value2(k,t-1) : run on the time period of 1-02-2012 to 1-04-2012
library("tidyverse")
df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
df2 <- df %>% arrange(User,Date) %>%
group_by(User) %>%
mutate(lag_v2 = dplyr::lag(Value2),
lead_v1 = dplyr::lead(Value1))
df3<-df2[!is.na(df2$lag_v2),]
df4<-df2[!is.na(df2$lead_v1),]
summary(lm(Value1~lag_v2,data=df3))
summary(lm(lead_v1~Value2,data=df4))