I have a dataset with hundreds of rows structured like this
User Date Value1 Value2
A 2012-01-01 4 3
A 2012-01-02 5 7
A 2012-01-03 6 1
A 2012-01-04 7 4
B 2012-01-01 2 4
B 2012-01-02 3 2
B 2012-01-03 4 9
B 2012-01-04 5 3
As the panel data has two indices (User=k, Date=t), I struggle to run a regression on R where the dependent variable (Value 1) is lagged only on the time index. the regression should be performed as follows:
Value1(k,t+1) ~ Value2(k,t)
or
Value1(k,t) ~ Value2(k,t-1)
Any suggestions?
For every user, you can do:
> df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
+ Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
+ Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
+ Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
>
> df_A <- df[df$User == "A", c("Value1", "Value2")]
> ts_A <- ts(df_A, start = c(2012, 1, 1), frequency = 365)
> ts_A <- ts.intersect(ts_A, lag(ts_A, -1))
> colnames(ts_A) <- c("Value1", "Value2", "Value1_t_1", "Value2_t_1")
>
> lm(Value1 ~ Value2_t_1, ts_A)
Call:
lm(formula = Value1 ~ Value2_t_1, data = ts_A)
Coefficients:
(Intercept) Value2_t_1
6.3929 -0.1071
>
Hope it helps.
Here's a solution using the dplyr package, you may notice in the code below I explicitly reference the lag function from dplyr as opposed to base R (stats). This is because the lag function from dplyr does not require a time series input.
I would also note that the two formulas you list may produce different regression results as you will be running them over different sets of data i.e.
Value1(k,t+1) ~ Value2(k,t) : run on the time period of 1-01-2012 to 1-03-2012
Value1(k,t) ~ Value2(k,t-1) : run on the time period of 1-02-2012 to 1-04-2012
library("tidyverse")
df <- data.frame(User = c(rep("A", 4), rep("B", 4)),
Date = rep(seq.Date(as.Date("2012-01-01"), as.Date("2012-01-04"), by = "day"), 2),
Value1 = c(4, 5, 6, 7, 2, 3, 4, 5),
Value2 = c(3, 7, 1, 4, 4, 2, 9, 3))
df2 <- df %>% arrange(User,Date) %>%
group_by(User) %>%
mutate(lag_v2 = dplyr::lag(Value2),
lead_v1 = dplyr::lead(Value1))
df3<-df2[!is.na(df2$lag_v2),]
df4<-df2[!is.na(df2$lead_v1),]
summary(lm(Value1~lag_v2,data=df3))
summary(lm(lead_v1~Value2,data=df4))
Related
I do have a dataframe, which is a result of a merge (all =TRUE) and looks like this one (where the merge is conducted by Groupname, ObservationName and Date, the 2 Treatment columns come from the x :
A <- data.frame(GroupName = c(rep(c("A", "B", "C"), each = 6)),
ObservationName = c("alpha", "beta", "gamma", "alpha", "beta", "gamma", rep(c("delta", "epsilon"),3), rep(c("zeta", "eta", "theta"),2)),
Date = rep(rep(seq(as.Date("2010-1-1"), as.Date("2010-3-1"), by = "month"), each =3), 2),
Value = runif(n = 18, min = 1, max = 10),
Treatment1 = rep(NA, 18),
Treatment2 = rep(NA, 18))
A[c(1, 5, 6, 10, 12,13),5] <- 1
A[c(1, 5, 6, 10, 12,13),6] <- c(1, 3, 5, 7, 3, 4)
A[c( 7, 10 , 14), c(1,2,4)] <- NA
I would like to carry the values of my Treatment1 and Treatment 2 on. Namely I want to group my dfs by Groupname and Observationname and order it by Date column. If Treatment1 has a one in a earlier observation of that group, all later Treatments should have a one as well. In Treatment2 the numbers shall cumulate. That mean: in row 1,2,3,4 should be 1, in row 5 should be 4 (since 1 + 3) and in row 6 there should be 9 (since 1 +3+5). and so on. Thanks for help.
One of my tries with dplyr is:
A %>% group_by(GroupName, ObservationName) %>%
arrange(Date) %>%
mutate(Treatment1 = sum(Treatment1),
Treatment1cm = cummax(Treatment1)) %>%
ungroup()
but that does not override the NAs.
The aim is to delete all the rows where only treatment1 and Treatment 2 is given, since the (or value is NA) but all information a took over.
I have a dataframe in R looking like that
ID1 <- c(1,2,3,4,5,6,7,8,9)
Value1 <- c(2,3,5,2,5,8,17,3,5)
ID2 <- c(1,2,3,4,5,6,7,8,9)
Value2 <- c(4,6,3,5,8,1,2,8,10)
df <- as.data.frame(cbind(ID1,Value1,ID2,Value2))
Now I am searching for the minimum value of the sum of Value1 and Value2 which has a sum of ID1 and ID2 equal or smaller than 9. Thus, it should show me the minimum of the combination of Value1 + Value2 (not needed to be within the same row) without exceding 9 as the sum of ID1+ID2.
The result should point me to the combination of x in Value1 and y in Value2, which together are the lowest potential values under the condition that ID1+ID2 are <=9.
Thanks in advance!
One possibility
library(dplyr)
goodrow <- filter(df, ID1 + ID2 <= 9) %>% mutate(sumval = Value1 + Value2) %>% filter(sumval == min(sumval))
If I understand well your question, consider using the crossing function. This will compute all the combination of ID1 and ID2
library(dplyr)
df <- as.data.frame(cbind(ID1,Value1))
df2 <- as.data.frame(cbind(ID2,Value2))
df_test <- crossing(df, df2)
goodrow <- filter(df_test, ID1 + ID2 <= 9) %>% mutate(sumval = Value1 + Value2) %>% filter(sumval == min(sumval))
For your specific case
which.min(rowSums(df[rowSums(df[,c("ID1","ID2")])<10,c("Value1","Value2")]))
You can use a SQL query to answer the question with the sqldf package
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
df <- structure(list(ID1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9), Value1 = c(2,
3, 5, 2, 5, 8, 17, 3, 5), ID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
Value2 = c(4, 6, 3, 5, 8, 1, 2, 8, 10)), class = "data.frame", row.names = c(NA,
-9L))
# just get min sum
sqldf('
select
min(a.Value1 + b.Value2) as min_sum
from
df a
join df b
on a.ID1 + b.ID2 <= 9
')
#> min_sum
#> 1 3
# show the rows where min sum occurs
sqldf('
select
a.Value1
, b.Value2
, a.ID1
, b.ID2
from
df a
join df b
on a.ID1 + b.ID2 <= 9
group by
1 = 1
having
a.Value1 + b.Value2 = min(a.Value1 + b.Value2)
')
#> Value1 Value2 ID1 ID2
#> 1 2 1 1 6
Created on 2021-11-15 by the reprex package (v2.0.1)
Another one liner,
filter(transform(df, 'new' = df$Value1 + df$Value2),(df$ID1 + df$ID2 <=9)&(new == min(new)))
is it possible to do something like this in R (assuming both df1 and df2 have the same number of rows?
if (df1$var1 = 8) df2$var1 = 1.
if (df1$var2 = 9) df2$var2 = 1.
A simple two line code can be done with Base R ifelse statement
df1 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2 <- data.frame(var1 = c(1:10), var2 = c(1:10))
df2$var1 <- ifelse(df1$var1 == 8, 1,df2$var1)
df2$var2 <- ifelse(df1$var2 == 9, 1,df2$var2)
Here is one simple option in base R, where we replicate the values 8, 9 to make the lengths same and compare with the subset of columns of 'df1', resulting in a logical matrix. Subset the 'df2' and assign those columns to 1
nm1 <- c('var1', 'var2')
df2[nm1][df1[nm1] == c(8, 9)[col(df1[nm1])]] <- 1
df2
# var1 var2 var3
#1 5 1 1
#2 3 1 2
#3 1 3 3
#4 1 4 4
#5 4 2 5
Or this can be done in two steps
df2$var1[df1$var1 == 8] <- 1
df2$var2[df1$var2 == 9] <- 1
Or using Map
df2[nm1] <- Map(function(x, y, z) replace(x, y == z, 1),
df2[nm1], df1[nm1], c(8, 9))
The if/else loop can be also done, but it is not vectorized i.e. it expects input to be of length 1. If we do a loop, then it can be done (but would be inefficient in R)
vals <- c(8, 9)
for(i in seq_len(nrow(df1))) {
for(j in seq_along(nm1)) {
if(df1[[nm1[j]]][i] == vals[j]) df2[[nm1[j]]][i] <- 1
}
}
data
df1 <- data.frame(var1 = c(1, 3, 8, 5, 2), var2 = c(9, 3, 1, 8, 4),
var3 = 1:5)
df2 <- data.frame(var1 = c(5, 3, 2, 1, 4), var2 = c(3, 1, 3, 4, 2),
var3 = 1:5)
I have a list of lists containing multiple data frames. I would like to transpose the data frames and leave the lists structured as is.
The data is setup in this format (from:John McDonnell):
parent <- list(
a = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))
),
b = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))
)
)
This works when a single list of data frames is used, but not for a list of lists:
a_tran <- lapply(a, function(x) {
t(x)
})
Any thoughts on how to modify?
You could use modify_depth from purrr
library(purrr)
modify_depth(.x = parent, .depth = 2, .f = ~ as.data.frame(t(.)))
#$a
#$a$foo
# V1 V2 V3
#first 1 2 3
#second 4 5 6
#$a$bar
# V1 V2 V3
#first 1 2 3
#second 4 5 6
#$a$puppy
# V1 V2 V3
#first 1 2 3
#second 4 5 6
#$b
# ...
A base R option that #hrbrmstr initially posted in a comment would be
lapply(parent, function(x) lapply(x, function(y) as.data.frame(t(y))))
I have a data frame where some columns have the same data, but different column names. I would like to remove duplicated columns, but merge the column names. An example, where test1 and test4 columns are duplicates:
df
test1 test2 test3 test4
1 1 1 0 1
2 2 2 2 2
3 3 4 4 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
and I would like the result to be something like this:
df
test1+test4 test2 test3
1 1 1 0
2 2 2 2
3 3 4 4
4 4 4 4
5 5 5 5
6 6 6 6
Here is the data:
structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
Please note that I do not simply want to remove duplicated columns. I also want to merge the column names of the duplicated columns, after the duplicates are removed.
I could do it manually for the simple table I posted, but I want to use this on large datasets, where I don't know in advance what columns are identical. I do not what to remove and rename columns manually, since I might have over 50 duplicated columns.
Ok, improving on the above answer using the idea from here. Save the duplicate and non-duplicate columns into data frames. Check to see if the non-duplicates match any duplicates, and if so concatenate their columns names. So this will now work if you have more than two duplicate columns.
Editted: Changed summary to digest. This helps with character data.
df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
library(digest)
nondups <- df[!duplicated(lapply(df, digest))]
dups <- df[duplicated(lapply(df, digest))]
for(i in 1:ncol(nondups)){
for(j in 1:ncol(dups)){
if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
}
}
nondups
Example 2, as a function.
Editted: Changed summary to digest and return non-duplicated and duplicated data frames.
age <- 18:29
height <- c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender <- c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe <- data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender, gender3 = gender)
dupcols <- function(df = testframe){
nondups <- df[!duplicated(lapply(df, digest))]
dups <- df[duplicated(lapply(df, digest))]
for(i in 1:ncol(nondups)){
for(j in 1:ncol(dups)){
if(FALSE %in% paste0(nondups[,i] == dups[,j])) NULL
else names(nondups)[i] <- paste(names(nondups[i]), names(dups[j]), sep = "+")
}
}
return(list(df1 = nondups, df2 = dups))
}
dupcols(df = testframe)
Editted: This section is new.
Example 3: On a large data frame
#Creating a 1500 column by 15000 row data frame
dat <- do.call(data.frame, replicate(1500, rep(FALSE, 15000), simplify=FALSE))
names(dat) <- 1:1500
#Fill the data frame with LETTERS across the rows
#This part may take a while. Took my PC about 23 minutes.
start <- Sys.time()
fill <- rep(LETTERS, times = ceiling((15000*1500)/26))
j <- 0
for(i in 1:nrow(dat)){
dat[i,] <- fill[(1+j):(1500+j)]
j <- j + 1500
}
difftime(Sys.time(), start, "mins")
#Run the function on the created data set
#This took about 4 minutes to complete on my PC.
start <- Sys.time()
result <- dupcols(df = dat)
difftime(Sys.time(), start, "mins")
names(result$df1)
ncol(result$df1)
ncol(result$df2)
It's not completely automated, but the output of the loop will identify pairs of duplicate columns. You'll then have to remove one of the duplicate columns and then re-name based on what columns were duplicates.
df <- structure(list(test1 = c(1, 2, 3, 4, 5, 6), test2 = c(1, 2, 4,
4, 5, 6), test3 = c(0, 2, 4, 4, 5, 6), test4 = c(1, 2, 3, 4,
5, 6)), .Names = c("test1", "test2", "test3", "test4"), row.names = c(NA,
-6L), class = "data.frame")
for(i in 1:(ncol(df)-1)){
for(j in 2:ncol(df)){
if(i == j) NULL
else if(FALSE %in% paste0(df[,i] == df[,j])) NULL
else print(paste(i, j, sep = " + "))
}
}
new <- df[,-4]
names(new)[1] <- paste(names(df[1]), names(df[4]), sep = "+")
new