I have a dataframe like this
id v1 v2 v3 v4 v5 pos
1 11 12 11 10 10 3
2 17 11 22 40 23 4
1 11 22 50 10 10 2
I would like to change its values based on a condition related to pos to get:
id v1 v2 v3 v4 v5 pos
1 11 12 12 12 12 3
2 17 11 22 22 22 4
1 11 11 11 11 11 2
So basically values get the previous value and the variable pos defines from where should we start.
Thx!
An approach using some indexing, which should be efficient in running time.
Not super efficient in terms of memory however, due to making a copy the same size as the input data:
vars <- paste0("v",1:5)
nv <- dat[vars][cbind(seq_len(nrow(dat)), dat$pos-1)]
ow <- col(dat[vars]) >= dat$pos
dat[vars][ow] <- nv[row(ow)[ow]]
# id v1 v2 v3 v4 v5 pos
#1 1 11 12 12 12 12 3
#2 2 17 11 22 22 22 4
#3 1 11 11 11 11 11 2
Explanation:
Get the variables of interest:
vars <- paste0("v",1:5)
Get the new values to overwrite for each row:
nv <- dat[vars][cbind(seq_len(nrow(dat)), dat$pos-1)]
Make a logical matrix of the cells to overwrite
ow <- col(dat[vars]) >= dat$pos
Overwrite the cells using a row identifier to pick the appropriate value.
dat[vars][ow] <- nv[row(ow)[ow]]
Quick comparative timing using a larger dataset:
dat <- dat[rep(1:3,1e6),]
# indexing
# user system elapsed
# 1.36 0.31 1.68
# apply
# user system elapsed
# 77.30 0.83 78.41
# gather/spread
# user system elapsed
# 293.43 3.64 299.10
Here is one idea with gather and spread.
library(tidyverse)
dat2 <- dat %>%
rowid_to_column() %>%
gather(v, value, starts_with("v")) %>%
group_by(rowid) %>%
mutate(value = ifelse(row_number() >= (pos - 1), nth(value, (pos - 1)[[1]]), value)) %>%
spread(v, value) %>%
ungroup() %>%
select(names(dat))
dat2
# # A tibble: 3 x 7
# id v1 v2 v3 v4 v5 pos
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 11 12 12 12 12 3
# 2 2 17 11 22 22 22 4
# 3 1 11 11 11 11 11 2
DATA
dat <- read.table(text = "id v1 v2 v3 v4 v5 pos
1 11 12 11 10 10 3
2 17 11 22 40 23 4
1 11 22 50 10 10 2",
header = TRUE)
library(tidyverse)
Using apply from base R
data.frame(t(apply(df, 1, function(x)
c(x[1:x["pos"]], rep(x[x["pos"]], ncol(df) - x["pos"] - 2), x['pos']))))
# X1 X2 X3 X4 X5 X6
#1 1 11 12 12 12 3
#2 2 17 11 22 22 4
#3 1 11 11 11 11 2
Related
I am trying to identify column names with matching substrings, and then calculate the differences of the values in those columns.
Sample data:
V1_ABC <- c(1,2,3,4)
V2_ABC <- c(2,3,4,5)
V1_WXYZ <- c(10,11,12,13)
V2_WXYZ <- c(11,12,13,14)
Date <- c(2001,2002,2003,2004)
So df looks like:
df <- data.frame(Date, V1_ABC, V2_ABC, V1_WXYZ, V2_WXYZ)
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ
1 2001 1 2 10 11
2 2002 2 3 11 12
3 2003 3 4 12 13
4 2004 4 5 13 14
I want to calculate V1 minus V2 for ABC and WXYZ. My original dataset is much larger, so I don't want to do this manually for each. I'd like to automate this so that R compares the column headers and finds which columns have the same ending substring (V1_ABC and V2_ABC, and V1_WXYZ and V2_WXYZ), then subtracts the V2_ from the V1_. Like this:
Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
1 2001 1 2 10 11 -1 -1
2 2002 2 3 11 12 -1 -1
3 2003 3 4 12 13 -1 -1
4 2004 4 5 13 14 -1 -1
Most of the functions I have found such as grep or intersect either look for a specific string you input, or return the values where the vectors are the same.
Any ideas on how to automate pairing based on names/substrings?
You could stack V1 and V2 separately, calculate the differences, and reshape them back to the wide form. This approach can deal with any numbers of pairs of V1_xxx and V2_xxx.
library(tidyverse)
df %>%
pivot_longer(contains("_"), names_to = c(".value", "grp"), names_sep = "_") %>%
mutate(dif = V1 - V2) %>%
pivot_wider(names_from = grp, values_from = c(V1, V2, dif))
# # A tibble: 4 × 7
# Date V1_ABC V1_WXYZ V2_ABC V2_WXYZ dif_ABC dif_WXYZ
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2001 1 10 2 11 -1 -1
# 2 2002 2 11 3 12 -1 -1
# 3 2003 3 12 4 13 -1 -1
# 4 2004 4 13 5 14 -1 -1
Here is a base R solution. You mention that your data frame is large so this checks for columns where there are exactly 2 shared suffixes and only operates on those. It assumes that they are all of the format "V1_suffix" and "V2_suffix" but could be easily modified if they are in other formats.
suffixes <- unlist(regmatches(names(df), gregexpr("_.+", names(df))))
# Limit to suffixes where there are 2
suffixes <- names(table(suffixes)[table(suffixes) == 2])
diffs <- sapply(suffixes,
\(suffix) df[[paste0("V1", suffix)]] - df[[paste0("V2", suffix)]]
)
diff_df <- data.frame(diffs) |>
setNames(paste0("dif", suffixes))
cbind(df, diff_df)
# Date V1_ABC V2_ABC V1_WXYZ V2_WXYZ dif_ABC dif_WXYZ
# 1 2001 1 2 10 11 -1 -1
# 2 2002 2 3 11 12 -1 -1
# 3 2003 3 4 12 13 -1 -1
# 4 2004 4 5 13 14 -1 -1
I have a dataframe
data <- data.frame(v=c(15,25,24), x_val=c(12,7,2), y_val=c(6,6,18))
I want the resulting data to look like this with the data repeated in rows a specified number of times (here 2 times).
v1 x1 y1 v2 x2 y2 v3 x3 y3
15 12 6 25 7 6 24 2 18
15 12 6 25 7 6 24 2 18
I managed to get the data all in one row with the right column names but I'm not sure how to extend the column to a specified length with the values repeated. Further, how can I do this without loops? I want to run this with a larger dataset which can be quite slow with loops.
My code is below which gives the values in a single row.
r=NULL
r<- as.data.frame(matrix(nrow=1, ncol=1))
n<-2
for (i in 1:nrow(data_subset)){
datainarow <- data_subset[i,]
r=cbind(r,as.data.frame(datainarow))
colnames(r)[n] <- paste0("v",i)
colnames(r)[n+1] <- paste0("x",i)
colnames(r)[n+2] <- paste0("y",i)
n <- n+3
}
Thank you!
You can use uncount in the tidyr package
If you already have your data in the single row format, just do:
n=4
data %>% tidyr::uncount(n)
# A tibble: 4 x 9
v1 v2 v3 x1 x2 x3 y1 y2 y3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 15 25 24 12 7 2 6 6 18
2 15 25 24 12 7 2 6 6 18
3 15 25 24 12 7 2 6 6 18
4 15 25 24 12 7 2 6 6 18
Here is one way to get that result from initial three row data frame
library(tidyverse)
n=4
data %>%
rename_all(~c("v","x","y")) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = v:y,names_sep = "") %>%
uncount(n)
This is a one-liner in base R
as.data.frame(t(as.vector(t(data))))[rep(1, 2),]
#> V1 V2 V3 V4 V5 V6 V7 V8 V9
#> 1 15 12 6 25 7 6 24 2 18
#> 1.1 15 12 6 25 7 6 24 2 18
Or if you wish to use the naming convention described, and have a more generalizable solution, you could use the following function:
expand_data <- function(data, reps) {
df <- as.data.frame(t(as.vector(t(data))))[rep(1, reps),]
names(df) <- paste(names(data), rep(seq(nrow(data)), each = nrow(data)), sep = "_")
rownames(df) <- NULL
df
}
which allows:
expand_data(data, 10)
v_1 x_val_1 y_val_1 v_2 x_val_2 y_val_2 v_3 x_val_3 y_val_3
1 15 12 6 25 7 6 24 2 18
2 15 12 6 25 7 6 24 2 18
3 15 12 6 25 7 6 24 2 18
4 15 12 6 25 7 6 24 2 18
5 15 12 6 25 7 6 24 2 18
6 15 12 6 25 7 6 24 2 18
7 15 12 6 25 7 6 24 2 18
8 15 12 6 25 7 6 24 2 18
9 15 12 6 25 7 6 24 2 18
10 15 12 6 25 7 6 24 2 18
Purpose
Suppose I have four variables: Two variables are original variables and the other two variables are the predictions of the original variables. (In actual data, there are a greater number of original variables)
I want to use for loop and mutate to create columns that compute the difference between the original and prediction variable. The sample data and the current approach are following:
Sample data
set.seed(10000)
id <- sample(1:20, 100, replace=T)
set.seed(10001)
dv.1 <- sample(1:20, 100, replace=T)
set.seed(10002)
dv.2 <- sample(1:20, 100, replace=T)
set.seed(10003)
pred_dv.1 <- sample(1:20, 100, replace=T)
set.seed(10004)
pred_dv.2 <- sample(1:20, 100, replace=T)
d <-
data.frame(id, dv.1, dv.2, pred_dv.1, pred_dv.2)
Current approach (with Error)
original <- d %>% select(starts_with('dv.')) %>% names(.)
pred <- d %>% select(starts_with('pred_dv.')) %>% names(.)
for (i in 1:length(original)){
d <-
d %>%
mutate(diff = original[i] - pred[i])
l <- length(d)
colnames(d[l]) <- paste0(original[i], '.diff')
}
Error: Problem with mutate() input diff. # x non-numeric
argument to binary operator # ℹ Input diff is original[i] - pred[i].
d %>%
mutate(
across(
.cols = starts_with("dv"),
.fns = ~ . - (get(paste0("pred_",cur_column()))),
.names = "diff_{.col}"
)
)
# A tibble: 100 x 7
id dv.1 dv.2 pred_dv.1 pred_dv.2 diff_dv.1 diff_dv.2
<int> <int> <int> <int> <int> <int> <int>
1 15 5 1 5 15 0 -14
2 13 4 4 5 11 -1 -7
3 12 20 13 6 13 14 0
4 20 11 8 13 3 -2 5
5 9 11 10 7 13 4 -3
6 13 3 3 6 17 -3 -14
7 3 12 19 6 17 6 2
8 19 6 7 11 4 -5 3
9 6 7 12 19 6 -12 6
10 13 10 15 6 7 4 8
# ... with 90 more rows
Subtraction can be applied on dataframes directly.
So you can create a vector of original column names and another vector of prediction column names and subtract them creating new columns.
orig_var <- grep('^dv', names(d), value = TRUE)
pred_var <- grep('pred', names(d), value = TRUE)
d[paste0(orig_var, '.diff')] <- d[orig_var] - d[pred_var]
d
# id dv.1 dv.2 pred_dv.1 pred_dv.2 dv.1.diff dv.2.diff
#1 15 5 1 5 15 0 -14
#2 13 4 4 5 11 -1 -7
#3 12 20 13 6 13 14 0
#4 20 11 8 13 3 -2 5
#5 9 11 10 7 13 4 -3
#...
#...
I have a function calculating something based on input variables, but I want to change the input variables depending on the result of the function. I feel that it is a very trivial question, but I have not found a working solution yet.
I have a dataset similar to this:
v1<-sample(5:12, 10, replace=T)
v2<-rep(100, 10)
v3<-rep(1,10)
v4<-sample(1:4, 10, replace=T)
t1<-sample(10:30, 10, replace=T)
And a function which uses the variables v1 to v4 to calculate something:
fun<-function(v1, v2, v3, v4){
#does complicated things
result<-((v2/(v1*v4))*v3
return(result)
}
out<-fun(v1, v2, v3, v4)
df<-data.frame(v1, v2, v3, v4, t1, out)
I need t1 < out. For non-vectorized data this seems to work:
while (out < t1){
v3= v3 + 1
out<- fun(v1, v2, v3, v4)
}
Ideally the last value for v3 is stored somehow.
v1 v1 v2 v3 v4 t1 out out_new v3_new
1 6 100 1 2 15 8.333333 16 2
2 12 100 1 2 17 4.166667 20 5
3 5 100 1 1 12 20.000000 20 1
4 10 100 1 4 26 2.500000 . .
5 8 100 1 2 15 6.250000 . .
6 6 100 1 3 18 5.555556
7 10 100 1 4 20 2.500000
8 11 100 1 2 12 4.545455
9 12 100 1 3 28 2.777778
10 6 100 1 2 25 8.333333
The output I am looking for is a dataframe looking like this:
What I am looking for is a vectorized solution that works on a large (100.000 rows) dataframe/tibble. I have tried different approaches with dplyr (mutate with case_when, ifelse) but have failed to come up with a working solution.
There is a way to achieve this using dplyr::rowwise():
library(tidyverse)
v1<-sample(5:12, 10, replace=T)
v2<-rep(100, 10)
v3<-rep(1,10)
v4<-sample(1:4, 10, replace=T)
t1<-sample(10:30, 10, replace=T)
fun<-function(v1, v2, v3, v4){
#does complicated things
result<-(v1/v2)*v4*v3
return(result)
}
reallyWeirdFunction = function(v1, v2, v3, v4, t){
out<- fun(v1, v2, v3, v4)
while (out < t){
v3= v3 + 1
out<- fun(v1, v2, v3, v4)
}
out
}
df<-data.frame(v1, v2, v3, v4, t1)
df %>%
rowwise() %>%
mutate(
out=fun(v1, v2, v3, v4),
out2=reallyWeirdFunction(v1, v2, v3, v4, t1)
)
#> # A tibble: 10 x 7
#> # Rowwise:
#> v1 v2 v3 v4 t1 out out2
#> <int> <dbl> <dbl> <int> <int> <dbl> <dbl>
#> 1 7 100 1 3 17 0.21 17.0
#> 2 11 100 1 2 24 0.22 24.2
#> 3 12 100 1 1 11 0.12 11.0
#> 4 10 100 1 4 15 0.4 15.2
#> 5 10 100 1 2 22 0.2 22
#> 6 9 100 1 1 16 0.09 16.0
#> 7 5 100 1 2 24 0.1 24
#> 8 12 100 1 2 23 0.24 23.0
#> 9 8 100 1 3 30 0.24 30
#> 10 7 100 1 2 14 0.14 14.
Created on 2020-08-28 by the reprex package (v0.3.0)
However, this is usually a very bad idea to use while loops in R, and rowwise() can be very slow in large datasets. You should definitely try to find a better algorithm than incrementing v3 like you are doing. I could not find one for your example though.
With mapply() and some other definitions you can do in base R:
set.seed(42)
df <- data.frame(v1=sample(5:12, 10, replace=T), v2=rep(100, 10), v3=rep(1,10),
v4=sample(1:4, 10, replace=T), t1=sample(10:30, 10, replace=T))
fun<-function(v1, v2, v3, v4) {
#does complicated things
result <- ((v2/(v1*v4))*v3)
return(result)
}
fun2 <- function(v1, v2, v3, v4, t) {
out <- fun(v1, v2, v3, v4)
while (out < t){
v3 <- v3 + 1
out <- fun(v1, v2, v3, v4)
}
return(list(v3new=v3, out=out))
}
cbind(df, t(mapply(fun2, df$v1, df$v2, df$v3, df$v4, df$t1)))
> cbind(df, t(mapply(fun2, df$v1, df$v2, df$v3, df$v4, df$t1)))
# v1 v2 v3 v4 t1 v3new out
# 1 5 100 1 3 24 4 26.66667
# 2 9 100 1 4 12 5 13.88889
# 3 5 100 1 3 18 3 20
# 4 5 100 1 4 13 3 15
# 5 6 100 1 1 14 1 16.66667
# 6 8 100 1 1 22 2 25
# 7 6 100 1 2 14 2 16.66667
# 8 6 100 1 4 29 7 29.16667
# 9 5 100 1 2 11 2 20
# 10 12 100 1 2 17 5 20.83333
I really need your help regarding a problem which may seem easy to solve for you.
Currently I work on a project which involves some panel-regressions. I have several large csv-files (up to 12 million entries per sheet) which are formatted as in the picture attached, whereas the columns (V1, V2) are individuals and the rows (1, 2, 3) are time identifiers.
In order to use the plm()-function I need all these files to convert to the following data structure:
ID Time X1 X2
1 1 x1 x2
1 2 x1 x2
1 ... ... ...
2 1 x1 x2
2 2 ... ...
I really struggle with this transformation and I'm really frustrated right now i.e. where do I get the identifier and the time index from?
Would really appreciate if you could provide me with information how to solve this problem.
If my question is not clear to you, just ask.
Best regards and thanks in advance
The output should look like as follows:
mydata<-structure(list(V1 = 10:13, V2 = 21:24, V3 = c(31L, 32L, 3L, 34L
)), .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA,
-4L))
> mydata
V1 V2 V3
1 10 21 31
2 11 22 32
3 12 23 3
4 13 24 34
The following code can be used for your data without changing anything. For illustration, I used just the above data. I used the base R reshape function
long <- reshape(mydata, idvar = "time", ids = row.names(mydata),
times = names(mydata), timevar = "id",
varying = list(names(mydata)),v.names="value", new.row.names = 1:((dim(mydata)[2])*(dim(mydata)[1])),direction = "long")
> long
id value time
1 V1 10 1
2 V1 11 2
3 V1 12 3
4 V1 13 4
5 V2 21 1
6 V2 22 2
7 V2 23 3
8 V2 24 4
9 V3 31 1
10 V3 32 2
11 V3 3 3
12 V3 34 4
long$id<-substr(long$id,2,4) # 4 is used to take into account your 416 variables
myout<-long[,c(1,3,2)]
> myout
id time value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 13
5 2 1 21
6 2 2 22
7 2 3 23
8 2 4 24
9 3 1 31
10 3 2 32
11 3 3 3
12 3 4 34
Here is an alternative: Use Stacked from my "splitstackshape" package.
Here it is applied on #Metrics's sample data:
# install.packages("splitstackshape")
library(splitstackshape)
Stacked(cbind(id = 1:nrow(mydata), mydata),
id.vars="id", var.stubs="V", sep = "V")
# id .time_1 V
# 1: 1 1 10
# 2: 1 2 21
# 3: 1 3 31
# 4: 2 1 11
# 5: 2 2 22
# 6: 2 3 32
# 7: 3 1 12
# 8: 3 2 23
# 9: 3 3 3
# 10: 4 1 13
# 11: 4 2 24
# 12: 4 3 34
It would be very fast if your data are large. Here are the speeds for the 12MB dataset you linked to. The sorting is different but the data are the same.
It still isn't faster than stack though (but at some point, stack starts to slow down).
See the system.times below:
reshape()
system.time(out <- reshape(x, idvar = "time", ids = row.names(x),
times = names(x), timevar = "id",
varying = list(names(x)),
v.names="value",
new.row.names = 1:prod(dim(x)),
direction = "long"))
# user system elapsed
# 53.11 0.00 53.11
head(out)
# id value time
# 1 V1 0.003808635 1
# 2 V1 -0.018807416 2
# 3 V1 0.008875447 3
# 4 V1 0.001148695 4
# 5 V1 -0.019365004 5
# 6 V1 0.012436560 6
Stacked()
system.time(out2 <- Stacked(cbind(id = 1:nrow(x), x),
id.vars="id", var.stubs="V",
sep = "V"))
# user system elapsed
# 0.30 0.00 0.29
out2
# id .time_1 V
# 1: 1 1 0.003808635
# 2: 1 10 -0.014184635
# 3: 1 100 -0.013341843
# 4: 1 101 0.006784138
# 5: 1 102 0.006463707
# ---
# 963868: 2317 95 0.009569451
# 963869: 2317 96 0.002497771
# 963870: 2317 97 0.009202519
# 963871: 2317 98 0.017007545
# 963872: 2317 99 -0.002495842
stack()
system.time(out3 <- cbind(id = 1:nrow(x), stack(x)))
# user system elapsed
# 0.09 0.00 0.09
head(out3)
# id values ind
# 1 1 0.003808635 V1
# 2 2 -0.018807416 V1
# 3 3 0.008875447 V1
# 4 4 0.001148695 V1
# 5 5 -0.019365004 V1
# 6 6 0.012436560 V1