I'm struggling with something that might turn out to be super easy.
What i'd like is some short and efficient code to create a dataframe where each column is made up of V1, V1 * 2, V1 * 3... and so on until a set number of columns is reached.
For example, if my V1 is this:
V1=rep(10000,1000)
I'd like a code to automatically generate additional columns such as V2 and V3
V2=V1*2
V3=V1*3
and bind them together in a dataframe to give
d=data.frame(V1,V2,V3)
d
Should this be done with a loop? Tried a bunch of things but am not the best at looping and at the moment I feel rather stuck.
Ideally I'd like my vector V1 to be:
V1=rep(10000,10973)
and to form a dataframe with 17 columns.
Thanks!
Use sapply to create multiple columns. Here, I am creating 17 columns where 1 to 10 are sequentially multiplied by 1, 2, ..., 17. Use as.data.frame to convert to data.frame object.
sapply(1:17, function(x) x * 1:10) |>
as.data.frame()
output
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34
3 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51
4 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68
5 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
6 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102
7 7 14 21 28 35 42 49 56 63 70 77 84 91 98 105 112 119
8 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136
9 9 18 27 36 45 54 63 72 81 90 99 108 117 126 135 144 153
10 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170
In your case, you would need:
sapply(1:17, function(x) x * rep(10000, 10973)) |>
as.data.frame()
We could use outer
as.data.frame( outer(1:10, 1:17))
Related
I'm managing with a data table. I have 13 * 2598893 data table, and I'm trying to make new column filled with character calculated based on another column.
So i made a function, and applied it to 'for in' loops, with those millions of rows. And it takes forever! I waited for some minutes, and I could not distinguish it from system down.
I tried it for just 10 rows, and the loops and function works well fastly. But when I extend it to other rows, it takes forever, again.
str(eco)
'data.frame': 2598893 obs. of 13 variables:
made function like this
check<-function(x){
if(x<=15){
return(1)
}
else{
return(0)
}
}
And applied loops like this.
for(x in c(1:nrow(eco))){eco[x,13]<-check(eco[x,4])}
And it continues and continues to work.
How can I shorten this work? Or is this just the limit of R that I should endure?
You should probably try to vectorize your operations (NB: for loops can often times be avoided in R). In addition, you could check out the data.table package to further improve efficiency:
library(data.table)
set.seed(1)
## create data.table
eco <- as.data.table(matrix(sample(1:100, 13 * 2598893, replace = TRUE), ncol = 13))
## update column
system.time(
set(eco, j = 13L, value = 1 * (eco[[4]] <= 15))
)
#> user system elapsed
#> 0.018 0.016 0.033
eco
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
#> 1: 68 74 55 62 82 51 42 18 16 12 50 73 0
#> 2: 39 97 53 61 21 25 79 71 85 19 54 30 0
#> 3: 1 89 62 42 5 90 33 77 31 1 59 26 0
#> 4: 34 22 27 4 36 74 65 45 46 67 74 34 1
#> 5: 87 57 88 4 42 26 9 13 64 32 16 15 1
#> ---
#> 2598889: 91 59 78 28 98 98 13 87 88 46 66 85 0
#> 2598890: 82 60 87 60 49 25 10 9 97 78 61 91 0
#> 2598891: 19 2 100 75 66 88 12 46 94 32 69 56 0
#> 2598892: 18 47 22 87 23 79 56 99 13 29 15 46 0
#> 2598893: 47 30 8 8 9 80 49 78 20 43 86 11 1
I have a dataframe and I would like to create a dataframe column based on the groupby on another column. The group by should be in increments of 50 on the column and the label should be the middle number in the group numbers. I am demonstrating this here with a reproducible example.
Here is the dataframe
das <- data.frame(val=1:27,
weigh=c(20,25,37,38,50,52,56,59,64,68,69,70,75,76,82,85,90,100,109,150,161,178,181,179,180,201,201))
val weigh
1 1 20
2 2 25
3 3 37
4 4 38
5 5 50
6 6 52
7 7 56
8 8 59
9 9 64
10 10 68
11 11 69
12 12 70
13 13 75
14 14 76
15 15 82
16 16 85
17 17 90
18 18 100
19 19 109
20 20 150
21 21 161
22 22 178
23 23 181
24 24 179
25 25 180
26 26 201
27 27 201
The desired output will be
val weigh label
1 1 20 45
2 2 25 45
3 3 37 45
4 4 38 45
5 5 50 45
6 6 52 45
7 7 56 45
8 8 59 45
9 9 64 45
10 10 68 45
11 11 69 45
12 12 70 45
13 13 75 95
14 14 76 95
15 15 82 95
16 16 85 95
17 17 90 95
18 18 100 95
19 19 109 95
20 20 150 145
21 21 161 145
22 22 178 195
23 23 181 195
24 24 179 195
25 25 180 195
26 26 201 195
27 27 201 195
Here the 45 is calculate by 20+ (20+50) /2 = 45, where 20 is where it start and 20+50 = 70 is where this group need to stop. And the label is the middle number between 20 and 70 which is 45.
Similarly with other labels
70+(70+50)/2= 95
120 + (170)/2= 145
170 + (220)/2 = 195
I am new to R and tried looking at many sources here, but I couldn't find anything that will do something like this. The closest I could find is grouping like this using cut2
df %>% mutate(label = as.numeric(cut2(weigh, g=5)))
library(dplyr)
# create your breaks
breaks = unique(c(seq(min(das$weigh), max(das$weigh)+1, 50), max(das$weigh)+1))
das %>%
group_by(group = cut(weigh, breaks, right=F)) %>% # group by intervals
mutate(group2 = as.numeric(group), # use the intervals as a number
label = (breaks[group2]+breaks[group2]+50)/2) %>% # call the corresponding break value and calculate your label
ungroup()
# # A tibble: 27 x 5
# val weigh group group2 label
# <int> <dbl> <fct> <dbl> <dbl>
# 1 1 20 [20,70) 1 45
# 2 2 25 [20,70) 1 45
# 3 3 37 [20,70) 1 45
# 4 4 38 [20,70) 1 45
# 5 5 50 [20,70) 1 45
# 6 6 52 [20,70) 1 45
# 7 7 56 [20,70) 1 45
# 8 8 59 [20,70) 1 45
# 9 9 64 [20,70) 1 45
#10 10 68 [20,70) 1 45
# # ... with 17 more rows
You can remove any unnecessary columns. I left them there just to make easier to understand how the process works.
Sample data
df <- data.frame(loc.id = rep(1:5, each = 6), day = sample(1:365,30),
ref.day1 = rep(c(20,30,50,80,90), each = 6),
ref.day2 = rep(c(10,28,33,49,67), each = 6),
ref.day3 = rep(c(31,49,65,55,42), each = 6))
For each loc.id, if I want to keep days that are >= then ref.day1, I do this:
df %>% group_by(loc.id) %>% dplyr::filter(day >= ref.day1)
I want to make 3 data frames, each whose rows are filtered by ref.day1, ref.day2,ref.day3 respectively
I tried this:
col.names <- c("ref.day1","ref.day2","ref.day3")
temp.list <- list()
for(cl in seq_along(col.names)){
col.sub <- col.names[cl]
columns <- c("loc.id","day",col.sub)
df.sub <- df[,columns]
temp.dat <- df.sub %>% group_by(loc.id) %>% dplyr::filter(day >= paste0(col.sub)) # this line does not work
temp.list[[cl]] <- temp.dat
}
final.dat <- rbindlist(temp.list)
I was wondering how to refer to columns by names and paste function in dplyr in order to filter it out.
The reason why your original code doesn't work is that your col.names are strings, but dplyr function uses non-standard evaluation which doesn't accept strings. So you need to convert the string into variables.rlang::sym() can do that.
Also, you can use map function in purrr package, which is much more compact:
library(dplyr)
library(purrr)
col_names <- c("ref.day1","ref.day2","ref.day3")
map(col_names,~ df %>% dplyr::filter(day >= UQ(rlang::sym(.x))))
#it will return you a list of dataframes
By the way I removed group_by() because they don't seem to be useful.
Returned result:
[[1]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 5 199 90 67 42
22 5 141 90 67 42
23 5 241 90 67 42
24 5 187 90 67 42
[[2]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 16 20 10 31
2 1 362 20 10 31
3 1 69 20 10 31
4 1 65 20 10 31
5 1 88 20 10 31
6 1 142 20 10 31
7 2 355 30 28 49
8 2 255 30 28 49
9 2 136 30 28 49
10 2 156 30 28 49
11 2 194 30 28 49
12 2 204 30 28 49
13 3 129 50 33 65
14 3 254 50 33 65
15 3 279 50 33 65
16 3 201 50 33 65
17 3 282 50 33 65
18 4 351 80 49 55
19 4 114 80 49 55
20 4 338 80 49 55
21 4 283 80 49 55
22 4 79 80 49 55
23 5 199 90 67 42
24 5 67 90 67 42
25 5 141 90 67 42
26 5 241 90 67 42
27 5 187 90 67 42
[[3]]
loc.id day ref.day1 ref.day2 ref.day3
1 1 362 20 10 31
2 1 69 20 10 31
3 1 65 20 10 31
4 1 88 20 10 31
5 1 142 20 10 31
6 2 355 30 28 49
7 2 255 30 28 49
8 2 136 30 28 49
9 2 156 30 28 49
10 2 194 30 28 49
11 2 204 30 28 49
12 3 129 50 33 65
13 3 254 50 33 65
14 3 279 50 33 65
15 3 201 50 33 65
16 3 282 50 33 65
17 4 351 80 49 55
18 4 114 80 49 55
19 4 338 80 49 55
20 4 283 80 49 55
21 4 79 80 49 55
22 5 199 90 67 42
23 5 67 90 67 42
24 5 141 90 67 42
25 5 241 90 67 42
26 5 187 90 67 42
You may also want to check these:
https://dplyr.tidyverse.org/articles/programming.html
Use variable names in functions of dplyr
I've a dataframe with 12 columns. For example, as below:
values <- matrix(1:120, nrow=10)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 11 21 31 41 51 61 71 81 91 101 111
2 12 22 32 42 52 62 72 82 92 102 112
3 13 23 33 43 53 63 73 83 93 103 113
4 14 24 34 44 54 64 74 84 94 104 114
5 15 25 35 45 55 65 75 85 95 105 115
6 16 26 36 46 56 66 76 86 96 106 116
7 17 27 37 47 57 67 77 87 97 107 117
8 18 28 38 48 58 68 78 88 98 108 118
9 19 29 39 49 59 69 79 89 99 109 119
Now, let's say I want to run a Linear Regression model lm() on a set of 3 columns each time (thus running the lm 4 times.. and each time giving a set of 3 columns as the dataframe for my each run of lm() function).
I was trying:
tapply(as.list(values), gl(ncol(values)/3, 3), lm())
and it said:
Error in terms.formula(formula, data = data):
argument is not a valid model
How do I solve this problem and run a linear model lm on a large dataframe by passing a specified number of columns as the input dataset for each run?
I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.
Reproducible example:
set.seed(100)
DT <- data.table(year=2000:2009, month=1:10, price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)
I have just tried:
merging the DT and intervals.dt data.tables by month/year,
creating a new intervalsstring column consisting of all the V* columns to
one column string, (not very elegant, I admit), and finally
substringing it to a vector, so as I can use it in findInterval() but the solution does not work for every row (!)
So, after:
DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]
I get
> DT
year month price intervalsstring interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7
which is correct for the first rows, but not for the last (or other) rows.
For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring.
Thank you!
You have to us the argument by = year to apply the function to all subsets:
DT[, interval := findInterval(price, intervals[as.character(year), ]), by = year]
year price interval
1: 2000 30.776611 4
2: 2001 25.767250 1
3: 2002 55.232243 4
4: 2003 5.638315 0
5: 2004 46.854928 2
---
3376: 2005 97.497761 10
3377: 2006 50.141227 5
3378: 2007 50.186270 7
3379: 2008 99.229338 10
3380: 2009 64.189162 8
Update (based on edited question):
DT[ , interval := findInterval(price,
unlist(intervals.dt[J(year[1], month[1]), 1:10])),
by = c("year", "month")]
year month price V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7