I have a function and a for-loop, I would like to iterate the same for-loop to 3 times for(i in 1:3){} and save the for loop output as a list with different names such as df.1, df.2, and df.3. Many thanks in advance.
df <- tibble( a = rnorm(10),b = rnorm(10))
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
df
Expected Answer
DF.1
A tibble: 10 x 2
a b
<dbl> <dbl>
1 1 0.624
2 0 0.421
3 0.551 1
4 0.320 0.466
5 0.266 0.247
6 0.0261 0.103
7 0.127 0.519
8 0.588 0.0623
9 0.489 0
10 0.556 0.540
DF.2
A tibble: 10 x 2
a b
<dbl> <dbl>
1 1 0.624
2 0 0.421
3 0.551 1
4 0.320 0.466
5 0.266 0.247
6 0.0261 0.103
7 0.127 0.519
8 0.588 0.0623
9 0.489 0
10 0.556 0.540
DF.3
A tibble: 10 x 2
a b
<dbl> <dbl>
1 1 0.624
2 0 0.421
3 0.551 1
4 0.320 0.466
5 0.266 0.247
6 0.0261 0.103
7 0.127 0.519
8 0.588 0.0623
9 0.489 0
10 0.556 0.540
Put the for loop code in a function and repeat the code for n times using replicate -
apply_fun <- function(df) {
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
df
}
result <- replicate(3, apply_fun(df), simplify = FALSE)
result will have list of dataframes.
If you want them as separate dataframes name the list and use list2env.
names(result) <- paste0('df.', seq_along(result))
list2env(result, .GlobalEnv)
Related
I hope to run KNN with the following two data frame. The following is the information of the data(already been scaled). age and lr_scale would be the features and euRefVoteAfter is the outcome variable.
head(training)
# A tibble: 6 x 3
age lr_scale euRefVoteAfter
<dbl> <dbl> <dbl+lbl>
1 -1.20 -0.808 0 [Rejoin the EU]
2 1.25 -1.29 1 [Stay out of the EU]
3 0.636 0.886 0 [Rejoin the EU]
4 0.0245 -0.324 1 [Stay out of the EU]
5 -1.26 0.402 0 [Rejoin the EU]
6 -0.770 0.402 0 [Rejoin the EU]
> head(testing)
# A tibble: 6 x 3
age lr_scale euRefVoteAfter
<dbl> <dbl> <dbl+lbl>
1 -1.20 -0.808 0 [Rejoin the EU]
2 1.25 -1.29 1 [Stay out of the EU]
3 0.636 0.886 0 [Rejoin the EU]
4 0.0245 -0.324 1 [Stay out of the EU]
5 -1.26 0.402 0 [Rejoin the EU]
6 -0.770 0.402 0 [Rejoin the EU]
And i run the following codes:
y_pred <- knn(train = training[, -3],
test = testing[, -3],
cl = training[,3],
k = 3,
prob = FALSE)
And i got the message'train' and 'class' have different lengths.
I've found some solution to fix this error, and try again as follow:
v1=training[,3]
y_pred <- knn(train = training[, -3],
test = testing[, -3],
cl = v1,
k = 3,
prob = FALSE)
But the same error message occured.
I'm sure the length of the variables are the same
> length(training$euRefVoteAfter)
[1] 26026
> length(training$age)
[1] 26026
> length(training$lr_scale)
[1] 26026
If someone can help me with this problem, I'd be really appreciated.
I have a question on how to mutate the slopes of lines into a new data frame into
by category.
d1 <-read.csv(file.choose(), header = T)
d2 <- d1 %>%
group_by(ID)%>%
mutate(Slope=sapply(split(df,df$ID), function(v) lm(x~y,v)$coefficients["y"]))
ID x y
1 3.429865279 2.431363764
1 3.595066124 2.681241237
1 3.735263469 2.352182518
1 3.316473584 2.51851394
1 3.285984642 2.380211242
1 3.860793029 2.62324929
1 3.397714117 2.819543936
1 3.452997088 2.176091259
1 3.718933278 2.556302501
1 3.518566578 2.537819095
1 3.689033452 2.40654018
1 3.349160923 2.113943352
1 3.658888644 2.556302501
1 3.251151343 2.342422681
1 3.911194909 2.439332694
1 3.432584505 2.079181246
1 4.031267043 2.681241237
1 3.168733129 1.544068044
1 4.032239897 3.084576278
1 3.663361648 2.255272505
1 3.582302046 2.62324929
1 3.606585565 2.079181246
1 3.541791347 2.176091259
4 3.844012861 2.892094603
4 3.608318477 2.767155866
4 3.588990218 2.883661435
4 3.607957917 2.653212514
4 3.306753044 2.079181246
4 4.002604841 2.880813592
4 3.195299837 2.079181246
4 3.512203238 2.643452676
4 3.66878494 2.431363764
4 3.598910385 2.511883361
4 3.721810134 2.819543936
4 3.352964661 2.113943352
4 4.008109343 3.084576278
4 3.584693332 2.556302501
4 4.019461819 3.084576278
4 3.359474563 2.079181246
4 3.950256012 2.829303773
I got the error message like'replacement has 2 rows, data has 119'. I am sure that the error is derived from mutate().
Best,
Once you do group_by, any function that succeeds uses on the columns in the grouped data.frame, in your case, it will only use x,y column within.
If you only want the coefficient, it goes like this:
df %>% group_by(ID) %>% summarize(coef=lm(x~y)$coefficients["y"])
# A tibble: 2 x 2
ID coef
<int> <dbl>
1 1 0.437
2 4 0.660
If you want the coefficient, which means a vector a long as the dataframe, you use mutate:
df %>% group_by(ID) %>% mutate(coef=lm(x~y)$coefficients["y"])
# A tibble: 40 x 4
# Groups: ID [2]
ID x y coef
<int> <dbl> <dbl> <dbl>
1 1 3.43 2.43 0.437
2 1 3.60 2.68 0.437
3 1 3.74 2.35 0.437
4 1 3.32 2.52 0.437
5 1 3.29 2.38 0.437
6 1 3.86 2.62 0.437
7 1 3.40 2.82 0.437
8 1 3.45 2.18 0.437
9 1 3.72 2.56 0.437
10 1 3.52 2.54 0.437
# … with 30 more rows
Why am I getting -
'train' and 'class' have different lengths
In spite of having both of them with same lengths
y_pred=knn(train=training_set[,1:2],
test=Test_set[,-3],
cl=training_set[,3],
k=5)
Their lengths are given below-
> dim(training_set[,-3])
[1] 300 2
> dim(training_set[,3])
[1] 300 1
> head(training_set)
# A tibble: 6 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -1.77 -1.47 0
2 -1.10 -0.788 0
3 -1.00 -0.360 0
4 -1.00 0.382 0
5 -0.523 2.27 1
6 -0.236 -0.160 0
> Test_set
# A tibble: 100 x 3
Age EstimatedSalary Purchased
<dbl> <dbl> <fct>
1 -0.304 -1.51 0
2 -1.06 -0.325 0
3 -1.82 0.286 0
4 -1.25 -1.10 0
5 -1.15 -0.485 0
6 0.641 -1.32 1
7 0.735 -1.26 1
8 0.924 -1.22 1
9 0.829 -0.582 1
10 -0.871 -0.774 0
It's because knn is expecting class to be a vector and you are giving it a data table with one column. The test knn is doing is whether nrow(train) == length(cl). If cl is a data table that does not give the answer you are expecting. Compare:
> length(data.frame(a=c(1,2,3)))
[1] 1
> length(c(1,2,3))
[1] 3
If you use cl=training_set$Purchased, which extracts the vector from the table, that should fix it.
This is specific gotcha if you are moving from data.frame to data.table because the default drop behaviour is different:
> dt <- data.table(a=1:3, b=4:6)
> dt[,2]
b
1: 4
2: 5
3: 6
> df <- data.frame(a=1:3, b=4:6)
> df[,2]
[1] 4 5 6
> df[,2, drop=FALSE]
b
1 4
2 5
3 6
I think my question is fairly simple to answer but I'm learning R so I'd like to know the best way to do it.
I've a dataset looking like this:
> print(agg_df41367)
# A tibble: 72 x 3
# Groups: hour [24]
hour predicted y
1 0 Feeding 0.121
2 0 Foraging 0.632
3 0 Standing 0.300
4 1 Feeding 0.141
5 1 Foraging 0.727
6 1 Standing 0.183
7 2 Feeding 0.0932
8 2 Foraging 0.817
9 2 Standing 0.133
10 3 Feeding 0.214
I would like to run a GLM model, so I'd like my data to look like:
head(agg_df41361_GLM)
hour Foraging Standing Feeding
0 0.632 0.300 0.121
1 0.727 0.183 0.141
2 0.817 0.133 0.0932
3 etc. etc. 0.214
Any ideas of what is the most compact way to do this? Ideally, I would like to use a for-loop to compute this transformation for multiple datasets. All my datasets follow a name format agg_df4136*. Any input is appreciated!
Here's a way to reshape the dataset you posted.
library(tidyr)
# example data
dt = read.table(text = "
hour predicted y
1 0 Feeding 0.121
2 0 Foraging 0.632
3 0 Standing 0.300
4 1 Feeding 0.141
5 1 Foraging 0.727
6 1 Standing 0.183
7 2 Feeding 0.0932
8 2 Foraging 0.817
9 2 Standing 0.133
", header=T)
spread(dt, predicted, y)
# hour Feeding Foraging Standing
# 1 0 0.1210 0.632 0.300
# 2 1 0.1410 0.727 0.183
# 3 2 0.0932 0.817 0.133
If you have multiple datasets it's better to create a list of them and apply the reshaping process to each one of them:
library(tidyverse)
# example of list of dataframes
l = list(dt, dt, dt)
map(l, ~spread(., predicted, y))
# [[1]]
# hour Feeding Foraging Standing
# 1 0 0.1210 0.632 0.300
# 2 1 0.1410 0.727 0.183
# 3 2 0.0932 0.817 0.133
#
# [[2]]
# hour Feeding Foraging Standing
# 1 0 0.1210 0.632 0.300
# 2 1 0.1410 0.727 0.183
# 3 2 0.0932 0.817 0.133
#
# [[3]]
# hour Feeding Foraging Standing
# 1 0 0.1210 0.632 0.300
# 2 1 0.1410 0.727 0.183
# 3 2 0.0932 0.817 0.133
Note that here I'm using the same dataset (dt) as my 3 list elements, but it will work with different datasets, as long as you have the same column names.
If you want to create a list of all your datasets that start with the name pattern you provided you can do this:
# get objects that start with this name pattern
input_names = ls()[grepl("^agg_df4136", ls())]
# get the data that match those names
list_datasets = map(input_names, get)
So, list_datasets is a list of all dataframes in your environment with a name that starts with "agg_df4136".
I would like to calculate the mean euclidean distances between each item and all other items in a group within a data frame. I'd like to do this within the tidyverse, but can't seem to get it to work how I want.
Example data:
library(tidyverse)
DF <- data.frame(Item = letters[1:20], Grp = rep(1:4, each = 5),
x = runif(20, -0.5, 0.5),
y = runif(20, -0.5, 0.5))
I think euclidean distances are calculated using:
sqrt(((x[i] - x[i + 1]) ^ 2) + ((y[i] - y[i + 1]) ^ 2))
I've tried, without success, to do something with mutate.
DF %>%
group_by(Grp, Item) %>%
mutate(Dist = mean(sqrt(((x - lag(x, default = x[1])) ^ 2) +
(y - lag(y, default = y[1])) ^ 2)))
But, it doesn't work and only returns NA's.
# A tibble: 20 x 5
# Groups: Grp, Item [20]
Item Grp x y Dist
<fct> <int> <dbl> <dbl> <dbl>
1 a 1 -0.212 0.390 NA
2 b 1 0.288 0.193 NA
3 c 1 -0.0910 0.141 NA
4 d 1 0.383 0.494 NA
5 e 1 0.440 0.156 NA
6 f 2 -0.454 0.209 NA
7 g 2 0.0281 0.0441 NA
8 h 2 0.392 0.0941 NA
9 i 2 0.0514 -0.211 NA
10 j 2 -0.0434 -0.353 NA
11 k 3 0.457 0.463 NA
12 l 3 -0.0467 0.402 NA
13 m 3 0.178 0.191 NA
14 n 3 0.0726 0.295 NA
15 o 3 -0.397 -0.475 NA
16 p 4 0.400 -0.0222 NA
17 q 4 -0.254 0.258 NA
18 r 4 -0.458 -0.284 NA
19 s 4 -0.172 -0.182 NA
20 t 4 0.455 -0.268 NA
If I understand lag correctly it would still be sequential (if it worked), rather than computing distances between all pairs within a group.
How can I get the mean of all 4 distances for each item in a group?
Does anyone have any suggestions?
DF %>% group_by(Grp) %>%
mutate(Dist = colMeans(as.matrix(dist(cbind(x, y)))))
# # A tibble: 20 x 5
# # Groups: Grp [4]
# Item Grp x y Dist
# <fctr> <int> <dbl> <dbl> <dbl>
# 1 a 1 -0.197904299 0.363086055 0.4659160
# 2 b 1 0.090540444 -0.006314185 0.2031230
# 3 c 1 0.101018893 -0.025062949 0.2011672
# 4 d 1 0.006358616 -0.149784267 0.2323359
# 5 e 1 0.219596250 -0.341440596 0.3605274
# 6 f 2 -0.493124602 -0.002935820 0.5155365
# ...
To see how it works, start with one data subset and go piece by piece:
# run these one line at a time and have a look at ?dist
dd = DF[DF$Grp == "1", c("x", "y")]
dist(dd)
as.matrix(dist(dd))
colMeans(as.matrix(dist(dd)))