Re-creating tibbles within purrr::map - r

I want to use map to apply a function to each column of a tibble.
However, I don't want the tibble columns to be simplified.
I could deal with that by re-creating tibbles with one column using imap.
However, how do I do that?
Let's use a super-simple function called test to see if it works.
Default behavior: columns simplified to vectors:
test<-function(data){
data
}
tibble(v1=1:20,v2=100:119) %>% map(test)
$v1
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$v2
[1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
This can't work because the names need to be quoted:
tibble(v1=1:20,v2=100:119) %>% imap(~test(tibble(.y=.x))) %>% str
List of 2
$ v1: tibble [20 x 1] (S3: tbl_df/tbl/data.frame)
..$ .y: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
$ v2: tibble [20 x 1] (S3: tbl_df/tbl/data.frame)
..$ .y: int [1:20] 100 101 102 103 104 105 106 107 108 109 ...
So why does this not work?
tibble(v1=1:20,v2=100:119) %>% imap(~test(tibble(!!!(.y)=.x)))
Error: unexpected '=' in "tibble(v1=1:20,v2=100:119) %>% imap(~test(tibble(!!!(.y)="

We can change the = to assignment operator (:=) and evaluate (!!) the .y on the lhs of :=
library(tibble)
library(purrr)
out <- tibble(v1=1:20,v2=100:119) %>%
imap( ~ test(tibble(!!.y := .x)))
-output
str(out)
#List of 2
# $ v1: tibble [20 × 1] (S3: tbl_df/tbl/data.frame)
# ..$ v1: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
# $ v2: tibble [20 × 1] (S3: tbl_df/tbl/data.frame)
# ..$ v2: int [1:20] 100 101 102 103 104 105 106 107 108 109 ...

You can use setNames :
tibble::tibble(v1=1:20,v2=100:119) %>% purrr::imap(~test(.x) %>% setNames(.y))
#$v1
# A tibble: 20 x 1
# v1
# <int>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
#...
#...
#$v2
# A tibble: 20 x 1
# v2
# <int>
# 1 100
# 2 101
# 3 102
# 4 103
# 5 104
# 6 105
#...
#...

Related

How to get rid of Qualitative predictors in Data set

Trying to use the range() function and take out my Qualitative predictors (2 total), instead of listing all the Quantitative (7 total).
require(ISLR)
data(Auto)
range(Auto$mpg)
range(Auto$cylinders)
range(Auto$displacement)
range(Auto$horsepower)
range(Auto$weight)
range(Auto$acceleration)
range(Auto$year)
Find out which columns are numeric with an sapply loop, creating a logical index i. Then sapply function range only to those columns.
Also, there is no need to load an entire package just to access one of its data sets, function data has a package argument that can be used to tell R where to find the data set to be loaded.
data(Auto, package = "ISLR")
str(Auto)
#> 'data.frame': 392 obs. of 9 variables:
#> $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
#> $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
#> $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
#> $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
#> $ weight : num 3504 3693 3436 3433 3449 ...
#> $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
#> $ year : num 70 70 70 70 70 70 70 70 70 70 ...
#> $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
i <- sapply(Auto, is.numeric)
t(sapply(Auto[i], range))
#> [,1] [,2]
#> mpg 9 46.6
#> cylinders 3 8.0
#> displacement 68 455.0
#> horsepower 46 230.0
#> weight 1613 5140.0
#> acceleration 8 24.8
#> year 70 82.0
#> origin 1 3.0
Created on 2023-02-05 with reprex v2.0.2
The following is a one-liner (ouput omited).
t(sapply(Auto[sapply(Auto, is.numeric)], range))
Created on 2023-02-05 with reprex v2.0.2

Appending/Growing List Elements Separately in r

I have a loop with an embedded function inside, which creates a list with the same elements but different length each time. I want the created list of elements to grow or merge in each loop. Here is a very simplified visualization of it:
# Code Body-------------------------------------------------------------
desiredList <- list()
for (i in 1:3){
# "existingList" has three dataframes of A, B embedded
# A is a dataframe with x1 and x2 columns
# B is a vector
# A, B values are calculated using "somefunction" &
# in each loop their lenght differ
existingList <- somefunction(variable[i])
# "desiredList" should also have three dataframes of A, B
# In each loop, gererated Ai, Bi append to A, B elements of "desiredList"
desiredList <- append(desiredList, (existingList))
}
# existingList in each loop--------------------------------------------
# i=1................................
A:'data.frame': 3 obs. of 2 variables:
..$ x1: num [1:3] 1 2 3
..$ x2: num [1:3] 13 26 39
B:'data.frame': 1 obs. of 1 variables:
..$ b: num [1:1] 2.6
# i=2................................
A:'data.frame': 2 obs. of 2 variables:
..$ x1: num [1:2] 4 5
..$ x2: num [1:2] 52 65
B:'data.frame': 3 obs. of 1 variables:
..$ b: num [1:3] 5.2 7.8 10.4
# i=3................................
A:'data.frame': 5 obs. of 2 variables:
..$ x1: num [1:5] 6 7 8 9 10
..$ x2: num [1:5] 78 91 104 117 130
B:'data.frame': 2 obs. of 1 variables:
..$ b: num [1:2] 13 15.6
# desiredList at the end of the loop
A:'data.frame': 10 obs. of 2 variables:
..$ x1: num [1:10] 1 2 3 4 5 6 7 8 9 10
..$ x2: num [1:10] 13 26 39 52 65 78 91 104 117 130
B:'data.frame': 6 obs. of 1 variables:
..$ b: num [1:6] 2.6 5.2 7.8 10.4 13 15.6
I have tried "append", "lapply", "Map", and bunches of other functions. However, none gives the correct answer.
Your desired output requires "row-binding" the dataframes, as in:
library(tibble)
a1 <- tribble(
~A1, ~B1, ~C1,
1, 13 , 2.6,
2, 26 , 5.2,
3, 39 , 7.8,
7, 91 , 18.2,
8, 104, 20.8
)
a2 <- tribble(
~A2, ~B2,~C2,
4, 52, 10.4,
5, 65, 13,
6, 78, 15.6
)
a3 <- tribble(
~A3, ~B3, ~C3,
9, 117, 23.4,
10, 130, 26
)
out <- lapply(
X = list(a1, a2, a3),
FUN = function(x) `names<-`(x, substr(names(x), 1,1))
)
do.call("rbind", out)
#> # A tibble: 10 x 3
#> A B C
#> <dbl> <dbl> <dbl>
#> 1 1 13 2.6
#> 2 2 26 5.2
#> 3 3 39 7.8
#> 4 7 91 18.2
#> 5 8 104 20.8
#> 6 4 52 10.4
#> 7 5 65 13
#> 8 6 78 15.6
#> 9 9 117 23.4
#> 10 10 130 26
dplyr::bind_rows(out)
#> # A tibble: 10 x 3
#> A B C
#> <dbl> <dbl> <dbl>
#> 1 1 13 2.6
#> 2 2 26 5.2
#> 3 3 39 7.8
#> 4 7 91 18.2
#> 5 8 104 20.8
#> 6 4 52 10.4
#> 7 5 65 13
#> 8 6 78 15.6
#> 9 9 117 23.4
#> 10 10 130 26
Created on 2020-07-21 by the reprex package (v0.3.0)
If you have control of the data frame names and can avoid appending numbers to them, you can row bind them more easily. dplyr::bind_rows() is also an easy way to bind them.
One way to solve this is by using the "Map" function. The only problem with Map is that it can not merge an empty list with a list containing elements. This can be solved by applying an if-else statement. When the loop is running for the first time, the empty list is set to the existing list and for the remaining loops, the "Map" function is applied to update the desired list.
# Code Body-------------------------------------------------------------
desiredList <- list()
for (i in 1:3){
# "existingList" generated using an external function
existingList <- somefunction(variable[i])
# "desiredList" generation
if (i == 1){ # define the list in first loop
desiredList <- existingList
} else { # append the list
desiredList <- Map(rbind, desiredList, existingList)
}
}

Error in ncol(xj) : object 'xj' not found when using R matplot()

Using matplot, I'm trying to plot the 2nd, 3rd and 4th columns of airquality data.frame after dividing these 3 columns by the first column of airquality.
However I'm getting an error
Error in ncol(xj) : object 'xj' not found
Why are we getting this error? The code below will reproduce this problem.
attach(airquality)
airquality[2:4] <- apply(airquality[2:4], 2, function(x) x /airquality[1])
matplot(x= airquality[,1], y= as.matrix(airquality[-1]))
You have managed to mangle your data in an interesting way. Starting with airquality before you mess with it. (And please don't attach() - it's unnecessary and sometimes dangerous/confusing.)
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
After you do
airquality[2:4] <- apply(airquality[2:4], 2,
function(x) x /airquality[1])
you get
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R:'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 4.63 3.28 12.42 17.39 NA ...
$ Wind :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 0.18 0.222 1.05 0.639 NA ...
$ Temp :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 1.63 2 6.17 3.44 NA ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
or
sapply(airquality,class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "data.frame" "data.frame" "data.frame" "integer" "integer"
that is, you have data frames embedded within your data frame!
rm(airquality) ## clean up
Now change one character and divide by the column airquality[,1] rather than airquality[1] (divide by a vector, not a list of length one ...)
airquality[,2:4] <- apply(airquality[,2:4], 2,
function(x) x/airquality[,1])
matplot(x= airquality[,1], y= as.matrix(airquality[,-1]))
In general it's safer to use [, ...] indexing rather than [] indexing to refer to columns of a data frame unless you really know what you're doing ...

Find mean from subset of one column based on ranking in the top 50 of another column

I have a data frame that has the following columns:
> str(wbr)
'data.frame': 214 obs. of 12 variables:
$ countrycode : Factor w/ 214 levels "ABW","ADO","AFG",..: 1 2 3 4 5 6 7 8 9 10 ...
$ countryname : Factor w/ 214 levels "Afghanistan",..: 10 5 1 6 2 202 8 9 4 7 ...
$ gdp_per_capita : num 19913 35628 415 2738 4091 ...
$ literacy_female : num 96.7 NA 17.6 59.1 95.7 ...
$ literacy_male : num 96.9 NA 45.4 82.5 98 ...
$ literacy_all : num 96.8 NA 31.7 70.6 96.8 ...
$ infant_mortality : num NA 2.2 70.2 101.6 13.3 ...
$ illiteracy_female: num 3.28 NA 82.39 40.85 4.31 ...
$ illiteracy_mele : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_male : num 3.06 NA 54.58 17.53 1.99 ...
$ illiteracy_all : num 3.18 NA 68.26 29.42 3.15 ...
I would like to find the mean of illiteracy_all from the top 50 countries with the highest GDP.
Before you answer me I need to inform you that the data frame has NA values meaning that if I want to find the mean I would have to write:
mean(wbr$illiteracy_all, na.rm=TRUE)
For a reproducible example, let's take:
data.df <- data.frame(x=101:120, y=rep(c(1,2,3,NA), times=5))
So how could I average the y values for e.g. the top 5 values of x?
> data.df
x y
1 101 1
2 102 2
3 103 3
4 104 NA
5 105 1
6 106 2
7 107 3
8 108 NA
9 109 1
10 110 2
11 111 3
12 112 NA
13 113 1
14 114 2
15 115 3
16 116 NA
17 117 1
18 118 2
19 119 3
20 120 NA
Any of the following would work:
mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
To unpack why this works, note first that rank gives ranks in a different order to what you might expect, 1 being the rank of the smallest number not the largest:
> rank(data.df$x)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can get round that by negating the input:
> rank(-data.df$x)
[1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
So now ranks 1 to 5 are the "top 5". If we want a vector of TRUE and FALSE to indicate the position of the top 5 we can use:
> rank(-data.df$x)<=5
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
(In reality you might find you have some ties in your data set. This is only going to cause issues if the 50th position is tied. You might want to have a look at the ties.method argument for rank to see how you want to handle this.)
So let's grab the values of y in those positions:
> data.df[rank(-data.df$x)<=5,"y"]
[1] NA 1 2 3 NA
Or you could use:
> data.df$y[rank(-data.df$x)<=5]
[1] NA 1 2 3 NA
So now we know what to input into mean:
> mean(data.df[rank(-data.df$x)<=5,"y"], na.rm=TRUE)
[1] 2
Or:
> mean(data.df$y[rank(-data.df$x)<=5], na.rm=TRUE)
[1] 2
Or if you don't like repeating the name of the data frame, use with:
> with(data.df, mean(y[rank(-x)<=5], na.rm=TRUE))
[1] 2

Carc data from rda file to numeric matrix

I try to make KDA (Kernel discriminant analysis) for carc data, but when I call command X<-data.frame(scale(X)); r shows error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
I tried to use as.numeric(as.matrix(carc)) and carc<-na.omit(carc), but it does not help either
library(ks);library(MASS);library(klaR);library(FSelector)
install.packages("klaR")
install.packages("FSelector")
library(ks);library(MASS);library(klaR);library(FSelector)
attach("carc.rda")
data<-load("carc.rda")
data
carc<-na.omit(carc)
head(carc)
class(carc) # check for its class
class(as.matrix(carc)) # change class, and
as.numeric(as.matrix(carc))
XX<-carc
X<-XX[,1:12];X.class<-XX[,13];
X<-data.frame(scale(X));
fit.pc<-princomp(X,scores=TRUE);
plot(fit.pc,type="line")
X.new<-fit.pc$scores[,1:5]; X.new<-data.frame(X.new);
cfs(X.class~.,cbind(X.new,X.class))
X.new<-fit.pc$scores[,c(1,4)]; X.new<-data.frame(X.new);
fit.kda1<-Hkda(x=X.new,x.group=X.class,pilot="samse",
bw="plugin",pre="sphere")
kda.fit1 <- kda(x=X.new, x.group=X.class, Hs=fit.kda1)
Can you help to resolve this problem and make this analysis?
Added:The car data set( Chambers, kleveland, Kleiner & Tukey 1983)
> head(carc)
P M R78 R77 H R Tr W L T D G C
AMC_Concord 4099 22 3 2 2.5 27.5 11 2930 186 40 121 3.58 US
AMC_Pacer 4749 17 3 1 3.0 25.5 11 3350 173 40 258 2.53 US
AMC_Spirit 3799 22 . . 3.0 18.5 12 2640 168 35 121 3.08 US
Audi_5000 9690 17 5 2 3.0 27.0 15 2830 189 37 131 3.20 Europe
Audi_Fox 6295 23 3 3 2.5 28.0 11 2070 174 36 97 3.70 Europe
Here is a small dataset with similar characteristics to what you describe
in order to answer this error:
"Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric"
carc <- data.frame(type1=rep(c('1','2'), each=5),
type2=rep(c('5','6'), each=5),
x = rnorm(10,1,2)/10, y = rnorm(10))
This should be similar to your data.frame
str(carc)
# 'data.frame': 10 obs. of 3 variables:
# $ type1: Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
# $ type2: Factor w/ 2 levels "5","6": 1 1 1 1 1 2 2 2 2 2
# $ x : num -0.1177 0.3443 0.1351 0.0443 0.4702 ...
# $ y : num -0.355 0.149 -0.208 -1.202 -1.495 ...
scale(carc)
# Similar error
# Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
Using set()
require(data.table)
DT <- data.table(carc)
cols_fix <- c("type1", "type2")
for (col in cols_fix) set(DT, j=col, value = as.numeric(as.character(DT[[col]])))
str(DT)
# Classes ‘data.table’ and 'data.frame': 10 obs. of 4 variables:
# $ type1: num 1 1 1 1 1 2 2 2 2 2
# $ type2: num 5 5 5 5 5 6 6 6 6 6
# $ x : num 0.0465 0.1712 0.1582 0.1684 0.1183 ...
# $ y : num 0.155 -0.977 -0.291 -0.766 -1.02 ...
# - attr(*, ".internal.selfref")=<externalptr>
The first column(s) of your data set may be factors. Taking the data from corrgram:
library(corrgram)
carc <- auto
str(carc)
# 'data.frame': 74 obs. of 14 variables:
# $ Model : Factor w/ 74 levels "AMC Concord ",..: 1 2 3 4 5 6 7 8 9 10 ...
# $ Origin: Factor w/ 3 levels "A","E","J": 1 1 1 2 2 2 1 1 1 1 ...
# $ Price : int 4099 4749 3799 9690 6295 9735 4816 7827 5788 4453 ...
# $ MPG : int 22 17 22 17 23 25 20 15 18 26 ...
# $ Rep78 : num 3 3 NA 5 3 4 3 4 3 NA ...
# $ Rep77 : num 2 1 NA 2 3 4 3 4 4 NA ...
# $ Hroom : num 2.5 3 3 3 2.5 2.5 4.5 4 4 3 ...
# $ Rseat : num 27.5 25.5 18.5 27 28 26 29 31.5 30.5 24 ...
# $ Trunk : int 11 11 12 15 11 12 16 20 21 10 ...
# $ Weight: int 2930 3350 2640 2830 2070 2650 3250 4080 3670 2230 ...
# $ Length: int 186 173 168 189 174 177 196 222 218 170 ...
# $ Turn : int 40 40 35 37 36 34 40 43 43 34 ...
# $ Displa: int 121 258 121 131 97 121 196 350 231 304 ...
# $ Gratio: num 3.58 2.53 3.08 3.2 3.7 3.64 2.93 2.41 2.73 2.87 ...
So exclude them by trying this:
X<-XX[,3:14]
or this
X<-XX[,-(1:2)]

Resources