Extracting string from Named chr in R [duplicate] - r

I am looking for just the value of the B1(newx) linear model coefficient, not the name. I just want the 0.5 value. I do not want the name "newx".
newx <- c(0.5,1.5,2.5)
newy <- c(2,3,4)
out <- lm(newy ~ newx)
out looks like:
Call:
lm(formula = newy ~ newx)
Coefficients:
(Intercept) newx
1.5 1.0
I arrived here. But now I am stuck.
out$coefficients["newx"]
newx
1.0

For a single element like this, use [[ rather than [. Compare:
coefficients(out)["newx"]
# newx
# 1
coefficients(out)[["newx"]]
# [1] 1
More generally, use unname():
unname(coefficients(out)[c("newx", "(Intercept)")])
# [1] 1.0 1.5
head(unname(mtcars))
# NA NA NA NA NA NA NA NA NA NA NA
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## etc.

If the question is about removing names, another way is here
my_vec <- # some quantile function(returns named vector)
names(my_vec) <- NULL
my_vec
## [1] 1 2 3

An easy and rather direct way to do it is
as.numeric(out$coefficients["newx"])

Another way would be to use broom package:
broom::tidy(out)$estimate[1]
#1.5

Related

Converting a character into a variable name

I need to convert a data frame into a matrix using the model.matrix function. The name of the original data frame is train, and the outcome variable of interest is called adequacy_ratio_total_percent. The below R code works.
X_train_matrix <- model.matrix(adequacy_ratio_total_percent ~ ., train)[, -1]
However, since my outcome variables may vary and I hope to simplify the changing of the outcome variables using the below code, which does not work.
list_outcome <- c("adequacy_ratio_total_percent")
X_train_matrix <- model.matrix(list_outcome ~ ., train)[, -1]
Error in model.frame.default(object, data, xlev = xlev) :
variable lengths differ (found for 'adequacy_ratio_total_percent')
I also tried the following, which does not work either.
list_outcome <- c("adequacy_ratio_total_percent")
X_train_matrix <- model.matrix(train$list_outcome ~ ., train)[, -1]
Error in model.frame.default(object, data, xlev = xlev) :
invalid type (NULL) for variable 'train$list_outcome'
Or the following:
list_outcome <- c("adequacy_ratio_total_percent")
X_train_matrix <- model.matrix(list_outcome[1] ~ ., train)[, -1]
Error in model.frame.default(object, data, xlev = xlev) :
variable lengths differ (found for 'adequacy_ratio_total_percent')
How can I extract the variable name from list_outcome and apply it to the model.matrix function? Thank you in advance for any advice!
Here's an answer that uses the same idea as #user20650, but with multiple possibilities for outcomes:
data(mtcars)
list_outcomes = c("qsec", "mpg")
Xmats <- lapply(list_outcomes, function(l){
model.matrix(reformulate(".", response=l), data=mtcars)
})
lapply(Xmats, head)
#> [[1]]
#> (Intercept) mpg cyl disp hp drat wt vs am gear carb
#> Mazda RX4 1 21.0 6 160 110 3.90 2.620 0 1 4 4
#> Mazda RX4 Wag 1 21.0 6 160 110 3.90 2.875 0 1 4 4
#> Datsun 710 1 22.8 4 108 93 3.85 2.320 1 1 4 1
#> Hornet 4 Drive 1 21.4 6 258 110 3.08 3.215 1 0 3 1
#> Hornet Sportabout 1 18.7 8 360 175 3.15 3.440 0 0 3 2
#> Valiant 1 18.1 6 225 105 2.76 3.460 1 0 3 1
#>
#> [[2]]
#> (Intercept) cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 1 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 1 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 1 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 1 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 1 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 1 6 225 105 2.76 3.460 20.22 1 0 3 1
Created on 2022-06-28 by the reprex package (v2.0.1)

How do I create a new column indicating whether certain other columns contain a given value?

I'd like a new column in a data.frame to indicate whether, for each row, the number "2" appears in certain other columns. Here's a simple version that works for a small data.frame:
df <- data.frame(mycol.1 = 1:5, mycol.2= 5:1, other.col = -2:2)
df$mycols.contain.two <- df$mycol.1 ==2 | df$mycol.2 ==2
df
mycol.1 mycol.2 other.col mycols.contain.two
1 1 5 -2 FALSE
2 2 4 -1 TRUE
3 3 3 0 FALSE
4 4 2 1 TRUE
5 5 1 2 FALSE
Now suppose the data.frame has 50 columns, and I want the new column to indicate whether any of the columns beginning with "mycol" contain a "2" in each row, without having to use the "|" symbol 49 times. I assume there's an elegant dplyr answer using starts_with(), but I can't figure out the syntax.
You could do:
df <- data.frame(mycol.1 = 1:5, mycol.2= 5:1, other.col = -2:2)
df$TYPE <- ifelse(rowSums(ifelse(sapply(df, function (x){x == 2}), 1, 0)) > 0 , "TRUE", "FALSE")
# > df
# mycol.1 mycol.2 other.col TYPE
# 1 1 5 -2 FALSE
# 2 2 4 -1 TRUE
# 3 3 3 0 FALSE
# 4 4 2 1 TRUE
# 5 5 1 2 TRUE
You can achieve it by indexing. Let's take the mtcars data.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
After that, we can index any column. Say we are interesting in columns 8 to 11,
mtcars$new <- rowSums(mtcars[,8:11]==2)>0
gives,
mpg cyl disp hp drat wt qsec vs am gear carb new
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 FALSE
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 FALSE
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 FALSE
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 FALSE
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 TRUE
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 FALSE
>
You could use a simple apply() loop:
df <- data.frame(mycol.1 = 1:5, mycol.2= 5:1, other.col = -2:2)
df$mycols.contain.two <- apply(df, 1, function(x){any(x == 2)})
or if you want to check only the first 3 columns:
df <- data.frame(mycol.1 = 1:5, mycol.2= 5:1, other.col = -2:2)
df$mycols.contain.two <- apply(df, 1, function(x){any(x[1:3] == 2)})

problems with NA in data.table

I have problems with missing values NA in data.table. When using mean(x) BY=z, I got NA if some of observations in a group with the same value of z has x=NA. How I can treat that?
As you have not provided any example data, its hard to guess what are you trying to do. However, here is a sample example to exclude the NA values from calculation. Consider a data table dt
dt = data.table(mtcars)[1:6][2, mpg := NA][]
mpg cyl disp hp drat wt qsec vs am gear carb
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2: NA 6 160 110 3.90 2.875 17.02 0 1 4 4
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Where you have NA value in second row of first column. If you are interested to calculate the mean of first column, you can use na.rm.
mean(dt$mpg, na.rm = TRUE)
#[1] 20.06129
Or, when doing by-group calculations:
dt[, mean(mpg, na.rm = TRUE), by=cyl]
# cyl V1
# 1: 6 20.16667
# 2: 4 22.80000
# 3: 8 18.70000

How do I extract just the number from a named number (without the name)?

I am looking for just the value of the B1(newx) linear model coefficient, not the name. I just want the 0.5 value. I do not want the name "newx".
newx <- c(0.5,1.5,2.5)
newy <- c(2,3,4)
out <- lm(newy ~ newx)
out looks like:
Call:
lm(formula = newy ~ newx)
Coefficients:
(Intercept) newx
1.5 1.0
I arrived here. But now I am stuck.
out$coefficients["newx"]
newx
1.0
For a single element like this, use [[ rather than [. Compare:
coefficients(out)["newx"]
# newx
# 1
coefficients(out)[["newx"]]
# [1] 1
More generally, use unname():
unname(coefficients(out)[c("newx", "(Intercept)")])
# [1] 1.0 1.5
head(unname(mtcars))
# NA NA NA NA NA NA NA NA NA NA NA
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## etc.
If the question is about removing names, another way is here
my_vec <- # some quantile function(returns named vector)
names(my_vec) <- NULL
my_vec
## [1] 1 2 3
An easy and rather direct way to do it is
as.numeric(out$coefficients["newx"])
Another way would be to use broom package:
broom::tidy(out)$estimate[1]
#1.5

lapply and data.frame in R

I am attempting to use R to accept as many user input files as required and to take those files and make one histogram per file of the values stored in the 14th column. I have gotten this far:
library("tcltk")
library("grid")
File.names<-(tk_choose.files(default="", caption="Choose your files", multi=TRUE, filters=NULL, index=1))
Num.Files<-NROW(File.names)
test<-sapply(1:Num.Files,function(x){readLines(File.names[x])})
data<-read.table(header=TRUE,text=test[1])
names(data)[14]<-'column14'
dat <- list(file1 = data.frame("column14"),
file2 = data.frame("column14"),
file3 = data.frame("column14"),
file4 = data.frame("column14"))
#Where the error comes up
tmp <- lapply(dat, `[[`, 2)
lapply(tmp, function(x) {hist(x, probability=TRUE, main=paste("Histogram of Coverage")); invisible()})
layout(1)
My code hangs up though on the line that states tmp <- lapply(dat,[[, 2)
The error that comes up is one of two things. If the line reads as above then the error is this:
Error in .subset2(x, i, exact = exact) : subscript out of bounds
Calls: lapply -> FUN -> [[.data.frame -> <Anonymous>
I did some research and found that it could be caused by a double [[]] so I changed it to tmp <- lapply(dat,[, 2) to see if it would do any good (as many tutorials said it might) but that just resulted in this error:
Error in `[.data.frame`(X[[1L]], ...) : undefined columns selected
Calls: lapply -> FUN -> [.data.frame
The input files all will follow this pattern:
Targ cov av_cov 87A_cvg 87Ag 87Agr 87Agr 87A_gra 87A%_1 87A%_3 87A%_5 87A%_10 87A%_20 87A%_30 87A%_40 87A%_50 87A%_75 87A%_100
1:028 400 0.42 400 0.42 1 1 2 41.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1:296 400 0.42 400 0.42 1 1 2 41.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Is this a common problem? Can anyone explain it to me? I am not too familiar with R but I hope to continue learning.
Thanks
EDIT:
For reproducibility, if I run:
head(test)
head(data)
x <- list(mtcars, mtcars, mtcars);lapply(x, head)
head(dat)
This is the result:
> head(test)
[,1]
[1,] "Targ cov av_cov 87A_cvg 87Ag 87Agr 87Agr 87A_gra 87A%_1 87A%_3 87A%_5 87A%_10 87A%_20 87A%_30 87A%_40\t87A%_50\t87A%_75\t87A%_100"
[2,] "1:028 400\t0.42\t400\t0.42\t1\t1\t2\t41.8\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0"
[3,] "1:296 400\t0.42\t400\t0.42\t1\t1\t2\t41.8\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0"
[4,] "1:453 1646\t8.11\t1646\t8.11\t7\t8\t13\t100.0\t100.0\t87.2\t32.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0"
[5,] "1:427 1646\t8.11\t1646\t8.11\t7\t8\t13\t100.0\t100.0\t87.2\t32.0\t0.0\t0.0\t0.0\t0.0\t0.0\t0.0"
[6,] "1:736 5105\t29.68\t5105\t29.68\t14\t29\t48\t100.0\t100.0\t100.0\t86.0\t65.7\t49.4\t35.5\t16.9\t0.0\t0.0"
> head(data)
[1] Targ cov av_cov X87A_cvg X87Ag X87Agr X87Agr.1
[8] X87A_gra X87A._1 X87A._3 X87A._5 X87A._10 X87A._20 X87A._30
[15] X87A._40 X87A._50 X87A._75 X87A._100
<0 rows> (or 0-length row.names)
> x <- list(mtcars, mtcars, mtcars);lapply(x, head)
[[1]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
[[2]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
[[3]]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> head(dat)
$file1
X.column14.
1 column14
$file2
X.column14.
1 column14
$file3
X.column14.
1 column14
$file4
X.column14.
1 column14
> tmp <- lapply(dat, `[`, 2)
Error in `[.data.frame`(X[[1L]], ...) : undefined columns selected
Calls: lapply -> FUN -> [.data.frame
Execution halted
What are you trying to do here?
tmp <- lapply(dat, `[[`, 2)
The lapply function is equivalent to
list(file1=dat[[1]][[2]],
file2=dat[[2]][[2]],
file3=dat[[3]][[2]],
file4=dat[[4]][[2]])
This doesn't work. You're trying to extract column 2 out of data frame that only has 1 column.
Redefine dat as this, and it will work.
dat <- list(file1 = data.frame("column14","iforgotcolumn2"),
file2 = data.frame("column14","iforgotcolumn2"),
file3 = data.frame("column14","iforgotcolumn2"),
file4 = data.frame("column14","iforgotcolumn2"))

Resources