Referecing numerical column names as variables in R - r

I have a dataframe where columns are numneri
asd <- data.frame(`2021`=rnorm(3), `2`=head(letters,3), check.names=FALSE)
But when I reference the columns names as variable, it is returning error
x = 2021
asd[x]
Error in `[.data.frame`(asd, x) : undefined columns selected
Expected output
x = 2021
asd[x]
2021
1 1.5570860
2 -0.8807877
3 -0.7627930

Reference it as a string:
x = "2021"
asd[,x]
[1] -0.2317928 -0.1895905 1.2514369

Use deparse
asd[,deparse(x)]
[1] 1.3445921 -0.3509493 0.5028844
asd[deparse(x)]
2021
1 1.3445921
2 -0.3509493
3 0.5028844

A bit more detail: numbers without quotes are not syntactically valid because they are parsed as numbers, so you will not be able to refer to them as column names without including quotes.
You can force R to interpret a number as a column name by
asd$2021
> asd$`2021`
[1] -0.634175 -1.612425 1.164135
Generally, you can protect yourself against syntactically invalid column names by
#(in base R)
names(asd) <- make.names(names(asd))
names(asd)
[1] "X2021" "X2"
#(or in tidyverse)
asd <- as_tibble(asd, .name_repair="universal")
New names:
* `2021` -> ...2021
* `2` -> ...2
# A tibble: 3 x 2
...2021 ...2
<dbl> <chr>
1 -0.634 a
2 -1.61 b
3 1.16 c

If the value is numeric, just convert to character with as.character - column/row names attributes are all character values
asd[as.character(x)]
2021
1 -0.4438473
2 -0.8904154
3 -0.9319593

Related

Same names in Columns in R

In R, I am using a read_excel function, to import some files, the problem is that my files have some columns with the same name, is there any way to force the same name? (I know it's not a good practice, but it's a very specific thing)
New names:
* `44228` -> `44228...4`
* `44229` -> `44229...5`
* `44230` -> `44230...6`
* `44231` -> `44231...7`
* `44232` -> `44232...8`
I need to use a conversion factor for these data names, so I need to leave it with the name of the member, they are data.
You can use the .name_repair argument of read_excel() to control, and turn off, the checks applied to column names by tibble(). So to allow duplicate names:
library("readxl")
library("writexl") # Only needed to generate an example xlsx file
x <- data.frame(a = 1:3, a = 1:3, a = 1:3, check.names = FALSE)
write_xlsx(x, "data.xlsx")
read_xlsx("data.xlsx", .name_repair = "minimal")
#> # A tibble: 3 x 3
#> a a a
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 2 2 2
#> 3 3 3 3
Although do be aware that duplicate column names are closer to a syntax error than "bad practice", so the resulting object will behave in strange ways:
df <- read_xlsx("data.xlsx", .name_repair = "minimal")
df$a
#> [1] 1 2 3

How to avoid number rounding when using as.numeric() in R?

I am reading well structured, textual data in R and in the process of converting from character to numeric, numbers lose their decimal places.
I have tried using round(digits = 2) but it didn't work since I first had to apply as.numeric. At one point, I did set up options(digits = 2) before the conversion but it didn't work either.
Ultimately, I desired to get a data.frame with its numbers being exactly the same as the ones seen as characters.
I looked up for help here and did find answers like this, this, and this; however, none really helped me solve this issue.
How will I prevent number rounding when converting from character to
numeric?
Here's a reproducible piece of code I wrote.
library(purrr)
my_char = c(" 246.00 222.22 197.98 135.10 101.50 86.45
72.17 62.11 64.94 76.62 109.33 177.80")
# Break characters between spaces
my_char = strsplit(my_char, "\\s+")
head(my_char, n = 2)
#> [[1]]
#> [1] "" "246.00" "222.22" "197.98" "135.10" "101.50" "86.45"
#> [8] "72.17" "62.11" "64.94" "76.62" "109.33" "177.80"
# Convert from characters to numeric.
my_char = map_dfc(my_char, as.numeric)
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 NA
#> 2 246
# Delete first value because it's empty
my_char = my_char[-1,1]
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 246
#> 2 222.
It's how R visualize data in a tibble.
The function map_dfc is not rounding your data, it's just a way R use to display data in a tibble.
If you want to print the data with the usual format, use as.data.frame, like this:
head(as.data.frame(my_char), n = 4)
V1
#>1 246.00
#>2 222.22
#>3 197.98
#>4 135.10
Showing that your data has not been rounded.
Hope this helps.

How to use select() only on columns of a certain type without loosing columns of other types?

There are a some similar questions (like here, or here), but none with quite the answer I am looking for.
The question:
How to use select() only on columns of a certain type?
The select helper functions used in select_if() or select_at() may only reference the column name or index. In this particular case I want to select columns of a certain type (numeric) and then select a subset of them based on their column sum while not losing the columns of other types (character).
What I would like to do:
tibbly = tibble(x = c(1,2,3,4),
y = c("a", "b","c","d"),
z = c(9,8,7,6))
# A tibble: 4 x 3
x y z
<dbl> <chr> <dbl>
1 1 a 9
2 2 b 8
3 3 c 7
4 4 d 6
tibbly %>%
select_at(is.numeric, colSums(.) > 12)
Error: `.vars` must be a character/numeric vector or a `vars()` object, not primitive
This doesn't work because select_at() doesn't recognize is.numeric as a proper function to select columns.
If I do something like:
tibbly %>%
select_if(is.numeric) %>%
select_if(colSums(.) > 12)
I manage to only select the columns with a sum > 12, but I also loose the character cholumns. I would like to avoid having to reattach the lost columns afterwards.
Is there a better way to select columns in a dplyr fashion, based on some properties other than their names / index?
Thank you!
Perhaps an option could be to create your own custom function, and use that as the predicate in the select_if function. Something like this:
check_cond <- function(x) is.character(x) | is.numeric(x) && sum(x) > 12
tibbly %>%
select_if(check_cond)
y z
<chr> <dbl>
1 a 9
2 b 8
3 c 7
4 d 6

Select specific names from list of dataframes in R

Sample data:
df <- data.frame(names=letters[1:10],name1=rnorm(10,1,1),name2=rexp(10,2))
list <- list(df,df)
vec_name <- c("f","i","c") # desired row names
I would like to select per list rows given the vec_name names:
Desired outcome:
[[1]]
names value1 value2
6 nd:f -1.6323952 0.3117470
9 nd:i 1.8270855 0.2475741
3 nd:c 0.6978422 0.4695581 # the ordering does matter; must be as seen in vec_name
[[2]]
names value1 value2
6 ad:f -1.6323952 0.3117470
9 ad:i 1.8270855 0.2475741
3 ad:c 0.6978422 0.4695581
Desired output 2: Is in dataframe, which would be I believe just do.call(rbind,list):
However the clean names from vec_names should be used instead.
names value1 value2
1 f -1.6323952 0.3117470
2 i 1.8270855 0.2475741
3 c 0.6978422 0.4695581
4 f -1.6323952 0.3117470
5 i 1.8270855 0.2475741
6 c 0.6978422 0.4695581
I have tried sapply; lapply ... for example:
lapply(list, function(x) x[grepl(vec_name,x$names),])
EDIT : PLEASE SEE THE EDITED QUESTION ABOVE.
You were almost there. The warning message was saying:
Warning messages:
1: In grepl(vec_name, x$names) :
argument 'pattern' has length > 1 and only the first element will be used
Reason is that you provide a vector to grepl which is expecting a regex (see ?regex). What you want to do is to match the contents:
lapply(list, function(x) x[match(vec_name,x$names),])
Which will give you a list of data.frame objects. If you want to combine them afterwards just use:
do.call(rbind, lapply(list, function(x) x[match(vec_name,x$names),]))
Or you use ldply from library(plyr):
library(plyr)
ldply(list, function(x) x[match(vec_name,x$names),])
# names name1 name2
# 1 f 2.01421228 0.4489627
# 2 i 0.28899891 0.8323940
# 3 c -0.01746007 1.5309936
# 4 f 2.01421228 0.4489627
# 5 i 0.28899891 0.8323940
# 6 c -0.01746007 1.5309936
And as a remark: avoid to use protected names like list for your variables to avoid unwanted effects.
Update
Taking the comments into account (vec_name does not match completely the names in the data.frame)you should clean first the names and then do the match. This is, however, assuming that your 'uncleaned' names contain the cleaned names with a pre-fix separated by a colon (':') (if this is not the case adapt the regex in the gsub statement):
ldply(list, function(x) x[match(vec_name, gsub(".*:(.*)", "\\1", x$names)),])
for the first output :
output1<-lapply(list,function(elt){
resmatch<-sapply(vec_name,function(x) regexpr(x,df$names))
elt<-elt[apply(resmatch,2,function(rg) which(rg>0)),]
colnames(elt)<-c("names","value1","value2")
return(elt)
})
>output1
[[1]]
names value1 value2
6 nd:f -0.2132962 0.7618105
9 nd:i -0.6580247 0.6010379
3 nd:c 0.9302625 0.1490061
[[2]]
names value1 value2
6 nd:f -0.2132962 0.7618105
9 nd:i -0.6580247 0.6010379
3 nd:c 0.9302625 0.1490061
For the second output, you can do what you wanted to :
output2<-do.call(rbind,output1)
> output2
names value1 value2
6 nd:f -0.2132962 0.7618105
9 nd:i -0.6580247 0.6010379
3 nd:c 0.9302625 0.1490061
61 nd:f -0.2132962 0.7618105
91 nd:i -0.6580247 0.6010379
31 nd:c 0.9302625 0.1490061

Extract data elements found in a single column

Here is what my data look like.
id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{
As you can see, can be multiple codes concatenated into a single column, seperated by {. It is also possible for a row to have no interest_string values at all.
How can I manipulate this data frame to extract the values into a format like this:
id interest
1 YI
1 Z0
1 ZI
2 Z0
3 <NA>
4 ZT
I need to complete this task with R.
Thanks in advance.
This is one solution
out <- with(dat, strsplit(as.character(interest_string), "\\{"))
## or
# out <- with(dat, strsplit(as.character(interest_string), "{", fixed = TRUE))
out <- cbind.data.frame(id = rep(dat$id, times = sapply(out, length)),
interest = unlist(out, use.names = FALSE))
Giving:
R> out
id interest
1 1 YI
2 1 Z0
3 1 ZI
4 2 ZO
5 3 <NA>
6 4 ZT
Explanation
The first line of solution simply splits each element of the interest_string factor in data object dat, using \\{ as the split indicator. This indicator has to be escaped and in R that requires two \. (Actually it doesn't if you use fixed = TRUE in the call to strsplit.) The resulting object is a list, which looks like this for the example data
R> out
[[1]]
[1] "YI" "Z0" "ZI"
[[2]]
[1] "ZO"
[[3]]
[1] "<NA>"
[[4]]
[1] "ZT"
We have almost everything we need in this list to form the output you require. The only thing we need external to this list is the id values that refer to each element of out, which we grab from the original data.
Hence, in the second line, we bind, column-wise (specifying the data frame method so we get a data frame returned) the original id values, each one repeated the required number of times, to the strsplit list (out). By unlisting this list, we unwrap it to a vector which is of the required length as given by your expected output. We get the number of times we need to replicate each id value from the lengths of the components of the list returned by strsplit.
A nice and tidy data.table solution:
library(data.table)
DT <- data.table( read.table( textConnection("id interest_string
1 YI{Z0{ZI{
2 ZO{
3 <NA>
4 ZT{"), header=TRUE))
DT$interest_string <- as.character(DT$interest_string)
DT[, {
list(interest=unlist(strsplit( interest_string, "{", fixed=TRUE )))
}, by=id]
gives me
id interest
1: 1 YI
2: 1 Z0
3: 1 ZI
4: 2 ZO
5: 3 <NA>
6: 4 ZT

Resources