Read columns with readr using regular expressions - r

I need to import data files with various column numbers. Finally, the code should be used by other co-workers being not very familiar with R. So it should be robust and without warning messages preferably. The main problem is that the headder is always ending with an additional "," which does not appear in the data below. Beside a whole bunch of unused columns the required columns are always labeled in the same way. I.e. there is always a certain string within the column name, but not necessarily the whole column name is identical.
The example code is a very simple approximation to my files. First, I would like to get rid of the error message due to the errornous comma at the end of the headder. Something like skip_col = ncol(headder). And second, I would like to read only columns with "*des*" within the column name.
My approach to deal with it looks simple in this simplified example but is not very satisfying in my more complex code.
library(tidyverse)
read_csv("date,col1des,col1foo,col2des,col3des,col2foo,col3foo,
2015-10-23T22:00:00Z,0.6,-1.5,-1.3,-0.5,1.8,0
2015-10-23T22:10:00Z,-0.5,-0.6,1.5,0.1,-0.3,0.3
2015-10-23T22:20:00Z,0.1,0.2,-1.6,-0.1,-1.4,-0.4
2015-10-23T22:30:00Z,1.7,-1.2,-0.2,-0.4,0.3,0.3")
if (length(grep("des", names(data))) > 0) {
des <- data[grep("des", names(data))]
des <- bind_cols(date = data$date, des)
}
So in my full code, I get the following warning messages:
1. Missing column names filled in: 'X184' [184]
2. Duplicated column names deduplicated: [long list of unrequired columns with dublicated names]
I would appreciate a solution within the tidyverse. As far as I found it is not possible to use regular expressions directly within the read_csv call to specify column names, right? So perhaps the only way is to read the headder first and build the cols() call out of that. But this exceeds my R knowledge.
Edit:
I wonder if something like this is possible:
headline <- "date,col1des,col1foo,col2des,col3des,col2foo,col3foo,"
head <- headline %>% strsplit(",") %>% unlist(use.names = FALSE)
head_des <- head[grep("des", head)]
data <- read_csv("mydata.csv", col_types = cols_only(head_des[1] = "d", head_des[2] = "d"))
I would like to grep() the column names befor reading the whole data.

Edit no. 2
In reaction to your comment;
This works with your data-string:
library(tidyverse)
yourData <- "date,col1des,col1foo,col2des,col3des,col2foo,col3foo,
2015-10-23T22:00:00Z,0.6,-1.5,-1.3,-0.5,1.8,0
2015-10-23T22:10:00Z,-0.5,-0.6,1.5,0.1,-0.3,0.3
2015-10-23T22:20:00Z,0.1,0.2,-1.6,-0.1,-1.4,-0.4
2015-10-23T22:30:00Z,1.7,-1.2,-0.2,-0.4,0.3,0.3"
data <- suppressWarnings(read_csv(yourData))
header <- names(data)
colList <- ifelse(str_detect(header,'des'),'c','_') %>% as.list
suppressWarnings(read_csv(yourData,col_types = do.call(cols_only, colList)))
#> # A tibble: 4 x 3
#> col1des col2des col3des
#> <chr> <chr> <chr>
#> 1 0.6 -1.3 -0.5
#> 2 -0.5 1.5 0.1
#> 3 0.1 -1.6 -0.1
#> 4 1.7 -0.2 -0.4
EDIT
Trying to accomodate your edited wishes and with the help of this Post:
library(tidyverse)
header <- suppressWarnings(readLines('file.csv')[1]) %>%
str_split(',',simplify = T)
colList <- ifelse(str_detect(header,'des'),'c','_') %>% as.list
suppressWarnings(read_csv(file = 'file.csv',col_types = do.call(cols_only, colList)))
#> # A tibble: 4 x 3
#> col1des col2des col3des
#> <chr> <chr> <chr>
#> 1 0.6 -1.3 -0.5
#> 2 -0.5 1.5 0.1
#> 3 0.1 -1.6 -0.1
#> 4 1.7 -0.2 -0.4
This is the most robust, most tidyverse way, I could come up with:
library(tidyverse)
file <- suppressWarnings(readLines('file.csv')) %>%
str_split(',')
dims <- file %>% map_int(~length(.))
if(any(dims != median(dims))){
file[[which(dims != median(dims))]] <- file[[which(dims != median(dims))]][1:median(dims)]
}
data <- file %>% map_chr(~paste(.,collapse = ',')) %>%
paste(., sep = '\n') %>% read_csv
(data <- data %>% select(which(str_detect(names(data), pattern = 'des'))))
#> # A tibble: 4 x 3
#> col1des col2des col3des
#> <dbl> <dbl> <dbl>
#> 1 0.6 -1.3 -0.5
#> 2 -0.5 1.5 0.1
#> 3 0.1 -1.6 -0.1
#> 4 1.7 -0.2 -0.4
Where file.csv contains your data.

Related

Access to the new column name from the function inside across

Is it possible to get the name of the new column from the across function?
For example
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), function(i) if("new column name"=="b") a+1 else a+0.5))
Expected result:
#> a b c
#> 1 1 2 1.5
Created on 2021-12-09 by the reprex package (v2.0.0)
I attempted to use cur_column() but its return value appears to be a in this case.
I apologise that this example is too simple to achieve the desired result in other ways, but my actual code is a large nested dataframe that is difficult to provide.
Interesting question. It seems that because you are defining b and c within the across call, they aren't available through cur_column().
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), function(i) print(cur_column())))
#> [1] "a"
#> [1] "a"
It works fine if they are already defined in the data.frame. Using tibble() here so I can refer to a in the constructor.
tibble(a=1, b=a, c=a) %>%
mutate(across(-a, ~ if (cur_column() == "b") .x + 1 else .x + 0.5))
#> # A tibble: 1 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 1.5
And similarly works even if you are in the same mutate() call, just making sure to define b and c prior to across().
data.frame(a=1) %>%
mutate(across(c(b=a, c=a), ~.x),
across(-a, ~ if (cur_column() == "b") .x + 1 else .x + 0.5))
#> a b c
#> 1 1 2 1.5
I believe it's happening because across is working across the rhs (a) and assigning the values to the lhs (b) only after the function in across() has been applied. I'm not sure this is the expected behavior (although it does seem right), I actually don't know, so will open up a GitHub issue because I think it's an interesting example!

How to avoid number rounding when using as.numeric() in R?

I am reading well structured, textual data in R and in the process of converting from character to numeric, numbers lose their decimal places.
I have tried using round(digits = 2) but it didn't work since I first had to apply as.numeric. At one point, I did set up options(digits = 2) before the conversion but it didn't work either.
Ultimately, I desired to get a data.frame with its numbers being exactly the same as the ones seen as characters.
I looked up for help here and did find answers like this, this, and this; however, none really helped me solve this issue.
How will I prevent number rounding when converting from character to
numeric?
Here's a reproducible piece of code I wrote.
library(purrr)
my_char = c(" 246.00 222.22 197.98 135.10 101.50 86.45
72.17 62.11 64.94 76.62 109.33 177.80")
# Break characters between spaces
my_char = strsplit(my_char, "\\s+")
head(my_char, n = 2)
#> [[1]]
#> [1] "" "246.00" "222.22" "197.98" "135.10" "101.50" "86.45"
#> [8] "72.17" "62.11" "64.94" "76.62" "109.33" "177.80"
# Convert from characters to numeric.
my_char = map_dfc(my_char, as.numeric)
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 NA
#> 2 246
# Delete first value because it's empty
my_char = my_char[-1,1]
head(my_char, n = 2)
#> # A tibble: 2 x 1
#> V1
#> <dbl>
#> 1 246
#> 2 222.
It's how R visualize data in a tibble.
The function map_dfc is not rounding your data, it's just a way R use to display data in a tibble.
If you want to print the data with the usual format, use as.data.frame, like this:
head(as.data.frame(my_char), n = 4)
V1
#>1 246.00
#>2 222.22
#>3 197.98
#>4 135.10
Showing that your data has not been rounded.
Hope this helps.

Is it possible to order (dplyr arrange?) a skim_df object by mean?

I am using the package skimr to summarize data that are all logicals, so naturally I would like to order the result by the mean from largest to smallest.
I have already attempted to pipe the skim function to arrange for dplyr but that didn't work.
We are simply using the skim function on a data frame that are all booleans/logicals.
I tried that, and seems like everything works as intended. skim_df inherits from data.frame, I don't see why dplyr functions will not work on it.
set.seed(123)
df <- data.frame(a = sample(c(T,F), 50, replace = TRUE),
b = c(rep(F,25), sample(c(T,F), 25, replace = TRUE)),
c = c(rep(T,25), sample(c(T,F), 25, replace = TRUE)))
sdf <- skimr::skim(df) %>%
dplyr::filter(stat == "mean") %>% dplyr::arrange(desc(value))
sdf
Output
variable type stat level value formatted
<chr> <chr> <chr> <chr> <dbl> <chr>
1 c logical mean .all 0.8 0.8
2 a logical mean .all 0.5 0.5
3 b logical mean .all 0.26 0.26
I don't know what your problem is. Carefully check your code for obvious errors.
Here's an answer for v2. In v2 the skim object is no longer the long object. Here select() turns the skim object into a regular tibble (focus()) would have kept it as a skimr object).
skim(df) %>% dplyr::select(skim_variable, logical.mean) %>%
dplyr::arrange(desc(logical.mean))
# A tibble: 3 x 2
skim_variable logical.mean
<chr> <dbl>
1 c 0.7
2 a 0.6
3 b 0.34
Alternatively
skim(df) %>% skimr::focus(skim_variable, logical.mean) %>%
dplyr::arrange(desc(logical.mean)) %>% as.data.frame()
skim_type skim_variable logical.mean
1 logical c 0.70
2 logical a 0.60
3 logical b 0.34
leaves the two meta columns in place. The as.data.frame() is one way to keep the summary from printing but you can also tell it to print with the summary excluded.
skim(df) %>% skimr::focus(skim_variable, logical.mean) %>%
dplyr::arrange(desc(logical.mean)) %>%
print(include_summary = FALSE)
── Variable type: logical ────────────────────────────────────────────────────────────────
skim_variable mean
1 c 0.7
2 a 0.6
3 b 0.34

Removing all columns with a name on the fly

I'm using read_excel for speed and simplicity to import an Excel file.
Unfortunately, there's as yet no capability of excluding selected columns which won't be needed from the data set; to save effort, I am naming such columns "x" with the col_names argument, which is easier than trying to keep track of x1, x2, and so on.
I'd then like to exclude such columns on the fly if possible to avoid an extra step of copying, so in pseudocode:
read_excel("data.xlsx", col_names = c("x", "keep", "x"))[ , !"x"]
We can use the sample data set included with the readxl package for illustration:
library(readxl)
DF <- read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
col_names = c("x", "x", "length", "width", "x"), skip = 1L)
The approaches I've seen that work don't exactly work on the fly, e.g., having stored DF, we can now do:
DF <- DF[ , -grep("^x$", names(DF))]
This works but requires making a copy of DF by storing it, then overwriting it; I'd prefer to remove the columns in the same command as read_excel to allocate DF properly ab initio.
Other similar approaches require declaring temporary variables, which I prefer to avoid if possible, e.g.,
col_names <- c("x", "x", "length", "width", "x")
DF <- read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
col_names = col_names, skip = 1L)[ , -grep("^x$", col_names)]
Is there a way to axe these columns without creating unnecessary temporary variables?
(I could convert to data.table, but am wondering if there's a way to do so without data.table)
There actually is a way to do this in readxl::read_excel, though it's a little hidden, and I have no idea if the columns are read into memory [temporarily] regardless. The trick is to specify column types, putting in "blank" for those you don't want:
readxl::read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
col_types = c('blank', 'blank', 'numeric', 'numeric', 'text'))
## # A tibble: 150 x 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <chr>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## 4 1.5 0.2 setosa
## 5 1.4 0.2 setosa
## 6 1.7 0.4 setosa
## 7 1.4 0.3 setosa
## 8 1.5 0.2 setosa
## 9 1.4 0.2 setosa
## 10 1.5 0.1 setosa
## # ... with 140 more rows
The caveat is that you need to know all the data types of the columns you want, though I suppose you could always just start with text and clean up later with type.convert or whatnot.
I don't see an easy way to avoid copying. But a one liner is achievable using piping, needing no temporary variables. E.g.:
library(magrittr)
read_excel(system.file("extdata/datasets.xlsx", package = "readxl"),
col_names = c("x", "x", "length", "width", "x"), skip = 1L) %>%
extract(, -grep("^x$", names(.))) ->
DF

Need help getting summary statistics for R data frame

This is my data (imagine I have 1050 rows of data shown below)
ID_one ID_two parameterX
111 aaa 23
222 bbb 54
444 ccc 39
My code then will divide the rows into groups of 100 (there will be 10 groups of 100 rows).
I then want to get the summary statistics per group. (not working)
After that I want to place the summary statistics in a data frame to plot them.
For example, put all 10 means for parameterX in a dataframe together, put all 10 std dev for parameterX in the same a data frame together etc
The following code is not working:
#assume data is available
dataframe_size <- nrow(thedata)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)
#splitdata into groups of 100
split_dataframe_into_groups <- function(x,y)
0:(x-1) %% y
list1 <- split(thedata, split_dataframe_into_groups(nrow(thedata), group_size))
#print data in the first group
list1[[1]]$parameterX
#NOT WORKING!!! #get summary stat for all 10 groups
# how to loop through all 10 groups?
list1_stat <- do.call(data.frame, list(mean = apply(list1[[1]]$parameterX, 2, mean),
sd = apply(list1[[1]]$parameterX, 2, sd). . .))
the error message is always:
Error in apply(...) dim(x) must have a positive length
That makes NO sense because when I run this code, There is clearly a positive length (data exists)
#print data in the first group
list1[[1]]$parameterX
#how to put all means in a dataframe?
# how to put all standard deviations in the same dataframe
ex df1 <- mean(2,2,3,4,7,2,4,,9,8,9),
sd (0.1, 3 , 0.5, . . .)
dplyr is so good for this kind of thing. If you create a new column that assigns a 'group' ID based on row location, then you can summarize each group very easily. I use an index to assist in assigning group IDs.
install.packages('dplyr')
library(dplyr)
## Create index
df$index <- 1:nrow(df)
## Assign group labels
df$group <- paste("Group", substr(df$index, 1, 1), sep = " ")
df[df$index <= 100, 'group'] <- "Group 0"
df[df$index > 1000, 'group'] <- paste("Group", substr(df$index, 1, 2), sep = " ")
df[df$index > 10000, 'group'] <- paste("Group", substr(df$index, 1, 3), sep = " ")
## Get summaries
df <- group_by(df, group)
summaries <- summarise(df, avg = mean(parameterX),
minimum = min(parameterX),
maximum = max(parameterX),
med = median(parameterX),
Mode = mode(parameterX))
... and so on.
Hope this helps.
I think this might be a good place to use tapply. there is an excellent summary here! One path forward might be an extension of the below:
df <- data.frame(id= c(rep("AA",10),rep("BB",10)), x=runif(20))
do.call("rbind", tapply(df$x, df$id, summary))
I think this is what you want :
require(dplyr)
dt<-rbind(iris,iris,iris)
dataframe_size <- nrow(dt)
group_size <- 100
number_ofgroups <- round(dataframe_size / group_size)
df<-dt %>%
# Creating the "bins" column using mutate
mutate(bins=cut(seq(1:dataframe_size),breaks=number_ofgroups)) %>%
# Aggregating the summary statistics by the bins variable
group_by(bins) %>%
# Calculating the mean
summarise(mean.Sepal.Length = mean( Sepal.Length))
head(dt)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
df
bins mean.Sepal.Length
(fctr) (dbl)
1 (0.551,113] 5.597345
2 (113,226] 5.755357
3 (226,338] 5.919643
4 (338,450] 6.100885

Resources