R tidyr regex: extract ordered numbers from character column - r

Suppose I have a data frame like this
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
It looks like this
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
Is there a simple way, possibly using the Tidyverse to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0. Essentially I would like the final result to be like this
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
I tried using stuff like
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")
but I got so lost.

We can use regex lookaround in str_extract to extract one or more digits (\\d+) followed by a space and 'vis' or 'data files' into two columns
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
In the first case, the pattern matches one or more digits (\\d+) followed by a regex lookaround ((?=) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'

You could use the package unglue to get a readable solution as you have a limited amount of possible patterns, then replace NAs by 0 :
library(unglue)
patterns <-
c("This script outputs {viz} visualization{=s{0,1}} and {files} data file{=s{0,1}}.",
"This script outputs {viz} visualization{=s{0,1}}.",
"This script outputs {files} data file{=s{0,1}}.")
res <- unglue_unnest(df, x, patterns, convert = TRUE)
res[is.na(res)] <- 0
res
#> viz files
#> 1 10 0
#> 2 1 0
#> 3 0 5
#> 4 0 1
#> 5 0 0
#> 6 9 28
#> 7 1 1

A base R approach ...
df$viz <- as.numeric(sub(".*This script outputs (\\d+).*", "\\1", df$x))
df$files <- as.numeric(sub(".*(\\d+) data file.*", "\\1", df$x))
df[is.na(df)] <- 0
df
# x viz files
# 1 This script outputs 10 visualizations. 10 0
# 2 This script outputs 1 visualization. 1 0
# 3 This script outputs 5 data files. 5 5
# 4 This script outputs 1 data file. 1 1
# 5 This script doesn't output any visualizations or data files 0 0
# 6 This script outputs 9 visualizations and 28 data files. 9 28
# 7 This script outputs 1 visualization and 1 data file. 1 1

Related

Looping over 16 numbers, but excluding one each time

Using expression the following expression I want to compute the influence of each data point in the an election forecast data set (see bottom). My idea is to loop through the expression 16 times and print the result, but for each time I loop through leave on x_1 out to see how each of them influences the result. But I have no idea how to make this loop in R.
The expression is:
LaTeX
$$ \hat{b} = \frac{\sum_{i=1}^{n} ({x_i}-{\bar{x}){y_i}}}{\sum_{i=1}^{n} ({x_i}-{\bar{x})}^2} $$
And in R
betahat<- (sum(data$growth)-mean(data$growth))*data$vote/(sum(data$growth)-mean(data$growth))^2
print(betahat)
And the data is this
data <- read.table("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/ElectionsEconomy/data/hibbs.dat", header = TRUE)
Expected functioning:
0 1 2 0 x x 0 1 2
1 2 4 first loop 1 2 4 second loop 1 x x etc.
2 3 6 ---> 2 3 6 ---> 2 3 6 --->
3 4 8 3 4 8 3 4 8
4 5 10 4 5 10 4 5 10
The first output should be something like
[1] 1.566974 2.029337 1.753535 2.155116 1.742644 2.170927 1.719807 1.570487 2.078876
[10] 1.895125 1.635485 1.923232 1.766184 1.800264 1.627404 1.826965

Combining CSV files in R [duplicate]

I Have multiple csv files that i have already read into R. Now I want to append all these into one file. I tried few things but getting different errors. Can anyone please help me with this?
TRY 1:
mydata <- rbind(x1,x2,x3,x4,x5,x6,x7,x8)
WHERE XI,X2....X8 Are the CSV files I read into R, error I am getting is
ERROR 1 :In [<-.factor(*tmp*, ri, value = c(NA, NA, NA, NA, NA, NA, NA, :
invalid factor level, NA generated
TRY 2: Then I try this in another way :
mydata1<- c(x1,x2,x3,x4,x5,x6,x7,x8)
> mydata2 <- do.call('rbind',lapply(mydata1,read.table,header=T))
Error 2: in FUN(X[[i]], ...) :
'file' must be a character string or connection
can anyone please help me know what is the right way to do this?
How to import all files from a single folder at once and bind by row (e.g., same format for each file.)
library(tidyverse)
list.files(path = "location_of/data/folder_you_want/",
pattern="*.csv",
full.names = T) %>%
map_df(~read_csv(.))
If there is a file that you want to exclude then
list.files(path = "location_of/data/folder_you_want/",
pattern="*.csv",
full.names = T) %>%
.[ !grepl("data/folder/name_of_file_to_remove.csv", .) ] %>%
map_df(~read_csv(.))
Sample CSV Files
Note
CSV files to be merged here have
- equal number of columns
- same column names
- same order of columns
- number of rows can be different
1st csv file abc.csv
A,B,C,D
1,2,3,4
2,3,4,5
3,4,5,6
1,1,1,1
2,2,2,2
44,44,44,44
4,4,4,4
4,4,4,4
33,33,33,33
11,1,11,1
2nd csv file pqr.csv
A,B,C,D
1,2,3,40
2,3,4,50
3,4,50,60
4,4,4,4
5,5,5,5
6,6,6,6
List FILENAMES of CSV Files
Note
The path below E:/MergeCSV/ has just the files to be merged. No other csv files. So in this path, there are only two csv files, abc.csv and pqr.csv
## List filenames to be merged.
filenames <- list.files(path="E:/MergeCSV/",pattern="*.csv")
## Print filenames to be merged
print(filenames)
## [1] "abc.csv" "pqr.csv"
FULL PATH to CSV Files
## Full path to csv filenames
fullpath=file.path("E:/MergeCSV",filenames)
## Print Full Path to the files
print(fullpath)
## [1] "E:/MergeCSV/abc.csv" "E:/MergeCSV/pqr.csv"
MERGE CSV Files
## Merge listed files from the path above
dataset <- do.call("rbind",lapply(filenames,FUN=function(files){ read.csv(files)}))
## Print the merged csv dataset, if its large use `head()` function to get glimpse of merged dataset
dataset
# A B C D
# 1 1 2 3 4
# 2 2 3 4 5
# 3 3 4 5 6
# 4 1 1 1 1
# 5 2 2 2 2
# 6 44 44 44 44
# 7 4 4 4 4
# 8 4 4 4 4
# 9 33 33 33 33
# 10 11 1 11 1
# 11 1 2 3 40
# 12 2 3 4 50
# 13 3 4 50 60
# 14 4 4 4 4
# 15 5 5 5 5
# 16 6 6 6 6
head(dataset)
# A B C D
# 1 1 2 3 4
# 2 2 3 4 5
# 3 3 4 5 6
# 4 1 1 1 1
# 5 2 2 2 2
# 6 44 44 44 44
## Print dimension of merged dataset
dim(dataset)
## [1] 16 4
The accepted answer above generates the error shown in the comments because the do.call requires the "fullpath" parameter. Use the code as shown to use in the directory of your choice:
dataset <- do.call("rbind",lapply(fullpath,FUN=function(files){ read.csv(files)}))
You can use a combination of lapply(), and do.call().
## cd to the csv directory
setwd("mycsvs")
## read in csvs
csvList <- lapply(list.files("./"), read.csv, stringsAsFactors = F)
## bind them all with do.call
csv <- do.call(rbind, csvList)
You can also use fread() function from the data.table package and rbindlist() instead for a performance boost.

How to read one specific column from table() in R

From a 100000+ rows table I generated this small table with table() in R:
> TableName <- table(ProductID = test$ProductID,format(test$Dates, "%y%m%d"))
> TableName
ProductID 161024 161025 161026 161027 161028 161029 161030
1 1 2 4 1 2 3 5
2 4 4 7 3 8 1 8
3 1 1 1 0 0 0 0
6 1 1 1 0 0 0 0
8 3 9 8 6 1 7 3
In the normal time, I can read one specific column with TableName$ColumnName but it doesn't work with the table generated from table() unless I write this table to a .csv file.
Is there any way that I can read one specific column without write the table to a .csv file and read the same .csv file back to R?
For matrix, table, the $ will not work, so, we need to use [
TableName[, '161024']

Split dataframe to multiple small dataframes in R [duplicate]

This question already has answers here:
Split dataframe into multiple output files
(2 answers)
Closed 5 years ago.
I have a large dataframe I would like to split into multiple small data frames, based on the value in the Name column.
head(DATAFILE)
# Age Site Name 1 2 3 4 5
# 10 1 Orange 0 2 1 0 1
# 10 1 Apple 2 5 4 0 2
# 10 1 Banana 0 0 0 0 2
# 20 2 Orange 0 2 1 0 0
# 20 2 Apple 0 2 0 7 1
# 20 2 Banana 0 4 1 3 6
And an example file of the desired output;
head(Orange)
# Age Site Name 1 2 3 4 5
# 10 1 Orange 0 2 1 0 1
# 20 2 Orange 0 2 1 0 0
I have tried
SPLIT.DATA <- split(DATAFILE, DATAFILE$Name, drop = FALSE)
But this returns a large list, and I would like individual files so that I can save them as .csv files. So I would like either a better way of dividing the original file, or a way to further divide the SPLIT.DATA file.
It is better to save the datasets directly from the list output of split itself instead of creating individual objects in the global environment. We loop by the names of the 'SPLIT.DATA', and write the list elements to individual csv files with the same name as the names of the list elements by pasteing the names to .csv in the write.csv call.
lapply(names(SPLIT.DATA), function(nm)
write.csv(SPLIT.DATA[[nm]], paste0(nm, ".csv"), row.names = FALSE, quote = FALSE))

How to save all loop's results in a csv

Consider this toy data frame:
df <- read.table(text = "target birds wolfs
0 21 7
0 8 4
1 2 5
1 2 4
0 8 3
1 1 12
1 7 10
1 1 9 ",header = TRUE)
I would like to run a loop function that will calculate the mean per each variable's 2:5 rows and save all results as a CSV.
I wrote this line of code:
for(i in names(df)) {print(mean(df[2:5,i]))}
and got the following results:
[1] 0.5
[1] 5
[1] 4
But when I tried to export it to csv using the code below I got in the file only the last result: [1] 4.
code:
for(i in names(df)) { j<-(mean(df[2:5,i]))
write.csv(j,"j.csv") }
How can I get in the same csv file a list of all the results?
In dplyr, you could use summarise_each to perform a computation on every columns of your data frame.
library(dplyr)
j <- slice(df,2:5) %>% # selects rows 2 to 5
summarise_each(funs=funs(mean(.))) # computes mean of each column
The results are in a data.frame:
j
target birds wolfs
1 0.5 5 4
If you want each variable mean on a separate line, use t()
t(j)
[,1]
target 0.5
birds 5.0
wolfs 4.0
And to export the results:
write.csv(t(j),"j.csv")

Resources