This question already has answers here:
Split dataframe into multiple output files
(2 answers)
Closed 5 years ago.
I have a large dataframe I would like to split into multiple small data frames, based on the value in the Name column.
head(DATAFILE)
# Age Site Name 1 2 3 4 5
# 10 1 Orange 0 2 1 0 1
# 10 1 Apple 2 5 4 0 2
# 10 1 Banana 0 0 0 0 2
# 20 2 Orange 0 2 1 0 0
# 20 2 Apple 0 2 0 7 1
# 20 2 Banana 0 4 1 3 6
And an example file of the desired output;
head(Orange)
# Age Site Name 1 2 3 4 5
# 10 1 Orange 0 2 1 0 1
# 20 2 Orange 0 2 1 0 0
I have tried
SPLIT.DATA <- split(DATAFILE, DATAFILE$Name, drop = FALSE)
But this returns a large list, and I would like individual files so that I can save them as .csv files. So I would like either a better way of dividing the original file, or a way to further divide the SPLIT.DATA file.
It is better to save the datasets directly from the list output of split itself instead of creating individual objects in the global environment. We loop by the names of the 'SPLIT.DATA', and write the list elements to individual csv files with the same name as the names of the list elements by pasteing the names to .csv in the write.csv call.
lapply(names(SPLIT.DATA), function(nm)
write.csv(SPLIT.DATA[[nm]], paste0(nm, ".csv"), row.names = FALSE, quote = FALSE))
Related
Suppose I have a data frame like this
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
It looks like this
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
Is there a simple way, possibly using the Tidyverse to extract the number of visualizations and the number of files for each row? When there are no visualizations (or no data files, or both) I would like to extract 0. Essentially I would like the final result to be like this
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
I tried using stuff like
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\\.$|ns\\.$))")
but I got so lost.
We can use regex lookaround in str_extract to extract one or more digits (\\d+) followed by a space and 'vis' or 'data files' into two columns
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\\d+(?= vis)")),
files = as.numeric(str_extract(x, "\\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
In the first case, the pattern matches one or more digits (\\d+) followed by a regex lookaround ((?=) where there is a space followed by the 'vis' word and in second column, it extracts the digits followed by the space and the word 'file' or 'files'
You could use the package unglue to get a readable solution as you have a limited amount of possible patterns, then replace NAs by 0 :
library(unglue)
patterns <-
c("This script outputs {viz} visualization{=s{0,1}} and {files} data file{=s{0,1}}.",
"This script outputs {viz} visualization{=s{0,1}}.",
"This script outputs {files} data file{=s{0,1}}.")
res <- unglue_unnest(df, x, patterns, convert = TRUE)
res[is.na(res)] <- 0
res
#> viz files
#> 1 10 0
#> 2 1 0
#> 3 0 5
#> 4 0 1
#> 5 0 0
#> 6 9 28
#> 7 1 1
A base R approach ...
df$viz <- as.numeric(sub(".*This script outputs (\\d+).*", "\\1", df$x))
df$files <- as.numeric(sub(".*(\\d+) data file.*", "\\1", df$x))
df[is.na(df)] <- 0
df
# x viz files
# 1 This script outputs 10 visualizations. 10 0
# 2 This script outputs 1 visualization. 1 0
# 3 This script outputs 5 data files. 5 5
# 4 This script outputs 1 data file. 1 1
# 5 This script doesn't output any visualizations or data files 0 0
# 6 This script outputs 9 visualizations and 28 data files. 9 28
# 7 This script outputs 1 visualization and 1 data file. 1 1
I have data frames with lists of elements such as NAMES. There are different names in dataframes, but most of them match together. I'd like to combine all of them in one list in which I'd see whether some names are missing from any of df.
DATA sample for df1:
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 RH_Type-Function-S
6 6 RH_REFERENT-S
and for df2
X x
1 1 rh_Structure/Focus_S
2 2 rh_Structure/Focus_C
3 3 lh_Structure/Focus_S
4 4 lh_Structure/Focus_C
5 5 UCZESTNIK
6 6 COACH
and expected result would be:
NAME. df1 df2
1 COACH NA 6
2 lh_Structure/Focus_C 4 4
3 lh_Structure/Focus_S 3 3
4 RH_REFERENT-S 6 NA
5 rh_Structure/Focus_C 2 2
6 rh_Structure/Focus_S 1 1
7 RH_Type-Function-S 5 NA
8 UCZESTNIK NA 5
I can do that with merge.data.frame(df1,df2,by = "x", all=T),
but the I can't do it with more df with similar structure. Any help would be appreciated.
It might be easier to work with this in a long form. Just rbind all the datasets below one another with a flag for which dataset they came from. Then it's relatively straightforward to get a tabulation of all the missing values (and as an added bonus, you can see if you have any duplicates in any of the source datasets):
dfs <- c("df1","df2")
dfall <- do.call(rbind, Map(cbind, mget(dfs), src=dfs))
table(dfall$x, dfall$src)
# df1 df2
# COACH 0 1
# lh_Structure/Focus_C 1 1
# lh_Structure/Focus_S 1 1
# RH_REFERENT-S 1 0
# rh_Structure/Focus_C 1 1
# rh_Structure/Focus_S 1 1
# RH_Type-Function-S 1 0
# UCZESTNIK 0 1
I have a table like this
table(mtcars$gear, mtcars$cyl)
I want to rank the rows by the ones with more observations in the 4 cylinder. E.g.
4 6 8
4 8 4 0
5 2 1 2
3 1 2 12
I have been playing with order/sort/rank without much success. How could I order tables output?
We can convert table to data.frame and then order by the column.
sort_col <- "4"
tab <- as.data.frame.matrix(table(mtcars$gear, mtcars$cyl))
tab[order(-tab[sort_col]), ]
# OR tab[order(tab[sort_col], decreasing = TRUE), ]
# 4 6 8
#4 8 4 0
#5 2 1 2
#3 1 2 12
If we don't want to convert it into data frame and want to maintain the table structure we can do
tab <- table(mtcars$gear, mtcars$cyl)
tab[order(-tab[,dimnames(tab)[[2]] == sort_col]),]
# 4 6 8
# 4 8 4 0
# 5 2 1 2
# 3 1 2 12
Could try this. Use sort for the relevant column, specifying decreasing=TRUE; take the names of the sorted rows and subset using those.
table(mtcars$gear, mtcars$cyl)[names(sort(table(mtcars$gear, mtcars$cyl)[,1], dec=T)), ]
4 6 8
4 8 4 0
5 2 1 2
3 1 2 12
In the same scope as Milan, but using the order() function, instead of looking for names() in a sort()-ed list.
The [,1] is to look at the first column when ordering.
table(mtcars$gear, mtcars$cyl)[order(table(mtcars$gear, mtcars$cyl)[,1], decreasing=T),]
From a 100000+ rows table I generated this small table with table() in R:
> TableName <- table(ProductID = test$ProductID,format(test$Dates, "%y%m%d"))
> TableName
ProductID 161024 161025 161026 161027 161028 161029 161030
1 1 2 4 1 2 3 5
2 4 4 7 3 8 1 8
3 1 1 1 0 0 0 0
6 1 1 1 0 0 0 0
8 3 9 8 6 1 7 3
In the normal time, I can read one specific column with TableName$ColumnName but it doesn't work with the table generated from table() unless I write this table to a .csv file.
Is there any way that I can read one specific column without write the table to a .csv file and read the same .csv file back to R?
For matrix, table, the $ will not work, so, we need to use [
TableName[, '161024']
I have the following data frame:
id<-c(1,2,3,4,1,1,2,3,4,4,2,2)
period<-c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df<-data.frame(id,period)
typing
table(df)
results in
period
id calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
however if I save it as a data frame 'df'
df<-data.frame(table(df))
the format of 'df' would be like
id period Freq
1 1 calib 2
2 2 calib 1
3 3 calib 1
4 4 calib 0
5 1 first 1
6 2 first 2
7 3 first 0
8 4 first 0
9 1 valid 0
10 2 valid 0
11 3 valid 2
12 4 valid 3
how can I avoid this and how can I save the first output as it is into a data frame?
more importantly is there any way to get the same result using 'dcast'?
Would this help?
> data.frame(unclass(table(df)))
calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
To elaborate just a little bit. I've changed the ids in the example data.frame such that your ids are not 1:4, in order to prove that the ids are carried along into the table and are not a sequence of row counts.
id <- c(10,20,30,40,10,10,20,30,40,40,20,20)
period <- c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df <- data.frame(id,period)
Create the new data.frame one of two ways. rengis answer is fine for 2-column data frames that have the id column first. It won't work so well if your data frame has more than 2 columns, or if the columns are in a different order.
Alternative would be to specify the columns and column order for your table:
df3 <- data.frame(unclass(table(df$id, df$period)))
the id column is included in the new data.frame as row.names(df3). To add it as a new column:
df3$id <- row.names(df3)
df3
calib first valid id
10 1 2 0 10
20 2 0 2 20
30 0 0 2 30
40 1 1 1 40