How put column name after make for loop with xlsx file - r

I am looping to load multiple xlsx files. This I am doing well. But when I want to add the name of the columns of the documents (the same names for all files) I have not managed to do it.
library(dplyr)
library(readr)
library(openxlsx)
library(readxl)
setwd("C:/Users/MiguelAngel/Documents/R Miguelo/Guillermo Ahumada")
ldf <- list()
listxlsx <- dir(pattern = "*.xlsx")
for (k in 1:length(listxlsx)){
ldf[[k]] <-as.data.frame(read.xlsx(listxlsx[k]))
}
The result:
355 1500 1100 43831
1 190 850 600 43832
2 93 4000 3000 43833
3 114 4000 3000 43834
4 431 1000 700 43835
5 182 1000 700 43836
6 496 500 300 43837
7 254 500 300 43838
8 174 600 300 43839
9 397 1500 945 43840
10 198 1500 900 43841
11 271 1500 900 43842
12 94 3000 2000 43843
13 206 400 230 43844
14 305 1500 1100 43845
15 184 850 600 43846
16 90 4000 3000 43847
17 70 4000 3000 43848
18 492 1000 700 43849
19 168 1000 700 43850
20 530 500 300 43851
They load all the files well but without the name of the columns.
I need add the name of columns:
list_file <- dir(pattern = "*.xlsx") %>%
lapply(read.xlsx) %>% # *I use stringAsFactor but appear error.
bind_rows
but appear this
list_file
Form of original columns all files
I need put this columns names after make the loop with for.
Thanks for help me guys

I cannot check this since I don't have Excel files to load, but I think this should work:
listxlsx <- list.files(path = "C:/Users/MiguelAngel/Documents/R Miguelo/Guillermo Ahumada", pattern = "*.xlsx", full.nams = TRUE)
names(listxlsx) <- listxlsx
purrr::map_dfr(listxlsx, readxl::read_excel, .id = "Filename")
(The first line is a better practice to get the filenames than relying on setwd.)
When listxlsx is a named vector the function map_dfr gives a column named Filename where the values are taken from listxlsx.

Related

for loop question in R with rbind or do.call

An VERY simplified example of my dataset:
HUC8 YEAR RO_MM
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
}
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – #r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2
has 2 columns, inconsistent with item 1 which has 3 columns. To fill
missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i],
stringsAsFactors = F) : Detected 1 column names but the data has 2
columns (i.e. invalid file). Added 1 extra default column name for the
first column which is guessed to be row names or an index. Use
setnames() afterwards if this guess is not correct, or fix the file
write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on
line 20. Expected 2 fields but found 3. Consider fill=TRUE and
comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?
Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.
First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip",exdir="./pokemonData")
Next, we obtain a list of files in the directory to which we unzipped the CSV files.
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with do.call(), and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.
library(data.table)
data <- do.call(rbind,lapply(thePokemonFiles,fread))
head(data)
tail(data)
...and the output:
> head(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1: 1 Bulbasaur Grass Poison 318 45 49 49 65 65 45
2: 2 Ivysaur Grass Poison 405 60 62 63 80 80 60
3: 3 Venusaur Grass Poison 525 80 82 83 100 100 80
4: 4 Charmander Fire 309 39 52 43 60 50 65
5: 5 Charmeleon Fire 405 58 64 58 80 65 80
6: 6 Charizard Fire Flying 534 78 84 78 109 85 100
Generation
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
> tail(data)
ID Name Form Type1 Type2 Total HP Attack Defense Sp. Atk
1: 895 Regidrago Dragon 580 200 100 50 100
2: 896 Glastrier Ice 580 100 145 130 65
3: 897 Spectrier Ghost 580 100 65 60 145
4: 898 Calyrex Psychic Grass 500 100 80 80 80
5: 898 Calyrex Ice Rider Psychic Ice 680 100 165 150 85
6: 898 Calyrex Shadow Rider Psychic Ghost 680 100 85 80 165
Sp. Def Speed Generation
1: 50 80 8
2: 110 30 8
3: 80 130 8
4: 80 80 8
5: 130 50 8
6: 100 150 8
>

How to merge tables with different column headers in loop [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 3 years ago.
I have a for loop that goes through a specific column in different CSV files (all these different files are just different runs for a specific class) and retrieve the count of each value. For example, in the first file (first run):
0 1 67
101 622 277
In the second run:
0 1 67 68
109 592 297 2
In the third run:
0 1 67
114 640 246
Note that each run might result in different values (look at the second run that includes one more value that is 68). I would like to merge all these results in one list and then write it to a CSV file. To do that, I did the following:
files <- list.files("/home/adam/Desktop/runs", pattern="*.csv", recursive=TRUE, full.names=TRUE, include.dirs=TRUE)
all <- list()
col <- 14
for(j in 1:length(files)){
dataset <- read.csv(files[j])
uniqueValues <- table(dataset[,col]) #this generates the examples shown above
all <- rbind(uniqueValues)
}
write.table(all, "all.csv", col.names=TRUE, sep=",")
The result of all is:
0 1 67
114 640 246
How to solve that?
The expected results in:
0 1 67 68
101 622 277 0
109 592 297 2
114 640 246 0
Marked this as a potential duplicate see link here
library(plyr)
df1 <- data.frame(A0 = c(101),
A1 = c(622),
A67 = c(277))
df2 <- data.frame(A0 = c(109),
A1 = c(592),
A67 = c(297),
A68= c(2))
df3 <- data.frame(A0 = c(114),
A1 = c(640),
A67 = c(246))
newds=rbind.fill(df1,df2,df3)

R: How to read selected columns from a RDS files?

How to read part of the data from very large files?
The sample data is generated as:
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
head(df)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X1.1 X2.1 X3.1 X4.1 X5.1 X6.1 X7.1 X8.1 X9.1 X10.1
# 1 575 1843 1854 883 592 1362 1075 210 1526 1365 Qk8NP Xvw9z OYRa1 8BGIV bejiv CCoIE XDKJN HR7zc 2kKNY 1I5h8
# 2 1577 390 1861 912 277 636 758 1461 1978 1865 ZaHFl QLsli E7lbs YGq8u DgUAW c6JQ0 RAZFn Sc0Zt mif8I 3Ys6U
# 3 818 1076 147 1221 257 1115 759 1959 1088 1292 jM5Uw ctM3y 0HiXR hjOHK BZDOP ULQWm Ei8qS BVneZ rkKNL 728gf
# 4 1766 884 1331 1144 1260 768 1620 1231 1428 1193 r4ZCI eCymC 19SwO Ht1O0 repPw YdlSW NRgfL RX4ta iAtVn Hzm0q
# 5 1881 1851 1324 1930 1584 1318 940 1796 830 15 w8d1B qK1b0 CeB8u SlNll DxndB vaufY ZtlEM tDa0o SEMUX V7tLQ
# 6 91 264 1563 414 914 1507 1935 1970 287 409 gsY1u FxIgu 2XqS4 8kreA ymngX h0hkK reIsn tKgQY ssR7g W3v6c
saveRDS is used to save the file.
saveRDS(df, 'df.rds')
The file size is looked using the below commands:
file.info('df.rds')$size
# [1] 29935125
utils:::format.object_size(29935125, "auto")
# [1] "28.5 Mb"
The saved file is read using the below function.
readRDS('df.rds')
However, some of my files are in GBs and would need only few columns for certain processing. Is it possible to read selected columns from RDS files?
Note: I already have RDS files, generated after considerably large amounts of processing. Now, I want to know the best possible way to read selected columns from the existing RDS files.
I don't think you can read only a portion of an rds or rda file.
An alternative would be to use feather. As an example, using a large-ish feather I'm working with:
library(feather)
file.info("../feathers/C1.feather")["size"]
# size
# ../feathers/C1.feather 498782328
system.time( c1whole <- read_feather("../feathers/C1.feather") )
# user system elapsed
# 0.860 0.856 5.540
system.time( c1dyn <- feather("../feathers/C1.feather") )
# user system elapsed
# 0 0 0
ls.objects()
# Type Size PrettySize Dim
# c1dyn feather 3232 3.2 Kb 2886147 x 36
# c1whole tbl_df 554158688 528.5 Mb 2886147 x 36
You can react with both variables as full data.frames: though c1whole is already in memory (so may be a little faster), accessing c1dyn is still quite speedy.
NB: some functions (e.g., several within dplyr) do not work on feather as they do on data.frame or tbl_df. If your intent is solely to pick-and-choose specific columns, then you'll be fine.
SQLite also could be a common way to store tabular/matrix/dataframe data on your hard drive using an SQLite database. This also allows the use of standard SQL commands or DPLYR to interrogate the data. Just be warned that SQLite does not have a date format so any dates need to be converted to character before writing them to the database.
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), dbname="myDB")
dbWriteTable(conn,"mytable",df)
alltables <- dbListTables(conn)
# Use sql queries to query data...
oneColumn <- dbGetQuery(conn,"SELECT X1 FROM mytable")
library(dplyr)
library(dbplyr)
my_db <- tbl(conn, "mytable")
my_db
# Use dplyr functions to query data...
my_db %>% select(X1)

Divide a data-frame into x roughly equal groups -- sequentially

I want to divide a df into x roughly equal groups, sequentially.
I was basically doing it like this:
df_1 <- df[1:10,]
df_2 <- df[11:21,]
df_3..
Is there a simpler way to do this, using split or slice? The important thing is, I want to maintain the order of the df, not sample from it.
Imagine I had 7000 observations, and I wanted 19 roughly equal groups.
Best!
I don't know if it counts for roughly equal, but you can do this:
nobs <- 7000
ngroups <- 17
df <- data.frame(x = sample(nobs))
set.seed(1)
df$grp <- sort(sample(1:ngroups,nobs,T)) # added the sort so the order of your df is maintained
table(df$grp)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 436 407 410 369 417 411 440 401 431 411 356 398 390 414 443 418 448
then split(df,df$grp)

Dividing a file in R and automatically creating notepad files

I have a file which is like this :
"1943" 359 1327 "t000000" 8
"1944" 359 907 "t000000" 8
"1946" 359 472 "t000000" 8
"1947" 359 676 "t000000" 8
"1948" 326 359 "t000000" 8
"1949" 359 585 "t000000" 8
"1950" 359 1157 "t000000" 8
"2460" 275 359 "t000000" 8
"2727" 22 556 "t000000" 8
"2730" 22 676 "t000000" 8
"479" 17 1898 "t0000000" 5
"864" 347 720 "t000s" 12
"3646" 349 691 "t000s" 7
"6377" 870 1475 "t000s" 14
"7690" 566 870 "t000s" 14
"7691" 870 2305 "t000s" 14
"8120" 870 1179 "t000s" 14
"8122" 44 870 "t000s" 14
"8124" 870 1578 "t000s" 14
"8125" 206 870 "t000s" 14
"8126" 870 1834 "t000s" 14
"6455" 1 1019 "t000t" 13
"4894" 126 691 "t00t" 9
"4896" 126 170 "t00t" 9
"560" 17 412 "t0t" 7
"130" 65 522 "tq" 18
"1034" 17 990 "tq" 10
"332" 3 138 "ts" 2
"2063" 61 383 "ts" 5
"2089" 127 147 "ts" 11
"2431" 148 472 "ts" 15
"2706" 28 43 "ts" 21
.....................
The first column is the random row number ( got after some sorting that I needed ), the fourth column contains the pattern for which I actually want different notepad files.
What I want is that I get individual notepad files named for example, f1.txt,f2.txt,f3.txt...containing all the rows for a value in column 4. For example, I get a different file for "t000000" and then a different one for "t000s" and then a seperate one for "t00t" and so on...
I did this,
list2env(split(sort, sort[,4]),envir=.GlobalEnv)
Here sort is my text file name of data set and 3 is that column.
And then I can use the write.table command, but since my file is huge, I get around 100's of files like that and doing write.table manually like that is very difficult. Is there any way I can automate it?
Using the excellent data.table package:
library(data.table)
# get your source file
the_file <- fread('~/Desktop/file.txt') #replace with your file path
# vector of unique values of column 4 & the roots of your output filename
fl_names <- unique(the_file$V4)
# dump all the relevant subsets to files
for (f in fl_names) write.table(the_file[V4==f, ], paste0(f, '.txt'), row.names=FALSE)
You've already figured out split, but instead of list2env, which will make more work for you just use lapply:
# Generally confusing to name a data.frame
# the same as a common function!
X <- split(sort, sort[, 4])
invisible(lapply(names(X), function(y)
write.csv(X[[y]], file = paste0(y, ".csv"))))
Proof of concept:
Dir <- getwd() # Won't be necessary in your actual script
setwd(tempdir()) # I just don't want my working directory filled
list.files(pattern=".csv") # with random csv files, so I'm using tempdir()
# character(0) # Note that there are no csv files presently
X <- split(sort, sort[, 4]) # You've already figured this step out
## invisible is just so you don't have to see an empty list
## printed in your console. The rest is pretty straightforward
invisible(lapply(names(X), function(y)
write.csv(X[[y]], file = paste0(y, ".csv"))))
list.files(pattern=".csv") # Check that the files are there
# [1] "t000000.csv" "t0000000.csv" "t000s.csv" "t000t.csv"
# [5] "t00t.csv" "t0t.csv" "tq.csv" "ts.csv"
setwd(Dir) # Won't be necessary for your actual script

Resources