Importing multiple .csv files into R with differents names - r

I have one work directory with 37 Locations.csv and 37 Behavior.csv
See below that has some files having the same number as 111868-Behavior.csv and 111868-Behavior 2.csv, so also with Locations.csv
#here some of the csv in the work directory
dir()
[1] "111868-Behavior 2.csv" "111868-Behavior.csv"
[3] "111868-Locations 2.csv" "111868-Locations.csv"
[5] "111869-Behavior.csv" "111869-Locations.csv"
[7] "111870-Behavior 2.csv" "111870-Behavior.csv"
[9] "111870-Locations 2.csv" "111870-Locations.csv"
[11] "112696-Behavior 2.csv" "112696-Behavior.csv"
[13] "112696-Locations 2.csv" "112696-Locations.csv"
I can't change the name of files.
I want to import all the 36 Locations and 36 Behaviors, but when I tried this
#Create list of all behaviors
bhv <- list.files(pattern="*-Behavior.csv")
bhv2 <- list.files(pattern="*-Behavior 2.csv")
#Throw them altogether
bhv_csv = ldply(bhv, read_csv)
bhv_csv2 = ldply(bhv2, read_csv)
#Join bhv_csv and bhv_csv2
b<-rbind(bhv_csv,bhv_csv2)
#Create list of all locations
loc <- list.files(pattern="*-Locations.csv")
loc2 <- list.files(pattern="*-Locations 2.csv")
#Throw them altogether
loc_csv = ldply(loc, read_csv)
loc_csv2 = ldply(loc2, read_csv)
#Join loc_csv and loc_csv2
l<-rbind(loc_csv,loc_csv2)
Shows me only 28, not 36 like I spected
length(unique(b$Ptt))
[1] 28
length(unique(l$Ptt))
[1] 28
This number 28, is about all Behaviors.csv and Locations.csv without Behaviors 2.csv and Locations 2.csv (those with number "2" are 8 in total each one)
I want to import all the files Behaviors and all the Locations in a way that shows the 36 Behaviors and Locations. How can I do that?

You can use purrr::map to simplify some of your code:
library("tidyverse")
library("readr")
# Create two small csv files
write_lines("a,b\n1,2\n3,4", "file1.csv")
write_lines("a,c\n5,6\n7,8", "file2.csv")
list.files(pattern = "*.csv") %>%
# `map` will cycle through the files and read each one
map(read_csv) %>%
# and then we can bind them all together
bind_rows()
#> Parsed with column specification:
#> cols(
#> a = col_double(),
#> b = col_double()
#> )
#> Parsed with column specification:
#> cols(
#> a = col_double(),
#> c = col_double()
#> )
#> # A tibble: 4 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 NA
#> 2 3 4 NA
#> 3 5 NA 6
#> 4 7 NA 8
Created on 2019-03-28 by the reprex package (v0.2.1)

Related

Make a loop to scrape a website to create multiple dataframes

I'm working on a project where I can see two ways to potentially solve my problem. I'm scraping a webpage by using a loop to save the each page locally as a HTML file. The problem I'm having is when I try to select on the files in my local folder they are basically blank pages. I'm not sure why. I've used this same code on other sites for this project with success.
This is the code I'm using.
#scrape playoff teams for multiple seasons and saved html to local folder
for(i in 2002:2021){
playoff_url <- read_html(paste0("https://www.espn.com/nfl/stats/player/_/season/",i,"/seasontype/3"))
playoff_stats <- playoff_url %>%
write_html(paste0("playoff",i,".HTML"))
}
My second option is to scrape individual seasons into a data frame, but I would like to do it in a loop, and to not have to run this code 20 different times. I also don't want to continually scrape data from the site every time I run the code. It doesn't matter if all the data is in 1 large data frame for all 20 seasons or 20 separate ones. I can export the code to a local file then import it when I need it.
#read in code for playoff QBs from ESPN and added year column
playoff_url <- read_html("https://www.espn.com/nfl/stats/player/_/season/2015/seasontype/3")
play_QB2015 <-playoff_url %>%
html_nodes("table") %>%
html_table()
#combine list from QB playoff data to convert to dataframe
play_QB2015 <- c(play_QB2015[[1]], play_QB2015[[2]])
# Convert list to dataframe using data.frame()
play_QB2015 <- data.frame(play_QB2015)
play_QB2015$Year = 2015
Not sure what happens to your files, but first downloading and storing with httr2 and then parsing saved files with rvest works fine for me (sorry for overused tidyverse ..) :
library(fs)
library(dplyr)
library(httr2)
library(rvest)
library(purrr)
library(stringr)
dest_dir <- path_temp("playoffs")
dir_create(dest_dir)
years <- 2002:2012
# collect all years to a list
playoff_lst <- map(
set_names(years),
~ {
dest_file <- path(dest_dir, str_glue("{.x}.html"))
# only download if local copy is not present
if (!file_exists(dest_file)){
request(str_glue("https://www.espn.com/nfl/stats/player/_/season/{.x}/seasontype/3")) %>%
req_perform(dest_file)
}
read_html(dest_file) %>%
html_elements("table") %>%
html_table() %>%
bind_cols()
}
)
Results:
names(playoff_lst)
#> [1] "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011"
#> [11] "2012"
head(playoff_lst$`2002`)
#> # A tibble: 6 × 17
#> RK Name POS GP CMP ATT `CMP%` YDS AVG `YDS/G` LNG TD
#> <int> <chr> <chr> <int> <int> <int> <dbl> <int> <dbl> <dbl> <int> <int>
#> 1 1 Rich Gan… QB 3 73 115 63.5 841 7.3 280. 50 7
#> 2 2 Brad Joh… QB 3 53 98 54.1 670 6.8 223. 71 5
#> 3 3 Tommy Ma… QB 2 51 89 57.3 633 7.1 316. 40 5
#> 4 4 Steve Mc… QB 2 48 80 60 532 6.7 266 39 3
#> 5 5 Jeff Gar… QB 2 49 85 57.6 524 6.2 262 76 3
#> 6 6 Donovan … QB 2 46 79 58.2 490 6.2 245 42 1
#> # … with 5 more variables: INT <int>, SACK <int>, SYL <int>, QBR <lgl>,
#> # RTG <dbl>
dir_tree(dest_dir)
#> ... /RtmpcjLFJe/playoffs
#> ├── 2002.html
#> ├── 2003.html
#> ├── 2004.html
#> ├── 2005.html
#> ├── 2006.html
#> ├── 2007.html
#> ├── 2008.html
#> ├── 2009.html
#> ├── 2010.html
#> ├── 2011.html
#> └── 2012.html
Created on 2023-02-16 with reprex v2.0.2

How can I write a regex to order the paths of which I want to list them in numeric order

I have hundreds of .wav files and imported them using list.files. Something like above:
[1] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsam-poplar-English-0701.wav"
[2] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch-English-0700.wav"
[3] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueberry-English-0703.wav"
.......
[73] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Capercaillie-English-0069.wav"
[74] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fat-tail-scorpion-English-0082.wav"
[75] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fire-salamander-English-0067.wav"
I use the following code to reorder the file paths of which I want number in each subpath follows numberic order. I have tried the following
filename<- file_list[order(as.numeric(stringr::str_extract(file_list,"[0-9]+(.*?)")) )]
The result is something like:
[1] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Capercaillie-English-0069.wav"
[2] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fat-tail-scorpion-English-0082.wav"
[3] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fire-salamander-English-0067.wav"
.......
[73] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsam-poplar-English-0701.wav"
[74] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch-English-0700.wav"
[75] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueberry-English-0703.wav"
I also want the last subpath follows in numberic order, e.g. English-0067;English-0069. I tried to repeat the matching for the last subpath, but it will disorder the previous order followed by 3...10. How could I let all the numbers in the subpaths follows numberic order?
another option:
ord <- order(as.numeric(sub("(^\\d+)/.*$","\\1",files)), as.numeric(sub("^.*-(\\d+)\\.wav","\\1",files)))
files[ord]
#> [1] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fire-salamander-English-0067.wav"
#> [2] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Capercaillie-English-0069.wav"
#> [3] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fat-tail-scorpion-English-0082.wav"
#> [4] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch-English-0700.wav"
#> [5] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsam-poplar-English-0701.wav"
#> [6] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueberry-English-0703.wav"
Here's one approach:
vec <- c( "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsam-poplar-English-0701.wav",
"10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch-English-0700.wav",
"10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueberry-English-0703.wav",
"3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Capercaillie-English-0069.wav",
"3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fat-tail-scorpion-English-0082.wav",
"3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fire-salamander-English-0067.wav")
nums <- strcapture("^([0-9]+).*\\b([0-9]+)\\.[a-z]+$", vec, proto=list(a=0L,b=0L))
nums
# a b
# 1 10 701
# 2 10 700
# 3 10 703
# 4 3 69
# 5 3 82
# 6 3 67
do.call(order, nums)
# [1] 6 4 5 2 1 3
vec[do.call(order, nums)]
# [1] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fire-salamander-English-0067.wav"
# [2] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Capercaillie-English-0069.wav"
# [3] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fat-tail-scorpion-English-0082.wav"
# [4] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch-English-0700.wav"
# [5] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsam-poplar-English-0701.wav"
# [6] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueberry-English-0703.wav"
If you needed to also include the BL-0001 in your ordering, all it would take is a small addition to the regex, an additional entry in proto=, and that's it. The use of do.call(order, nums) will handle 1 or more columns, regardless of how many.
Note that if you over-tune your regex, rows that don't match both groups here will return NA for both; this means it'll sort the NA rows last. If you find that one or more filenames are misordered, check the regex and the intermediate nums entries for those filenames.
A tidyverse solution: structuring data as a table and using stringr::str_detect() to arrange rows before extracting filenames.
vec <- c( "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsam-poplar-English-0701.wav",
"10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch-English-0700.wav",
"10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueberry-English-0703.wav",
"3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Capercaillie-English-0069.wav",
"3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fat-tail-scorpion-English-0082.wav",
"3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fire-salamander-English-0067.wav")
library(dplyr)
library(stringr)
vec_tib <- tibble(filename = vec)
vec_tib <- mutate(vec_tib,
num_1 = str_extract(filename, "\\d+"),
num_2 = str_extract(filename, "\\d+(?=(\\.wav))"))
head(vec_tib, 3)
#> # A tibble: 3 × 3
#> filename num_1 num_2
#> <chr> <chr> <chr>
#> 1 10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsa… 10 0701
#> 2 10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch… 10 0700
#> 3 10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueb… 10 0703
vec_tib <- mutate(vec_tib, across(starts_with("num"), as.numeric))
vec_tib |>
arrange(num_1, num_2) |>
pull(filename)
#> [1] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fire-salamander-English-0067.wav"
#> [2] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Capercaillie-English-0069.wav"
#> [3] "3/Project_English-3/BL-0002_Lesser-horseshoe-bat/Fat-tail-scorpion-English-0082.wav"
#> [4] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Birch-English-0700.wav"
#> [5] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Balsam-poplar-English-0701.wav"
#> [6] "10/Project_English-10/BL-0001_A-conifer-cone-contains-seeds/Blueberry-English-0703.wav"
Created on 2022-11-28 with reprex v2.0.2

readr::read_tsv() parsing failures due to trailing tabs

Issue/Question
I have tab-delimited my-data.txt file with 51 columns. The col_names row has no trailing tabs and readr::read_tsv() correctly detects 51 columns. However, the data columns all contain trailing tabs and readr::read_tsv() interprets these incorrectly as having 52 columns. While the code runs, I get a warning, which I would like to get rid of. Are there any read_tsv() arguments that can help handle this? Should I instead use a different readr function?
my-data.txt
PT AU BA CA GP RI OI BE Z2 TI X1 Y1 Z1 FT PN AE Z3 SO S1 SE BS VL IS SI MA BP EP AR DI D2 SU PD PY AB X4 Y4 Z4 AK CT CY SP CL TC Z8 ZB ZS Z9 SN BN UT PM
J Jacquelin, Sebastien; Straube, Jasmin; Cooper, Leanne; Vu, Therese; Song, Axia; Bywater, Megan; Baxter, Eva; Heidecker, Matthew; Wackrow, Brad; Porter, Amy; Ling, Victoria; Green, Joanne; Austin, Rebecca; Kazakoff, Stephen; Waddell, Nicola; Hesson, Luke B.; Pimanda, John E.; Stegelmann, Frank; Bullinger, Lars; Doehner, Konstanze; Rampal, Raajit K.; Heckl, Dirk; Hill, Geoffrey R.; Lane, Steven W. Jak2V617F and Dnmt3a loss cooperate to induce myelofibrosis through activated enhancer-driven inflammation BLOOD 132 26 2707 2721 10.1182/blood-2018-04-846220 DEC 27 2018 2018 10 WOS:000454429300003
J Renne, Julius; Gutberlet, Marcel; Voskrebenzev, Andreas; Kern, Agilo; Kaireit, Till; Hinrichs, Jan; Zardo, Patrick; Warnecke, Gregor; Krueger, Marcus; Braubach, Peter; Jonigk, Danny; Haverich, Axel; Wacker, Frank; Vogel-Claussen, Jens; Zinne, Norman Multiparametric MRI for organ quality assessment in a porcine Ex-Vivo lung perfusion system PLOS ONE 13 12 e0209103 10.1371/journal.pone.0209103 DEC 27 2018 2018 1 WOS:000454418200015
J Lau, Skadi; Eicke, Dorothee; Oliveira, Marco Carvalho; Wiegmann, Bettina; Schrimpf, Claudia; Haverich, Axel; Blasczyk, Rainer; Wilhelmi, Mathias; Figueiredo, Constanca; Boeer, Ulrike Low Immunogenic Endothelial Cells Maintain Morphological and Functional Properties Required for Vascular Tissue Engineering TISSUE ENGINEERING PART A 24 5-6 432 447 10.1089/ten.tea.2016.0541 MAR 2018 2018 4 WOS:000418327100001
Reprex
Note that I did some manually editing of the reprex because I needed to read in the .txt file to reproduce the issue, but this causes errors in reprex without my computer-specific path). See RStudio Community Topic 8773
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(readr)
my_data <- read_tsv("my-data.txt", quote = "")
#> Parsed with column specification:
#> cols(
#> .default = col_logical(),
#> PT = col_character(),
#> AU = col_character(),
#> TI = col_character(),
#> SO = col_character(),
#> VL = col_double(),
#> IS = col_character(),
#> BP = col_double(),
#> EP = col_double(),
#> AR = col_character(),
#> DI = col_character(),
#> PD = col_character(),
#> PY = col_double(),
#> TC = col_double(),
#> UT = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 1 -- 51 columns 52 columns 'my-data.txt'
#> 2 -- 51 columns 52 columns 'my-data.txt'
#> 3 -- 51 columns 52 columns 'my-data.txt'
problems(my_data)
#> # A tibble: 3 x 5
#> row col expected actual file
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 <NA> 51 columns 52 columns 'my-data.txt'
#> 2 2 <NA> 51 columns 52 columns 'my-data.txt'
#> 3 3 <NA> 51 columns 52 columns 'my-data.txt'
Created on 2020-04-01 by the reprex package (v0.3.0)
Thank you for taking the time to help me.
My favorite .tsv file reader is fread from data.table. It often works right out of the box. It might be worth a try.
library(data.table)
my_data <- fread("my-data.txt")

R function to fix automatically formatted data

I am currently analyzing a baseball data set that has the count data included, however, some of the data has automatically been formatted as a date.
I have already tried using as.numeric but it does not help. I have provided a sample of the data below:
Count(Factor) 0-0 0-1 0-2 1-Feb 1-Jan 1-Mar 2-Feb 2-Jan 2-Mar
Feb-00 Jan-00 Mar-00
I would like to remove the date format. For instance, I want to see 1-Feb as 1-2, 1-Jan as 1-1, 1-Mar as 1-3, Feb-00 as 2-0.
Does anyone have any suggestions on how to do so?
You can replace the abbreviated months with their relevant calendar position by referencing months.abb. Below I have created a general function using Base R.
## function to apply
month_num <- function(x){
if (! grepl('\\w{3}', x))
return(x)
gsub('/?\\w{3}', as.character(match(regmatches(x, regexpr('(\\w{3})', x)), month.abb)), x)
}
## vector
strings <- c( '0-0', '0-1' ,'0-2', '1-Feb', '1-Jan', '1-Mar', '2-Feb', '2-Jan', '2-Mar', 'Feb-00', '/Jan-00', 'Mar-00')
sapply(strings, month_num, USE.NAMES = FALSE)
#> [1] "0-0" "0-1" "0-2" "1-2" "1-1" "1-3" "2-2" "2-1" "2-3" "2-00"
#> [11] "1-00" "3-00"
## data.frame or matrix
tmp <- data.frame(
strings = c( '0-0', '0-1' ,'0-2', '1-Feb', '1-Jan', '1-Mar', '2-Feb', '2-Jan', '2-Mar', 'Feb-00', '/Jan-00', 'Mar-00')
)
tmp$strings <- apply(tmp, 1, month_num)
tmp
#> strings
#> 1 0-0
#> 2 0-1
#> 3 0-2
#> 4 1-2
#> 5 1-1
#> 6 1-3
#> 7 2-2
#> 8 2-1
#> 9 2-3
#> 10 2-00
#> 11 1-00
#> 12 3-00
## list
strings <- list( '0-0', '0-1' ,'0-2', '1-Feb', '1-Jan', '1-Mar', '2-Feb', '2-Jan', '2-Mar', 'Feb-00', '/Jan-00', 'Mar-00')
strings <- lapply(strings, month_num)
tail(strings)
#> [[1]]
#> [1] "2-2"
#>
#> [[2]]
#> [1] "2-1"
#>
#> [[3]]
#> [1] "2-3"
#>
#> [[4]]
#> [1] "2-00"
#>
#> [[5]]
#> [1] "1-00"
#>
#> [[6]]
#> [1] "3-00"
Created on 2019-02-12 by the reprex package (v0.2.1)

Skipping rows until row with a certain value

I need to to read a .txt file from an URL, but would like to skip the rows until a row with a certain value. The URL is https://fred.stlouisfed.org/data/HNOMFAQ027S.txt and the data takes the following form:
"
... (number of rows)
... (number of rows)
... (number of rows)
DATE VALUE
1945-01-01 144855
1946-01-01 138515
1947-01-01 136405
1948-01-01 135486
1949-01-01 142455
"
I would like to skip all rows until the row with "DATE // VALUE" and start importing the data from this line onwards (including "DATE // VALUE"). Is there a way to do this with data.table's fread() - or any other way, such as with dplyr?
Thank you very much in advance for your effort and your time!
Best,
c.
Here's a way to get to extract that info from those text files using readr::read_lines, dplyr, and string handling from stringr.
library(tidyverse)
library(stringr)
df <- data_frame(lines = read_lines("https://fred.stlouisfed.org/data/HNOMFAQ027S.txt")) %>%
filter(str_detect(lines, "^\\d{4}-\\d{2}-\\d{2}")) %>%
mutate(date = str_extract(lines, "^\\d{4}-\\d{2}-\\d{2}"),
value = as.numeric(str_extract(lines, "[\\d-]+$"))) %>%
select(-lines)
df
#> # A tibble: 286 x 2
#> date value
#> <chr> <dbl>
#> 1 1945-10-01 1245
#> 2 1946-01-01 NA
#> 3 1946-04-01 NA
#> 4 1946-07-01 NA
#> 5 1946-10-01 1298
#> 6 1947-01-01 NA
#> 7 1947-04-01 NA
#> 8 1947-07-01 NA
#> 9 1947-10-01 1413
#> 10 1948-01-01 NA
#> # ... with 276 more rows
I filtered for all the lines you want to keep using stringr::str_detect, then extracted out the info you want from the string using stringr::str_extract and regexes.
Combining fread with unix tools:
> fread("curl -s https://fred.stlouisfed.org/data/HNOMFAQ027S.txt | sed -n -e '/^DATE.*VALUE/,$p'")
DATE VALUE
1: 1945-10-01 1245
2: 1946-01-01 .
3: 1946-04-01 .
4: 1946-07-01 .
5: 1946-10-01 1298
---
282: 2016-01-01 6566888
283: 2016-04-01 6741075
284: 2016-07-01 7022321
285: 2016-10-01 6998898
286: 2017-01-01 7448792
>
Using:
file.names <- c('https://fred.stlouisfed.org/data/HNOMFAQ027S.txt',
'https://fred.stlouisfed.org/data/DGS10.txt',
'https://fred.stlouisfed.org/data/A191RL1Q225SBEA.txt')
text.list <- lapply(file.names, readLines)
skip.rows <- sapply(text.list, grep, pattern = '^DATE\\s+VALUE') - 1
# option 1
l <- Map(function(x,y) read.table(text = x, skip = y), x = text.list, y = skip.rows)
# option 2
l <- lapply(seq_along(text.list), function(i) fread(file.names[i], skip = skip.rows[i]))
will get you a list of data.frame's (option 1) or data.table's (option 2).

Resources