Define multiple columns when reading a txt file into R [duplicate] - r

This question already has answers here:
Reading text file with multiple space as delimiter in R
(3 answers)
Closed 1 year ago.
I am trying to read wave height data into R using this website
https://www.ndbc.noaa.gov/download_data.php?filename=51201h2017.txt.gz&dir=data/historical/stdmet/
my code is
buoy <- 51211
year <- 2017
one_yr <- paste0("https://www.ndbc.noaa.gov/view_text_file.php?filename=",
buoy, "h", year, ".txt.gz&dir=data/historical/stdmet/")
oneBuoy_oneYR.df <- read.csv(one_yr, fill = TRUE)
The resulting output is a data frame that has one column and 8985 observations. I have tried using sep = " " but there are some columns that are separated with two spaces instead of one. I have also tried read.delim
I'm sure there is an easy solution, I just haven't found it.

Use fread from data.table. fread will detetec the separator and colClasses automatically for you.
library(data.table)
#> Warning: package 'data.table' was built under R version 4.0.4
buoy <- 51211
year <- 2017
one_yr <- paste0("https://www.ndbc.noaa.gov/view_text_file.php?filename=",
buoy, "h", year, ".txt.gz&dir=data/historical/stdmet/")
oneBuoy_oneYR.df <- fread(one_yr, fill = TRUE)
head(oneBuoy_oneYR.df)
#> #YY MM DD hh mm WDIR WSPD GST WVHT DPD APD MWD PRES ATMP WTMP DEWP
#> 1: #yr mo dy hr mn degT m/s m/s m sec sec degT hPa degC degC degC
#> 2: 2017 06 07 14 30 999 99.0 99.0 0.58 15.38 5.66 161 9999.0 999.0 26.4 999.0
#> 3: 2017 06 07 15 00 999 99.0 99.0 0.58 15.38 5.61 156 9999.0 999.0 26.4 999.0
#> 4: 2017 06 07 15 30 999 99.0 99.0 0.55 12.50 5.37 161 9999.0 999.0 26.3 999.0
#> 5: 2017 06 07 16 30 999 99.0 99.0 0.64 12.50 4.97 158 9999.0 999.0 26.3 999.0
#> 6: 2017 06 07 17 00 999 99.0 99.0 0.64 15.38 4.95 158 9999.0 999.0 26.3 999.0
#> VIS TIDE
#> 1: mi ft
#> 2: 99.0 99.00
#> 3: 99.0 99.00
#> 4: 99.0 99.00
#> 5: 99.0 99.00
#> 6: 99.0 99.00
Created on 2021-05-31 by the reprex package (v0.3.0)

Related

Apply a function across multiple columns and write to a csv in R

R beginner here, my question is: How do I change this function so that it can be used across all my time periods without copying and pasting the function over and over? The time periods are indicated in the function by the from = X$pre.start1[i] and to = X$pre.start2[i] arguments. I would like to have all the results end up in a single .csv file as well. Is that possible?
I know this function works and I have used it in the past by copying it and changing the time periods but with multiple spreadsheets with data like this it is tedious to apply this way. So I am looking to modify it so that I am not copying and pasting hundreds of times.
The function:
ADIanalyzeFUN <- function(X) {
adianalyzeFUN <- function(X, i){
r <- read_wave(X$sound.files[i], from = X$pre.start1[i], to = X$pre.start2[i])
soundfile.adi <- acoustic_diversity(r)
return(soundfile.adi$adi_left)
return(soundfile.adi$adi_right)
}
output <- vector("logical", ncol(X))
for (i in seq_along(X$sound.files)) {
output[[i]] <- adianalyzeFUN(X, i)
}
X$adi.values.pre1to2 <-output
write.csv(X, "/media/parks/Seagate Portable Drive 2 (2tb)/Parks/2021 Threat Experiment/ADI index values/ADI01.csv", row.names = TRUE)
}
Below is a sample of the data
Each column is a list of times in seconds and I am applying the function to the wave file between one time and the next eg between pre.start1 and pre.start2.
pre.start1 pre.start2 pre.start3 pre.start4 pre.start5 pre.start6 pre.start7 pre.start8 pre.start9 pre.start10 pre.end duringpb.start1
1 2304 2364 2424 2484 2544 2604 2664 2724 2784 2844 2904 2964
2 1386 1446 1506 1566 1626 1686 1746 1806 1866 1926 1986 2046
3 1680 1740 1800 1860 1920 1980 2040 2100 2160 2220 2280 2340
4 1553 1613 1673 1733 1793 1853 1913 1973 2033 2093 2153 2213
5 1661 1721 1781 1841 1901 1961 2021 2081 2141 2201 2261 2321
6 1728 1788 1848 1908 1968 2028 2088 2148 2208 2268 2328 2388
duringpb.end1 duringpb.start2 duringpb.end2 duringpb.start3 duringpb.end3 duringpb.start4 duringpb.end4 duringpb.start5 duringpb.end5
1 3024 3084 3144 3204 3264 3324 3384 3444 3504
2 2106 2166 2226 2286 2346 2406 2466 2526 2586
3 2400 2460 2520 2580 2640 2700 2760 2820 2880
4 2273 2333 2393 2453 2513 2573 2633 2693 2753
5 2381 2441 2501 2561 2621 2681 2741 2801 2861
6 2448 2508 2568 2628 2688 2748 2808 2868 2928```
Thanks for any help!
I would like the output to be something like:
X pre.start1-pre.start2 pre.start2-pr.estart3 pre.start3-pre.start4
1 0.86 0.56 0.89
2 0.27 0.09 0.03
3 0.18 0.10 0.55
4 0.39 0.52 0.74
5 0.14 0.17 0.97
6 0.91 0.64 0.71
You could use purrr package and their map variant called map2_df() (this package is a part of the Tidyverse)
Your example isn't easily reproducible, so here's an example of taking 2 first columns from the iris dateset and constructing a dataframe (a tibble in this case) with the sum for each row and putting it into one dataframe.
library(tidyverse)
iris %>% tibble
#> # A tibble: 150 × 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
map2_df(
.x = iris$Sepal.Length,
.y = iris$Sepal.Width,
.f = ~ tibble("sum" = sum(c(.x, .y)))
)
#> # A tibble: 150 × 1
#> sum
#> <dbl>
#> 1 8.6
#> 2 7.9
#> 3 7.9
#> 4 7.7
#> 5 8.6
#> 6 9.3
#> 7 8
#> 8 8.4
#> 9 7.3
#> 10 8
#> # … with 140 more rows
Created on 2021-09-04 by the reprex package (v2.0.1)

reshape untidy data frame, spreading rows to columns names [duplicate]

This question already has answers here:
Transpose a data frame
(6 answers)
Closed 2 years ago.
Have searched the threads but can't understand a solution that will solve the problem with the data frame that I have.
My current data frame (df):
# A tibble: 8 x 29
`Athlete` Monday...2 Tuesday...3 Wednesday...4 Thursday...5 Friday...6 Saturday...7 Sunday...8
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Date 29/06/2020 30/06/2020 43837.0 43868.0 43897.0 43928.0 43958.0
2 HR 47.0 54.0 51.0 56.0 59.0 NA NA
3 HRV 171.0 91.0 127.0 99.0 77.0 NA NA
4 Sleep Duration 9.11 7.12 8.59 7.15 8.32 NA NA
5 Sleep Efficien~ 92.0 94.0 89.0 90.0 90.0 NA NA
6 Recovery Score 98.0 66.0 96.0 72.0 46.0 NA NA
7 Life Stress NO NO NO NO NO NA NA
8 Sick NO NO NO NO NO NA NA
Have tried to use spread and pivot wider but I know there would require additional functions in order to get the desired output which beyond my level on understanding in R.
Do I need to u
Desired output:
Date HR HRV Sleep Duration Sleep Efficiency Recovery Score Life Stress Sick
29/06/2020 47.0 171.0 9.11
30/06/2020 54.0 91.0 7.12
43837.0 51.0 127.0 8.59
43868.0 56.0 99.0 7.15
43897.0 59.0 77.0 8.32
43928.0 NA NA NA
43958.0 NA NA NA
etc.
Thank you
In Base R you will do:
type.convert(setNames(data.frame(t(df[-1]), row.names = NULL), df[,1]))
Date HR HRV Sleep Duration Sleep Efficien~ Recovery Score Life Stress Sick
1 29/06/2020 47 171 9.11 92 98 NO NO
2 30/06/2020 54 91 7.12 94 66 NO NO
3 43837.0 51 127 8.59 89 96 NO NO
4 43868.0 56 99 7.15 90 72 NO NO
5 43897.0 59 77 8.32 90 46 NO NO
6 43928 NA NA NA NA NA <NA> <NA>
7 43958 NA NA NA NA NA <NA> <NA>

How to scrape tables inside a comment tag in html with R?

I am trying to scrape from http://www.basketball-reference.com/teams/CHI/2015.html using rvest. I used selectorgadget and found the tag to be #advanced for the table I want. However, I noticed it wasn't picking it up. Looking at the page source, I noticed that the tables are inside an html comment tag <!--
What is the best way to get the tables from inside the comment tags? Thanks!
Edit: I am trying to pull the 'Advanced' table: http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none
You can use the XPath comment() function to select comment nodes, then reparse their contents as HTML:
library(rvest)
# scrape page
h <- read_html('http://www.basketball-reference.com/teams/CHI/2015.html')
df <- h %>% html_nodes(xpath = '//comment()') %>% # select comment nodes
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to a single string
read_html() %>% # reparse to HTML
html_node('table#advanced') %>% # select the desired table
html_table() %>% # parse table
.[colSums(is.na(.)) < nrow(.)] # get rid of spacer columns
df[, 1:15]
## Rk Player Age G MP PER TS% 3PAr FTr ORB% DRB% TRB% AST% STL% BLK%
## 1 1 Pau Gasol 34 78 2681 22.7 0.550 0.023 0.317 9.2 27.6 18.6 14.4 0.5 4.0
## 2 2 Jimmy Butler 25 65 2513 21.3 0.583 0.212 0.508 5.1 11.2 8.2 14.4 2.3 1.0
## 3 3 Joakim Noah 29 67 2049 15.3 0.482 0.005 0.407 11.9 22.1 17.1 23.0 1.2 2.6
## 4 4 Aaron Brooks 30 82 1885 14.4 0.534 0.383 0.213 1.9 7.5 4.8 24.2 1.5 0.6
## 5 5 Mike Dunleavy 34 63 1838 11.6 0.573 0.547 0.181 1.7 12.7 7.3 9.7 1.1 0.8
## 6 6 Taj Gibson 29 62 1692 16.1 0.545 0.000 0.364 10.7 14.6 12.7 6.9 1.1 3.2
## 7 7 Nikola Mirotic 23 82 1654 17.9 0.556 0.502 0.455 4.3 21.8 13.3 9.7 1.7 2.4
## 8 8 Kirk Hinrich 34 66 1610 6.8 0.468 0.441 0.131 1.4 6.6 4.1 13.8 1.5 0.6
## 9 9 Derrick Rose 26 51 1530 15.9 0.493 0.325 0.224 2.6 8.7 5.7 30.7 1.2 0.8
## 10 10 Tony Snell 23 72 1412 10.2 0.550 0.531 0.148 2.5 10.9 6.8 6.8 1.2 0.6
## 11 11 E'Twaun Moore 25 56 504 10.3 0.504 0.273 0.144 2.7 7.1 5.0 10.4 2.1 0.9
## 12 12 Doug McDermott 23 36 321 6.1 0.480 0.383 0.140 2.1 12.2 7.3 3.0 0.6 0.2
## 13 13 Nazr Mohammed 37 23 128 8.7 0.431 0.000 0.100 9.6 22.3 16.1 3.6 1.6 2.8
## 14 14 Cameron Bairstow 24 18 64 2.1 0.309 0.000 0.357 10.5 3.3 6.8 2.2 1.6 1.1
Ok..got it.
library(stringi)
library(knitr)
library(rvest)
any_version_html <- function(x){
XML::htmlParse(x)
}
a <- 'http://www.basketball-reference.com/teams/CHI/2015.html#advanced::none'
b <- readLines(a)
c <- paste0(b, collapse = "")
d <- as.character(unlist(stri_extract_all_regex(c, '<table(.*?)/table>', omit_no_match = T, simplify = T)))
e <- html_table(any_version_html(d))
> kable(summary(e),'rst')
====== ========== ====
Length Class Mode
====== ========== ====
9 data.frame list
2 data.frame list
24 data.frame list
21 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
28 data.frame list
28 data.frame list
27 data.frame list
30 data.frame list
27 data.frame list
27 data.frame list
3 data.frame list
====== ========== ====
kable(e[[1]],'rst')
=== ================ === ==== === ================== === === =================================
No. Player Pos Ht Wt Birth Date  Exp College
=== ================ === ==== === ================== === === =================================
41 Cameron Bairstow PF 6-9 250 December 7, 1990 au R University of New Mexico
0 Aaron Brooks PG 6-0 161 January 14, 1985 us 6 University of Oregon
21 Jimmy Butler SG 6-7 220 September 14, 1989 us 3 Marquette University
34 Mike Dunleavy SF 6-9 230 September 15, 1980 us 12 Duke University
16 Pau Gasol PF 7-0 250 July 6, 1980 es 13
22 Taj Gibson PF 6-9 225 June 24, 1985 us 5 University of Southern California
12 Kirk Hinrich SG 6-4 190 January 2, 1981 us 11 University of Kansas
3 Doug McDermott SF 6-8 225 January 3, 1992 us R Creighton University
## Realized we should index with some names...but this is somewhat cheating as we know the start and end indexes for table titles..I prefer to parse-in-the-dark.
# Names are in h2-tags
e_names <- as.character(unlist(stri_extract_all_regex(c, '<h2(.*?)/h2>', simplify = T)))
e_names <- gsub("<(.*?)>","",e_names[grep('Roster',e_names):grep('Salaries',e_names)])
names(e) <- e_names
kable(head(e$Salaries), 'rst')
=== ============== ===========
Rk Player Salary
=== ============== ===========
1 Derrick Rose $18,862,875
2 Carlos Boozer $13,550,000
3 Joakim Noah $12,200,000
4 Taj Gibson $8,000,000
5 Pau Gasol $7,128,000
6 Nikola Mirotic $5,305,000
=== ============== ===========

Loading a csv file as a ts

Below are monthly prices of a particular stock;
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 46.09 50.01 48 48 50.15 43.45 41.05 41.67 36.66 25.02 22.98 22
2009 20.98 15 13.04 14.4 26.46 14.32 14.6 11.83 14 14.4 13.07 13.6
2010 15.31 15.71 18.97 15.43 13.5 13.8 14.21 12.73 12.35 13.17 14.59 15.01
2011 15.3 15.22 15.23 15 15.1 14.66 14.8 12.02 12.41 12.9 11.6 12.18
2012 12.45 13.33 12.4 14.16 13.99 13.75 14.4 15.38 16.3 18.02 17.29 19.49
2013 20.5 20.75 21.3 20.15 22.2 19.8 19.75 19.71 19.99 21.54 21.3 27.4
2014 23.3 20.5 20 22.7 25.4 25.05 25.08 24.6 24.5 21.2 20.52 18.41
2015 16.01 17.6 20.98 21.15 21.44 0 0 0 0 0 0 0
I want to decompose the data into seasonal and trend data but I am not getting a result.
How can I load the data as a "ts" class data so I can decompose it?
Here is a solution using tidyr, which is fairly accessible.
library(dplyr); library(tidyr)
data %>% gather(month, price, -Year) %>% # 1 row per year-month pair, name the value "price"
mutate(synth_date_txt= paste(month,"1,",Year), # combine month and year into a date string
date=as.Date(synth_date_txt,format="%b %d, %Y")) %>% # convert date string to date
select(date, price) # keep just the date and price
# date price
# 1 2008-01-01 46.09
# 2 2009-01-01 20.98
# 3 2010-01-01 15.31
# 4 2011-01-01 15.30
# 5 2012-01-01 12.45
This gives you an answer with date format (even though you didn't specify a date, just a month and year). It should work for your time series analysis, but if you really need a timestamp you can just use as.POSIXct(date)
Mike,
The program is R and below is the code I have tried.
sev=read.csv("X7UPM.csv")
se=ts(sev,start=c(2008, 1), end=c(2015,1), frequency=12)
se
se=se[,1]
S=decompose(se)
plot(se,col=c("blue"))
plot(decompose(se))
S.decom=decompose(se,type="mult")
plot(S.decom)
trend=S.decom$trend
trend
seasonal=S.decom$seasonal
seasonal
ts.plot(cbind(trend,trend*seasonal),lty=1:2)
plot(stl(se,"periodic"))

reading ascii file in R

I am trying to read a file (ascii) in R using read.table
The file looks like the following:
DAILY MAXIMUM TEMPARATURE
YEAR DAY MT DT LAT. 66.5 67.5 68.5 69.5 70.5
1969 001 01 01 6.5 99.90 99.90 31.90 99.90 99.90
1969 001 01 01 7.5 99.90 20.90 99.90 99.90 23.90
1969 001 01 01 8.5 99.90 99.90 30.90 99.90 18.90
.....
.....
YEAR DAY MT DT LAT. 66.5 67.5 68.5 69.5 70.5
1969 001 01 02 6.5 21.90 99.90 99.90 99.90 99.90
1969 001 01 02 7.5 99.90 33.90 99.90 99.90 99.90
1969 001 01 02 8.5 99.90 99.90 15.90 99.90 99.90
.....
.....
YEAR DAY MT DT LAT. 66.5 67.5 68.5 69.5 70.5
1969 001 01 03 6.5 99.90 99.90 99.90 99.90 99.90
1969 001 01 03 7.5 99.90 99.90 99.90 99.90 99.90
1969 001 01 03 8.5 99.90 99.90 99.90 99.90 99.90
.....
.....
I read it using:
inp=read.table("MAXT1969.TXT",skip=1,header=T)
The file has been read and the contents are in the variable inp.
I have 2 questions -
I. the command to see the first 5 columns gives some extra information along with the desired output,
for example, inp[1,5] gives the following output:
> inp[1,5]
"[1] 6.5
33 Levels: 10.5 11.5 12.5 13.5 14.5 15.5 16.5 17.5 18.5 19.5 20.5 21.5 ... LAT."
I don't want the extra info but only the value. Where I am going wrong?
II. After every 32 rows, I've a header (YEAR DAY ....). How to ignore reading the header at regular intervals?
Try comment.char="Y" which will make read.table ignore all the lines starting with Y.
stringsAsFactors=FALSE will avoid converting strings to factors.
inp <- read.table("MAXT1969.TXT", skip = 1, header=FALSE, comment.char="Y", stringsAsFactors=FALSE )
#Read just first row to get header names
cols <- read.table("MAXT1969.TXT", header=FALSE, skip=1, nrows=1 )
names(inp) <- cols
inp
## YEAR DAY MT DT LAT. 66.5 67.5 68.5 69.5 70.5
## 1 1969 1 1 1 6.5 99.9 99.9 31.9 99.9 99.9
## 2 1969 1 1 1 7.5 99.9 20.9 99.9 99.9 23.9
## 3 1969 1 1 1 8.5 99.9 99.9 30.9 99.9 18.9
## 4 1969 1 1 2 6.5 21.9 99.9 99.9 99.9 99.9
## 5 1969 1 1 2 7.5 99.9 33.9 99.9 99.9 99.9
## 6 1969 1 1 2 8.5 99.9 99.9 15.9 99.9 99.9
## 7 1969 1 1 3 6.5 99.9 99.9 99.9 99.9 99.9
## 8 1969 1 1 3 7.5 99.9 99.9 99.9 99.9 99.9
## 9 1969 1 1 3 8.5 99.9 99.9 99.9 99.9 99.9
#Since the stringsAsFactor = FALSE was used numbers were read correctly.
inp[1, 5]
## [1] 6.5
Question 1: This means that you value has been read as a factor, i.e. a categorical variable. Just use as.numeric on the column to transform it from factor to numeric. Alternatively, you can use the colClasses argument to read.table to directly specify the type of the columns in the file.
Question 2: You can read the lines using readLines, find the lines that start with YEAR using grep, delete those, and read this edited output into a data.frame using read.table(textConnection(edited_data)). I would use #geektrader's solution in stead, but I just wanted to add this for completeness sake.
Another solution would be to introduce NAs and then omit them -
inp = as.data.frame(na.omit(apply(apply(inp, 2, as.character), 2, as.numeric)))

Resources