readr::read_tsv() parsing failures due to trailing tabs - r

Issue/Question
I have tab-delimited my-data.txt file with 51 columns. The col_names row has no trailing tabs and readr::read_tsv() correctly detects 51 columns. However, the data columns all contain trailing tabs and readr::read_tsv() interprets these incorrectly as having 52 columns. While the code runs, I get a warning, which I would like to get rid of. Are there any read_tsv() arguments that can help handle this? Should I instead use a different readr function?
my-data.txt
PT AU BA CA GP RI OI BE Z2 TI X1 Y1 Z1 FT PN AE Z3 SO S1 SE BS VL IS SI MA BP EP AR DI D2 SU PD PY AB X4 Y4 Z4 AK CT CY SP CL TC Z8 ZB ZS Z9 SN BN UT PM
J Jacquelin, Sebastien; Straube, Jasmin; Cooper, Leanne; Vu, Therese; Song, Axia; Bywater, Megan; Baxter, Eva; Heidecker, Matthew; Wackrow, Brad; Porter, Amy; Ling, Victoria; Green, Joanne; Austin, Rebecca; Kazakoff, Stephen; Waddell, Nicola; Hesson, Luke B.; Pimanda, John E.; Stegelmann, Frank; Bullinger, Lars; Doehner, Konstanze; Rampal, Raajit K.; Heckl, Dirk; Hill, Geoffrey R.; Lane, Steven W. Jak2V617F and Dnmt3a loss cooperate to induce myelofibrosis through activated enhancer-driven inflammation BLOOD 132 26 2707 2721 10.1182/blood-2018-04-846220 DEC 27 2018 2018 10 WOS:000454429300003
J Renne, Julius; Gutberlet, Marcel; Voskrebenzev, Andreas; Kern, Agilo; Kaireit, Till; Hinrichs, Jan; Zardo, Patrick; Warnecke, Gregor; Krueger, Marcus; Braubach, Peter; Jonigk, Danny; Haverich, Axel; Wacker, Frank; Vogel-Claussen, Jens; Zinne, Norman Multiparametric MRI for organ quality assessment in a porcine Ex-Vivo lung perfusion system PLOS ONE 13 12 e0209103 10.1371/journal.pone.0209103 DEC 27 2018 2018 1 WOS:000454418200015
J Lau, Skadi; Eicke, Dorothee; Oliveira, Marco Carvalho; Wiegmann, Bettina; Schrimpf, Claudia; Haverich, Axel; Blasczyk, Rainer; Wilhelmi, Mathias; Figueiredo, Constanca; Boeer, Ulrike Low Immunogenic Endothelial Cells Maintain Morphological and Functional Properties Required for Vascular Tissue Engineering TISSUE ENGINEERING PART A 24 5-6 432 447 10.1089/ten.tea.2016.0541 MAR 2018 2018 4 WOS:000418327100001
Reprex
Note that I did some manually editing of the reprex because I needed to read in the .txt file to reproduce the issue, but this causes errors in reprex without my computer-specific path). See RStudio Community Topic 8773
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(readr)
my_data <- read_tsv("my-data.txt", quote = "")
#> Parsed with column specification:
#> cols(
#> .default = col_logical(),
#> PT = col_character(),
#> AU = col_character(),
#> TI = col_character(),
#> SO = col_character(),
#> VL = col_double(),
#> IS = col_character(),
#> BP = col_double(),
#> EP = col_double(),
#> AR = col_character(),
#> DI = col_character(),
#> PD = col_character(),
#> PY = col_double(),
#> TC = col_double(),
#> UT = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning: 3 parsing failures.
#> row col expected actual file
#> 1 -- 51 columns 52 columns 'my-data.txt'
#> 2 -- 51 columns 52 columns 'my-data.txt'
#> 3 -- 51 columns 52 columns 'my-data.txt'
problems(my_data)
#> # A tibble: 3 x 5
#> row col expected actual file
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 <NA> 51 columns 52 columns 'my-data.txt'
#> 2 2 <NA> 51 columns 52 columns 'my-data.txt'
#> 3 3 <NA> 51 columns 52 columns 'my-data.txt'
Created on 2020-04-01 by the reprex package (v0.3.0)
Thank you for taking the time to help me.

My favorite .tsv file reader is fread from data.table. It often works right out of the box. It might be worth a try.
library(data.table)
my_data <- fread("my-data.txt")

Related

Revisitig "Can't rename columns that don't exist"

I tried to rename columns which is actually a very straight forward operation but still getting errors. I tried two methods and none of them working. Can any one explain, what needs to be done to rename columns without getting these strange errors. I tried several SO posts but none of them really worked.
library(pacman)
#> Warning: package 'pacman' was built under R version 4.2.1
p_load(dplyr, readr)
data = read_csv("https://raw.githubusercontent.com/srk7774/data/master/august_october_2020.csv",
col_names = TRUE)
#> Rows: 16 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): X.1
#> dbl (2): Total Agree - August 2020, Total Agree - October 2020
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
column_recodes <- c(X.1 = "country",
august = "Total Agree - August 2020",
october = "`Total Agree - October 2020",
`Another non-existent column name` = "bar")
data %>% rename_with(~recode(., !!!column_recodes))
#> # A tibble: 16 × 3
#> country `Total Agree - August 2020` `Total Agree - October 2020`
#> <chr> <dbl> <dbl>
#> 1 Total 77 73
#> 2 India 87 87
#> 3 China 97 85
#> 4 South Korea 84 83
#> 5 Brazil 88 81
#> 6 Australia 88 79
#> 7 United Kingdom 85 79
#> 8 Mexico 75 78
#> 9 Canada 76 76
#> 10 Germany 67 69
#> 11 Japan 75 69
#> 12 South Africa 64 68
#> 13 Italy 67 65
#> 14 Spain 72 64
#> 15 United States 67 64
#> 16 France 59 54
data %>%
rename(country = X.1,
august = Total.Agree...August.2020,
october = Total.Agree...October.2020)
#> Error in `chr_as_locations()`:
#> ! Can't rename columns that don't exist.
#> ✖ Column `Total.Agree...August.2020` doesn't exist.
Created on 2022-10-24 by the reprex package (v2.0.1)
Add backtick when using names with space:
data %>%
rename(country = X.1,
august = `Total Agree - August 2020`,
october =`Total Agree - October 2020`)

read file from google drive

I have spreadsheet uploaded as csv file in google drive unlocked so users can read from it.
This is the link to the csv file:
https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/edit?usp=sharing
I am trying to read it from R but I am getting a long list of error messages. I am using:
id = "170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk"
read.csv(sprint("https://docs.google.com/spreadsheets/d/uc?id=%s&export=download",id))
Could someone suggest how to read files from google drive directly into R?
I would try to publish the sheet as a CSV file (doc), and then read it from there.
It seems like your file is already published as a CSV. So, this should work. (Note that the URL ends with /pub?output=csv)
read.csv("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
To read the CSV file faster you can use vroom which is even faster than fread(). See here.
Now using vroom,
library(vroom)
vroom("https://docs.google.com/spreadsheets/d/170235QwbmgQvr0GWmT-8yBsC7Vk6p_dmvYxrZNfsKqk/pub?output=csv")
#> Rows: 387048 Columns: 14
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): StationCode, SampleID, WeatherCode, OrganismCode, race, race2
#> dbl (7): WaterTemperature, Turbidity, Velocity, ForkLength, Weight, Count, ...
#> date (1): SampleDate
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 387,048 × 14
#> StationCode SampleDate SampleID WeatherCode WaterTemperature Turbidity
#> <chr> <date> <chr> <chr> <dbl> <dbl>
#> 1 Gate 11 2000-04-25 116_00 CLD 13.1 2
#> 2 Gate 5 1995-04-26 117_95 CLR NA 2
#> 3 Gate 2 1995-04-21 111_95 W 10.4 12
#> 4 Gate 6 2008-12-13 348_08 CLR 49.9 1.82
#> 5 Gate 5 1999-12-10 344_99 CLR 7.30 1.5
#> 6 Gate 6 2012-05-25 146_12 CLR 55.5 1.60
#> 7 Gate 10 2011-06-28 179_11 RAN 57.3 3.99
#> 8 Gate 11 1996-04-25 116_96 CLR 13.8 21
#> 9 Gate 9 2007-07-02 183_07 CLR 56.6 2.09
#> 10 Gate 6 2009-06-04 155_09 CLR 58.6 3.08
#> # … with 387,038 more rows, and 8 more variables: Velocity <dbl>,
#> # OrganismCode <chr>, ForkLength <dbl>, Weight <dbl>, Count <dbl>,
#> # race <chr>, year <dbl>, race2 <chr>
Created on 2022-07-08 by the reprex package (v2.0.1)

Importing multiple .csv files into R with differents names

I have one work directory with 37 Locations.csv and 37 Behavior.csv
See below that has some files having the same number as 111868-Behavior.csv and 111868-Behavior 2.csv, so also with Locations.csv
#here some of the csv in the work directory
dir()
[1] "111868-Behavior 2.csv" "111868-Behavior.csv"
[3] "111868-Locations 2.csv" "111868-Locations.csv"
[5] "111869-Behavior.csv" "111869-Locations.csv"
[7] "111870-Behavior 2.csv" "111870-Behavior.csv"
[9] "111870-Locations 2.csv" "111870-Locations.csv"
[11] "112696-Behavior 2.csv" "112696-Behavior.csv"
[13] "112696-Locations 2.csv" "112696-Locations.csv"
I can't change the name of files.
I want to import all the 36 Locations and 36 Behaviors, but when I tried this
#Create list of all behaviors
bhv <- list.files(pattern="*-Behavior.csv")
bhv2 <- list.files(pattern="*-Behavior 2.csv")
#Throw them altogether
bhv_csv = ldply(bhv, read_csv)
bhv_csv2 = ldply(bhv2, read_csv)
#Join bhv_csv and bhv_csv2
b<-rbind(bhv_csv,bhv_csv2)
#Create list of all locations
loc <- list.files(pattern="*-Locations.csv")
loc2 <- list.files(pattern="*-Locations 2.csv")
#Throw them altogether
loc_csv = ldply(loc, read_csv)
loc_csv2 = ldply(loc2, read_csv)
#Join loc_csv and loc_csv2
l<-rbind(loc_csv,loc_csv2)
Shows me only 28, not 36 like I spected
length(unique(b$Ptt))
[1] 28
length(unique(l$Ptt))
[1] 28
This number 28, is about all Behaviors.csv and Locations.csv without Behaviors 2.csv and Locations 2.csv (those with number "2" are 8 in total each one)
I want to import all the files Behaviors and all the Locations in a way that shows the 36 Behaviors and Locations. How can I do that?
You can use purrr::map to simplify some of your code:
library("tidyverse")
library("readr")
# Create two small csv files
write_lines("a,b\n1,2\n3,4", "file1.csv")
write_lines("a,c\n5,6\n7,8", "file2.csv")
list.files(pattern = "*.csv") %>%
# `map` will cycle through the files and read each one
map(read_csv) %>%
# and then we can bind them all together
bind_rows()
#> Parsed with column specification:
#> cols(
#> a = col_double(),
#> b = col_double()
#> )
#> Parsed with column specification:
#> cols(
#> a = col_double(),
#> c = col_double()
#> )
#> # A tibble: 4 x 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 1 2 NA
#> 2 3 4 NA
#> 3 5 NA 6
#> 4 7 NA 8
Created on 2019-03-28 by the reprex package (v0.2.1)

reading zip file directly with read_csv from readr producing weird results

I'm trying to read directly from a URL to grab a zip file that contains a pipe delimited text file. If I download the file, then use read_csv to read it from disk, I have no problems. But if I try to use read_csv to read the URL directly I get garbage in my resulting df. I can work around this by coding in a download then read. But it seems like it should work directly. Any clues on what's going on here?
library(readr)
url <- "https://www.rma.usda.gov/data/sob/sccc/sobcov_2018.zip"
df <- read_delim(url, delim='|',
col_names = c('year','stFips','stAbbr','coFips','coName',
'cropCd','cropName','planCd','planAbbr','coverCat',
'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
'liab','prem','subsidy','indem', 'lossRatio'))
#> Parsed with column specification:
#> cols(
#> .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 7908 parsing failures.
#> row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 year "" embedded null 'https://www.rma.usda.gov/data/sob… file 2 1 <NA> 25 columns 1 columns 'https://www.rma.usda.gov/data/sob… row 3 2 <NA> 25 columns 4 columns 'https://www.rma.usda.gov/data/sob… col 4 3 <NA> 25 columns 2 columns 'https://www.rma.usda.gov/data/sob… expected 5 4 year "" embedded null 'https://www.rma.usda.gov/data/sob…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.
head(df)
#> # A tibble: 6 x 25
#> year stFips stAbbr coFips coName cropCd cropName planCd planAbbr
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "PK\u00… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 "K\xe6\… "\xf5\x… "\xc5\… "\xfa\… <NA> <NA> <NA> <NA> <NA>
#> 3 "\xb0\x… "\xfd\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 "j`/Q\x… "\x96\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 "\xc0\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 "z\xe4\… "~y\xf5… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> # ... with 16 more variables: coverCat <chr>, deliveryType <chr>,
#> # covLevel <chr>, policyCount <chr>, policyPremCount <chr>,
#> # policyIndemCount <chr>, unitsReportingPrem <chr>, indemCount <chr>,
#> # quantType <chr>, quantNet <chr>, companionAcres <chr>, liab <chr>,
#> # prem <chr>, subsidy <chr>, indem <chr>, lossRatio <chr>
If I download first, I get the following output:
> url <- './data/sobcov_2018.zip'
> df <- read_delim(url, delim='|',
+ col_names = c('year','stFips','stAbbr','coFips','coName',
+ 'cropCd','cropName','planCd','planAbbr','coverCat',
+ 'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
+ 'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
+ 'liab','prem','subsidy','indem', 'lossRatio'))
Parsed with column specification:
cols(
.default = col_integer(),
stFips = col_character(),
stAbbr = col_character(),
coFips = col_character(),
coName = col_character(),
cropCd = col_character(),
cropName = col_character(),
planCd = col_character(),
planAbbr = col_character(),
coverCat = col_character(),
deliveryType = col_character(),
covLevel = col_double(),
quantType = col_character(),
lossRatio = col_double()
)
See spec(...) for full column specifications.
> head(df)
# A tibble: 6 x 25
year stFips stAbbr coFips coName cropCd cropName planCd planAbbr coverCat deliveryType covLevel
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2018 02 AK 999 "All Other … 9999 "All Other C… 01 "YP … "A " RBUP 0.500
2 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.500
3 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.750
4 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "C " RCAT 0.500
5 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.600
6 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.750
# ... with 13 more variables: policyCount <int>, policyPremCount <int>, policyIndemCount <int>,
# unitsReportingPrem <int>, indemCount <int>, quantType <chr>, quantNet <int>, companionAcres <int>,
# liab <int>, prem <int>, subsidy <int>, indem <int>, lossRatio <dbl>
>
readr can handle only gz compressed files as remote sources, since there are no analogues to base::gzcon() for other compression algorithms. See this github issue for a discussion and the improved documentation (also in ?readr::datasource).

use model object, e.g. panelmodel, to flag data used

Is it possible in some way to use a fit object, specifically the regression object I get form a plm() model, to flag observations, in the data used for the regression, if they were in fact used in the regression. I realize this could be done my looking for complete observations in my original data, but I am curious if there's a way to use the fit/reg object to flag the data.
Let me illustrate my issue with a minimal working example,
First some packages needed,
# install.packages(c("stargazer", "plm", "tidyverse"), dependencies = TRUE)
library(plm); library(stargazer); library(tidyverse)
Second some data, this example is drawing heavily on Baltagi (2013), table 3.1, found in ?plm,
data("Grunfeld", package = "plm")
dta <- Grunfeld
now I create some semi-random missing values in my data object, dta
dta[c(3:13),3] <- NA; dta[c(22:28),4] <- NA; dta[c(30:33),5] <- NA
final step in the data preparation is to create a data frame with an index attribute that describes its individual and time dimensions, using tidyverse,
dta.p <- dta %>% group_by(firm, year)
Now to the regression
plm.reg <- plm(inv ~ value + capital, data = dta.p, model = "pooling")
the results, using stargazer,
stargazer(plm.reg, type="text") # stargazer(dta, type="text")
#> ============================================
#> Dependent variable:
#> ---------------------------
#> inv
#> ----------------------------------------
#> value 0.114***
#> (0.008)
#>
#> capital 0.237***
#> (0.028)
#>
#> Constant -47.962***
#> (9.252)
#>
#> ----------------------------------------
#> Observations 178
#> R2 0.799
#> Adjusted R2 0.797
#> F Statistic 348.176*** (df = 2; 175)
#> ===========================================
#> Note: *p<0.1; **p<0.05; ***p<0.01
Say I know my data has 200 observations, and I want to find the 178 that was used in the regression.
I am speculating if there's some vector in the plm.reg I can (easily) use to crate a flag i my original data, dta, if this observation was used/not used, i.e. the semi-random missing values I created above. Maybe some broom like tool.
I imagine something like,
dta <- dta %>% valid_reg_obs(plm.reg)
The desired outcome would look something like this, the new element is the vector plm.reg at the end, i.e.,
dta %>% as_tibble()
#> # A tibble: 200 x 6
#> firm year inv value capital plm.reg
#> * <int> <int> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1935 318 3078 2.80 T
#> 2 1 1936 392 4662 52.6 T
#> 3 1 1937 NA 5387 157 F
#> 4 1 1938 NA 2792 209 F
#> 5 1 1939 NA 4313 203 F
#> 6 1 1940 NA 4644 207 F
#> 7 1 1941 NA 4551 255 F
#> 8 1 1942 NA 3244 304 F
#> 9 1 1943 NA 4054 264 F
#> 10 1 1944 NA 4379 202 F
#> # ... with 190 more rows
Update, I tried to use broom's augment(), but unforunatly it gave me the error message I had hoped would create some flag,
# install.packages(c("broom"), dependencies = TRUE)
library(broom)
augment(plm.reg, dta)
#> Error in data.frame(..., check.names = FALSE) :
#> arguments imply differing number of rows: 200, 178
The vector is plm.reg$residuals. Not sure of a nice broom solution, but this seems to work:
library(tidyverse)
dta.p %>%
as.data.frame %>%
rowid_to_column %>%
mutate(plm.reg = rowid %in% names(plm.reg$residuals))
for people who use the class pdata.frame() to create an index attribute that describes its individual and time dimensions, you can us the following code, this is from another Baltagi in the ?plm,
# == Baltagi (2013), pp. 204-205
data("Produc", package = "plm")
pProduc <- pdata.frame(Produc, index = c("state", "year", "region"))
form <- log(gsp) ~ log(pc) + log(emp) + log(hwy) + log(water) + log(util) + unemp
Baltagi_reg_204_5 <- plm(form, data = pProduc, model = "random", effect = "nested")
pProduc %>% mutate(reg.re = rownames(pProduc) %in% names(Baltagi_reg_204_5$residuals)) %>%
as_tibble() %>% select(state, year, region, reg.re)
#> # A tibble: 816 x 4
#> state year region reg.re
#> <fct> <fct> <fct> <lgl>
#> 1 CONNECTICUT 1970 1 T
#> 2 CONNECTICUT 1971 1 T
#> 3 CONNECTICUT 1972 1 T
#> 4 CONNECTICUT 1973 1 T
#> 5 CONNECTICUT 1974 1 T
#> 6 CONNECTICUT 1975 1 T
#> 7 CONNECTICUT 1976 1 T
#> 8 CONNECTICUT 1977 1 T
#> 9 CONNECTICUT 1978 1 T
#> 10 CONNECTICUT 1979 1 T
#> # ... with 806 more rows
finally, if you are running the first Baltagi without index attributes, i.e. unmodified example from the help file, the code should be,
Grunfeld %>% rowid_to_column %>%
mutate(plm.reg = rowid %in% names(p$residuals)) %>% as_tibble()

Resources