Trouble Reading Census Data into R (read.delim, read.table) - r

I am trying to read in Census regional data from the text file at (https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt). But the delimit function is not separating the columns apart. I've tried setting sep to default, " ", "\t", but the results are either errors or everything crammed into a single column.
Here is the code I'm using:
read.delim("https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt", sep = "")
This is the error I receive:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, : duplicate 'row.names' are not allowed
This is the output when I set sep = "\t":
UACE......NAME...................................................................POP............HU...........AREALAND..AREALANDSQMI..........AREAWATER.AREAWATERSQMI........POPDEN..LSADC
<chr>
00037 Abbeville, LA 19824 8460 29222871 11.28 300497 0.12 1757.0 76
00064 Abbeville, SC 5243 2578 11315197 4.37 19786 0.01 1200.1 76
00091 Abbotsford, WI 3966 1616 5363441 2.07 13221 0.01 1915.2 76
00118 Aberdeen, MS 4666 2050 7416616 2.86 52732 0.02 1629.4 76
00145 Aberdeen, SD 25977 12114 33002447 12.74 247597 0.10 2038.6 76
00172 Aberdeen, WA 29856 13139 39997951 15.44 1929689 0.75 1933.3 76
00199 Aberdeen--Bel Air South--Bel Air North, MD 213751 83721 339626464 131.13 9825290 3.79 1630.1 75
00226 Abernathy, TX 2785 1124 3051109 1.18 12572 0.00 2364.1 76
00253 Abilene, KS 7054 3238 8773263 3.39 1877 0.00 2082.4 76
00280 Abilene, TX 110421 46732 141756054 54.73 988193 0.38 2017.5 75
00334 Abingdon, IL 3389 1483 3731303 1.44 0 0.00 2352.4 76
00388 Ada, OH 5945 1906 4769036 1.84 0 0.00 3228.6 76
00415 Ada, OK 17400 8086 30913906 11.94 89140 0.03 1457.8 76
00450 Adams, NY 2542 1100 5107296 1.97 13914 0.01 1289.1 76
00469 Adel, GA 6986 2990 15634050 6.04 204861 0.08 1157.3 76
00496 Adel, IA 3170 1317 4624127 1.79 0 0.00 1775.5 76
...
1-16 of 3,601 rows

I'd propose a different solution, since there seems to be a strange delimiter used for the .txt file: How about you download the .xls file in the same folder and use that?
The link to 'ua_list_all.xls' is here: https://www2.census.gov/geo/docs/reference/ua/
See code below:
library(readxl)
test <- readxl::read_excel(path = 'ua_list_all.xls')

Related

How can I extract specific data points from a wide-formatted text file in R?

I have datasheets with multiple measurements that look like the following:
FILE DATE TIME LOC QUAD LAI SEL DIFN MTA SEM SMP
20 20210805 08:38:32 H 1161 2.80 0.68 0.145 49. 8. 4
ANGLES 7.000 23.00 38.00 53.00 68.00
CNTCT# 1.969 1.517 0.981 1.579 1.386
STDDEV 1.632 1.051 0.596 0.904 0.379
DISTS 1.008 1.087 1.270 1.662 2.670
GAPS 0.137 0.192 0.288 0.073 0.025
A 1 08:38:40 31.66 33.63 34.59 39.13 55.86
1 2 08:38:40 -5.0e-006
B 3 08:38:48 25.74 20.71 15.03 2.584 1.716
B 4 08:38:55 0.344 1.107 2.730 0.285 0.265
B 5 08:39:02 3.211 5.105 13.01 4.828 1.943
B 6 08:39:10 8.423 22.91 48.77 16.34 3.572
B 7 08:39:19 12.58 14.90 18.34 18.26 4.125
I would like to read the entire datasheet and extract the values for 'QUAD' and 'LAI' only. For example, for the data above I would only be extracting a QUAD of 1161 and an LAI of 2.80.
In the past the datasheets were formatted as long data, and I was able to use the following code:
library(stringr)
QUAD <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^QUAD).*$")))
LAI <- as.numeric(str_trim(str_extract(data, "(?m)(?<=^LAI).*$")))
data_extract <- data.frame(
QUAD = QUAD[!is.na(QUAD)],
LAI = LAI[!is.na(LAI)]
)
data_extract
Unfortunately, this does not work because of the wide formatting in the current datasheet. Any help would be hugely appreciated. Thanks in advance for your time.

Append new data to existing csv file in R

I am working on a project where I need to graph 10 days worth of data from remote sites.
I am downloading new data every 30 minutes from a remote computer via FTP (data is written every half hour also).
The local (onsite) file path changes every month so I have a dynamic IP address based on the current date.
eg.
/data/sitename/2020/July/data.csv
/data/sitename/2020/August/data.csv
My problem is at each new month the csv I am downloading will be in a new folder and when I FTP the new csv file, it will only contain data from the new month and not the previous months.
I need to graph the last 10 days of data.
So what I'm hoping to do is download the new data every half hour and only append the newest records to the master data set. Or is there a better way all together?
What I (think I) need to do is download the csv into R, and append only the new data to a master file and remove the oldest records so as to only contain 10 days worth of data in the csv.
I have searched everywhere but cannot seem to crack it.
This seems like it should be so easy, maybe I am using the wrong search terms.
I would like the following pretty please (showed 10 lines of data, I'll need 480 for 10 days).
INITIAL DATA
DateTime Data1 Data2 Data3 Data4 Data5
641 2020-08-26T02:31:59.999+10:00 10.00 53.4 3.101 42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
NEW DATA
DateTime Data1 Data2 Data3 Data4 Data5
641 2020-08-26T02:31:59.999+10:00 10.00 53.4 3.101 42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
651 2020-08-26T07:31:59.999+10:00 3.85 109.7 3.125 70 20.65
REQUIRED DATA
DateTime Data1 Data2 Data3 Data4 Data5
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
651 2020-08-26T07:31:59.999+10:00 3.85 109.7 3.125 70 20.65
This is where I am at...
library(RCurl)
library(readr)
library(ggplot2)
library(data.table)
# Get the date parts we need
Year <-format(Sys.Date(), format="%Y")
Month <- format(Sys.Date(), format="%B")
MM <- format(Sys.Date(), format="%m")
# Create the file string and read
site <- glue::glue("ftp://user:passwd#99.99.99.99/path/{Year}/{Month}/site}{Year}-{MM}.csv")
site <- read.csv(site, header = FALSE)
# Write table and create csv
EP <- write.table(site, "EP.csv", col.names = FALSE, row.names = FALSE)
EP <- fread("EP.csv", header = FALSE, select = c( 1, 2, 3, 5, 6, 18))
output<- write.table(EP, file = 'output.csv', col.names = c("A", "B", etc), sep = ",", row.names = FALSE)
#working up to here
# Append to master csv file
master <- read.csv("C:\\path\\"master.csv")
You can turn the DateTime column to POSIXct class, combine the new and initial data and get data which is present in last 10 days.
library(dplyr)
library(lubridate)
initial_data <- initial_data %>% mutate(DateTime = ymd_hms(DateTime))
new_data <- new_data %>% mutate(DateTime = ymd_hms(DateTime))
combined_data <- bind_rows(new_data, initial_data)
ten_days_data <- combined_data %>%
filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))
I'll try and answer this combining the help from Ronak.
I am still hopeful that a better solution can be found where I can simply append the new data to the old data.
There were multiple parts to my question and Ronak provided a solution for the last 10 days problem:
ten_days_data <- combined_data %>%
filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))
The second part about combining the data I found from another post How to rbind new rows from one data frame to an existing data frame in R
combined_data <- unique(rbindlist(list(inital_data, new_data)), by = "DateTime")

How to break datasets in r into new data sets using empty rows

I am trying to "automize" the process of separating this data into datasets based on its trials. The original data is a list of values taken at several sites, and I want to break them into individual sets without having to use notation like c(1:312) because the number of rows in each trial vary. Each trail starts with a header, like d9, and ends with a blank row before the next header. How can I separate the data into new dataframes using the headers/empty rows?
This is for analyzing water data; Depth, Temperature, DO, and Salinity. The end goal is to create a graph of each trial showing the differences is Temperature across the trials.
Data Set (starting at row 1299)
1299 NA
1300 d4
1301 0.00
1302 0.18
1303 0.20
1304 0.31
1305 0.49
1306 0.76
1307 1.12
1308 1.51
1309 1.82
1310 1.92
1311 2.08
1312 2.35
1313 2.41
1314 2.48
1315 2.68
1316 2.97
1317 3.22
1318 3.33
1319 3.40
1320 3.55
1321 3.81
1322 4.05
1323 4.30
1324 4.41
1325 4.46
1326 4.56
1327 4.61
1328 4.62
1329 4.55
1330 4.54
1331 4.56
1332 4.49
1333 4.38
1334 4.38
1335 4.55
1336 4.71
1337 4.91
1338 5.14
1339 5.22
1340 5.26
1341 NA
1342 d11
1343 0.00
1344 0.22
1345 0.22
1346 0.27
D9 <- Data[3:314,]
D12 <- Data[317:517,]
D3 <- Data[520:703,]
D15 <- Data[706:795,]
D14 <- Data[798:853,]
D2 <- Data[856:939,]
D13 <- Data[942:975,]
D1 <- Data[978:1029,]
D6 <- Data[1032:1113,]
D5 <- Data[1116:1171,]
D7 <- Data[1174:1230,]
D8 <- Data[1233:1298,]
D4 <- Data[1301:1340,]
D11 <- Data[1343:1392,]
D10 <- Data[1395:1493,]
We can create a list using split along with grepl and cumsum
lst <- lapply(split.data.frame(x = df, cumsum(grepl('d\\d+',df$V2))),
function(x) {
names(x)[2] <- as.character(x[1,'V2'])
x <- x[-1,]
})
data
df <- structure(list(V1 = 1299:1346, V2 = c(NA, "d4", "0.00", "0.18",
"0.20", "0.31", "0.49", "0.76", "1.12", "1.51", "1.82", "1.92",
"2.08", "2.35", "2.41", "2.48", "2.68", "2.97", "3.22", "3.33",
"3.40", "3.55", "3.81", "4.05", "4.30", "4.41", "4.46", "4.56",
"4.61", "4.62", "4.55", "4.54", "4.56", "4.49", "4.38", "4.38",
"4.55", "4.71", "4.91", "5.14", "5.22", "5.26", NA, "d11", "0.00",
"0.22", "0.22", "0.27")), class = "data.frame", row.names = c(NA, -48L))
Note: It's advised to keep your data frames in a list instead of assigning them into Global env., see here

Why do I get duplicate data when I create a random sample from an existing dataset within R?

I wanted to understand what is wrong with my syntax when I try to sample data from my dataset which has 5000 rows, I only want to random sample 500 from it.
repex of dataset(xdata):
AccountId Street City State ZipCode CloseFactorPct OpenFactorPct ZipIncome ZipDegree
1 455697 3919 Birkdale Ln Se Olympia WA 98501 0.75 1.40 67060 0.17879866
2 490095 29174 Wagon Rd Agoura Hills CA 91301 0.85 2.50 115125 0.21376952
3 427399 301a Franklin Ave Princeton NJ 8540 0.80 2.25 124954 0.50428200
4 470678 1461 Woodsview Way Macedon NY 14502 0.80 2.50 67780 0.13772373
5 424824 616 Locust Ave Las Animas CO 81054 0.80 2.25 31343 0.02021198
6 437343 13 New Oxford Rd Conway AR 72034 0.80 2.25 51435 0.15904222
TotalOwed
1 0.0
2 185.1
3 1645.0
4 0.0
5 0.0
6 0.0
>
My code:
sample2 <- xdata[sample(nrow(xdata), "500", replace=T), sample(ncol(xdata), 10, replace=T)]
head(sample2)
ZipIncome City ZipIncome.1 TotalOwed Street OpenFactorPct ZipHhIncome.2
14470 41866 Columbus 41866 841.31 792 Dennison Avenue 0.85 41866
23502 55221 El Paso 55221 0.00 12949 Eastbrook Drive Apt 53 0.70 55221
7370 93373 Saddle Brook 93373 570.38 229 S Boulevard 0.70 93373
31627 61830 Choudrant 61830 1156.28 153 Jones Street 0.70 61830
29840 39697 Beckley 39697 0.00 2109 S Kanawha St 0.75 39697
14938 91313 Bradenton 91313 0.00 5007 Serata Dr 0.85 91313
ZipIncome.3 ClosedFactorPct ZipIncome.4
14470 41866 0.95 41866
23502 55221 0.80 55221
7370 93373 1.20 93373
31627 61830 0.80 61830
29840 39697 0.80 39697
14938 91313 1.30 91313
The output I receive gives me 4 duplicates of zipincome. Why does this happen? can someone help me understand if my syntax to pull out a random sample is incorrect or if I require to set.seed()?

How can I read the file which doesn't have Filename extension in R?

I am dealing with climate dataset in R, where I downloaded yearly Temp/Precip observation by the globe from here: climate data archive, and example datasets can be found yearly temperature data for all countires and another one is yearly precipitation data for all countries. However, the format of this data doesn't have fileName extension, and its respective filename extension was forgotten or missing. I tried base::scan() to load them in R, but the output is not desired. Because each file must have 14 fixed columns, but if I used scan(), it will read 7 column only, which is not desired for me. Is there any better function for reading the file without specific filename extension? Any idea?
Here is how list of climate data looks like:
list.files("stella/data/air_temp_1980_2014/", recursive = TRUE)
[1] "air_temp.1980" "air_temp.1981" "air_temp.1982" "air_temp.1983"
[5] "air_temp.1984" "air_temp.1985" "air_temp.1986" "air_temp.1987"
[9] "air_temp.1988" "air_temp.1989" "air_temp.1990" "air_temp.1991"
[13] "air_temp.1992" "air_temp.1993" "air_temp.1994" "air_temp.1995"
[17] "air_temp.1996" "air_temp.1997" "air_temp.1998" "air_temp.1999"
[21] "air_temp.2000" "air_temp.2001" "air_temp.2002" "air_temp.2003"
[25] "air_temp.2004" "air_temp.2005" "air_temp.2006" "air_temp.2007"
[29] "air_temp.2008" "air_temp.2009" "air_temp.2010" "air_temp.2011"
[33] "air_temp.2012" "air_temp.2013" "air_temp.2014"
Here is how scan() produce its output:
>scan(file = "stella/data/air_temp_1980_2014/air_temp.1980", sep = "", skip = 1)
[1] -179.75 68.75 -27.00 -28.20 -27.20 -21.60 -9.00
[8] 0.60 2.80 1.90 -0.20 -11.90 -22.70 -25.10
[15] -179.75 68.25 -27.80 -28.50 -27.50 -22.00 -9.50
[22] 0.40 3.00 1.80 -0.80 -12.70 -23.60 -26.80
[29] -179.75 67.75 -26.80 -26.60 -25.70 -20.50 -8.00
[36] 2.70 6.00 4.00 0.50 -12.20 -23.20 -27.30
[43] -179.75 67.25 -29.10 -28.40 -27.50 -22.30 -9.70
[50] 2.20 6.20 3.30 -1.30 -15.40 -26.40 -31.10
[57] -179.75 66.75 -25.40 -23.80 -22.90 -18.20 -6.10
[64] 3.80 8.60 6.00 1.10 -11.50 -22.30 -27.20
Desired output:
> desired output
Long Lat Jan Feb Mar April May Jun Jul
1 -179.75 68.75 -27.0 -28.2 -27.2 -21.6 -9.0 0.6 2.8
2 -179.75 68.25 -27.8 -28.5 -27.5 -22.0 -9.5 0.4 3.0
3 -179.75 67.75 -26.8 -26.6 -25.7 -20.5 -8.0 2.7 6.0
4 -179.75 67.25 -29.1 -28.4 -27.5 -22.3 -9.7 2.2 6.2
5 -179.75 66.75 -25.4 -23.8 -22.9 -18.2 -6.1 3.8 8.6
6 -179.75 66.25 -21.5 -18.9 -17.2 -14.0 -2.3 3.4 9.2
7 -179.75 65.75 -20.2 -17.9 -17.1 -13.2 -2.2 4.3 10.1
8 -179.75 65.25 -20.0 -18.7 -17.4 -14.1 -2.4 4.3 10.5
9 -179.75 -16.75 27.4 28.3 27.9 27.2 25.7 24.9 24.7
10 -179.75 -84.75 -18.9 -27.9 -38.6 -41.5 -41.2 -44.4 -45.2
11 -179.75 -85.25 -23.9 -33.8 -45.1 -47.9 -47.7 -50.4 -51.5
12 -179.75 -85.75 -22.8 -33.5 -45.2 -48.1 -47.7 -49.9 -51.4
13 -179.75 -86.25 -24.3 -35.5 -47.7 -50.6 -50.2 -52.1 -53.8
14 -179.75 -86.75 -25.5 -37.1 -49.6 -52.6 -52.1 -53.8 -55.7
15 -179.75 -87.25 -26.2 -38.1 -50.9 -53.8 -53.2 -54.8 -56.8
16 -179.75 -87.75 -26.7 -39.0 -51.9 -54.8 -54.3 -55.7 -57.9
I want to read all list of files in R. How can I read above data correctly in R as I expected? Any idea?
The filename extension in itself doesn't mean much. It is there to signify how the data in the file is ordered. You should open the file in a text-editor to figure out how it is represented.
From the looks of it, and according to the other answer, it might be a tab-delimited csv file. So the way to import it into R is to use the CSV-related input functions, like read.csv or data.table::fread.
It appears to be a tab delimited file (confirm by adding .txt extension). If you add a .csv extension to each of the files and then read them in explicitly using whitespace as the delimiter, it should work fine. This may be tedious but is likely your best option, as files without an appropriate extension are confusing in their own right.
Be cautious though because the column names are not preserved. To avoid the first row getting stored as column names you need to pass a vector of names to the function as well.
name_vector <- c("Long", "Lat", ... )
x <- read.csv("path/precip.1980.csv", sep = "", col.names = name_vector)
edit:
since you've already scanned the data in, you should be able to just paste a ".csv" to the end of each element in the file list vector instead of having to do it manually. However, read.csv() will not work without an extension so it has to be done at some point.
# store file list
filelist <- list.files("stella/data/air_temp_1980_2014/", recursive = TRUE)
# paste extension
filelist <- paste0(filelist, ".csv")
Then you could iteratively read the files in with my code from above. Here's an example of a solution that could work. do that for you.
dat <- lapply(filelist, function (x) {
read.csv(x, sep = "", col.names = name_vector)
})
I haven't explicitly tested this solution, and it still likely will present errors because of the column names issue. If you'd provide a proper reprex it would be much easier to troubleshoot these issues for you.

Resources