I am working on a project where I need to graph 10 days worth of data from remote sites.
I am downloading new data every 30 minutes from a remote computer via FTP (data is written every half hour also).
The local (onsite) file path changes every month so I have a dynamic IP address based on the current date.
eg.
/data/sitename/2020/July/data.csv
/data/sitename/2020/August/data.csv
My problem is at each new month the csv I am downloading will be in a new folder and when I FTP the new csv file, it will only contain data from the new month and not the previous months.
I need to graph the last 10 days of data.
So what I'm hoping to do is download the new data every half hour and only append the newest records to the master data set. Or is there a better way all together?
What I (think I) need to do is download the csv into R, and append only the new data to a master file and remove the oldest records so as to only contain 10 days worth of data in the csv.
I have searched everywhere but cannot seem to crack it.
This seems like it should be so easy, maybe I am using the wrong search terms.
I would like the following pretty please (showed 10 lines of data, I'll need 480 for 10 days).
INITIAL DATA
DateTime Data1 Data2 Data3 Data4 Data5
641 2020-08-26T02:31:59.999+10:00 10.00 53.4 3.101 42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
NEW DATA
DateTime Data1 Data2 Data3 Data4 Data5
641 2020-08-26T02:31:59.999+10:00 10.00 53.4 3.101 42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
651 2020-08-26T07:31:59.999+10:00 3.85 109.7 3.125 70 20.65
REQUIRED DATA
DateTime Data1 Data2 Data3 Data4 Data5
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
651 2020-08-26T07:31:59.999+10:00 3.85 109.7 3.125 70 20.65
This is where I am at...
library(RCurl)
library(readr)
library(ggplot2)
library(data.table)
# Get the date parts we need
Year <-format(Sys.Date(), format="%Y")
Month <- format(Sys.Date(), format="%B")
MM <- format(Sys.Date(), format="%m")
# Create the file string and read
site <- glue::glue("ftp://user:passwd#99.99.99.99/path/{Year}/{Month}/site}{Year}-{MM}.csv")
site <- read.csv(site, header = FALSE)
# Write table and create csv
EP <- write.table(site, "EP.csv", col.names = FALSE, row.names = FALSE)
EP <- fread("EP.csv", header = FALSE, select = c( 1, 2, 3, 5, 6, 18))
output<- write.table(EP, file = 'output.csv', col.names = c("A", "B", etc), sep = ",", row.names = FALSE)
#working up to here
# Append to master csv file
master <- read.csv("C:\\path\\"master.csv")
You can turn the DateTime column to POSIXct class, combine the new and initial data and get data which is present in last 10 days.
library(dplyr)
library(lubridate)
initial_data <- initial_data %>% mutate(DateTime = ymd_hms(DateTime))
new_data <- new_data %>% mutate(DateTime = ymd_hms(DateTime))
combined_data <- bind_rows(new_data, initial_data)
ten_days_data <- combined_data %>%
filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))
I'll try and answer this combining the help from Ronak.
I am still hopeful that a better solution can be found where I can simply append the new data to the old data.
There were multiple parts to my question and Ronak provided a solution for the last 10 days problem:
ten_days_data <- combined_data %>%
filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))
The second part about combining the data I found from another post How to rbind new rows from one data frame to an existing data frame in R
combined_data <- unique(rbindlist(list(inital_data, new_data)), by = "DateTime")
Related
I am trying to read in Census regional data from the text file at (https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt). But the delimit function is not separating the columns apart. I've tried setting sep to default, " ", "\t", but the results are either errors or everything crammed into a single column.
Here is the code I'm using:
read.delim("https://www2.census.gov/geo/docs/reference/ua/ua_list_all.txt", sep = "")
This is the error I receive:
Error in read.table(file = file, header = header, sep = sep, quote =
quote, : duplicate 'row.names' are not allowed
This is the output when I set sep = "\t":
UACE......NAME...................................................................POP............HU...........AREALAND..AREALANDSQMI..........AREAWATER.AREAWATERSQMI........POPDEN..LSADC
<chr>
00037 Abbeville, LA 19824 8460 29222871 11.28 300497 0.12 1757.0 76
00064 Abbeville, SC 5243 2578 11315197 4.37 19786 0.01 1200.1 76
00091 Abbotsford, WI 3966 1616 5363441 2.07 13221 0.01 1915.2 76
00118 Aberdeen, MS 4666 2050 7416616 2.86 52732 0.02 1629.4 76
00145 Aberdeen, SD 25977 12114 33002447 12.74 247597 0.10 2038.6 76
00172 Aberdeen, WA 29856 13139 39997951 15.44 1929689 0.75 1933.3 76
00199 Aberdeen--Bel Air South--Bel Air North, MD 213751 83721 339626464 131.13 9825290 3.79 1630.1 75
00226 Abernathy, TX 2785 1124 3051109 1.18 12572 0.00 2364.1 76
00253 Abilene, KS 7054 3238 8773263 3.39 1877 0.00 2082.4 76
00280 Abilene, TX 110421 46732 141756054 54.73 988193 0.38 2017.5 75
00334 Abingdon, IL 3389 1483 3731303 1.44 0 0.00 2352.4 76
00388 Ada, OH 5945 1906 4769036 1.84 0 0.00 3228.6 76
00415 Ada, OK 17400 8086 30913906 11.94 89140 0.03 1457.8 76
00450 Adams, NY 2542 1100 5107296 1.97 13914 0.01 1289.1 76
00469 Adel, GA 6986 2990 15634050 6.04 204861 0.08 1157.3 76
00496 Adel, IA 3170 1317 4624127 1.79 0 0.00 1775.5 76
...
1-16 of 3,601 rows
I'd propose a different solution, since there seems to be a strange delimiter used for the .txt file: How about you download the .xls file in the same folder and use that?
The link to 'ua_list_all.xls' is here: https://www2.census.gov/geo/docs/reference/ua/
See code below:
library(readxl)
test <- readxl::read_excel(path = 'ua_list_all.xls')
I am trying to "automize" the process of separating this data into datasets based on its trials. The original data is a list of values taken at several sites, and I want to break them into individual sets without having to use notation like c(1:312) because the number of rows in each trial vary. Each trail starts with a header, like d9, and ends with a blank row before the next header. How can I separate the data into new dataframes using the headers/empty rows?
This is for analyzing water data; Depth, Temperature, DO, and Salinity. The end goal is to create a graph of each trial showing the differences is Temperature across the trials.
Data Set (starting at row 1299)
1299 NA
1300 d4
1301 0.00
1302 0.18
1303 0.20
1304 0.31
1305 0.49
1306 0.76
1307 1.12
1308 1.51
1309 1.82
1310 1.92
1311 2.08
1312 2.35
1313 2.41
1314 2.48
1315 2.68
1316 2.97
1317 3.22
1318 3.33
1319 3.40
1320 3.55
1321 3.81
1322 4.05
1323 4.30
1324 4.41
1325 4.46
1326 4.56
1327 4.61
1328 4.62
1329 4.55
1330 4.54
1331 4.56
1332 4.49
1333 4.38
1334 4.38
1335 4.55
1336 4.71
1337 4.91
1338 5.14
1339 5.22
1340 5.26
1341 NA
1342 d11
1343 0.00
1344 0.22
1345 0.22
1346 0.27
D9 <- Data[3:314,]
D12 <- Data[317:517,]
D3 <- Data[520:703,]
D15 <- Data[706:795,]
D14 <- Data[798:853,]
D2 <- Data[856:939,]
D13 <- Data[942:975,]
D1 <- Data[978:1029,]
D6 <- Data[1032:1113,]
D5 <- Data[1116:1171,]
D7 <- Data[1174:1230,]
D8 <- Data[1233:1298,]
D4 <- Data[1301:1340,]
D11 <- Data[1343:1392,]
D10 <- Data[1395:1493,]
We can create a list using split along with grepl and cumsum
lst <- lapply(split.data.frame(x = df, cumsum(grepl('d\\d+',df$V2))),
function(x) {
names(x)[2] <- as.character(x[1,'V2'])
x <- x[-1,]
})
data
df <- structure(list(V1 = 1299:1346, V2 = c(NA, "d4", "0.00", "0.18",
"0.20", "0.31", "0.49", "0.76", "1.12", "1.51", "1.82", "1.92",
"2.08", "2.35", "2.41", "2.48", "2.68", "2.97", "3.22", "3.33",
"3.40", "3.55", "3.81", "4.05", "4.30", "4.41", "4.46", "4.56",
"4.61", "4.62", "4.55", "4.54", "4.56", "4.49", "4.38", "4.38",
"4.55", "4.71", "4.91", "5.14", "5.22", "5.26", NA, "d11", "0.00",
"0.22", "0.22", "0.27")), class = "data.frame", row.names = c(NA, -48L))
Note: It's advised to keep your data frames in a list instead of assigning them into Global env., see here
I'm trying to use complete.cases to clear out the NAs from a file.
I've been using help from this site but it isn't working and I'm no longer sure if what I'm trying to do is possible.
juulDataRaw <- read.csv(url("http://blah"));
juulDataRaw[complete.cases(juulDataRaw),]
I tried this (one of the examples from here)
dog<-structure(list(Sample = 1:6
,gene = c("ENSG00000208234","ENSG00000199674","ENSG00000221622","ENSG00000207604","ENSG00000207431","ENSG00000221312")
,hsap = c(0,0,0,0,0,0)
,mmul = c(NA,2,NA,NA,NA,1)
,mmus = c(NA,2,NA,NA,NA,2)
,rnor = c(NA,2,NA,1,NA,3)
,cfam = c(NA,2,NA,2,NA,2))
,.Names = c("gene", "hsap", "mmul", "mmus", "rnor", "cfam"), class = "data.frame", row.names = c(NA, -6L))
dog[complete.cases(dog),]
and that works.
So can mine be done?
What is the difference between the two?
Aren't they both just data frames?
You have quotes around the numeric values so they are read in as factors. That makes the "NA" just another string rather than an R NA.
> juulDataRaw[] <- lapply(juulDataRaw, as.character)
> juulDataRaw[] <- lapply(juulDataRaw, as.numeric)
Warning messages:
1: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
2: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
3: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
> juulDataRaw[complete.cases(juulDataRaw),]
age height igf1 weight
55 6.00 111.6 98 19.1
57 6.08 116.7 242 21.7
61 6.26 120.3 196 24.7
66 6.40 115.5 179 19.6
69 6.42 115.6 126 20.6
71 6.43 116.1 142 20.2
80 6.61 130.3 236 28.0
81 6.63 122.2 148 21.6
83 6.70 126.2 174 26.1
84 6.72 125.6 136 22.6
85 6.72 121.0 164 24.4
snipped remaining output.....
I have a data frame like the one you see here.
DRSi TP DOC DN date Turbidity Anions
158 5.9 3371 264 14/8/06 5.83 2246.02
217 4.7 2060 428 16/8/06 6.04 1632.29
181 10.6 1828 219 16/8/06 6.11 1005.00
397 5.3 1027 439 16/8/06 5.74 314.19
2204 81.2 11770 1827 15/8/06 9.64 2635.39
307 2.9 1954 589 15/8/06 6.12 2762.02
136 7.1 2712 157 14/8/06 5.83 2049.86
1502 15.3 4123 959 15/8/06 6.48 2648.12
1113 1.5 819 195 17/8/06 5.83 804.42
329 4.1 2264 434 16/8/06 6.19 2214.89
193 3.5 5691 251 17/8/06 5.64 1299.25
1152 3.5 2865 1075 15/8/06 5.66 2573.78
357 4.1 5664 509 16/8/06 6.06 1982.08
513 7.1 2485 586 15/8/06 6.24 2608.35
1645 6.5 4878 208 17/8/06 5.96 969.32
Before I got here i used the following code to remove those columns that had no values at all or some NA's.
rem = NULL
for(col.nr in 1:dim(E.3)[2]){
if(sum(is.na(E.3[, col.nr]) > 0 | all(is.na(E.3[,col.nr])))){
rem = c(rem, col.nr)
}
}
E.4 <- E.3[, -rem]
Now I need to remove the "date" column but not based on its column name, rather based on the fact that it's a character string.
I've seen here (Remove an entire column from a data.frame in R) already how to simply set it to NULL and other options but I want to use a different argument.
First use is.character to find all columns with class character. However, make sure that your date is really a character, not a Date or a factor. Otherwise use is.Date or is.factor instead of is.character.
Then just subset the columns that are not characters in the data.frame, e.g.
df[, !sapply(df, is.character)]
I was having a similar problem but the answer above isn't resolve it for a Date columns (that's what I needed), so I've found another solution:
df[,-grep ("Date|factor|character", sapply (df, class))]
Will return you your df without Date, character and factor columns.
I have a data that looks like this.
Name|ID|p72|p78|p51|p49|c36.1|c32.1|c32.2|c36.2|c37
hsa-let-7a-5p|MIMAT0000062|9.1|38|12.7|185|8|4.53333333333333|17.9|23|63.3
hsa-let-7b-5p|MIMAT0000063|11.3|58.6|27.5|165.6|20.4|8.5|21|30.2|92.6
hsa-let-7c|MIMAT0000064|7.8|40.2|9.6|147.8|11.8|4.53333333333333|15.4|17.7|62.3
hsa-let-7d-5p|MIMAT0000065|4.53333333333333|27.7|13.4|158.1|8.5|4.53333333333333|14.2|13.5|50.5
hsa-let-7e-5p|MIMAT0000066|6.2|4.53333333333333|4.53333333333333|28|4.53333333333333|4.53333333333333|5.6|4.7|12.8
hsa-let-7f-5p|MIMAT0000067|4.53333333333333|4.53333333333333|4.53333333333333|78.2|4.53333333333333|4.53333333333333|6.8|4.53333333333333|8.9
hsa-miR-15a-5p|MIMAT0000068|4.53333333333333|70.3|10.3|147.6|4.53333333333333|4.53333333333333|21.1|30.2|100.8
hsa-miR-16-5p|MIMAT0000069|9.5|562.6|60.5|757|25.1|4.53333333333333|89.4|142.9|613.9
hsa-miR-17-5p|MIMAT0000070|10.5|71.6|27.4|335.1|6.3|10.1|51|51|187.1
hsa-miR-17-3p|MIMAT0000071|4.53333333333333|4.53333333333333|4.53333333333333|17.2|4.53333333333333|4.53333333333333|9.5|4.53333333333333|7.3
hsa-miR-18a-5p|MIMAT0000072|4.53333333333333|14.6|4.53333333333333|53.4|4.53333333333333|4.53333333333333|9.5|25.5|29.7
hsa-miR-19a-3p|MIMAT0000073|4.53333333333333|11.6|4.53333333333333|42.8|4.53333333333333|4.53333333333333|4.53333333333333|5.5|17.9
hsa-miR-19b-3p|MIMAT0000074|8.3|93.3|15.8|248.3|4.53333333333333|6.3|44.7|53.2|135
hsa-miR-20a-5p|MIMAT0000075|4.53333333333333|75.2|23.4|255.7|6.6|4.53333333333333|43.8|38|130.3
hsa-miR-21-5p|MIMAT0000076|6.2|19.7|18|299.5|6.8|4.53333333333333|49.9|68.5|48
hsa-miR-22-3p|MIMAT0000077|40.4|128.4|65.4|547.1|56.5|33.4|104.9|84.1|248.3
hsa-miR-23a-3p|MIMAT0000078|58.3|99.3|58.6|617.9|36.6|21.4|107.1|125.5|120.9
hsa-miR-24-1-5p|MIMAT0000079|4.53333333333333|4.53333333333333|4.53333333333333|9.2|4.53333333333333|4.53333333333333|4.53333333333333|4.9|4.53333333333333
hsa-miR-24-3p|MIMAT0000080|638.2|286.9|379.5|394.4|307.8|240.4|186|234.2|564
What I want to do is to simply pick rows where all the values is greater than 10.
But why this code of mine only report the last one?
The data clearly showed that there are more rows that satisfy this condition.
> dat<-read.delim("http://dpaste.com/1215552/plain/",sep="|",na.strings="",header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
But why this code of mine only report the last one?
> dat[apply(dat[, -1], MARGIN = 1, function(x) all(x > 10)), ]
Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186 234.2 564
What is the right way to do it?
Update:
alexwhan solution works. But I wonder how can I generalized his approach
so that it can handle data with missing values (NA)
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
Since you're including your ID column (which is a factor) in the all(), it's getting messed up. Try:
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10)), ]
# Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
EDIT
For the case where you have NA, you can just just use the na.rm argument for all(). Using your new data (from the comment):
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10, na.rm = T)), ]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 7 hsa-miR-15a-5p MIMAT0000068 NA 70.3 10.3 147.6 NA NA 21.1 30.2 100.8
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
# 20 hsa-miR-25-3p MIMAT0000081 19.3 78.6 25.6 84.3 14.9 16.9 19.1 27.2 113.8
# 21 hsa-miR-26a-5p MIMAT0000082 NA 22.8 31.0 561.2 12.4 NA 67.0 55.8 48.9
ANother idea is to transform your data ton long format( or molton format). I think it is even better to avoid missing values problem with:
library(reshape2)
dat.m <- melt(dat,id.vars=c('Name','ID'))
dat.m$value <- as.numeric(dat.m$value)
library(plyr)
res <- ddply(dat.m,.(Name,ID), summarise, keepme = all(value > 10))
res[res$keepme,]
# Name ID keepme
# 16 hsa-miR-22-3p MIMAT0000077 TRUE
# 17 hsa-miR-23a-3p MIMAT0000078 TRUE
# 19 hsa-miR-24-3p MIMAT0000080 TRUE