remove character columns from a numeric data frame - r

I have a data frame like the one you see here.
DRSi TP DOC DN date Turbidity Anions
158 5.9 3371 264 14/8/06 5.83 2246.02
217 4.7 2060 428 16/8/06 6.04 1632.29
181 10.6 1828 219 16/8/06 6.11 1005.00
397 5.3 1027 439 16/8/06 5.74 314.19
2204 81.2 11770 1827 15/8/06 9.64 2635.39
307 2.9 1954 589 15/8/06 6.12 2762.02
136 7.1 2712 157 14/8/06 5.83 2049.86
1502 15.3 4123 959 15/8/06 6.48 2648.12
1113 1.5 819 195 17/8/06 5.83 804.42
329 4.1 2264 434 16/8/06 6.19 2214.89
193 3.5 5691 251 17/8/06 5.64 1299.25
1152 3.5 2865 1075 15/8/06 5.66 2573.78
357 4.1 5664 509 16/8/06 6.06 1982.08
513 7.1 2485 586 15/8/06 6.24 2608.35
1645 6.5 4878 208 17/8/06 5.96 969.32
Before I got here i used the following code to remove those columns that had no values at all or some NA's.
rem = NULL
for(col.nr in 1:dim(E.3)[2]){
if(sum(is.na(E.3[, col.nr]) > 0 | all(is.na(E.3[,col.nr])))){
rem = c(rem, col.nr)
}
}
E.4 <- E.3[, -rem]
Now I need to remove the "date" column but not based on its column name, rather based on the fact that it's a character string.
I've seen here (Remove an entire column from a data.frame in R) already how to simply set it to NULL and other options but I want to use a different argument.

First use is.character to find all columns with class character. However, make sure that your date is really a character, not a Date or a factor. Otherwise use is.Date or is.factor instead of is.character.
Then just subset the columns that are not characters in the data.frame, e.g.
df[, !sapply(df, is.character)]

I was having a similar problem but the answer above isn't resolve it for a Date columns (that's what I needed), so I've found another solution:
df[,-grep ("Date|factor|character", sapply (df, class))]
Will return you your df without Date, character and factor columns.

Related

Append new data to existing csv file in R

I am working on a project where I need to graph 10 days worth of data from remote sites.
I am downloading new data every 30 minutes from a remote computer via FTP (data is written every half hour also).
The local (onsite) file path changes every month so I have a dynamic IP address based on the current date.
eg.
/data/sitename/2020/July/data.csv
/data/sitename/2020/August/data.csv
My problem is at each new month the csv I am downloading will be in a new folder and when I FTP the new csv file, it will only contain data from the new month and not the previous months.
I need to graph the last 10 days of data.
So what I'm hoping to do is download the new data every half hour and only append the newest records to the master data set. Or is there a better way all together?
What I (think I) need to do is download the csv into R, and append only the new data to a master file and remove the oldest records so as to only contain 10 days worth of data in the csv.
I have searched everywhere but cannot seem to crack it.
This seems like it should be so easy, maybe I am using the wrong search terms.
I would like the following pretty please (showed 10 lines of data, I'll need 480 for 10 days).
INITIAL DATA
DateTime Data1 Data2 Data3 Data4 Data5
641 2020-08-26T02:31:59.999+10:00 10.00 53.4 3.101 42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
NEW DATA
DateTime Data1 Data2 Data3 Data4 Data5
641 2020-08-26T02:31:59.999+10:00 10.00 53.4 3.101 42 20.70
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
651 2020-08-26T07:31:59.999+10:00 3.85 109.7 3.125 70 20.65
REQUIRED DATA
DateTime Data1 Data2 Data3 Data4 Data5
642 2020-08-26T03:01:59.999+10:00 11.11 52.0 2.778 44 20.70
643 2020-08-26T03:31:59.999+10:00 2.63 105.5 2.899 45 20.70
644 2020-08-26T04:01:59.999+10:00 11.11 60.5 2.920 45 20.70
645 2020-08-26T04:31:59.999+10:00 3.03 101.3 2.899 48 20.70
646 2020-08-26T05:01:59.999+10:00 2.86 125.2 2.899 49 20.65
647 2020-08-26T05:31:59.999+10:00 2.86 132.2 2.899 56 20.65
648 2020-08-26T06:01:59.999+10:00 3.23 113.9 2.963 61 20.65
649 2020-08-26T06:31:59.999+10:00 3.45 113.9 3.008 64 20.65
650 2020-08-26T07:01:59.999+10:00 3.57 108.3 3.053 66 20.65
651 2020-08-26T07:31:59.999+10:00 3.85 109.7 3.125 70 20.65
This is where I am at...
library(RCurl)
library(readr)
library(ggplot2)
library(data.table)
# Get the date parts we need
Year <-format(Sys.Date(), format="%Y")
Month <- format(Sys.Date(), format="%B")
MM <- format(Sys.Date(), format="%m")
# Create the file string and read
site <- glue::glue("ftp://user:passwd#99.99.99.99/path/{Year}/{Month}/site}{Year}-{MM}.csv")
site <- read.csv(site, header = FALSE)
# Write table and create csv
EP <- write.table(site, "EP.csv", col.names = FALSE, row.names = FALSE)
EP <- fread("EP.csv", header = FALSE, select = c( 1, 2, 3, 5, 6, 18))
output<- write.table(EP, file = 'output.csv', col.names = c("A", "B", etc), sep = ",", row.names = FALSE)
#working up to here
# Append to master csv file
master <- read.csv("C:\\path\\"master.csv")
You can turn the DateTime column to POSIXct class, combine the new and initial data and get data which is present in last 10 days.
library(dplyr)
library(lubridate)
initial_data <- initial_data %>% mutate(DateTime = ymd_hms(DateTime))
new_data <- new_data %>% mutate(DateTime = ymd_hms(DateTime))
combined_data <- bind_rows(new_data, initial_data)
ten_days_data <- combined_data %>%
filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))
I'll try and answer this combining the help from Ronak.
I am still hopeful that a better solution can be found where I can simply append the new data to the old data.
There were multiple parts to my question and Ronak provided a solution for the last 10 days problem:
ten_days_data <- combined_data %>%
filter(between(as.Date(DateTime), Sys.Date() - 10, Sys.Date()))
The second part about combining the data I found from another post How to rbind new rows from one data frame to an existing data frame in R
combined_data <- unique(rbindlist(list(inital_data, new_data)), by = "DateTime")

Read in, transpose, and merge dataframe from MANY two column data frames with first row in common

I have thousands of comma separated .txt files with two columns, where one column has "wavelength" for the column name and the same wavelength values ("x" values) for all files, and the other column has the file name as the column name and response values (various observed "y" values).
If I read in a single file by readr, the format appears like this:
# A tibble: 2,151 x 2
Wavelength a1lm_00000.asd.ref.sco.txt ### [filename]
<dbl> <dbl>
1 350 0.0542
2 351 0.0661
3 352 0.0686
4 353 0.0608
5 354 0.0545
6 355 0.0589
7 356 0.0644
8 357 0.0587
9 358 0.0556
10 359 0.0519
...etc.
The end format I need is:
Filename "350" "351" "352" "353" etc.
a1lm_00000.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 etc.
a1lm_00001.asd.ref.sco.txt 0.0567 0.0680 0.0704 0.0627 etc.
...etc.
In other words, I need the first column as the file identifier, and each following column a spectral response with the associated spectral wavelength as the column name.
So, I need to read all these files in from a directory, and either:
a.) Create a third column that is the file name, make all second columns names something like "response", apply bind_rows to all files, then use "spread" in the tidyr package.
b.) Transpose each files as soon as it is read, in such a way that the first row becomes all column names, the second row column name is inserted into a first column for row identifiers by file name, and row bind these resulting rows.
Option b. seems preferable. Either options seems like I will need to use either lapply and possibly bind_rows or bind_cols. But I'm not sure how best to do so. There are a lot of data, and a few of the methods I've used have caused my machine to run out of memory, so the more memory-efficient I can make it the better.
I recommend storing all data.frames in a list. Then it becomes a simple matter of merging data.frames, converting data from wide to long, and back to wide with a different key.
library(tidyverse)
reduce(lst, full_join) %>%
gather(file, value, -Wavelength) %>%
spread(Wavelength, value)
# file 350 351 352 353 354 355 356
#1 a1lm_00000.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 0.0545 0.0589 0.0644
#2 a1lm_00001.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608 0.0545 0.0589 0.0644
# 357 358 359
#1 0.0587 0.0556 0.0519
#2 0.0587 0.0556 0.0519
Two more comments:
To store data.frames in a list, I would do something along the lines of map(file_names, ~read_csv2(.x)) (or in in base R lapply(file_names, function(x) read.csv(x))). Adjust file_names and read_csv2/read.csv parameters as necessary.
More generally, I would probably advise against such a format. It seems much easier to keep data in a list of long (and tidy) data.frames.
For completeness, the same can be achieved in base R using Reduce+merge to join data, and stack+reshape to convert from wide to long to wide.
df <- Reduce(merge, lst)
reshape(
cbind(stack(df, select = -Wavelength), Wavelength = df$Wavelength),
idvar = "ind", timevar = "Wavelength", direction = "wide")
# ind values.350 values.351 values.352 values.353
#1 a1lm_00000.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608
#11 a1lm_00001.asd.ref.sco.txt 0.0542 0.0661 0.0686 0.0608
# values.354 values.355 values.356 values.357 values.358 values.359
#1 0.0545 0.0589 0.0644 0.0587 0.0556 0.0519
#11 0.0545 0.0589 0.0644 0.0587 0.0556 0.0519
Sample data
df1 <- read.table(text =
"Wavelength a1lm_00000.asd.ref.sco.txt
1 350 0.0542
2 351 0.0661
3 352 0.0686
4 353 0.0608
5 354 0.0545
6 355 0.0589
7 356 0.0644
8 357 0.0587
9 358 0.0556
10 359 0.0519", header = T)
df2 <- read.table(text =
"Wavelength a1lm_00001.asd.ref.sco.txt
1 350 0.0542
2 351 0.0661
3 352 0.0686
4 353 0.0608
5 354 0.0545
6 355 0.0589
7 356 0.0644
8 357 0.0587
9 358 0.0556
10 359 0.0519", header = T)
lst <- list(df1, df2)

dataframe subsetting using vector

I have a Data Frame
head(readDF1)
Date sulfate nitrate ID
279 2003-10-06 7.21 0.651 1
285 2003-10-12 5.99 0.428 10
291 2003-10-18 4.68 1.040 100
297 2003-10-24 3.47 0.363 200
303 2003-10-30 2.42 0.507 300
315 2003-11-11 1.43 0.474 332
If I'm subsetting using the below code it is working correct
readDF1[readDF1$ID==331]
but If I'm using
readDF1[readDF1$ID==1:300]
this is not working, I want to subset a Dataframe wihch has the values of the column ID from 1 to 300 (Asssume that ID contains values from 1 to 1000 and they are multiple)
== is the wrong operator here. You aren't asking 'which ID is equal to the sequence 1:331'.
You want %in% (i.e. which ID values can be found in 1:331
readDF1$ID[readDF1$ID %in% 1:331]

R "complete.cases" Works on one not on the other?

I'm trying to use complete.cases to clear out the NAs from a file.
I've been using help from this site but it isn't working and I'm no longer sure if what I'm trying to do is possible.
juulDataRaw <- read.csv(url("http://blah"));
juulDataRaw[complete.cases(juulDataRaw),]
I tried this (one of the examples from here)
dog<-structure(list(Sample = 1:6
,gene = c("ENSG00000208234","ENSG00000199674","ENSG00000221622","ENSG00000207604","ENSG00000207431","ENSG00000221312")
,hsap = c(0,0,0,0,0,0)
,mmul = c(NA,2,NA,NA,NA,1)
,mmus = c(NA,2,NA,NA,NA,2)
,rnor = c(NA,2,NA,1,NA,3)
,cfam = c(NA,2,NA,2,NA,2))
,.Names = c("gene", "hsap", "mmul", "mmus", "rnor", "cfam"), class = "data.frame", row.names = c(NA, -6L))
dog[complete.cases(dog),]
and that works.
So can mine be done?
What is the difference between the two?
Aren't they both just data frames?
You have quotes around the numeric values so they are read in as factors. That makes the "NA" just another string rather than an R NA.
> juulDataRaw[] <- lapply(juulDataRaw, as.character)
> juulDataRaw[] <- lapply(juulDataRaw, as.numeric)
Warning messages:
1: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
2: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
3: In lapply(juulDataRaw, as.numeric) : NAs introduced by coercion
> juulDataRaw[complete.cases(juulDataRaw),]
age height igf1 weight
55 6.00 111.6 98 19.1
57 6.08 116.7 242 21.7
61 6.26 120.3 196 24.7
66 6.40 115.5 179 19.6
69 6.42 115.6 126 20.6
71 6.43 116.1 142 20.2
80 6.61 130.3 236 28.0
81 6.63 122.2 148 21.6
83 6.70 126.2 174 26.1
84 6.72 125.6 136 22.6
85 6.72 121.0 164 24.4
snipped remaining output.....

K-fold cross-validation using cv.lm()

I am new to R and trying to do K-fold cross validation using cv.lm()
Refer: http://www.statmethods.net/stats/regression.html
I am getting error indicating the length of my variable are different. Infact during my verification using length(), I found the size in fact the same.
The below are the minimal datasets to replicate the problem,
X Y
277 5.20
285 5.17
297 4.96
308 5.26
308 5.11
263 5.27
278 5.20
283 5.16
268 5.17
250 5.20
275 5.18
274 5.09
312 5.03
294 5.21
279 5.29
300 5.14
293 5.09
298 5.16
290 4.99
273 5.23
289 5.32
279 5.21
326 5.14
293 5.22
256 5.15
291 5.09
283 5.09
284 5.07
298 5.27
269 5.19
Used the below code to do the cross-validation
# K-fold cross-validation, with K=10
sampledata <- read.table("H:/sample.txt", header=TRUE)
y.1 <- sampledata$Y
x.1 <- sampledata$X
fit=lm(y.1 ~ x.1)
library(DAAG)
cv.lm(df=sampledata, fit, m=10)
The error on the terminal,
Error in model.frame.default(formula = form, data = df[rows.in, ], drop.unused.levels = TRUE) :
variable lengths differ (found for 'x.1')
Verification,
> length(x.1)
[1] 30
> length(y.1)
[1] 30
The above confirms the length are the same.
> str(x.1)
int [1:30] 277 285 297 308 308 263 278 283 268 250 ...
> str(y.1)
num [1:30] 5.2 5.17 4.96 5.26 5.11 5.27 5.2 5.16 5.17 5.2 ...
> is(y.1)
[1] "numeric" "vector"
> is(x.1)
[1] "integer" "numeric" "vector" "data.frameRowLabels"
Further check on the data set as above indicates one dataset is integer and another is numeric. But even when the data sets are converted the numeric to integer or integer to numeric, the same error pops up in the screen indicating issues with data length.
Can you guide me what should I do to correct the error?
I am unsuccessful in handling this since 2 days ago. Did not get any good lead from my research using internet.
Addional Related Query:
I see the fit works if we use the headers of the data set in the attributes,
fit=lm(Y ~ X, data=sampledata)
a) what is the difference of the above syntax with,
fit1=lm(sampledata$Y ~ sampledata$X)
Thought it is the same. In the below,
#fit 1 works
fit1=lm(Y ~ X, data=sampledata)
cv.lm(df=sampledata, fit1, m=10)
#fit 2 does not work
fit2=lm(sampledata$Y ~ sampledata$X)
cv.lm(df=sampledata, fit2, m=10)
The problem is at df=sampledata as the header "sampledata$Y" does not exist but only $Y exist. Tried to manupulate cv.lm to below it does not work too,
cv.lm(fit2, m=10)
b) How if we like to manipulate the variables, how to use it in cv.lm() for e.g
y.1 <- (sampledata$Y/sampledata$X)
x.1 <- (1/sampledata$X)
#fit 4 problem
fit4=lm(y.1 ~ x.1)
cv.lm(df=sampledata, fit4, m=10)
Is there a way I could reference y.1 and x.1 instead of the header Y ~ X in the function?
Thanks.
I'm not sure about why exactly this happens, but I've spotted that you do not specify data argument for lm(), so this was my first guess.
fit=lm(Y ~ X, data=sampledata)
Since the error is gone, this may be a sufficient answer.
UPD: The reason for the error is that y.1 and x.1 do not exist in sampledata, which is provided as df argument for cv.lm, so that formula y.1 ~ x.1 makes no sense in the cv.lm environment.

Resources