I have two variables date and referencenumber. Both are extracted from a text string, with the use of a regular expression. They both have the class character.
When I use the cbind.fill function to combine these variables in an already excising dataframe the values are transformed to numeric values, 1 and 1. Instead of "06-07-2016" and "123ABC". I use the cbind.fill function because something only 1 variables is found, and then this variable still must be placed in the dataframe.
When I run the same code on a computer at school, it doesn't transform the values to numeric. So maybe it has something to do with my settings?
Why is this happening?
library(rowr)
dataframevariablen <- as.data.frame(matrix(nrow = 0, ncol = 2))
colnames(dataframevariablen) <- c("date", "refnr")
rulebased(dfgg$Text[i]) #returns the date and refnr as global variable
dataframevariablen[i,] <- cbind.fill(date,refnr, fill = NULL)
This works for you?
x <- c("6jul2016", "2jan1960", "31mar1960", "30jul1960")
date <- as.Date(x, "%d%b%Y")
refnr="123ABC" #returns the date and refnr as global variable
for (i in 1:length(date))
dataframevariablen[i,] <- data.frame(date[i],refnr,stringsAsFactors = F)
dataframevariablen$date=as.Date(dataframevariablen$date,origin="1970-01-01")
dataframevariablen
date refnr
1 2016-07-06 123ABC
2 1960-01-02 123ABC
3 1960-03-31 123ABC
4 1960-07-30 123ABC
Related
I have df1:
ID Time
1 16:00:00
2 14:30:00
3 9:23:00
4 10:00:00
5 23:59:00
and would like to change the current 'character' column 'Time' into a an 'integer' as below:
ID Time
1 1600
2 1430
3 923
4 1000
5 2359
We could replace the :'s, make numeric, divide by 100, and convert to integer like this:
df1$Time = as.integer(as.numeric(gsub(':', '', df1$Time))/100)
You want to use as.POSIXct().
Functions to manipulate objects of classes "POSIXlt" and "POSIXct" representing calendar dates and times.
R Documents as.POSIXct()
So in the case of row 1: as.POSIXct("16:00:00", format = "%H%M")
Then use as.numeric if you need it to truly be an int.
Converts a character matrix to a numeric matrix.
R Docs as.Numeric()
df1 <- data.frame(Time = "16:00:00")
df1[, "Time"] <- as.numeric(paste0(substr(df1[, "Time"], 1, 2), substr(df1[, "Time"], 4, 5)))
print(df1)
# Time
# 1 1600
There are many ways to process this, but here's one example:
library(dplyr)
df1 <- mutate(df1, Time = gsub(":", "", Time) # replace colons with blanks
df1 <- mutate(df1, Time = as.numeric(Time)/100) # coerce to numeric type, divide by 100
Trying to using %in% operator in r to find an equivalent SAS Code as below:
If weather in (2,5) then new_weather=25;
else if weather in (1,3,4,7) then new_weather=14;
else new_weather=weather;
SAS code will produce variable "new_weather" with values 25, 14 and as defined in variable "weather".
R code:
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[newcol] = df[col]
df[df[newcol] %in% c(2,5)]= 25
df[df[newcol] %in% c(1,3,4,7)] = 14
return(df)
}
Result: output values of "col" and "newcol" are same, when passing a data frame through a function "GS". Syntax is not picking up the second or more values for a variable "newcol"? Appreciated your time explaining the reason and possible fix.
Is this what you are trying to do?
df <- data.frame(A=seq(1:4), B=seq(1:4))
add_and_adjust <- function(df, copy_column, new_column_name) {
df[new_column_name] <- df[copy_column] # make copy of column
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(2,5), 25, df[,new_column_name])
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(1,3,4,7), 14, df[,new_column_name])
return(df)
}
Usage:
add_and_adjust(df, 'B', 'my_new_column')
df[newcol] is a data frame (with one column), df[[newcol]] or df[, newcol] is a vector (just the column). You need to use [[ here.
You also need to be assigning the result to df[[newcol]], not to the whole df. And to be perfectly consistent and safe you should probably test the col values, not the newcol values.
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[[newcol]] = df[[col]]
df[[newcol]][df[[col]] %in% c(2,5)] = 25
df[[newcol]][df[[col]] %in% c(1,3,4,7)] = 14
return(df)
}
GS(data.frame(x = 1:7), "x", "new")
# x new
# 1 1 14
# 2 2 25
# 3 3 14
# 4 4 14
# 5 5 25
# 6 6 6
# 7 7 14
#user9231640 before you invest too much time in writing your own function you may want to explore some of the recode functions that already exist in places like car and Hmisc.
Depending on how complex your recoding gets your function will get longer and longer to check various boundary conditions or to change data types.
Just based upon your example you can do this in base R and it will be more self documenting and transparent at one level:
df <- data.frame(A=seq(1:30), B=seq(1:30))
df$my_new_column <- df$B
df$my_new_column <- ifelse(df$my_new_column %in% c(2,5), 25, df$my_new_column)
df$my_new_column <- ifelse(df$my_new_column %in% c(1,3,4,7), 14, df$my_new_column)
So my hope is to change columns 14:18 into 1 column "Type". I wanted to give each of the entries in this new column (for matching observations in the previous) the value of which of the 5 is a 1 (because only 1 of them can be true). This is my best attempt at doing this in R (and beyond frustrated).
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
carSubset <- apply(carSubset[,types],
2,
function(each_obs){
hit_index <- which(each_obs == 1)
carSubset$Type <- types[hit_index]
})
head(carSubset) # output:
1 2 3 4 5
"sedan" "coupe" "convertible" "convertible" "convertible"
Which is what I wanted ... however, I also wanted the rest of my data.frame to come along with it, like I just wanted the new column of "Type" but I cannot even access it with the following line of code...
head(carSubset$Type) # output: Error in carSubset$Type : $ operator is invalid for atomic vectors
Any help on how to Add a new column dynamically while appending previously related data observations to it?
I actually figured it out! Probably not the best way to do it, but hey, it works.
library(caret)
data("cars")
carSubset <- subset(cars)
head(carSubset)
# I want to convert the columns from of carSubset with following vector names
types <- c("convertible","coupe", "hatchback", "sedan", "wagon")
head(carSubset[,types])
carSubset[,types]
# into 1 column named Type, with the corresponding column name
carSubset$Type <- NULL
newSubset <- c()
newSubset <- apply(carSubset[,types],
1,
function(obs){
hit_index <- which(obs == 1)
newSubset <- types[hit_index]
})
newSubset
carSubset$Type <- cbind(Type = newSubset)
head(carSubset[, !(names(carSubset) %in% types)])
I have several .csv files containing hourly data. Each file represents data from a point in space. The start and end date is different in each file.
The data can be read into R using:
lstf1<- list.files(pattern=".csv")
lst2<- lapply(lstf1,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE, dec = ".",quote = "\""))
head(lst2[[800]])
datetime precip code
1 2003-12-30 00:00:00 NA M
2 2003-12-30 01:00:00 NA M
3 2003-12-30 02:00:00 NA M
4 2003-12-30 03:00:00 NA M
5 2003-12-30 04:00:00 NA M
6 2003-12-30 05:00:00 NA M
datetime is YYYY-MM-DD-HH-MM-SS, precip is the data value, codecan be ignored.
For each dataframe (df) in lst2 I want to select data for the period 2015-04-01 to 2015-11-30 based on the following conditions:
1) If precip in a df contains all NAswithin this period, delete it (do not select)
2) If precip is not all NAs select it.
The desired output (lst3) contains the sub-setted data for the period 2015-04-01 to 2015-11-30.
All dataframes in lst3 should have equal length with days and hourswithout precipdenoted as NA
The I can write the files in lst3 to my directory using something like:
sapply(names(lst2),function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
The link to a sample file can be found here (~200 KB)
It's a little hard to understand exactly what you are trying to do, but this example (using dplyr, which has nice filter syntax) on the file you provided should get you close:
library(dplyr)
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
# Get the required date range and delete the NAs
df.sub <- filter(df, !is.na(precip),
datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Check if the subset has any rows left (it will be empty if it was full of NA for precip)
if nrow(df.sub > 0) {
df.result <- filter(df, datetime >= as.POSIXct("2015-04-01"),
datetime < as.POSIXct("2015-12-01"))
# Then add df.result to your list of data frames...
} # else, don't add it to your list
I think you are saying that you want to retain NAs in the data frame if there are also valid precip values--you only want to discard if there are NAs for the entire period. If you just want to strip all NAs, then just use the first filter statement and you are done. You obviously don't need to use POSIXct if you've already got your dates encoded correctly another way.
EDIT: w/ function wrapper so you can use lapply:
library(dplyr)
# Get some example data
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
dfnull <- df
dfnull$precip <- NA
# list of 3 input data frames to test, 2nd one has precip all NA
df.list <- list(df, dfnull, df)
# Function to do the filtering; returns list of data frames to keep or null
filterprecip <- function(d) {
if (nrow(filter(d, !is.na(precip), datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01"))) >
0) {
return(filter(d, datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01")))
}
}
# Function to remove NULLS in returned list
# (Credit to Hadley Wickham: http://tolstoy.newcastle.edu.au/R/e8/help/09/12/8102.html)
compact <- function(x) Filter(Negate(is.null), x)
# Filter the list
results <- compact(lapply(df.list, filterprecip))
# Check that you got a list of 2 data frames in the right date range
str(results)
Based on what you've written, is sounds like you're just interested in subsetting your list of files if data exists in the precip column for this specific date range.
> valuesExist <- function(df,start="2015-04-01 0:00:00",end="2015-11-30 23:59:59"){
+ sub.df <- df[df$datetime>=start & df$datetime>=end,]
+ if(sum(is.na(sub.df$precip)==nrow(df)){return(FALSE)}else{return(TRUE)}
+ }
> lst2.bool <- lapply(lst2, valuesExist)
> lst2 <- lst2[lst2.bool]
> lst3 <- lapply(lst2, function(x) {x[x$datetime>="2015-04-01 0:00:00" & x$datetime>="2015-11-30 23:59:59",]}
> sapply(names(lst2), function (x) write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))
If you want to have a dynamic start and end time, toss a variable with these values into the valueExist function and replace the string timestamp in the lst3 assignment with that same variable.
If you wanted to combine the two lapply loops into one, be my guest, but I prefer having a boolean variable when I'm subsetting.
I use the following source and get an error:
>source("raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…)
Error in source("raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…) : raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/…: unexpected input 1: ï»
Since I have to use what is the error and how I can fix it?
Here is my code (the last line is the relevant command:
library(zoo)
library (xts)
library(data.table)
source("https://raw.githubusercontent.com/iembry-USGS/ie2misc/master/R/na.interp1.R")
Lines <- "D1,Diff
1,20/11/2014 16:00,0.01
2,20/11/2014 17:00,0.02
3,20/11/2014 19:00,0.03
4,21/11/2014 16:00,0.04
5,21/11/2014 17:00,0.06
6,21/11/2014 20:00,0.10"
z <- read.zoo(text = Lines, tz = "", format = "%d/%m/%Y %H:%M", sep = ",")
## Source 1 begins
startdate <- as.character((start(z)))
# set the start date/time as the 1st entry in the time series and make
# this a character vector.
start <- as.POSIXct(startdate)
# transform the character vector to a POSIXct object
enddate <- as.character((end(z)))
# set the end date/time as the last entry in the time series and make
# this a character vector.
end <- as.POSIXct(enddate)
# transform the character vector to a POSIXct object
gridtime <- seq(from = start, by = 3600, to = end)
# create a sequence beginning with the start date/time with a 60 minute
# interval ending at the end date/time
## Source 1 ends
## Source 2 begins
timeframe <- data.frame(rep(NA, length(gridtime)))
# create 1 NA column spaced out by the gridtime to complement the single
# column of z
timelength <- xts(timeframe, order.by = gridtime)
# create a xts time series object using timeframe and gridtime
zDate <- merge(timelength, z)
# merge the z zoo object and the timelength xts object
## Source 2 ends
Lines <- as.data.frame(zDate)
# to data.frame from zoo
Lines[, "D1"] <- rownames(Lines)
# create column named D1
Lines <- setDT(Lines)
# create data.table out of data.frame
setcolorder(Lines, c(3, 2, 1))
# set the column order as the 3rd column followed by the 2nd and 1st
# columns
Lines <- Lines[, 3 := NULL]
# remove the 3rd column
setnames(Lines, 2, "diff")
# change the name of the 2nd column to diff
Lines <- setDF(Lines)
# return to data.frame
rowsinterps1 <- which(is.na(Lines$diff == TRUE))
# index of rows of Lines that have NA (to be interpolated)
xi <- as.numeric(Lines[which(is.na(Lines$diff == TRUE)), 1])
# the Date-Times for diff to be interpolated in numeric format
interps1 <- na.interp1(as.numeric(Lines$Time), Lines$diff, xi = xi, na.rm = FALSE, maxgap = 3)
# the interpolated values where only gap sizes of 3 are filled
The package was updated that's the reason that the code didn't work.
I wish the people that make the points to drop would return them back. The question was OK.