Related
I have 50 data frames (different name each) with 10 (same name) columns of climate data. The first 5 columns although they are numbers, their class is "character". The rest 4 columns are already in the correct class (numeric) and the last one (named 'wind dir') is in character class so no change is needed.
I tried two ways to convert the class of those 5 columns in all 50 data frames, but nothing worked.
1st way) Firstly I've created a vector with the names of those 50 data frames and I named it onomata.
Secondly I've created a vector col_numbers2 <- c(1:5) with the number of columns I would like to convert.
Then I wrote the following code:
for(i in onomata){
i[col_numbers2] <- sapply(i[col_numbers2], as.numeric)
}
Checking the class of those first five columns I saw that nothing changed. (No error report after executing the code)
2nd way) Then I tried to use the dplyr package with a for loop and the code is as follows:
for(i in onomata){
i <- i %>%
mutate_at(vars(-`wind_dir`),as.numeric)
In this case, I excluded the character column, and I applied the mutate function to the whole data frame, but I received an error message :
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "character"
What do you think I am doing wrong ?
Thank you
Original data table (what I get when I use read.table() for each txt file:
date
Time
Tdry
Humidity
Wind_velocity
Wind_direction
Wind_gust
02/01/15
02:00
2.4
77.0
6.4
WNW
20.9
02/01/15
03:00
2.3
77.0
11.3
NW
30.6
02/01/15
04:00
2.3
77.0
9.7
NW
20.9
02/01/15
05:00
2.3
77.0
11.3
NW
30.6
02/01/15
06:00
2.3
78.0
9.7
NW
19.3
02/01/15
07:00
2.2
79.0
12.9
NNW
35.4
02/01/15
08:00
2.4
79.0
8.0
NW
14.5
02/01/15
09:00
2.6
79.0
8.0
WNW
20.9
Data after I split data in columns 1 and 2 (date, time):
day
month
year
Hour
Minutes
Tdry
Humidity
Wind_velocity
Wind_direction
Wind_gust
02
01
15
02
00
2.4
77.0
6.4
WNW
20.9
02
01
15
03
00
2.3
77.0
11.3
NW
30.6
02
01
15
04
00
2.3
77.0
9.7
NW
20.9
02
01
15
05
00
2.3
77.0
11.3
NW
30.6
02
01
15
06
00
2.3
78.0
9.7
NW
19.3
02
01
15
07
00
2.2
79.0
12.9
NNW
35.4
02
01
15
08
00
2.4
79.0
8.0
NW
14.5
02
01
15
09
00
2.6
79.0
8.0
WNW
20.9
Here are two possible ways. Both relies on getting all your files in a list of dataframes (called df_list in the example below). To acheive this you could use mget() (ex: mget(onomata) or list.files()).
Once this is done, you can use lapply (or mapply) to go through all your dataframes.
Solution 1
To transform your data, I propose you 1st convert it into POSIXct format and then extract the relevant elements to make the wanted columns.
# create a custom function that transforms each dataframe the way you want
fun_split_datehour <- function(df){
df[, "datetime"] <- as.POSIXct(paste(df$date, df$hour), format = "%d/%m/%Y %H:%M") # create a POSIXct column with info on date and time
# Extract elements you need from the date & time column and store them in new columns
df[,"year"] <- as.numeric(format(df[, "datetime"], format = "%Y"))
df[,"month"] <- as.numeric(format(df[, "datetime"], format = "%m"))
df[,"day"] <- as.numeric(format(df[, "datetime"], format = "%d"))
df[,"hour"] <- as.numeric(format(df[, "datetime"], format = "%H"))
df[,"min"] <- as.numeric(format(df[, "datetime"], format = "%M"))
return(df)
}
# use this function on each dataframe of your list
lapply(df_list, FUN = fun_split_datehour)
Adapted from Split date data (m/d/y) into 3 separate columns (this answer)
Data:
# two dummy dataframe, date and hour format does not matter, you can tell as.POSIXct what to expect using format argument (see ?as.POSIXct)
df1 <- data.frame(date = c("02/01/2010", "03/02/2010", "10/09/2010"),
hour = c("05:32", "08:20", "15:33"))
df2 <- data.frame(date = c("02/01/2010", "03/02/2010", "10/09/2010"),
hour = c("05:32", "08:20", "15:33"))
# you can replace c("df1", "df2") with onomata: df_list <- mget(onomata)
df_list <- mget(c("df1", "df2"))
Outputs:
> lapply(df_list, FUN = fun_split_datehour)
$df1
date hour datetime year month day min
1 2010-01-02 5 2010-01-02 05:32:00 2010 1 2 32
2 2010-02-03 8 2010-02-03 08:20:00 2010 2 3 20
3 2010-09-10 15 2010-09-10 15:33:00 2010 9 10 33
$df2
date hour datetime year month day min
1 2010-01-02 5 2010-01-02 05:32:00 2010 1 2 32
2 2010-02-03 8 2010-02-03 08:20:00 2010 2 3 20
3 2010-09-10 15 2010-09-10 15:33:00 2010 9 10 33
And columns year, month, day, hour and min are numeric. You can check using str(lapply(df_list, FUN = fun_split_datehour)).
Note: looking at the question you asked before this one, you might find https://stackoverflow.com/a/24376207/10264278 usefull. In addition, using POSIXct format will save you time if you want to make plots, arrange, etc.
Solution 2
If you do not want to use POSIXct, you could do:
# Dummy data changed to match you situation with already splited date
dfa <- data.frame(day = c("02", "03", "10"),
hour = c("05", "08", "15"))
dfb <- data.frame(day = c("02", "03", "10"),
hour = c("05", "08", "15"))
df_list <- mget(c("dfa", "dfb"))
# Same thing, use lapply() to go through each dataframe of the list and apply() to use as.numeric on the wanted columns
lapply(df_list, FUN = function(df){as.data.frame(apply(df[1:2], 2, as.numeric))}) # change df[1:2] to select columns you want to convert in your actual dataframes
Maybe the following code can help.
First, get the filenames with list.files. Second, read them all in with lapply. If read.table is not the appropriate function, read help("read.table"), it is the same page as for read.csv, read.csv2, etc. Then, coerce the first 5 columns of all data.frames to numeric in one go.
filenames <- list.files(path = "your_directory", pattern = "\\.txt")
onomata <- lapply(filenames, read.table)
onomata <- lapply(onomata, function(X){
X[1:5] <- lapply(X[1:5], as.numeric)
X
})
I have this large xts, aggregated monthly with apply.monthly function.
2011-07-31 269.8
2011-08-31 251.0
2011-09-30 201.8
2011-10-31 95.8
2011-11-30 NA
2011-12-31 49.3
2012-01-31 77.1
...
What I want is to calculate the average of Jan-Dec months for all the period. Something like this, but in xts form:
01 541.8
02 23.0
03 34.8
04 12.8
05 21.8
06 44.8
07 22.8
08 55.0
09 287.8
10 15.8
11 113
12 419.3
I want to avoid using dplyr functions like group_by. I think there must be a solution using split and lapply / do.call
I tried spliting the xts in years
xtsobject <- split(xtsobject, f = "years")
and then I dont know how to use properly the lapply function in order to calculate the 12 averages (Jan-Dec) of all the period.
This question
Group by period.apply() in xts
is similar, but in my xts I dont have/want a new column, I think it can be done using the xts index.
Assuming the input data x, shown reproducibly in the Note at the end, useaggregate.zoo like this:
ag <- aggregate(x, cycle(as.yearmon(time(x))), mean, na.rm = TRUE)
ag
giving the following zoo series:
1 77.1
7 269.8
8 251.0
9 201.8
10 95.8
11 NaN
12 49.3
We could plot it like this:
plot(ag, type = "h")
Note
Lines <- "2011-07-31 269.8
2011-08-31 251.0
2011-09-30 201.8
2011-10-31 95.8
2011-11-30 NA
2011-12-31 49.3
2012-01-31 77.1"
library(xts)
z <- read.zoo(text = Lines)
x <- as.xts(z)
You can use the base::months function to extract the month before calculating the mean:
do.call(rbind, lapply(split(x, base::months(index(x))), mean, na.rm=TRUE))
output:
[,1]
April 165.1600
August 290.2444
December 106.8200
February 82.6300
January 62.9100
July 264.9889
June 246.4889
March 100.5500
May 246.3333
November 116.6400
October 151.3667
September 158.5667
It seems the index is a number and not a POSIXct object. You can convert it and use format to extract months and use it in tapply :
tapply(xtsobject[, 1], format(as.POSIXct(zoo::index(xtsobject),
origin = '1970-01-01'), '%m'), mean, na.rm = TRUE)
I have a large dataframe in R and I want to plot the change in temperature over time. I've tried this before but since there is so much data the graph is really noisy and impossible to read.
I experimented with other plot types to try and get around this but they didn't really work. So I decided instead I will plot the mean temperature for each hour.
I've uploaded the data from a csv file and there are about 56k rows, an hour is about 720 rows give or take.
> head(wormData)
Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44
The column I am interested in is Temp.1 so what I want to do is take the mean of every 720 values in the Temp.1 column, then put each of those mean values into a new dataframe so I can plot a cleaner graph.
I thought of just doing it by hand but that would be about 50 data points and I have many more csv files to do, so any help on how I could do this would be appreciated. I've tried subsetting the data or making vectors with the mean values as well as writing some loops, but I'm struggling to tell R that I want the mean of every 720 rows.
Thanks so much :)
A kind of basic solution on top of matrix:
set.seed(123)
x<-sample(1:10,(720*5),replace=TRUE) # generate dummy data
> str(x)
int [1:3600] 3 8 5 9 10 1 6 9 6 5 ...
# Use wormData$Temp.1 instead of x for your actual datas
z<-matrix(x,nrow=length(x)/719) # divide by 719 to get 720 values per row
rowMeans(z) # 'loop' over each row to get the mean
Output:
[1] 5.654167 5.375000 5.358333 5.477778 5.618056
If your dataset is not a multiple of 720, you'll get a warning and the last point would be false (recycling of the vector to fill the last line).
Here is a solution with dplyr, assuming your row number is a multiple of 720. We create a grouping variable and then compute the mean by group.
library(dplyr)
n <- 2 # replace with n <- 720 with your actual data
mutate(d,group = rep(1:(nrow(d)/n), each=n)) %>%
group_by(group) %>%
summarize(mean=mean(Temp.1))
data
d <- read.table(text = " Time Date Day.of.Week Humidity.1 Temp.1 Vapor.Density.1 Base.Temp.1
1 0:18:44 1/7/2016 Friday 69.7 26.4 17.43 85.00
2 0:18:49 1/7/2016 Friday 69.7 26.4 17.43 27.44
3 0:18:54 1/7/2016 Friday 69.6 26.4 17.40 27.44
4 0:18:59 1/7/2016 Friday 69.6 26.4 17.40 27.44
5 0:19:05 1/7/2016 Friday 69.5 26.4 17.38 27.44
6 0:19:10 1/7/2016 Friday 69.5 26.4 17.38 27.44",stringsAsFactor=FALSE,head=TRUE)
Here is a more complete answer using dplyr. This uses the actual dates and times you have so that you aren't approximating 720 values per hour.
library(tidyverse)
worm_data <- data_frame(time = c("0:18:44","0:18:49","2:18:54",
"0:18:59","0:19:05","2:19:10"),
date = c("2016-07-01","2016-07-01","2016-07-01",
"2016-07-02", "2016-07-02", "2016-07-02"),
temp_1 = c(25,27,290,30,20,2))
worm_data_test <- worm_data %>%
mutate(
date = paste(date, time),
date = as.POSIXct(date, tz="GMT", format="%Y-%m-%d %H:%M:%S")
) %>%
group_by(
datetime = as.POSIXct(cut(date, breaks='hour')) # creates a new variable
) %>%
summarize(
temp_1 = mean(temp_1, na.rm=T)
) %>%
ungroup()
In this case, you are grouping by the hour, then summarizing over those hours. I chose strange values and modified the dates and times to show that it works.
For more on datetime, I suggest: https://www.stat.berkeley.edu/~s133/dates.html
I have a dataset that has dates and interest rates in the same column. I need to split these two numbers into two separate columns, however when I use the following code:
Split <- str_split(df$Dates, "[ ]", n = 2)
Dates <- unlist(Split)[1]
Rates <- unlist(Split)[2]
It returns only the first "value" of each element, i.e., "1971-04-01" for Dates and "7.43" for Rates. I need it to return all values for the portion of the string split and the same for the second portion of the string split
Below is a portion of the dataset, total rows = 518.
1971-04-01 7.31
1971-05-01 7.43
1971-06-01 7.53
1971-07-01 7.60
1971-08-01 7.70
1971-09-01 7.69
1971-10-01 7.63
1971-11-01 7.55
1971-12-01 7.48
1972-01-01 7.44
Thanks
Could do
Split <- strsplit(as.character(df$Dates), " ", fixed = TRUE)
Dates <- sapply(Split, "[", 1)
Rates <- sapply(Split, "[", 2)
You can use reshape2::colsplit
library(reshape2)
colsplit(df$Dates, ' ', names = c('Dates','Rates'))
# Dates Rates
# 1 1971-04-01 7.31
# 2 1971-05-01 7.43
# 3 1971-06-01 7.53
# 4 1971-07-01 7.60
# 5 1971-08-01 7.70
# 6 1971-09-01 7.69
# 7 1971-10-01 7.63
# 8 1971-11-01 7.55
# 9 1971-12-01 7.48
# 10 1972-01-01 7.44
Perhaps I'm biased, but I would suggest my cSplit function for this problem.
First, I'm assuming we are starting with the following (single column) data.frame (where there are multiple spaces between the "date" value and the "rate" value).
df <- data.frame(
Date = c("1971-04-01 7.31", "1971-05-01 7.43", "1971-06-01 7.53",
"1971-07-01 7.60", "1971-08-01 7.70", "1971-09-01 7.69",
"1971-10-01 7.63", "1971-11-01 7.55", "1971-12-01 7.48",
"1972-01-01 7.44"))
Next, get the cSplit function from my GitHub Gist, and use it. You can split on a regular expression (here, multiple spaces).
cSplit(df, "Date", "\\s+", fixed = FALSE)
# Date_1 Date_2
# 1: 1971-04-01 7.31
# 2: 1971-05-01 7.43
# 3: 1971-06-01 7.53
# 4: 1971-07-01 7.60
# 5: 1971-08-01 7.70
# 6: 1971-09-01 7.69
# 7: 1971-10-01 7.63
# 8: 1971-11-01 7.55
# 9: 1971-12-01 7.48
# 10: 1972-01-01 7.44
Since the function converts a data.frame to a data.table, you have access to setnames which would let you rename your columns in place.
setnames(cSplit(df, "Date", "\\s+", fixed = FALSE), c("Dates", "Rates"))[]
# Dates Rates
# 1: 1971-04-01 7.31
# 2: 1971-05-01 7.43
# 3: 1971-06-01 7.53
# 4: 1971-07-01 7.60
# 5: 1971-08-01 7.70
# 6: 1971-09-01 7.69
# 7: 1971-10-01 7.63
# 8: 1971-11-01 7.55
# 9: 1971-12-01 7.48
# 10: 1972-01-01 7.44
Using #user2583119's data (please post minimal reproducible code including a data set):
library(qdap)
colsplit2df(data.frame(Split), sep = " ")
## X1 X2
## 1 1971-06-01 7.53
## 2 1971-05-01 7.43
## 3 1971-06-01 7.53
Also:
Split <- c("1971-06-01 7.53", "1971-05-01 7.43", "1971-06-01 7.53")
Your code selects only the first observation.
Str <- unlist(str_split(Split, "[ ]", n=2))
Str[1]
#[1] "1971-06-01"
If you look at the output of unlist(..), dates are followed by values. So, you can use a logical index.
Str[c(T,F)]
#[1] "1971-06-01" "1971-05-01" "1971-06-01"
as.numeric(Str[c(F,T)])
#[1] 7.53 7.43 7.53
You can convert to two columns of a dataframe from Split by using read.table
read.table(text=Split, header=F, sep="",stringsAsFactors=F)
# V1 V2
# 1 1971-06-01 7.53
# 2 1971-05-01 7.43
# 3 1971-06-01 7.53
df <- data.frame(
Date = c("1971-04-01 7.31", "1971-05-01 7.43", "1971-06-01 7.53",
"1971-07-01 7.60", "1971-08-01 7.70", "1971-09-01 7.69",
"1971-10-01 7.63", "1971-11-01 7.55", "1971-12-01 7.48",
"1972-01-01 7.44"))
do.call(rbind, strsplit(as.character(df$Date), split = '\\s+', fixed = FALSE))
Try this:
Split <- c("1971-06-01 7.53", "1971-05-01 7.43", "1971-06-01 7.53")
df <- unlist(str_split(string = Split, pattern = "\\s"))
df
In R I have a data.frame that has several variables that have been measured monthly over several years. I would like to derive the monthly average (using all years) for each variable. Ideally these new variables would all be together in a new data.frame (carrying over the ID), below I am simply adding the new variable to the data.frame. The only way I know how to do this at the moment (below) seems quite laborious, and I was hoping there might be a smarter way to do this in R, that would not require typing out each month and variable as I did below.
# Example data.frame with only two years, two month, and two variables
# In the real data set there are always 12 months per year
# and there are at least four variables
df<- structure(list(ID = 1:4, ABC.M1Y2001 = c(10, 12.3, 45, 89), ABC.M2Y2001 = c(11.1,
34, 67.7, -15.6), ABC.M1Y2002 = c(-11.1, 9, 34, 56.5), ABC.M2Y2002 = c(12L,
13L, 11L, 21L), DEF.M1Y2001 = c(14L, 14L, 14L, 16L), DEF.M2Y2001 = c(15L,
15L, 15L, 12L), DEF.M1Y2002 = c(5, 12, 23.5, 34), DEF.M2Y2002 = c(6L,
34L, 61L, 56L)), .Names = c("ID", "ABC.M1Y2001", "ABC.M2Y2001","ABC.M1Y2002",
"ABC.M2Y2002", "DEF.M1Y2001", "DEF.M2Y2001", "DEF.M1Y2002",
"DEF.M2Y2002"), class = "data.frame", row.names = c(NA, -4L))
# list variable to average for ABC Month 1 across years
ABC.M1.names <- c("ABC.M1Y2001", "ABC.M1Y2002")
df <- transform(df, ABC.M1 = rowMeans(df[,ABC.M1.names], na.rm = TRUE))
# list variable to average for ABC Month 2 across years
ABC.M2.names <- c("ABC.M2Y2001", "ABC.M2Y2002")
df <- transform(df, ABC.M2 = rowMeans(df[,ABC.M2.names], na.rm = TRUE))
# and so forth for ABC
# ...
# list variables to average for DEF Month 1 across years
DEF.M1.names <- c("DEF.M1Y2001", "DEF.M1Y2002")
df <- transform(df, DEF.M1 = rowMeans(df[,DEF.M1.names], na.rm = TRUE))
# and so forth for DEF
# ...
Here's a solution using data.table development version v1.8.11 (which has melt and cast methods implemented for data.table):
require(data.table)
require(reshape2) # melt/cast builds on S3 generic from reshape2
dt <- data.table(df) # where df is your data.frame
dcast.data.table(melt(dt, id="ID")[, sum(value)/.N, list(ID,
gsub("Y.*$", "", variable))], ID ~ gsub)
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1: 1 -0.55 11.55 9.50 10.5
2: 2 10.65 23.50 13.00 24.5
3: 3 39.50 39.35 18.75 38.0
4: 4 72.75 2.70 25.00 34.0
You can just cbind this to your original data.
Note that sum is a primitive where as mean is S3 generic. Therefore, using sum(.)/length(.) is better (as if there are too many groupings, dispatching the right method with mean for every group could be quite a time-consuming operation). .N is a special variable in data.table that directly gives you the length of the group.
Here is a solution using reshape2 that is more automated when you have lots of data and uses regular expressions to extract the variable name and the month. This solution will give you a nice summary table.
# Load required package
require(reshape2)
# Melt your wide data into long format
mdf <- melt(df , id = "ID" )
# Extract relevant variable names from the variable colum
mdf$Month <- gsub( "^.*\\.(M[0-9]{1,2}).*$" , "\\1" , mdf$variable )
mdf$Var <- gsub( "^(.*)\\..*" , "\\1" , mdf$variable )
# Aggregate by month and variable
dcast( mdf , Var ~ Month , mean )
# Var M1 M2
#1 ABC 30.5875 19.275
#2 DEF 16.5625 26.750
Or to be compatible with the other solutions, and return the table by ID as well...
dcast( mdf , ID ~ Var + Month , mean )
# ID ABC_M1 ABC_M2 DEF_M1 DEF_M2
#1 1 -0.55 11.55 9.50 10.5
#2 2 10.65 23.50 13.00 24.5
#3 3 39.50 39.35 18.75 38.0
#4 4 72.75 2.70 25.00 34.0
This is pretty straight forward in base R.
mean.names <- split(names(df)[-1], gsub('Y[0-9]{4}$', '', names(df)[-1]))
means <- lapply(mean.names, function(x) rowMeans(df[, x], na.rm = TRUE))
data.frame(df, means)
This gives you your original data.frame with the following four columns at the end:
ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 -0.55 11.55 9.50 10.5
2 10.65 23.50 13.00 24.5
3 39.50 39.35 18.75 38.0
4 72.75 2.70 25.00 34.0
You can use Reshape from package {splitstackshape} and then use plyr package or data.table or base R to perform mean.
library(splitstackshape) # Reshape
library(plyr) # ddply
kk<-Reshape(df,id.vars="ID",var.stubs=c("ABC.M1","ABC.M2","DEF.M1","DEF.M2"),sep="")
> kk
ID AE DB time ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 NA NA 1 10.0 11.1 14.0 15
2 2 NA NA 1 12.3 34.0 14.0 15
3 3 NA NA 1 45.0 67.7 14.0 15
4 4 NA NA 1 89.0 -15.6 16.0 12
5 1 NA NA 2 -11.1 12.0 5.0 6
6 2 NA NA 2 9.0 13.0 12.0 34
7 3 NA NA 2 34.0 11.0 23.5 61
8 4 NA NA 2 56.5 21.0 34.0 56
ddply(kk[,c(1,5:8)],.(ID),colwise(mean))
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 -0.55 11.55 9.50 10.5
2 2 10.65 23.50 13.00 24.5
3 3 39.50 39.35 18.75 38.0
4 4 72.75 2.70 25.00 34.0