Create several new derived variables from existing variables in data.frame

Create several new derived variables from existing variables in data.frame - r

In R I have a data.frame that has several variables that have been measured monthly over several years. I would like to derive the monthly average (using all years) for each variable. Ideally these new variables would all be together in a new data.frame (carrying over the ID), below I am simply adding the new variable to the data.frame. The only way I know how to do this at the moment (below) seems quite laborious, and I was hoping there might be a smarter way to do this in R, that would not require typing out each month and variable as I did below.
# Example data.frame with only two years, two month, and two variables
# In the real data set there are always 12 months per year
# and there are at least four variables
df<- structure(list(ID = 1:4, ABC.M1Y2001 = c(10, 12.3, 45, 89), ABC.M2Y2001 = c(11.1,
34, 67.7, -15.6), ABC.M1Y2002 = c(-11.1, 9, 34, 56.5), ABC.M2Y2002 = c(12L,
13L, 11L, 21L), DEF.M1Y2001 = c(14L, 14L, 14L, 16L), DEF.M2Y2001 = c(15L,
15L, 15L, 12L), DEF.M1Y2002 = c(5, 12, 23.5, 34), DEF.M2Y2002 = c(6L,
34L, 61L, 56L)), .Names = c("ID", "ABC.M1Y2001", "ABC.M2Y2001","ABC.M1Y2002",
"ABC.M2Y2002", "DEF.M1Y2001", "DEF.M2Y2001", "DEF.M1Y2002",
"DEF.M2Y2002"), class = "data.frame", row.names = c(NA, -4L))
# list variable to average for ABC Month 1 across years
ABC.M1.names <- c("ABC.M1Y2001", "ABC.M1Y2002")
df <- transform(df, ABC.M1 = rowMeans(df[,ABC.M1.names], na.rm = TRUE))
# list variable to average for ABC Month 2 across years
ABC.M2.names <- c("ABC.M2Y2001", "ABC.M2Y2002")
df <- transform(df, ABC.M2 = rowMeans(df[,ABC.M2.names], na.rm = TRUE))
# and so forth for ABC
# ...
# list variables to average for DEF Month 1 across years
DEF.M1.names <- c("DEF.M1Y2001", "DEF.M1Y2002")
df <- transform(df, DEF.M1 = rowMeans(df[,DEF.M1.names], na.rm = TRUE))
# and so forth for DEF
# ...

Here's a solution using data.table development version v1.8.11 (which has melt and cast methods implemented for data.table):
require(data.table)
require(reshape2) # melt/cast builds on S3 generic from reshape2
dt <- data.table(df) # where df is your data.frame
dcast.data.table(melt(dt, id="ID")[, sum(value)/.N, list(ID,
gsub("Y.*$", "", variable))], ID ~ gsub)
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1: 1 -0.55 11.55 9.50 10.5
2: 2 10.65 23.50 13.00 24.5
3: 3 39.50 39.35 18.75 38.0
4: 4 72.75 2.70 25.00 34.0
You can just cbind this to your original data.
Note that sum is a primitive where as mean is S3 generic. Therefore, using sum(.)/length(.) is better (as if there are too many groupings, dispatching the right method with mean for every group could be quite a time-consuming operation). .N is a special variable in data.table that directly gives you the length of the group.

Here is a solution using reshape2 that is more automated when you have lots of data and uses regular expressions to extract the variable name and the month. This solution will give you a nice summary table.
# Load required package
require(reshape2)
# Melt your wide data into long format
mdf <- melt(df , id = "ID" )
# Extract relevant variable names from the variable colum
mdf$Month <- gsub( "^.*\\.(M[0-9]{1,2}).*$" , "\\1" , mdf$variable )
mdf$Var <- gsub( "^(.*)\\..*" , "\\1" , mdf$variable )
# Aggregate by month and variable
dcast( mdf , Var ~ Month , mean )
# Var M1 M2
#1 ABC 30.5875 19.275
#2 DEF 16.5625 26.750
Or to be compatible with the other solutions, and return the table by ID as well...
dcast( mdf , ID ~ Var + Month , mean )
# ID ABC_M1 ABC_M2 DEF_M1 DEF_M2
#1 1 -0.55 11.55 9.50 10.5
#2 2 10.65 23.50 13.00 24.5
#3 3 39.50 39.35 18.75 38.0
#4 4 72.75 2.70 25.00 34.0

This is pretty straight forward in base R.
mean.names <- split(names(df)[-1], gsub('Y[0-9]{4}$', '', names(df)[-1]))
means <- lapply(mean.names, function(x) rowMeans(df[, x], na.rm = TRUE))
data.frame(df, means)
This gives you your original data.frame with the following four columns at the end:
ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 -0.55 11.55 9.50 10.5
2 10.65 23.50 13.00 24.5
3 39.50 39.35 18.75 38.0
4 72.75 2.70 25.00 34.0

You can use Reshape from package {splitstackshape} and then use plyr package or data.table or base R to perform mean.
library(splitstackshape) # Reshape
library(plyr) # ddply
kk<-Reshape(df,id.vars="ID",var.stubs=c("ABC.M1","ABC.M2","DEF.M1","DEF.M2"),sep="")
> kk
ID AE DB time ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 NA NA 1 10.0 11.1 14.0 15
2 2 NA NA 1 12.3 34.0 14.0 15
3 3 NA NA 1 45.0 67.7 14.0 15
4 4 NA NA 1 89.0 -15.6 16.0 12
5 1 NA NA 2 -11.1 12.0 5.0 6
6 2 NA NA 2 9.0 13.0 12.0 34
7 3 NA NA 2 34.0 11.0 23.5 61
8 4 NA NA 2 56.5 21.0 34.0 56
ddply(kk[,c(1,5:8)],.(ID),colwise(mean))
ID ABC.M1 ABC.M2 DEF.M1 DEF.M2
1 1 -0.55 11.55 9.50 10.5
2 2 10.65 23.50 13.00 24.5
3 3 39.50 39.35 18.75 38.0
4 4 72.75 2.70 25.00 34.0

Related

Calculate daily mean of data frame in r

I have a data frame in r that contains readings each five minutes of an hour for couple of months. I want to calculate daily mean of the var3 (data frame under) and add into this data frame as var4.
Here is my df:
>df
timestamp Var1 Var2 Var3
1 2018-07-20 13:50:00 32.0358 28.1 3.6
2 2018-07-20 13:55:00 32.0358 28.0 2.5
3 2018-07-20 14:00:00 32.0358 28.1 2.2
I find this solution from searching the forum, but it's raising error.
Here is the solution I am applying:
aggregate(ts(df$var3[, 2], freq = 288), 1, mean)
This is the error I am getting:
Error in df$var3[, 2] : incorrect number of dimensions
I think this should work for my data frame too but not able to remove this error. Please help.

Here's an approach with dplyr and lubridate.
library(dplyr)
library(lubridate)
df %>%
group_by(Day = day(ymd_hms(timestamp))) %>%
mutate(Var4 = mean(Var3))
## A tibble: 1,000 x 6
## Groups: Day [5]
# timestamp Var1 Var2 Var3 Day Var4
# <dttm> <dbl> <dbl> <dbl> <int> <dbl>
# 1 2018-07-20 13:55:30 32.2 22.9 2.35 20 2.99
# 2 2018-07-20 14:00:30 37.7 24.8 2.99 20 2.99
# 3 2018-07-20 14:05:30 38.7 29.6 3.47 20 2.99
# 4 2018-07-20 14:10:30 30.4 24.2 3.02 20 2.99
# 5 2018-07-20 14:15:30 32.0 28.4 2.95 20 2.99
## … with 995 more rows
Sample Data
df <- data.frame(timestamp = ymd_hms("2018-07-20 13:50:30") + 60*5 * 1:1000,
Var1 = runif(100,30,40),
Var2 = runif(100,20,30),
Var3 = runif(100,2,4))

replace duplicate values with NA in time series data using dplyr

My data seems a bit different than other similar kind of posts.
box_num date x y
1-Q 2018-11-18 20.2 8
1-Q 2018-11-25 21.23 7.2
1-Q 2018-12-2 21.23 23
98-L 2018-11-25 0.134 9.3
98-L 2018-12-2 0.134 4
76-GI 2018-12-2 22.734 4.562
76-GI 2018-12-9 28 4.562
Here I would like to replace the repeated values with NA in both x and y columns.
The code I have tried using dplyr :
(1)df <- df %>% group_by(box_num) %>% arrange(box_num,date) %>%
mutate(df$x[duplicated(df$x),] <- NA)
It creates a new column with all NA's instead of just replacing a repeated value with NA
(2)df <- df %>% group_by(box_num) %>% arrange(box_num,date) %>%
distinct(x,.keep_all = TRUE)
The second one just gives the rows that are not duplicated(we are missing the time series)
Desired Output :
box_num date x y
1-Q 2018-11-18 20.2 8
1-Q 2018-11-25 21.23 7.2
1-Q 2018-12-2 NA 23
98-L 2018-11-25 0.134 9.3
98-L 2018-12-2 NA 4
76-GI 2018-12-2 22.734 4.562
76-GI 2018-12-9 28 NA

Using dplyr we can group_by box_num and use mutate_at x and y column and replace the duplicated value by NA.
library(dplyr)
df %>%
group_by(box_num) %>%
mutate_at(vars(x:y), funs(replace(., duplicated(.), NA)))
# box_num date x y
# <fct> <fct> <dbl> <dbl>
#1 1-Q 2018-11-18 20.2 8
#2 1-Q 2018-11-25 21.2 7.2
#3 1-Q 2018-12-2 NA 23
#4 98-L 2018-11-25 0.134 9.3
#5 98-L 2018-12-2 NA 4
#6 76-GI 2018-12-2 22.7 4.56
#7 76-GI 2018-12-9 28 NA
A base R option (which might not be the best in this case) would be :
cols <- c("x", "y")
df[cols] <- sapply(df[cols], function(x)
ave(x, df$box_num, FUN = function(x) replace(x, duplicated(x), NA)))

Here is an option with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), specify the columns of interest in .SDcols, replace the duplicated elements in the columns with NA and update those columns by assigning (:=) the output back to the columns
library(data.table)
setDT(df1)[, c('x', 'y') := lapply(.SD, function(x)
replace(x, anyDuplicated(x), NA)), box_num, .SDcols= x:y]
df1
# box_num date x y
#1: 1-Q 2018-11-18 20.200 8.000
#2: 1-Q 2018-11-25 21.230 7.200
#3: 1-Q 2018-12-2 NA 23.000
#4: 98-L 2018-11-25 0.134 9.300
#5: 98-L 2018-12-2 NA 4.000
#6: 76-GI 2018-12-2 22.734 4.562
#7: 76-GI 2018-12-9 28.000 NA
data
df1 <- structure(list(box_num = c("1-Q", "1-Q", "1-Q", "98-L", "98-L",
"76-GI", "76-GI"), date = c("2018-11-18", "2018-11-25", "2018-12-2",
"2018-11-25", "2018-12-2", "2018-12-2", "2018-12-9"), x = c(20.2,
21.23, 20.2, 0.134, 0.134, 22.734, 28), y = c(8, 7.2, 23, 9.3,
4, 4.562, 4.562)), class = "data.frame",
row.names = c(NA, -7L))

One-liner to find corresponding value in large dataframe r

I am looking for a simple one-liner that will help me find a corresponding value in a dataframe.
Data sample:
weather <-data.frame("date" = seq(as.Date("2000/1/1"), by ="days", length.out = 10), temp = runif(10))
weather
date temp
1 2000-01-01 0.08520875
2 2000-01-02 0.69003449
3 2000-01-03 0.85892903
4 2000-01-04 0.37790250
5 2000-01-05 0.04121786
6 2000-01-06 0.31550816
7 2000-01-07 0.86219597
8 2000-01-08 0.30844555
9 2000-01-09 0.96949855
10 2000-01-10 0.18851018
Lets say I now want to find the day on which the maximum temperature occurred:
max_temp <- max(weather$temp)
max_temp
[1] 0.9694985
Now there are a couple of ways that I can find the date of this temperature (i.e. the corresponding value that i am after):
weather[which(weather$temp == max_temp), which(colnames(weather) == "date")]
[1] "2000-01-09"
But this is kind of laborious. I could also use dplyr:
library(dplyr)
filter(weather, temp == max_temp) %>%
select(date)
date
1 2000-01-09
But again, a two liner in the console just to get this seems like overkill.
I can't help but feel that there must be something like:
function(df, name_of_known_variable, value_of_known_variable, character_vector_of_variables_of_interest)
So for this example this would look like (assuming the function is "correspond"):
correspond(weather, temp, max_temp, date)
1 2000-01-09
I have looked all over and can't seem to find something simple for this. Please note that i understand that i could use:
weather[which.max(weather$temp), 1]
[1] "2000-01-09"
But lets assume that I am not necessarily looking for the maximum temperature (lets imagine i just have a value of interest and i am trying to find the corresponding value). Lets also imagine i have a massive data frame with lots and lots of columns (so many as to make counting them laborious). Further, lets imagine that i want to return corresponding values from multiple columns.

Turning my comment into an answer, using Base R only:
Create data, adding two more columns to provide a broader perspective:
set.seed( 1110 )
weather <-data.frame( "date" = seq( as.Date("2000/1/1"), by = "days", length.out = 10),
temp = round( runif( 10 ), 2 ),
loc = round( runif( 10 ) * 10, 2 ),
speed = round( runif( 10 ) * 50, 1 ) )
> weather
date temp loc speed
1 2000-01-01 0.48 9.79 18.9
2 2000-01-02 0.79 9.20 18.6
3 2000-01-03 0.88 9.65 46.3
4 2000-01-04 0.58 0.59 5.3
5 2000-01-05 0.22 6.12 38.7
6 2000-01-06 0.09 3.05 42.6
7 2000-01-07 0.49 4.09 2.1
8 2000-01-08 0.99 8.60 31.9
9 2000-01-09 0.56 4.27 12.6
10 2000-01-10 0.36 6.02 42.7
Now we can select per one-liner and based on column names rather than numbers, as required:
# The day with the maximum temparature
weather[ weather$temp == max( weather$temp ), "date" ]
[1] "2000-01-08"
But we can do a lot more:
# Speed and Location (order reversed) on the day with a temperature of 0.49
weather[ weather$temp == .49, c( "speed", "loc" ) ]
speed loc
7 2.1 4.09
# Date and speed, based upon two selection criteria (Temparature or Location)
# here we need to use which() to get the row indices
weather[ c( which( weather$temp == min( weather$temp ) ), which( weather$loc == 6.12 ) ), c( "date", "speed" ) ]
date speed
6 2000-01-06 42.6
5 2000-01-05 38.7

use data.table package. Syntax is simple.
a[variable == value_you_want]
a[variable == max(variable]
a[variable == 0]

dplyr::slice is also a possibility here:
set.seed(1)
weather <-data.frame("date" = seq(as.Date("2000/1/1"), by ="days", length.out = 10), temp = runif(10))
library(dplyr)
weather %>% arrange(desc(temp)) %>% slice(1)
# A tibble: 1 x 2
date temp
<date> <dbl>
1 2000-01-07 0.9446753
And you can use dplyr::filter if you need to look for a specific value

Calculate average of month and replace values of other column

I have a dataframe as given below:
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015","13-05-2015","14-05-2015"
,"15-05-2015","12-06-2015","13-06-2015","14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
Below is the column which contains value based on some calculation:
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)
Now I want to replace pvar value if its value less than the average value for that particular month.
For example,
for month 4,
Average value of pvar is 9.3 ((8.4+2.4+12+14.4)/4).
Then replace all the value in pvar which is less than avg for month 4 that is (8.4 &2.4).
Pvar value would be 9.3,9.3,12,14.4
I need to do this for all the values in pvar.

A base R solution would be to use ave. Note that we first need to convert the date column to actual date in order to extract the month (strsplit or regex can also do it but I prefer to have it set as a proper date), i.e.
df$vdate <- as.POSIXct(df$vdate, format = '%d-%m-%Y')
with(df, ave(pvar, format(vdate, '%m'), FUN = function(i) replace(i, i < mean(i), mean(i))))
#[1] 9.30 9.30 12.00 14.40 4.65 4.65 7.80 5.00 16.00 14.45 18.00 18.40
As per your edit, I will use dplyr to tackle it as it might be more readable. There are actually two ways I came up with.
First: Create an extra grouping variable that will put all the months you need to alter the values in the same group and replace from there, i.e.
library(dplyr)
cbind(df, pvar) %>%
group_by(grp = cumsum(!month %in% c(4, 5))+1, month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))) %>%
ungroup() %>%
select(-grp)
Second: Filter the months you need, do the calculations. Then filter the months you don't need, create again the pvar but without changing anything (necessary for binding the rows) and bind the rows, i.e.
bind_rows(
cbind(df, pvar) %>%
filter(month %in% c(4, 5)) %>%
group_by(month) %>%
mutate(pvar = replace(pvar, pvar < mean(pvar), mean(pvar))),
cbind(df, pvar) %>%
filter(!month %in% c(4, 5))
)
Both the above give,
vdate month col1 pvar
<fct> <dbl> <dbl> <dbl>
1 12-04-2015 4. 12.0 12.0
2 13-04-2015 4. 12.4 12.4
3 14-04-2015 4. 14.3 14.3
4 15-04-2015 4. 3.00 10.4
5 12-05-2015 5. 5.30 5.30
6 13-05-2015 5. 1.80 4.80
7 14-05-2015 5. 7.60 7.60
8 15-05-2015 5. 4.50 4.80
9 12-06-2015 6. 7.60 7.60
10 13-06-2015 6. 10.7 10.7
11 14-06-2015 6. 12.0 12.0
12 15-06-2015 6. 15.7 15.7

A dplyr based solution could be :
#Additional condition has been added to check if month != 6
cbind(df, pvar) %>%
group_by(month) %>%
mutate(pvar = ifelse(pvar < mean(pvar) & month != 6, mean(pvar), pvar)) %>%
as.data.frame()
# vdate month col1 pvar
# 1 12-04-2015 4 12.0 9.30
# 2 13-04-2015 4 12.4 9.30
# 3 14-04-2015 4 14.3 12.00
# 4 15-04-2015 4 3.0 14.40
# 5 12-05-2015 5 5.3 4.65
# 6 13-05-2015 5 1.8 4.65
# 7 14-05-2015 5 7.6 7.80
# 8 15-05-2015 5 4.5 5.00
# 9 12-06-2015 6 7.6 16.00
# 10 13-06-2015 6 10.7 5.40
# 11 14-06-2015 6 12.0 18.00
# 12 15-06-2015 6 15.7 18.40
Data
vdate=c("12-04-2015","13-04-2015","14-04-2015","15-04-2015","12-05-2015",
"13-05-2015","14-05-2015","15-05-2015","12-06-2015","13-06-2015",
"14-06-2015","15-06-2015")
month=c(4,4,4,4,5,5,5,5,6,6,6,6)
col1=c(12,12.4,14.3,3,5.3,1.8,7.6,4.5,7.6,10.7,12,15.7)
df=data.frame(vdate,month,col1)
pvar=c(8.4,2.4,12,14.4,2.3,3.5,7.8,5,16,5.4,18,18.4)

Melt and Recast into a new dataframe in r

I just downloaded a lot of temperature data from one of our dataloggers. The dataframe gives me mean hourly observations of temperature for 1691 hours for 87 temperature sensors (so there is a lot of data here). This looks something like this
D1_A D1_B D1_C
13.43 14.39 12.33
12.62 13.53 11.56
11.67 12.56 10.36
10.83 11.62 9.47
I would like to reshape this dataset into a matrix that looks like this:
#create a blank matrix 5 columns 131898 rows
matrix1<-matrix(nrow=131898, ncol=5)
colnames(matrix1)<- c("year", "ID", "Soil_Layer", "Hour", "Temperature")
where:
year is always "2012"
ID corresponds to the header ID (e.g. D1)
Soil_Layer corresponds to the second bit of the header (e.g. A, B, or C)
Hour= 1:1691 for each sensor
and Temperature= the observed values in the original dataframe.
Can this be done with the reshape package in r? Does this need to be done as a loop? Any input on how to handle this dataset would be useful. Cheers!

I think this does what you want...you can take advantage of the colsplit() and melt() functions in package reshape2. It's not clear where you identify the Hour for the data, so I assumed it was ordered from the original dataset. If that's not the case, update your question:
library(reshape2)
#read in your data
x <- read.table(text = "
D1_A D1_B D1_C
13.43 14.39 12.33
12.62 13.53 11.56
11.67 12.56 10.36
10.83 11.62 9.47
9.98 10.77 9.04
9.24 10.06 8.65
8.89 9.55 8.78
9.01 9.39 9.88
", header = TRUE)
#add hour index, if data isn't ordered, replace this with whatever
#tells you which hour goes where
x$hour <- 1:nrow(x)
#Melt into long format
x.m <- melt(x, id.vars = "hour")
#Split into two columns
x.m[, c("ID", "Soil_Layer")] <- colsplit(x.m$variable, "_", c("ID", "Soil_Layer"))
#Add the year
x.m$year <- 2012
#Return the first 6 rows
head(x.m[, c("year", "ID", "Soil_Layer", "hour", "value")])
#----
year ID Soil_Layer hour value
1 2012 D1 A 1 13.43
2 2012 D1 A 2 12.62
3 2012 D1 A 3 11.67
4 2012 D1 A 4 10.83
5 2012 D1 A 5 9.98
6 2012 D1 A 6 9.24

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create several new derived variables from existing variables in data.frame - r

Related

Calculate daily mean of data frame in r

replace duplicate values with NA in time series data using dplyr

One-liner to find corresponding value in large dataframe r

Calculate average of month and replace values of other column

Melt and Recast into a new dataframe in r

Categories

Resources