Partially transpose a dataframe in R - r

Given the following set of data:
transect <- c("B","N","C","D","H","J","E","L","I","I")
sampler <- c(rep("J",5),rep("W",5))
species <- c("ROB","HAW","HAW","ROB","PIG","HAW","PIG","PIG","HAW","HAW")
weight <- c(2.80,52.00,56.00,2.80,16.00,55.00,16.20,18.30,52.50,57.00)
wingspan <- c(13.9, 52.0, 57.0, 13.7, 11.0,52.5, 10.7, 11.1, 52.3, 55.1)
week <- c(1,2,3,4,5,6,7,8,9,9)
# Warning to R newbs: Really bad idea to use this code
ex <- as.data.frame(cbind(transect,sampler,species,weight,wingspan,week))
What Iā€™m trying to achieve is to transpose the species and its associated information on weight and wingspan. For a better idea of the expected result please see below. My data set is about half a million lines long with approximately 200 different species so it will be a very large dataframe.
transect sampler week ROBweight HAWweight PIGweight ROBwingspan HAWwingspan PIGwingspan
1 B J 1 2.8 0.0 0.0 13.9 0.0 0.0
2 N J 2 0.0 52.0 0.0 0.0 52.0 0.0
3 C J 3 0.0 56.0 0.0 0.0 57.0 0.0
4 D J 4 2.8 0.0 0.0 13.7 0.0 0.0
5 H J 5 0.0 0.0 16.0 0.0 0.0 11.0
6 J W 6 0.0 55.0 0.0 0.0 52.5 0.0
7 E W 7 0.0 0.0 16.2 0.0 0.0 10.7
8 L W 8 0.0 0.0 18.3 0.0 0.0 11.1
9 I W 9 0.0 52.5 0.0 0.0 52.3 0.0
10 I W 9 0.0 57.0 0.0 0.0 55.1 0.0

The main problem is that you don't currently have unique "id" variables, which will create problems for the usual suspects of reshape and dcast.
Here's a solution. I've used getanID from my "splitstackshape" package, but it's pretty easy to create your own unique ID variable using many different methods.
library(splitstackshape)
library(reshape2)
idvars <- c("transect", "sampler", "week")
ex <- getanID(ex, id.vars=idvars)
From here, you have two options:
reshape from base R:
reshape(ex, direction = "wide",
idvar=c("transect", "sampler", "week", ".id"),
timevar="species")
melt and dcast from "reshape2"
First, melt your data into a "long" form.
exL <- melt(ex, id.vars=c(idvars, ".id", "species"))
Then, cast your data into a wide form.
dcast(exL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1 B J 1 1 NA NA NA NA 2.8 13.9
# 2 C J 3 1 56.0 57.0 NA NA NA NA
# 3 D J 4 1 NA NA NA NA 2.8 13.7
# 4 E W 7 1 NA NA 16.2 10.7 NA NA
# 5 H J 5 1 NA NA 16.0 11.0 NA NA
# 6 I W 9 1 52.5 52.3 NA NA NA NA
# 7 I W 9 2 57.0 55.1 NA NA NA NA
# 8 J W 6 1 55.0 52.5 NA NA NA NA
# 9 L W 8 1 NA NA 18.3 11.1 NA NA
# 10 N J 2 1 52.0 52.0 NA NA NA NA
A better option: "data.table"
Alternatively (and perhaps preferably), you can use the "data.table" package (at least version 1.8.11) as follows:
library(data.table)
library(reshape2) ## Also required here
packageVersion("data.table")
# [1] ā€˜1.8.11ā€™
DT <- data.table(ex)
DT[, .id := sequence(.N), by = c("transect", "sampler", "week")]
DTL <- melt(DT, measure.vars=c("weight", "wingspan"))
dcast.data.table(DTL, transect + sampler + week + .id ~ species + variable)
# transect sampler week .id HAW_weight HAW_wingspan PIG_weight PIG_wingspan ROB_weight ROB_wingspan
# 1: B J 1 1 NA NA NA NA 2.8 13.9
# 2: C J 3 1 56.0 57.0 NA NA NA NA
# 3: D J 4 1 NA NA NA NA 2.8 13.7
# 4: E W 7 1 NA NA 16.2 10.7 NA NA
# 5: H J 5 1 NA NA 16.0 11.0 NA NA
# 6: I W 9 1 52.5 52.3 NA NA NA NA
# 7: I W 9 2 57.0 55.1 NA NA NA NA
# 8: J W 6 1 55.0 52.5 NA NA NA NA
# 9: L W 8 1 NA NA 18.3 11.1 NA NA
# 10: N J 2 1 52.0 52.0 NA NA NA NA
Add fill = 0 to either of the dcast versions to replace NA values with 0.

Related

Fill Timeseries with founded values or keep NA if no value was found yet

I had a longer TimeSeries and turned it into wider for forecasting purposes, currently timeseries has the following structure :
Day
Value
Strength1
Strength2
Strength3
1/2
1.356
3
NA
NA
2/2
1.385
NA
NA
NA
3/2
1.385
NA
1.01
NA
4/2
1.4
NA
NA
10
5/2
1.6
NA
NA
NA
6/2
1.7
4
NA
NA
7/2
1.8
NA
1.05
NA
8/2
1.88
NA
NA
NA
9/2
1.98
NA
NA
11
10/2
1.8
NA
NA
NA
I want a function that :
given a TimeSeries
loops through columns if cell == NA and
previously only NAs were found in the column , keep NA
if cell != NA good
if cell == NA But previously we found not NA values, change to previously found value
This would be result :
Day
Value
Strength1
Strength2
Strength3
1/2
1.356
3
NA
NA
2/2
1.385
3
NA
NA
3/2
1.385
3
1.01
NA
4/2
1.4
3
1.01
10
5/2
1.6
3
1.01
10
6/2
1.7
4
1.01
10
7/2
1.8
4
1.05
10
8/2
1.88
4
1.05
10
9/2
1.98
4
1.05
11
10/2
1.8
4
1.05
11
I tried this function but it isn't right :
filler <- function(df) {
col <- colnames(df)
one <- NA
for (i in col) {
for (a in i) {
if(!is.na(a)) {
one = a
}
if(!is.na(one) & is.na(a)) {
a = one
}
}
}
}
You may use tidyr::fill -
filler <- function(data) tidyr::fill(data, dplyr::everything())
filler(df)
# Day Value Strength1 Strength2 Strength3
#1 1/2 1.356 3 NA NA
#2 2/2 1.385 3 NA NA
#3 3/2 1.385 3 1.01 NA
#4 4/2 1.400 3 1.01 10
#5 5/2 1.600 3 1.01 10
#6 6/2 1.700 4 1.01 10
#7 7/2 1.800 4 1.05 10
#8 8/2 1.880 4 1.05 10
#9 9/2 1.980 4 1.05 11
#10 10/2 1.800 4 1.05 11

Quarterly year-to-year changes

I have a quarterly time series. I am trying to apply a function which is supposed calculate the year-to-year growth and year-to-year difference and multiply a variable by (-1).
I already used a similar function for calculating quarter-to-quarter changes and it worked.
I modified this function for yoy changes and it does not have any effect on my data frame. And any error popped up.
Do you have any suggestion how to modify the function or how to accomplish to apply the yoy change function on a time series?
Here is the code:
Date <- c("2004-01-01","2004-04-01", "2004-07-01","2004-10-01","2005-01-01","2005-04-01","2005-07-01","2005-10-01","2006-01-01","2006-04-01","2006-07-01","2006-10-01","2007-01-01","2007-04-01","2007-07-01","2007-10-01")
B1 <- c(3189.30,3482.05,3792.03,4128.66,4443.62,4876.54,5393.01,5885.01,6360.00,6930.00,7430.00,7901.00,8279.00,8867.00,9439.00,10101.00)
B2 <- c(7939.97,7950.58,7834.06,7746.23,7760.59,8209.00,8583.05,8930.74,9424.00,9992.00,10041.00,10900.00,11149.00,12022.00,12662.00,13470.00)
B3 <- as.numeric(c("","","","",140.20,140.30,147.30,151.20,159.60,165.60,173.20,177.30,185.30,199.30,217.10,234.90))
B4 <- as.numeric(c("","","","",-3.50,-14.60,-11.60,-10.20,-3.10,-16.00,-4.90,-17.60,-5.30,-10.90,-12.80,-8.40))
df <- data.frame(Date,B1,B2,B3,B4)
The code will produce following data frame:
Date B1 B2 B3 B4
1 2004-01-01 3189.30 7939.97 NA NA
2 2004-04-01 3482.05 7950.58 NA NA
3 2004-07-01 3792.03 7834.06 NA NA
4 2004-10-01 4128.66 7746.23 NA NA
5 2005-01-01 4443.62 7760.59 140.2 -3.5
6 2005-04-01 4876.54 8209.00 140.3 -14.6
7 2005-07-01 5393.01 8583.05 147.3 -11.6
8 2005-10-01 5885.01 8930.74 151.2 -10.2
9 2006-01-01 6360.00 9424.00 159.6 -3.1
10 2006-04-01 6930.00 9992.00 165.6 -16.0
11 2006-07-01 7430.00 10041.00 173.2 -4.9
12 2006-10-01 7901.00 10900.00 177.3 -17.6
13 2007-01-01 8279.00 11149.00 185.3 -5.3
14 2007-04-01 8867.00 12022.00 199.3 -10.9
15 2007-07-01 9439.00 12662.00 217.1 -12.8
16 2007-10-01 10101.00 13470.00 234.9 -8.4
And I want to apply following changes on the variables:
# yoy absolute difference change
abs.diff = c("B1","B2")
# yoy percentage change
percent.change = c("B3")
# make the variable negative
negative = c("B4")
This is the fuction that I am trying to use for my data frame.
transformation = function(D,abs.diff,percent.change,negative)
{
TT <- dim(D)[1]
DData <- D[-1,]
nms <- c()
for (i in c(2:dim(D)[2])) {
# yoy absolute difference change
if (names(D)[i] %in% abs.diff)
{ DData[,i] = (D[5:TT,i]-D[1:(TT-4),i])
names(DData)[i] = paste('a',names(D)[i],sep='') }
# yoy percent. change
if (names(D)[i] %in% percent.change)
{ DData[,i] = 100*(D[5:TT,i]-D[1:(TT-4),i])/D[1:(TT-4),i]
names(DData)[i] = paste('p',names(D)[i],sep='') }
#CA.deficit
if (names(D)[i] %in% negative)
{ DData[,i] = (-1)*D[1:TT,i] }
}
return(DData)
}
This is what I would like to get :
Date pB1 pB2 aB3 B4
1 2004-01-01 NA NA NA NA
2 2004-04-01 NA NA NA NA
3 2004-07-01 NA NA NA NA
4 2004-10-01 NA NA NA NA
5 2005-01-01 39.33 -2.26 NA 3.5
6 2005-04-01 40.05 3.25 NA 14.6
7 2005-07-01 42.22 9.56 NA 11.6
8 2005-10-01 42.54 15.29 11.0 10.2
9 2006-01-01 43.13 21.43 19.3 3.1
10 2006-04-01 42.11 21.72 18.3 16.0
11 2006-07-01 37.77 16.99 22.0 4.9
12 2006-10-01 34.26 22.05 17.7 17.6
13 2007-01-01 30.17 18.3 19.7 5.3
14 2007-04-01 27.95 20.32 26.1 10.9
15 2007-07-01 27.04 26.1 39.8 12.8
16 2007-10-01 27.84 23.58 49.6 8.4
Grouping by the months, i.e. 6th and 7th substring using ave and do the necessary calculations. With sapply we may loop over the columns.
f <- function(x) {
g <- substr(Date, 6, 7)
l <- length(unique(g))
o <- ave(x, g, FUN=function(x) 100/x * c(x[-1], NA) - 100)
c(rep(NA, l), head(o, -4))
}
cbind(df[1], sapply(df[-1], f))
# Date B1 B2 B3 B4
# 1 2004-01-01 NA NA NA NA
# 2 2004-04-01 NA NA NA NA
# 3 2004-07-01 NA NA NA NA
# 4 2004-10-01 NA NA NA NA
# 5 2005-01-01 39.32901 -2.259202 NA NA
# 6 2005-04-01 40.04796 3.250329 NA NA
# 7 2005-07-01 42.21960 9.560688 NA NA
# 8 2005-10-01 42.54044 15.291439 NA NA
# 9 2006-01-01 43.12655 21.434066 13.83738 -11.428571
# 10 2006-04-01 42.10895 21.720063 18.03279 9.589041
# 11 2006-07-01 37.77093 16.986386 17.58316 -57.758621
# 12 2006-10-01 34.25636 22.050356 17.26190 72.549020
# 13 2007-01-01 30.17296 18.304329 16.10276 70.967742
# 14 2007-04-01 27.95094 20.316253 20.35024 -31.875000
# 15 2007-07-01 27.03903 26.102978 25.34642 161.224490
# 16 2007-10-01 27.84458 23.577982 32.48731 -52.272727

Replace all duplicated with na

My question is similar to replace duplicate values with NA in time series data using dplyr but while applying to other time series which are like below :
box_num date x y
6-WQ 2018-11-18 20.2 8
6-WQ 2018-11-25 500.75 7.2
6-WQ 2018-12-2 500.75 23
25-LR 2018-11-18 374.95 4.3
25-LR 2018-11-25 0.134 9.3
25-LR 2018-12-2 0.134 4
73-IU 2018-12-2 225.54 0.7562
73-IU 2018-12-9 28 0.7562
73-IU 2018-12-16 225.54 52.8
library(dplyr)
df %>%
group_by(box_num) %>%
mutate_at(vars(x:y), funs(replace(., duplicated(.), NA)))
The above code can identify and replace with NA, but the underlying problem is I'm trying to replace all NA with a linear trend in the coming step. Since it's a time series.But when we see for box_num : 6-WQ after 20.2 we can see directly a large shift which we can say it's a imputed value so I would to replace both the imputed values as NA and the other case is like for box_num 73-IU imputed values got entered after one week so I would like to replace imputed values with NA
Expected output :
box_num date x y
6-WQ 2018-11-18 20.2 8
6-WQ 2018-11-25 NA 7.2
6-WQ 2018-12-2 NA 23
25-LR 2018-11-18 374.95 4.3
25-LR 2018-11-25 NA 9.3
25-LR 2018-12-2 NA 4
73-IU 2018-12-2 NA NA
73-IU 2018-12-9 28 NA
73-IU 2018-12-16 NA 52.8
foo = function(x){
replace(x, ave(x, x, FUN = length) > 1, NA)
}
myCols = c("x", "y")
df1[myCols] = lapply(df1[myCols], foo)
df1
# box_num date x y
#1 6-WQ 2018-11-18 20.20 8.0
#2 6-WQ 2018-11-25 NA 7.2
#3 6-WQ 2018-12-2 NA 23.0
#4 25-LR 2018-11-18 374.95 4.3
#5 25-LR 2018-11-25 NA 9.3
#6 25-LR 2018-12-2 NA 4.0
#7 73-IU 2018-12-2 NA NA
#8 73-IU 2018-12-9 28.00 NA
#9 73-IU 2018-12-16 NA 52.8
#DATA
df1 = structure(list(box_num = c("6-WQ", "6-WQ", "6-WQ", "25-LR", "25-LR",
"25-LR", "73-IU", "73-IU", "73-IU"), date = c("2018-11-18", "2018-11-25",
"2018-12-2", "2018-11-18", "2018-11-25", "2018-12-2", "2018-12-2",
"2018-12-9", "2018-12-16"), x = c(20.2, 500.75, 500.75, 374.95,
0.134, 0.134, 225.54, 28, 225.54), y = c(8, 7.2, 23, 4.3, 9.3,
4, 0.7562, 0.7562, 52.8)), class = "data.frame", row.names = c(NA,
-9L))
With tidyverse you can do:
df %>%
group_by(box_num) %>%
mutate_at(vars(x:y), funs(ifelse(. %in% subset(rle(sort(.))$values, rle(sort(.))$length > 1), NA, .)))
box_num date x y
<fct> <fct> <dbl> <dbl>
1 6-WQ 2018-11-18 20.2 8.00
2 6-WQ 2018-11-25 NA 7.20
3 6-WQ 2018-12-2 NA 23.0
4 25-LR 2018-11-18 375. 4.30
5 25-LR 2018-11-25 NA 9.30
6 25-LR 2018-12-2 NA 4.00
7 73-IU 2018-12-2 NA NA
8 73-IU 2018-12-9 28.0 NA
9 73-IU 2018-12-16 NA 52.8
First, it sorts the values in "x" and "y" and computes the run length of equal values. Second, it creates a subset for those values that have a run length > 1. Finally, it compares whether the values in "x" and "y" are in the subset, and if so, they get NA.

Aggregating data with NA values based on site

I am using the EPA NLA dataset to find the average temperature in the epiliminion for some lake data. The data set looks like this:
SITE DEPTH METALIMNION TEMP FIELD
1 0.0 NA 25.6
1 0.5 NA 25.1
1 0.8 T 24.9
1 1.0 NA 24.1
1 2.0 B 23.0
2 0.0 NA 29.0
2 0.5 T 28.0
"T" indicates the end of the epiliminion, and I want to average all corresponding temperature values including and above the "T" for each site. I have no idea where to even begin. (The "B" is irrelevant for this issue).
Thanks!
With base R you can do it like this.
I use ave twice, the first time to determine where column METALIMNION has a "T", by group of SITE. This is vector g.
The second, average METALIMNION by SITE and that vector g.
g <- with(NLA, ave(as.character(METALIMNION), SITE,
FUN = function(x) {
x[is.na(x)] <- ""
rev(cumsum(rev(x) == "T"))
}))
NLA$AVG <- ave(NLA$TEMP.FIELD, NLA$SITE, g)
NLA
# SITE DEPTH METALIMNION TEMP.FIELD AVG
#1 1 0.0 <NA> 25.6 25.20
#2 1 0.5 <NA> 25.1 25.20
#3 1 0.8 T 24.9 25.20
#4 1 1.0 <NA> 24.1 23.55
#5 1 2.0 B 23.0 23.55
#6 2 0.0 <NA> 29.0 28.50
#7 2 0.5 T 28.0 28.50
Assuming that there is only one 'T' for each value of site, using dplyr package:
library(dplyr)
data.frame(SITE=c(1,1,1,1,1,2,2),TEMP=c(25.6,25.1,24.9,24.1,23.0,29.0,28.0)) %>%
group_by(SITE) %>%
summarise(meanTemp=mean(TEMP))
Result:
# A tibble: 2 x 2
SITE meanTemp
<dbl> <dbl>
1 1 24.5
2 2 28.5

Combine facet_wrap of line plot(1st dataset) with facet_wrap of point plot(2nd dataset) on single plot

I have two datsets I call them 1.data 2.data_other
1.data
data <- read.table(text = "Me EE PE DE TE DEE CE
1 1 1 4.5 2000 0.50 0.2547 0.69
2 1 2 2.4 3000 NA 0.5896 2.56
3 1 3 6.5 2345 15.24 NA 1.85
4 1 4 NA NA 18.23 1.2594 2.06
5 2 1 2.6 NA 12.25 1.5943 2.34
6 2 2 NA 3145 10.25 NA NA
7 2 3 2.7 4235 NA NA 2.90
8 2 4 NA NA 6.32 2.5990 3.18
9 3 1 3.5 NA 8.25 2.9339 3.46
10 3 2 3.8 NA NA NA NA
11 3 3 NA NA NA 3.6037 1.58
12 3 4 4.4 4325 NA 3.9386 4.30
13 4 1 4.7 NA 15.24 4.2735 4.58
14 4 2 NA 4325 6.66 NA NA
15 4 3 5.3 NA 25.20 NA NA
16 4 4 5.6 3256 NA 5.2782 5.42
17 5 1 NA 4351 25.36 5.6131 5.70
18 5 2 6.2 2345 NA 5.9480 NA
19 5 3 6.5 NA 19.36 NA NA
20 5 4 NA 4643 17.25 6.6178 6.54", header = T)
2.data_other
Me EE PE DE TE DEE CE
1 1 1.3 1452 12.5 0.2587 1.25
1 2 1.7 2458 10.8 1.5469 0.69
1 3 0.8 3524 11.96 0.5874 0.87
1 4 0.7 1905 13.58 0.9654 0.98
2 1 0.5 2941 12.54 0.2548 1.65
2 2 0.2 3183.5 13.134 1.3658 1.11
2 3 0.01 3426 13.42 0.48692 1.320666667
2 4 1.52 3668.5 13.706 1.9547 1.386095238
3 1 1.98 3911 13.992 0.36906 1.45152381
3 2 2.1 4153.5 14.278 0.31013 1.516952381
3 3 2.44 6674 15.96 0.564 0.2512
3 4 2.736 6671 16.5 1.85 1.5687
4 1 1.05 4881 15.136 0.13334 1.713238095
4 2 1.52 5123.5 15.422 0.07441 1.778666667
4 3 0.95 5366 15.708 0.01548 1.844095238
4 4 1.073 5608.5 15.994 1.2548 1.90952381
5 1 1.0233 5851 16.28 1.9658 1.974952381
5 2 0.9733 6093.5 16.566 2.2154 2.040380952
5 3 0.9233 6336 16.852 1.2587 2.105809524
5 4 2.45 6578.5 17.138 1.9657 2.171238095
I am plotting point graph with data and I am using following code
data <- read.table("data.txt")
datad <- melt(data,id.vars = c("Me","EE"),measure.vars= c("PE","DE","TE","DEE","CE"))
ggplot(subset(datad, !is.na(EE)),aes(x=EE,y=value), as.table = T) +
geom_point(aes(color=factor(Me)), size = 3) +
facet_wrap(~variable, nrow=3, ncol=2, scales = "free_y")
I am plotting line plot with data_other and using following code
data_other <- read.table("data_other.txt", header = T)
data_otherd <- melt(data_other, id.vars = c("Me","EE"), measure.vars = c("PE","DE","TE","DEE","CE"))
ggplot(subset(data_otherd, !is.na(EE)), aes(x=EE, y=value), as.table = T) +
geom_line(aes(color = factor (Me))) +
facet_wrap(~variable, nrow=3, ncol=2, scales = "free_y")
Result.
data (point graph)
data_other (line graph)
I am trying to combine both the data sets on to the same plot (Meaning data_other line plot should be overlapped on to the data point plot or vice versa ).
Both the datasets have same number of columns and columns names are also similar.
Without the facet_wrap I know I can use different geom to specify different dataset and plot the line plot on the point graph. But with facet wrap I dont know how to combine both the dataset.
How to wrap two datasets with different geom on to a single plot.
Please help. Thanks in advance.
You need just one call to ggplot() with first data frame and then add geom_line() with data=data_otherd. As all variable names are the same, geom_line() will use the same information that is provided is aes() of ggplot() (move also color=factor(Me) to aes() of ggplot() call).
ggplot(datad,aes(x=EE,y=value,color=factor(Me)))+
geom_point()+
geom_line(data=data_otherd)+
facet_wrap(~variable, nrow=3, ncol=2, scales = "free_y")

Resources