copy as long as value does not change in R - r

I have a data frame which contain multiple rows, see example.
df <- data.frame(rbind(c('1','CAR','Jan'),
c('2','3','4'),
c('5','6','7'),
c('8','CAR','Feb'),
c('9','10', '11'),
c('12','13','14')))
I would like to copy the value in which comes after CAR (Jan and Feb) and copy it in a new column X4 until CAR is coming again. The number of rows is not all the time the same between the CARs, the number of columns are the same.
The data should look like this
data.frame(rbind(c('1','CAR','Jan','Jan' ),
c('2','3','4','Jan'),
c('5','6','7','Jan'),
c('8','CAR','Feb','Feb'),
c('9','10','11','Feb'),
c('11','12','12','Feb')))
I have tried different options (ifelse, if, for loop), but none of them provides the right result.
Would you have any hints on how to solve this?
Thanks in advance
Eric

Here's a another data.table solution
library(data.table)
setDT(df)[, X4 := X3[1L], by = cumsum(X2 == "CAR")]
df
# X1 X2 X3 X4
# 1: 1 CAR Jan Jan
# 2: 2 3 4 Jan
# 3: 5 6 7 Jan
# 4: 8 CAR Feb Feb
# 5: 9 10 11 Feb
# 6: 12 13 14 Feb
We could also do a similar thing using dplyr (but it will add an indx column too)
library(dplyr)
df %>%
group_by(indx = cumsum(X2 == "CAR")) %>%
mutate(X4 = X3[1L])

You can try
library(data.table)
library(zoo)
setDT(df)[X2=='CAR', X4:= X3][, X4:= na.locf(X4)]
# X1 X2 X3 X4
#1: 1 CAR Jan Jan
#2: 2 3 4 Jan
#3: 5 6 7 Jan
#4: 8 CAR Feb Feb
#5: 9 10 11 Feb
#6: 12 13 14 Feb

Here's an uglier, base-R version of David's answer:
df$X4 <- unlist(tapply(
df$X3,
cumsum(df$X2=="CAR"),
function(x){y <- levels(x)[x[1]]; rep(y,length(x))}
))

Related

To apply mutate with an other line

I have a table and I would like to add a column that calculates the percentage compared to the previous line.
You have to do as calculation takes the line 1 divided by line 2 and on the line 2, you indicate the result
Example
month <- c(10,11,12,13,14,15)
sell <-c(258356,278958,287928,312254,316287,318999)
df <- data.frame(month, sell)
df %>% mutate(augmentation = sell[month]/sell[month+1])
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984
dplyr
You can just use lag like this:
library(dplyr)
df %>%
mutate(resultat = lag(sell)/sell)
Output:
month sell resultat
1 10 258356 NA
2 11 278958 0.9261466
3 12 287928 0.9688464
4 13 312254 0.9220955
5 14 316287 0.9872489
6 15 318999 0.9914984
data.table
Another option is using shift:
library(data.table)
setDT(df)[, resultat:= shift(sell)/sell][]
Output:
month sell resultat
1: 10 258356 NA
2: 11 278958 0.9261466
3: 12 287928 0.9688464
4: 13 312254 0.9220955
5: 14 316287 0.9872489
6: 15 318999 0.9914984

Transpose column and group dataframe [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I'm trying to change a dataframe in R to group multiple rows by a measurement. The table has a location (km), a size (mm) a count of things in that size bin, a site and year. I want to take the sizes, make a column from each one (2, 4 and 6 in this example), and place the corresponding count into each the row for that location, site and year.
It seems like a combination of transposing and grouping, but I can't figure out a way to accomplish this in R. I've looked at t(), dcast() and aggregate(), but those aren't really close at all.
So I would go from something like this:
df <- data.frame(km=c(rep(32,3),rep(50,3)), mm=rep(c(2,4,6),2), count=sample(1:25,6), site=rep("A", 6), year=rep(2013, 6))
km mm count site year
1 32 2 18 A 2013
2 32 4 2 A 2013
3 32 6 12 A 2013
4 50 2 3 A 2013
5 50 4 17 A 2013
6 50 6 21 A 2013
To this:
km site year mm_2 mm_4 mm_6
1 32 A 2013 18 2 12
2 50 A 2013 3 17 21
Edit: I tried the solution in a suggested duplicate, but I did not work for me, not really sure why. The answer below worked better.
As suggested in the comment above, we can use the sep argument in spread:
library(tidyr)
spread(df, mm, count, sep = "_")
km site year mm_2 mm_4 mm_6
1 32 A 2013 4 20 1
2 50 A 2013 15 14 22
As you mentioned dcast(), here is a method using it.
set.seed(1)
df <- data.frame(km=c(rep(32,3),rep(50,3)),
mm=rep(c(2,4,6),2),
count=sample(1:25,6),
site=rep("A", 6),
year=rep(2013, 6))
library(reshape2)
dcast(df, ... ~ mm, value.var="count")
# km site year 2 4 6
# 1 32 A 2013 13 10 20
# 2 50 A 2013 3 17 1
And if you want a bit of a challenge you can try the base function reshape().
df2 <- reshape(df, v.names="count", idvar="km", timevar="mm", ids="mm", direction="wide")
colnames(df2) <- sub("count.", "mm_", colnames(df2))
df2
# km site year mm_2 mm_4 mm_6
# 1 32 A 2013 13 10 20
# 4 50 A 2013 3 17 1

R: Insert and fill missing periods in panel data

I'm trying to learn R coming from Stata, but have run into the following two problems which I cannot seem to find elegant solutions for in R:
1) I have a panel dataset with gaps in my time variable. I would like to expand my time variable to include the gaps despite having no observed data for these rows.
In Stata I would usually go about this by setting my ID and time variables with xtset and then expanding the dataset based on this with tsfill. Is there an equivalently elegant way in R?
2) I would like to fill some of the new, blank cells with data for constant variables.
In Stata I would do this by copying data from previous (relative to my time variable) observations using the l.-prefix; for example using replace Con = l.Con.
In other words I'm asking how to go from something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Jun 15 B
To something like this:
ID Time Num Con
1 Jan 10 A
1 Feb 15 A
1 Mar A
1 Apr A
1 May 20 A
2 Feb 12 B
2 Mar 14 B
2 Apr B
2 May B
2 Jun 15 B
Hopefully that makes sense. Thanks in advance.
You can try merge from base R or the data.table join
library(data.table)
DT2 <- setDT(df1)[, {tmp <- match(Time, month.abb)
list(Time=month.abb[min(tmp):max(tmp)])}, .(ID,Con)]
setkey(df1[, c(1,4,2,3), with=FALSE], ID, Con, Time)[DT2]
# ID Con Time Num
# 1: 1 A Jan 10
# 2: 1 A Feb 15
# 3: 1 A Mar NA
# 4: 1 A Apr NA
# 5: 1 A May 20
# 6: 2 B Feb 12
# 7: 2 B Mar 14
# 8: 2 B Apr NA
# 9: 2 B May NA
#10: 2 B Jun 15
NOTE: It may be better to keep missing value as NA

Clean way to calculate both group and overall statistics

I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both at the same time.
Right now, what I'm doing is calculate both statistics separately; then join the results; then tidy the data if necessary. Here's an example of what this may look like if I wanted to know the median delay per day and per month:
library(dplyr)
library(hflights)
data(hflights)
# Calculate both statistics separately
per_day <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Daily")
per_month <- hflights %>%
group_by(Year, Month) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Monthly", DayofMonth = NA)
# Join into a single data.frame
my_summary <- full_join(per_day, per_month,
by = c("Year", "Month", "DayofMonth", "Interval", "Delay"))
my_summary
# Source: local data frame [377 x 5]
# Groups: Year, Month
#
# Year Month DayofMonth Delay Interval
# 1 2011 1 1 10.067642 Daily
# 2 2011 1 2 10.509745 Daily
# 3 2011 1 3 6.038627 Daily
# 4 2011 1 4 7.970740 Daily
# 5 2011 1 5 4.172650 Daily
# 6 2011 1 6 6.069909 Daily
# 7 2011 1 7 3.907295 Daily
# 8 2011 1 8 3.070140 Daily
# 9 2011 1 9 17.254325 Daily
# 10 2011 1 10 11.040388 Daily
# .. ... ... ... ... ...
Are there better ways to do this?
(Note that in many cases one could easily progressively roll up summaries as pointed out in the Introduction to dplyr. However, this doesn't work for statistics like median, mean etc.)
As a one-off table. This is fairly straightforward in data.table:
require(data.table)
setDT(hflights)[,{
mo_del <- mean(ArrDelay,na.rm=TRUE)
.SD[,.(DailyDelay = mean(ArrDelay,na.rm=TRUE),MonthlyDelay = mo_del),by=DayofMonth]
},by=.(Year,Month)]
# Year Month DayofMonth DailyDelay MonthlyDelay
# 1: 2011 1 1 10.0676417 4.926065
# 2: 2011 1 2 10.5097451 4.926065
# 3: 2011 1 3 6.0386266 4.926065
# 4: 2011 1 4 7.9707401 4.926065
# 5: 2011 1 5 4.1726496 4.926065
# ---
# 361: 2011 12 14 1.0293610 5.013244
# 362: 2011 12 17 -0.1049822 5.013244
# 363: 2011 12 24 -4.1457490 5.013244
# 364: 2011 12 25 -2.2976827 5.013244
# 365: 2011 12 31 46.4846491 5.013244
How it works. The basic syntax is DT[i,j,by].
With by=.(Year,Month), all operations in j are done per "by group."
We can nest another "by group" using the data.table of the current Subset of Data, .SD.
To return columns in j we use .(colname1=col1,colname2=col2,...).
Creating new variables. Alternately, we could create new variables in hflights using := in j.
hflights[,DailyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month,DayofMonth)]
hflights[,MonthlyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month)]
Then we can view the summary table:
hflights[,.GRP,.(Year,Month,DayofMonth,DailyDelay,MonthlyDelay)]
# Year Month DayofMonth DailyDelay MonthlyDelay .GRP
# 1: 2011 1 1 10.0676417 4.926065 1
# 2: 2011 1 2 10.5097451 4.926065 2
# 3: 2011 1 3 6.0386266 4.926065 3
# 4: 2011 1 4 7.9707401 4.926065 4
# 5: 2011 1 5 4.1726496 4.926065 5
# ---
# 361: 2011 12 14 1.0293610 5.013244 361
# 362: 2011 12 17 -0.1049822 5.013244 362
# 363: 2011 12 24 -4.1457490 5.013244 363
# 364: 2011 12 25 -2.2976827 5.013244 364
# 365: 2011 12 31 46.4846491 5.013244 365
(Something needed to be put in j here, so I used the "by group" code, .GRP.)

Read Data into Time Series Object in R

My data looks as follows:
Month/Year;Number
01/2010; 1.0
02/2010;19.0
03/2010; 1.0
...
How can I read this into a ts(object) in R?
Try this (assuming your data is called df)
ts(df$Number, start = c(2010, 01), frequency = 12)
## Jan Feb Mar
## 2010 1 19 1
Edit: this will work only if you don't have missing dates and your data is in correct order. For a more general solution see #Anandas answer below
I would recommend using zoo as a starting point. This will ensure that if there are any month/year combinations missing, they would be handled properly.
Example (notice that data for April is missing):
mydf <- data.frame(Month.Year = c("01/2010", "02/2010", "03/2010", "05/2010"),
Number = c(1, 19, 1, 12))
mydf
# Month.Year Number
# 1 01/2010 1
# 2 02/2010 19
# 3 03/2010 1
# 4 05/2010 12
library(zoo)
as.ts(zoo(mydf$Number, as.yearmon(mydf$Month.Year, "%m/%Y")))
# Jan Feb Mar Apr May
# 2010 1 19 1 NA 12

Resources