How to read .txt tables into R - r

I have a .txt file which I think was output from STATA, but I'm not sure. It is a list of tables formatted like this:
Q1 | Freq. Percent Cum.
------------+-----------------------------------
answer | 35 21.08 21.08
text | 4 2.41 23.49
words | 35 21.08 44.5
something | 38 22.89 67.47
blah | 54 32.53 100.00
------------+-----------------------------------
Total | 166 100.00
Q2 | Freq. Percent Cum.
------------------+-----------------------------------
foo | 1 0.60 0.60
blahblah | 11 6.63 7.23
etc | 26 15.66 22.89
more text | 82 49.40 72.29
answer | 7 4.22 76.51
survey response | 39 23.49 100.00
------------------+-----------------------------------
Total | 166 100.00
Q3 | Freq. Percent Cum.
------------+-----------------------------------
option | 7 4.22 4.22
text | 24 14.46 18.67
blahb | 25 15.06 33.73
more text | 82 49.40 83.13
etc | 28 16.87 100.00
------------+-----------------------------------
Total | 166 100.00
It continues for about 200 questions and their respective survey answers. Does anyone know how I can quickly read each survey question into separate data frames in R?

No need for scan():
txt <- " Q1 | Freq. Percent Cum.
------------+-----------------------------------
answer | 35 21.08 21.08
text | 4 2.41 23.49
words | 35 21.08 44.5
something | 38 22.89 67.47
blah | 54 32.53 100.00
------------+-----------------------------------
Total | 166 100.00
Q2 | Freq. Percent Cum.
------------------+-----------------------------------
foo | 1 0.60 0.60
blahblah | 11 6.63 7.23
etc | 26 15.66 22.89
more text | 82 49.40 72.29
answer | 7 4.22 76.51
survey response | 39 23.49 100.00
------------------+-----------------------------------
Total | 166 100.00
Q3 | Freq. Percent Cum.
------------+-----------------------------------
option | 7 4.22 4.22
text | 24 14.46 18.67
blahb | 25 15.06 33.73
more text | 82 49.40 83.13
etc | 28 16.87 100.00
------------+-----------------------------------
Total | 166 100.00"
library(purrr)
You can just as easily read from a file vs the above text vector. The main goal here is to remove cruft from the data and get it into a form we can work with, so we get rid of the dashed lines and the Total line, and convert spaces to commas. This makes a big assumption about your data format so it needs to be consistent.
readLines(textConnection(txt)) %>%
discard(~grepl("(----|Total)", .)) %>%
gsub("[[:space:]]*\\|[[:space:]]*", ",", .) %>%
gsub("[[:space:]][[:space:]]+", ",", .) %>%
gsub("^,", "", .) -> lines
There's a blank line between the tables. This is another assumption the code makes. We find that blank line and extract the lines that are between blanks (including the start and end of the text). Then we read that into a data frame with read.csv.
starts <- c(1, which(lines=="")+1)
ends <- c(which(lines=="")-1, length(lines))
map2(starts, ends, function(start, end) {
read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)
})
This results in a list of data frames:
## [[1]]
## Q1 Freq. Percent Cum.
## 1 answer 35 21.08 21.08
## 2 text 4 2.41 23.49
## 3 words 35 21.08 44.50
## 4 something 38 22.89 67.47
## 5 blah 54 32.53 100.00
##
## [[2]]
## Q2 Freq. Percent Cum.
## 1 foo 1 0.60 0.60
## 2 blahblah 11 6.63 7.23
## 3 etc 26 15.66 22.89
## 4 more text 82 49.40 72.29
## 5 answer 7 4.22 76.51
## 6 survey response 39 23.49 100.00
##
## [[3]]
## Q3 Freq. Percent Cum.
## 1 option 7 4.22 4.22
## 2 text 24 14.46 18.67
## 3 blahb 25 15.06 33.73
## 4 more text 82 49.40 83.13
## 5 etc 28 16.87 100.00
But, I think this might be more useful as one big data frame:
map2_df(starts, ends, function(start, end) {
df <- read.csv(textConnection(lines[start:end]), stringsAsFactors=FALSE)
colnames(df) %>%
tolower() %>%
gsub("\\.", "", .) -> cols
question <- cols[1]
cols[1] <- "text"
setNames(df, cols) %>%
mutate(question=question) %>%
mutate(n=1:nrow(.)) %>%
select(question, n, text, freq, percent, cum) %>%
mutate(percent=percent/100, cum=cum/100)
})
## question n text freq percent cum
## 1 q1 1 answer 35 0.2108 0.2108
## 2 q1 2 text 4 0.0241 0.2349
## 3 q1 3 words 35 0.2108 0.4450
## 4 q1 4 something 38 0.2289 0.6747
## 5 q1 5 blah 54 0.3253 1.0000
## 6 q2 1 foo 1 0.0060 0.0060
## 7 q2 2 blahblah 11 0.0663 0.0723
## 8 q2 3 etc 26 0.1566 0.2289
## 9 q2 4 more text 82 0.4940 0.7229
## 10 q2 5 answer 7 0.0422 0.7651
## 11 q2 6 survey response 39 0.2349 1.0000
## 12 q3 1 option 7 0.0422 0.0422
## 13 q3 2 text 24 0.1446 0.1867
## 14 q3 3 blahb 25 0.1506 0.3373
## 15 q3 4 more text 82 0.4940 0.8313
## 16 q3 5 etc 28 0.1687 1.0000

Related

How to split column with double information (abs. value and %) into 2 separated columns?

In the dataframe absolute values and percentages are combined, and I want to split them into 2 separated columns:
df <- data.frame (Sales = c("74(2.08%)",
"71(2.00%)",
"58(1.63%)",
"42(1.18%)"))
Sales
1 74(2.08%)
2 71(2.00%)
3 58(1.63%)
4 42(1.18%)
Expected output
Sales Share
1 74 2.08
2 71 2.00
3 58 1.63
4 42 1.18
in Base R:
read.table(text=gsub("[()%]", ' ', df$Sales), col.names = c("Sales", "Share"))
Sales Share
1 74 2.08
2 71 2.00
3 58 1.63
4 42 1.18
df %>%
separate(Sales, c("Sales", "Share"), sep='[()%]', extra = 'drop', convert = TRUE)
Sales Share
1 74 2.08
2 71 2.00
3 58 1.63
4 42 1.18
Using tidyr::extract you could split your column into separate columns using a regex:
library(tidyr)
df |>
extract(Sales, into = c("Sales", "Share"), regex = "^(\\d+)\\((\\d+\\.\\d+)\\%\\)$", convert = TRUE)
#> Sales Share
#> 1 74 2.08
#> 2 71 2.00
#> 3 58 1.63
#> 4 42 1.18

#1 Combining categories of a categorical variable

I would like to combine some Brazilian political party names from a categorical variable (partido_pref) that was wrongly coded.
The categories that I would like to combine are "PC do B" and "PCdoB", and "PT do B" and "PTdoB". The parties with and without space are the same parties.
I would rather do it in Stata but I can also work on R.
Below you will find the list of political parties.
. tab partido_pref
partido_pref | Freq. Percent Cum.
---------------+-----------------------------------
DEM | 2,267 2.14 2.14
NA | 34,848 32.84 34.98
Não disponível | 2 0.00 34.98
Outra situação | 19 0.02 35.00
PAN | 6 0.01 35.00
PC do B | 260 0.25 35.25
PCB | 2 0.00 35.25
PCdoB | 7 0.01 35.26
PCO | 1 0.00 35.26
PDT | 3,933 3.71 38.97
PFL | 6,811 6.42 45.39
PHS | 194 0.18 45.57
PL | 2,525 2.38 47.95
PMDB | 14,833 13.98 61.93
PMN | 410 0.39 62.31
PP | 5,467 5.15 67.47
PPB | 1,661 1.57 69.03
PPL | 10 0.01 69.04
PPS | 2,493 2.35 71.39
PR | 1,861 1.75 73.14
PRB | 298 0.28 73.43
PRN | 9 0.01 73.43
PRONA | 26 0.02 73.46
PRP | 273 0.26 73.72
PRTB | 121 0.11 73.83
PSB | 2,905 2.74 76.57
PSC | 480 0.45 77.02
PSD | 816 0.77 77.79
PSDB | 11,316 10.66 88.45
PSDC | 121 0.11 88.57
PSL | 273 0.26 88.83
PSOL | 4 0.00 88.83
PST | 48 0.05 88.87
PSTU | 1 0.00 88.88
PT | 5,258 4.96 93.83
PT do B | 139 0.13 93.96
PTB | 5,383 5.07 99.03
PTC | 140 0.13 99.17
PTdoB | 10 0.01 99.18
PTN | 108 0.10 99.28
PV | 702 0.66 99.94
Recusa | 2 0.00 99.94
Sem partido | 62 0.06 100.00
---------------+-----------------------------------
Total | 106,105 100.00
Thank you in advance!
One option is fct_collapse from forcats
library(forcats)
fct_collapse(df1$partido_pref, pc = c( "PC do B", "PCdoB"),
pt = c( "PT do B", "PTdoB"))
If your problem is just getting rid of whitespace:
replace partido_pref = subinstr(partido_pref, " ", "")
See help string_functions for more options.
R is more flexible, but Stata can handle that level of simple text management.

create a data frame based on input values

I am new to R. Your help here will be appreciated.
I have inputs such as.
columnA <- 14 # USERINPUT
columnB <- 1 # Incremented from 1.2.3.etc
columnC <- columnA * columnB
columnD <- 25 # remains constant
columnE <- columnC / columnD
columnF <- 8 # remains constant
columnG <- columnE + columnF
mydf <- data.frame(columnA,columnB,columnC,columnD,columnE,columnF,columnG)
Based on the above data frame I need to create a data frame such that in every susbsequent row value at columnB is incremented from 1 to 2 to 3 such that the value at columnG is never above 600 and we stop creating rows. I tried to do this in excel.Below is kind of the output i would need.
+---------+--------+---------+---------+---------+---------+---------+
| columnA | columB | columnC | columnD | columnE | columnF | columnG |
+---------+--------+---------+---------+---------+---------+---------+
| 14 | 1 | 14 | 25 | 0.56 | 8 | 8.56 |
| 14 | 2 | 28 | 25 | 1.12 | 8 | 9.12 |
| 14 | 3 | 42 | 25 | 1.68 | 8 | 9.68 |
| 14 | 4 | 56 | 25 | 2.24 | 8 | 10.24 |
| 14 | 5 | 70 | 25 | 2.8 | 8 | 10.8 |
| 14 | 6 | 84 | 25 | 3.36 | 8 | 11.36 |
| 14 | 7 | 98 | 25 | 3.92 | 8 | 11.92 |
| 14 | 8 | 112 | 25 | 4.48 | 8 | 12.48 |
+---------+--------+---------+---------+---------+---------+---------+
The end result should be a data frame
First you can compute the lenght of the data.frame:
userinput <- 14
N <- (600 - 8) * 25 / userinput
Then, using dplyr you create the data.frame:
mydf <- data_frame(ColA = 14, ColB = 1:floor(N), ColD = 25, ColF = 8) %>%
mutate(ColC = ColA * ColB, ColE = ColC/ColD, ColG = ColE + ColF)
If you need the columns in the correct order:
> mydf <- mydf %>% select(ColA, ColB, ColC, ColD, ColE, ColF, ColG)
> mydf
ColA ColB ColD ColF ColC ColE ColG
1: 14 1 25 8 14 0.56 8.56
2: 14 2 25 8 28 1.12 9.12
3: 14 3 25 8 42 1.68 9.68
4: 14 4 25 8 56 2.24 10.24
5: 14 5 25 8 70 2.80 10.80
---
1053: 14 1053 25 8 14742 589.68 597.68
1054: 14 1054 25 8 14756 590.24 598.24
1055: 14 1055 25 8 14770 590.80 598.80
1056: 14 1056 25 8 14784 591.36 599.36
1057: 14 1057 25 8 14798 591.92 599.92

Merging uneven Panel Data frames in R

I have two sets of panel data that I would like to merge. The problem is that, for each respective time interval, the variable which links the two data sets appears more frequently in the first data frame than the second. My objective is to add each row from the second data set to its corresponding row in the first data set, even if that necessitates copying said row multiple times in the same time interval. Specifically, I am working with basketball data from the NBA. The first data set is a panel of Player and Date while the second is one of Team (Tm) and Date. Thus, each Team entry should be copied multiple times per date, once for each player on that team who played that day. I could do this easily in excel, but the data frames are too large.
The result is 0 observations of 52 variables. I've experimented with bind, match, different versions of merge, and I've searched for everything I can think of; but, nothing seems to address this issue specifically. Disclaimer, I am very new to R.
Here is my code up until my road block:
HGwd = "~/Documents/Fantasy/Basketball"
library(plm)
library(mice)
library(VIM)
library(nnet)
library(tseries)
library(foreign)
library(ggplot2)
library(truncreg)
library(boot)
Pdata = read.csv("2015-16PlayerData.csv", header = T)
attach(Pdata)
Pdata$Age = as.numeric(as.character(Pdata$Age))
Pdata$Date = as.Date(Pdata$Date, '%m/%e/%Y')
names(Pdata)[8] = "OppTm"
Pdata$GS = as.factor(as.character(Pdata$GS))
Pdata$MP = as.numeric(as.character(Pdata$MP))
Pdata$FG = as.numeric(as.character(Pdata$FG))
Pdata$FGA = as.numeric(as.character(Pdata$FGA))
Pdata$X2P = as.numeric(as.character(Pdata$X2P))
Pdata$X2PA = as.numeric(as.character(Pdata$X2PA))
Pdata$X3P = as.numeric(as.character(Pdata$X3P))
Pdata$X3PA = as.numeric(as.character(Pdata$X3PA))
Pdata$FT = as.numeric(as.character(Pdata$FT))
Pdata$FTA = as.numeric(as.character(Pdata$FTA))
Pdata$ORB = as.numeric(as.character(Pdata$ORB))
Pdata$DRB = as.numeric(as.character(Pdata$DRB))
Pdata$TRB = as.numeric(as.character(Pdata$TRB))
Pdata$AST = as.numeric(as.character(Pdata$AST))
Pdata$STL = as.numeric(as.character(Pdata$STL))
Pdata$BLK = as.numeric(as.character(Pdata$BLK))
Pdata$TOV = as.numeric(as.character(Pdata$TOV))
Pdata$PF = as.numeric(as.character(Pdata$PF))
Pdata$PTS = as.numeric(as.character(Pdata$PTS))
PdataPD = plm.data(Pdata, index = c("Player", "Date"))
attach(PdataPD)
Tdata = read.csv("2015-16TeamData.csv", header = T)
attach(Tdata)
Tdata$Date = as.Date(Tdata$Date, '%m/%e/%Y')
names(Tdata)[3] = "OppTm"
Tdata$MP = as.numeric(as.character(Tdata$MP))
Tdata$FG = as.numeric(as.character(Tdata$FG))
Tdata$FGA = as.numeric(as.character(Tdata$FGA))
Tdata$X2P = as.numeric(as.character(Tdata$X2P))
Tdata$X2PA = as.numeric(as.character(Tdata$X2PA))
Tdata$X3P = as.numeric(as.character(Tdata$X3P))
Tdata$X3PA = as.numeric(as.character(Tdata$X3PA))
Tdata$FT = as.numeric(as.character(Tdata$FT))
Tdata$FTA = as.numeric(as.character(Tdata$FTA))
Tdata$PTS = as.numeric(as.character(Tdata$PTS))
Tdata$Opp.FG = as.numeric(as.character(Tdata$Opp.FG))
Tdata$Opp.FGA = as.numeric(as.character(Tdata$Opp.FGA))
Tdata$Opp.2P = as.numeric(as.character(Tdata$Opp.2P))
Tdata$Opp.2PA = as.numeric(as.character(Tdata$Opp.2PA))
Tdata$Opp.3P = as.numeric(as.character(Tdata$Opp.3P))
Tdata$Opp.3PA = as.numeric(as.character(Tdata$Opp.3PA))
Tdata$Opp.FT = as.numeric(as.character(Tdata$Opp.FT))
Tdata$Opp.FTA = as.numeric(as.character(Tdata$Opp.FTA))
Tdata$Opp.PTS = as.numeric(as.character(Tdata$Opp.PTS))
TdataPD = plm.data(Tdata, index = c("OppTm", "Date"))
attach(TdataPD)
PD = merge(PdataPD, TdataPD, by = "OppTm", all.x = TRUE)
attach(PD)
Any help on how to do this would be greatly appreciated!
EDIT
I've tweaked it a little from last night, but still nothing seems to do the trick. See the above, updated code for what I am currently using.
Here is the output for head(PdataPD):
Player Date Rk Pos Tm X..H OppTm W.L GS MP FG FGA FG. X2P
22408 Aaron Brooks 2015-10-27 817 G CHI CLE W 0 16 3 9 0.333 3
22144 Aaron Brooks 2015-10-28 553 G CHI # BRK W 0 16 5 9 0.556 3
21987 Aaron Brooks 2015-10-30 396 G CHI # DET L 0 18 2 6 0.333 1
21456 Aaron Brooks 2015-11-01 4687 G CHI ORL W 0 16 3 11 0.273 3
21152 Aaron Brooks 2015-11-03 4383 G CHI # CHO L 0 17 5 8 0.625 1
20805 Aaron Brooks 2015-11-05 4036 G CHI OKC W 0 13 4 8 0.500 3
X2PA X2P. X3P X3PA X3P. FT FTA FT. ORB DRB TRB AST STL BLK TOV PF PTS GmSc
22408 8 0.375 0 1 0.000 0 0 NA 0 2 2 0 0 0 2 1 6 -0.9
22144 3 1.000 2 6 0.333 0 0 NA 0 1 1 3 1 0 1 4 12 8.5
21987 2 0.500 1 4 0.250 0 0 NA 0 4 4 4 0 0 0 1 5 5.2
21456 6 0.500 0 5 0.000 0 0 NA 2 1 3 1 1 1 1 4 6 1.0
21152 3 0.333 4 5 0.800 0 0 NA 0 0 0 4 1 0 0 4 14 12.6
20805 5 0.600 1 3 0.333 0 0 NA 1 1 2 0 0 0 0 1 9 5.6
FPTS H.A
22408 7.50 H
22144 20.25 A
21987 16.50 A
21456 14.75 H
21152 24.00 A
20805 12.00 H
And for head(TdataPD):
OppTm Date Rk X Opp Result MP FG FGA FG. X2P X2PA X2P. X3P X3PA
2105 ATL 2015-10-27 71 DET L 94-106 240 37 82 0.451 29 55 0.527 8 27
2075 ATL 2015-10-29 41 # NYK W 112-101 240 42 83 0.506 32 59 0.542 10 24
2047 ATL 2015-10-30 13 CHO W 97-94 240 36 83 0.434 28 60 0.467 8 23
2025 ATL 2015-11-01 437 # CHO W 94-92 240 37 88 0.420 30 59 0.508 7 29
2001 ATL 2015-11-03 413 # MIA W 98-92 240 37 90 0.411 30 69 0.435 7 21
1973 ATL 2015-11-04 385 BRK W 101-87 240 37 76 0.487 29 54 0.537 8 22
X3P. FT FTA FT. PTS Opp.FG Opp.FGA Opp.FG. Opp.2P Opp.2PA Opp.2P. Opp.3P
2105 0.296 12 15 0.800 94 37 96 0.385 25 67 0.373 12
2075 0.417 18 26 0.692 112 38 93 0.409 32 64 0.500 6
2047 0.348 17 22 0.773 97 36 88 0.409 24 58 0.414 12
2025 0.241 13 14 0.929 94 32 86 0.372 18 49 0.367 14
2001 0.333 17 22 0.773 98 38 86 0.442 33 58 0.569 5
1973 0.364 19 24 0.792 101 36 83 0.434 31 62 0.500 5
Opp.3PA Opp.3P. Opp.FT Opp.FTA Opp.FT. Opp.PTS
2105 29 0.414 20 26 0.769 106
2075 29 0.207 19 21 0.905 101
2047 30 0.400 10 13 0.769 94
2025 37 0.378 14 15 0.933 92
2001 28 0.179 11 16 0.688 92
1973 21 0.238 10 13 0.769 87
If there is way to truncate the output from dput(head(___)), I am not familiar with it. It appears that simply erasing the excess characters would remove entire variables from the dataset.
It would help if you posted your data (or a working subset of it) and a little more detail on how you are trying to merge, but if I understand what you are trying to do, you want each final data record to have individual stats for each player on a particular date followed by the player's team's stats for that date. In this case, you should have a team column in the Player table that identifies the player's team, and then join the two tables on the composite key Date and Team by setting the by= attribute in merge:
merge(PData, TData, by=c("Date", "Team"))
The fact that the data frames are of different lengths doesn't matter--this is exactly what join/merge operations are for.
For an alternative to merge(), you might check out the dplyr package join functions at https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html

Manipulating Data in R

I have data a data frame in the following structure
transaction | customer | week | amount
12551 | ieeamo | 32 | €23.54
12553 | ieeamo | 33 | €17.00
I would like to get it in the following structure (for all weeks)
week | customer | activity last week | activity 2 weeks ago
32 | ieeamo | €0.00 | €0.00
33 | ieeamo | €23.54 | €0.00
34 | ieeamo | €17.00 | €23.54
35 | ieeamo | €0.00 | €17.00
Essentially, I am trying to convert transactional data to relative data.
My thoughts are that the best way to do this is to use loops to generate many dataframes then rbind them all at the end. However this approach does not seem efficient, and i'm not sure it will scale to the data I am using.
Is there a more proper solution?
Rbinding is a bad idea for this, since each rbind creates a new copy of the data frame in memory. We can get to the answer more quickly with a mostly vectorized approach, using loops only to make code more concise. Props to the OP for recognizing the inefficiency and searching for a solution.
Note: The following solution will work for any number of customers, but would require minor modification to work with more lag columns.
Setup: First we need to generate some data to work with. I'm going to use two different customers with a few weeks of transactional data each, like so:
data <- read.table(text="
transaction customer week amount
12551 cOne 32 1.32
12552 cOne 34 1.34
12553 cTwo 34 2.34
12554 cTwo 35 2.35
12555 cOne 36 1.36
12556 cTwo 37 1.37
", header=TRUE)
Step 1: Calculate some variables and initialize new data frame. To make the programming really easy, we first want to know two things: how many customers and how many weeks? We calculate those answers like so:
customer_list <- unique(data$customer)
# cOne cTwo
week_span <- min(data$week):max(data$week)
# 32 33 34 35 36 37
Next, we need to initialize the new data frame based on the variables we just calculated. In this new data frame, we need an entry for every week, not just the weeks in the data. This is where our 'week_span' variable comes in useful.
new_data <- data.frame(
week=sort(rep(week_span,length(customer_list))),
customer=customer_list,
activity_last_week=NA,
activity_2_weeks_ago=NA)
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne NA NA
# 4 33 cTwo NA NA
# 5 34 cOne NA NA
# 6 34 cTwo NA NA
# 7 35 cOne NA NA
# 8 35 cTwo NA NA
# 9 36 cOne NA NA
# 10 36 cTwo NA NA
# 11 37 cOne NA NA
# 12 37 cTwo NA NA
You'll notice we repeat the week list for each customer and sort it, so we get a list resembling 1,1,2,2,3,3,4,4...n,n with a number of repetitions equal to the number of customers in the data. This makes it so we can specify the 'customer' data as just the list of customers, since the list will repeat to fill up the space. The lag columns are left as NA for now.
Step 2: Fill in the lag values. Now, things are pretty simple. We just need to grab the subset of rows for each customer and find out if there were any transactions for each week. We do this by using the 'match' function to pull out values for every week. Where data does not exist, we'll get an NA value and need to replace those with zeros (assuming no activity means a zero transaction). Then, for the lag columns, we just offset the values with NA depending on the number of weeks we are lagging.
# Loop through the customers.
for (i in 1:length(customer_list)){
# Select the next customer's data.
subset <- data[data$customer==customer_list[i],]
# Extract the data values for each week.
subset_amounts <- subset$amount[match(week_span, subset$week)]
# Replace NA with zero.
subset_amounts <- ifelse(is.na(subset_amounts),0,subset_amounts)
# Loop through the lag columns.
for (lag in 1:2){
# Write in the data values with the appropriate
# number of offsets according to the lag.
# Truncate the extra values.
new_data[new_data$customer==customer_list[i], (2+lag)] <- c(rep(NA,lag), subset_amounts[1:(length(subset_amounts)-lag)])
}
}
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne 1.32 NA
# 4 33 cTwo 0.00 NA
# 5 34 cOne 0.00 1.32
# 6 34 cTwo 0.00 0.00
# 7 35 cOne 1.34 0.00
# 8 35 cTwo 2.34 0.00
# 9 36 cOne 0.00 1.34
# 10 36 cTwo 2.35 2.34
# 11 37 cOne 1.36 0.00
# 12 37 cTwo 0.00 2.35
In other situations... If you have a series of ordered time data where no rows are missing, this sort of task becomes incredibly simple with the 'embed' function. Let's say we have some data that looks like this:
data <- data.frame(week=1:20, value=1:20+(1:20/100))
# week value
# 1 1 1.01
# 2 2 2.02
# 3 3 3.03
# 4 4 4.04
# 5 5 5.05
# 6 6 6.06
# 7 7 7.07
# 8 8 8.08
# 9 9 9.09
# 10 10 10.10
# 11 11 11.11
# 12 12 12.12
# 13 13 13.13
# 14 14 14.14
# 15 15 15.15
# 16 16 16.16
# 17 17 17.17
# 18 18 18.18
# 19 19 19.19
# 20 20 20.20
We could make a lagged data set in no time, like so:
new_data <- data.frame(week=data$week[3:20], embed(data$value,3))
names(new_data)[2:4] <- c("this_week", "last_week", "2_weeks_ago")
# week this_week last_week 2_weeks_ago
# 1 3 3.03 2.02 1.01
# 2 4 4.04 3.03 2.02
# 3 5 5.05 4.04 3.03
# 4 6 6.06 5.05 4.04
# 5 7 7.07 6.06 5.05
# 6 8 8.08 7.07 6.06
# 7 9 9.09 8.08 7.07
# 8 10 10.10 9.09 8.08
# 9 11 11.11 10.10 9.09
# 10 12 12.12 11.11 10.10
# 11 13 13.13 12.12 11.11
# 12 14 14.14 13.13 12.12
# 13 15 15.15 14.14 13.13
# 14 16 16.16 15.15 14.14
# 15 17 17.17 16.16 15.15
# 16 18 18.18 17.17 16.16
# 17 19 19.19 18.18 17.17
# 18 20 20.20 19.19 18.18

Resources