R script to format datatable to exactly 2 decimal places - r

I have made a datatable "Event_Table" with 46 rows and 6 columns. At some point I export this to text file and would like the output of some fields to be truncated to exactly 2 decimal places.
Event_Table[1:34,3:6]=round(Event_Table[1:34,3:6])
Event_Table[36:39,3:6]=format(round(Event_Table[36:39,3:6],2), nsmall=2)
Event_Table[41:46,3:6]=format(round(Event_Table[41:46,3:6],2), nsmall=2)
Line 1 and 2 produce the desired result, but subsequently running line 3 throws an error:
Error in Math.data.frame(list(CO = c("0", "0", "0.786407766990291", "0", :
non-numeric variable in data frame: CONCONATotal
Why? If remove line 2, then line 3 runs fine. So somethign about setting the formatting in one part of the table is affecting the entire table and prevents a second format command form being possible (even though the formatting is only being applied to discrete parts of the table). Any ideas how to avoid this, or to achieve what is required in a different way?
EDIT:
I should perhaps add that the following code is not quite sufficient:
Event_Table[36:46,3:6]=round(Event_Table[36:46,3:6], digits=2)
Trailing zeros are truncated. i.e. A value of 1 is displayed as "1", not as "1.00". The latter being what is required.
EDIT2:
Here is the table:
ChrSize Chr CO NCO NA Total
1 230218 1 4.00 1.00 0 5.00
2 813184 2 6.00 6.00 0 12.00
3 316620 3 2.00 3.00 0 5.00
4 1531933 4 13.00 20.00 0 33.00
5 576874 5 3.00 8.00 0 11.00
6 270161 6 4.00 2.00 0 6.00
7 1090940 7 11.00 5.00 0 16.00
8 562643 8 5.00 9.00 0 14.00
9 439888 9 6.00 3.00 0 9.00
10 745751 10 10.00 6.00 0 16.00
11 666816 11 3.00 7.00 0 10.00
12 1078177 12 11.00 13.00 1 25.00
13 924431 13 7.00 12.00 0 19.00
14 784333 14 5.00 6.00 1 12.00
15 1091291 15 6.00 17.00 0 23.00
16 948066 16 7.00 6.00 0 13.00
17 12071326 TOTAL 103.00 124.00 2 229.00
18 NA Event Lengths: NA NA NA NA
19 NA Min Len 0.00 22.00 0 0.00
20 NA Max Len 14745.00 12524.00 0 14745.00
21 NA Mean Len 2588.00 1826.00 0 2153.00
22 NA Median Len 1820.00 1029.00 0 1322.00
23 NA Chromatids: NA NA NA NA
24 NA 1_chrom 0.00 98.00 2 100.00
25 NA 2_chrom 81.00 22.00 0 103.00
26 NA 3_chrom 14.00 4.00 0 18.00
27 NA 4_chrom 8.00 0.00 0 8.00
28 NA Classe: NA NA NA NA
29 NA 1_1brin 0.00 55.00 0 55.00
30 NA 1_2brins 0.00 43.00 2 45.00
31 NA 2_nonsis 81.00 15.00 0 96.00
32 NA 2_sis 0.00 7.00 0 7.00
33 NA classe_3 14.00 4.00 0 18.00
34 NA classe_4 8.00 0.00 0 8.00
35 NA Fraction of Chromatids: NA NA NA NA
36 NA 1_chrom 0.00 0.79 1 0.44
37 NA 2_chrom 0.79 0.18 0 0.45
38 NA 3_chrom 0.14 0.03 0 0.08
39 NA 4_chrom 0.08 0.00 0 0.03
40 NA Fraction of each Classe: NA NA NA NA
41 NA 1_1brin 0.00 0.44 0 0.24
42 NA 1_2brins 0.00 0.35 1 0.20
43 NA 2_nonsis 0.79 0.12 0 0.42
44 NA 2_sis 0.00 0.06 0 0.03
45 NA classe_3 0.14 0.03 0 0.08
46 NA classe_4 0.08 0.00 0 0.03
I require rows 1-34 formatted without decimals.
And rows 36-46 formatted with precisely 2 decimal places for all values.
EDIT3: The initial data is read sequentially into tables called "data", then a derivative output table "Event_Table" is generated in which I am inserting summaries of various aspects of each "data" table (i.e. totals, means, medians etc). I then sequentially export the "Event_Tables" since these contain the required summary informations for each "data" table.
Here is the start of the code:
# FIRST SET WORKING DIRECTORY WHERE INPUT FILES ARE!
files = list.files(pattern="Events_") # import files names with "Event_" string into variable "files"
files1 = length(files) # Count number of files
files2 = read.table(text = files, sep = "_", as.is = TRUE) #Split file names by "_" separator and create table "files2"
for (j in 1:files1)
{data <- read.table(files[j], header=TRUE) #Import datatable from files number 1 to j
# Making derivative dataframes:
Event_Table <- data.frame(matrix(NA, nrow = 46, ncol = 6)) # Creates dataframe of arbitrary size full of NAs
names(Event_Table) <- c("ChrSize","Chr","CO","NCO","NA","Total") # Adds column names to dataframe
Event_Table ["Chr"] = c(1:16, "TOTAL","Event Lengths:","Min Len", "Max Len","Mean Len","Median Len","Chromatids:","1_chrom","2_chrom","3_chrom","4_chrom","Classe:","1_1brin","1_2brins","2_nonsis","2_sis","classe_3","classe_4","Fraction of Chromatids:","1_chrom","2_chrom","3_chrom","4_chrom","Fraction of each Classe:","1_1brin","1_2brins","2_nonsis","2_sis","classe_3","classe_4") # Inserts vector 1:16 (numbers 1 to 16) in column 1 of dataframe
Event_Table [1:16,"ChrSize"] = c(230218,813184,316620,1531933,576874,270161,1090940,562643,439888,745751,666816,1078177,924431,784333,1091291,948066)
Event_Table [17,"ChrSize"] =sum(Event_Table [1:16,"ChrSize"])
nE = nrow(data) # Total number of events
Event_Table [17,"Total"] = nrow(data)
Event_Table [19,"Total"] = min(data ["len"])
Event_Table [20,"Total"] = max(data ["len"])
Event_Table [21,"Total"] = mean(data ["len"])
Event_Table [22,"Total"] = median(data [1:nrow(data),"len"])
#More stuff here, etc, then close j loop }
So the Event_Table is set up as a data.frame of type matrix filled with NAs.
I then fill it manually with relevant info in relevant grid positions.
I then simply want to format the visual appearance of these fields.
If I am going about this all wrong, then please can you suggest a better way to do this! Thanks

Here is a proof of concept using 2 rather different data frames:
DF1 <- data.frame(x = rnorm(10), person = rep(LETTERS[1:2], 5))
DF2 <- data.frame(y = 1:10L, result = rep(LETTERS[3:4], 5), alt = rep(letters[3:4], 5))
write.table(DF1, file = "example.csv", sep = ",")
write.table(DF2, file = "example.csv", sep = ",", append = TRUE)
This issues a warning (about column names - no problem) and gives:
x person
1 0.796933543 A
2 1.495800567 B
3 0.359153458 A
4 2.105378598 B
5 0.175455314 A
6 -1.850171347 B
7 -0.87197177 A
8 2.682650638 B
9 1.040676847 A
10 -0.086197042 B
y result alt
1 1 C c
2 2 D d
3 3 C c
4 4 D d
5 5 C c
6 6 D d
7 7 C c
8 8 D d
9 9 C c
10 10 D d
From here you can control the formatting as desired. You may wish to suppress the column names or give more informative ones, and you probably don't want the row numbering either. See ?write.table for all the options.

It could be a similar problem as Error in Math.data.frame.....non-numeric variable in data frame:. Maybe you have commas in your data. If that is not the case, could you show what is in your table?

Related

How do I apply an ifelse function to all cells in a data frame?

I am trying to apply an ifelse statement to all the cells in my data frame. I'm pretty sure I am overthinking this but would appreciate some help/guidance!
I have a dataframe of (slightly modified) percent cover of vegetation from a number of sites where the site names and the vegetation types are the row names and column names, respectively (ie. the data frame should only consist of numeric values):
dwarf shrub equisetum forb fungi graminoid lichen moss shrub-forb tall shrub tree
site1 33.25 0 21.25 1.0 35.25 3.25 60.00 0.00 34.25 0.25
site2 30.25 0 15.00 0.0 25.75 7.50 62.25 1.50 26.75 0
site3 50.00 0 10.00 0.5 23.50 3.25 65.00 6.75 18.50 0
site4 46.00 0 7.75 0.0 32.75 2.25 33.75 4.50 11.25 0.75
site5 28.00 0 11.00 0.0 40.00 6.00 30.00 0.00 38.00 0
site6 40.25 0 10.50 0.0 5.75 6.25 7.25 3.25 8.75 1.25
I am trying to round the numbers to the nearest whole number such that the round() function is used when the value is greater than 1 and the ceiling() function is used when the value is less than 1.
Here is the code I have written to try do this:
new.df <- if(old.df > 1){
round(old.df, digits = 0)} else{
ceiling(old.df)
}
I have also tried without the ceiling function:
new.df <- if(old.df > 1){
round(old.df, digits = 0)} else{
old.df == 1
}
I have not been successful in applying the second half of the statement (ceiling()). I get this error:
Warning message:
In if (old.df > 1) { :
the condition has length > 1 and only the first element will be used
Any assistance would be much appreciated, thank you!
You mentioned ifelse, I think it's straight-forward enough to apply this to each column using lapply. (I'll add the isnum check in case there are non-numeric columns in the data, feel free to ignore it if your data is always numeric.)
isnum <- sapply(dat, is.numeric)
dat[isnum] <- lapply(dat[isnum], function(x) ifelse(x > 1, ceiling(x), round(x, 0)))
dat
# dwarf_shrub equisetum forb fungi graminoid lichen moss shrub_forb tall shrub tree
# 1 site1 34 0 22 1 36 4 60 0 35 0
# 2 site2 31 0 15 0 26 8 63 2 27 0
# 3 site3 50 0 10 0 24 4 65 7 19 0
# 4 site4 46 0 8 0 33 3 34 5 12 1
# 5 site5 28 0 11 0 40 6 30 0 38 0
# 6 site6 41 0 11 0 6 7 8 4 9 2
Data: I had to rename some of the columns since some of your column names are not as easy to read in as easily (spaces, hyphens).
dat <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
dwarf_shrub equisetum forb fungi graminoid lichen moss shrub_forb tall shrub tree
site1 33.25 0 21.25 1.0 35.25 3.25 60.00 0.00 34.25 0.25
site2 30.25 0 15.00 0.0 25.75 7.50 62.25 1.50 26.75 0
site3 50.00 0 10.00 0.5 23.50 3.25 65.00 6.75 18.50 0
site4 46.00 0 7.75 0.0 32.75 2.25 33.75 4.50 11.25 0.75
site5 28.00 0 11.00 0.0 40.00 6.00 30.00 0.00 38.00 0
site6 40.25 0 10.50 0.0 5.75 6.25 7.25 3.25 8.75 1.25")

Finding max of column by group with condition

I have a data frame like this:
for each gill, I would like to find the maximum time for which the Diameter is different from 0. I have tried to use the function aggregate and the dplyr package but this did not work. A combinaison of for, if and aggregate would probably work but I did not find how to do it.
I'm not sure of the best way to approach this. I'd appreciate any help.
After grouping by 'Gill', subset the 'Time' where 'Diametre' is not 0 and get the max (assuming 'Time' is numeric class)
library(dplyr)
df1 %>%
group_by(Gill) %>%
summarise(Time = max(Time[Diametre != 0]))
Here how you can use aggregate:
> df<- data.frame(
Gill = rep(1:11, each = 2),
diameter = c(0,0,1,0,0,0,73.36, 80.08,1,25.2,53.48,61.21,28.8,28.66,71.2,80.25,44.55,53.50,60.91,0,11,74.22),
time = 0.16
)
> df
Gill diameter time
1 1 0.00 0.16
2 1 0.00 0.16
3 2 1.00 0.16
4 2 0.00 0.16
5 3 0.00 0.16
6 3 0.00 0.16
7 4 73.36 0.16
8 4 80.08 0.16
9 5 1.00 0.16
10 5 25.20 0.16
11 6 53.48 0.16
12 6 61.21 0.16
13 7 28.80 0.16
14 7 28.66 0.16
15 8 71.20 0.16
16 8 80.25 0.16
17 9 44.55 0.16
18 9 53.50 0.16
19 10 60.91 0.16
20 10 0.00 0.16
21 11 11.00 0.16
22 11 74.22 0.16
> # Remove diameter == 0 before aggregate
> dfnew <- df[df$diameter != 0, ]
> aggregate(dfnew$time, list(dfnew$Gill), max )
Group.1 x
1 2 0.16
2 4 0.16
3 5 0.16
4 6 0.16
5 7 0.16
6 8 0.16
7 9 0.16
8 10 0.16
9 11 0.16
I would use a different approach than the elegant solution that akrun suggested. I know how to use this method to create the column MaxTime that you show in your image.
#This will split your df into a list of data frames for each gill.
list.df <- split(df1, df1$Gill)
Then you can use lapply to find the maximum of Time for each Gill and then make that value a new column called MaxTime.
lapply(list.df, function(x) mutate(x, MaxTime = max(x$Time[x$Diametre != 0])))
Then you can combine these split dataframes back together using bind_rows()
df1 = bind_rows(list.df)

Create a Custom Function that Extracts Certain Rows

head(MYK)
X Analyte Subject Cohort DayNominal HourNominal Concentration uniqueID FS EF VTI deltaFS deltaEF deltaVTI HR
2 MYK-461 005-010 1 1 0.25 31.00 005-0100.25 31.82 64.86 0.00 3 -1 -100 58
3 MYK-461 005-010 1 1 0.50 31.80 005-0100.5 NA NA NA NA NA NA NA
4 MYK-461 005-010 1 1 1.00 9.69 005-0101 26.13 69.11 0.00 -15 6 -100 55
5 MYK-461 005-010 1 1 1.50 8.01 005-0101.5 NA NA NA NA NA NA NA
6 MYK-461 005-010 1 1 2.00 5.25 005-0102 NA NA NA NA NA NA NA
7 MYK-461 005-010 1 1 3.00 3.26 005-0103 29.89 60.99 23.49 -3 -7 9 55
105 MYK-461 005-033 2 1 0.25 3.4 005-0330.25 30.18 68.59 23.22 1 0 16 47
106 MYK-461 005-033 2 1 0.50 12.4 005-0330.5 NA NA NA NA NA NA NA
107 MYK-461 005-033 2 1 0.75 27.1 005-0330.75 NA NA NA NA NA NA NA
108 MYK-461 005-033 2 1 1.00 23.5 005-0331 32.12 69.60 21.06 7 2 5 43
109 MYK-461 005-033 2 1 1.50 16.8 005-0331.5 NA NA NA NA NA NA NA
110 MYK-461 005-033 2 1 2.00 15.8 005-0332 NA NA NA NA NA NA NA
organize = function(x, y) {
g1 = subset(x, Cohort == y)
g1 = aggregate(x[,'Concentration'], by=list(x[,'HourNominal']), FUN=mean)
g1 = setNames(g1, c('HourNominal', 'Concentration'))
g2 = aggregate(x[,'Concentration'], by=list(x[,'HourNominal']), FUN=sd)
g2 = setNames(g2, c('HourNominal', 'SD'))
g1[,'SD'] = g2$SD
g1$top = g1$Concentration + g1$SD
g1$bottom = g1$Concentration - g1$SD
return(g1)
}
I have a dataframe here, along with some code to subset the dataframe based on a certain Cohort, and to aggregate the Concentration based on Hour. However, all of the dataframes look the same.
CA1 = organize(MYK, 1)
CA2 = organize(MYK, 2)
Yet whenever I use these two commands, the two datasets are identical.
I want a dataset that looks like
HourNominal Concentration SD top bottom
1 0.25 27.287500 25.112204 52.399704 2.1752958
2 0.50 41.989722 32.856013 74.845735 9.1337094
3 0.75 49.866667 22.485254 72.351921 27.3814122
4 1.00 107.168889 104.612098 211.780987 2.5567908
5 1.50 191.766389 264.375466 456.141855 -72.6090774
6 1.75 319.233333 290.685423 609.918757 28.5479100
7 2.00 226.785278 272.983234 499.768512 -46.1979560
8 2.25 341.145833 301.555769 642.701602 39.5900645
9 2.50 341.145833 319.099679 660.245512 22.0461542
10 3.00 195.303333 276.530533 471.833866 -81.2271993
11 4.00 107.913889 140.251991 248.165880 -32.3381024
12 6.00 50.174167 64.700785 114.874952 -14.5266184
13 8.00 38.132639 47.099796 85.232435 -8.9671572
14 12.00 31.404444 39.667850 71.072294 -8.2634051
15 24.00 33.488583 41.267392 74.755975 -7.7788087
16 48.00 29.304833 38.233776 67.538609 -8.9289422
17 72.00 7.322792 6.548898 13.871690 0.7738932
18 96.00 7.002833 6.350251 13.353085 0.6525821
19 144.00 6.463875 5.612630 12.076505 0.8512452
20 216.00 5.007792 4.808156 9.815948 0.1996353
21 312.00 3.964727 4.351626 8.316353 -0.3868988
22 480.00 2.452857 3.220947 5.673804 -0.7680897
23 648.00 1.826625 2.569129 4.395754 -0.7425044
The problem is that the even why I try to separate the values by Cohort, the two dataframes have the same content. They should not be identical.

Categorical Survey Analysis - data structure problems

I am trying to run a probability table for an entire survey. I want to then export these statistics into a csv where each column represents a single question. Each question in my original is its own column, like so:
print(InternalSurveyPercent)
Q1 Q2 Q3 Q4
1 3 2 Mazda
2 3 4 Ford
3 5 2 Toyota
9 3 2 Hyundai
I'd like the results to look like this, but for each column.
InternalSurveyPercent$Q1
Q1
1 25%
2 25%
3 25%
4 0%
5 0%
9 25%
I use this function to generate the list (is lapply the right way to do this?)
InternalSurveyPercent = lapply(InternalSurvey, function(x) prop.table(table(x)))
Then I multiply by 100 because it makes graphic my data easier.
InternalSurveyPercent = sapply(InternalSurveyPercent, "*", 100)
I'm not really sure where to go from here. I'm very confused about how the data is being structured at this point.
str(InternalSurveyPercent)
List of 4
$ Q1: table [1:5(1d)] 25.00 25.00 25.00 0.00 0.00 25.00
..- attr(*, "dimnames")=List of 1
.. ..$ x: chr [1:5] "1" "2" "3" "4" ...
Why is it returning a list? Why not a data frame with 4 variables (columns)? Thoughts on where I am going wrong/getting lost?
Thank you!
Seems folks are having different interpretation on the output, suggest to re-frame the question and desired output with clarity. Anyhow, here s a data.table solution based on how far I understand the question.
# the data
df <- read.table(text="Q1 Q2 Q3 Q4
1 3 2 Mazda
2 3 4 Ford
3 5 2 Toyota
9 3 2 Hyundai", header=T, as.is=T)
library(data.table)
# one liner to get the %
setDT(df)[,lapply(.SD, function(x) prop.table(table(x))*100)][]
# Q1 Q2 Q3 Q4
# 1: 25 75 75 25
# 2: 25 25 25 25
# 3: 25 75 75 25
# 4: 25 25 25 25
# If you prefer stitch the result table with the original together, you could:
df2 <- setDT(df)[,lapply(.SD, function(x) prop.table(table(x))*100)]
df[,paste0("Q",(1:4),"%") := df2[,1:4,with=FALSE], with=FALSE][]
# Q1 Q2 Q3 Q4 Q1% Q2% Q3% Q4%
# 1: 1 3 2 Mazda 25 75 75 25
# 2: 2 3 4 Ford 25 25 25 25
# 3: 3 5 2 Toyota 25 75 75 25
# 4: 9 3 2 Hyundai 25 25 25 25
This may be helpful. I am guessing that you have six options in Q1-3 (i.e., 1,2,3,4,5,and 9). But, Q4 is a different question in that there may not be the same options. Therefore, you will see ten options in the outcome.
devtools::install_github("hadley/tidyr")
library(tidyr)
# I am following your idea with data provided by #LyzandeR
ana <- lapply(InternalSurvey, function(x) prop.table(table(x)))
bob <- data.frame(t(unnest(lapply(ana, as.data.frame.list))), stringsAsFactors = FALSE)
bob <- replace(bob, is.na(bob), 0)
colnames(bob) <- gsub("X", "Q", colnames(bob))
# Q1 Q2 Q3 Q4
#X1 0.25 0.00 0.00 0.00
#X2 0.25 0.00 0.75 0.00
#X3 0.25 0.75 0.00 0.00
#X9 0.25 0.00 0.00 0.00
#X5 0.00 0.25 0.00 0.00
#X4 0.00 0.00 0.25 0.00
#Ford 0.00 0.00 0.00 0.25
#Hyundai 0.00 0.00 0.00 0.25
#Mazda 0.00 0.00 0.00 0.25
#Toyota 0.00 0.00 0.00 0.25

Manipulating Data in R

I have data a data frame in the following structure
transaction | customer | week | amount
12551 | ieeamo | 32 | €23.54
12553 | ieeamo | 33 | €17.00
I would like to get it in the following structure (for all weeks)
week | customer | activity last week | activity 2 weeks ago
32 | ieeamo | €0.00 | €0.00
33 | ieeamo | €23.54 | €0.00
34 | ieeamo | €17.00 | €23.54
35 | ieeamo | €0.00 | €17.00
Essentially, I am trying to convert transactional data to relative data.
My thoughts are that the best way to do this is to use loops to generate many dataframes then rbind them all at the end. However this approach does not seem efficient, and i'm not sure it will scale to the data I am using.
Is there a more proper solution?
Rbinding is a bad idea for this, since each rbind creates a new copy of the data frame in memory. We can get to the answer more quickly with a mostly vectorized approach, using loops only to make code more concise. Props to the OP for recognizing the inefficiency and searching for a solution.
Note: The following solution will work for any number of customers, but would require minor modification to work with more lag columns.
Setup: First we need to generate some data to work with. I'm going to use two different customers with a few weeks of transactional data each, like so:
data <- read.table(text="
transaction customer week amount
12551 cOne 32 1.32
12552 cOne 34 1.34
12553 cTwo 34 2.34
12554 cTwo 35 2.35
12555 cOne 36 1.36
12556 cTwo 37 1.37
", header=TRUE)
Step 1: Calculate some variables and initialize new data frame. To make the programming really easy, we first want to know two things: how many customers and how many weeks? We calculate those answers like so:
customer_list <- unique(data$customer)
# cOne cTwo
week_span <- min(data$week):max(data$week)
# 32 33 34 35 36 37
Next, we need to initialize the new data frame based on the variables we just calculated. In this new data frame, we need an entry for every week, not just the weeks in the data. This is where our 'week_span' variable comes in useful.
new_data <- data.frame(
week=sort(rep(week_span,length(customer_list))),
customer=customer_list,
activity_last_week=NA,
activity_2_weeks_ago=NA)
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne NA NA
# 4 33 cTwo NA NA
# 5 34 cOne NA NA
# 6 34 cTwo NA NA
# 7 35 cOne NA NA
# 8 35 cTwo NA NA
# 9 36 cOne NA NA
# 10 36 cTwo NA NA
# 11 37 cOne NA NA
# 12 37 cTwo NA NA
You'll notice we repeat the week list for each customer and sort it, so we get a list resembling 1,1,2,2,3,3,4,4...n,n with a number of repetitions equal to the number of customers in the data. This makes it so we can specify the 'customer' data as just the list of customers, since the list will repeat to fill up the space. The lag columns are left as NA for now.
Step 2: Fill in the lag values. Now, things are pretty simple. We just need to grab the subset of rows for each customer and find out if there were any transactions for each week. We do this by using the 'match' function to pull out values for every week. Where data does not exist, we'll get an NA value and need to replace those with zeros (assuming no activity means a zero transaction). Then, for the lag columns, we just offset the values with NA depending on the number of weeks we are lagging.
# Loop through the customers.
for (i in 1:length(customer_list)){
# Select the next customer's data.
subset <- data[data$customer==customer_list[i],]
# Extract the data values for each week.
subset_amounts <- subset$amount[match(week_span, subset$week)]
# Replace NA with zero.
subset_amounts <- ifelse(is.na(subset_amounts),0,subset_amounts)
# Loop through the lag columns.
for (lag in 1:2){
# Write in the data values with the appropriate
# number of offsets according to the lag.
# Truncate the extra values.
new_data[new_data$customer==customer_list[i], (2+lag)] <- c(rep(NA,lag), subset_amounts[1:(length(subset_amounts)-lag)])
}
}
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne 1.32 NA
# 4 33 cTwo 0.00 NA
# 5 34 cOne 0.00 1.32
# 6 34 cTwo 0.00 0.00
# 7 35 cOne 1.34 0.00
# 8 35 cTwo 2.34 0.00
# 9 36 cOne 0.00 1.34
# 10 36 cTwo 2.35 2.34
# 11 37 cOne 1.36 0.00
# 12 37 cTwo 0.00 2.35
In other situations... If you have a series of ordered time data where no rows are missing, this sort of task becomes incredibly simple with the 'embed' function. Let's say we have some data that looks like this:
data <- data.frame(week=1:20, value=1:20+(1:20/100))
# week value
# 1 1 1.01
# 2 2 2.02
# 3 3 3.03
# 4 4 4.04
# 5 5 5.05
# 6 6 6.06
# 7 7 7.07
# 8 8 8.08
# 9 9 9.09
# 10 10 10.10
# 11 11 11.11
# 12 12 12.12
# 13 13 13.13
# 14 14 14.14
# 15 15 15.15
# 16 16 16.16
# 17 17 17.17
# 18 18 18.18
# 19 19 19.19
# 20 20 20.20
We could make a lagged data set in no time, like so:
new_data <- data.frame(week=data$week[3:20], embed(data$value,3))
names(new_data)[2:4] <- c("this_week", "last_week", "2_weeks_ago")
# week this_week last_week 2_weeks_ago
# 1 3 3.03 2.02 1.01
# 2 4 4.04 3.03 2.02
# 3 5 5.05 4.04 3.03
# 4 6 6.06 5.05 4.04
# 5 7 7.07 6.06 5.05
# 6 8 8.08 7.07 6.06
# 7 9 9.09 8.08 7.07
# 8 10 10.10 9.09 8.08
# 9 11 11.11 10.10 9.09
# 10 12 12.12 11.11 10.10
# 11 13 13.13 12.12 11.11
# 12 14 14.14 13.13 12.12
# 13 15 15.15 14.14 13.13
# 14 16 16.16 15.15 14.14
# 15 17 17.17 16.16 15.15
# 16 18 18.18 17.17 16.16
# 17 19 19.19 18.18 17.17
# 18 20 20.20 19.19 18.18

Resources