Manipulating Data in R - r

I have data a data frame in the following structure
transaction | customer | week | amount
12551 | ieeamo | 32 | €23.54
12553 | ieeamo | 33 | €17.00
I would like to get it in the following structure (for all weeks)
week | customer | activity last week | activity 2 weeks ago
32 | ieeamo | €0.00 | €0.00
33 | ieeamo | €23.54 | €0.00
34 | ieeamo | €17.00 | €23.54
35 | ieeamo | €0.00 | €17.00
Essentially, I am trying to convert transactional data to relative data.
My thoughts are that the best way to do this is to use loops to generate many dataframes then rbind them all at the end. However this approach does not seem efficient, and i'm not sure it will scale to the data I am using.
Is there a more proper solution?

Rbinding is a bad idea for this, since each rbind creates a new copy of the data frame in memory. We can get to the answer more quickly with a mostly vectorized approach, using loops only to make code more concise. Props to the OP for recognizing the inefficiency and searching for a solution.
Note: The following solution will work for any number of customers, but would require minor modification to work with more lag columns.
Setup: First we need to generate some data to work with. I'm going to use two different customers with a few weeks of transactional data each, like so:
data <- read.table(text="
transaction customer week amount
12551 cOne 32 1.32
12552 cOne 34 1.34
12553 cTwo 34 2.34
12554 cTwo 35 2.35
12555 cOne 36 1.36
12556 cTwo 37 1.37
", header=TRUE)
Step 1: Calculate some variables and initialize new data frame. To make the programming really easy, we first want to know two things: how many customers and how many weeks? We calculate those answers like so:
customer_list <- unique(data$customer)
# cOne cTwo
week_span <- min(data$week):max(data$week)
# 32 33 34 35 36 37
Next, we need to initialize the new data frame based on the variables we just calculated. In this new data frame, we need an entry for every week, not just the weeks in the data. This is where our 'week_span' variable comes in useful.
new_data <- data.frame(
week=sort(rep(week_span,length(customer_list))),
customer=customer_list,
activity_last_week=NA,
activity_2_weeks_ago=NA)
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne NA NA
# 4 33 cTwo NA NA
# 5 34 cOne NA NA
# 6 34 cTwo NA NA
# 7 35 cOne NA NA
# 8 35 cTwo NA NA
# 9 36 cOne NA NA
# 10 36 cTwo NA NA
# 11 37 cOne NA NA
# 12 37 cTwo NA NA
You'll notice we repeat the week list for each customer and sort it, so we get a list resembling 1,1,2,2,3,3,4,4...n,n with a number of repetitions equal to the number of customers in the data. This makes it so we can specify the 'customer' data as just the list of customers, since the list will repeat to fill up the space. The lag columns are left as NA for now.
Step 2: Fill in the lag values. Now, things are pretty simple. We just need to grab the subset of rows for each customer and find out if there were any transactions for each week. We do this by using the 'match' function to pull out values for every week. Where data does not exist, we'll get an NA value and need to replace those with zeros (assuming no activity means a zero transaction). Then, for the lag columns, we just offset the values with NA depending on the number of weeks we are lagging.
# Loop through the customers.
for (i in 1:length(customer_list)){
# Select the next customer's data.
subset <- data[data$customer==customer_list[i],]
# Extract the data values for each week.
subset_amounts <- subset$amount[match(week_span, subset$week)]
# Replace NA with zero.
subset_amounts <- ifelse(is.na(subset_amounts),0,subset_amounts)
# Loop through the lag columns.
for (lag in 1:2){
# Write in the data values with the appropriate
# number of offsets according to the lag.
# Truncate the extra values.
new_data[new_data$customer==customer_list[i], (2+lag)] <- c(rep(NA,lag), subset_amounts[1:(length(subset_amounts)-lag)])
}
}
# week customer activity_last_week activity_2_weeks_ago
# 1 32 cOne NA NA
# 2 32 cTwo NA NA
# 3 33 cOne 1.32 NA
# 4 33 cTwo 0.00 NA
# 5 34 cOne 0.00 1.32
# 6 34 cTwo 0.00 0.00
# 7 35 cOne 1.34 0.00
# 8 35 cTwo 2.34 0.00
# 9 36 cOne 0.00 1.34
# 10 36 cTwo 2.35 2.34
# 11 37 cOne 1.36 0.00
# 12 37 cTwo 0.00 2.35
In other situations... If you have a series of ordered time data where no rows are missing, this sort of task becomes incredibly simple with the 'embed' function. Let's say we have some data that looks like this:
data <- data.frame(week=1:20, value=1:20+(1:20/100))
# week value
# 1 1 1.01
# 2 2 2.02
# 3 3 3.03
# 4 4 4.04
# 5 5 5.05
# 6 6 6.06
# 7 7 7.07
# 8 8 8.08
# 9 9 9.09
# 10 10 10.10
# 11 11 11.11
# 12 12 12.12
# 13 13 13.13
# 14 14 14.14
# 15 15 15.15
# 16 16 16.16
# 17 17 17.17
# 18 18 18.18
# 19 19 19.19
# 20 20 20.20
We could make a lagged data set in no time, like so:
new_data <- data.frame(week=data$week[3:20], embed(data$value,3))
names(new_data)[2:4] <- c("this_week", "last_week", "2_weeks_ago")
# week this_week last_week 2_weeks_ago
# 1 3 3.03 2.02 1.01
# 2 4 4.04 3.03 2.02
# 3 5 5.05 4.04 3.03
# 4 6 6.06 5.05 4.04
# 5 7 7.07 6.06 5.05
# 6 8 8.08 7.07 6.06
# 7 9 9.09 8.08 7.07
# 8 10 10.10 9.09 8.08
# 9 11 11.11 10.10 9.09
# 10 12 12.12 11.11 10.10
# 11 13 13.13 12.12 11.11
# 12 14 14.14 13.13 12.12
# 13 15 15.15 14.14 13.13
# 14 16 16.16 15.15 14.14
# 15 17 17.17 16.16 15.15
# 16 18 18.18 17.17 16.16
# 17 19 19.19 18.18 17.17
# 18 20 20.20 19.19 18.18

Related

For Loop Alternative on large data frame that runs a different filter with each iteration

I'm running a loop that takes the ranking from R1[i] and filters a data frame of all rankings in the specified range and at the same time filtering a different column R2[i] to find the ranking of an opponent so I end up with a new data frame that only includes matches that involve players in those specific ranking ranges so that I can find the mean of a column for only those matches.
For Example: Player 1 is Ranked 10th and Player 2 is Ranked 34th. The following code takes every match including players ranked between 5-15 +/- 20% of 10 and players ranked between 29-39 +/- 20% of 34.
Then it finds the mean of Data_Dif and returns to the initial DF in row [i] and does so for every row.
This code works fine but it's a bit messy and it takes 4 hours to run 57,000 matches. Does anyone have a faster solution please? I have to run this every day.
Rank <- Data %>% filter(between(R1, Data$R1[i]-5-(Data$R1[i]*0.2), Data$R1[i]+5+(Data$R1[i]*0.2)) | between(R1, Data$R2[i]-5-(Data$R2[i]*0.2), Data$R2[i]+5+(Data$R2[i]*0.2)))
%>% filter(between(R2, Data$R1[i]-5-(Data$R1[i]*0.2), Data$R1[i]+5+(Data$R1[i]*0.2)) | between(R2, Data$R2[i]-5-(Data$R2[i]*0.2), Data$R2[i]+5+(Data$R2[i]*0.2)))
Rank_Difference <- Data$Rank_Dif[i]
Rank <- Rank %>% filter(Rank_Dif >= Rank_Difference-5)
Data$Rank_Adv[i] <- mean(Rank$Data_Dif)
}
Data
R1 R2 Rank_Dif Data_Dif Rank_Adv
1 2 1 1 -0.272 0.037696970
2 10 34 24 0.377 0.146838617
3 10 29 19 0.373 0.130336232
4 2 5 3 0.134 0.076242424
5 34 17 17 -0.196 0.094226519
6 1 18 17 0.144 0.186158879
7 17 25 8 0.264 0.036212219
8 42 18 24 0.041 0.102343915
9 5 13 8 -0.010 0.091952381
10 34 21 13 -0.226 0.060790576
11 2 14 12 0.022 0.122350649
12 10 158 148 0.330 0.184901961
13 11 1 10 -0.042 0.109918367
14 29 52 23 0.463 0.054469108
15 10 1000 990 0.628 0.437600000
16 17 329 312 0.445 0.307750000
17 11 20 9 0.216 0.072621875
18 417 200 217 -0.466 0.106737401
19 5 53 48 0.273 0.243890710
20 14 7 7 -0.462 0.075739414

Why same category is giving different frequency in R

Process_Table = Process_Table[order(-Process_Table$Process, -Process_Table$Freq),]
#output
Process Freq Percent
17 Other Airport Services 45 15.46
5 Check-in 35 12.03
23 Ticket sales and support channels 35 12.03
11 Flight and inflight 33 11.34
19 Pegasus Plus 23 7.90
24 Time Delays 16 5.50
7 Other 13 4.47
14 Other 13 4.47
22 Other 13 4.47
25 Other 13 4.47
16 Other 11 3.78
20 Other 6 2.06
26 Other 6 2.06
3 Other 5 1.72
13 Other 5 1.72
18 Other 5 1.72
21 Other 4 1.37
1 Other 2 0.69
2 Other 1 0.34
4 Other 1 0.34
6 Other 1 0.34
8 Other 1 0.34
9 Other 1 0.34
10 Other 1 0.34
12 Other 1 0.34
15 Other 1 0.34
as you can see it is giving different frequency for the same level
whereas, if i am printing the levels in that feature it is giving an output as the following
levels(Process_Table$Process)
[1] "Check-in" "Flight and inflight"
[3] "Other" "Other Airport Services"
[5] "Pegasus Plus" "Ticket sales and support channels"
[7] "Time Delays"
what i want is the combined frequency of "Others" category. Can anyone help me out on this.
Edit: code was used to derive to the first set of output:
Process_Table$Percent = round(Process_Table$Freq/sum(Process_Table$Freq) * 100, 2)
Process_Table$Process = as.character(Process_Table$Process)
low_list = Process_Table %>%
filter(Percent < 5.50) %>%
select(Process)
Process_Table$Process = ifelse(Process_Table$Process %in% low_list$Process, 'Other', Process_Table$Process)
as.data.frame(Process_Table)
Process_Table$Process = as.factor(Process_Table$Process)
Your Processed_Table should undergo another step of aggregating. Add the following to your final step of data aggregating.
Processed_Table <- Processed_Table %>% group_by(Process) %>% summarize(Freq = sum(Freq), Percent = sum(Percent))

Categorizing Hospital Admittance Data in Excel or R

I have patient data that looks like this:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82 12/09/2014 13.35
82 12/09/2014 0.16
82 12/09/2014 0.24
82 12/09/2014 0.31
82 22/12/2014 100.39
82 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219 12/01/2013 80.20
219 12/01/2013 0.20
In the first column is a patient ID, In the second there is a date and time (time is visually missing but is in there) and the third is just the duration difference (which I have been using to determine different admittance of patients). Each different row is a check up on the patient but they may have come here at a later date (not with in the same time frame).
Basically what I want to do is to be able to categorize each patients number so that when they admit a second time there id becomes "82a" and third time "82b" and so on. It wouldn't have to be alphabetic it could be any such indicator. Sometimes there can be patients with as many as 50 different admissions (separate occasion admissions). So after this I want to have it look something like:
ID DATE DUR
82 29/08/2014 10.32
82 29/08/2014 0.32
82a 12/09/2014 13.35
82a 12/09/2014 0.16
82a 12/09/2014 0.24
82a 12/09/2014 0.31
82b 22/12/2014 100.39
82b 22/12/2014 0.1
219 31/11/2012 -300.32
219 31/11/2012 0.23
219a 12/01/2013 80.20
219a 12/01/2013 0.20
I have been working in Excel for the time being and at first had used
=IF(AND(ABS(C3)>1,A3=A2),1,0)
Just to allow to indicate when an ID is repeated on a new admission date, then I did this again to indicate the 3rd admission and began drawing out columns for 4th,5th,6th and planned on merging them. This is simply not an efficient solution, especially with a large data set. I am familiar with R and think that might be a better way for manipulation but I am just stuck with how to do this for the entire data set and to continually add a new "indicator" every time the same patient is admitted again. I am not even sure exactly how to tell the computer what to do with pseudo. Perhaps something like this
Pseudo-Code
-> Run through ID Column
-> IF Dur is > 1 (it will always be > 1 for a new admission)
ANDIF ID already exists above with DUR > 1 = a, or if DUR > 1 TWICE for
same ID = b, or if DUR > THREE TIMES = c, and so on....
Any help would be great
In R, you have a lot of options. Your data has issues, however; since November only has 30 days, converting the DATE column to an actual date format will introduce NAs. (You could, of course, just leave it as character, but date formats are easier to work with.)
With dplyr:
library(dplyr)
df %>% mutate(DATE = as.Date(DATE, '%d/%m/%Y')) %>% # parse date data
group_by(ID) %>% # group data by ID
mutate(visit = as.integer(factor(DATE))) # make an integer factor of DATE
# Source: local data frame [12 x 4]
# Groups: ID [2]
#
# ID DATE DUR visit
# (int) (date) (dbl) (int)
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1
Base R has a lot of options, including ave and tapply, but to keep it simple so you can see what happens step-by-step in a split-apply-combine model, split by grouping variable, lapply across the list, and use do.call(rbind to reassemble:
df$DATE <- as.Date(df$DATE, '%d/%m/%Y')
df <- do.call(rbind, lapply(split(df, df$ID),
function(x){data.frame(x,
visit = as.integer(factor(x$DATE)))}))
rownames(df) <- NULL # delete useless rownames
df
# ID DATE DUR visit
# 1 82 2014-08-29 10.32 1
# 2 82 2014-08-29 0.32 1
# 3 82 2014-09-12 13.35 2
# 4 82 2014-09-12 0.16 2
# 5 82 2014-09-12 0.24 2
# 6 82 2014-09-12 0.31 2
# 7 82 2014-12-22 100.39 3
# 8 82 2014-12-22 0.10 3
# 9 219 <NA> -300.32 NA
# 10 219 <NA> 0.23 NA
# 11 219 2013-01-12 80.20 1
# 12 219 2013-01-12 0.20 1

R script to format datatable to exactly 2 decimal places

I have made a datatable "Event_Table" with 46 rows and 6 columns. At some point I export this to text file and would like the output of some fields to be truncated to exactly 2 decimal places.
Event_Table[1:34,3:6]=round(Event_Table[1:34,3:6])
Event_Table[36:39,3:6]=format(round(Event_Table[36:39,3:6],2), nsmall=2)
Event_Table[41:46,3:6]=format(round(Event_Table[41:46,3:6],2), nsmall=2)
Line 1 and 2 produce the desired result, but subsequently running line 3 throws an error:
Error in Math.data.frame(list(CO = c("0", "0", "0.786407766990291", "0", :
non-numeric variable in data frame: CONCONATotal
Why? If remove line 2, then line 3 runs fine. So somethign about setting the formatting in one part of the table is affecting the entire table and prevents a second format command form being possible (even though the formatting is only being applied to discrete parts of the table). Any ideas how to avoid this, or to achieve what is required in a different way?
EDIT:
I should perhaps add that the following code is not quite sufficient:
Event_Table[36:46,3:6]=round(Event_Table[36:46,3:6], digits=2)
Trailing zeros are truncated. i.e. A value of 1 is displayed as "1", not as "1.00". The latter being what is required.
EDIT2:
Here is the table:
ChrSize Chr CO NCO NA Total
1 230218 1 4.00 1.00 0 5.00
2 813184 2 6.00 6.00 0 12.00
3 316620 3 2.00 3.00 0 5.00
4 1531933 4 13.00 20.00 0 33.00
5 576874 5 3.00 8.00 0 11.00
6 270161 6 4.00 2.00 0 6.00
7 1090940 7 11.00 5.00 0 16.00
8 562643 8 5.00 9.00 0 14.00
9 439888 9 6.00 3.00 0 9.00
10 745751 10 10.00 6.00 0 16.00
11 666816 11 3.00 7.00 0 10.00
12 1078177 12 11.00 13.00 1 25.00
13 924431 13 7.00 12.00 0 19.00
14 784333 14 5.00 6.00 1 12.00
15 1091291 15 6.00 17.00 0 23.00
16 948066 16 7.00 6.00 0 13.00
17 12071326 TOTAL 103.00 124.00 2 229.00
18 NA Event Lengths: NA NA NA NA
19 NA Min Len 0.00 22.00 0 0.00
20 NA Max Len 14745.00 12524.00 0 14745.00
21 NA Mean Len 2588.00 1826.00 0 2153.00
22 NA Median Len 1820.00 1029.00 0 1322.00
23 NA Chromatids: NA NA NA NA
24 NA 1_chrom 0.00 98.00 2 100.00
25 NA 2_chrom 81.00 22.00 0 103.00
26 NA 3_chrom 14.00 4.00 0 18.00
27 NA 4_chrom 8.00 0.00 0 8.00
28 NA Classe: NA NA NA NA
29 NA 1_1brin 0.00 55.00 0 55.00
30 NA 1_2brins 0.00 43.00 2 45.00
31 NA 2_nonsis 81.00 15.00 0 96.00
32 NA 2_sis 0.00 7.00 0 7.00
33 NA classe_3 14.00 4.00 0 18.00
34 NA classe_4 8.00 0.00 0 8.00
35 NA Fraction of Chromatids: NA NA NA NA
36 NA 1_chrom 0.00 0.79 1 0.44
37 NA 2_chrom 0.79 0.18 0 0.45
38 NA 3_chrom 0.14 0.03 0 0.08
39 NA 4_chrom 0.08 0.00 0 0.03
40 NA Fraction of each Classe: NA NA NA NA
41 NA 1_1brin 0.00 0.44 0 0.24
42 NA 1_2brins 0.00 0.35 1 0.20
43 NA 2_nonsis 0.79 0.12 0 0.42
44 NA 2_sis 0.00 0.06 0 0.03
45 NA classe_3 0.14 0.03 0 0.08
46 NA classe_4 0.08 0.00 0 0.03
I require rows 1-34 formatted without decimals.
And rows 36-46 formatted with precisely 2 decimal places for all values.
EDIT3: The initial data is read sequentially into tables called "data", then a derivative output table "Event_Table" is generated in which I am inserting summaries of various aspects of each "data" table (i.e. totals, means, medians etc). I then sequentially export the "Event_Tables" since these contain the required summary informations for each "data" table.
Here is the start of the code:
# FIRST SET WORKING DIRECTORY WHERE INPUT FILES ARE!
files = list.files(pattern="Events_") # import files names with "Event_" string into variable "files"
files1 = length(files) # Count number of files
files2 = read.table(text = files, sep = "_", as.is = TRUE) #Split file names by "_" separator and create table "files2"
for (j in 1:files1)
{data <- read.table(files[j], header=TRUE) #Import datatable from files number 1 to j
# Making derivative dataframes:
Event_Table <- data.frame(matrix(NA, nrow = 46, ncol = 6)) # Creates dataframe of arbitrary size full of NAs
names(Event_Table) <- c("ChrSize","Chr","CO","NCO","NA","Total") # Adds column names to dataframe
Event_Table ["Chr"] = c(1:16, "TOTAL","Event Lengths:","Min Len", "Max Len","Mean Len","Median Len","Chromatids:","1_chrom","2_chrom","3_chrom","4_chrom","Classe:","1_1brin","1_2brins","2_nonsis","2_sis","classe_3","classe_4","Fraction of Chromatids:","1_chrom","2_chrom","3_chrom","4_chrom","Fraction of each Classe:","1_1brin","1_2brins","2_nonsis","2_sis","classe_3","classe_4") # Inserts vector 1:16 (numbers 1 to 16) in column 1 of dataframe
Event_Table [1:16,"ChrSize"] = c(230218,813184,316620,1531933,576874,270161,1090940,562643,439888,745751,666816,1078177,924431,784333,1091291,948066)
Event_Table [17,"ChrSize"] =sum(Event_Table [1:16,"ChrSize"])
nE = nrow(data) # Total number of events
Event_Table [17,"Total"] = nrow(data)
Event_Table [19,"Total"] = min(data ["len"])
Event_Table [20,"Total"] = max(data ["len"])
Event_Table [21,"Total"] = mean(data ["len"])
Event_Table [22,"Total"] = median(data [1:nrow(data),"len"])
#More stuff here, etc, then close j loop }
So the Event_Table is set up as a data.frame of type matrix filled with NAs.
I then fill it manually with relevant info in relevant grid positions.
I then simply want to format the visual appearance of these fields.
If I am going about this all wrong, then please can you suggest a better way to do this! Thanks
Here is a proof of concept using 2 rather different data frames:
DF1 <- data.frame(x = rnorm(10), person = rep(LETTERS[1:2], 5))
DF2 <- data.frame(y = 1:10L, result = rep(LETTERS[3:4], 5), alt = rep(letters[3:4], 5))
write.table(DF1, file = "example.csv", sep = ",")
write.table(DF2, file = "example.csv", sep = ",", append = TRUE)
This issues a warning (about column names - no problem) and gives:
x person
1 0.796933543 A
2 1.495800567 B
3 0.359153458 A
4 2.105378598 B
5 0.175455314 A
6 -1.850171347 B
7 -0.87197177 A
8 2.682650638 B
9 1.040676847 A
10 -0.086197042 B
y result alt
1 1 C c
2 2 D d
3 3 C c
4 4 D d
5 5 C c
6 6 D d
7 7 C c
8 8 D d
9 9 C c
10 10 D d
From here you can control the formatting as desired. You may wish to suppress the column names or give more informative ones, and you probably don't want the row numbering either. See ?write.table for all the options.
It could be a similar problem as Error in Math.data.frame.....non-numeric variable in data frame:. Maybe you have commas in your data. If that is not the case, could you show what is in your table?

function to return suitably lagged and iterated divided value in R

I have a time series data, and I wanted to use a function to return suitably lagged and iterated divided value.
Data:
ID Temperature value
1 -1.1923333
2 -0.2123333
3 -0.593
4 -0.7393333
5 -0.731
6 -0.4976667
7 -0.773
8 -0.6843333
9 -0.371
10 0.754
11 1.798
12 3.023
13 3.8233333
14 4.2456667
15 4.599
16 5.078
17 4.9133333
18 3.5393333
19 2.0886667
20 1.8236667
21 1.2633333
22 0.6843333
23 0.7953333
24 0.6883333
The function should work like this:
new values : 23ID=value(24)/value(23), 22ID=value(23)/value(22), 21ID=value(22)/value(21), and so forth.
Expected Results:
ID New Temperature value
1 0.17
2 2.79
3 1.24
4 0.98
5 0.68
6 1.55
7 0.885
8 0.54
9 -2.03
10 2.38
11 1.68
12 1.264
13 1.11
14 1.083
15 1.104
16 0.967
17 0.72
18 0.59
19 0.873
20 0.69
21 0.541
22 1.16
23 0.86
24 NAN
To divide each element of a vector x by its successor, use:
x[-1] / x[-length(x)]
This will return a vector with a length of length(x) - 1. If you really need the NaN value at the end, add it by hand via c(x[-1] / x[-length(x)], NaN).

Resources