Calculation in columns using previous row value without loops - r

I have data in columns which I need to do calculations on. Is it possible to do this using previous row values without using a loop? E.g. if in the first column the value is 139, calculate the median of the last 5 values and the percent change of the value 5 rows above and the value in the current row?
ID Data PF
135 5 123
136 4 141
137 5 124
138 6 200
139 1 310
140 2 141
141 4 141
So here in this dataset you would do:
Find 139 in ID column
Return average of last 5 rows in Data (Gives 4.2)
Return performance of values in PF 5 rows above to current value (Gives 152%)
If I would do a loop it looks like this:
for (i in 1:nrow(data)){
if(data$ID == "139" & i>=3)
{data$New_column <- data[i,"PF"] / data[i-4,"PF"] - 1
}
The problem is that the loop takes too long due to to many data points. The ID 139 will appear several times in the dataset.
Many thanks.
Carlos

As pointed out by Tutuchacn and Sotos, use the package zoo to get the mean of the Data in the last N rows (inclusive of the row) you are querying (assuming your data is in the data frame df):
library(zoo)
ind <- which(df$ID==139) ## this is the row you are querying
N <- 5 ## here, N is 5
res <- rollapply(df$Data, width=N, mean)[ind-(N-1)]
print(res)
## [1] 4.2
rollapply(..., mean) returns the rolling mean of the windowed data of width=N. Note that the index used to query the output from rollapply is lagged by N-1 because the rolling mean is applied forward in the series.
To get the percent performance from PF as you specified:
percent.performance <- function(x) {
z <- zoo(x) ## create a zoo series
lz <- lag(z,4) ## create the lag version
return(z/lz - 1)
}
res <- as.numeric(percent.performance(df$PF)[ind])
print(res)
## [1] 1.520325
Here, we define a function percent.performance that returns what you want for all rows of df for which the computation makes sense. We then extract the row we want using ind and convert it to a number.
Hope this helps.

Is that what you want?
ntest=139
sol<-sapply(5:nrow(df),function(ii){#ii=6
tdf<-df[(ii-4):ii,]
if(tdf[5,1]==ntest)
c(row=ii,aberage=mean(tdf[,"Data"]),performance=round(100*tdf[5,"PF"]/tdf[1,"PF"]-1,0))
})
sol<- sol[ ! sapply(sol, is.null) ] #remove NULLs
sol
[[1]]
row aberage performance
5.0 4.2 251.0

This could be a decent start:
mytext = "ID,Data,PF
135,5,123
136,4,141
137,5,124
138,6,200
139,1,310
140,2,141
141,4,141"
mydf <- read.table(text=mytext, header = T, sep = ",")
do.call(rbind,lapply(mydf$ID[which(mydf$ID==139):nrow(mydf)], function(x) {
tempdf <- mydf[1:which(mydf$ID==x),]
data.frame(ID=x,Data=mean(tempdf$Data),PF=100*(tempdf[nrow(tempdf),"PF"]-tempdf[(nrow(tempdf)-4),"PF"])/tempdf[(nrow(tempdf)-4),"PF"])
}))
ID Data PF
139 4.200000 152.03252
140 3.833333 0.00000
141 3.857143 13.70968
The idea here is: You take ID's starting from 139 to the end and use the lapply function on each of them by generating a temporary data.frame which includes all the rows above that particular ID (including the ID itself). Then you grab the mean of the Data column and the rate of change (i.e. what you call performance) of the PF column.

Related

Is there an R function equivalent to Excel's $ for "keep reference cell constant" [duplicate]

This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}

Fuction to return the first five columns in R

I wrote a function in R which is supposed to return the first five developers who made the most input:
developer.busy <- function(x){
bus.dev <- sort(table(test2$devf), decreasing = TRUE)
return(bus.dev)
}
bus.dev(test2)
ericb shields mdejong cabbey lord elliott-oss jikesadmin coar
3224 1432 998 470 241 179 77 1
At the moment it just prints out all developers sorted in decreasing range. I just want the first 5 to be shown. How can I make this possible. Any suggestion is welcome.
If we want the first five, either use index with [ or with head. Modified the function with three input, data object name, column name ('colnm') and number of elements to extract ('n')
developer.busy <- function(data, colnm, n){
sort(table(data[[colnm]]), decreasing = TRUE)[seq_len(n)]
# or another optioin is
head(sort(table(data[[colnm]]), decreasing = TRUE), n)
}
developer.busy(test2, "developerf", n = 5)
-using a reproducible example with mtcars dataset
data(mtcars)
developer.busy(mtcars, 'carb', 5)
# 2 4 1 3 6
#10 10 7 3 1

Extract data distributed equally from a dataframe - R

I have a data.frame (df) with different number of rows (numElement) and I wish to extract from it X elements (numExtract) distributed equally in the df and store them in the new dataframe (extractData). When I use the script below sometimes I get in the extractData different number of element (bigger by one from the numExtract). How can I fix it?
Script:
numElement<-400
df<-data.frame(seq(1:numElement))
numExtract<-5
extractData <- df[seq(1, nrow(df), by = round(nrow(df)/numExtract)),]
numElement<-400
df<-data.frame(seq(1:numElement))
numExtract<-7
extractData <- df[seq(1, nrow(df), by = round(nrow(df)/numExtract)),]
I cannot comment yet but round without extra arguments rounds the number to the nearest integer.
In your first case you want every 80th element and then in the second case every 57th element and that means you'll get the elements with indexes of 1 58 115 172 229 286 343 400 (total 8 indexes here).
Custom function
Use cut to obtain more intuitive intervals, and extract the breaks. It uses gsubfn:strapply for substring extraction
library(gsubfn)
myfun <- function(maxval, numbreaks) {
require(gsubfn)
x <- unique(cut(1:maxval, numbreaks-1))
A <- sapply(x, function(Z) round(as.numeric(strapply(as.character(Z), "^[(](\\S+)[,]", perl=TRUE))))
A <- c(A, maxval)
return(A)
}
Output
myfun(400, 5)
# 1 101 200 300 400
myfun(400, 7)
# 1 68 134 200 267 334 400

R programming Sum data frame

i have a code using R language, i want to sum all data frame (df$number is unlist result in 'res')
total result is = [1] 1 3 5 7 9 20 31 42
digits <- function(x){as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))}
generated <- function(x){ x + sum(digits(x))}
digitadition <- function(x,N) { c(x, replicate(N-1, x <<- generated(x))) }
res <- NULL
for(i in 0:50){
for(j in 2:50){
tmp <- digitadition(i,j)
IND <- 50*(i-1) + (j-1) - (i-1) #to index results
res[IND] <- tmp[length(tmp)]
}
}
df <- data.frame(number = unlist(res), generator=rep(1:50, each=49), N=2:50)
total <- table(df$number)[as.numeric(names(table(df$number)))<=50]
setdiff(1:50, as.numeric(names(total)))
sum(total)
i'm using sum(total) but the result of summary is '155' it is not the right answer, cause the right answer is '118'
what the spesific code to sum the 'total'?
thank you.
I ran your code and I think you may be confused on what you want to sum.
You setdiff contains the values 1 3 5 7 9 20 31 42 which sum is 118.
So, if you do sum(setdiff(1:50, as.numeric(names(total)))), you'll get the 118 you are looking for.
Your total variable is different from this. Let me explain what you are doing and what I think you should do.
Your code: total <- table(df$number)[as.numeric(names(table(df$number)))<=50]]
When you table(), you get each unique value from the vector, and the number of how many times this number appears on your vector.
And when you get the names() of this table, you get each of these unique values as a character, that's why you are setting as.numeric.
But the function unique() do this job for you, he extracts uniques values from a vector.
Here's what you can do: total <- unique(df$number[which(df$number <= 50)])
Where which() get the ID's of values <= 50, and unique extracts unique values of these ID's.
And finally: sum(setdiff(1:50, total)) that sums all the values from 1 to 50 that are not in your total vector.
And in my opinion, sum(setdiff(total, 1:50)) its more intuitive.

Including all permutations when using data.table[,,by=...]

I have a large data.table that I am collapsing to the month level using ,by.
There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.
Reproducible example:
require(data.table)
set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are
Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?
I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:
keycols <- c("g1", "g2", "g3") ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate
## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3
This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:
Advanced: Aggregation for a subset of known groups is
particularly efficient when passing those groups in i. When
i is a data.table, DT[i,j] evaluates j for each row
of i. We call this by without by or grouping by i.
Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
Make a cartesian join of the unique values, and use that to join back to your results
dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys]) # effectively a left join of datCollapsed onto dat.keys
# [1] 625
Note that the missing values are NA right now, but you can easily change that to 0s if you want.

Resources