I have a whole bunch of data.frames with irregular time spacing.
I would like to make a new data.frame and join the others to it, for each data.frame being joined picking the latest value out of the new data.frame.
For example, listOfDataFrames below contains a list of data.frames each of which has a time column in seconds. I find the total range, mod the range by 60 and seqn it by to obtain an increasing seqn of full minutes. Now I need to merge the list of data.frames to the left of this new seqn. e.g. if the value in mypoints is 60, the value joined to it should be the latest value <= 60.
xrange <- range(lapply(listOfDataFrames,function(x) range(x$Time)))
mypoints <- 60*do.call(seq,as.list(xrange%/%60))
I believe this is sometimes called an asof join.
Is there a simple procedure to do this?
Thanks
EDIT: this is what I currently use
xrange <- range(lapply(listOfDataFrames,function(x) range(x$Time)))
mypoints <- 60*seq(xrange[1]%/%60,1+xrange[2]%/%60)
result <- data.frame(Time=mypoints)
for(index in 1:length(listOfDataFrames))
{
x<-listOfDataFrames[[index]]
indices <- which(sort(c(mypoints,x$Time)) %in% mypoints) - 1:length(mypoints)
indices[indices==0] <- NA
newdf<-data.frame(new=x$Result[indices])
colnames(newdf)<-paste("S",index,sep="")
result <- cbind(result,newdf)
}
EDIT: full example
AsOfJoin <- function (listOfDataFrames) {
xrange <- range(lapply(listOfDataFrames,function(x) range(x$Time)))
mypoints <- 60*seq(xrange[1]%/%60,1+xrange[2]%/%60)
result <- data.frame(Time=mypoints)
for(index in 1:length(listOfDataFrames))
{
x<-listOfDataFrames[[index]]
indices <- which(sort(c(mypoints,x$Time)) %in% mypoints) - 1:length(mypoints)
indices[indices==0] <- NA
newdf<-data.frame(new=x$Result[indices])
colnames(newdf)<-paste("S",index,sep="")
result <- cbind(result,newdf)
}
result[is.na(result)]<-0
result
}
a<-data.frame(Time=c(28947.5,28949.6,29000),Result=c(10,15,9))
b<-data.frame(Time=c(28947.8,28949.5),Result=c(14,19))
listOfDataFrames <- list(a,b)
result<-AsOfJoin(listOfDataFrames)
> a
Time Result
1 28947.5 10
2 28949.6 15
3 29000.0 9
> b
Time Result
1 28947.8 14
2 28949.5 19
> result
Time S1 S2
1 28920 0 0
2 28980 15 19
3 29040 9 19
data.table provide very fast asof joins out of the box.
See also This post for an example
See my edit for answer. Apparently the best way.
Related
This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}
I have a large data set that is organized as a list of 1044 data frames. Each data frame is a profile that holds the same data for a different station and time. I am trying to create a data frame that holds the output of my function fitsObs, but my current code only goes through a single data frame. Any ideas?
i=1
start=1
for(i in 1:1044){
station1 <- surveyCTD$stations[[i]]
df1 <- surveyCTD$data[[i]]
date1 <- surveyCTD$dates[[i]]
fitObs <- fitTp2(-df1$depth, df1$temp)
if(start==1){
start=0
dfout <- data.frame(
date=date1
,station=station1
)
names(fitObs) <- paste0(names(fitObs),"o")
dfout<-cbind(dfout, df1$temp, df1$depth)
dfout <- cbind(dfout, fitObs)
}
}
From a first look I would try two ways to debug it. First print out the head of a DF to understand the behavior of your loop, then check the range of your variable dfout, it looks like the variable is local to your loop.
Moreover your i variable out of the loop does not change anything in your loop.
I have created a reproducible example of my best guess as to what you are asking. I also assume that you are able to adjust the concepts in this general example to suit your own problem. It's easier if you provide an example of your list in future.
First we create some reproducible data
a <- c(10,20,30,40)
b <- c(5,10,15,20)
c <- c(20,25,30,35)
df1 <- data.frame(x=a+1,y=b+1,z=c+1)
df2 <- data.frame(x=a,y=b,z=c)
ls1 <- list(df1,df2)
Which looks like this
print(ls1)
[[1]]
x y z
1 11 6 21
2 21 11 26
3 31 16 31
4 41 21 36
[[2]]
x y z
1 10 5 20
2 20 10 25
3 30 15 30
4 40 20 35
So we now have two dataframes within a single list. The following code should then work to go through the columns within each dataframe of the list and apply the mean() function to the data in the column. You change this to row by selecting '1' rather than '2'.
df <- do.call("rbind", lapply(ls1, function(x) apply(x,2,mean)))
as.data.frame(df)
print(df)
x y z
1 26 13.5 28.5
2 25 12.5 27.5
You should be able to replace mean() with whatever function you have written for your data. Let me know if this helps.
Consider building a generalized function to be called withi Map (wrapper to mapply, the multiple, elementwise iterator member of apply family) to build a list of data frames each with your fitObs output. And pass all equal length objects into the data.frame() constructor.
Then outside the loop, run do.call for a final, single appended dataframe of all 1,044 smaller dataframes (assuming each maintains exact same and number of columns):
# GENERALIZED FUNCTION
add_fit_obs <- function(dt, st, df) {
fitObs <- fitTp2(-df$depth, df$temp)
names(fitObs) <- paste0(names(fitObs),"o")
tmp <- data.frame(
date = dt,
station = st,
depth = df1$depth,
temp = df1$temp,
fitObs
)
return(tmp)
}
# LIST OF DATA FRAMES
df_list <- Map(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data)
# EQUIVALENTLY:
# df_list <- mapply(add_fit_obs, surveyCTD$stations, surveyCTD$dates, surveyCTD$data, SIMPLIFY=FALSE)
# SINGLE DATAFRAME
master_df <- do.call(rbind, df_list)
i have a code using R language, i want to sum all data frame (df$number is unlist result in 'res')
total result is = [1] 1 3 5 7 9 20 31 42
digits <- function(x){as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))}
generated <- function(x){ x + sum(digits(x))}
digitadition <- function(x,N) { c(x, replicate(N-1, x <<- generated(x))) }
res <- NULL
for(i in 0:50){
for(j in 2:50){
tmp <- digitadition(i,j)
IND <- 50*(i-1) + (j-1) - (i-1) #to index results
res[IND] <- tmp[length(tmp)]
}
}
df <- data.frame(number = unlist(res), generator=rep(1:50, each=49), N=2:50)
total <- table(df$number)[as.numeric(names(table(df$number)))<=50]
setdiff(1:50, as.numeric(names(total)))
sum(total)
i'm using sum(total) but the result of summary is '155' it is not the right answer, cause the right answer is '118'
what the spesific code to sum the 'total'?
thank you.
I ran your code and I think you may be confused on what you want to sum.
You setdiff contains the values 1 3 5 7 9 20 31 42 which sum is 118.
So, if you do sum(setdiff(1:50, as.numeric(names(total)))), you'll get the 118 you are looking for.
Your total variable is different from this. Let me explain what you are doing and what I think you should do.
Your code: total <- table(df$number)[as.numeric(names(table(df$number)))<=50]]
When you table(), you get each unique value from the vector, and the number of how many times this number appears on your vector.
And when you get the names() of this table, you get each of these unique values as a character, that's why you are setting as.numeric.
But the function unique() do this job for you, he extracts uniques values from a vector.
Here's what you can do: total <- unique(df$number[which(df$number <= 50)])
Where which() get the ID's of values <= 50, and unique extracts unique values of these ID's.
And finally: sum(setdiff(1:50, total)) that sums all the values from 1 to 50 that are not in your total vector.
And in my opinion, sum(setdiff(total, 1:50)) its more intuitive.
I have a list of stocks in an index sorted by date, and I'm trying to remove all rows in which the previous row has the same stock code. This will give a dataframe of the initial index and all dates that there was a change to the index
In my working example, I'll use names instead of the date column, and some numbers.
At first, I thought I could remove the rows by using subset() and !duplicated
name <- c("Joe","Mary","Sue","Frank","Carol","Bob","Kate","Jay")
num <- c(1,2,2,1,2,2,2,3)
num2 <- c(1,1,1,1,1,1,1,1)
df <- data.frame(name,num,num2)
dfnew <- subset(df, !duplicated(df[,2]))
However, this might not work in the case where a stock is removed from the list and then later replaced. So, in my working example, the desired output are the rows of Joe, Mary, Frank, Carol and Jay.
Next I created a function to tell if the index changes. The input of the function is row number:
#------ function to tell if there is a change in the row subset-----#
df2 <- as.matrix(df)
ChangeDay <- function(x){
Current <- df2[x,2:3]
Prev <- df2[x-1,2:3]
if (length(Current) != length(Prev))
NewList <- true
else
NewList <- length(which(Current==Prev))!=length(Current)
return(NewList)
}
Finally, I attempt to create a loop to remove the desired rows. I'm new to programming, and I struggle with loops. I'm not sure what the best way is to pre-allocate memory when the dimensions of my final output is unknown. All the books I've looked at only give trivial loop examples. Here is my latest attempt:
result <- matrix(data=NA,nrow=nrow(df2),ncol=3) #pre allocate memory
tmp <- as.numeric(df2) #store the original data
changes <- 1
for (i in 2:nrow(df2)){ #always keep row 1, thus the loop starts at row 2
if(ChangeDay(i)==TRUE){
result[i,] <-tmp[i] #store the row in result if ChangeDay(i)==TRUE
changes <- changes + 1 #increment counter
}
}
result <- result[1:changes,]
Thansk for your help, and any additional general advice on loops is appreciated!
It is not clear what you want to do. But I guess :
df[c(1,diff(df$num)) !=0,]
name num num2
1 Joe 1 1
2 Mary 2 1
4 Frank 1 1
5 Carol 2 1
8 Jay 3 1
How should I modify my code to update variables within a loop?
Specifically, I want to do something like the following:
myMatrix1 <- read.table(someFile)
myMatrix2 <- read.table(someFile2)
for (i in nrow(myMatrix2))
{
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- cbind(myMatrix3, myMatrix4, myMatrix5)
# *see question
}
How can I get myMatrix6 to be updated instead of reassigned the product of cbind(myMatrix3, myMatrix4, myMatrix5)? In other words, if the first iteration (i = 1) gave a myMatrix6 of:
> 1 1 1 1
> 2 2 2 2
and the second iteration (i = 2) gave myMatrix 6 of:
> 3 3 3 3
> 4 4 4 4
how do I get a dataframe(?) of:
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
> 4 4 4 4
UPDATE:
I have - thanks to DWin and Timo's suggestions - got the following. However, the following code has taken me about 2 hours to run on my datasets. Are there any ways to make it run any faster??? (without using a more powerful computer I may add)
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
You instead initialize myMatrix6 to an empty data.frame and rbind the results (which may be inefficient). If efficiency is a concern then you pre-allocate to the size you want and fill in rows in the data.frame with indexing.
# Method # 1 code
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
for (i in nrow(myMatrix2)) {
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- rbind( myMatrix6, cbind(myMatrix3, myMatrix4, myMatrix5) )
}
In your code, you are not dealing with matrices (in the sense of R), but data frames, as read.table returns a data frame.
In either way, you can append one matrix/data frame to another (assuming column names match) with rbind command
For example, if
> a = data.frame(x=c(1,2,3),y=c(4,5,6),z=c(7,8,9))
> b = data.frame(x=c(4,5),y=c(5,6),z=c(6,7))
then
> rbind(a,b)
x y z
1 1 4 7
2 2 5 8
3 3 6 9
4 4 5 6
5 5 6 7
There are other gotchas in the code you provide. For example
for (i in length(someVector)))
should be
for (i in 1:length(someVector)))
R has many functions for iterating over data.frames, vectors etc and can do all kinds of data transformations. Most of the time one does not need to write a for loop.
If you would provide more details about what you are trying to do, maybe we can find a simpler solution.
EDIT:
It seems from your post update that you are trying to do some sort of conversion between 'wide' and 'long' format and filter out some lines that fail a test. Correct me, if I am wrong.
Anyway, if that is the case, you should check out reshape command. Also, there is a reshape package containing extremely useful commands melt and cast, which can do that kind of transformations quite efficiently. Also, there is merge command for doing certain "join" operations for data frames. I'm quite sure your problem could be solved by using a combination of above commands, but it depends on exact details.
For filtering rows/columns with some criteria, check out subset command.