Insert row every nth value from one dataframe to another in R - r

I am trying to insert many rows from one dataframe to another. I managed to do it once, but I have to do the same 3500 times.
I have the following two dataframes with the same headers:
dataframe a with 850561 rows and 121 columns
dataframe b with 854001 rows and 121 columns
And I used the following code to insert a row from b to a:
a <- rbind(a[1:244,],b[245],a[-(1:244),])
This works perfectly because it inserts a row from b in between rows 244 and 245 of a.
The problem is that I have to do the same every 243 rows, for example the next would be something like this:
a <- rbind(a[246:489,],b[489],a[-(246:489),])
I have tried with the following for loop, but it does not work:
i<-244
j<-1
for (val in a){
a<-rbind(a[j:i,],b[i+1],a[-(j:i),])
i<-i+243
j<j+245
}
I would much appreciate your help

One possible approach is to rbind(a,b) into a single table, with a new variable which when sorted on gives rows in the correct order.
This new variable would take the values 1:(dim(a)[1]) for the existing rows of a, then 245.5, 489.5, etc for the rows of b. Then if I sort on it the rows of b will appear in the correct place.
(Edit: I've just realised this is slightly different to your situation, but it should be easily adaptable, all you would need to do is select the appropriate rows of b first)
Here's a toy example:
# Create a and b
a <- data.frame(x=runif(15),source="a")
b <- data.frame(x=runif(3),source="b")
# Suppose I want b interleaved with a, such that a row of b occurs every 5 rows of a:
a$sortorder <- 1:dim(a)[1]
b$sortorder <- (1:dim(b)[1])*5+0.5
dat <- rbind(a,b)
dat[order(dat$sortorder),-3]
x source
1 0.10753897 a
2 0.24482683 a
3 0.72236241 a
4 0.03273494 a
5 0.54934999 a
16 0.20578103 b
6 0.68280868 a
7 0.30024090 a
8 0.38877446 a
9 0.73244615 a
10 0.96249642 a
17 0.13878037 b
11 0.76059617 a
12 0.58167316 a
13 0.46287299 a
14 0.35630962 a
15 0.38392182 a
18 0.38908214 b

Related

Insert row/s of NAs on the Nth row to list of data.frames with N from list

After numerous hours I find myself unable to solve the following issue:
I have a list of dataframes. I want to insert (not replace) one or more rows of NAs (always at least one row) to every DF respectively. The numbers of NA to insert are stored in a separate list.
To illustrate, i have the following two lists:
#list of dataframes
listDF <- list(data.frame(1:10),data.frame(1:9))
#list of row-indexes
listRI<- list(1,c(3,5))
My task hence, is to insert a row of NA to the first row to the first dataframe of listDF and two rows of NA (row 3 and 5) to the second dataframe on listDF
From Add new row to dataframe, at specific row-index, not appended?, answer 156, i have made the following function:
insertRow <- function(df, rowindex) {
df[seq(rowindex+1,nrow(df)+1),] <- df[seq(rowindex,nrow(df)),]
df[rowindex,] <- rep(NA,ncol(df))
df
}
After this, I'm not sure how to proceed. Looking around SO and other pages, I figure that the Map-function might help me. The following works as long as it is only one row to add to each df. For instance this works fine:
#Example with insert of single row in both dataframes
Map(function(x,y){insertRow(x,y)},x=listDF,y=list(1,5))
This inserts one row of NA on the first row of the first df and a row of NA on the fifth row of the second df. However if I use:
#Example with insert of single row in both dataframes
Map(function(x,y){insertRow(x,y)},x=listDF,y=listRI)
the function does not work (since the second list of listRI is of length>1. What I miss, if I have understood it correctly, is a a for-loop that updates those lists/dfs of listDF where I want to insert several rows of NA. Can I get some input in how to solve my issue?
As always, please let me know if I need to be clearer. Best/John
Edit:
I edited the example code to not only include first number/numbers of row indexes.
Edit (again):
In case someone runs into this code and intend to use, I found a problem with the insertRow function if intending to add a new row to a dataframe. I solved this by editing the function as follows:
insertRow <- function(df, rowindex) {
if(rowindex<=nrow(df)){df[seq(rowindex+1,nrow(df)+1),] <- df[seq(rowindex,nrow(df)),]
df[rowindex,] <- rep(NA,ncol(df))
return(df)}
if(rowindex>=nrow(df)+1){df[nrow(df)+1,]<-rep(NA,ncol(df))
return(df)}
}
You can add a for loop to go over listRI.
Map(function(x,y){for(i in y) {x <- insertRow(x, i)}; x},x=listDF,y=listRI)
#[[1]]
# X1.10
#1 NA
#2 1
#3 2
#4 3
#5 4
#6 5
#7 6
#8 7
#9 8
#10 9
#11 10
#
#[[2]]
# X1.9
#1 1
#2 2
#3 NA
#4 3
#5 NA
#6 4
#7 5
#8 6
#9 7
#10 8
#11 9

find var in a data frame and return next element in same row

ultimately, I need to search column 1 of my data frame and if I find a match to var, then I need to get the element next to it in the same row (the next column over)
I thought I could usematch or %in% to find the index in a vector I got from the data frame, but I keep getting NA
one example I looked at is
Is there an R function for finding the index of an element in a vector?, and I don't understand why my code is any different from the answers.
so in my code below, if I find b in column 1, I eventually want to get 20 from the data frame.
What am I doing wrong? I would preferr to stick to r-base packages if possible
> df = data.frame(names = c("a","b","c"),weight = c(10,20,30),height=c(5,10,15))
> df
names weight height
1 a 10 5
2 b 20 10
3 c 30 15
> vect = df[1]
> vect
names
1 a
2 b
3 c
> match("b", vect)
[1] NA

R enumerate duplicates in a dataframe with unique value

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

R getting count of unique values in a table

I have a data as below. I want to get a list of distinct values and their count for the entire matrix. What is an efficient way to do that?
I thought about putting one column below another (concatenating columns) and creating a single column of 9 elements and then running table command. But I feel that there must be a better way to do this..what are my options?
sm <- matrix(c(51,43,22,"a",51,21,".",22,9),ncol=3,byrow=TRUE)
expected output
distinct value: count
51:2
43:1
22:2
a:1
21:1
.:1
9:1
The table() command works just fine across a matrix
t<-table(sm)
t
# sm
# . 21 22 43 51 9 a
# 1 1 2 1 2 1 1
if you want to reshape the results, you can do
cat(paste0(names(t), ":", t, collapse="\n"), "\n")
# .:1
# 21:1
# 22:2
# 43:1
# 51:2
# 9:1
# a:1

Find consecutive values with sliding window

I have a dataframe. For simplicity, I am leaving out many columns and rows:
Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
I need to find 6 consecutive rows in the dataframe, such that the average distance is 1000, and such that the only types considered are A or B. Just for clarification, one may think to filter out all Type C rows, and then proceed, but then the rows that were not originally consecutive will become consecutive upon filtering, and that's no good.
For example, if I filtered out rows 3 and 5 above, I would be left with 3 rows. And if I had provided more rows, that might produce a faulty result.
Maybe a solution with data.table library ?
For reproducibility, here is a data sample, based on what you wrote.
library(data.table)
# data orig (with row numbers...)
DO<-"Distance Type
1 162 A
2 27182 A
3 212 C
4 89 B
5 11 C
6 1234 A"
# data : sep by comma
DS<-gsub('[[:blank:]]+',';',DO)
# data.frame
DF<-read.table(textConnection(DS),header=T,sep=';',stringsAsFactors = F)
#data.table
DT<-as.data.table(DF)
Then, make a function to increment a counter each time a sequence of identical value is found :
# function to set sequencial group number
mkGroupRep<-function(x){
cnt=1L
grp=1L
lx<-length(x)
ne<- x[-lx] != x[-1L] #next not equal
for(i in seq_along(ne)){if(ne[i])cnt=cnt+1;grp[i+1]=cnt}
grp
}
And use it with data.table 'multiple assignment by reference' :
# update dat : set group number based on sequential type
DT[,grp:=mkGroupRep(Type)]
# calc sum of distance and number of item in group, by group
DT[,`:=`(
distMean=mean(Distance),
grpLength=.N
),by=grp]
# filter what you want :
DT[Type != 'C' & distMean >100 & grpLength==2 | grpLength==3]
Output :
Distance Type grp distMean grpLength
1: 162 A 1 13672 2
2: 27182 A 1 13672 2

Resources