How can I merge dataframes of unequal length but known chunk length? - r

I have a model where individuals can die and reproduce. I record information from the model at set intervals. I know the identity of the individuals and the iteration number I sampled from:
df1<-data.frame(
who= c(1,2,3,4,1,2,3,3,5),
iteration = c(1,1,1,1,2,2,2,3,3)
)
df1
But each of the individuals has a list of numbers associated with it that I want to track. Because each individual has more than one number associated with it, I get two data frames of unequal sample size.
df2 <- data.frame(values=c(1,1, # id = 1
1,2, # id = 2
2,1, # id = 3
0,0, # id = 4
1,1, # id = 1
1,2, # id = 2
2,1, # id = 3
2,1, # id = 3
0,0)) # id = 5
df2
I want to bind them so the 'who' variable is matched up with its value. I did the following to split the values up into the right sized chunks but now I'm stuck.
df3 <- split(df2$values, ceiling(seq_along(df2$values)/2))
I should get something out that looks like this:
who iteration value1 value2
1 1 1 1
2 1 1 2
3 1 2 1
4 1 0 0
1 2 1 1
2 2 1 2
3 2 2 1
3 3 2 1
5 3 0 0

Here, we split the 'values' column based on a grouping index created with %% into a list of vectors, then make the list element pad with NA at the end (in case if there are less number of elements) by assigning the length<- to the maximum length of list element
lst <- split(df2$values, (seq_along(df2$values)-1) %% 2 +1)
m1 <- do.call(cbind, lapply(lst, "length<-", max(lengths(lst))))
cbind(df1, m1)

Related

Sort vector into repeating sequence when sequential values are missing R

I would like to take a vector such as this:
x <- c(1,1,1,2,2,2,2,3,3)
and sort this vector into a repeating sequence maintaining the hierarchical order of 1, 2, 3 when values are absent.
return: c(1,2,3,1,2,3,1,2,2)
We can create the order based on the sequence of 'x'
x[order(ave(x, x, FUN = seq_along))]
#[1] 1 2 3 1 2 3 1 2 2
Or with rowid fromdata.table
library(data.table)
x[order(rowid(x))]
#[1] 1 2 3 1 2 3 1 2 2

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

for each column, calculate the difference between it and the max of the others

Let's say I have a dataframe:
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
# a b c
#1 1 2 4
#2 2 3 5
#3 3 2 1
For each column, I would like to calculate the difference between that and the max of the other columns:
# Desired result:
# a b c
#1 -3 -2 2
#2 -3 -2 2
#3 1 -1 -2
For example, for the (1,1) entry, it's 1 because for the first row, a = 1, and max(b,c) = 4, so 1 - 4 = -3.
Note that I don't necessarily know the number of columns in the dataframe up front, so there could be arbitrarily many columns.
This should work on any number of columns:
sapply(1:ncol(x), function (i) {
x[,i] - do.call(pmax, x[,-i])
})
If you want a dplyr solution with a bit of RC indexing, you can use transmute to generate a new data frame, or mutate to add to your existing dataframe.
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
x %>% transmute(a = a-max(x[,-1]),
b = b-max(x[,-2]),
c = c-max(x[,-3]))

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

R grouping with Select the rows with Limited

The sample data frame
grp = c(1,1,1, 1,1,2,2,2,2,2, 2,2)
val = c(2,1,5,NA,3,NA,1)
dta = data.frame(grp=grp, val=val)
The results should look like this:
# The max number of count is 3
grp count
1 3
1 2
2 3
2 3
2 1
Here's a way with base R. We first count the repeated measures with rle. Then use a custom function that combines 3 with the remainder of the division. Finally we combine to form a new data frame:
grp = c(1,1,1, 1,1,2,2,2,2,2,2,2)
fun3 <- function(x) c(rep(3, floor(x/3)), x %% 3)
len <- rle(grp)$lengths
ans <- lapply(len, fun3)
cbind.data.frame(grp=rep(unique(grp), lengths(ans)), count=unlist(ans))
# grp count
# 1 1 3
# 2 1 2
# 3 2 3
# 4 2 3
# 5 2 1

Resources