input sequential numbers without specific end in a data frame's column in r - r

I would like to give a sequence of numbers to a new column to a data frame. But this sequence will repeat several times based on a value in another column. (i.e It starts from 1 until that specific value will be changed to other value).
My problem is how to define the ending point for each sequence in r.
A part of my data frame with the column "V2" which I intend to add:
V1 V2(new added column with sequential numbers)
12 1
12 2
12 3
12 4
12 5
13 1
13 2
13 3
13 4
13 5
13 6
14 1
14 2
14 3
14 4
I tried to use the following code, which was not working!
count <- table(df$V1)
c <- as.integer(names(count)[df$V1==12])
repeat{
df$V2<- seq(1,c, by=1)
if(df$V1!=12){
break
}
}

It sounds like you might be looking for rle since you're interested in any time the "V1" variable changes.
Try the following:
> sequence(rle(df$V1)$lengths)
[1] 1 2 3 4 5 1 2 3 4 5 6 1 2 3 4

rle is a very good solution but you could also have used ave:
tab$V2 <- ave(tab$V1, tab$V1, FUN=seq_along)
hth

Well Ananda beats my effort:
vec = numeric(0)
for(i in unique(df$V1)){
n = length(df$V1[df$V1 == i])
vec = c(vec, 1:n)
}

Related

For Loop Adding Extra Rows to The Data Frame

Hello I am very new to the programming world and data science as well, and I am trying to work my way through it.
I am trying to assign values to the column in a data frame and using for loop such that the data frame is divided into ten groups and every row in every group is assigned a rank, such that row 1 to 10 is assigned as rank 1 and row 11 to 20 is assigned as rank 2 and so on. The original dimension of subset data set is 100 * 6
My data frame looks like
Data Frame
The codes I have written are:
x <- round(nrow(subset) / 10)
a=1
for(j in 1:10){
for(i in a:x){
subset[i, "rank"] = j
}
j = j + 1
a = x + 1
x = x * j
}
However, the loop runs infinitely and keeps on adding additional rows to the data frame. I had to manually stop the loop and the resulting dimension of the subset data frame was 17926 * 6.
Please help me understand where am I going wrong in writing the loop.
P.S. subset is a data frame name and not the subset function in R
Thanks in Advance !!
It might be better for you to start working with vectorized calculations instead of loops. This will help you in the future.
For example:
df <- data.frame(x = 1:100)
df$rank <- (df$x-1)%/%10 + 1
df
results in:
x rank
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 3
22 22 3
23 23 3
24 24 3
25 25 3
How about something like this:
subset$Rank <- ceiling(as.numeric(rownames(subset))/10)
The as.numeric converts the rowname into a number, dividing it by 10 and rounding up should give you what you need? Let me know if I've misunderstood.

Using two grouping designations to create one 'combined' grouping variable

Given a data.frame:
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))
#> df
# grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10
Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.
Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.
Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).
Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.
The final output that I'm hoping to obtain would look something like:
# > df
# grp1 grp2 combinedGrp
# 1 1 1 1
# 2 1 2 1
# 3 1 3 1
# 4 2 3 1
# 5 2 4 1
# 6 2 5 1
# 7 3 6 2
# 8 3 7 2
# 9 3 8 2
# 10 4 6 2
# 11 4 9 2
# 12 4 10 2
Thank you for any direction on this topic!
I would define a graph and label nodes according to connected components:
gmap = unique(stack(df))
gmap$node = seq_len(nrow(gmap))
oldcols = unique(gmap$ind)
newcols = paste0("node_", oldcols)
df[ newcols ] = lapply(oldcols, function(i) with(gmap[gmap$ind == i, ],
node[ match(df[[i]], values) ]
))
library(igraph)
g = graph_from_edgelist(cbind(df$node_grp1, df$node_grp2), directed = FALSE)
gmap$group = components(g)$membership
df$group = gmap$group[ match(df$node_grp1, gmap$node) ]
grp1 grp2 node_grp1 node_grp2 group
1 1 1 1 5 1
2 1 2 1 6 1
3 1 3 1 7 1
4 2 3 2 7 1
5 2 4 2 8 1
6 2 5 2 9 1
7 3 6 3 10 2
8 3 7 3 11 2
9 3 8 3 12 2
10 4 6 4 10 2
11 4 9 4 13 2
12 4 10 4 14 2
Each unique element of grp1 or grp2 is a node and each row of df is an edge.
One way to do this is via a matrix that defines links between rows based on group membership.
This approach is related to #Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, #Frank's answer would work better for very long data, or if you only ever have two columns.
The steps are
compare rows based on groups and define these rows as linked (i.e., create a graph)
determine connected components of the graph defined by the links in 1.
You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.
1. construct an adjacency matrix (matrix of pairwise links) between rows
(i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked
linked_rows <- function(data){
## helper function
## returns a _function_ to compare two rows of data
## based on group membership.
## Use Vectorize so it works even on vectors of indices
Vectorize(function(i, j) {
## numeric: 1= i and j have overlapping group membership
common <- vapply(names(data), function(name)
data[i, name] == data[j, name],
FUN.VALUE=FALSE)
as.numeric(any(common))
})
}
which I use in outer to construct a matrix,
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.
One iteration involves: i) matrix multiply to get the square of A, and
ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)
## define as a function to use below
lump_links <- function(A) {
A <- A %*% A
A[A > 0] <- 1
A
}
repeat this till the links are stable
oldA <- 0
i <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.
One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:
lump <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
oldA <- 0
while (any(oldA != A)) {
oldA <- A
A <- lump_links(A)
}
df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df
}
This works for the original df and also for the structure in #rawr's answer
df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3
PS
Here's a version using igraph, which makes the connection with #Frank's answer more clear:
lump2 <- function(df) {
rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df))
cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
df$combinedGrp <- cluster_A$membership
df
}
Hope this solution helps you a bit:
Assumption: df is ordered on the basis of grp1.
## split dataset using values of grp1
split_df <- split.default(df$grp2,df$grp1)
parent <- vector('integer',length(split_df))
## find out which combinations have values of grp2 in common
for (i in seq(1,length(split_df)-1)){
for (j in seq(i+1,length(split_df))){
inter <- intersect(split_df[[i]],split_df[[j]])
if (length(inter) > 0){
parent[j] <- i
}
}
}
ans <- vector('list',length(split_df))
index <- which(parent == 0)
## index contains indices of elements that have no element common
for (i in seq_along(index)){
ans[[index[i]]] <- rep(i,length(split_df[[i]]))
}
rest_index <- seq(1,length(split_df))[-index]
for (i in rest_index){
val <- ans[[parent[i]]][1]
ans[[i]] <- rep(val,length(split_df[[i]]))
}
df$combinedGrp <- unlist(ans)
df
grp1 grp2 combinedGrp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
Based on https://stackoverflow.com/a/35773701/2152245, I used a different implementation of igraph because I already had an adjacency matrix of sf polygons from st_intersects():
library(igraph)
library(sf)
# Use example data
nc <- st_read(system.file("shape/nc.shp", package="sf"))
nc <- nc[-sample(1:nrow(nc),nrow(nc)*.75),] #drop some polygons
# Find intersetions
b <- st_intersects(nc, sparse = F)
g <- graph.adjacency(b)
clu <- components(g)
gr <- groups(clu)
# Quick loop to assign the groups
for(i in 1:nrow(nc)){
for(j in 1:length(gr)){
if(i %in% gr[[j]]){
nc[i,'group'] <- j
}
}
}
# Make a new sfc object
nc_un <- group_by(nc, group) %>%
summarize(BIR74 = mean(BIR74), do_union = TRUE)
plot(nc_un['BIR74'])

Convert a full length column to one variable in a row in R

I was wondering if it is possible to convert 1 column into 1 variable next to eachother
i.e.:
d <- data.frame(y = 1:10)
> d
y
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Convert this column into:
> d
1 2 3 4 5 6 7 8 9 10
We don't know how are you going to use the numbers, but I think it is unnecessary to make any transformation. You can use d$y to get the numbers applied to any map of colors. See for example.
d <- data.frame(y = 1:7)
library(RColorBrewer)
mypalette<-brewer.pal(4,"Greens")
mycol <-palette()#rainbow(7)
heatmap(matrix(1:28,ncol=4),col=mypalette[d$y[1:4]],xlab="Greens (sequential)",
ylab="",xaxt="n",yaxt="n",bty="n",RowSideColors=mycol[d$y])
Not sure what is the prupose of:
1 variable next to eachother
But there are few ways to get the desired result (again, depends on the objective). You can do either:
d$y
unname(unlist(d)) #suggested by agstudy
or, better yet, to convert your dataframe's column into a vector, do this:
v <- as.vector(d[,1])
as string:
args <- paste(d$y, sep=" ")
args<-noquote(args)
now you'll have
[1] 1 2 3 4 5 6 7 8 9 10

recursive replacement in R

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
temp[which(temp==0)]=temp[which(temp==0)-1]
returns
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?
The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
library(zoo)
na.locf(temp)
#[1] 1 2 4 5 5 5 6 7
You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
inverse.rle(o)
#[1] 1 2 4 5 5 5 6 6

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources