Swap rows, per two rows - r

Is it possible to swap two adjacent rows with each other, and then move onto the next two rows, and swap their individual rows together? i.e. swap col1 value in row 1 with col 1 value in row2; swap col 1700 value in row 87 with col 1700 value in row 88.
sample data:
col1 col2
row1 a b
row2 b b
row3 c a
row4 d c
My real data has many rows and columns and the data changes each time I go through a loop, so I need a way where I don't refer to specific column names and row names.
The desired result would look like:
col1 col2
row1 b b
row2 a b
row3 d c
row4 c a

Add 1 to the first row in a group of 2, subtract one from the second row in a group of 2:
dat[seq_len(nrow(dat)) + c(1,-1),]
# col1 col2
#row2 b b
#row1 a b
#row4 d c
#row3 c a
This works because of the vector recycling in R:
1:10 + c(1,-1)
#[1] 2 1 4 3 6 5 8 7 10 9

Another way is to create two sequences, one for odd and another for even numbers and combine them alternatively and then use them as row indexes.
df[c(rbind(seq(2, nrow(df), 2), seq(1, nrow(df), 2))),]
# col1 col2
#row2 b b
#row1 a b
#row4 d c
#row3 c a
where
seq(2, nrow(df), 2)
#[1] 2 4
generates even numbered sequence and
seq(1, nrow(df), 2)
#[1] 1 3
generates odd numbered sequence.
We then use rbind and c to alternatively select index from two vectors.

Related

select row based on value of another row in R

EDIT: to make myself clear, I know how to select individual rows and I know there are many different ways of doing it. I want to write a code that will work no matter what the actual value of the rows is, so it works over a larger dataframe, that is, I don't have to change the code based on the content. So instead of saying, select row 1, then 3, it'll say, select row one, then row [value in row 1 column Z] then row [value in column Z from the row just selected] and so on - so my question is, how to tell R to read that value as row number
I'm trying to figure out how to select and save a row based on a value in another row, so that I get get a new df with row 1(aA), then go to row 3 and save it (cC), then go to row 2 etc.
X Y Z
a A 3
b B 5
c C 2
d D 1
e E NA
Knowing the row number, I can use rbind which will give me the following
rbind(df[1, ], df[3, ]
a A 3
c C 2
But I want R to extract the number 3 from the column not to explicitly tell it which row to pick - how do I do that?
Thanks
You can use a while loop to keep on selecting rows until NA occurs or all the rows are selected in the dataframe.
all_rows <- 1
next_row <- df$Z[all_rows]
while(!is.na(next_row) || length(all_rows) >= nrow(df)) {
all_rows <- c(all_rows, next_row)
next_row <- df$Z[all_rows[length(all_rows)]]
}
result <- df[all_rows, ]
# X Y Z
#1 a A 3
#3 c C 2
#2 b B 5
#5 e E NA
if you know which rows of which column that you want, you can use ;
df <- read.table(textConnection('X Y Z
a A 3
b B 5
c C 2
d D 1
e E NA'),
header=T)
desired_rows <- c('a','c')
df2 <- df[df$X %in% desired_rows,]
df2
output;
X Y Z
<fct> <fct> <int>
1 a A 3
2 c C 2

R- Insert rows of another dataframe after every row of dataframe?

I have 2 dataframe: mydata1 and mydata2, both 222x80.
I want to create a new dataframe in which after every row of mydata1 I add the row (same index) of mydata2.
I tried with transform function, but the output is duplicating the rows of the same dataframe.
I can't substitute all columns values.
If someone has suggestion, Thank you!!
insert.mydataFeat <- transform(mydata1, colnames(mydata1)=colnames(mydata2))
out.mydataFeat <- rbind(mydata1, insert.mydataFeat)
#reorder the rows
n <- nrow(mydata1)
out.mydataFeat<-out.mydataFeat[kronecker(1:n, c(0, n), "+"), ]
out.mydataFeat
You can use indexing trick after combining the data with rbind.
mydata1 <- data.frame(col1 = 1:5, col2 = 'A')
mydata2 <- data.frame(col1 = 6:10, col2 = 'B')
combine_df <- rbind(mydata1, mydata2)
combine_df <- combine_df[rbind(1:(nrow(combine_df)/2),
((nrow(combine_df)/2) +1):nrow(combine_df)), ]
# col1 col2
#1 1 A
#6 6 B
#2 2 A
#7 7 B
#3 3 A
#8 8 B
#4 4 A
#9 9 B
#5 5 A
#10 10 B
where
rbind(1:(nrow(combine_df)/2), ((nrow(combine_df)/2) +1):nrow(combine_df))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 2 3 4 5
#[2,] 6 7 8 9 10
the above creates a two row matrix with 1st row as row numbers from 1st dataframe and 2nd row as row numbers from second dataframe and we use that to subset combine_df.

Add list of columns above a certain threshold

Say I have a dataframe:
df <- data.frame(rbind(c(10,1,5,4), c(6,0,3,10), c(7,1,10,10)))
colnames(df) <- c("a", "b", "c", "d")
df
a b c d
10 1 5 4
6 0 3 10
7 1 10 10
And a vector of numbers (which correspond to the four column names a,b,c,d)
threshold <- c(7,1,5,8)
I need to compare each row in the data frame to the vector. When the value in the data frame meets or exceeds that in the vector, I need to return the column name. The output would be:
a b c d cols
10 1 5 4 a,b,c #10>7, 1>=1, 5>=5
6 0 3 10 d #10>8
7 1 10 10 a,b,c,d ##7>=7, 1>=1, 10>=5, 10>-8
The column cols can be a string that simply lists the columns where the value is exceeded.
Is there any clever way to do this? I'm migrating an old Excel function and I can write a loop or something, but I thought there almost had to be a better way.
You do not need which and the desired output is for comma separated values:
df$cols <- apply(df[-1], 1, function(x) toString(names(df)[-1][x >= threshold]))
df
id a b c d cols
1 aa 10 1 5 4 a, b, c
2 bb 6 0 3 10 d
3 cc 7 1 10 10 a, b, c, d
We can also try
i1 <- which(df >=threshold[col(df)], arr.ind=TRUE)
df$cols <- unname(tapply(names(df)[i1[,2]], i1[,1], toString))
df$cols
#[1] "a, b, c" "d" "a, b, c, d"
You can try this:
df$cols <- apply(df[, 2:5], 1, function(x) names(df[, 2:5])[which(x >= threshold)])

Find the index of the row in data frame that contain one element in a string vector

If I have a data.frame like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to get the row indices that contains one of the element in c("a", "k", "n"); in this example, the result should be 1, 2, 5.
If you have a large data frame and you wish to check all columns, try this
x <- c("a", "k", "n")
Reduce(union, lapply(x, function(a) which(rowSums(df == a) > 0)))
# [1] 1 5 2
and of course you can sort the end result.
s <- c('a','k','n');
which(df$col1%in%s|df$col3%in%s);
## [1] 1 2 5
Here's another solution. This one works on the entire data.frame, and happens to capture the search strings as element names (you can get rid of those via unname()):
sapply(s,function(s) which(apply(df==s,1,any))[1]);
## a k n
## 1 2 5
Original second solution:
sort(unique(rep(1:nrow(df),ncol(df))[as.matrix(df)%in%s]));
## [1] 1 2 5

R - using values in row to determine what column to select to fill a field

For example:
I have a data frame named table:
Cn c1 c2 c3 c4
c3 1 3 5 6
c2 4 6 7 9
I want to create a new column, with the value contained in the column, with column name in Cn, so it'll look like:
Cn c1 c2 c3 c4 NewCol
c3 1 3 5 6 5
c2 4 6 7 9 6
My attempt was table$NewCol<-table[,table$Cn]
However, instead of returning 1 value per row, the table$NewCol[1] is a vector containing (5, 3), which refers to the (c3, c2) in the Cn column, meaning that for each row, all rows of Cn are looked up and put into the new variable.
I know I can use loops but I'm dealing with a 7 million+ record data frame, and looping is very slow.
Anyone have any ideas how to deal with this?
Use mapply to apply [.data.frame as you move along each row and d$Cn.
table$NewCol <- mapply(i = seq_along(d[['Cn']]),
j= d[['Cn']],
FUN = function(i,j,x) x[i,j,drop=TRUE],
MoreArgs=list(x=d))
If speed and efficiency are of concern, use data.table and set (This loop is efficient)
library(data.table)
setDT(d)
for(i in seq_len(nrow(d))){
set(d,j='newCol', i=i, value= d[[d[['Cn']][i]]][i])
}
Use matrix indexing of the desired row and column values to extract.
I used dat as your data.frame name.
dat[-1][cbind(seq_along(dat$Cn),match(as.character(dat$Cn),names(dat[-1])))]
#[1] 5 6
As in:
sel <- cbind(seq_along(dat$Cn),match(as.character(dat$Cn),names(dat[-1])))
sel
# row col
# [,1] [,2]
#[1,] 1 3
#[2,] 2 2
dat[-1][sel]
#[1] 5 6
Timing on 7M rows and your 4 column example is about 0.4 seconds.
dat2 <- dat[sample(1:2,7e6,replace=TRUE),]
nrow(dat2)
#[1] 7000000
system.time({
sel <- cbind(seq_along(dat2$Cn),match(as.character(dat2$Cn),names(dat2[-1])))
dat2$newcol <- dat2[-1][sel]
})
# user system elapsed
# 0.33 0.07 0.39

Resources