To count how many times one row is equal to a value - r

To count how many times one row is equal to a value
I have a df here:
df <- data.frame('v1'=c(1,2,3,4,5),
'v2'=c(1,2,1,1,2),
'v3'=c(NA,2,1,4,'1'),
'v4'=c(1,2,3,NaN,5),
'logical'=c(1,2,3,4,5))
I would like to know how many times one row is equal to the value of the variable 'logical' with a new variable 'count'
I wrte a for loop like this:
attach(df)
df$count <- 0
for(i in colnames(v1:v4)){
if(df$logical == i){
df$count <- df$count+1}
}
but it doesn't work. there's still all 0 in the new variable 'count'.
Please help to fix it.
the perfect result should looks like this:
df <- data.frame('v1'=c(1,2,3,4,5),
'v2'=c(1,2,1,1,2),
'v3'=c(NA,2,1,4,'1'),
'v4'=c(1,2,3,NaN,5),
'logical'=c(1,2,3,4,5),
'count'=c(3,4,2,2,2))
Many thanks from a beginner.

We can use rowSums after creating a logical matrix
df$count <- rowSums(df[1:4] == df$logical, na.rm = TRUE)
df$count
#[1] 3 4 2 2 2

Personally I guess so far the solution by #akrun is an elegant and also the best efficient way to add the column count.
Another way (I don't know if that is the one you are looking for the "elegance") you can used to "attach" the column the count column to the end of df might be using within, i.e.,
df <- within(df, count <- rowSums(df[1:4]==logical,na.rm = T))
such that you will get
> df
v1 v2 v3 v4 logical count
1 1 1 <NA> 1 1 3
2 2 2 2 2 2 4
3 3 1 1 3 3 2
4 4 1 4 NaN 4 2
5 5 2 1 5 5 2

Related

R: loop matrix sort columns individually for specific rows

I want to sort my Matrix (U) columnwise for the rows, which have the same name. My (very large) matrix looks similar to this:
1 2
1 5 6
1 -4 4
1 6 -2
2 7 -2
2 -2 3
Now I want to loop through the matrix looking for the same rows and then sort the columns which have the same row.name resulting in this matrix:
1 2
1 -4 -2
1 5 4
1 6 6
2 -2 -2
2 7 3
My code until now looks like this:
First step was the row count, which works:
z <- 1
for(i in (1:nrow(U))){
if(row.names(U)[i] != row.names(U)[i-1]){
z = (sum(row.names(U) == row.names(U)[i]))+1}}
Now I wanted to add after the row count a sorting function and I tried this for the first set of rows manually:
x <- 1
for(x in (1:ncol(U))){
U[1:3,x]<- U[do.call(order, lapply(x:NCOL(U), function(x) U[1:3, x]
However this loop is on the one hand very slow and on the other hand it only fills in the first column correctly
Do you have a recommendation how I could improve my sorting function, while taking into account the performance issues?
EDIT: I guess this was confusing in my first edit. The first "column" of my matrix are the row.names and I have in this example a 5x2 Matrix
Here's an approach which just uses order() first by row name, then by each column in turn. Is this what you're after?
U <- matrix(c(5,6,-4,4,6,-2,7,-2,-2,3), byrow=TRUE, ncol=2, dimnames=list(c(1,1,1,2,2), c(1,2)))
apply(U, 2, function(j) j[order(rownames(U), j)])
We can use data.table, convert to data.table, grouped by the first column ('U'), loop through the columns and sort
library(data.table)
as.data.table(m1)[, lapply(.SD, sort), by = U]
An alternative using dplyr
df = read.table(textConnection("U 1 2
1 5 6
1 -4 4
1 6 -2
2 7 -2
2 -2 3"), header= TRUE)
library(dplyr)
df %>% group_by(U) %>% transmute(sort(X1),sort(X2))

remove duplicate row based only of previous row

I'm trying to remove duplicate rows from a data frame, based only on the previous row. The duplicate and unique functions will remove all duplicates, leaving you only with unique rows, which is not what I want.
I've illustrated the problem here with a loop. I need to vectorize this because my actual data set is much to large to use a loop on.
x <- c(1,1,1,1,3,3,3,4)
y <- c(1,1,1,1,3,3,3,4)
z <- c(1,2,1,1,3,2,2,4)
xy <- data.frame(x,y,z)
xy
x y z
1 1 1 1
2 1 1 2
3 1 1 1
4 1 1 1 #this should be removed
5 3 3 3
6 3 3 2
7 3 3 2 #this should be removed
8 4 4 4
# loop that produces desired output
toRemove <- NULL
for (i in 2:nrow(xy)){
test <- as.vector(xy[i,] == xy[i-1,])
if (!(FALSE %in% test)){
toRemove <- c(toRemove, i) #build a vector of rows to remove
}
}
xy[-toRemove,] #exclude rows
x y z
1 1 1 1
2 1 1 2
3 1 1 1
5 3 3 3
6 3 3 2
8 4 4 4
I've tried using dplyr's lag function, but it only works on single columns, when I try to run it over all 3 columns it doesn't work.
ifelse(xy[,1:3] == lag(xy[,1:3],1), NA, xy[,1:3])
Any advice on how to accomplish this?
Looks like we want to remove if the row is same as above:
# make an index, if cols not same as above
ix <- c(TRUE, rowSums(tail(xy, -1) == head(xy, -1)) != ncol(xy))
# filter
xy[ix, ]
Why don't you just iterate the list while keeping track of the previous row to compare it to the next row?
If this is true at some point: remember that row position and remove it from the list then start iterating from the beginning of the list.
Don't delete row while iterating because you will get concurrent modification error.

Fill in-between entries in an ID vector

Looking for a quick-and-easy solution to a problem which I have only been able to solve inelegantly, by looping. I have an ID vector which looks something like this:
id<-c(NA,NA,1,1,1,NA,1,NA,2,2,2,NA,3,NA,3,3,3)
The NA's that fall in-between a sequence of a single number (id[6], id[14]) need to be replaced by that number. However, the NA's that don't meet this condition (those between sequences of two different numbers) need to be left alone (i.e., id[1],id[2],id[8],id[12]). The target vector is therefore:
id.target<-c(NA,NA,1,1,1,1,1,NA,2,2,2,NA,3,3,3,3,3)
This is not difficult to do by looping through each value, but I am looking to do this to many very long vectors, and was hoping for a neater solution. Thanks for any suggestions.
This seem to work. The idea is to use zoo::na.locf in order to fill the NAs correctly and then insert NAs when they are between different numbers
id.target <- zoo::na.locf(id, na.rm = FALSE)
id.target[(c(diff(id.target), 1L) > 0L) & is.na(id)] <- NA
id.target
## [1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3
Here is a base R option
d1 <- do.call(rbind,lapply(split(seq_along(id), id), function(x) {
i1 <- min(x):max(x)
data.frame(val= unique(id[x]), i1)}))
id[seq_along(id) %in% d1$i1 ] <- d1$val
id
#[1] NA NA 1 1 1 1 1 NA 2 2 2 NA 3 3 3 3 3

Assigning values to correlative series in r

I hope you can help me with this issue I have.
I have a big dataframe, to simplify it, it look like this:
df <- data.frame(radius = c (2,3,5,7,4,6,9,8,3,7,8,9,2,4,5,2,6,7,8,9,1,10,8))
df$num <- c(1,2,3,4,5,6,7,8,9,10,11,1,12,13,1,14,15,16,17,18,19,1,1)
df
The column $num has correlative series (1-11, 1, 12-13, 1, 14-19,1,1)
I would like to assign a value (sorted) per each correlative serie as a column. the outcome should be like this:
df$outcome <- c(1,1,1,1,1,1,1,1,1,1,1,2,3,3,4,5,5,5,5,5,5,6,7)
df
thanks a lot!
A.
We can get the difference between adjacent elements in 'num' using diff and check whether it is not equal to 1. The logical output will be one less than the length of the 'num' vector. We pad with 'TRUE' and cumsum to get the expected output.
df$outcome <- cumsum(c(TRUE,diff(df$num)!=1))
df$outcome
#[1] 1 1 1 1 1 1 1 1 1 1 1 2 3 3 4 5 5 5 5 5 5 6 7

Create a vector listing run length of original vector with same length as original vector

This problem seems trivial but I'm at my wits end after hours of reading.
I need to generate a vector of the same length as the input vector that lists for each value of the input vector the total count for that value. So, by way of example, I would want to generate the last column of this dataframe:
> df
customer.id transaction.count total.transactions
1 1 1 4
2 1 2 4
3 1 3 4
4 1 4 4
5 2 1 2
6 2 2 2
7 3 1 3
8 3 2 3
9 3 3 3
10 4 1 1
I realise this could be done two ways, either by using run lengths of the first column, or grouping the second column using the first and applying a maximum.
I've tried both tapply:
> tapply(df$transaction.count, df$customer.id, max)
And rle:
> rle(df$customer.id)
But both return a vector of shorter length than the original:
[1] 4 2 3 1
Any help gratefully accepted!
You can do it without creating transaction counter with:
df$total.transactions <- with( df,
ave( transaction.count , customer.id , FUN=length) )
You can use rle with rep to get what you want:
x <- rep(1:4, 4:1)
> x
[1] 1 1 1 1 2 2 2 3 3 4
rep(rle(x)$lengths, rle(x)$lengths)
> rep(rle(x)$lengths, rle(x)$lengths)
[1] 4 4 4 4 3 3 3 2 2 1
For performance purposes, you could store the rle object separately so it is only called once.
Or as Karsten suggested with ddply from plyr:
require(plyr)
#Expects data.frame
dat <- data.frame(x = rep(1:4, 4:1))
ddply(dat, "x", transform, total = length(x))
You are probably looking for split-apply-combine approach; have a look at ddply in the plyr package or the split function in base R.

Resources