Deleting rows have most of the value zero [duplicate]

Deleting rows have most of the value zero [duplicate] - r

This question already has an answer here:
Count number of zeros per row, and remove rows with more than n zeros
(1 answer)
Closed 3 years ago.
File-
i j k l m n
a 0 0 0 1 0 1
b 8 6 34 1 0 0
c 0 9 12 0 8 0
d 7 9 3 7 0 5
e 0 0 0 1 0 0
f 2 3 9 6 8 9
g 0 1 0 3 1 5
h 0 9 0 8 4 0
I want to delete those rows which vane value 0 in more than 3 cells.
Expected output-
i j k l m n
b 8 6 34 1 0 0
d 7 9 3 7 0 5
f 2 3 9 6 8 9
g 0 1 0 3 1 5

We can use rowSums
df[rowSums(df == 0) < 3, ]
# i j k l m n
#b 8 6 34 1 0 0
#d 7 9 3 7 0 5
#f 2 3 9 6 8 9
#g 0 1 0 3 1 5
We can also use apply and count row-wise number of 0's and then subset
df[apply(df == 0, 1, sum) < 3, ]

Related

Count consecutive numbers

I have some time series with corresponding number for each date as 0 or 1. For example:
date value
1 0
2 0
3 1
4 1
5 1
6 0
7 1
8 1
So I want to count the consecutive 1´s like for date 3-5 the sum should be 3 and then start at date 7 again to count. And if this sum is below 6 the 1´s should be transformed to 0´s.

library(dplyr)
data.frame(
date = 1:8,
value = c(0,0,1,1,1,0,1,1)
) %>%
mutate(
count = rle(value) %>%
{list(.$lengths * .$values, .$lengths)} %>%
{rep(x = .[[1]], times = .[[2]])},
count_1 = ifelse(count < 6, 0, count)
)
gives:
date value count count_1
1 1 0 0 0
2 2 0 0 0
3 3 1 3 0
4 4 1 3 0
5 5 1 3 0
6 6 0 0 0
7 7 1 2 0
8 8 1 2 0

I would first create a grouping variable and then use this to aggregate the dataset.
d = data.frame("date"=1:12,
"value"=c(1,1,0,0,1,1,1,1,0,0,1,0))
d$group = 1
for(i in 2:dim(d)[1]){
if(d$value[i]==d$value[i-1]){
d$group[i]=d$group[i-1]
} else {
d$group[i]=d$group[i-1]+1
}
}
nd = data.frame("Group"=unique(d$group),
"Start"=aggregate(d$date~d$group,FUN=min)[,2],
"End"=aggregate(d$date~d$group,FUN=max)[,2],
"Count"=aggregate(d$value~d$group,FUN=sum)[,2])
The output for this data would be:
> d ## Input data
date value
1 1 1
2 2 1
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 0
10 10 0
11 11 1
12 12 0
> nd ## All groups
Group Start End Count
1 1 1 2 2
2 2 3 4 0
3 3 5 8 4
4 4 9 10 0
5 5 11 11 1
6 6 12 12 0
> nd[nd$Count>0,] ## Just the groups with 1 in them:
Group Start End Count
1 1 1 2 2
3 3 5 8 4
5 5 11 11 1

Another solution which looks like what you expected :
d = data.frame("date"=1:20,"value"=c(1,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,0))
repl <- rle(d$value)
rep_lengths <- rep(repl$lengths, repl$lengths)
rep_lengths[rep_lengths < 6] <- 0
d$value <- rep_lengths
returns
> d
date value
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 7
14 14 7
15 15 7
16 16 7
17 17 7
18 18 7
19 19 7
20 20 0

You can use rle to count the consecutive and use ifelse to set those lower 6 to 0:
y <- rle(x$value)
y[[2]] <- y[[1]] * y[[2]]
y[[2]] <- ifelse(y[[2]] < 6, 0, y[[2]])
inverse.rle(y)
#[1] 0 0 0 0 0 0 0 0
Data:
x <- data.frame(date = 1:8, value = c(0,0,1,1,1,0,1,1))

R- Include starting point in cumsum function

I have this data.frame:
a b
[1,] 1 0
[2,] 2 0
[3,] 3 0
[4,] 4 0
[5,] 5 0
[6,] 6 1
[7,] 7 2
[8,] 8 3
[9,] 9 4
[10,] 10 5
I want to apply cumsum on column a only when its corresponding value on column b is different from 0.
I tried this below but it doesn't include a starting condition on the cumsum:
df_cumsum <- cbind(c(1:10), c(0,0,0,0,0,1,2,3,4,5),
as.data.frame(ave(A[,1], A[,2] != 0, FUN=cumsum)))
Unfortunately, I obtain a cumsum over the whole column:
a b c
1 1 0 1
2 2 0 3
3 3 0 6
4 4 0 10
5 5 0 15
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
I would like to obtain:
a b c
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
Thanks for help!

Assuming the input is df as shown reproducibly in the Note at the end, try this. It zeros out any a value for which b is 0.
transform(df, cum = cumsum((b > 0) * a))
giving:
a b cum
1 1 0 0
2 2 0 0
3 3 0 0
4 4 0 0
5 5 0 0
6 6 1 6
7 7 2 13
8 8 3 21
9 9 4 30
10 10 5 40
Note
We assume this input shown in reproducible form:
Lines <- "
a b
1 0
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4
10 5"
df <- read.table(text = Lines, header = TRUE)
Update
a and b had been reversed. Have fixed.

It would be better to create an index and update
i1 <- df1$b > 0
df1$c[i1] <- with(df1, cumsum(a[i1]))
Or in a single line
df1$c <- with(df1, cumsum(a * (b > 0)))
df1$c
#[1] 0 0 0 0 0 6 13 21 30 40

I really like how clean the other answers are using the a * (b > 0) but that can sometimes be a bit confusing for newer programers. As an alternative to this syntax you can use a vectorized ifelse function.
df <- data.frame(a=c(1:10), b=c(0,0,0,0,0,1,2,3,4,5))
# One way
df$c <- cumsum(ifelse(df$b>0,df$a,0))
# Another way
df$d <- with(df,cumsum(ifelse(b>0,a,0)))

Finding variance of columns from 2 dataframes

I have 2 dataframes
DataFrame A and Dataframe B.
A <- data.frame(a=c(1,2,3,4,5),b=c(2,4,6,8,10),c=c(3,6,9,12,15),x=c(4,8,12,16,20),y=c(5,10,15,20,25))
B <- data.frame(a=c(1,2,3,4,5),b=c(2,4,6,8,10),c=c(3,6,9,12,15),x=c(4,8,12,16,20),y=c(5,10,15,20,25))
A
a b c x y
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
B
a b c x y
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
Expected Output:
C
a b c x y
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
Both have a key column which is alpha-numeric.
Both dataframes have 260 columns in all out of which 250 are float.
Is there an eaiser way to easily compute the variance of each of the 250 columns and store the variance in another dataframe?

I think you want difference brtween respective columns of two dataframes
temp = names(A)
data.frame(A["a"], do.call(cbind, lapply(temp[!temp %in% "a"], function(x) A[x] - B[x])))
# a b c x y
#1 1 0 0 0 0
#2 2 0 0 0 0
#3 3 0 0 0 0
#4 4 0 0 0 0
#5 5 0 0 0 0

We can use Map/mapply to find the difference between the corresponding columns of 'A' and 'B'
cbind(A[1], mapply(`-`, A[-1], B[names(A)[-1]]))
# a b c x y
#1 1 0 0 0 0
#2 2 0 0 0 0
#3 3 0 0 0 0
#4 4 0 0 0 0
#5 5 0 0 0 0
Or just
cbind(A[1], A[-1] - B[-1])

Using R to filter special rows

I have a question which has bothered me for a long time.I have a data frame as below...
ll <- data.frame(id=1:10,
A=c(rep(0,5),3,4,5,0,2),
B=c(1,4,2,0,3,0,3,24,0,0),
C=c(0,3,4,5,0,4,0,6,0,5),
D=c(0,1,2,0,42,4,0,3,8,0))
> ll
id A B C D
1 1 0 1 0 0
2 2 0 4 3 1
3 3 0 2 4 2
4 4 0 0 5 0
5 5 0 3 0 42
6 6 3 0 4 4
7 7 4 3 0 0
8 8 5 24 6 3
9 9 0 0 0 8
10 10 2 0 5 0
I want to filter out some special rows which have more than one "0" such as...
id A B C D
1 1 0 1 0 0
I want the final output as...
id A B C D
2 2 0 4 3 1
3 3 0 2 4 2
6 6 3 0 4 4
8 8 5 24 6 3

You can just use rowSums:
> ll[rowSums(ll == 0) <= 1, ]
id A B C D
2 2 0 4 3 1
3 3 0 2 4 2
6 6 3 0 4 4
8 8 5 24 6 3
If there are any columns that shouldn't be included, you can drop them in the rowSums step. For example, I assume "id" would not be included. If that's the case, then you can do:
ll[rowSums(ll[-1] == 0) <= 1, ]

cumulative counter in dataframe R

I have a dataframe with many rows, but the structure looks like this:
year factor
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 1
18 0
19 0
20 0
I would need to add a counter as a third column. It should count the cumulative cells that contains zero until it set again to zero once the value 1 is encountered. The result should look like this:
year factor count
1 0 0
2 0 1
3 0 2
4 0 3
5 0 4
6 0 5
7 0 6
8 0 7
9 1 0
10 0 1
11 0 2
12 0 3
13 0 4
14 0 5
15 0 6
16 0 7
17 1 0
18 0 1
19 0 2
20 0 3
I would be glad to do it in a quick way, avoiding loops, since I have to do the operations for hundreds of files.
You can copy my dataframe, pasting the dataframe in "..." here:
dt <- read.table( text="...", , header = TRUE )

Perhaps a solution like this with ave would work for you:
A <- cumsum(dt$factor)
ave(A, A, FUN = seq_along) - 1
# [1] 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3
Original answer:
(Missed that the first value was supposed to be "0". Oops.)
x <- rle(dt$factor == 1)
y <- sequence(x$lengths)
y[dt$factor == 1] <- 0
y
# [1] 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 0 1 2 3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Deleting rows have most of the value zero [duplicate] - r

We can use rowSums df[rowSums(df == 0) < 3, ] # i j k l m n #b 8 6 34 1 0 0 #d 7 9 3 7 0 5 #f 2 3 9 6 8 9 #g 0 1 0 3 1 5 We can also use apply and count row-wise number of 0's and then subset df[apply(df == 0, 1, sum) < 3, ]

Related

Count consecutive numbers

R- Include starting point in cumsum function

Finding variance of columns from 2 dataframes

Using R to filter special rows

cumulative counter in dataframe R

Categories

Resources