Using R to filter special rows - r

I have a question which has bothered me for a long time.I have a data frame as below...
ll <- data.frame(id=1:10,
A=c(rep(0,5),3,4,5,0,2),
B=c(1,4,2,0,3,0,3,24,0,0),
C=c(0,3,4,5,0,4,0,6,0,5),
D=c(0,1,2,0,42,4,0,3,8,0))
> ll
id A B C D
1 1 0 1 0 0
2 2 0 4 3 1
3 3 0 2 4 2
4 4 0 0 5 0
5 5 0 3 0 42
6 6 3 0 4 4
7 7 4 3 0 0
8 8 5 24 6 3
9 9 0 0 0 8
10 10 2 0 5 0
I want to filter out some special rows which have more than one "0" such as...
id A B C D
1 1 0 1 0 0
I want the final output as...
id A B C D
2 2 0 4 3 1
3 3 0 2 4 2
6 6 3 0 4 4
8 8 5 24 6 3

You can just use rowSums:
> ll[rowSums(ll == 0) <= 1, ]
id A B C D
2 2 0 4 3 1
3 3 0 2 4 2
6 6 3 0 4 4
8 8 5 24 6 3
If there are any columns that shouldn't be included, you can drop them in the rowSums step. For example, I assume "id" would not be included. If that's the case, then you can do:
ll[rowSums(ll[-1] == 0) <= 1, ]

Related

Pasting a string of variables into a function is not working

I was looking at this question: Find how many times duplicated rows repeat in R data frame, which provides the following code:
library(plyr)
ddply(df,.(a,b),nrow)
However, I have a dataset with many variables, so I can't type them out like a,b in this case. I've tried using names(data) with the paste function, but it doesn't seem to work. I tried this:
var_names=paste(names(data),collapse=",")
ddply(data,.(paste(a)),nrow)
It instead gives this output:
However, if I manually type them out, I get the proper output:
What do I need to do differently here?
Instead of paste and evaluating, make use of count from dplyr, which can take multiple columns with across and select-helpers - everything()
library(dplyr)
df %>%
count(across(everything()))
A reproducible example with mtcars dataset
data(mtcars)
df <- mtcars %>%
select(vs:carb)
count(df, across(everything()))
vs am gear carb n
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Also, in ddply, we can just pass a vector of column names i.e. no need to create a single string
library(plyr)
ddply(df, names(df), nrow)
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Or if we are creating a single string from names, also paste the whole expression and then evaluate (which is not recommended as there are standard ways of dealing this)
eval(parse(text = paste('ddply(df, .(', toString(names(df)), '), nrow)')))
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
You can use aggregate by grouping all the columns and counting it's length.
aggregate(1:nrow(df)~., df, length)

What is the R function for detecting successive differences in a data frame?

I use the following code in R and it works very well. More precisely, I compare each time cluster_id with the last cluster_ref to see when they differ 2 periods in a row (data is organized by fund_numbers). However, I would like to adapt it to 5 periods. But it is impossible to make it work. Do you have any idea how I can modify this code to solve my problem?
get_output <- function(mon, ref){
exp <- !is.na(Cluster_id) & !map2_lgl(Cluster_id, last(Cluster_ref), identical)
as.integer(exp & lag(exp, default = FALSE))
}
df %>%
arrange(Fund_number, rolling_window) %>%
group_by(Fund_number) %>%
mutate(Deviation = get_output(Cluster_id, Cluster_ref)) %>%
ungroup()
rolling_window Fund_number Cluster_id Cluster_ref Expected_output
1 1 10 10 0
2 1 10 10 0
3 1 8 9 0
4 1 8 8 0
5 1 7 7 0
6 1 8 8 0
7 1 8 NA 1
8 1 7 NA 1
9 1 7 10 1
10 1 10 10 0
1 2 NA NA 0
2 2 NA 3 0
3 2 3 3 0
4 2 2 5 0
5 2 2 NA 0
6 2 2 4 0
7 2 2 4 1
8 2 5 5 0
9 2 4 5 0
10 2 3 5 0
This is what I want.
So as you can see, the data is organized by fund_number. Then I look at the last cluster_ref for each fund (so every 8 rows) and compare it to each cluster_id for each fund. As soon as it is different at least 5 periods in a row I have 1 if not 0. So for each fund, I compare the 8th cluster_ref and the cluster_id of rows 1 to 8.
The code above makes this but with 2 time periods.
Thank you very much,
Vanie
In data.table we can use rleid over Cluster_id values.
library(data.table)
setDT(df)[, temp := rleid(last(Cluster_ref) != Cluster_id), Fund_number]
df[, output := +(seq_along(Cluster_ref) >= 5), .(Fund_number, temp)]
df[, temp := NULL]
df
# rolling_window Fund_number Cluster_id Cluster_ref Expected_output output
# 1: 1 1 10 10 0 0
# 2: 2 1 10 10 0 0
# 3: 3 1 8 9 0 0
# 4: 4 1 8 8 0 0
# 5: 5 1 7 7 0 0
# 6: 6 1 8 8 0 0
# 7: 7 1 8 NA 1 1
# 8: 8 1 7 NA 1 1
# 9: 9 1 7 10 1 1
#10: 10 1 10 10 0 0
#11: 1 2 NA NA 0 0
#12: 2 2 NA 3 0 0
#13: 3 2 3 3 0 0
#14: 4 2 2 5 0 0
#15: 5 2 2 NA 0 0
#16: 6 2 2 4 0 0
#17: 7 2 2 4 1 1
#18: 8 2 5 5 0 0
#19: 9 2 4 5 0 0
#20: 10 2 3 5 0 0

How to generate Classification Analysis tables in R?

So far I have done the discriminant analysis. I generated the posterior probabilities, structure loadings, and group centroids.
I have 1 grouping variable : history
I have 3 discriminant variables : mhpg, exercise, and control
here is the code so far
td <- read.delim("H:/Desktop/TAB DATA.txt")
td$history<-factor(td$history)
fit<-lda(history~mhpg+exercise+control, data=td)
git<-predict(fit)
xx<-subset(td, select=c(mhpg, control, exercise))
cor(xx,git$x)
aggregate(git$x~history,data=td,FUN=mean)
tst<-lm(cbind(mhpg,control,exercise)~history,data=td)
Basically, the above code is for discriminant analysis.
Now I want generate frequency classification and percent classification tables for classification analysis.
my attempted code (which i sampled from someone else to no avail) is:
td[6] <- git$class
td$V6<-factor(td$V6)
ftab<-table(td$history,dt$V6)
prop.table(ftab,1)
Where column 6 is my grouping variable history.
I get the following error when trying to make td$V6 a categorical variable with factor
Error in `$<-.data.frame`(`*tmp*`, "V6", value = integer(0)) :
replacement has 0 rows, data has 50
Can anyone steer me in the right direction? I really don't know why the sample code used a capital V out of nowhere. Below is the data. Column 6 is the grouping variable, history. Column 5 is the discriminant variable, control. column 7 is the discriminant variable, exercise. Column 8 is the discriminant variable, mhpg.
1 3 6 0 2 0 4 2 4 3 0 6 0
1 4 5 0 0 1 2 5 4 6 1 4 1
1 4 4 0 2 1 1 8 6 7 1 2 1
2 4 9 0 2 1 0 6 7 8 1 4 1
2 4 3 1 4 1 2 6 6 6 1 4 1
2 5 7 0 1 1 3 6 7 7 1 1 1
2 5 8 0 1 1 1 6 6 7 1 5 1
2 6 7 0 1 1 0 9 8 8 1 3 1
2 6 4 1 2 1 2 5 7 6 1 5 1
3 4 10 0 1 1 1 8 5 7 1 4 1
3 4 4 0 1 1 1 8 9 8 1 3 1
3 4 7 0 1 0 1 6 3 4 0 8 0
3 5 4 1 4 1 2 5 4 5 0 5 1
3 5 7 0 2 1 1 7 5 7 1 4 1
3 5 6 0 0 1 0 10 9 10 1 3 1
3 5 6 0 2 1 1 9 10 9 1 2 1
3 5 5 1 2 1 2 5 4 4 0 9 1
3 6 2 1 4 1 3 6 4 4 0 7 1
3 6 3 1 2 1 2 7 5 5 0 6 1
3 6 5 1 2 1 2 6 7 6 1 6 1
3 6 7 1 3 1 3 5 4 4 0 8 1
3 6 5 1 2 1 2 5 3 3 0 10 1
3 7 8 0 0 1 1 7 6 7 1 5 1
3 7 5 1 2 1 1 5 5 5 0 6 1
3 7 6 1 2 0 4 3 1 2 0 9 0
3 8 6 1 2 1 1 6 5 5 0 7 1
3 8 9 0 0 1 0 7 5 6 1 3 1
4 5 5 1 2 1 1 5 6 5 0 6 1
4 5 5 1 2 0 2 3 3 4 0 8 0
4 6 8 0 0 1 2 8 7 7 1 4 1
4 6 6 1 3 1 2 5 4 4 0 7 0
4 6 5 1 3 1 2 4 3 2 0 8 0
4 7 2 0 3 0 4 3 6 6 1 4 1
4 7 4 1 3 0 3 4 2 1 0 7 0
4 7 7 1 3 0 4 4 5 5 0 7 0
4 7 6 1 3 0 3 3 6 5 0 4 0
5 7 5 1 1 0 4 1 7 4 0 7 1
5 8 1 1 3 0 3 4 8 7 1 5 0
5 8 3 1 3 0 3 4 5 6 1 5 1
5 9 4 1 4 0 3 2 7 5 0 5 1
5 9 6 1 4 0 3 4 6 6 1 7 0
5 10 4 1 3 0 3 4 2 3 0 6 0
1 1 8 0 1 0 2 5 6 5 0 6 1
1 2 7 0 1 1 1 7 8 9 1 5 0
1 2 7 0 1 1 0 7 5 6 1 5 1
1 3 5 0 1 1 2 7 8 8 1 5 0
2 3 3 1 2 1 2 6 7 6 1 6 0
2 3 6 1 1 1 2 7 6 4 0 7 0
2 4 6 1 3 1 3 6 5 5 0 6 0
2 5 4 1 3 1 3 4 4 3 0 6 0
Try:
tbl <- table(td$history,git$class)
tbl
# 0 1
# 0 13 2
# 1 1 34
prop.table(tbl)
# 0 1
# 0 0.26 0.04
# 1 0.02 0.68
These are the classification tables.
Regarding why your "borrowed" code does not run, there are too many possibilities.
First, if you import the data set you provided without column names, R will assign names Vn where n is 1,2,3, etc. But if this was the case none of your code would run as you refer to columns history, control, etc. So at least those must be named properly.
Second, in the line:
ftab<-table(td$history,dt$V6)
you refer to dt$V6. AFAICT there is no dt (is this a typo?).

cumulative counter in dataframe R

I have a dataframe with many rows, but the structure looks like this:
year factor
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 1
18 0
19 0
20 0
I would need to add a counter as a third column. It should count the cumulative cells that contains zero until it set again to zero once the value 1 is encountered. The result should look like this:
year factor count
1 0 0
2 0 1
3 0 2
4 0 3
5 0 4
6 0 5
7 0 6
8 0 7
9 1 0
10 0 1
11 0 2
12 0 3
13 0 4
14 0 5
15 0 6
16 0 7
17 1 0
18 0 1
19 0 2
20 0 3
I would be glad to do it in a quick way, avoiding loops, since I have to do the operations for hundreds of files.
You can copy my dataframe, pasting the dataframe in "..." here:
dt <- read.table( text="...", , header = TRUE )
Perhaps a solution like this with ave would work for you:
A <- cumsum(dt$factor)
ave(A, A, FUN = seq_along) - 1
# [1] 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3
Original answer:
(Missed that the first value was supposed to be "0". Oops.)
x <- rle(dt$factor == 1)
y <- sequence(x$lengths)
y[dt$factor == 1] <- 0
y
# [1] 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 0 1 2 3

Cut value in creating table

I have following type of data:
mydata <- data.frame (yvar = rnorm(200, 15, 5), xv1 = rep(1:5, each = 40),
xv2 = rep(1:10, 20))
table(mydata$xv1, mydata$xv2)
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
I want tabulate again with yvar categories. The following is cutkey.
cutkey :
< 10 - group 1
10-12 - group 2
12-16 - group 3
>16 - group 4
Thus we will have similar to above type of table to each cutkey elements. I want to have margin sums everytime.
< 10 - group 1
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
10-12 - group 2
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
and so on for all groups
(the numbers will be definately different)
Is there easyway to do it ?
Yes, using cut, dlply (plyr package) and addmargins:
mydata$yvar1 <- cut(mydata$yvar,breaks = c(-Inf,10,12,16,Inf))
> dlply(mydata,.(yvar1),function(x) addmargins(table(x$xv1,x$xv2)))
$`(-Inf,10]`
1 2 3 4 5 6 7 8 9 10 Sum
1 0 0 0 0 0 0 2 0 1 0 3
2 1 1 0 1 0 0 0 0 2 0 5
3 0 1 0 0 1 1 0 2 0 0 5
4 0 0 2 0 1 1 0 1 0 0 5
5 0 1 1 0 1 1 1 0 0 2 7
Sum 1 3 3 1 3 3 3 3 3 2 25
$`(10,12]`
1 2 3 4 6 7 8 9 10 Sum
1 0 0 0 1 2 0 0 0 0 3
2 0 0 1 0 0 1 0 0 1 3
3 0 1 0 1 1 2 0 0 1 6
4 0 1 0 0 0 0 0 0 0 1
5 1 0 1 1 1 0 1 1 2 8
Sum 1 2 2 3 4 3 1 1 4 21
$`(12,16]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 3 1 1 1 2 0 3 0 2 15
2 0 1 0 1 3 3 2 0 0 1 11
3 3 1 3 1 0 0 0 2 4 1 15
4 3 2 1 2 2 0 1 1 4 1 17
5 3 1 1 2 0 1 1 1 1 0 11
Sum 11 8 6 7 6 6 4 7 9 5 69
$`(16, Inf]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 1 3 2 3 0 2 1 3 2 19
2 3 2 3 2 1 1 1 4 2 2 21
3 1 1 1 2 3 2 2 0 0 2 14
4 1 1 1 2 1 3 3 2 0 3 17
5 0 2 1 1 3 1 2 2 2 0 14
Sum 7 7 9 9 11 7 10 9 7 9 85
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
yvar1
1 (-Inf,10]
2 (10,12]
3 (12,16]
4 (16, Inf]
You can adjust the breaks argument to cut to get the values just how you want them. (Although the margin sums you display in your question don't look like margin sums at all.)

Resources