Count occurrence of unique variables - r

I have a data frame of variables, some occur more than once, e.g.:
a, b, b, b, c, c, d, e, f
I would then like to get an output (in two columns) like this:
a 1; b 3; c 2; d 1; e 1; f 1.
Bonus question: I'd like the variable to be named something (e.g. 'other' if less than 2 occurrences) if the variable is appearing less than 'n' times in the counted column.

Tabulating and collapsing
Your example vector is
vec <- letters[c(1,2,2,2,3,3,4,5,6)]
To get a tabulation, use
tab <- table(vec)
To collapse infrequent items (say, with counts below two), use
res <- c(tab[tab>=2],other=sum(tab[tab<2]))
# b c other
# 3 2 4
Displaying in two columns
resdf <- data.frame(count=res)
# count
# b 3
# c 2
# other 4
Technically, the "first column" here is the row labels, accessible with rownames(resdf).
Similar options include:
stack(res) for two actual columns
data.frame(count=sort(res,decreasing=TRUE)) to sort
In all of these, tab or c(tab) can be used in place of res.

Related

How to find the longest sequence of non-NA rows in R?

I have an ordered dataframe with many variables, and am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column. Is there an easy way to do this? I have tried the na.contiguous() function but my data is not formatted as a time series.
My intuition is to create a running counter which determines whether a row has NA or not, and then will determine the count for the number of consecutive rows without an NA. I would then put this in an if statement to keep restarting every time an NA is encountered, outputting a dataframe with the lengths of every sequence of non-NAs, which I could use to find the longest such sequence. This seems very inefficient so I'm wondering if there is a better way!
If I understand this phrase correctly:
[I] am looking to extract the data from all columns associated with the longest sequence of non-NA rows for one particular column
You have a column of interest, call it WANT, and are looking to isolate all columns from the single row of data with the highest consecutive non-NA values in WANT.
Example data
df <- data.frame(A = LETTERS[1:10],
B = LETTERS[1:10],
C = LETTERS[1:10],
WANT = LETTERS[1:10],
E = LETTERS[1:10])
set.seed(123)
df[sample(1:nrow(df), 2), 4] <- NA
# A B C WANT E
#1 A A A A A
#2 B B B B B
#3 C C C <NA> C
#4 D D D D D
#5 E E E E E
#6 F F F F F
#7 G G G G G
#8 H H H H H
#9 I I I I I # want to isolate this row (#9) since most non-NA in WANT
#10 J J J <NA> J
Here you would want all I values as it is the row with the longest running non-NA values in WANT.
If my interpretation of your question is correct, we can extend the excellent answer found here to your situation. This creates a data frame with a running tally of consecutive non-NA values for each column.
The benefit of using this is that it will count consecutive non-NA runs across all columns (of any type, ie character, numeric), then you can index on whatever column you want using which.max()
# from #jay.sf at https://stackoverflow.com/questions/61841400/count-consecutive-non-na-items
res <- as.data.frame(lapply(lapply(df, is.na), function(x) {
r <- rle(x)
s <- sapply(r$lengths, seq_len)
s[r$values] <- lapply(s[r$values], `*`, 0)
unlist(s)
}))
# index using which.max()
want_data <- df[which.max(res$WANT), ]
#> want_data
# A B C WANT E
#9 I I I I I
If this isn't correct, please edit your question for clarity.

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks
We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA
Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

Want to delete column that has lowest quantity of data after breaking up a categorical value

Hi i am creating a matrix function that would be n-1. But i want to delete the categorical variable that has least amount in n columns. How to do that.
BreakCad= function(dataf,catog) # takes the parent data frame and the variable containing categorical variable.
{
result<- model.matrix(~ factor(catog)-1) ##--Takes input a categorical variable breaks to n-1 categorical variable with value 0 or 1--#
result<-result[,-1] # its removing col in position -1 , I want to remove the #knonw which has lowest quantity
result= as.data.frame(result)
result=cbind(dataf,result)
return(result)
}
y= data.frame(Decision=sample(c("yes","no","cant decide"),40,replace=TRUE ),point1=sample(1:10, 40, replace=TRUE))
fd=BreakCad(y,y$Decision)
min(table(y$Decision))
You may use relevel function to put selected level of catog first. A toy example:
(catog<-factor(rep(letters[1:4], 4:1)))
[1] a a a a b b b c c d
Levels: a b c d
Find level with lowest count:
(wmin<-which.min(table(catog)))
d
4
Set it as first level:
(catog<-relevel(catog, wmin))
[1] a a a a b b b c c d
Levels: d a b c
Then your code should work.
result<-model.matrix(~factor(catog)-1)
result<-result[,-1]

Counting repetition in r

I want to count the number of specific repetitions in my dataframe. Here is a reproducible example
df <- data.frame(Date= c('5/5', '5/5', '5/5', '5/6', '5/7'),
First = c('a','b','c','a','c'),
Second = c('A','B','C','D','A'),
Third = c('q','w','e','w','q'),
Fourth = c('z','x','c','v','z'))
Give this:
Date First Second Third Fourth
1 5/5 a A q z
2 5/5 b B w x
3 5/5 c C e c
4 5/6 a D w v
5 5/7 c A q z
I read a big file that holds 400,000 instances and I want to know different statistics about specific attributes. For an example here I'd like to know how many times a happens on 5/5. I tried using sum(df$Date == '5/5' & df$First == 'a', na.rm=TRUE) which gave me the right result here (2), but when I run it on the big data set, the numbers are not accurate.
Any idea why?

Replacing value of one df column only in specific rows

I have vector index that corresponds to the rows of a df I want to modify for one specific column
index <- c(1,3,6)
dm <- one two three four
x y z k
a b c r
s e t f
e d f a
a e d r
q t j i
Now I want to modify column "three" only for rows 1, 3 and 6 replacing whatever value in it with "A".
Should I use apply?
There is no need for apply. You could simply use the following:
dm$three[index] <- "A"

Resources