Counting number of unique rows that have repeated records in one column - r

This is what my dataframe looks like:
a <- c(1,1,4,4,5)
b <- c(1,2,3,3,5)
c <- c(1,4,4,4,5)
d <- c(2,2,4,4,5)
e <- c(1,5,3,3,5)
df <- data.frame(a,b,c,d,e)
I'd like to write something that returns all unique instances of vectors a,b,c,d that have a repeated value in vector e.
For example:
a b c d e
1 1 1 1 2 1
2 1 2 4 2 5
3 4 3 4 4 3
4 4 3 4 4 3
5 5 5 5 5 5
Rows 3 and 4 are exactly the same till vector d (having a combination of 4344) so only one instance of those should be returned, but they have 2 repeated values in vector e. I would want to get a count on those - so the combination of 4344 has 2 repeated values in vector e.
The expected output would me how many times a certain combination such as 4344 had repeated values in vector e. So in this case it would be something like:
a b c d e
4 3 4 4 2
Both R and SQL work, whatever does the job.

Again, see my comments above, but I believe the following gives you a start on your first question. First, create a "key" variable (in this case named key_abcd which uses tidyr::unite to unite columns a, b, c, and d). Then, count up e by this key_abcd variable. The group_by is implicit.
library(tidyr)
library(dplyr)
df <- data.frame(a,b,c,d,e,f,g)
df %>%
unite(key_abcd, a, b, c, d) %>%
count(key_abcd, e)
# key_abcd e n
# (chr) (dbl) (int)
# 1 1_1_1_2 1 1
# 2 1_2_4_2 5 1
# 3 4_3_4_4 3 2
# 4 5_5_5_5 5 1
It appears from how you've worded the question, you are only interested in "more than one" combinations, therefore, you could add %>% filter(n > 1) to the above code.

Related

count characters based on the order they appear

How does one count the characters based on the order they appear in a single length string. Below is an minimal example:
x <- "abbccdddaab"
First thought was this but it only counts them irrespective of order:
table(unlist(strsplit(x, "\\b")))
a b c d
3 3 2 3
But the desired output is:
a b c d a b
1 2 2 3 2 1
I would imagine the solution would require a for loop?
We can use rle instead of table as rle returns the output as a list of values and lengths based on checking whether the adjacent elements are same or not
out <- rle(strsplit(x, "\\b")[[1]])
setNames(out$lengths, out$values)
# a b c d a b
# 1 2 2 3 2 1
Using data.table::rleid :
x <- "abbccdddaab"
tmp <- strsplit(x, "\\b")[[1]]
table(data.table::rleid(tmp))
#1 2 3 4 5 6
#1 2 2 3 2 1

How to skip not completly empty rows in r

So, I'm trying to read a excel files. What happens is that some of the rows are empty for some of the columns but not for all of them. I want to skip all the rows that are not complete, i.e., that don't have information in all of the columns. For example:
In this case I would like to skip the lines 1,5,6,7,8 and so on.
There is probably more elegant way of doing it, but a possible solution is to count the number of elements per rows that are not NA and keep only rows with the number of elements equal to the number of columns.
Using this dummy example:
df <- data.frame(A = LETTERS[1:6],
B = c(sample(1:10,5),NA),
C = letters[1:6])
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
6 F NA f
Using apply, you can for each rows count the number of elements without NA:
v <- apply(df,1, function(x) length(na.omit(x)))
[1] 3 3 3 3 3 2
And then, keep only rows with the number of elements equal to the number of columns (which correspond to complete rows):
df1 <- df[v == ncol(df),]
A B C
1 A 5 a
2 B 9 b
3 C 1 c
4 D 3 d
5 E 4 e
Does it answer your question ?

Count of unique values across all columns in a data frame

We have a data frame as below :
raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
I need a result data frame in the following format :
result<-data.frame(v1=c("A","B","C","D"), v2=c(3,2,2,3))
Used the following code to get the count across one particular column :
count_raw<-sqldf("SELECT DISTINCT(v1) AS V1, COUNT(v1) AS count FROM raw GROUP BY v1")
This would return count of unique values across an individual column.
Any help would be highly appreciated.
Use this
table(unlist(raw))
Output
A B C D
3 2 2 3
For data frame type output wrap this with as.data.frame.table
as.data.frame.table(table(unlist(raw)))
Output
Var1 Freq
1 A 3
2 B 2
3 C 2
4 D 3
If you want a total count,
sapply(unique(raw[!is.na(raw)]), function(i) length(which(raw == i)))
#A B C D
#3 2 2 3
We can use apply with MARGIN = 1
cbind(raw[1], v2=apply(raw, 1, function(x) length(unique(x[!is.na(x)]))))
If it is for each column
sapply(raw, function(x) length(unique(x[!is.na(x)])))
Or if we need the count based on all the columns, convert to matrix and use the table
table(as.matrix(raw))
# A B C D
# 3 2 2 3
If you have only character values in your dataframe as you've provided, you can unlist it and use unique or to count the freq, use count
> library(plyr)
> raw<-data.frame(v1=c("A","B","C","D"),v2=c(NA,"B","C","A"),v3=c(NA,"A",NA,"D"),v4=c(NA,"D",NA,NA))
> unique(unlist(raw))
[1] A B C D <NA>
Levels: A B C D
> count(unlist(raw))
x freq
1 A 3
2 B 2
3 C 2
4 D 3
5 <NA> 6

Counting the times a value in a vector is different per combination of 4 other vectors

This is what my dataframe looks like:
a <- c(1,1,4,4,5)
b <- c(1,2,3,3,5)
c <- c(1,4,4,4,5)
d <- c(2,2,4,4,5)
e <- c(1,5,3,2,5)
df <- data.frame(a,b,c,d,e)
I'd like to write something that returns all unique instances of vectors a,b,c,d that have a different value in vector e.
For example:
a b c d e
1 1 1 1 2 1
2 1 2 4 2 5
3 4 3 4 4 3
4 4 3 4 4 2
5 5 5 5 5 5
Rows 3 and 4 are exactly the same till vector d (having a combination of 4344) so only one instance of those should be returned, but they have 2 different values in vector e. I would want to get a count on those - so the combination of 4344 has 2 different values in vector e.
The expected output would tell me how many times a certain combination such as 4344 had different values in vector e. So in this case it would be something like:
a b c d e
4 3 4 4 2
So far I have something like this:
library(tidyr)
library(dplyr)
df %>%
unite(key_abcd, a, b, c, d) %>%
count(key_abcd, e)
But this will count the times e has been repeated per combination of a,b,c,d. I would like to instead count the times e is different per combination of a,b,c,d.
NOTE: There are both repeated combinations of values in vectors a,b,c,d and repeated values in vector e. I would like to return only the count of unique values in e for unique combinations of a,b,c,d.
You could try adding a little dplyr on:
library(dplyr)
df %>%
unite(key_abcd, a, b, c, d) %>%
group_by(key_abcd) %>%
summarise(e = n()) %>%
filter(e>1)

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources