Let's say I have:
v = rep(c(1,2, 2, 2), 25)
Now, I want to count the number of times each unique value appears. unique(v) returns what the unique values are, but not how many they are.
> unique(v)
[1] 1 2
I want something that gives me
length(v[v==1])
[1] 25
length(v[v==2])
[1] 75
but as a more general one-liner :) Something close (but not quite) like this:
#<doesn't work right> length(v[v==unique(v)])
Perhaps table is what you are after?
dummyData = rep(c(1,2, 2, 2), 25)
table(dummyData)
# dummyData
# 1 2
# 25 75
## or another presentation of the same data
as.data.frame(table(dummyData))
# dummyData Freq
# 1 1 25
# 2 2 75
If you have multiple factors (= a multi-dimensional data frame), you can use the dplyr package to count unique values in each combination of factors:
library("dplyr")
data %>% group_by(factor1, factor2) %>% summarize(count=n())
It uses the pipe operator %>% to chain method calls on the data frame data.
It is a one-line approach by using aggregate.
> aggregate(data.frame(count = v), list(value = v), length)
value count
1 1 25
2 2 75
length(unique(df$col)) is the most simple way I can see.
table() function is a good way to go, as Chase suggested.
If you are analyzing a large dataset, an alternative way is to use .N function in datatable package.
Make sure you installed the data table package by
install.packages("data.table")
Code:
# Import the data.table package
library(data.table)
# Generate a data table object, which draws a number 10^7 times
# from 1 to 10 with replacement
DT<-data.table(x=sample(1:10,1E7,TRUE))
# Count Frequency of each factor level
DT[,.N,by=x]
To get an un-dimensioned integer vector that contains the count of unique values, use c().
dummyData = rep(c(1, 2, 2, 2), 25) # Chase's reproducible data
c(table(dummyData)) # get un-dimensioned integer vector
1 2
25 75
str(c(table(dummyData)) ) # confirm structure
Named int [1:2] 25 75
- attr(*, "names")= chr [1:2] "1" "2"
This may be useful if you need to feed the counts of unique values into another function, and is shorter and more idiomatic than the t(as.data.frame(table(dummyData))[,2] posted in a comment to Chase's answer. Thanks to Ricardo Saporta who pointed this out to me here.
This works for me. Take your vector v
length(summary(as.factor(v),maxsum=50000))
Comment: set maxsum to be large enough to capture the number of unique values
or with the magrittr package
v %>% as.factor %>% summary(maxsum=50000) %>% length
Also making the values categorical and calling summary() would work.
> v = rep(as.factor(c(1,2, 2, 2)), 25)
> summary(v)
1 2
25 75
You can try also a tidyverse
library(tidyverse)
dummyData %>%
as.tibble() %>%
count(value)
# A tibble: 2 x 2
value n
<dbl> <int>
1 1 25
2 2 75
If you need to have the number of unique values as an additional column in the data frame containing your values (a column which may represent sample size for example), plyr provides a neat way:
data_frame <- data.frame(v = rep(c(1,2, 2, 2), 25))
library("plyr")
data_frame <- ddply(data_frame, .(v), transform, n = length(v))
You can also try dplyr::count
df <- tibble(x=c('a','b','b','c','c','d'), y=1:6)
dplyr::count(df, x, sort = TRUE)
# A tibble: 4 x 2
x n
<chr> <int>
1 b 2
2 c 2
3 a 1
4 d 1
If you want to run unique on a data.frame (e.g., train.data), and also get the counts (which can be used as the weight in classifiers), you can do the following:
unique.count = function(train.data, all.numeric=FALSE) {
# first convert each row in the data.frame to a string
train.data.str = apply(train.data, 1, function(x) paste(x, collapse=','))
# use table to index and count the strings
train.data.str.t = table(train.data.str)
# get the unique data string from the row.names
train.data.str.uniq = row.names(train.data.str.t)
weight = as.numeric(train.data.str.t)
# convert the unique data string to data.frame
if (all.numeric) {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) as.numeric(unlist(strsplit(x, split=","))))))
} else {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) unlist(strsplit(x, split=",")))))
}
names(train.data.uniq) = names(train.data)
list(data=train.data.uniq, weight=weight)
}
I know there are many other answers, but here is another way to do it using the sort and rle functions. The function rle stands for Run Length Encoding. It can be used for counts of runs of numbers (see the R man docs on rle), but can also be applied here.
test.data = rep(c(1, 2, 2, 2), 25)
rle(sort(test.data))
## Run Length Encoding
## lengths: int [1:2] 25 75
## values : num [1:2] 1 2
If you capture the result, you can access the lengths and values as follows:
## rle returns a list with two items.
result.counts <- rle(sort(test.data))
result.counts$lengths
## [1] 25 75
result.counts$values
## [1] 1 2
count_unique_words <-function(wlist) {
ucountlist = list()
unamelist = c()
for (i in wlist)
{
if (is.element(i, unamelist))
ucountlist[[i]] <- ucountlist[[i]] +1
else
{
listlen <- length(ucountlist)
ucountlist[[i]] <- 1
unamelist <- c(unamelist, i)
}
}
ucountlist
}
expt_counts <- count_unique_words(population)
for(i in names(expt_counts))
cat(i, expt_counts[[i]], "\n")
Related
Example data to copy
df <- data.frame(
AA = c(100, 200, 300, 400),
X1 = c(2, 1, 3, 1),
X2 = c(1, 3, 4, 1)
)
Based on the index of AA, and it's values, I would like to calculate the sum of indicators based on the condition df$AA[i] > df[df$X1[i], c('AA')] (here for X1) for every row on a fluctuating number of variables.
My probably naive approach is to use a for-loop, which works perfectly for a fixed number of variables (columns), in the given example X1, X2. My problem is that I do not know the number of variables beforehand. Theoretically, any number 1, 2, 3, ... is possibly.
for (i in 1:nrow(df)) {
df$index[i] <- sum(df$AA[i] > df[df$X1[i], c('AA')],
df$AA[i] > df[df$X2[i], c('AA')])
}
Which gives the desired output for a fixed number of variables X1, X2:
df
#> AA X1 X2 index
#> 1 100 2 1 0
#> 2 200 1 3 1
#> 3 300 3 4 0
#> 4 400 1 1 2
Is there a smooth base R approach which translates my approach to a flexible number of variables X1, ..., Xn?
Note, the reason why I am interested in a base R approach is my aim to extend an existing package, which is fully written in base R. So I would like to keep it like that.
Loops or *apply-family approaches are both very welcome.
I am aware of the fact that operations on dataframes are often considered to be slower. Since all variables AA, X1, ... are of the same length, a solution which does not rely on a dataframe structure would also be great!
Created on 2022-04-06 by the reprex package (v2.0.1)
You don't need to loop through rows. You can use Reduce.
Reduce(`+`, lapply(df[-1], function(x) df$AA > df$AA[x]))
#> [1] 0 1 0 2
Does this correspond to what you're looking for ?
df$index <- apply(df, 1, function(x){sum(x[1] > df$AA[x[-1]])})
assuming that AA is the column 1 and all your Xi are all the other columns.
The following one-liner will work especially because df is a data-frame:
df$index <- rowSums( # To sum over a non-specified number of columns
mapply(
df[,- which(names(df) == "AA")], # Everything except AA
df[,"AA", drop = FALSE], # Only AA, but in a data-frame
FUN = function(index, aa) aa[index] < aa)) # Compare
Am working on a large dataset to calculate a single value in R. I believe the CUMSUM and cum product would work. But I don't know-how
county_id <- c(1,1,1,1,2,2,2,3,3)
res <- c(2,3,2,4,2,4,3,3,2)
I need a function that can simply give me a single value as follows
for every county_id, then I need the total.
Example, for county_id=1 the total for res is calculated manually as
2(3+2+4)+3(2+4)+2(4)
for county_id=2 the total for res is calculated manually as
2(4+3)+4(3)
for county_id=3 the total for res is calculated manually as
3(2)
Then it sums all this into a single variable
44+26+6=76
NB my county_id run from 1:47 and each county_id could have up to 200 res
Thank you
You can use aggregate with cumsum like:
x <- aggregate(res, list(county_id)
, function(x) sum(rev(cumsum(rev(x[-1])))*x[-length(x)]))
#Group.1 x
#1 1 44
#2 2 26
#3 3 6
sum(x[,2])
#[1] 76
You can sum the product of the pairwise combinations:
library(dplyr)
dat %>%
group_by(county_id) %>%
summarise(x = sum(combn(res, 2, FUN = prod)))
# A tibble: 3 x 2
county_id x
<dbl> <dbl>
1 1 44
2 2 26
3 3 6
Base R:
aggregate(res ~ county_id, dat, FUN = function(x) sum(combn(x, 2, FUN = prod)))
Here is one way to do this using tidyverse functions.
For each county_id we multiply the current res value with the sum of res value after it.
library(dplyr)
library(purrr)
df1 <- df %>%
group_by(county_id) %>%
summarise(result = sum(map_dbl(row_number(),
~res[.x] * sum(res[(.x + 1):n()])), na.rm = TRUE))
df1
# county_id result
# <dbl> <dbl>
#1 1 44
#2 2 26
#3 3 6
To get total sum you can then do :
sum(df1$result)
#[1] 76
data
county_id <- c(1,1,1,1,2,2,2,3,3)
res <- c(2,3,2,4,2,4,3,3,2)
df <- data.frame(county_id, res)
Another option is to use SPSS syntax
// You need to count the number of variables with valid responses
count x1=var1 to var4(1 thr hi).
execute.
// 1st thing is to declare a variable that will hold your cumulative sum
// Declare your variables in terms of a vector
//You then loop twice. The 1st loop being from the 1st variable to the number of
//variables with data (x1). The 2nd loop will be from the 1st variable to the `
//variable in (1st loop-1) for all variables with data.`
//Lastly you need to get a cumulative sum based on your formulae
// This syntax can be replicated in other software.
compute index1=0.
vector x=var1 to var4.
loop #i=1 to x1.
loop #j=1 to #i-1 if not missing(x(#i)).
compute index1=index1+(x(#j)*sum(x(#i))).
end loop.
end loop.
execute.
I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5
Let's say I have:
v = rep(c(1,2, 2, 2), 25)
Now, I want to count the number of times each unique value appears. unique(v) returns what the unique values are, but not how many they are.
> unique(v)
[1] 1 2
I want something that gives me
length(v[v==1])
[1] 25
length(v[v==2])
[1] 75
but as a more general one-liner :) Something close (but not quite) like this:
#<doesn't work right> length(v[v==unique(v)])
Perhaps table is what you are after?
dummyData = rep(c(1,2, 2, 2), 25)
table(dummyData)
# dummyData
# 1 2
# 25 75
## or another presentation of the same data
as.data.frame(table(dummyData))
# dummyData Freq
# 1 1 25
# 2 2 75
If you have multiple factors (= a multi-dimensional data frame), you can use the dplyr package to count unique values in each combination of factors:
library("dplyr")
data %>% group_by(factor1, factor2) %>% summarize(count=n())
It uses the pipe operator %>% to chain method calls on the data frame data.
It is a one-line approach by using aggregate.
> aggregate(data.frame(count = v), list(value = v), length)
value count
1 1 25
2 2 75
length(unique(df$col)) is the most simple way I can see.
table() function is a good way to go, as Chase suggested.
If you are analyzing a large dataset, an alternative way is to use .N function in datatable package.
Make sure you installed the data table package by
install.packages("data.table")
Code:
# Import the data.table package
library(data.table)
# Generate a data table object, which draws a number 10^7 times
# from 1 to 10 with replacement
DT<-data.table(x=sample(1:10,1E7,TRUE))
# Count Frequency of each factor level
DT[,.N,by=x]
To get an un-dimensioned integer vector that contains the count of unique values, use c().
dummyData = rep(c(1, 2, 2, 2), 25) # Chase's reproducible data
c(table(dummyData)) # get un-dimensioned integer vector
1 2
25 75
str(c(table(dummyData)) ) # confirm structure
Named int [1:2] 25 75
- attr(*, "names")= chr [1:2] "1" "2"
This may be useful if you need to feed the counts of unique values into another function, and is shorter and more idiomatic than the t(as.data.frame(table(dummyData))[,2] posted in a comment to Chase's answer. Thanks to Ricardo Saporta who pointed this out to me here.
This works for me. Take your vector v
length(summary(as.factor(v),maxsum=50000))
Comment: set maxsum to be large enough to capture the number of unique values
or with the magrittr package
v %>% as.factor %>% summary(maxsum=50000) %>% length
Also making the values categorical and calling summary() would work.
> v = rep(as.factor(c(1,2, 2, 2)), 25)
> summary(v)
1 2
25 75
You can try also a tidyverse
library(tidyverse)
dummyData %>%
as.tibble() %>%
count(value)
# A tibble: 2 x 2
value n
<dbl> <int>
1 1 25
2 2 75
If you need to have the number of unique values as an additional column in the data frame containing your values (a column which may represent sample size for example), plyr provides a neat way:
data_frame <- data.frame(v = rep(c(1,2, 2, 2), 25))
library("plyr")
data_frame <- ddply(data_frame, .(v), transform, n = length(v))
You can also try dplyr::count
df <- tibble(x=c('a','b','b','c','c','d'), y=1:6)
dplyr::count(df, x, sort = TRUE)
# A tibble: 4 x 2
x n
<chr> <int>
1 b 2
2 c 2
3 a 1
4 d 1
If you want to run unique on a data.frame (e.g., train.data), and also get the counts (which can be used as the weight in classifiers), you can do the following:
unique.count = function(train.data, all.numeric=FALSE) {
# first convert each row in the data.frame to a string
train.data.str = apply(train.data, 1, function(x) paste(x, collapse=','))
# use table to index and count the strings
train.data.str.t = table(train.data.str)
# get the unique data string from the row.names
train.data.str.uniq = row.names(train.data.str.t)
weight = as.numeric(train.data.str.t)
# convert the unique data string to data.frame
if (all.numeric) {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) as.numeric(unlist(strsplit(x, split=","))))))
} else {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) unlist(strsplit(x, split=",")))))
}
names(train.data.uniq) = names(train.data)
list(data=train.data.uniq, weight=weight)
}
I know there are many other answers, but here is another way to do it using the sort and rle functions. The function rle stands for Run Length Encoding. It can be used for counts of runs of numbers (see the R man docs on rle), but can also be applied here.
test.data = rep(c(1, 2, 2, 2), 25)
rle(sort(test.data))
## Run Length Encoding
## lengths: int [1:2] 25 75
## values : num [1:2] 1 2
If you capture the result, you can access the lengths and values as follows:
## rle returns a list with two items.
result.counts <- rle(sort(test.data))
result.counts$lengths
## [1] 25 75
result.counts$values
## [1] 1 2
count_unique_words <-function(wlist) {
ucountlist = list()
unamelist = c()
for (i in wlist)
{
if (is.element(i, unamelist))
ucountlist[[i]] <- ucountlist[[i]] +1
else
{
listlen <- length(ucountlist)
ucountlist[[i]] <- 1
unamelist <- c(unamelist, i)
}
}
ucountlist
}
expt_counts <- count_unique_words(population)
for(i in names(expt_counts))
cat(i, expt_counts[[i]], "\n")
I have an integer as a column that I would like to split into multiple, seperate integers
Creating a list of dataframes using split() doesn't work for my later purposes
df <- as.data.frame(runif(n = 10000, min = 1, max = 10))
where split() creates a list of dataframe which I can't use for further purposes, where I need a separate integer as "Values"
map.split <- split(df, (as.numeric(rownames(df)) - 1) %/% 250) # this is not the trick
My goal is to split the column into different integer (not saved under the Global Environment "Data", but "Values")
This would be the slow way:
VecList1 <- df[1:250,]
VecList2 <- df[251:500,]
with
str(VecList1)
Int [1:250] 1 1 10 5 3 ....
Any advice welcome
If I'm interpreting correctly (not clear to me), here's a reduced problem and what I think you're asking for.
set.seed(2)
df <- data.frame(x = runif(10, min = 1, max = 10))
df$Values <- (seq_len(nrow(df))-1) %/% 4
df
# x Values
# 1 2.663940 0
# 2 7.321366 0
# 3 6.159937 0
# 4 2.512467 0
# 5 9.494554 1
# 6 9.491275 1
# 7 2.162431 1
# 8 8.501039 1
# 9 5.212167 2
# 10 5.949854 2
If all you need is that Values column as its own object, then you can just change df$Values <- ... to Values <- ....
Here's one way of doing this (although it's probably better to figure out a way where you don't need a series of separate vectors, but rather work with columns in a single matrix):
df <- data.frame(a=runif(n = 10000, min = 1, max = 10))
mx<-matrix(df$a,nrow=250)
for (i in 1:NCOL(mx)) {
assign(paste0("VecList",i),mx[,i])}
Note: using assign is generally not advisable. Whatever it is you're trying to achieve, there's probably a better way of doing it without creating a series of new vectors in the global environment.