Trying to count categories variables in R using rowSums - r

I'm trying to get a count for each of the observation categories per row.
In the example of the data below, the top line containing photo, 2, 3, 4, 5, 6 is the headers and the line beneath that contains the observations.
I would do it in excel with countif however dataset is huge with this only being a tiny sample. Plus screw excel :)
photo 2 3 4 5 6
30001004501 SINV_SPO_V SINV_HYD LSUB_SAND Unc SINV_SPO_V
I'm trying to do it so that it will create a new column for each observation I count, ie if I were trying to determine the frequency of "Unc" would have its own column with how many times "Unc" was counted for each row.
The code below is one of the things I've tried over the last couple of days as well as variations of count and length commands but with no success
data$Unc <-rowSums(data[,3:52] == "Unc", na.rm = F)
I'm trying to get R to only count the columns between 3 and 52
Thanking in advance for any help is getting incredibly frustrating as I know it should be really easy
I hope this makes sense

So if i understood your request correctly this is a data.table solution of your problem, you can use 3:52 in measure.vars for your task. Also this only works if photo is a unique id variable, if it isn't you should create one yourself and use that instead
library(data.table)
# create example data.table
dt <- data.table(photo = 1:6,
x1 = c("a", "b", "a", "c", "a", "d"),
x2 = c("c", "c", "a", "c", "a", "d"),
x3 = c("c", "c", "a", "c", "a", "d"))
# Melt data.table, select which columns you need
dt_melt <- melt.data.table(dt, id.vars = 'photo', measure.vars = 2:3, variable.name = 'column')
# Get a resulting data.table with pairs of photo and observation
result_dt <- dt_melt[, .N, by = c('photo', 'value')]
photo value N
1: 1 a 1
2: 2 b 1
3: 3 a 2
4: 4 c 2
5: 5 a 2
6: 6 d 2
7: 1 c 1
8: 2 c 1
# For wide representation
dcast(result_dt, photo ~ value, value.var = 'N', fill = 0)
photo a b c d
1: 1 1 0 1 0
2: 2 0 1 1 0
3: 3 2 0 0 0
4: 4 0 0 2 0
5: 5 2 0 0 0
6: 6 0 0 0 2

I think that a way to solve your problem is to use the table function:
col1 <- c('a','b','b','b','a','c','b','a','c')
col2 <- c('d','e','d','d','d','d','d','d','e')
data = data.frame(col1,col2)
table(col1)
table(col2)
tab = table(data)
tab
margin.table(tab,1)
margin.table(tab,2)
table(col1) will give you the frequencies for the categorical variables of col1, and this gives the same result as margin.table(tab,1). So it depends if you prefer to work on the data.frame or on the columns directly.

Related

Count of number of elements between distinct elements in vector

Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.
Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.
An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")

How do you turn categorical rows into columnd?

I am attempting to convert the rows of org_type into columns. Is there a way to do this I tried spread, but I have not had success. Does anyone know how?
Below is a picture.
r
You can do it using mltools and data.table R package. First you need to convert the data frame to data.table. Then you've to use one_hot() function of mltools. You can see the example I've used with a dummy data.
# demo data frame
df <- data.frame(
org_name = c("A", "B", "C", "D", "E", "F"),
org_type = c("Tech", "Tech", "Business", "Business", "Bank", "Bank")
)
df
# load the libraries
library(data.table)
library(mltools)
# convert df to data.table
df_dt <- as.data.table(df)
# use one_hot specify the column name in cols
df_one_hot <- one_hot(df_dt, cols = "org_type")
df_one_hot
Output before:
org_name org_type
1 A Tech
2 B Tech
3 C Business
4 D Business
5 E Bank
6 F Bank
Output after:
org_name org_type_Bank org_type_Business org_type_Tech
1: A 0 0 1
2: B 0 0 1
3: C 0 1 0
4: D 0 1 0
5: E 1 0 0
6: F 1 0 0

R count and list unique rows for each column satisfying a condition

I have been going crazy with something basic...
I am trying to count and list in a comma separated column each unique ID coming up in a data frame, e.g.:
df<-data.frame(id = as.character(c("a", "a", "a", "b", "c", "d", "d", "e", "f")), x1=c(3,1,1,1,4,2,3,3,3),
x2=c(6,1,1,1,3,2,3,3,1),
x3=c(1,1,1,1,1,2,3,3,2))
> > df
id x1 x2 x3
1 a 3 6 1
2 a 1 1 1
3 a 1 1 1
4 b 1 1 1
5 c 4 3 1
6 d 1 2 2
7 d 3 3 3
8 e 1 3 3
9 f 3 1 2
I am trying to get a count of unique id that satisfy a condition, >1:
res = data.frame(x1_counts =5, x1_names="a,c,d,e,f", x2_counts = 4, x2_names="a,c,d,f", x3_counts = 3, x3_names="d,e,f")
> res
x1_counts x1_names x2_counts x2_names x3_counts x3_names
1 5 a,c,d,e,f 4 a,c,d,f 3 d,e,f
I have tried with data.table but it seems very convoluted, i.e.
DT = as.data.table(df)
res <- DT[, list(x1= length(unique(id[which(x1>1)])), x2= length(unique(id[which(x2>1)]))), by=id)
But I can't get it right, I am going not getting what I need to do with data.table since it is not really a grouping I am looking for. Can you direct me in the right path please? Thanks so much!
You can reshape your data to long format and then do the summary:
library(data.table)
(melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable])
# You can replace the list to toString if you want a string as name instead of list
# variable counts names
#1: x1 5 a,c,d,e,f
#2: x2 4 a,c,d,e
#3: x3 3 d,e,f
To get what you need, reshape it back to wide format:
dcast(1~variable,
data = (melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable]),
value.var = c('counts', 'names'))
# . counts_x1 counts_x2 counts_x3 names_x1 names_x2 names_x3
# 1: . 5 4 3 a,c,d,e,f a,c,d,e d,e,f

data.table : remove duplicate subset of rows for a given index value

I would like to improve my piece of code. Let's say you want to remove duplicate rows that have the same 'label' and 'id'. The way I do it is:
library(data.table)
dt <- data.table(label = c("A", "A", "B", "B", "C", "A", "A", "A"),
id = c(1, 1, 2, 2, 3, 4, 5, 5))
tmp = dt[label == 'A',]
tmp = unique(tmp, by = 'id')
dt = dt[label != 'A',]
dt = rbind(dt, tmp)
Is there a smarter/shorter way to accomplish that? If possible by reference?
This code looks very ugly and implies a lot of copies.
(Moreover I have to do this operation for a few labels, but not all of them. So this implies 4 lines for every label...)
Thanks !
Example:
label id
A 1
A 1
B 2
B 2
C 3
A 4
A 5
A 5
Would give :
label id
A 1
B 2
B 2
C 3
A 4
A 5
Note that line 3 and 4 stay duplicated since the label is equal to 'B' and not to 'A'.
There is no need to create tmp and then rbind it again. You can simply use the duplicated function as follows:
dt[label != "A" | !duplicated(dt, by=c("label", "id"))]
# label id
# 1: A 1
# 2: B 2
# 3: B 2
# 4: C 3
# 5: A 4
# 6: A 5
If you want to do this over several labels:
dt[!label %in% c("A", "C") | !duplicated(dt, by=c("label", "id"))]
See ?duplicated to learn more about de-duplication functions in data.table.
This could be also done using an if/else condition
dt[, if(all(label=='A')) .SD[1L] else .SD, by = id]
# id label
#1: 1 A
#2: 2 B
#3: 2 B
#4: 3 C
#5: 4 A
#6: 5 A

Observation number by group [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
In R I have a data frame with observations described by several values one of which is a factor. I have sorted the dataset by this factor and would like to add a column in which I would get a number of observation on each level of the factor e.g.
factor obsnum
a 1
a 2
a 3
b 1
b 2
b 3
b 4
c 1
c 2
...
In SAS I do it with something like:
data logs.full;
set logs.full;
count + 1;
by cookie;
if first.cookie then count = 1;
run;
How can I achieve that in R?
Thanks,
Use rle (run length encoding) and sequence:
x <- c("a", "a", "a", "b", "b", "b", "b", "c", "c")
data.frame(
x=x,
obsnum = sequence(rle(x)$lengths)
)
x obsnum
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
Here is the ddply() solution
dataset <- data.frame(x = c("a", "a", "a", "b", "b", "b", "b", "c", "c"))
library(plyr)
ddply(dataset, .(x), function(z){
data.frame(obsnum = seq_along(z$x))
})
One solution using base R, assuming your data is in a data.frame named dfr:
dfr$cnt<-do.call(c, lapply(unique(dfr$factor), function(curf){
seq(sum(dfr$factor==curf))
}))
There are likely better solutions (e.g. employing package plyr and its ddply), but it should work.

Resources