Related
Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.
Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.
An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")
I need to assign some values to strings in my dataset. My dataframe looks like:
Network1 Network2
A A
A C
B D
I would like for all values to be consistent so if A =1 in network 1 it should be 1 also in network 2
I tried the following:
data$network1<-as.numeric(as.factor(data$network1))
data$network2<-as.numeric(as.factor(data$network2))
But the values that are attached do not match except for a few cases.
Is there any way I could just do this globally for both columns at the same time so the values are consistent? I would like the desired output to be:
Network1 network2
1 1
1 3
2 4
Thanks for any help.
unlist, convert it into factor, then numeric and back to original form
df[] <- as.numeric(factor(unlist(df)))
df
# Network1 Network2
#1 1 1
#2 1 3
#3 2 4
You can save all the levels of the data frame first :
df <- data.frame(Network1 = c("A", "A", "B"), Network2 = c("A", "C", "D"))
lvls <- unique(unlist(df))
df$Network1 <- as.numeric(factor(df$Network1, levels = lvls))
df$Network2 <- as.numeric(factor(df$Network2, levels = lvls))
df
> Network1 Network2
1 1 1
2 1 3
3 2 4
Could also try:
strings <- unique(unlist(df))
matchdf <- data.frame(strings, as.numeric(as.factor(strings)))
as.data.frame(sapply(df, function(x) match(x, matchdf$strings)))
Output:
Network1 Network2
1 1 1
2 1 3
3 2 4
This will apply the logic to all columns at once.
I'm trying to get a count for each of the observation categories per row.
In the example of the data below, the top line containing photo, 2, 3, 4, 5, 6 is the headers and the line beneath that contains the observations.
I would do it in excel with countif however dataset is huge with this only being a tiny sample. Plus screw excel :)
photo 2 3 4 5 6
30001004501 SINV_SPO_V SINV_HYD LSUB_SAND Unc SINV_SPO_V
I'm trying to do it so that it will create a new column for each observation I count, ie if I were trying to determine the frequency of "Unc" would have its own column with how many times "Unc" was counted for each row.
The code below is one of the things I've tried over the last couple of days as well as variations of count and length commands but with no success
data$Unc <-rowSums(data[,3:52] == "Unc", na.rm = F)
I'm trying to get R to only count the columns between 3 and 52
Thanking in advance for any help is getting incredibly frustrating as I know it should be really easy
I hope this makes sense
So if i understood your request correctly this is a data.table solution of your problem, you can use 3:52 in measure.vars for your task. Also this only works if photo is a unique id variable, if it isn't you should create one yourself and use that instead
library(data.table)
# create example data.table
dt <- data.table(photo = 1:6,
x1 = c("a", "b", "a", "c", "a", "d"),
x2 = c("c", "c", "a", "c", "a", "d"),
x3 = c("c", "c", "a", "c", "a", "d"))
# Melt data.table, select which columns you need
dt_melt <- melt.data.table(dt, id.vars = 'photo', measure.vars = 2:3, variable.name = 'column')
# Get a resulting data.table with pairs of photo and observation
result_dt <- dt_melt[, .N, by = c('photo', 'value')]
photo value N
1: 1 a 1
2: 2 b 1
3: 3 a 2
4: 4 c 2
5: 5 a 2
6: 6 d 2
7: 1 c 1
8: 2 c 1
# For wide representation
dcast(result_dt, photo ~ value, value.var = 'N', fill = 0)
photo a b c d
1: 1 1 0 1 0
2: 2 0 1 1 0
3: 3 2 0 0 0
4: 4 0 0 2 0
5: 5 2 0 0 0
6: 6 0 0 0 2
I think that a way to solve your problem is to use the table function:
col1 <- c('a','b','b','b','a','c','b','a','c')
col2 <- c('d','e','d','d','d','d','d','d','e')
data = data.frame(col1,col2)
table(col1)
table(col2)
tab = table(data)
tab
margin.table(tab,1)
margin.table(tab,2)
table(col1) will give you the frequencies for the categorical variables of col1, and this gives the same result as margin.table(tab,1). So it depends if you prefer to work on the data.frame or on the columns directly.
I have been going crazy with something basic...
I am trying to count and list in a comma separated column each unique ID coming up in a data frame, e.g.:
df<-data.frame(id = as.character(c("a", "a", "a", "b", "c", "d", "d", "e", "f")), x1=c(3,1,1,1,4,2,3,3,3),
x2=c(6,1,1,1,3,2,3,3,1),
x3=c(1,1,1,1,1,2,3,3,2))
> > df
id x1 x2 x3
1 a 3 6 1
2 a 1 1 1
3 a 1 1 1
4 b 1 1 1
5 c 4 3 1
6 d 1 2 2
7 d 3 3 3
8 e 1 3 3
9 f 3 1 2
I am trying to get a count of unique id that satisfy a condition, >1:
res = data.frame(x1_counts =5, x1_names="a,c,d,e,f", x2_counts = 4, x2_names="a,c,d,f", x3_counts = 3, x3_names="d,e,f")
> res
x1_counts x1_names x2_counts x2_names x3_counts x3_names
1 5 a,c,d,e,f 4 a,c,d,f 3 d,e,f
I have tried with data.table but it seems very convoluted, i.e.
DT = as.data.table(df)
res <- DT[, list(x1= length(unique(id[which(x1>1)])), x2= length(unique(id[which(x2>1)]))), by=id)
But I can't get it right, I am going not getting what I need to do with data.table since it is not really a grouping I am looking for. Can you direct me in the right path please? Thanks so much!
You can reshape your data to long format and then do the summary:
library(data.table)
(melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable])
# You can replace the list to toString if you want a string as name instead of list
# variable counts names
#1: x1 5 a,c,d,e,f
#2: x2 4 a,c,d,e
#3: x3 3 d,e,f
To get what you need, reshape it back to wide format:
dcast(1~variable,
data = (melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable]),
value.var = c('counts', 'names'))
# . counts_x1 counts_x2 counts_x3 names_x1 names_x2 names_x3
# 1: . 5 4 3 a,c,d,e,f a,c,d,e d,e,f
This question already has answers here:
How do I get a contingency table?
(6 answers)
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
I want to create a matrix with 3 columns and many rows assigning 1 or 0 if the condition is satisfied.
I have data stored in 3 variables
df1 <- data.frame(names=c("A","B","C","D","E","F"))
df2 <- data.frame(names=c("A","B","C","F"))
df3 <- data.frame(names=c("E","F","H"))
output will be
df1 df2 df3
A 1 1 0
B 1 1 0
C 1 1 0
D 1 0 0
E 1 1 1
F 1 0 1
H 0 0 1
In first row if A is present in dataset then I will assign 1 under each column and 0 if A not present in dataset
Here is what I have tried
DF <- rbind(df1,df2,df3)
for (i in DF) {
for (j in 1:length(df1$names)) {
if(i == df1$names[j]){
A3 <-data.frame(paste0("",i),paste0(1),paste0(0),paste0(0))
names(A3) <- NULL
}
else{
A3 <-data.frame(paste0("",i),paste0(0),paste0(0),paste0(0))
}
}
}
I have written this code only for df1 but its very slow because I have more than 1500 rows in my orignal data set. What would be the fastest way to do it?
Add a grouping variable to each dataframe:
df1 <- data.frame(names=c("A","B","C","D","E","F"),group="df1")
df2 <- data.frame(names=c("A","B","C","F"),group="df2")
df3 <- data.frame(names=c("E","F","H"),group="df3")
DF <- rbind(df1,df2,df3)
Then do this:
res <- table(DF)
> res
group
names df1 df2 df3
A 1 1 0
B 1 1 0
C 1 1 0
D 1 0 0
E 1 0 1
F 1 1 1
H 0 0 1
Or if you want a dataframe:
library(reshape2)
dcast(names~group, data=DF,fun.aggregate = length)
When using the idcol parameter in rbindlist of the data.table package, there is no need of creating a grouping column for each dataframe separately:
library(data.table) # I used v1.9.5 for this
DT <- rbindlist(list(df1, df2, df3), idcol="id")
dcast(DT[, .N , by=.(id,names)], names ~ id, fill=0)
which gives:
names 1 2 3
1: A 1 1 0
2: B 1 1 0
3: C 1 1 0
4: D 1 0 0
5: E 1 0 1
6: F 1 1 1
7: H 0 0 1
%in% operator lets you check if a string is present in a vector of strings. It is also vectorised, so it works quite quick:
x=c(LETTERS[c(1:6,8)])
df=data.frame(x=x,df1=as.numeric(x %in% df1$names),
df2=as.numeric(x %in% df2$names),
df3=as.numeric(x %in% df3$names))
df
If speed is crucial, {data.table} package gives a little speed boost with %chin% operator:
library(data.table)
x=c(LETTERS[c(1:6,8)])
dt=data.table(x=x,df1=as.numeric(x %chin% as.character(df1$names)),
df2=as.numeric(x %chin% as.character(df2$names)),
df3=as.numeric(x %chin% as.character(df3$names)))
dt
The code below is slightly more general than the other answers. Also, I think it's useful to know how to dynamically create commands...
I use the data frames as you prepared them:
df1 <- data.frame( names = c( "A", "B", "C", "D", "E", "F") )
df2 <- data.frame( names = c( "A", "B", "C"," F") )
df3 <- data.frame( names = c( "E", "F", "H") )
DF <- rbind( df1, df2, df3 )
nDF <- unique( DF ) #we don't want to duplicate tests.
Then the main loop is just like this:
n_ <- 3
for( ii in 1 : n_){
nDF[ paste0( "df", ii ) ] <- as.logical( NA ) #dynamically creates a new variable in your data frame
cmnd <- paste0("nDF$names %in% df",ii,"$names") #dynamically creates the appropriate command (in this case you want to test e.g. whether "nDF$names %in% df1$names".
nDF[ paste0("df",ii)] <- eval( parse( text = cmnd ) ) #evaluates the dynamically created command and saves it into the previously created variable.
}
Should be relatively fast. But if you don't have duplicates in your data, then heroka's suggestion to this questions is probably the way to go.