This is what my dataframe looks like. V3 is my desired Column. V3 is not available to me.
library(data.table)
dt <- fread('
Level V1 V2
0 10 2
1 0 3
1 0 2
1 0 2 ')
I am trying to calculate V3 based on prior values of V3. The V3 formula is:
New Value of V3 =((Prior Value of V3+ Prior Value of V3*V2)*Level)+V1
1st Row V3 = (NA+NA*3)*1 + 10 = 10
2nd Row V3 = (10+10*3)*1 + 0 =40
3rd Row V3 = (40+40*2)*1 + 0 =120
4th Row V3 = (120+120*2)*1 + 0 = 360
The output should look like this.
Level V1 V2 V3
0 10 2 10
1 0 3 40
1 0 2 120
1 0 2 360
I was trying:
dt[,V3:= (cumsum(V3+V3*V2)*Level)+V1]
I reworked your efforts in the comments to get the desired result:
dt[,V3:=cumprod( c(V1[1] ,(Level*(1 + V2))[-1]) ) ]
dt
Level V1 V2 V3
1: 0 10 2 10
2: 1 0 3 40
3: 1 0 2 120
4: 1 0 2 360
I didn't actually get an error (only a warning) with dt[,V3:= V1[1] * cumprod((Level*(1 + V2))[-1])]. Using the [-1] shortened the cumprod with no extension, and resulted in recycling.
Within data.table
dt[,{ lag.V3=c(0, V3[-.N]) ; V3 = (lag.V3 + lag.V3 * V2 )* Level + V1 }]
Output
[1] 10 40 120 360
Here is one way to do it in dplyr
dt %>%
mutate(V4=lag(V3) + lag(V3)*V2 + V1,
V4=ifelse(is.na(V4), 0, V4))
Level V1 V2 V3 V4
1 0 10 2 10 0
2 1 0 3 40 40
3 1 0 2 120 120
4 1 0 2 360 360
Related
I have a data set as follows: (i took a simple example but the real data set is much bigger)
V1 V2 V3 V4
1 1 0 0 1
2 0 1 1 0
3 0 0 1 0
4 1 1 1 1
5 0 1 1 0
6 1 0 0 1
7 0 0 0 1
8 0 1 1 1
9 1 0 1 0
10 0 1 1 0
...
where V1, V2,V3...Vn are items and 1,2,3,4...1000 are transactions. I want to partition these items into k clusters such that in each cluster i have the items that appear the most frequently together in the same transactions.
To determine the number times each couple of items appear together i tried crosstable, i got the following results:
V1 V2 V3 V4
V1 4 1 2 3
V2 1 5 5 2
V3 2 5 7 2
V4 3 2 2 5
For this small example if i want to create 2 clusters (k=2) such that a cluster must contain 2 items (to maintain the balance between clusters), i will get:
Cluster1={V1,V4}
Cluster2={V2,V3}
because:
1) V1 appears more frequently with V4 (V1,V4)=3 > (V1,V3) > (V1,V2) and same for V4.
2) V2 appears more frequently with V2 (V2,V3)=5 > (V2,V4) > (V2, V1) and same for V3.
How can i do this partition with R and for a bigger set of data ?
I think you are asking about clustering. It is not quite the same as what you are doing above, but you could use hclust to look for similarity between variables with a suitable distance measure.
For example
plot(hclust(dist(t(df),method="binary")))
produces the following...
You should look at ?dist to see if this distance measure is meaningful in your context, and ?hclust for other things you can do once you have your dendrogram (such as identifying clusters).
Or you could use your crosstab as a distance matrix (perhaps take the reciprocal of the values, and then as.dist).
library(data.table)
data:
df<-
fread("
V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
")[,-1]
code:
setDT(df)
sapply(names(df),function(x){
df[get(x)==1,lapply(.SD,sum,na.rm=T),.SDcols=names(df)]
})
result:
V2 V3 V4
V2 4 1 2
V3 1 3 3
V4 2 3 7
df <- read.table(text="
ID V1 V2 V3
1 1 0 0
2 0 0 1
3 0 0 1
4 1 1 1
5 0 0 1
6 1 0 0
7 0 0 0
8 0 1 1
9 1 0 1
10 0 1 1
", header = TRUE)
k = 3 # number of clusters
library(dplyr)
df %>%
# group and count on all except the first id column
group_by_at(2:ncol(df)) %>%
# get the counts, and collect all the transaction ids
summarize(n = n(), tran_ids = paste(ID, collapse = ',')) %>%
ungroup() %>%
# grab the top k summarizations
top_n(k, n)
# V1 V2 V3 n tran_ids
# <int> <int> <int> <int> <chr>
# 1 0 0 1 3 2,3,5
# 2 0 1 1 2 8,10
# 3 1 0 0 2 1,6
You can transpose your table and use standard methods of clustering. Thus, you will cluster the items. The features are the transactions.
Geometrical approaches can be used like kmeans. Alternatively, you can use mixture models which provide information criteria (like BIC) for selecting the number of clusters. Here is an R script
require(VarSelLCM)
my.data <- as.data.frame(t(df))
# To consider Gaussian mixture
# Alternatively Poisson mixture can be considered by converting each column into integer.
for (j in 1:ncol(my.data)) my.data[,j] <- as.numeric(my.data[,j])
## Clustering by considering all the variables as discriminative
# Number of clusters is between 1 and 6
res.all <- VarSelCluster(my.data, 1:6, vbleSelec = FALSE)
# partition
res.all#partitions#zMAP
# shiny application
VarSelShiny(res.all)
## Clustering with variable selection
# Number of clusters is between 1 and 6
res.selec <- VarSelCluster(my.data, 1:6, vbleSelec = TRUE)
# partition
res.selec#partitions#zMAP
# shiny application
VarSelShiny(res.selec)
I want to make every element in the dataframe (except fot the ID column) become a 0 if it is any number other than 1.
I have:
ID A B C D E
abc 5 3 1 4 1
def 4 1 3 2 5
I want:
ID A B C D E
abc 0 0 1 0 1
def 0 1 0 0 0
I am having trouble figuring out how to specify for this to be done to do to every entry in every column and row.
Here is my code:
apply(dat.lec, 2 , function(y)
if(!is.na(y)){
if(y==1){y <- 1}
else{y <-0}
}
else {y<- NA}
)
Thank you for your help!
No need for implicit or explicit looping.
# Sample data
set.seed(2016);
df <- as.data.frame(matrix(sample(10, replace = TRUE), nrow = 2));
df <- cbind.data.frame(id = sample(letters, 2), df);
df;
# id V1 V2 V3 V4 V5
#1 k 2 9 5 7 1
#2 g 2 2 2 9 1
# Replace all entries != 1 with 0's
df[, -1][df[, -1] != 1] <- 0;
df;
# id V1 V2 V3 V4 V5
#1 k 0 0 0 0 1
#2 g 0 0 0 0 1
Data set cwm looks like this
V1 V2 V3
1 2 ?
3 5 ?
4 4 ?
#NA 9 ?
#NA #NA ?
Want to create dummy variable V3, 1 if V1=V2, 0 otherwise, and producing #NA in any case where #NA is involved.
After I have done a similar thing for equivalent columns V3 and V4, to produce dummy variable V5, I need to create a continuous variable, V6, where 1 means neither V3 or V5 = 1, 2 means either V3 or V5 = 1, 3 means both V3 and V5 = 1.
V3 V5 V6
1 0 ?
1 0 ?
0 0 ?
1 1 ?
If done correctly, V3 = {0,0,1,#NA,#NA} and V6 = {2,2,1,3}
Best approach?
df = read.table(text="V1 V2
1 2
3 5
4 4
NA 9
NA NA",
header = TRUE, na.strings="NA")
V3 = as.numeric(df$V1 == df$V2)
V3
[1] 0 0 1 NA NA
df2 = read.table(text="V3 V5
1 0
1 0
0 0
1 1",
header = TRUE)
V6 = df2$V3 + df2$V5 + 1
V6
[1] 2 2 1 3
I have a data frame such as this:
df <- data.frame(
ID = c('123','124','125','126'),
Group = c('A', 'A', 'B', 'B'),
V1 = c(1,2,1,0),
V2 = c(0,0,1,0),
V3 = c(1,1,0,3))
which returns:
ID Group V1 V2 V3
1 123 A 1 0 1
2 124 A 2 0 1
3 125 B 1 1 0
4 126 B 0 0 3
and I would like to return a table that indicates if a variable is represented in the group or not:
Group V1 V2 V3
A 1 0 1
B 1 1 1
In order to count the number of distinct variables in each group.
We can do this with base R
aggregate(.~Group, df[-1], function(x) as.integer(sum(x)>0))
# Group V1 V2 V3
#1 A 1 0 1
#2 B 1 1 1
Or using rowsum from base R
+(rowsum(df[-(1:2)], df$Group)>0)
# V1 V2 V3
#A 1 0 1
#B 1 1 1
Or with by from base R
+(do.call(rbind, by(df[3:5], df['Group'], FUN = colSums))>0)
# V1 V2 V3
#A 1 0 1
#B 1 1 1
Have you tried
unique(group_by(mtcars,cyl)$cyl).
Output:[1] 6 4 8
This question already has answers here:
Table by row with R
(4 answers)
Closed 6 years ago.
Imagine a group of three of machines (a,b,c) capture data in a series of tests. I need to count per test how many of each possible outcome has happened.
Using this test data and sample output, how might you solve it (assume that the test results may be numbers or alpha).
tests <- data.table(
a = c(1,2,2,3,0),
b = c(1,2,3,0,3),
c = c(2,2,3,0,2)
)
sumry <- data.table(
V0 = c(0,0,0,2,1),
V1 = c(2,0,0,0,0),
V2 = c(1,3,1,0,1),
V3 = c(0,0,2,1,1),
v4 = c(0,0,0,0,0)
)
tests
sumry
The output from sumry shows a column for each possible outcome/value (prefixed with V as in 'value' measured). Note: the sumry output indicates that there is the potential for a value of 4 but that is not observed in any of the test data here and therefore is always zero.
> tests
a b c
1: 1 1 2
2: 2 2 2
3: 2 3 3
4: 3 0 0
5: 0 3 2
> sumry
V0 V1 V2 V3 v4
1: 0 2 1 0 0
2: 0 0 3 0 0
3: 0 0 1 2 0
4: 2 0 0 1 0
5: 1 0 1 1 0
the V0 column from sumry indicates how many times the value zero is observed from any machine in test #1. For this set of test data zero is only observed in the 4th and 5th tests. The same holds true for V1-V4
I'm sure there's a simple name for this.
Here's one solution built around tabulate():
res <- suppressWarnings(do.call(rbind,apply(tests+1L,1L,tabulate)));
colnames(res) <- paste0('V',seq(0L,len=ncol(res)));
res;
## V0 V1 V2 V3
## [1,] 0 2 1 0
## [2,] 0 0 3 0
## [3,] 0 0 1 2
## [4,] 2 0 0 1
## [5,] 1 0 1 1