Calculate a Lagged column on itself - r

I'm certain there is an easier way to accomplish this. I have the following dataframe:
B <- c(1, 1, 1, 0, 1, 2, 2, 0, 0, 0)
A <- c(1:10)
df <- as.data.frame(cbind(A,B))
What I would like to do is add a third column (C) that applies column B, unless column B is 0, in which case apply the percent change in column A to the previous result of column C.
Here is what I did:
library(Hmisc)
df$New <- ifelse(df$B!=0, df$B, df$A/Lag(df$A, shift=1)*Lag(df$B, shift=1))
df$New2 <- ifelse(df$New !=0, df$New, df$A/Lag(df$A, shift=1)*Lag(df$New, shift=1))
df$New3 <- ifelse(df$New2 !=0, df$New2, df$A/Lag(df$A, shift=1)*Lag(df$New2, shift=1))
df$C <- pmax(df$New, df$New2, df$New3)
df<- df[c(1,2,6)]
Essentially, I need to calculate on the column based on the previous calculated result, so maybe sapply, but not sure.

Related

Creating a variable to count number of zero values across variables occurring in each observation- R

I am trying to figure out a way to do this in R and for the life of me can't figure it out. Let's say I have a df consisting of the following.
v1<- c(0, 0, 2, 0 1 3)
v2<- c(1, 0, 8, 1 ,0)
v3<- c(0, 1, 3, 0, 0)
v4<- c(0, 0, 0, 0, 0)
df<- data.frame(v1, v2,v3, v4)
I want to create a new variable, say num_zeros, that counts the number of 0s for each observation in v1 to v3. Is there a quick way to do this? Any help would be greatly appreciated!
We can use rowSums on a logical matrix to get the count of 0 values and assign it to 'num_zeros' column
df$num_zeros <- rowSums(df[c('v1', 'v2', 'v3')] == 0)
Or another option is
df$num_zeros <- (df$v1 == 0) + (df$v2 == 0) + (df$v3 == 0)
NOTE: Both methods are efficient and are vectorized
We can use apply rowwise :
cols <- paste0('v', 1:3)
df$num_zeros <- apply(df[cols] == 0, 1, sum)
Or with lapply :
df$num_zeros <- Reduce(`+`, lapply(df[cols], `==`, 0))

How to keep row names for hirarchical clustering on imported csv files

I would like to produce a hirarchical clustering analysis of data imported from .csv file into R. I'm having trouble retaining the first column of row names, so my dendrogram tips end up with no names, which is useless for downstream analyses and linking with meta-data.
When I import the .csv file, if I use the dataframe including the first column of row names for the dist function I get a warning:
"Warning message:
In dist(as.matrix(df)) : NAs introduced by coercion".
I found a previous Stack Overflow question which addressed this:
"NAs introduced by coercion" during Cluster Analysis in R
The solution offered was to remove the row names. But this also removes the tip labels from the resulting distance matrix, which I need for making sense of the dendrogram and linking to meta-data downstream (e.g. to add colour to dendrogram tips or a heat map based on other variables).
# Generate dataframe with example numbers
Samples <- c('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E')
Variable_A <- c(0, 1, 1, 0, 1)
Variable_B <- c(0, 1, 1, 0, 1)
Variable_C <- c(0, 0, 1, 1, 1)
Variable_D <- c(0, 0, 1, 1, 0)
Variable_E <- c(0, 0, 1, 1, 0)
df = data.frame(Samples, Variable_A, Variable_B, Variable_C, Variable_D, Variable_E, row.names=c(1))
df
# generate distance matrix
d <- dist(as.matrix(df))
# apply hirarchical clustering
hc <- hclust(d)
# plot dendrogram
plot(hc)
That all works fine. But let's say I want to import my real data from a file...
# writing the example dataframe to file
write.csv(df, file = "mock_df.csv")
# importing a file
df_import <- read.csv('mock_df.csv', header=TRUE)
I no longer get the original row names using the same code as above:
# generating distance matrix for imported file
d2 <- dist(as.matrix(df_import))
# apply hirarchical clustering
hc2 <- hclust(d2)
# plot dendrogram
plot(hc2)
Everything works fine with the df created in R, but I lose row names with the imported data. How do I solve this?
Samples <- c('Sample_A', 'Sample_B', 'Sample_C', 'Sample_D', 'Sample_E')
Variable_A <- c(0, 1, 1, 0, 1)
Variable_B <- c(0, 1, 1, 0, 1)
Variable_C <- c(0, 0, 1, 1, 1)
Variable_D <- c(0, 0, 1, 1, 0)
Variable_E <- c(0, 0, 1, 1, 0)
df = data.frame(Samples, Variable_A, Variable_B, Variable_C, Variable_D, Variable_E, row.names=c(1))
df
d <- dist(as.matrix(df))
hc <- hclust(d)
plot(hc)
df
write.csv(df, file = "mock_df.csv",row.names = TRUE)
df_import <- read.table('mock_df.csv', header=TRUE,row.names=1,sep=",")
d2 <- dist(as.matrix(df_import))
hc2 <- hclust(d2)
plot(hc2)
in other words use read.table instead of read.csv
df_import <- read.table('mock_df.csv', header=TRUE,row.names=1,sep=",")

Conditional if/else statement in R

I am learning to improve my coding in R. I have this code:
data$score[testA == 1] <- testA_score
data$score[testB==1] <- testB_score
So basically I have four columns that I want to combine into one: testA=1 indicates if the student took version A of the test and testA_score is their score; testB=1 indicates if the student took version B of the test and testB_score is their score. I want to combine this information into new column score.
As well Suppose I had testA, testB through testH. All values are 0 or 1. How can I make new column test_complete which is = 1 if any of the tests are = 1?
Basically as a former Stata user I am looking for the R equivalent commands to egen rowtotal and egenrowfirst. Thanks so much.
you can take max out of all test : since it 1 or 0 values only if at least one test is completed max will be equal to 1
testA <- c(1,0, 0, 1,0,0,0)
testB <- c(0, 1,0, 0, 1,0,1)
testC <- c(0, 0, 0,1, 0, 1, 0)
df <- as.data.frame(cbind(testA, testB, testC))
df$completed <- apply(df[, 1:3], 1, max)
So if I understand correctly, taking the maximum value by row should give what you need:
binary <- c(0,1)
df <- data.frame(
score1 = sample(binary, 20, replace = TRUE),
score2 = sample(binary, 20, replace = TRUE),
score3 = sample(binary, 20, replace = TRUE)
)
df$passed <- apply(df, 1, max)
head(df)

R - Averaging specific matrix indices over matrix

I have two matrices. The first, m1, is 100x100 and contains numbers with decimal places and the other, m2, is 300x100 and is sparsely populated with integers, like so:
m1 <- matrix(rexp(1000, rate = .1), ncol = 100)
m2 <- matrix(sample(c(rep(0, 1000), rep(1, 10), rep(2, 1)), 300 * 100, replace = T), 300, 100)
Each row in m1 corresponds to the column of the same number in m2. Each column m2 represents the number of occurrences of the corresponding row in m1 for that observation.
For each row in m2, I want to get the colMeans of each row of m1 corresponding to how many times it appears in that row of m2. The result should be a 300x100 matrix. I want to know the most efficient way of doing this.
It's a complex operation but hopefully you understand what I mean. If you need any clarification I can give it. If it helps, what I'm trying to do is to get a document features matrix from a word feature matrix and a document-term matrix.
dtm <- matrix(c(0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0), ncol = 4)
wvm <- matrix(c(27.305102, 9.095906, 3.792833, 17.561222, 32.06434, 4.719152, 8.367996, 0.0568822), ncol = 2)
dtm
wvm
t(apply(dtm, 1, function(dtm_row) {
vs <- wvm[dtm_row > 0, ] * dtm_row[dtm_row > 0]
if (is.matrix(vs)) { colMeans(vs) } else vs
}))
Solved my own problem. But if anyone wants to improve my method I'll mark there answer as the correct one.

variable limit to define values

I have a simple question, so lets take some basic data
a <- rnorm(100, mean=1, sd = 0.1)
b <- rnorm(100, mean=5, sd = 2)
c <- data.frame(a,b)
Now I want to redefine C$B such that if it is below a limit, the user manually defines the new variable it will take, and if it is above this limit, the values take the same as previous
c$b <- with(c, ifelse(b < 2, 1, # leave as exsiting value #))
so when b < 2, we want to assign a value of 1, otherwise use the exisitng value
If we are using ifelse, try
c$b <- with(c, ifelse(b < 2, 1, b))
This doesn't even require ifelse. We can get the logical index of values less than 2 in the 'b' column (c$b <2) and assign those values to 1.
c$b[c$b<2] <- 1

Resources