I am trying to run a cumsum on a data frame on two separate columns. They are essentially tabulation of events for two different variables. Only one variable can have an event recorded per row in the data frame. The way I attacked the problem was to create a new variable, holding the value ‘1’, and create two new columns to sum the variables totals. This works fine, and I can get the correct total amount of occurrences, but the problem I am having is that in my current ifelse statement, if the event recorded is for variable “A”, then variable “B” is assigned 0. But, for every row, I want to have the previous variable’s value assigned to the current row, so that I don’t end up with gaps where it goes from 1 to 2, to 0, to 3.
I don't want to run summarize on this either, I would prefer to keep each recorded instance and run new columns through mutate.
CURRENT DF:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 0 1
4 1 A 3 0
DESIRED RESULT:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 2 1
4 1 A 3 1
Thanks!
You can use the property of booleans that you can sum them as ones and zeroes. Therefore, you can use the cumsum-function:
DF$Total.A <- cumsum(DF$variable=="A")
Or as a more general approach, provided by #Frank you can do:
uv = unique(as.character(DF$Variable))
DF[, paste0("Total.",uv)] <- lapply(uv, function(x) cumsum(DF$V == x))
If you have many levels to your factor, you can get this in one line by dummy coding and then cumsuming the matrix.
X <- model.matrix(~Variable+0, DF)
apply(X, 2, cumsum)
Related
I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data
I am trying to find the total of rows that have a column value of 3 or 4. That being said, the first row has only one value of 3 so if I create a new column
currentdx_count1$TotalDiagnoses
That new column called TotalDiagnoses should only have a value of 1 under it for the first row. I have tried
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[2:32])
This doesn't give me what I need as expected because it literally sums up the whole row. That being said, is there an existing function that does what I want to do or will I have to make one? Could I specify more in rowSums for it to work as I need it to?
Thanks for any and all help.
Edit: I'm trying to adapt a method I use earlier in my script that works for a similar purpose
findtotal <- endsWith(names(currentdx_count1), 'Current')
findtotal <- lapply(findtotal, `>`, 2)
findtotal <- unlist(findtotal)
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
I get an error which I have never seen before (an error in view?!)
So I tried just this
findtotal <- endsWith(names(currentdx_count1), 'Current')
currentdx_count1$TotalDiagnoses <- currentdx_count1[c(findtotal)]
Gets me closer but it is finding the total count for each column separately which is not what I need. I want a single column to encompass counts for each SID.
You can compare the dataframe with the value of 3 or 4 and then use rowSums to count :
currentdx_count1$TotalDiagnoses <- rowSums(currentdx_count1[-1] == 3 |
currentdx_count1[-1] == 4)
currentdx_count1$TotalDiagnoses
#[1] 1 2 2 2 1 1 1 1 1 1 1 1 1 2
I have the following matrix:
x=c(0,0,0,1,1,1,2,2,2,0,1,2,0,1,2,0,1,2)
M=matrix(x,9,2)
The matrix M is:
> M
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
How do I find that the number of (0,0), (0,1), (0,2), ... (that is the first row, the second, the third and so on) in the whole rows are equal to 1?
If we need to get the frequency, use the table,
tbl <- table(paste(M[,1], M[,2], sep="_"))
This can be converted to a 3 column data.frame by splitting the names of 'tbl' into two columns and cbinding the value of 'tbl'
cbind(read.table(text=names(tbl), sep="_", header = FALSE), value = as.vector(tbl))
If you want to check if every row appears a single time you can use
duplicated(data.frame(M))
If any of the resulting values is TRUE then you know some rows appear more than one time (and you know where they are).
Assuming my dataframe has one column, I wish to add another column to indicate if my ith element is unique within the first i elements. The results I want is:
c1 c2
1 1
2 1
3 1
2 0
1 0
For example, 1 is unique in {1}, 2 is unique in {1,2}, 3 is unique in {1,2,3}, 2 is not unique in {1,2,3,2}, 1 is not unique in {1,2,3,2,1}.
Here is my code, but is runs extremely slow given I have nearly 1 million rows.
for(i in 1:nrow(df)){
k <- sum(df$C1[1:i]==df$C1[i]))
if(k>1){df[i,"C2"]=0}
else{df[i,"C2"]=1}
}
Is there a quicker way of achieving this?
The following works:
x$c2 = as.numeric(! duplicated(x$c1))
Or, if you prefer more explicit code (I do, but it’s slower in this case):
x$c2 = ifelse(duplicated(x$c1), 0, 1)
I have two sets of data, which correspond to different experiment tasks that I want to merge for analysis. The problem is that I need to search and match up certain rows for particular stimuli and for particular participants. I'd like to use a script to save some trouble. This is probably quite simple, but I've never done it before.
Here's my problem more specifically:
In the first data set, each row corresponds to a two-alternative forced choice task where two stimuli are presented at a time and the participant selects one. In the second data set, each row corresponds to a single item task where the participants are asked if they have ever seen the stimulus before. The stimuli in the second task match the stimuli in the pairs on the first task (twice as many rows). I want to be able to match up and add two columns to the first dataset--one that states if the leftside item was recognized later and one for the rightside stimulus.
I assume this could be done with nested loops, but I'm not sure if there is a elegant way to do this or perhaps a package.
As I understand it, your first dataset looks something like this:
(dat1 <- data.frame(person=1:2, stim1=1:2, stim2=3:4))
# person stim1 stim2
# 1 1 1 3
# 2 2 2 4
This would mean person 1 got stimuli 1 and 3 and person 2 got stimuli 2 and 4. Then your second dataset looks something like this:
(dat2 <- data.frame(person=c(1, 1, 2, 2), stim=c(1, 3, 4, 2), responded=c(0, 1, 0, 1)))
# person stim responded
# 1 1 1 0
# 2 1 3 1
# 3 2 4 0
# 4 2 2 1
This gives information about how each person responded to each stimulus they were given.
You can merge these two by matching person/stimulus pairs with the match function:
dat1$response1 <- dat2$responded[match(paste(dat1$person, dat1$stim1), paste(dat2$person, dat2$stim))]
dat1$response2 <- dat2$responded[match(paste(dat1$person, dat1$stim2), paste(dat2$person, dat2$stim))]
dat1
# person stim1 stim2 response1 response2
# 1 1 1 3 0 1
# 2 2 2 4 1 0
Another option (starting from the original dat1 and dat2) would be to merge twice with the merge function. You have a little less control on the names of the output columns, but it requires a bit less typing:
merged <- merge(dat1, dat2, by.x=c("person", "stim1"), by.y=c("person", "stim"))
merged <- merge(merged, dat2, by.x=c("person", "stim2"), by.y=c("person", "stim"))