Doubts about ddply function in R - r

I'm trying to do an equivalent group by summary in R through the plyr function named ddply. I have a data frame which have three columns (say id, period and event). Then, I'd like to count the times each id appears in the data frame (count(*)... group by id with SQL) and get the last element of each id corresponding to the column event.
Here an example of what I have and what I'm trying to obtain:
id period event #original data frame
1 1 1
2 1 0
2 2 1
3 1 1
4 1 1
4 1 0
id t x #what I want to obtain
1 1 1
2 2 1
3 1 1
4 2 0
This is the simple code I've been using for that:
teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.
Now, I've been reading The Split-Apply-Combine Strategy for Data
Analysis and it is given an example where they employed an equivalent syntax to the one I put below:
datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.
This is the data frame I get using datos2
id t x
1 1 1
2 2 0
3 1 1
4 1 1
So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1? What am I doing wrong?
It is not clear for me when I have to use summarise or transform. Could you tell me the correct syntax for the ddply function?

When you use summarise, stop referencing the original data frame. Instead, just write expressions in terms of the column names.
You tried this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3])
when what you probably wanted was something more like this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=tail(event,1))

Related

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

Like dcast but without sum of data

I have data organized for the R survival package, but want to export it to work in Graphpad Prism, which uses a different structure.
#Example data
Treatment<-c("A","A","A","A","A","B","B","B","B","B")
Time<-c(3,4,5,5,5,1,2,2,3,5)
Status<-c(1,1,0,0,0,1,1,1,1,1)
df<-data.frame(Treatment,Time,Status)
The R survival package data structure looks like this
Treatment Time Status
A 3 1
A 4 1
A 5 0
A 5 0
A 5 0
B 1 1
B 2 1
B 2 1
B 3 1
B 5 1
The output I need organizes each treatment as one column, and then sorts by time. Each individual is then recorded as a 1 or 0 according to its Status. The output should look like this:
Time A B
1 1
2 1
2 1
3 1 1
4 1
5 0 1
5 0
5 0
dcast() does something similar to what I want, but it sums up the Status values and merges them into one cell for all individuals with matching Time values.
Thanks for any help!
I ran into a weird issue when trying to implement Sotos' code to my actual data. I got the error:
Error in Math.factor(var) : ‘abs’ not meaningful for factors
Which is weird, because Sotos' code works for the example. When I checked the example data frame using sapply() it gave me the result:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "numeric"
My issue as far as I could tell, was that my Status variable was read as numeric in my example, but an integer in my real data:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "integer"
I loaded my data from a .csv, so maybe that's what caused the change in variable calling. I ended up changing my Status variable using as.numeric(), and then re-generating the dataframe.
Status<-as.numeric(df$Status)
df<-data.frame(Treatment, Time, Status)
And was able to apply Sotos' code to the new dataframe.

Mutate Cumsum with Previous Row Value

I am trying to run a cumsum on a data frame on two separate columns. They are essentially tabulation of events for two different variables. Only one variable can have an event recorded per row in the data frame. The way I attacked the problem was to create a new variable, holding the value ‘1’, and create two new columns to sum the variables totals. This works fine, and I can get the correct total amount of occurrences, but the problem I am having is that in my current ifelse statement, if the event recorded is for variable “A”, then variable “B” is assigned 0. But, for every row, I want to have the previous variable’s value assigned to the current row, so that I don’t end up with gaps where it goes from 1 to 2, to 0, to 3.
I don't want to run summarize on this either, I would prefer to keep each recorded instance and run new columns through mutate.
CURRENT DF:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 0 1
4 1 A 3 0
DESIRED RESULT:
Event Value Variable Total.A Total.B
1 1 A 1 0
2 1 A 2 0
3 1 B 2 1
4 1 A 3 1
Thanks!
You can use the property of booleans that you can sum them as ones and zeroes. Therefore, you can use the cumsum-function:
DF$Total.A <- cumsum(DF$variable=="A")
Or as a more general approach, provided by #Frank you can do:
uv = unique(as.character(DF$Variable))
DF[, paste0("Total.",uv)] <- lapply(uv, function(x) cumsum(DF$V == x))
If you have many levels to your factor, you can get this in one line by dummy coding and then cumsuming the matrix.
X <- model.matrix(~Variable+0, DF)
apply(X, 2, cumsum)

Reshaping data - is this an operation for tidyr::spread?

I'm trying to reshape a data frame so that each unique value in a column becomes a binary column.
I've been provided data that looks like this:
df <- data.frame(id = c(1,1,2),
value = c(200,200,1000),
feature = c("A","B","C"))
print(df)
##id,value,feature
##1,200,A
##1,200,B
##2,1000,C
I'm trying to reshape it into this:
##trying to get here
##id,value,A,B,C
##1,200,1,1,0
##2,1000,0,0,1
spread(df,id,feature) fails because ids repeat.
I want to reshape the data to facilitate modeling - I'm trying to predict value from the presence or absence of features.
There is a way to do it with tidyr::spread though, using a transition variable always equal to one.
library(dplyr)
library(tidyr)
mutate(df,v=1) %>%
spread(feature,v,fill=0)
id value A B C
1 1 200 1 1 0
2 2 1000 0 0 1
As my previous comment:
You have to use dcast of the reshape2 package because spread works well for data that are been processed and/or are consistent with tidy data principles. Your "spreading" is a little bit different (and complicated). Unless of course you use spread combined with other functions.
library(reshape2)
dcast(df, id + value ~ ..., length)
id value A B C
1 1 200 1 1 0
2 2 1000 0 0 1

How to cross-tabulate two variables in R?

This seems to be basic, but I wont get it. I am trying to compute the frequency table in R for the data as below
1 2
2 1
3 1
I want to transport the the two way frequencies in csv output, whose rows will be all the unique entries in column A of the data and whose columns will be all the unique entries in column B of the data, and the cell values will be the number of times the values have occurred. I have explored some constructs like table but I am not able to output the values correctly in csv format.
Output of sample data:
"","1","2"
"1",0,1
"2",1,0
"3",1,0
The data:
df <- read.table(text = "1 2
2 1
3 1")
Calculate frequencies using table:
(If your object is a matrix, you could convert it to a data frame using as.data.frame before using table.)
tab <- table(df)
V2
V1 1 2
1 0 1
2 1 0
3 1 0
Write data with the function write.csv:
write.csv(tab, "tab.csv")
The resulting file:
"","1","2"
"1",0,1
"2",1,0
"3",1,0

Resources