I'm trying to reshape a data frame so that each unique value in a column becomes a binary column.
I've been provided data that looks like this:
df <- data.frame(id = c(1,1,2),
value = c(200,200,1000),
feature = c("A","B","C"))
print(df)
##id,value,feature
##1,200,A
##1,200,B
##2,1000,C
I'm trying to reshape it into this:
##trying to get here
##id,value,A,B,C
##1,200,1,1,0
##2,1000,0,0,1
spread(df,id,feature) fails because ids repeat.
I want to reshape the data to facilitate modeling - I'm trying to predict value from the presence or absence of features.
There is a way to do it with tidyr::spread though, using a transition variable always equal to one.
library(dplyr)
library(tidyr)
mutate(df,v=1) %>%
spread(feature,v,fill=0)
id value A B C
1 1 200 1 1 0
2 2 1000 0 0 1
As my previous comment:
You have to use dcast of the reshape2 package because spread works well for data that are been processed and/or are consistent with tidy data principles. Your "spreading" is a little bit different (and complicated). Unless of course you use spread combined with other functions.
library(reshape2)
dcast(df, id + value ~ ..., length)
id value A B C
1 1 200 1 1 0
2 2 1000 0 0 1
Related
I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data
Assuming I have an original version dataset containing a complete set of "texsts" (a string variable), and a second dataset that only contains those "texts" for which the new variable "value" takes a certain value (0, 1, or NA).
Now I would like to merge them back together so that the resulting dataset contains the full range of "texts" from the first dataset but also includes "value" which should be 0 if coded 0 and/or only present in the original dataset.
dat1<-data.frame(text=c("a","b","c","d","e","f","g","h")) # original dataset
dat2<-data.frame(text=c("e","f","g","h"), value=c(0,NA,1,1)) # second version
The final dataset should look like this:
> dat3
text value
1 a 0
2 b 0
3 c 0
4 d 0
5 e 0
6 f NA
7 g 1
8 h 1
However, what Base-R's merge() does is to introduce NAs where I want 0s instead:
dat3<-merge(dat1, dat2, by=c("text"), all=T)
Is there a way to define a default input for when the variable by which datasets are merged is only present in one but not the other dataset? In other words, how can I define 0 as standard input value instead of NA?
I am aware of the fact that I could temporarily change the coded NAs in the second dataset to something else to distinguish later on between "real" NAs and NAs that just get introduced, but I would really like to refrain from doing so, if there's another, cleaner way. Ideally, I would like to use merge() or plyr::join() for that purpose but couldn't find anything in the manual(s).
I know that this is not ideal too, but something to consider:
library(dplyr)
dat3 <- dplyr::left_join(dat1,dat2,all.x =T)
dat3[which(dat2$text != dat3$text),2] = 0
Or wrapping in a function to call a one-liner:
merge_NA <- function(dat1,dat2){
dat3 <- dplyr::left_join(dat1,dat2,all.x = T)
dat3[which(dat2$text != dat3$text),2] = 0
return(dat3)
}
Now, you only call:
merge_NA(dat1,dat2)
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I am an early user of Rstudio, and i have a quite simple problem, but unfortunately i am not able to solve it.
I just want to aggregate rows of my data.frame by words contained on the first column of the df.
The data.frame is made by five columns:
The first one is made by words;
the second, the third, the fourth, the fifth ones are made by numeric values.
for example if the data would be:
SecondWord X Y Z Q
NO 1 2 2 1
NO 0 0 1 0
YES 1 1 1 1
i expect to see a result like:
SecondWord X Y Z Q
NO 1 2 3 1
YES 1 1 1 1
How could i do?
i have tried to use the following method:
test <- read.csv2("test.csv")
df<-aggregate(.~Secondword,data=test, FUN = sum, na.rm=TRUE)
But the values were not the ones i expected to see.
Thank you for your future helps and sorry for the "simple" question.
You can also use tidyverse
library(tidyverse)
df <- test %>%
group_by(SecondWord) %>%
summarize_each(funs(sum))
df
# SecondWord X Y Z Q
# NO 1 2 3 1
# YES 1 1 1 1
ddply should work as well.
For example, something like:
library(plyr)
grouped <- ddply(test, "Secondword", numcolwise(sum))
I am sure this is a super simple thing, but I cannot find a really quick and easy solution.
I have patient data with a lot of columns in a format like this:
patID disease category ...
1 1 A
2 0 B
3 1 C
4 1 B
How can I quickly produce a summary table, which includes the number of observations for each column/variable in the dataframe? The result should be something like this:
VARIABLE Number of rows
disease:1 3
disease:0 1
category:A 1
category:B 2
category:C 1
...
I know I can do this for a single variable by just using table(data$column). But how can I produce something similar for all columns in a dataframe?
Using tidyr and dplyr:
gather(data, variable, value, -patID) %>%
count(variable, value)
(Thanks #Frank for reminding me about tally and count.)
I'm trying to do an equivalent group by summary in R through the plyr function named ddply. I have a data frame which have three columns (say id, period and event). Then, I'd like to count the times each id appears in the data frame (count(*)... group by id with SQL) and get the last element of each id corresponding to the column event.
Here an example of what I have and what I'm trying to obtain:
id period event #original data frame
1 1 1
2 1 0
2 2 1
3 1 1
4 1 1
4 1 0
id t x #what I want to obtain
1 1 1
2 2 1
3 1 1
4 2 0
This is the simple code I've been using for that:
teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.
Now, I've been reading The Split-Apply-Combine Strategy for Data
Analysis and it is given an example where they employed an equivalent syntax to the one I put below:
datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.
This is the data frame I get using datos2
id t x
1 1 1
2 2 0
3 1 1
4 1 1
So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1? What am I doing wrong?
It is not clear for me when I have to use summarise or transform. Could you tell me the correct syntax for the ddply function?
When you use summarise, stop referencing the original data frame. Instead, just write expressions in terms of the column names.
You tried this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3])
when what you probably wanted was something more like this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=tail(event,1))