I'd like to create a dataframe from a dataframe that created before. my first dataframe is:
Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
Then I want to create a dataframe like below, for all (96*24-motifs*chromosomes-):
Sample CT-G.A,chr1 TA-C.C,chr1 TC-G.C,chr1 CG-A.T,ch1 CA-G.T,ch1 CT-G.A,chr2 TA-C.C,chr2 TC-G.C,chr2 CG-A.T,ch2 CA-G.T,ch2
1 1 1 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 1 1
Here is a possble solution using dplyr and tidyr.
We add a column value that indicates if a chromosome is present, then complete the data.frame, making sure we have rows for each motif-chromosome-Sample combination, where missing combinations get a 0 in the value column. We create a key out of the motif and chromosome columns, and then discard those columns. Lastly, we reshape the data.frame from long to wide (see here) to get your desired format. Hope this helps!
df = read.table(text="Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2
2 CA-G.T 2",header=T)
library(tidyr)
library(dplyr)
df %>% mutate(value=1) %>% complete(motif,chromosome,Sample,fill=list(value=0)) %>%
mutate(key=paste0(motif,',chr',chromosome)) %>%
group_by(Sample,key) %>%
summarize(value = sum(value)) %>%
spread(key,value) %>%
as.data.frame
Output:
Sample CA-G.T,chr1 CA-G.T,chr2 CG-A.T,chr1 CG-A.T,chr2 CT-G.A,chr1 CT-G.A,chr2 TA-C.C,chr1 TA-C.C,chr2 TC-G.C,chr1 TC-G.C,chr2
1 1 0 0 0 0 1 0 1 0 0 1
2 2 0 2 0 1 0 0 0 0 0 0
This seems to be a classic case of when you would want to use factors and ensure that the empty factor levels aren't dropped (which dcast and other functions might do unless explicitly told not to).
Using #Florian's sample data, you can try:
library(data.table)
cols <- c("motif", "chromosome")
setDT(df)[, (cols) := lapply(.SD, factor), .SDcols = cols][
, dcast(unique(.SD)[, value := 1L],
Sample ~ motif + chromosome, value.var = "value",
fill = 0L, drop = FALSE)]
# Sample CA-G.T_1 CA-G.T_2 CG-A.T_1 CG-A.T_2 CT-G.A_1 CT-G.A_2 TA-C.C_1 TA-C.C_2 TC-G.C_1 TC-G.C_2
# 1 1 0 0 0 0 1 0 1 0 0 1
# 2 2 0 1 0 1 0 0 0 0 0 0
I've moved "cols" and myfun() outside of the transformation to save some typing and make things look a little more tidy.
Using the "tidyverse", I'd take a slightly different approach from #Florian, perhaps something like:
library(tidyverse)
df %>%
mutate_at(c("motif", "chromosome"), factor) %>%
mutate(value = 1) %>%
distinct() %>%
mutate(key = interaction(motif, chromosome)) %>%
select(-motif, -chromosome) %>%
spread(key, value, fill = 0, drop = FALSE)
Benchmarks
Benchmarks for these approaches and #Florian's can be found at this Gist.
On 10,000 rows, and 20 resulting columns, the results look like:
This will work for you. I have used package tidyr and dplyr. Actually, I had preferred to use unite and expand.grid from base r to achieve by finally using spread
df <- read.table(text = "Sample motif chromosome
1 CT-G.A 1
1 TA-C.C 1
1 TC-G.C 2
2 CG-A.T 2
2 CA-G.T 2", header = TRUE)
#add a column to represent presence of chromosome
df$val <- 1
library(tidyr)
library(dplyr)
#Complete missing rows
df_complete <- left_join(
expand.grid(Sample=unique(df$Sample), motif=unique(df$motif),
chromosome=unique(df$chromosome)),
df, by = c("Sample", "motif", "chromosome"), copy = TRUE)
#Additional rows should have val = 0
df_complete$val[is.na(df_complete$val)] <- 0
df_complete %>%
unite(motif, c("motif", "chromosome"), sep = ",chr" ) %>%
spread(motif, val)
#Result
Sample CA-G.T,chr1 CA-G.T,chr2 CG-A.T,chr1 CG-A.T,chr2 CT-G.A,chr1 CT-G.A,chr2 TA-C.C,chr1 TA-C.C,chr2 TC-G.C,chr1 TC-G.C,chr2
1 1 0 0 0 0 1 0 1 0 0 1
2 2 0 1 0 1 0 0 0 0 0 0
Related
I am working to classify a wide data frame in an added column, but based on the threshold (>0) of multiple columns. Previous examples here on SO require complete names of columns and if else() statements with > and ==. But I need to be able to use grep() or contains() to isolate columns based on a common string.
Input dataframe:
library(tidyverse)
df <- data.frame(
"ID" = c("asdf","vfdkun", "seifu", "seijd", "qweri"),
"elephant_zoo" = c(1,1,1,2,0), #Should not be useful there
"rhino_zoo" = c(1,2,3,1,0), #Should not be useful there
"hippo_zoo" = c(1,1,0,0,0),
"elephant_wild_A" = c(0,0,1,1,3),
"rhino_wild_A" = c(0,0,4,3,1),
"elephant_wild_B" = c(0,0,0,0,0),
"rhino_wild_C" = c(0,0,1,5,7),
"hippo_wild_B" = c(0,0,0,0,0)) %>%
column_to_rownames(var = "ID")
df
In reality, this has many many more columns and rows!
Desired output dataframe has classified rows (ZOO and WILD) and a compilation of these CLASSIFICATION.
df_goal <- data.frame(
"ID" = c("asdf","vfdkun", "seifu", "seijd", "qweri"),
"elephant_zoo" = c(1,1,1,2,2), #Should not be useful there
"rhino_zoo" = c(1,2,3,1,2), #Should not be useful there
"hippo_zoo" = c(1,1,0,0,2),
"elephant_wild_A" = c(0,0,1,1,3),
"rhino_wild_A" = c(1,0,4,3,1),
"elephant_wild_B" = c(0,0,0,0,0),
"rhino_wild_C" = c(6,0,1,5,7),
"hippo_wild_B" = c(0,0,0,0,0)) %>%
column_to_rownames(var = "ID") %>%
add_column(ZOO = c("zoo", "zoo", "zoo", "zoo", "")) %>%
add_column(WILD = c("", "", "wild", "wild", "wild")) %>%
add_column(CLASSIFICATION = c("zoo only", "zoo only", "both", "both", "wild only"))
df_goal
I am hoping to use a combination of mutate() and case_when(), but I cannot get the select of multiple columns correct.
Examples of attempts:
# using an if else statement
df %>%
mutate(ZOO = ifelse(select(contains("zoo")) > 0, "zoo", "F"))
# using mutate and case_when
df %>%
mutate(ZOO = case_when(
select(contains("zoo")) > 0 ~ "zoo",
TRUE ~ ""))
My actual data frame has many more categories, so being able to break it down into the ZOO vs WILD and then following up with the compiled column.
You can try using reduce from purrr package. One may use an intermediate function any_cols to made the code clearer and use it with across:
library(tidyverse)
any_cols <- function(df) reduce(df, `|`)
df %>%
mutate(ZOO = ifelse(any_cols(across(contains("zoo"), ~`>`(.,0))), "zoo", "F"))
elephant_zoo rhino_zoo hippo_zoo elephant_wild_A rhino_wild_A elephant_wild_B rhino_wild_C hippo_wild_B ZOO
1 1 1 1 0 0 0 0 0 zoo
2 1 2 1 0 0 0 0 0 zoo
3 1 3 0 1 4 0 1 0 zoo
4 2 1 0 1 3 0 5 0 zoo
5 0 0 0 3 1 0 7 0 F
df %>%
mutate(ZOO =
case_when(any_cols(across(contains("zoo"), ~`>`(.,0))) ~ "zoo",
TRUE ~ "F"))
elephant_zoo rhino_zoo hippo_zoo elephant_wild_A rhino_wild_A elephant_wild_B rhino_wild_C hippo_wild_B ZOO
1 1 1 1 0 0 0 0 0 zoo
2 1 2 1 0 0 0 0 0 zoo
3 1 3 0 1 4 0 1 0 zoo
4 2 1 0 1 3 0 5 0 zoo
5 0 0 0 3 1 0 7 0 F
Got a data.frame with a column like this:
Column_1
AAA
B
BBB
AAA_FACE
CCC
BBB_AAA
I want to spread the column into new columns (but not for all my unique values, because then I would get very, very much columns), but only for the values containing a specific pattern: "AAA".
After spreading the values, I want to make them binary, So ideally my new data.frame looks like this:
AAA AAA_FACE BBB_AAA
1 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 0 1
I tried dplyr's spread() function. But there I got the issue that I spread the data in many, many columns (instead of only the columns containing 'AAA' pattern).
One option with tidyverse would be
library(tidyverse)
df1 %>%
mutate(i1 = as.integer(str_detect(Column_1, "AAA")),
rn = row_number()) %>%
spread(Column_1, i1, fill = 0) %>%
select(matches("AAA"))
# AAA AAA_FACE BBB_AAA
#1 1 0 0
#2 0 0 0
#3 0 0 0
#4 0 1 0
#5 0 0 0
#6 0 0 1
It can be made a bit more efficient by replaceing the other values to NA and then do the spread
df1 %>%
mutate(i1 = as.integer(str_detect(Column_1, "AAA")),
Column_1 = replace(Column_1, !i1, NA),
rn = row_number()) %>%
spread(Column_1, i1, fill = 0) %>%
select(matches("AAA"))
Using basic R code:
Your data
db<-data.frame(Column_1=c("AAA","B","BBB","AAA_FACE","CCC","BBB_AAA"))
Identify "AAA" pattern
AAA_names<-as.character(db[grep("AAA",db$Column_1),"Column_1"])
Output dataframe creation:
out<-data.frame(lapply(AAA_names, f<-function(x,y){ return(x == y) }, y=as.character(db$Column_1)))
colnames(out)<-AAA_names
out[,AAA_names] <- lapply(out[,AAA_names], as.numeric)
Your output
out
AAA AAA_FACE BBB_AAA
1 1 0 0
2 0 0 0
3 0 0 0
4 0 1 0
5 0 0 0
6 0 0 1
I am attempting to do two things to a dataset I currently have:
ID IV1 DV1 DV2 DV3 DV4 DV5 DV6 DV7
1 97330 3 0 0 0 0 0 1 0
2 118619 0 0 0 0 0 1 1 0
3 101623 2 0 0 0 0 0 0 0
4 202626 0 0 0 0 0 0 0 0
5 182925 1 1 0 0 0 0 0 0
6 179278 1 0 0 0 0 0 0 0
Find the unique number of column combinations of 7 binary
independent variables (DV1 - DV7)
Find the sum of an independent count variable (IV1) by each unique group.
I have been able to determine the number of unique column combinations by using the following:
uniq <- unique(dat[,c('DV1','DV2','DV3','DV4','DV5','DV6','DV7')])
This indicates there are 101 unique combinations present in the dataset. What I haven't been able to figure out is how to determine how to sum the variable "IV1" by each unique group. I've been reading around on this site, and I'm fairly certain there is an easy dplyr answer to this, but it's eluded me so far.
NOTE: I'm essentially trying to find an R solution to perform a "conjunctive analysis" which is displayed in this paper. There is sample code for SPSS, SAS and STATA at the end of the paper.
library(dplyr)
group_by(dat, DV1, DV2, DV3, DV4, DV5, DV6, DV7) %>%
summarize(sumIV1 = sum(IV1))
The number of rows in the result is the number of unique combinations present in your data. The sumIV1 column, of course, has the group-wise sum of IV1.
Thanks to Frank in the comments, we can use strings with group_by_ to simplify:
group_by_(dat, .dots = paste0("DV", 1:7)) %>%
summarize(sumIV1 = sum(IV1))
Here's a reproducible example:
library(data.table)
DT <- data.table(X = c(1, 1, 1 , 1), Y = c(2, 2 , 3 , 4), Z = c(1,1,3,1))
Where X, Y ... are your columns.
Then use the Reduce function:
DT[, join_grp := Reduce(paste,list(X,Y,Z))]
This gives:
DT
X Y Z join_grp
1: 1 2 1 1 2 1
2: 1 2 1 1 2 1
3: 1 3 3 1 3 3
4: 1 4 1 1 4 1
And we can find:
unique(DT[, join_grp])
[1] "1 2 1" "1 3 3" "1 4 1"
For the sums:
DT[ , sum(X), by = join_grp]
Just put whatever column you want to sum in place of the X
Concise Solution
DT[, join_grp := Reduce(paste,list(X,Y,Z))][ , sum(X), by = join_grp]
or
DT[ , sum(X), by = list(Reduce(paste,list(X,Y,Z)))]
So I have been able to achieve my desired output, but I am sure that one can use a string to achieve a much more efficient code.
Let play with this data
set.seed(123)
A <- 1:100
type.a <- rnorm(100, mean=5000, sd=1433)
type.b <- rnorm(100, mean=5000, sd=1425)
type.c <- rnorm(100, mean=5000, sd=1125)
type.d <- rnorm(100, mean=5000, sd=1233)
df1 <- data.frame(A, type.a, type.b, type.c, type.d)
Now we want to create a new variable for df1 that will identity if a type(a:d) begun with number 1. So I have used this code:
df1$Type_1 <- with(df1, ifelse((type.a < 2000 & type.a > 999)|(type.b < 2000 & type.c > 999)|
(type.c < 2000 & type.c > 999)|(type.d < 2000 & type.d > 999), 1,0))
Or similiarly, this also:
df1$type_1 <- with(df1, ifelse(type.a < 2000 & type.a > 999, 1,
ifelse(type.b < 2000 & type.c > 999, 1,
ifelse(type.c < 2000 & type.c > 999, 1,
ifelse(type.d < 2000 & type.d > 999, 1,0)))))
Now my question form two parts
How can you use a string which will look at only the first digit of type(a:d) to test if it is equal to our constraint. (in this instance equal to 1)
Secondly, I have more than four columns of data. Thus I dont think it is efficient I specify column names each time. Can the use of [,x:y] be used?
The code then be used to create 9 new columns of data (ie. type_1 & type_2 ... type_9), as the first digit of our type(a:d) has a range of 1:9
We can use substr to extract the first character of a string. As there are four columns that start with type, we can use grep to get the numeric index of columns, we loop the columns with lapply, check whether the 1st character is equal to 1. If we want to know whether there is at least one value that meets the condition, we can wrap it with any. Using lapply returns a list output with a length of 1 for each list element. As we need a binary (0/1) instead of logical (FALSE/TRUE), we can wrap with + to coerce the logical to binary representation.
indx <- grep('^type', colnames(df1))
lapply(df1[indx], function(x) +(any(substr(x, 1, 1)==1)))
If we need a vector output
vapply(df1[indx], function(x) +(any(substr(x, 1, 1)==1)), 1L)
Great and elegant answer by #akrun. I was interested in the 2nd part of your question. Specifically about how you're going to use the first part to create the new 9 columns you mention. I don't know if I'm missing something, but instead of checking each time if the first element matches 1,2,3, etc. you can just simply capture that first element. Something like this:
library(dplyr)
library(tidyr)
set.seed(123)
A <- 1:100
type.a <- rnorm(100, mean=5000, sd=1433)
type.b <- rnorm(100, mean=5000, sd=1425)
type.c <- rnorm(100, mean=5000, sd=1125)
type.d <- rnorm(100, mean=5000, sd=1233)
df1 <- data.frame(A, type.a, type.b, type.c, type.d)
df1 %>%
group_by(A) %>%
mutate_each(funs(substr(.,1,1))) %>% # keep first digit
ungroup %>%
gather(variable, type, -A) %>% # create combinations of rows and digits
select(-variable) %>%
mutate(type = paste0("type_",type),
value = 1) %>%
group_by(A,type) %>%
summarise(value = sum(value)) %>% # count how many times the row belongs to each type
ungroup %>%
spread(type, value, fill=0) %>% # create the new columns
inner_join(df1, by="A") %>% # join back initial info
select(A, starts_with("type."), starts_with("type_")) # order columns
# A type.a type.b type.c type.d type_1 type_2 type_3 type_4 type_5 type_6 type_7 type_8 type_9
# 1 1 4196.838 3987.671 7473.662 4118.106 0 0 1 2 0 0 1 0 0
# 2 2 4670.156 5366.059 6476.465 4071.935 0 0 0 2 1 1 0 0 0
# 3 3 7233.629 4648.464 4701.712 3842.782 0 0 1 2 0 0 1 0 0
# 4 4 5101.039 4504.752 5611.093 3702.251 0 0 1 1 2 0 0 0 0
# 5 5 5185.269 3643.944 4533.868 4460.982 0 0 1 2 1 0 0 0 0
# 6 6 7457.688 4935.835 4464.222 5408.344 0 0 0 2 1 0 1 0 0
# 7 7 5660.493 3881.511 4112.822 2516.478 0 1 1 1 1 0 0 0 0
# 8 8 3187.167 2623.183 4331.056 5261.372 0 1 1 1 1 0 0 0 0
# 9 9 4015.740 4458.177 6857.271 6524.820 0 0 0 2 0 2 0 0 0
# 10 10 4361.366 6309.570 4939.218 7512.329 0 0 0 2 0 1 1 0 0
# .. .. ... ... ... ... ... ... ... ... ... ... ... ... ...
Example when we have column A and B in the beginning:
library(dplyr)
library(tidyr)
set.seed(123)
A <- 1:100
B <- 101:200
type.a <- rnorm(100, mean=5000, sd=1433)
type.b <- rnorm(100, mean=5000, sd=1425)
type.c <- rnorm(100, mean=5000, sd=1125)
type.d <- rnorm(100, mean=5000, sd=1233)
df1 <- data.frame(A,B, type.a, type.b, type.c, type.d)
# work by grouping on A and B
df1 %>%
group_by(A,B) %>%
mutate_each(funs(substr(.,1,1))) %>%
ungroup %>%
gather(variable, type, -c(A,B)) %>%
select(-variable) %>%
mutate(type = paste0("type_",type),
value = 1) %>%
group_by(A,B,type) %>%
summarise(value = sum(value)) %>%
ungroup %>%
spread(type, value, fill=0) %>%
inner_join(df1, by=c("A","B")) %>%
select(A,B, starts_with("type."), starts_with("type_"))
# A B type.a type.b type.c type.d type_1 type_2 type_3 type_4 type_5 type_6 type_7 type_8 type_9
# 1 1 101 4196.838 3987.671 7473.662 4118.106 0 0 1 2 0 0 1 0 0
# 2 2 102 4670.156 5366.059 6476.465 4071.935 0 0 0 2 1 1 0 0 0
# 3 3 103 7233.629 4648.464 4701.712 3842.782 0 0 1 2 0 0 1 0 0
# 4 4 104 5101.039 4504.752 5611.093 3702.251 0 0 1 1 2 0 0 0 0
# 5 5 105 5185.269 3643.944 4533.868 4460.982 0 0 1 2 1 0 0 0 0
# 6 6 106 7457.688 4935.835 4464.222 5408.344 0 0 0 2 1 0 1 0 0
# 7 7 107 5660.493 3881.511 4112.822 2516.478 0 1 1 1 1 0 0 0 0
# 8 8 108 3187.167 2623.183 4331.056 5261.372 0 1 1 1 1 0 0 0 0
# 9 9 109 4015.740 4458.177 6857.271 6524.820 0 0 0 2 0 2 0 0 0
# 10 10 110 4361.366 6309.570 4939.218 7512.329 0 0 0 2 0 1 1 0 0
# .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ...
However, in this case you should notice that you have one A value for each line. So, B isn't really needed in order to define your rows (in a unique way). Therefore, you can work exactly as before (when B wasn't there) and just join B to your result:
df1 %>%
select(-B) %>%
group_by(A) %>%
mutate_each(funs(substr(.,1,1))) %>%
ungroup %>%
gather(variable, type, -A) %>%
select(-variable) %>%
mutate(type = paste0("type_",type),
value = 1) %>%
group_by(A,type) %>%
summarise(value = sum(value)) %>% # count how many times the row belongs to each type
ungroup %>%
spread(type, value, fill=0) %>%
inner_join(df1, by="A") %>%
mutate(B=B) %>%
select(A,B, starts_with("type."), starts_with("type_"))
Assume a data structure like the following
MemberID <- c(123,123,234,234)
nbin <- 4
imatrix <- matrix(sample(c(0,1), size=nbin * length(MemberID), replace=TRUE),
nrow=length(MemberID))
colnames(imatrix) <- letters[1:nbin]
years <- c("Y1","Y2","Y1","Y2")
mydf <- data.frame(cbind(MemberID, years, imatrix))
How can I make a similar data structure such that I have an indicator for each level of years for each a,b,c,d.
I would like to have a 2 x 9 data frame with columns MemberID, a.Y1, a.Y2, b.Y1,b.Y2,...
Ideally, I would like to do this with spread or cast as I have been using those tools a bit and would like to learn more about how to use them.
Using tidyr/dplyr
library(dplyr)
library(tidyr)
gather(mydf, Var, Val, a:d) %>%
unite(yearsVar, years, Var) %>%
spread(yearsVar, Val)
Base R solution:
reshape(mydf, timevar = "years", idvar= "MemberID", direction = "wide")
MemberID a.Y1 b.Y1 c.Y1 d.Y1 a.Y2 b.Y2 c.Y2 d.Y2
1 123 0 0 1 0 0 1 0 0
3 234 1 0 0 0 0 0 1 0
Solution using reshape2 (and magrittr):
mydf %>% melt(c('MemberID','years')) %>% dcast(MemberID~...)
MemberID Y1_a Y1_b Y1_c Y1_d Y2_a Y2_b Y2_c Y2_d
1 123 0 0 1 0 0 1 0 0
2 234 1 0 0 0 0 0 1 0