I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.
I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.
library(tidyverse)
Using the sample data at the bottom, I want to find counts of the Gender and FP variables, then spread these variables using tidyr::spread(). I'm attempting to do this by creating a list of dataframes, one for the Gender counts, and one for FP counts. The reason I'm doing this is to eventually cbind both dataframes. However, I'm having trouble incorporating the tidyr::spread into my function.
The function below creates a list of two dataframes with counts for Gender and FP, but the counts are not "spread."
group_by_quo=quos(Gender,FP)
DF2<-map(group_by_quo,~DF%>%
group_by(Code,!!.x)%>%
summarise(n=n()))
If I add tidyr::spread, it doesn't work. I'm not sure how to incorporate this since each dataframe in the list has a different variable.
group_by_quo=quos(Gender,FP)
DF2<-map(group_by_quo,~DF%>%
group_by(Code,!!.x)%>%
summarise(n=n()))%>%
spread(!!.x,n)
Any help would be appreciated!
Sample Code:
Subject<-c("Subject1","Subject2","Subject1","Subject3","Subject3","Subject4","Subject2","Subject1","Subject2","Subject4","Subject3","Subject4")
Code<-c("AAA","BBB","AAA","CCC","CCC","DDD","BBB","AAA","BBB","DDD","CCC","DDD")
Code2<-c("AAA2","BBB2","AAA2","CCC2","CCC2","DDD2","BBB2","AAA2","BBB2","DDD2","CCC2","DDD2")
Gender<-c("Male","Male","Female","Male","Female","Female","Female","Male","Male","Male","Male","Male")
FP<-c("F","P","P","P","F","F","F","F","F","F","F","F")
DF<-data_frame(Subject,Code,Code2,Gender,FP)
I think you misplaced the closing parenthesis. This code works for me:
library(tidyverse)
Subject<-c("Subject1","Subject2","Subject1","Subject3","Subject3","Subject4","Subject2","Subject1","Subject2","Subject4","Subject3","Subject4")
Code<-c("AAA","BBB","AAA","CCC","CCC","DDD","BBB","AAA","BBB","DDD","CCC","DDD")
Code2<-c("AAA2","BBB2","AAA2","CCC2","CCC2","DDD2","BBB2","AAA2","BBB2","DDD2","CCC2","DDD2")
Gender<-c("Male","Male","Female","Male","Female","Female","Female","Male","Male","Male","Male","Male")
FP<-c("F","P","P","P","F","F","F","F","F","F","F","F")
DF<-data_frame(Subject,Code,Code2,Gender,FP)
group_by_quo <- quos(Gender, FP)
DF2 <- map(group_by_quo,
~DF %>%
group_by(Code,!!.x) %>%
summarise(n=n()) %>%
spread(!!.x,n))
This last part is a bit more concise using count:
DF2 <- map(group_by_quo,
~DF %>%
count(Code,!!.x) %>%
spread(!!.x,n))
And by using count the unnecessary grouping information is removed as well.
cols <- data %>% names()
data %>% dplyr::filter_(is.na(cols[1]))
gives zero although it should output some rows, alternatively when calling
data %>% dplyr::filter(is.na("colName"))
output rows
Thus, dynamic filtration not working, any idea what is the alternative?
dplyr::filter(data, is.na(data[, cols[1]]))
I have a data.frame:
set.seed(1L)
vector <- data.frame(patient=rep(1:5,each=2),medicine=rep(1:3,length.out=10),prob=runif(10))
I want to get the mean of the "prob" column while grouping by patient. I do this with the following code:
vector %>%
group_by(patient) %>%
summarise(average=mean(prob))
This code perfectly works. However, I need to get the same values without using the word "prob" on the "summarise" line. I tried the following code, but it gives me a data.frame in which the column "average" is a vector with 5 identical values, which is not what I want:
vector %>%
group_by(patient) %>%
summarise(average=mean(vector[,3]))
PD: for the sake of understanding why I need this, I have another data frame with multiple columns with complex names that need to be "summarised", that's why I can't put one by one on the summarise command. What I want is to put a vector there to calculate the probs of each column grouped by patients.
It appears you want summarise_each
vector %>%
group_by(patient) %>%
summarise_each(funs(mean), vars= matches('prop'))
Using data.table you could do
setDT(vector)[,lapply(.SD,mean),by=patient,.SDcols='prob')