Edited
I have a large table, which starts like this
Essentially, it's a table with multiple samples ("samp_id") showing the number ("least") of "taxon" present in each.
I want to transpose/pivot the table to look like this;
i.e. with "taxon" as the top row, with each of the 90 samples in "data" following as a row based on the "least" column, re-named with its "samp_id". So you see what each sample is, as well the value in "least" for each sample in the different "taxon" (which may not be identical across the 90 samples).
Previously, I have separated the data into multiple tibbles based on "samp_id", selected "taxon" and "least", re-named "least" with the "samp_id" then combined the individual tibbles based on "taxon" with full_join using something like the code below, then transposing the combined table
ACLOD_11 = data %>%
filter(samp_id == "ACLOD_11") %>%
select(taxon, least) %>%
rename("ACLOD_11" = least)
ACLOD_12 = data ... #as above, but different samp_id
data_final = list(ACLOD_11, ACLOD_12, ...) %>%
reduce(full_join, by = "taxon")
As I have more data tables to follow after this one with 90 samples, so I want to be able to do this without having to individually separate the data into 100s of tibbles and manually inputting the "samp_id" before joining.
I have currently split the data into 90 separate tibbles based on "samp_id" (there are 90 samples in "data")
data_split = data %>%
group_split(samp_id)
but am unsure if this is the best way to do this, or what I should to next?
We can use
library(dplyr)
library(purrr)
data %>%
split(.$samp_id) %>%
imap(~ .x %>%
select(taxon, least) %>%
rename(!!.y := least)) %>%
reduce(full_join, by = 'taxon')
Related
I have a question on how to get a random sample but maintain multiple items that belong to the same group. What I'm really trying to do is do sampling, but each sample has to include every item.
Here is a method of sampling from mtcars. Using this, I get two random rows,
(sampled_df <- mtcars[sample(nrow(mtcars), 2), ])
I can take mtcars and then number it as though there are groups. mtcars has 32 observations. Here I'm saying that there are eight groups with four items each.
library(dplyr)
mtcars %>%
mutate(number = rep(1:8,each=4)) %>%
group_by(number) %>%
sample_n(2)
The last two lines of code isn't doing what I'm hoping it would. I'm trying to have eight lines as output: all four of the observations from two of the groups.
I'm really working with invoice data and I want to be able to make the data frame smaller while making sure that I'm keeping the basket sizes the same.
What you might want is:
mtcars %>%
mutate(number = rep(1:8,each=4)) %>%
filter(number %in% sample(1:8, 2))
I have an issue with merging many columns by the same ID. I know that this is possible for two lists but I need to combine all species columns into one so I have first column as species (combined) and then w,w.1,w.2,w.3, w.4... The species columns all have the same species in them but are not in order so I can't just drop every other column as this would mean the w values aren't associated with the right species. This is an extremely large dataset of 10000 rows and 2000 columns so would need to automated. I need the w values to be associated to the corresponding species. Dataset attached.
Thank you for any help
dataset
If your data is in a frame called dt, you can use lapply() along with bind_rows() like this:
library(dplyr)
library(tidyr)
bind_rows(
lapply(seq(1,ncol(dt),2), function(x) {
dt[,c(x,x+1)] %>%
rename_with(~c("Species", "value")) %>%
mutate(w = colnames(dt)[x+1])
})
) %>%
pivot_wider(id_cols = Species, names_from = w)
I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.
I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.
I have two datasets:
DS1 - contains a list of subjects with a columns for name, ID number and Employment status
DS2 - contains the same list of subjects names and ID numbers but some of these are missing on the second data set.
Finally it contains a 3rd column for Education Level.
I want to merge the Education column onto the first dataset. I have done this using the merge function sorting by ID number but because some of the ID numbers are missing on the second data set I want to merge the remaining Education level by name as a secondary option. Is there a way to do this using dplyr/tidyverse?
There are two ways you can do this. Choose the one based on your preference.
1st option:
#here I left join twice and select columns each time to ensure there is no duplication like '.x' '.y'
finalDf = DS1 %>%
dplyr::left_join(DS2 %>%
dplyr::select(ID,EducationLevel1=EducationLevel),by=c('ID')) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel2=EducationLevel),by=c('Name')) %>%
dplyr::mutate(FinalEducationLevel = ifelse(is.na(EducationLevel1),EducationLevel2,EducationLevel1))
2nd option:
#first find the IDs which are present in the 2nd dataset
commonIds = DS1 %>%
dplyr::inner_join(DS2 %>%
dplyr::select(ID,EducationLevel),by=c('ID'))
#now the records where ID was not present in DS2
idsNotPresent = DS1 %>%
dplyr::filter(!ID %in% commonIds$ID) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel),by=c('Name'))
#bind these two dfs to get the final df
finalDf = bind_rows(commonIds,idsNotPresent)
Let me know if this works.
The second option in makeshift-programmer's answer worker for me. Thank you so much. Had to play around with it for my actual data sets but the basic structure worked very well and it was easy to adapt