Splitting a long row into multiple shorter rows in R - r

I have this really wide data frame (1 obs. of 696 variables) and i want to split this only row into others every 10 columns
I think it'd be too confusing to post just the final data because it is too wide, so I'm giving the code for how I created it
library(tidyverse)
Vol_cil <- function(r, h) {
vol <- (pi*(r^2))*h
return(vol)}
vec <- Vol_cil(625, 0:695)/1000000
df <- data.frame(vec)
stckovrflw <- df %>%
mutate("mm" = 0:695) %>%
pivot_wider(names_from = mm, values_from = vec)
I want the columns to go from 0 to 9 and the rows from 0 to 69, with the data in this data frame (stckovrflw), I tried to find anyway to do this in the internet but couldn't do anything, ended up exporting it to excel and doing it by hand.
I'd appreciate any help
If I wasn't able to make myself understood please feel free to ask me anything

Here is one way to do it. It starts by putting stckovrflw back to long format so if that is actually what you have you can take out that step. It works by creating columns for the row and column number, then spreading by column number.
stckovrflw %>% pivot_longer(everything(), names_to='mm') %>%
mutate(row=rep(1:70, each=10)[1:696], col=rep(1:10, 70)[1:696]) %>%
select(-mm) %>%
pivot_wider(names_from='col', values_from='value')

Related

Transpose my R Dataset for association analysis

I am sort of a newbie with R and data manipulation and I am trying to transpose the UCI words dataset. The default dataset is currently structured as so.
Where the first column is the document number, the second column is the word number referencing another text file and the last column is the number of times the word occurs in the document. (For now, we can forget about the third column and I know how to drop it from the dataset.)
What I am trying to do is to transpose the dataset so that I can have each document's words in one row. So a simple example would be like this.
I tried using the t() function but it would transpose the entire dataset all together which is not what I want. I looked in using the dplyr package to help with the data manipulation but I am not getting any solid leads. If you guys have any sources or a particular direction you can nudge me towards accomplishing this that would helpfull.
Thank you!
Here's a solution using the tidyverse package (which includes dplyr). The trick is to first add another column to differentiate entries with the same value in the first column (document number) and then just change the data to wide format using pivot_wider.
library(tidyverse)
# Your data
df <- read.csv(text = "num word
1 61
2 76
1 89
3 211
3 296", sep = " ")
df %>%
# Group by num
group_by(num) %>%
# Add a rownumber to differentiate entries for the same first column value
mutate(rownum = row_number()) %>%
# Change data to wide format
pivot_wider(id = num,
names_from = rownum,
values_from = word)
So I was able to figure out how to accomplish this task. Hopefully, it helps other DS's in the future.
data <- read.table("docword.kos.txt", sep = " ")
data <- data %>% select(V1, V2)
trans <- data %>%
group_by(V1) %>%
summarise(words = paste(V2, collapse = ","))
trans <- trans %>% select(words)
What I ended up doing is using the tidyr package to perform some data wrangling and group my dataset by the first column. Then I exported and re uploaded the dataset after making some slight adjustments in notepad (Replaced the " from the generated csv file)
write.csv(trans, "~\trend.csv", row.names = FALSE)

Subset a distance matrix in R by values

I have a very large distance matrix (3678 x 3678) currently encoded as a data frame. Columns are named "1", "2", "3" and so on, the same for rows. So what I need to do is to find values <26 and different from 0 and to have the results in a second dataframe with two columns: the first one with index and the second one with the value. For example:
value
318-516 22.70601
...
where 318 is the row index and 516 is the column index.
Ok, I'm trying to recreate your situation (note: if you can, it's always helpful to include a few lines of your data with a dput command).
You should be able to use filter and some simple tidyverse commands (if you don't know how they work, run them line by line, always selecting commands up to the %>% to check what they are doing):
library(tidyverse)
library(tidylog) # gives you additional output on what each command does
# Creating some data that looks similar
data <- matrix(rnorm(25,mean = 26),ncol=5)
data <- as_tibble(data)
data <- setNames(data,c(1:5))
data %>%
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = "column",values_to = "values", names_prefix = "V") %>%
# depending on how your column names look like, you might need to use a separate() command first
filter(values > 0 & values < 26) %>%
# if you want you can create an index column as well
mutate(index = paste0(row,"-",column)) %>%
# then you can get rid of the row and column
select(-row,-column) %>%
# move index to the front
relocate(index)

How to identify number of duplicate rows in R (and remove)

I have a large dataframe in R (1.3 mil row, 51 columns). I am not sure if there are any duplicate rows but I want to find out. I tried using the duplicate() function but it took too long and ended up freezing my Rstudio. I dont need to know which entries are duplicate, I just want to delete the ones that are.
Does anyone know how to do this with out it taking 20+ minutes and eventually not loading?
Thanks
I don't know how you used the duplicated function. It seems like this way should be relatively quick even if the dataframe is large (I've tested it on a dataframe with 1.4m rows and 32 columns: it took less than 2min):
df[-which(duplicated(df)), ]
The first one is to extract complete duplicates or over 1(maybe triples)
The second is to removes duplicates or over one.
duplication <- df %>% group_by(col) %>% filter(any(row_number() > 1))
unique_df <- df %>% group_by(col) %>% filter(!any(row_number() > 1))
you can use these too.
dup <- df[duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
uni_df <- df[!duplicated(df$col)|duplicated(df$col, fromLast=TRUE),]
*** If you want to get the whole df then you can use this***
df %>%
group_by_all() %>%
count() %>%
filter(n > 1)

Changing a Column to an Observation in a Row in R

I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.

How to arrange, group and concentrate string values of repeated keys in different column using R

I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.

Resources