I am sort of a newbie with R and data manipulation and I am trying to transpose the UCI words dataset. The default dataset is currently structured as so.
Where the first column is the document number, the second column is the word number referencing another text file and the last column is the number of times the word occurs in the document. (For now, we can forget about the third column and I know how to drop it from the dataset.)
What I am trying to do is to transpose the dataset so that I can have each document's words in one row. So a simple example would be like this.
I tried using the t() function but it would transpose the entire dataset all together which is not what I want. I looked in using the dplyr package to help with the data manipulation but I am not getting any solid leads. If you guys have any sources or a particular direction you can nudge me towards accomplishing this that would helpfull.
Thank you!
Here's a solution using the tidyverse package (which includes dplyr). The trick is to first add another column to differentiate entries with the same value in the first column (document number) and then just change the data to wide format using pivot_wider.
library(tidyverse)
# Your data
df <- read.csv(text = "num word
1 61
2 76
1 89
3 211
3 296", sep = " ")
df %>%
# Group by num
group_by(num) %>%
# Add a rownumber to differentiate entries for the same first column value
mutate(rownum = row_number()) %>%
# Change data to wide format
pivot_wider(id = num,
names_from = rownum,
values_from = word)
So I was able to figure out how to accomplish this task. Hopefully, it helps other DS's in the future.
data <- read.table("docword.kos.txt", sep = " ")
data <- data %>% select(V1, V2)
trans <- data %>%
group_by(V1) %>%
summarise(words = paste(V2, collapse = ","))
trans <- trans %>% select(words)
What I ended up doing is using the tidyr package to perform some data wrangling and group my dataset by the first column. Then I exported and re uploaded the dataset after making some slight adjustments in notepad (Replaced the " from the generated csv file)
write.csv(trans, "~\trend.csv", row.names = FALSE)
Related
[R] I am trying to modify the format of my data frame (df) so that the column name is appended to each observation within that column within R. For example:
Soccer_Brand
Basketball_Brand
Adidas
Nike
Nike
Under Armour
And want to get it to look like
Soccer_Brand
Basketball_Brand
Adidas_Soccer_Brand
Nike_Basketball_Brand
Nike_Soccer_Brand
Under_Armour_Basketball_Brand
Im attempting a market basket analysis and need to remove column names eventually. However I will lose the information on what sport the brand belongs to without appending the column names to the observations themselves. Essentially I wont be able to tell whether a 'nike' entry belongs to soccer or basketball.
I've used Excel formulas to hack a solution thus far but want my R script to be self contained. I haven't found any solutions out there for this in R.
You can paste a column's name onto its contents. Just iterate through all the columns. Doing so with lapply allows the one-liner:
df[] <- lapply(seq_along(df),\(i) paste(df[[i]], names(df)[i], sep = "_"))
resulting in
df
#> Soccer_Brand Basketball_Brand
#> 1 Adidas_Soccer_Brand Nike_Basketball_Brand
#> 2 Nike_Soccer_Brand Under Armour_Basketball_Brand
Data from question in reproducible format
df <- data.frame(Soccer_Brand = c("Adidas", "Nike"),
Basketball_Brand = c("Nike", "Under Armour"))
Or using an option in tidyverse
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(everything(), ~ str_c(.x, cur_column(), sep = "_")))
-output
df
Soccer_Brand Basketball_Brand
1 Adidas_Soccer_Brand Nike_Basketball_Brand
2 Nike_Soccer_Brand Under Armour_Basketball_Brand
I am having trouble combining multiple rows into 1 row, below is my current data:
I want one row of symptoms for each VAERS_ID. However, because the number of rows each VAERS_ID is inconsistent, I am having trouble.
I have tried this:
test= data %>%
select(VAERS_ID, SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5) %>%
group_by(VAERS_ID) %>%
mutate(Grp = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse
= ",")) %>%
distinct(VAERS_ID, Grp, .keep_all = TRUE)
This gives me the original data, plus another column labeled Grp containing all of the symptoms for each VAERS_ID pasted together, with a comma between each set.
Any help would be appreciated.
Your approach seems right but since data cannot be copied and tested, I am not able to reproduce your error. Some changes suggested, which you can try.
since you want "ALL Symptoms" in 1 place for each VAERS_ID, which is a common real world use case and I face this often. If you don't need original data in output, simply use this
data%>%
group_by(VAERS_ID) %>%
summarise("Symptoms" = paste0(SYMPTOM1,SYMPTOM1, SYMPTOM2, SYMPTOM3, SYPMTOM4, SYMPTOM5, collapse = ",")
With mutate you get original data since it adds a new column.
To address the warning to ungroup, just added %>%ungroup at end or within summarise add .groups="drop"
I'm doing a prediction with a class tree, with "rpart" library, and when I make "predict", I get a table with probabilities and its value/category that test data can take, and I want to get the value/category from the hightest probability. For example (once predict is done), table I get is:
Table1
And I want to have this table:
Tale2
thanks in advance, I've tried a few things but haven't achieved much since I'm pretty new to R, cheers!
One way to achieve your desired output could be:
identify your values in vector pattern
mutate across the relevant columns and use str_detect to
check if values are in this column -> if true use cur_column() to place
the column name in the new column.
the do some tricks with .names and unite and
finally select.
library(dplyr)
library(tidyr)
library(stringr)
pattern <- c("0.85|0.5|0.6|0.8")
df %>%
mutate(across(starts_with("cat"), ~case_when(str_detect(., pattern) ~ cur_column()), .names = 'new_{col}')) %>%
unite(New_Col, starts_with('new'), na.rm = TRUE, sep = ' ') %>%
select(index, pred_category = New_Col)
index pred_category
<dbl> <chr>
1 1 cat2
2 2 cat1
3 3 cat3
4 4 cat3
You didn't post your data so I just put it in a .csv and accessed it from my R folder on my C: drive.
Might be an easier way to do it, but this is the method I use when I might have multiple different types (by column or row) I'd like to sort for. If you're new to R and don't have data.table or dplyr installed yet, you'll need to enter the second parts in the console.
I left the values in but that can be fixed with the last line if you don't want them.
setwd("C:/R")
library(data.table)
library(dplyr)
Table <- read.csv("Table1.csv", check.names = FALSE, fileEncoding = 'UTF-8-BOM')
#Making the data long form makes it much easier to sort as your data gets more complex.
LongForm <- melt(setDT(Table), id.vars = c("index"), variable.name = "Category")
Table1 <- as.data.table(LongForm)
#This gets you what you want.
highest <- Table1 %>% group_by(index) %>% top_n(1, value)
#Then just sort it how you wanted it to look
Table2 <- highest[order(highest$index, decreasing = FALSE), ]
View(Table2)
If you don't have the right packages
install.packages("data.table")
and
install.packages("dplyr")
To get rid of the numbers
Table3 <- Table2[,1:2]
I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.
I have this really wide data frame (1 obs. of 696 variables) and i want to split this only row into others every 10 columns
I think it'd be too confusing to post just the final data because it is too wide, so I'm giving the code for how I created it
library(tidyverse)
Vol_cil <- function(r, h) {
vol <- (pi*(r^2))*h
return(vol)}
vec <- Vol_cil(625, 0:695)/1000000
df <- data.frame(vec)
stckovrflw <- df %>%
mutate("mm" = 0:695) %>%
pivot_wider(names_from = mm, values_from = vec)
I want the columns to go from 0 to 9 and the rows from 0 to 69, with the data in this data frame (stckovrflw), I tried to find anyway to do this in the internet but couldn't do anything, ended up exporting it to excel and doing it by hand.
I'd appreciate any help
If I wasn't able to make myself understood please feel free to ask me anything
Here is one way to do it. It starts by putting stckovrflw back to long format so if that is actually what you have you can take out that step. It works by creating columns for the row and column number, then spreading by column number.
stckovrflw %>% pivot_longer(everything(), names_to='mm') %>%
mutate(row=rep(1:70, each=10)[1:696], col=rep(1:10, 70)[1:696]) %>%
select(-mm) %>%
pivot_wider(names_from='col', values_from='value')