R: Separate Text String by Space and Remove Tabs, Line Breaks, Etc - r

After reading an HTML table, my name column appears with records as follows:
\n\t\t\t\t\t\t\t\t\t\t\t\t\tMike Moon\n\t\t\t\t\t\t\t\t
The following code fails to generate the correct values in the First and Last name columns
separate(data=nametable, col = Name, into = c("First","Last"), sep= " ")
Curiously, the First column is blank, while the Last column contains only the person's first name.
How could I correctly turn this column into the First and Last column desired (i.e...
First Last
Mike Moon
Data example per recommendation of #r2evans and as appearing in correct answer code below:
nametable <- data.frame(Name="\n\t\t\t\t\t\t\t\t\t\t\t\t\tMike Moon\n\t\t\t\t\t\t\t\t", stringsAsFactors=FALSE)

It might help to trim whitespace from the field before moving on. trimws removes "leading and/or trailing whitespace from character strings" (from ?trimws).
Data:
nametable <- data.frame(Name="\n\t\t\t\t\t\t\t\t\t\t\t\t\tMike Moon\n\t\t\t\t\t\t\t\t", stringsAsFactors=FALSE)
library(dplyr)
nametable %>% mutate(Name = trimws(Name))
# Name
# 1 Mike Moon
I infer that you are using dplyr as well as tidyr, so I'm using it here. It is also really straight-forward to do nametable$Name <- trimws(nametable$Name) without the dplyr usage.
From here, it's as you initially coded:
nametable %>%
mutate(Name = trimws(Name)) %>%
tidyr::separate(col=Name, into=c("First", "Last"))
# First Last
# 1 Mike Moon

Related

Is there a way to append a column name to the end of each observation within respective columns?

[R] I am trying to modify the format of my data frame (df) so that the column name is appended to each observation within that column within R. For example:
Soccer_Brand
Basketball_Brand
Adidas
Nike
Nike
Under Armour
And want to get it to look like
Soccer_Brand
Basketball_Brand
Adidas_Soccer_Brand
Nike_Basketball_Brand
Nike_Soccer_Brand
Under_Armour_Basketball_Brand
Im attempting a market basket analysis and need to remove column names eventually. However I will lose the information on what sport the brand belongs to without appending the column names to the observations themselves. Essentially I wont be able to tell whether a 'nike' entry belongs to soccer or basketball.
I've used Excel formulas to hack a solution thus far but want my R script to be self contained. I haven't found any solutions out there for this in R.
You can paste a column's name onto its contents. Just iterate through all the columns. Doing so with lapply allows the one-liner:
df[] <- lapply(seq_along(df),\(i) paste(df[[i]], names(df)[i], sep = "_"))
resulting in
df
#> Soccer_Brand Basketball_Brand
#> 1 Adidas_Soccer_Brand Nike_Basketball_Brand
#> 2 Nike_Soccer_Brand Under Armour_Basketball_Brand
Data from question in reproducible format
df <- data.frame(Soccer_Brand = c("Adidas", "Nike"),
Basketball_Brand = c("Nike", "Under Armour"))
Or using an option in tidyverse
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(everything(), ~ str_c(.x, cur_column(), sep = "_")))
-output
df
Soccer_Brand Basketball_Brand
1 Adidas_Soccer_Brand Nike_Basketball_Brand
2 Nike_Soccer_Brand Under Armour_Basketball_Brand

How to break Char Vectors up at defined terms?

I am scraping data off a website using rvest() for the first time.
It gave me a character vector that I am trying to split and convert to a data frame with columns.
How do you turn this vector:
char.vector <- c("John DoeTeacherSpeaksEnglishJapaneseRateUSD 10Video Intro","JaneTutorSpeaksJapaneseFrenchRateUSD 15Video Intro")
...into this data frame with columns:
Name
Role
English
Japanese
French
Rate_USD
John Doe
Teacher
1
1
0
10
Jane
Tutor
0
1
1
15
Splitting on spaces or character position is problematic. Is there maybe a way to create a vector of all the words I want to split at and use it as the split argument?
split.vector <- c("Teacher", "Tutor", "Speaks", "English", "Japanese", "French", "Rate", "Video")
My code and url:
EN.char <- read_html("https://www.italki.com/teachers/english") %>%
html_nodes(".teacher-card") %>%
html_text()
EN.char
So, as there are potentially different numbers of languages for each entry, I shamelessly took the approach used by #akrun in their answer here, whereby read.dcf is used to map out all the languages present, and put NA where a language is not present for a given entry. After having read this article I saw that:
The DCF rules as implemented in R are:
A database consists of one or more records, each with one or more
named fields. Not every record must contain each field, a field may
appear only once in a record.
Regular lines start with a non-whitespace character.
Regular lines are of form tag:value, i.e.,
have a name tag and a value for the field, separated by : (only the
first : counts). The value can be empty (=whitespace only).
Lines starting with whitespace are continuation lines (to the preceding
field) if at least one character in the line is non-whitespace.
Records are separated by one or more empty (=whitespace only) lines.
In order to resolve a malformed line error, I needed to fix rule 3 and insert ":" so as to have tag:value pairs.
Following #akrun's example, I wrapped the read.dcf within an as.data.frame call. I kept the unnecessary, in this case, if statement, for safety of missing entries. I doubt languages would be missing given the nature of the site.
I then used replace_na() to switch out NA with "0" and cast all the language columns as integer.
I then generated another DataFrame, with all the other details required, which only had one entry per field in the record (row). As I was sure of matching, and ordered, rows, I then used cbind() to join the DataFrames column-wise.
N.B. Given my location, the USD values are actually showing as GBP.
R:
library(purrr)
library(dplyr)
library(stringr)
library(rvest)
page <- read_html("https://www.italki.com/teachers/english")
entries <- page %>% html_elements(".teacher-card")
language_df <- map_dfr(entries, ~ {
new <- .x %>%
html_elements("p + div > div") %>%
html_text() %>%
paste0(., ":1")
if (length(new) > 0) {
as.data.frame(read.dcf(textConnection(new)))
} else {
NULL
}
}) %>%
mutate(across(.cols = everything(), ~ as.integer(tidyr::replace_na(.x, 0))))
details_df <- map_dfr(entries, ~
data.frame(
name = .x %>% html_element("div > .overflow-hidden") %>% html_text(),
role = .x %>% html_element("span + div > .text-tiny") %>% html_text(),
rate_usd = .x %>% html_element(".flex-1:nth-child(1) > div > span") %>% html_text()
))
results <- cbind(details_df, language_df)
Results:

Transpose my R Dataset for association analysis

I am sort of a newbie with R and data manipulation and I am trying to transpose the UCI words dataset. The default dataset is currently structured as so.
Where the first column is the document number, the second column is the word number referencing another text file and the last column is the number of times the word occurs in the document. (For now, we can forget about the third column and I know how to drop it from the dataset.)
What I am trying to do is to transpose the dataset so that I can have each document's words in one row. So a simple example would be like this.
I tried using the t() function but it would transpose the entire dataset all together which is not what I want. I looked in using the dplyr package to help with the data manipulation but I am not getting any solid leads. If you guys have any sources or a particular direction you can nudge me towards accomplishing this that would helpfull.
Thank you!
Here's a solution using the tidyverse package (which includes dplyr). The trick is to first add another column to differentiate entries with the same value in the first column (document number) and then just change the data to wide format using pivot_wider.
library(tidyverse)
# Your data
df <- read.csv(text = "num word
1 61
2 76
1 89
3 211
3 296", sep = " ")
df %>%
# Group by num
group_by(num) %>%
# Add a rownumber to differentiate entries for the same first column value
mutate(rownum = row_number()) %>%
# Change data to wide format
pivot_wider(id = num,
names_from = rownum,
values_from = word)
So I was able to figure out how to accomplish this task. Hopefully, it helps other DS's in the future.
data <- read.table("docword.kos.txt", sep = " ")
data <- data %>% select(V1, V2)
trans <- data %>%
group_by(V1) %>%
summarise(words = paste(V2, collapse = ","))
trans <- trans %>% select(words)
What I ended up doing is using the tidyr package to perform some data wrangling and group my dataset by the first column. Then I exported and re uploaded the dataset after making some slight adjustments in notepad (Replaced the " from the generated csv file)
write.csv(trans, "~\trend.csv", row.names = FALSE)

How to fill a section of a column with already existing values corresponding to another column in R?

I'm working on some cleaning data for some flight trajectories and 'callsign' is a required field that I need to have filled in.
Section of the csv I am working with
The data I'm working with has almost 300000 rows and this issue of blank callsigns is quite repetitive. Is there any way I can fill these callsigns in based on their corresponding icao24 identification numbers?
I've tried using a tapply() function for sectioning off the data on the basis of their icao24 number and applying a function to each chunk ie.
tapply(myDF$callsign, myDF$icao24, ...)
But I can't seem to understand what 'function' I would be applying to each section because they are named differently. Would I need to use some sort of loop iterating over each section with a tapply() applied to each section?
If the values are blank (""), then do a group_by 'icao24' and replace the elements that are "" with the first element of non-blank 'callsign'
library(dplyr)
df2 <- df1%>%
group_by(icao24) %>%
mutate(callsign = replace(callsign, callsign == "",
first(callsign[callsign != ""])))
Another option is fill after converting the blank to NA
library(tidyr)
df2 <- df1 %>%
mutate(callsign = na_if(callsign, "")) %>%
group_by(icao24) %>%
fill(callsign)

How to arrange, group and concentrate string values of repeated keys in different column using R

I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.

Resources