Strip off tracking from URL using R - r

I have a dataframe with a column of URLs from which I want to remove everything after the first question mark. Some URLs have no question mark, and I want these to remain unchanged. In short, I want to strip off all the tracking. This is a sample URL.
https://www.dummy.com/2017/11/29/four-questions-we-have-about-stuff/?utm_source=exacttarget&utm_medium=newsletter&utm_term=dummydotcom-dummycomnewsletter&utm_content=na-readblog-blogpost&utm_campaign=dummy
This is the result I'm looking for.
https://www.dummy.com/2017/11/29/four-questions-we-have-about-stuff/

Assuming your dataframe is called df and it has a column in it named url:
df$url <- sub('\\?.*', '', df$url)

With strsplit:
url <- "https://www.dummy.com/2017/11/29/four-questions-we-have-about-stuff/?utm_source=exacttarget&utm_medium=newsletter&utm_term=dummydotcom-dummycomnewsletter&utm_content=na-readblog-blogpost&utm_campaign=dummy"
result <- strsplit(url, "\\?")[[1]][1]
Output:
> result
[1] "https://www.dummy.com/2017/11/29/four-questions-we-have-about-stuff/"
And here is an example of using it on a vector rather than a single string:
strings <- c("here?string", "another?string", "stringnoquestion", "one?more")
> sapply(strsplit(strings, "\\?"), function(x) x[1])
[1] "here" "another" "stringnoquestion" "one"
strsplit returns a list because it is written to work for vectors as well as singular elements. So in the first example the [[1]] was accessing the first element of the list and then the [1] was accessing the first element of that, the url before the ?.
Here is the first example broken out in to steps:
# Returns a list of length one
> strsplit(url, "\\?")
[[1]]
[1] "https://www.dummy.com/2017/11/29/four-questions-we-have-about-stuff/"
[2] "utm_source=exacttarget&utm_medium=newsletter&utm_term=dummydotcom-dummycomnewsletter&utm_content=na-readblog-blogpost&utm_campaign=dummy"
# Each element of the list is a vector
> strsplit(url, "\\?")[[1]]
[1] "https://www.dummy.com/2017/11/29/four-questions-we-have-about-stuff/"
[2] "utm_source=exacttarget&utm_medium=newsletter&utm_term=dummydotcom-dummycomnewsletter&utm_content=na-readblog-blogpost&utm_campaign=dummy"
# The first element of that vector
> strsplit(url, "\\?")[[1]][1]
[1] "https://www.dummy.com/2017/11/29/four-questions-we-have-about-stuff/"

Related

How to see the difference in two strings

I'm trying to find the difference between two columns in a CSV file, which I named Test.
I'd like to add a new column called 'Results' that contains the difference between Events_1 & Events_2. If there is no difference the Results can be blank.
This is a basic example, for what I'm trying to accomplish, the real list contains hundreds of events in both columns.
Not tested with your data, but
vec2 <- c("hello,goodbye","hello,goodbye")
vec1 <- c("hello","hello,goodbye")
Map(setdiff, strsplit(vec2, "[,\\s]+"), strsplit(vec1, "[,\\s]+"))
# [[1]]
# [1] "goodbye"
# [[2]]
# character(0)
If you need them to be comma-delimited strings, then
mapply(function(a,b) paste(setdiff(a,b), collapse=","), strsplit(vec2, "[,\\s]+"), strsplit(vec1, "[,\\s]+"))
# [1] "goodbye" ""

How can I give sequential names to items in a list?

I have a character string that I have split into a list of smaller strings using strsplit. For example:
> full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> full.seq
[1] "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
> sequences <- strsplit(full.seq, "cg")
> sequences
[[1]]
[1] "FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa" "GKi3VdVSQzEFZp"
[4] "GKAVdVRpEFKGIZpg13"
I would like to give each of these new strings a unique, sequential name that I can still use to identify that they were from the same original string (for a later analysis I will do on these strings). For example, "ID.seq1", "ID.seq2", "ID.seq3" etc. I have tried doing this manually but receive this error:
> names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4")
Error in names(sequences) <- c("ID.seq1", "ID.seq2", "ID.seq3", "ID.seq4") :
'names' attribute [4] must be the same length as the vector [1]
I would also like an automated way of doing this though, as I will need to label up to 30 new strings from a number of original strings. Any suggestions?
First of all, if you want a character vector, you will have to subset the list, because strsplit returns a list. After doing that, you can easily assign names to that vector of terms.
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
sequences <- strsplit(full.seq, "cg")[[1]]
names(sequences) <- paste0("ID.seq", c(1:4))
sequences
ID.seq1 ID.seq2
"FZp" "K3VdAQzEFZpcAVdV8QM8ZpsEFZpa"
ID.seq3 ID.seq4
"GKi3VdVSQzEFZp" "GKAVdVRpEFKGIZpg13"
Answer by Tim is perfect. I just want to add if you want to keep your list and name elements of each item:
full.seq <- "FZpcgK3VdAQzEFZpcAVdV8QM8ZpsEFZpacgGKi3VdVSQzEFZpcgGKAVdVRpEFKGIZpg13"
full.seq
sequences <- strsplit(full.seq, "cg")
names(sequences[[1]]) <- paste("ID.seq",1:4,sep="")

Name lists in R (not the elements)

I am trying to name a nested list. This would be one of the several lists in my nested list:
paths_list[i]
[[1]]
[[1]]$CLASS
[1] "Signal Transduction (Saccharomyces cerevisiae)"
[[1]]$GENES
[1] "YPR165W"
[[1]]$ORGANISM
[1] "Saccharomyces cerevisiae"
Basically what I want to do is to put an ID name for example R-SCE-198203 as the main name for the list (so above $CLASS it should appear the name R-SCE-198203). List paths_list[i] to have the name R-SCE-198203.
I want this:
paths_list[i]
[[1]]R-SCE-198203
[[1]]$CLASS
[1] "Signal Transduction (Saccharomyces cerevisiae)"
[[1]]$GENES
[1] "YPR165W"
[[1]]$ORGANISM
[1] "Saccharomyces cerevisiae"
I have searched and the closest I have found was with lapply but you ends up like this:
setNames(lapply(tabs, setNames, varB), varA)
#$varA1
#$varA1$varB1
#[1] "integer"
#
#$varA1$varB2
#[1] "integer"
# ...
I want to avoid the main ID to appear in every element of the list (do not want $varA1 being repeated all the time).
Is that possible?
Thanks in advance
I think your lapply approach is already what you want. The "true" names of the sub-elements do not have the $ signs. The output in the console when you print a full list object shows these signs to help you read the data, however, if you access the individual sub-element via [[]] their names do not have these signs. Maybe the following code example helps. Check the outputs.
a_list <- list("dummy")
names(a_list) <- "dummyname"
a_list <- list(a_list)
names(a_list) <- "name"
a_list
#$name
#$name$dummyname
#[1] "dummy"
names(a_list)
#[1] "name"
a_list[[1]]
#$dummyname
#[1] "dummy"
names(a_list[[1]])
#[1] "dummyname"

a list of multiple lists of 2 for synonyms

I want to read the synonyms from a csv file , where the first word is the "main" word and the rest of the words in the same record are its synonyms
now i basically want to create a list like i would have in R ,
**synonyms <- list(
list(word="ss", syns=c("yy","yyss")),
list(word="ser", syns=c("sert","sertyy","serty"))
)**
This gives me a list as
synonyms
[[1]]
[[1]]$word
[1] "ss"
[[1]]$syns
[1] "yy" "yyss"
[[2]]
[[2]]$word
[1] "ser"
[[2]]$syns
[1] "sert" "sertyy" "serty"
which is essentially a list of lists of "word" and "syns".
how do i go about creating the similar list while reading the word and synonyms from a csv file
any pointers would help !! Thanks
This process should return what you want.
# read in data using readLines
myStuff <- readLines(textConnection(temp))
This will return a character vector with one element per line in the file. Note that textConnection is not necessary for reading in files. Just supply the file path. Now, split each vector element into a vectors using strsplit and return a list.
myList <- strsplit(myStuff, split=" ")
Now, separate the first element from the remaining element for each vector within the list.
result <- lapply(myList, function(x) list(word=x[1], synonyms=x[-1]))
This returns the desired result. We use lapply to move through the list items. For each list item, we return a named list where the first element, named word, corresponds to the first element of the vector that is the list item and the remaining elements of this vector are placed in a second list element called synonyms.
result
[[1]]
[[1]]$word
[1] "ss"
[[1]]$synonyms
[1] "yy" "yyss"
[[2]]
[[2]]$word
[1] "ser"
[[2]]$synonyms
[1] "sert" "sertyy" "serty"
[[3]]
[[3]]$word
[1] "at"
[[3]]$synonyms
[1] "ate" "ater" "ates"
[[4]]
[[4]]$word
[1] "late"
[[4]]$synonyms
[1] "lated" "lates" "latee"
data
temp <-
"ss yy yyss
ser sert sertyy serty
at ate ater ates
late lated lates latee"

How to extract elements from a list with mixed elements

I have a list in R with the following elements:
[[812]]
[1] "" "668" "12345_s_at" "667" "4.899777748"
[6] "49.53333333" "10.10930207" "1.598228663" "5.087437057"
[[813]]
[1] "" "376" "6789_at" "375" "4.899655078"
[6] "136.3333333" "27.82508792" "2.20223398" "5.087437057"
[[814]]
[1] "" "19265" "12351_s_at" "19264" "4.897730912"
[6] "889.3666667" "181.5874908" "1.846451572" "5.087437057"
I know I can access them with something like list_elem[[814]][3] in case that I want to extract the third element of the position 814.
I need to extract the third element of all the list, for example 12345_s_at, and I want to put them in a vector or list so I can compare their elements to another list later on. Below is my code:
elem<-(c(listdata))
lp<-length(elem)
for (i in 1:lp)
{
newlist<-c(listdata[[i]][3]) ###maybe to put in a vector
print(newlist)
}
When I print the results I get the third element, but like this:
[1] "1417365_a_at"
[1] "1416336_s_at"
[1] "1416044_at"
[1] "1451201_s_at"
so I cannot traverse them with an index like newlist[3], because it returns NA. Where is my mistake?
If you want to extract the third element of each list element you can do:
List <- list(c(1:3), c(4:6), c(7:9))
lapply(List, '[[', 3) # This returns a list with only the third element
unlist(lapply(List, '[[', 3)) # This returns a vector with the third element
Using your example and taking into account #GSee comment you can do:
yourList <- list(c("","668","12345_s_at","667", "4.899777748","49.53333333",
"10.10930207", "1.598228663","5.087437057"),
c("","376", "6789_at", "375", "4.899655078","136.3333333",
"27.82508792", "2.20223398", "5.087437057"),
c("", "19265", "12351_s_at", "19264", "4.897730912",
"889.3666667", "181.5874908","1.846451572","5.087437057" ))
sapply(yourList, '[[', 3)
[1] "12345_s_at" "6789_at" "12351_s_at"
Next time you can provide some data using dput on a portion of your dataset so we can reproduce your problem easily.
With purrr you can extract elements and ensure data type consistency:
library(purrr)
listdata <- list(c("","668","12345_s_at","667", "4.899777748","49.53333333",
"10.10930207", "1.598228663","5.087437057"),
c("","376", "6789_at", "375", "4.899655078","136.3333333",
"27.82508792", "2.20223398", "5.087437057"),
c("", "19265", "12351_s_at", "19264", "4.897730912",
"889.3666667", "181.5874908","1.846451572","5.087437057" ))
map_chr(listdata, 3)
## [1] "12345_s_at" "6789_at" "12351_s_at"
There are other map_ functions that enforce the type consistency as well and a map_df() which can finally help end the do.call(rbind, …) madness.
In case you wanted to use the code you typed in your question, below is the fix:
listdata <- list(c("","668","12345_s_at","667", "4.899777748","49.53333333",
"10.10930207", "1.598228663","5.087437057"),
c("","376", "6789_at", "375", "4.899655078","136.3333333",
"27.82508792", "2.20223398", "5.087437057"),
c("", "19265", "12351_s_at", "19264", "4.897730912",
"889.3666667", "181.5874908","1.846451572","5.087437057" ))
v <- character() #creates empty character vector
list_len <- length(listdata)
for(i in 1:list_len)
v <- c(v, listdata[[i]][3]) #fills the vector with list elements (not efficient, but works fine)
print(v)
[1] "12345_s_at" "6789_at" "12351_s_at"

Resources