Split string using delimiter alternatively - r

I have a list of urls like this:
mydata <- read.table(header=TRUE, text="
Id
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrickpattern%3ADecorative%2FArt+Deco%3Abrickpattern%3AFloral%3Abrickpattern%3AGeometric%3Abrickpattern%3AGraphic%3Abrickpattern%3ATropical%3Aprice%3A300%2C10500&page=7&gridValue=4
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2040%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3ABlue%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830316016?q=%3Arelevance%3Averticalcolorfamily%3AWhite&gclid=CjwKEAjw9_jJBRCXycSarr3csWcSJABthk07W_H0RxQtOPZX7VdD9CSmK4S01BMYdXbtc0XxC0OeChoCky_w_wcB
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AFLYING%20MACHINE%3Abrand%3AMUFTI%3Abrand%3AUNITED%20COLORS%20OF%20BENETTON
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Averticalsizegroupformat%3AIN%2038%3Averticalsizegroupformat%3AIN%2039%3Averticalsizegroupformat%3AIN%20M%3Averticalsizegroupformat%3AUK%2039%3Averticalsizegroupformat%3AUK%20M%3Averticalsizegroupformat%3AUK%20S%3Averticalsizegroupformat%3AUS%20M%3Averticalsizegroupformat%3AUS%20S%3Abrickpattern%3ASolid%3Averticalcolorfamily%3ABlack%3Averticalcolorfamily%3AWhite
https://www.example.com/dp/c/830216013?q=%3Aprce-asc%3Abricksleeve%3AShort%3Aprice%3A300%2C10500&page=2&gridValue=4
https://www.example.com/dp/c/830216013??q=%3Aprce-asc%3Abrand%3AUS+POLO%3Abricksleeve%3AShort%3Aprice%3A300%2C10500
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3AAJIO%3Abrand%3ABASICS%3Abrand%3ACelio%3Abrand%3ADNMX%3Abrand%3AGAS%3Abrand%3ALEVIS%3Abrand%3ANETPLAY%3Abrand%3ASIN%3Abrand%3ASUPERDRY%3Abrand%3AUS%20POLO%3Abrand%3AVIMAL%3Abrand%3AVIMAL%20APPARELS%3Abrand%3AVOI%20JEANS
https://www.example.com/dp/c/830216013?q=%3Arelevance%3Abrand%3ABritish+Club%3Abrand%3ACelio%3Abrand%3AFLYING+MACHINE%3Aprice%3A300%2C10500&page=1&gridValue=4
")
I need to pull out value of parameters like the brand, verticalcolorfamily, q= etc from the urls. These parameters are the filters applied on the website.
The output which i am looking for is a data frame with three columns:parameter,value and the frequency of occurrence of the value. For Ex:
parameter | value | frequency
----------|----------------|----------
brand | FLYING+MACHINE | 2
q= | relevance | 5
price | 300%2C10500 | 2
brand | BASICS | 1
Currently i am able to think of is to collect each urls as a string vectors separated by alternating values of "%3A" as a delimiter:[q=%3Arelevance ,brickpattern%3ADecorative%2FArt+Deco,brickpattern%3AFloral , brickpattern%3AGeometric , brickpattern%3AGraphic , brickpattern%3ATropical , price%3A300%2C10500].
Then place each element in a column of a data frame and then again split by '%3A' and do a group by.
Suggestions on an other approach will be really appreciated.
Also if i am supposed to use this approach i am unaware of the method of using alternating '%3A' as delimiter .

urltools looks like an awesome package for what you want to do. Here's a hacked answer in the meantime. Starting with your data.frame:
# Convert to character list
# Get rid of url
# Split by "%3A" and convert to "long" list
L <- as.character(mydata$Id)
L <- gsub("https://www.example.com/dp/c/830216013\\?", "", L)
L <- unlist(strsplit(L, "%3A"))
head(L)
[1] "q=" "relevance" "brickpattern"
[4] "Decorative%2FArt+Deco" "brickpattern" "Floral"
Then:
# Convert to 2-column data frame
# Count unique parameter:value pairs
df <- data.frame(parameter = L[seq(1,length(L),2)], value = L[seq(2,length(L),2)]) %>%
group_by(parameter, value) %>%
summarize(frequency=sum(!is.na(value)))
I will show only the following entries where frequency >= 2:
# Show only entries with frequency >= 2
filter(df, frequency >= 2)
parameter value frequency
<fctr> <fctr> <int>
1 brand Celio 2
2 bricksleeve Short 2
3 q= relevance 6
4 verticalcolorfamily Black 2
5 verticalcolorfamily White 2
Note that brand::FLYING+MACHINE != 2 because FLYING+MACHINE occurs as FLYING%20MACHINE and FLYING+MACHINE.

Related

How do I create a new column that that contains part of a string based on a pattern in R

Apologies if this has been solved before, I haven't been able to find a solution to this problem. I am trying to pull out the letter "N" out of a sting including the -1 and +1 position and report it in a new column, there may be more than one instance of N in the string and i would like it to report all of them. I can filter the peptides containing N using
dt_contains_N <-dt[str_detect(dt$Peptide, "N"),]
but I'm not sure how to extract it, I was thinking something like ,
dt_N_motif <- dt[substring(dt$Peptide, regexpr("N", dt$Peptide) + 1)]
but im not sure how to include the N-position column information to extract the N-1, N and N+1 positions.
For example a simplified view of my data table looks like:
dt <- data.frame(Peptide= c("GESNEL", "SADNNEW", "SADNNEW"), N_pos=c(4,4,5))
.
.
peptide
N pos
GESNEL
4
SADNNEW
4
SADNNEW
5
and I would like it to look like this:
peptide
N pos
Motif
GESNEL
4
SNE
SADNNEW
4
DNN
SADNNEW
5
NNE
Any help would be great,
Thanks!
Use substr/substring to extract the string present between N_pos - 1 to N_pos + 1.
transform(dt, Motif = substr(Peptide, N_pos - 1, N_pos + 1))
# Peptide N_pos Motif
#1 GESNEL 4 SNE
#2 SADNNEW 4 DNN
#3 SADNNEW 5 NNE
Using tidyverse
library(dplyr)
library(stringr)
dt %>%
mutate(Motif = str_sub(Peptide, N_Pos -1, N_pos + 1))

Saving the output of a str_which loop in R

I work with a sheet of data that lists a variety of scientific publications. Rows are publications,
columns are a variety of metrics describing each publication (author name and position, Pubmed IDs, Date etc...)
I want to filter for publications for each author and extract parts of them. The caveat is the format:
all author names (5-80 per cell) are lumped together in one cell for each row.
I managed to solve this with the use of str_which, saving the coordinates for each author and later extract. This works only for manual use. When I try to automate this process using a loop to draw on a list of authors I fail to save the output.
I am at a bit of a loss on how to store the results without overwriting previous ones.
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
list of names that I want the coordinates for
namesOfInterest <-
list(c("Doe J", "Theobald U"))
the manual extraction, requiring me to type the exact name and output object
Doe <- str_which(sampleDat$var1, "Doe J")
Theobald <- str_which(sampleDat$var1, "Theobald U")
one of many attempts that does not replicate the manual version.
results <- c()
for (i in namesOfInterest) {
results[i] <- str_which(sampleDat$var1, i)
}
The for loop is set up incorrectly (it needs to be something like for(i in 1:n){do something}). Also, even if you fix that, you'll get an error related to the fact that str_which returns a vector of varying length, indicating the position of each of the matches it makes (and it can make multiple matches). Thus, indexing a vector in a loop won't work here because whenever a author has multiple matches, more than one entry will be saved to a single element, throwing an error.
Solve this by working with lists, because lists can hold vectors of arbitrary length. Index the list with double bracket notation: [[.
library(stringr)
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
# no need for list here. a simple vector will do
namesOfInterest <- c("Doe J", "Theobald U")
# initalize list
results <- vector("list", length = length(namesOfInterest))
# loop over list, saving output of `str_which` in each list element.
# seq_along(x) is similar to 1:length(x)
for (i in seq_along(namesOfInterest)) {
results[[i]] <- str_which(sampleDat$var1, namesOfInterest[i])
}
which returns:
> results
[[1]]
[1] 1
[[2]]
[1] 2 3
The way to understand the output above is that the ith element of the list, results[[i]] contains the output of str_which(sampleDat$var1, namesOfInterest[i]), where namesOfInterest[i] is always exactly one author. However, the length of results[[i]] can be longer than one:
> sapply(results, length)
[1] 1 2
indicating that a single author can be mentioned multiple times. In the example above, sapply counts the length of each vector along the list results, showing that namesOfInterest[1] has one paper, and namesOfInterest[2] has 2. `
Here is another approach for you. If you want to know which scholar is in which publication, you can do the following as well. First, assign unique IDs to publications. Then, split authors and create a long-format data frame. Define groups by authors and aggregate publication ID (pub_id) as string (character). If you need to extract some authors, you can use this data frame (foo) and subset rows.
library(tidyverse)
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = toString(pub_id)) -> foo
var1 pub_id
<chr> <chr>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2, 3
filter(foo, var1 %in% c("Doe J", "Theobald U"))
var1 pub_id
<chr> <chr>
1 Doe J 1
2 Theobald U 2, 3
If you want to have index as numeric, you can twist the idea above and do the following. You can subset rows with targeted names with filter().
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = list(pub_id)) %>%
unnest(pub_id)
var1 pub_id
<chr> <int>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2
9 Theobald U 3

how to find the frequency of a tag or a word in r

Im working on stack overflow data dump .csv file and I need to to find :
The top 8 most frequent tags in the dataset.
To do this, I see the set of tags associated with each row in the data1.PostTypeId column.The frequency of a tag is equal to the number of questions that have that tag.(it means the frequency of a tag is equal to the number of rows that has that tag )
Note1 : The file is too large it has over 1 million of rows
Note2 : Im beginner in R, so I need the simplest way. My attempt is to use table function but what I got was list of tags and I couldn't figure out the top ones
This is a sample of the table Im using is below :
Let say for example that "java" had the highest frequency (because it appeared in the most among all the rows)
then the tag "python-3.x" is the second highest frequency (because appeared the most among all the rows)
so basically I need to go over the the second column in the table and what are the top 8 that were there
etc ...
Using base R with (optional) magrittr pipes for readability:
library(magrittr)
# Make a vector of all the tags present in data
tags_sep <- tags %>%
strsplit("><") %>%
unlist()
# Clean out the remaining < and >
tags_sep <- gsub("<|>", "", tags_sep)
# Frequency table sorted
tags_table <- tags_sep %>%
table() %>%
sort(decreasing = TRUE)
# Print the top 10 tags
tags_table[1:10]
java android amazon-ec2 amazon-web-services android-mediaplayer
4 2 1 1 1
antlr antlr4 apache-kafka appium asp.net
1 1 1 1 1
Data
tags <- c(
"<java><android><selenium><appium>",
"<java><javafx><javafx-2>",
"<apache-kafka>",
"<java><spring><eclipse><gradle><spring-boot>",
"<c><stm32><led>",
"<asp.net>",
"<python-3.x><python-2.x>",
"<http><server><Iocalhost><ngrok>",
"<java><android><audio><android-mediaplayer>",
"<antlr><antlr4>",
"<ios><firebase><swift3><push-notification>",
"<amazon-web-services><amazon-ec2><terraform>",
"<xamarin.forms>",
"<gnuplot>",
"<rx-java><rx-android><rx-binding>",
"<vim><vim-plugin><syntastic>",
"<plot><quantile>",
"<node.js><express-handlebars>",
"<php><html>"
)
If I understood correctly, this should solve your problem
library(stringr)
library(data.table)
# some dummy data
dat = data.table(id = 1:3, tags = c("<java><android><selenium>",
"<java><javafx>",
"<apache><android>"))
tags = apply(str_split(dat$tags, pattern = "><", simplify = T),
2, function(x) str_replace(x, "<|>", "")) # separate one tag in each column
foo = cbind(dat[, .(id)], tags) # add the separated tags to the data
foo[foo==""] = NA # substitute empty strings with NA
foo = melt.data.table(foo, id.vars = "id") # transform to long format
foo = foo[, .N, by = value] # calculate frequency
foo[, .SD[N %in% head(N, n = 1)]] # change the value of "n" to the number you want
value N
1: java 2
2: android 2
3: NA 2

R - how to index rank and accordingly display a data frame?

I have a data frame that lists down some names of individuals and their monetary transactions carried out in USD. The table lists down data according to several districts and the valid transactions made by either cash or credit cards, like so:
X Dist transact.cash transact.card
a 1 USD USD
b 1 USD USD
Where X is an individual and his/her transactions for a period of time keeping that period fixed and Dist is the district where he/she resides. There are over 4000 observations in total for an approx. 80-100 rows per Dist. So far, the sorting, slicing and everything else have been simple operations with dat.cash and dat.card being subsetted tables according to mode of transaction; but I'm having problems when extracting information in reference to ranking the dataset. For this, I have written a function where I specify a rank and the function should show those rows starting from that rank:
rankdat <- function(transact, numb) {
# Truncated
valid.nums = c('highest', 'lowest', 1:nrow(dat.cash)) # for cash subset
if (transact == 'cash' && numb == 'highest') { # This is easy
sort <- dat.cash[order(dat.cash[, 3], decreasing = T), ]# For sorting only cash data set
} else if (transact == 'cash' and numb == 1:nrow(dat.cash)) {
sort <- dat.cash[order(dat.cash[, 3], decreasing = T) == numb, ] } # Not getting results here
}
The last line is returning NULL instead of a ranked transaction and all its rows. Replacing == with %in% still gives NULL and using rank() doesn't change anything. For highest and lowest numbers, its not a great deal since it only involves simple sorting. If I specify rankdat('cash', 10), the function should return values starting from the 10th highest transaction and decreasing irrespective of Dist, similar to:
X Dist transact.cash
b 1 10th highest
h 2 11th highest
p 1 12th highest
and so on
This function is able to do that:
rankdat <- function(df,rank.by,num=10,method="top",decreasing=T){
# ------------------------------------------------------
# RANKDAT
# ------------------------------------------------------
# ARGUMENT
# ========
# df Input dataFrame [d.f]
# num Selected row [num]
# rank.by Name of column(s) used to rank dataFrame
# method Method used to extract rows
# top - to select top rank (e.g. 10 first rows)
# specific - to select specific row
# ------------------------------------------------------
eval(parse(text=paste("sort=df[with(df,order(",rank.by,"), decreasing=",decreasing,"),]",sep=""))) # order dataFrame by
if(method %in% "top"){
return(sort[1:num,])
}else if(method %in% "specific"){
return(sort[num,])
}else{
stop("Please select method used to extract data !!!")
}
}
Suppose that you have the following data.frame:
df=data.frame(X=c(rep('A',2),rep('B',3),rep('A',3),rep('B',2)),
Dist=c(rep(1,5),rep(0,5)),
transact.cash=c(rep('USD',5),rep('€',5)),
transact.card=c(rep('USD',5),rep('€',5)))
We obtain:
X Dist transact.cash transact.card
1 A 1 USD USD
2 A 1 USD USD
3 B 1 USD USD
4 B 1 USD USD
5 B 1 USD USD
6 A 0 € €
7 A 0 € €
8 A 0 € €
9 B 0 € €
10 B 0 € €
If you would like to sort a dataframe with multiple columns transact.cash or transact.cash you can used stackoverflow : How to sort a dataframe by column(s). In your example, you only specified dat.cash, thus :
sort = df[order(df$transact.cash, decreasing=T),] # Order your dataFrame with transact.cash column
If you want to extract rows which respect a specific statement, you need to use which() and == for numeric, double, logical match or %in% for string match. For example:
XA = df[which(df$X %in% "A"),] # Select row by user
XDist = df[which(df$Dist == 1),] # Select row by District
Finally, if you would like to select the first five row after ordering:
sort[1:5,] # Select first five rows
sort[1:numb,] # Select first numb rows
With that you can perform a simple function to easily extract data from your dataframe.
Hope it will help you

Reshape Panel Data Wide Format to Long Format

I am struggling with transformation of a Panel Dataset from wide to long format. The Dataset looks like this:
ID | KP1_430a | KP1_430b | KP1_430c | KP2_430a | KP2_430b | KP2_430c | KP1_1500a | ...
1 ....
2 ....
KP1; KP2 up to KP7 describe the Waves.
a,b up to f describe a specific Item. (E.g. left to right right placement of Party a)
I would like to have this data in long format. Like this:
ID | Party | Wave | 430 | 1500
1 1 1 .. ..
1 2 1 .. ..
. . .
1 1 2 .. ..
. . .
2 1 1 .. ..
I tried to use the reshape function. But I had problems reshaping it over time and over the parties simultaneously.
Here is a small data.frame example.
data <- data.frame(matrix(rnorm(10),2,10))
data[,1] <- 1:2
names(data) <- c("ID","KP1_430a" , "KP1_430b" , "KP1_430c" , "KP2_430a" , "KP2_430b ", "KP2_430c ", "KP1_1500a" ,"KP1_1500b", "KP1_1500c")
And this is how far I got.
data_long <- reshape(data,varying=list(names(data)[2:4],names(data)[5:7], names(data[8:10]),
v.names=c("KP1_430","KP2_430","KP1_1500"),
direction="long", timevar="Party")
The question remains: how I can get the time varying variables in long format as well? And is there a more elegant way to reshape this data? In the code above I would have to enter the names (names(data)[2:4]) for each wave and variable. With this small data.frame it is Ok, but the Dataset is a lot larger.
EDIT: How this transformation could be done by hand: I actually have done this, which leaves me with a page-long code file.
First, Bind KP1_430a and KP1_1500a with IDs, Time=1 and Party=1 column wise. Second create the same object for all parties [b-f], changing the party index respectively, and append it row-wise. Do step one and two for the rest of the waves [2-7], respectively changing party and time var, and append them row-wise.
It is usually easier to proceed in two steps: first use melt to put your data into a "tall" format (unless it is already the case) and then use dcast to convert ti to a wider format.
library(reshape2)
library(stringr)
# Tall format
d <- melt(data, id.vars="ID")
# Process the column containing wave and party
d1 <- str_match_all(
as.character( d$variable ),
"KP([0-9])_([0-9]+)([a-z])"
)
d1 <- do.call( rbind, d1 )
d1 <- d1[,-1]
colnames(d1) <- c("wave", "number", "party")
d1 <- as.data.frame( d1)
d <- cbind( d, d1 )
# Convert to the desired format
d <- dcast( d, ID + wave + party ~ number )
At the moment your Wave data is in your variable names and you need to extract it with some string processing. I had no trouble with melt
mdat <- melt(data, id.vars="ID")
mdat$wave=sub("KP", "", sub("_.+$", "", mdat$variable)) # remove the other stuff
mdat
Your description is too sketchy (so far) for me to figure out the rule for deriving a "Party" variable, so perhaps you can edit you question to show how that might be done by a human being .... and then we can show the computer how to to do it.
EDIT: If the last lower-case letter in the original column names is Party as Vincent thinks, then you could trim the trailing spaces in those names and extract:
mdat$var <- sub("\\s", "", (as.character(mdat$variable)))
mdat$party=substr( mdat$var, nchar(mdat$var), nchar(mdat$var))
#--------------
> mdat
ID variable value wave party var
1 1 KP1_430a 0.7220627 1 a KP1_430a
2 2 KP1_430a 0.9585243 1 a KP1_430a
3 1 KP1_430b -1.2954671 1 b KP1_430b
4 2 KP1_430b 0.3393617 1 b KP1_430b
5 1 KP1_430c -1.1477627 1 c KP1_430c
6 2 KP1_430c -1.0909179 1 c KP1_430c
<snipped output>

Resources