How select columns, where the first row equals TRUE in R? - r

I am using following code
summary.out$which[which.max(summary.out$adjr2),]
This gives me column names with one one row, which equals to TRUE or FALSE.
What should I add to this code in order to get column names, where the first row shows TRUE?
Additionally, how can I create that all column names of this output are in one string deliminated with "+" sign? (one string showing "Column1"+"Column2"+"Column3"

Since the OP did not provide a minimal reproducible example, we'll assume that summary.out is a data frame, and therefore we can use the colnames() function to extract the column names once we know which elements of the first row have the value TRUE.
Here is a reproducible example illustrating how to use TRUE / FALSE values in the first row of data frame to extract its column names.
textData <- "v1,v2,v3,v4
TRUE,FALSE,TRUE,TRUE
FALSE,TRUE,FALSE,TRUE
TRUE,TRUE,TRUE,TRUE"
data <- read.csv(text=textData)
# select column names where first row = TRUE
colnames(data)[data[1,]==TRUE]
The preceding line of code uses the [ form of the extract operator to extract the required elements from the result of the colnames() function.
...and the output:
> colnames(data)[data[1,]==TRUE]
[1] "v1" "v3" "v4"
We can print the names as a single object separated by + signs with the paste() function and its collapse = argument.
# separated by + sign
paste(colnames(data)[data[1,]==TRUE],collapse = " + ")
...and the output:
> # separated by + sign
> paste(colnames(data)[data[1,]==TRUE],collapse = " + ")
[1] "v1 + v3 + v4"

Related

How can I paste a comma (,) in a string of numbers in R Statistics?

I'm quite newbie at R Statistics. I have a vector with multiple objects inside (numbers), and I want to put a comma between the first and second number for the whole objects.
x gives this result:
[8] -8196110 -7681989 -8042092 -8196660 -7606310 -7217828 -7634887
[15] -7401244 -7211947 -7636932 -7606444 -7598894 -7398965```
My question is how to automatically put a comma in all those objects between the first and the second numbers. The desired output would be:
```[1] -8,385772 -7,390682 -8,019960 -8,300000 -8,069984 -8,786782 -7,414995
[8] -8,196110 -7,681989 -8,042092 -8,196660 -7,606310 -7,217828 -7,634887
[15] -7,401244 -7,211947 -7,636932 -7,606444 -7,598894 -7,398965```
We can use sub to capture the first digit from the start (^) of the string and replace with the backreference (\\1) followed by the,
sub("^(-?\\d)", "\\1,", x)
-output
[1] "-8,196110" "-7,681989" "-8,042092" "-8,196660" "-7,606310" "-7,217828" "-7,634887" "-7,401244" "-7,211947" "-7,636932" "-7,606444" "-7,598894" "-7,398965"
data
x <- c(-8196110, -7681989, -8042092, -8196660, -7606310, -7217828,
-7634887, -7401244, -7211947, -7636932, -7606444, -7598894, -7398965
)
We can use strsplit to split our numeric vector into a list where each element has the first digit and then the rest of the number. Then pass that into an sapply call that inserts a comma in the right spot:
x_split = strsplit(as.character(x), split = '')
sapply(x_split, function(k){paste0(c(k[1], ',',k[2:length(k)]), collapse = '')})

How to apply regex in the Quanteda package in R to remove consecutively repeated tokens(words)

I am currently working on a text mining project and after running my ngrams model, I do realize I have sequences of repeated words. I would like to remove the repeated words while keeping their first occurrence. An illustration of what I intend to do is demonstrated with the code below. Thanks!
textfun <- "This this this this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"
textfun <- corpus(textfun)
textfuntoks <- tokens(textfun)
textfunRef <- tokens_replace(textfuntoks, pattern = **?**, replacement = **?**, valuetype ="regex")
The desired result is "This analysis should remove all of the duplicated or repeated words and return only their first occurrence". I am only interested in consecutive repetitions.
My main problem is in coming up with values for the "pattern" and the "replacement" arguments within the "tokens_replace" function. I have tried different patterns, some of which were adapted from sources on here but none seems to work. An image of the problem is included.[5grams frequency distribution showing instances such as for words like "swag", "pleas", "gas", "books", & "chicago", "happi"] 1
You can split the data at each word, use rle to find consecutive occurrence and paste the first value together.
textfun <- "This this this this analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence"
paste0(rle(tolower(strsplit(textfun, '\\s+')[[1]]))$values, collapse = ' ')
#[1] "this analysis should remove all of the duplicated or repeated words and return only their first occurrence"
Interesting challenge. To do this within quanteda, you can create a dictionary mapping each repeat sequence into its single occurrence.
library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
corp <- corpus("This this this this will analysis analysis analysis should should remove remove remove all all all all all of of the the the the duplicated duplicated or or or repeated repeated repeated words words words and and return return return return return only their their first first first occurrence")
toks <- tokens(corp)
ngrams <- tokens_tolower(toks) %>%
tokens_ngrams(n = 5:2, concatenator = " ") %>%
as.character()
# choose only the ngrams that are all the same word
ngrams <- ngrams[lengths(sapply(strsplit(ngrams, split = " "), unique, simplify = TRUE)) == 1]
# remove duplicates
ngrams <- unique(ngrams)
head(ngrams, n = 3)
## [1] "all all all all all" "return return return return return"
## [3] "this this this this"
So this provides a vector of all (lowercased) repeated values. (To avoid lowercasing, remove the tokens_tolower() line.)
Now we create a dictionary where each sequence is a "value", and each unique token is the "key". Multiple identical keys will exist in the list from which dict is built, but the dictionary() constructor automatically combines them. Once this is created, then the sequences can be converted to the single token using tokens_lookup().
dict <- dictionary(
structure(
# this causes each ngram to be treated as a single "value"
as.list(ngrams),
# each dictionary key will be the unique token
names = sapply(ngrams, function(x) strsplit(x, split = " ")[[1]][1], simplify = TRUE, USE.NAMES = FALSE)
)
)
# convert the sequence to their keys
toks2 <- tokens_lookup(toks, dict, exclusive = FALSE, nested_scope = "dictionary", capkeys = FALSE)
print(toks2, max_ntoken = -1)
## Tokens consisting of 1 document.
## text1 :
## [1] "this" "will" "analysis" "should" "remove"
## [6] "all" "of" "the" "duplicated" "or"
## [11] "repeated" "words" "and" "return" "only"
## [16] "their" "first" "occurrence"
Created on 2021-04-08 by the reprex package (v1.0.0)

Remove character from string in R

I have a data frame as given below:
data$Latitude
"+28.666428" "+28.666470" "+28.666376" "+28.666441" "+28.666330" "+28.666391"
str(data$Latitude)
Factor w/ 1368 levels "+10.037451","+10.037457",..
I want to remove the "+" character from each of the Latitude values.
I tried using gsub()
data$Latitude<-gsub("+","",as.character(factor(data$Latitude)))
This isn't working.
You can use a combination of sapply, substring and regexpr to achieve the result.
regexpr(<character>,<vector>)[1] gives you the index of the character.
Using the value as the start index for substring, the rest of the string can be separated.
sapply allows you loop through the values.
Here is the data.
d<-c("+28.666428","+28.666470","+28.666376","+28.666441","+28.666330")
Here is the logic.
v <- sapply(d, FUN = function(d){substring(d, regexpr('+',d) + 1, nchar(d))}, USE.NAMES=FALSE)
Here is the output
> v
[1] "28.666428" "28.666470" "28.666376" "28.666441" "28.666330" "28.666391"

data frame from character vector that contains three comma seperated values in each row

How to create columns of a data frame from a long character vector that contains three comma seperated values in each row. The first element contains the names of the data frame columns.
Not every row has three columns, some places there is just a trailing comma:
> string.split.cols[1] #This row is the .names
[1] "Acronym,Full form,Remarks"
> string.split.cols[2]
[1] "AC,Actual Cost, "
> string.split.cols[3]
[1] "ACWP,Actual Cost of Work Performed,Old term for AC"
> string.split.cols[4]
[1] "ADM,Arrow Diagramming Method,Rarely used now"
> string.split.cols[5]
[1] "ADR,Alternative Dispute Resolution, "
> string.split.cols[6]
[1] "AE,Apportioned Effort, "
The output should be a df with three columns, I'm only interested in the first two columns and will throw out the third.
This is the original string, some columns are not comma escaped but that isn't a big huge deal.
string.cols <- [1] "Acronym,Full form,Remarks\nAC,Actual Cost, \nACWP,Actual Cost of Work Performed,Old term for AC\nADM,Arrow Diagramming Method,Rarely used now\nADR,Alternative Dispute Resolution, \nAE,Apportioned Effort, \nAOA,Activity-on-Arrow,Rarely used now\nAON,Activity-on-Node, \nARMA,Autoregressive Moving Average, \nBAC,Budget at Completion, \nBARF,Bought-into, Approved, Realistic, Formal,from Rita Mulcahy's PMP Exam Prep\nBCR,Benefit Cost Ratio, \nBCWP,Budgeted Cost of Work Performed,Old term for EV\nBCWS,Budgeted Cost of Work Scheduled,Old term for PV\nCA,Control Account, \nCBR,Cost Benefit Ratio, \nCBT,Computer-Based Test, \n..."
Have you tried the text input for read.csv?
df <- read.csv( text = string.split.cols, header = T )
I found this routine to be very fast for splitting a string and converting to a data frame.
slist<-strsplit(mylist,",")
x<-sapply(slist, FUN= function(x) {x[1]})
y<-sapply(slist, FUN= function(x) {x[2]})
df<-data.frame(Column1Name=x, Column2Name=y, stringsAsFactors = FALSE)
where mylist is your vector of strings to split.
You can use rbind.data.frame to do this, after splitting the string:
x <- do.call(rbind.data.frame, strsplit(split.string.cols[-1], ','))
names(x) <- strsplit(split.string.cols[1], ',')[[1]]
x
## Acronym Full form Remarks
## 1 AC Actual Cost
## 2 ACWP Actual Cost of Work Performed Old term for AC
## ...
As a one-liner:
setNames(do.call(rbind.data.frame,
strsplit(split.string.cols[-1], ',')
),
strsplit(split.string.cols[1], ',')[[1]]
)

Match unlist output against set of column names

Here is sample data:
main.data <- c("id","num","open","close","char","gene","valid")
data.step.1 <- list(id="12",num="00",open="01-01-2015",char="yes",gene="1234",valid="NA")
match.step.1 <- unlist(data.step.1)
The main.data are the column names of all possible column data.
I have a loop that streams data step-by-step, which could have missing column (list name).
I would like to match the each step (data.step.n) against the master column names (main.data).
Desired output:
id num open close char gene valid
"12" "00" "01-01-2015" "" "yes" "1234" "NA"
How can I unlist the data and match it against the names so that if the entry is missing like in this case close that would be filled with empty string.
Try
v1 <- setNames(rep('', length(main.data)), main.data)
v1[main.data %in% names(match.step.1)] <- match.step.1
Or use match
v1[match(names(match.step.1), main.data)] <- match.step.1
Or just use [
v2 <- setNames(match.step.1[main.data], main.data)
v2[is.na(v2)] <- ''

Resources