Selecting/subsetting blank and non-blank rows with a vector [duplicate] - r

This question already has an answer here:
dplyr filter : value is contained in a vector
(1 answer)
Closed 4 years ago.
I'd like to use a vector featuring blank ("") and non-blank character strings to subset rows so that I end up with a result like in dfgoal.
I've tried using dplyr::select(), but I get an error message (Error: Strings must match column names. Unknown columns: tooth, , head, foot).
I realise I've got a problem in that I want to keep some "" and get rid of others, but I don't know how to resolve it.
Thanks for any help!
# Data
df <- data.frame(avar=c("tooth","","","head","","foot","",""),bvar=c(1:8))
# Vector
veca <- c("tooth","foot")
vecb <- c("")
vecc <- as.vector(rbind(veca,vecb))
vecc <- unique(vecc)
# Attempt
library(dplyr)
df <- df %>% dplyr::select(vecc)
# Goal
dfgoal <- data.frame(avar=c("tooth","","","foot","",""),bvar=c(1,2,3,6,7,8))

I'm not entirely clear on what you're trying to do. I assume you're asking how to select rows where avar %in% veca including subsequent blank ("") rows.
Perhaps something like this using tidyr::fill?
library(tidyverse)
veca <- c("tooth","foot")
df %>%
mutate(tmp = ifelse(avar == "", NA, as.character(avar))) %>%
fill(tmp) %>%
filter(tmp %in% veca) %>%
select(-tmp)
# avar bvar
#1 tooth 1
#2 2
#3 3
#4 foot 6
#5 7
#6 8

Related

R: Counting frequency of words from predefined dictionary

I have a very large dataset that looks like this: one column contains names, the second column contains their respective (very long) texts. I also have a pre-defined dictionary that contains at least 20 terms. How can I count the number of times these key words occur in each row of my dataframe? I have tried str_detect,grep(l), and %>% like, and looped over each row, but the problem seems to be that I want to detect too many terms, and these functions stop working when I use 15+ terms or so.
Would be sooo happy if anyone could help me out with this!
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2)```
dict <- c("groceries", "going", "me") #but my actual dictionary is much larger
Create a unique identifier for your rows. Split your col2 by words, one in each row. Filter for only the select words in your dict. Then count by each row. Finally, combine with original df and set NA to Zeros for rows that don't have any words from your dict.
library(dplyr)
col1 <- c("A","B","A")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2, stringsAsFactors = FALSE)
dict <- c("groceries", "going", "me")
df <- df %>% mutate(row=row_number()) %>% select(row, everything())
counts <- df %>% tidyr::separate_rows(col2) %>% filter(col2 %in% dict) %>% group_by(row) %>% count(name = "counts")
final <- left_join(df, counts, by="row") %>% tidyr::replace_na(list(counts=0L))
final
#> row col1 col2 counts
#> 1 1 A I am going to get groceries 2
#> 2 2 B He called me at six. 1
#> 3 3 A No, he did not 0
Here is a base R option using gregexpr
dfout <- within(
df,
counts <- sapply(
gregexpr(paste0(dict, collapse = "|"), col2),
function(x) sum(x > 0)
)
)
or
dfout <- within(
df,
counts <- sapply(
regmatches(col2, gregexpr("\\w+", col2)),
function(v) sum(v %in% dict)
)
)
which gives
> dfout
col1 col2 counts
1 1 I am going to get groceries 2
2 2 He called me at six. 1
3 3 No, he did not 0
Data
structure(list(col1 = 1:3, col2 = c("I am going to get groceries",
"He called me at six.", "No, he did not")), class = "data.frame", row.names = c(NA,
-3L))
I think my solution gives you the output you want - that is for each word in your "dict" list, you can see how many times it appears in each sentence. Each row is an entry in df$col2 i.e. a sentence. "Dict" is your vector of terms that you're looking to match. We can loop over the vector and for each entry in the vector we match how many times that entry appears in each row/sentence using stringr::str_count. Note the syntax for str_count: str_count(string being checked over, expression you're trying to match)
str_count returns a vector showing how many times the word appears in each row. I create a data frame of these vectors which will contain the same number of rows as there are entries in the dict vector. Then you can just cbind "dict" to that data frame and you can see how many times each word is used in each sentence. I adjust the column names at very end so you can match the words to the sentence #'s. Note that if you want to calculate row means you'll need to subset out the "dict" column of the final data frame because it's character.
library(stringr)
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he
did not")
df <- data.frame(col1, col2)
dict <- c("groceries", "going", "me")
word_matches <- data.frame()
for (i in dict) {
word_tot<-(str_count(df$col2, i))
word_matches <- rbind(word_matches,word_tot)
}
word_matches
colnames(word_matches) <- paste("Sentence", 1:ncol(word_matches))
cbind(dict,word_matches)
dict Sentence 1 Sentence 2 Sentence 3
1 groceries 1 0 0
2 going 1 0 0
3 me 0 1 0

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

R: show ALL rows with duplicated elements in a column [duplicate]

This question already has answers here:
Fastest way to remove all duplicates in R
(3 answers)
Closed 6 years ago.
Does a function like this exist in any package?
isdup <- function (x) duplicated (x) | duplicated (x, fromLast = TRUE)
My intention is to use it with dplyr to display all rows with duplicated values in a given column. I need the first occurrence of the duplicated element to be shown as well.
In this data.frame for instance
dat <- as.data.frame (list (l = c ("A", "A", "B", "C"), n = 1:4))
dat
> dat
l n
1 A 1
2 A 2
3 B 3
4 C 4
I would like to display the rows where column l is duplicated ie. those with an A value doing:
library (dplyr)
dat %>% filter (isdup (l))
returns
l n
1 A 1
2 A 2
dat %>% group_by(l) %>% filter(n() > 1)
I don't know if it exists in any package, but since you can implement it easily, I'd say just go ahead and implement it yourself.

finding similar element between two data

I asked a question before which was complicated and I did not get any help. So I tried to simplify the question and input output.
I have tried many ways but none worked for example , I sort down some
# 1
for(i in ncol(mydata)){
corsA = grep(colnames(mydata)[i] , colnames(mysecond))
mydata[,corsA]%in%mysecond[,i]}
# here if I get true then means they have match
## 2
are.cols.identical <- function(col1, col2) identical(mydata[,col1], mysecond[,col2])
res <- outer(colnames(mydata), colnames(mysecond),FUN = Vectorize(are.cols.identical))
cut <- apply(res, 1, function(x)match(TRUE, x))
### 3
(mydata$Rad) %in% (mysecond$Ro5_P1_A5)
#### 4
which(mydata %in% mysecond)
#### 5
match(mydata$sus., mysecond$R5_P1_A5)
or
which(mydata$sus. %in% mysecond$RP1_A5)
matches <- sapply(mydata,function(x) sapply(mysecond,identical,x))
and few others, but none led me to an answer
Here is another solution using regex:
rows<-mapply(grep,mysecond,mydata)
The step above will return a list with the matched rows in each column:
rows
If you would like to see how many rows where matched you can do this:
lapply(rows,length)
Now we can go ahead a get the rows of interest in mydata, but rows is a list so we need to unlist() and we might have some duplicate rows, and we don't want them to appear twice in the output, so we use the unique() function:
rows<-unique(unlist(rows))
mydata[rows,]
#View(mydata[rows,])
require(plyr)
dat <- strsplit(as.character(mydata$subunits..UniProt.IDs.), ',')
dat <- data.frame(mydata[,1],rbind.fill(lapply(dat,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)})))
mydata[unlist(apply(dat,2, function(x) which(x %in% mysecond[,2]))),]

Using dplyr, Remove all strings from a data frame

I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson

Resources