I have a 10-row data frame of tweets about potatoes and need to flag them based on the punctuation each tweet contains (questions marks or exclamation points). The grep function will return row numbers where these characters appear:
grep("\\?", potatoes$tweet)
grep("!", potatoes$tweet)
I've tried to create the flag variable question with mutate in dplyr as shown...
potatoes$question <- NA
potatoes <- mutate(potatoes, question = +row_number(grep("\\?", potatoes$tweet)))
Error in mutate_impl(.data, dots) :
Column `question` must be length 10 (the number of rows) or one, not 3
I'm also happy to consider more elegant solutions than conditioning on the output of grep. Any help appreciated!
We can use grepl instead of grep as grep returns the index/position where the matches occurs, whereas grepl returns a logical vector where TRUE denotes the matching element and FALSE non-matching. It can be used as a flag
i1 <- grepl("!", potatoes$tweet)
and if we need to change to row numbers,
potatoes$question <- i1 * seq_len(nrow(potatoes$sweet))
Similarly, grep with row index can be used for assignment
i2 <- grep("!", potatoes$tweet)
potatoes$question[i2] <- seq_len(nrow(potatoes))[i2]
Related
I am trying to subset a data frame df.1 based on two conditions:
observations in Accession variable should contain ;
observations in kinase.or.not should be kinase
Below is the code I used. But it seems that the first condition grep(";", df.1$Accession) is ignored. Why is that? Thanks!
df.2 <- df.1[grep(";", df.1$Accession) & df.1$kinase.or.not == "Kinase",]
We need grepl instead of grep - the difference is grep returns the numeric position index whereas grepl returns a logical vector which can be used along with & to join the compound expression
df.1[grepl(";", df.1$Accession) & df.1$kinase.or.not == "Kinase",]
I have a dataframe with multiple columns that I want to group according to their names. When several columns names respond to the same pattern, I want them grouped in a single column and that column is the sum of the group.
colnames(dataframe)
[1] "Départements" "01...3" "01...4" "01...5" "02...6" "02...7" "02...8" "02...9" "02...10" "03...11"
[11] "03...12" "03...13" "04...14" "04...15" "05...16" "05...17" "05...18" "06...19" "06...20" "06...21"
So I use this bit of code that works just fine when every column are numeric, though the first one is character and therefore I hit an error. How can I exclude the first column from the code?
#Group columns by patern, look for a pattern and loop through
patterns <- unique(substr(names(dataframe_2012), 1, 3))` #store patterns in a vector
dataframe <- sapply(patterns, function(xx) rowSums(dataframe[,grep(xx, names(dataframe)), drop=FALSE]))
#loop through
This is the error code I get
Error in rowSums(DEPTpolicedata_2012[, grep(xx, names(DEPTpolicedata_2012)), :
'x' must be numeric
You can simply remove the first column using
patterns$Départements <- NULL
I have a simple data frame
d <- data.frame(var1=c(5,5,5),var1_c=c(5,NA,6),var2 =c(6,6,6),var2_c = c(8,6,NA))
with a lots of lines, and a lots of variables, all labeled "varXXX" and "varXXX_c", and I want that everytimes there's a NA in a varXXX_c to replace the NA with the value in the varXXX variable.
In short, I want to do :
d[is.na(d$var1_c),"var1_c"] <- d$var1[is.na(d$var1_c)]
but try to find a better way to do this that copy paste and change "1" with the number of the variable.
I would rather find a solution in base R or dplyr, but would be grateful for any help !
We can use grep to find the column names that start with var followed by numbers (\\d+) followed by _ and followed by c. Similarly, we have another set of logical index for var followed by one or more numbers (\\d+) till the end of the string ($) and then do the subset of columns based on the index and change the NA values (is.na(d[i1])) to the corresponding elements in 'd[i2]`.
i1 <- grepl("var\\d+_c", names(d))
i2 <- grepl('var\\d+$', names(d))
d[i1][is.na(d[i1])] <- d[i2][is.na(d[i1])]
NOTE: This is based on the assumption that the columns are in the same order.
I have a column named subcat_id in which the values are stored as comma separated lists. I need to count the number of values and store the counts in a new column. The lists also have Null values that I want to get rid of.
I would like to store the counts in the n column.
We can try
nchar(gsub('[^,]+', '', gsub(',(?=,)|(^,|,$)', '',
gsub('(Null){1,}', '', df1$subcat_id), perl=TRUE)))+1L
#[1] 6 4
Or
library(stringr)
str_count(df1$subcat_id, '[0-9.]+')
#[1] 6 4
data
df1 <- data.frame(subcat_id = c('1,2,3,15,16,78',
'1,2,3,15,Null,Null'), stringsAsFactors=FALSE)
You can do
sapply(strsplit(subcat_id,","),FUN=function(x){length(x[x!="Null"])})
strsplit(subcat_id,",") will return a list of each item in subcat_id split on commas. sapply will apply the specified function to each item in this list and return us a vector of the results.
Finally, the function that we apply will take just the non-null entries in each list item and count the resulting sublist.
For example, if we have
subcat_id <- c("1,2,3","23,Null,4")
Then running the above code returns c(3,4) which you can assign to your column.
If running this from a dataframe, it is possible that the character column has been interpreted as a factor, in which case the error non-character argument will be thrown. To fix this, we need to force interpretation as a character vector with the as.character function, changing the command to
sapply(strsplit(as.character(frame$subcat_id),","),FUN=function(x){length(x[x!="Null"])})
I am trying to remove duplicated rows by one column (e.g the 1st column) in an R matrix. How can I extract the unique set by one column from a matrix? I've used
x_1 <- x[unique(x[,1]),]
While the size is correct, all of the values are NA. So instead, I tried
x_1 <- x[-duplicated(x[,1]),]
But the dimensions were incorrect.
I think you're confused about how subsetting works in R. unique(x[,1]) will return the set of unique values in the first column. If you then try to subset using those values R thinks you're referring to rows of the matrix. So you're likely getting NAs because the values refer to rows that don't exist in the matrix.
Your other attempt runs afoul of the fact that duplicated returns a boolean vector, not a vector of indices. So putting a minus sign in front of it converts it to a vector of 0's and -1's, which again R interprets as trying to refer to rows.
Try replacing the '-' with a '!' in front of duplicated, which is the boolean negation operator. Something like this:
m <- matrix(runif(100),10,10)
m[c(2,5,9),1] <- 1
m[!duplicated(m[,1]),]
As you need the indeces of the unique rows, use duplicated as you tried. The problem was using - instead of !, so try:
x[!duplicated(x[,1]),]