Count of Comma separated values in r - r

I have a column named subcat_id in which the values are stored as comma separated lists. I need to count the number of values and store the counts in a new column. The lists also have Null values that I want to get rid of.
I would like to store the counts in the n column.

We can try
nchar(gsub('[^,]+', '', gsub(',(?=,)|(^,|,$)', '',
gsub('(Null){1,}', '', df1$subcat_id), perl=TRUE)))+1L
#[1] 6 4
Or
library(stringr)
str_count(df1$subcat_id, '[0-9.]+')
#[1] 6 4
data
df1 <- data.frame(subcat_id = c('1,2,3,15,16,78',
'1,2,3,15,Null,Null'), stringsAsFactors=FALSE)

You can do
sapply(strsplit(subcat_id,","),FUN=function(x){length(x[x!="Null"])})
strsplit(subcat_id,",") will return a list of each item in subcat_id split on commas. sapply will apply the specified function to each item in this list and return us a vector of the results.
Finally, the function that we apply will take just the non-null entries in each list item and count the resulting sublist.
For example, if we have
subcat_id <- c("1,2,3","23,Null,4")
Then running the above code returns c(3,4) which you can assign to your column.
If running this from a dataframe, it is possible that the character column has been interpreted as a factor, in which case the error non-character argument will be thrown. To fix this, we need to force interpretation as a character vector with the as.character function, changing the command to
sapply(strsplit(as.character(frame$subcat_id),","),FUN=function(x){length(x[x!="Null"])})

Related

What is happening during assignment to a dataframe by lapply

Given a dataframe df and a function f which is applied to df:
df[] <- lapply(df, f)
What is the magic R is performing to replace columns in df with collection of vectors in the list from lapply? I see that the result from lapply is a list of vectors having the same names as the dataframe df. I assume some magic mapping is being done to map the vectors to df[], which is the collection of columns in df (methinks). Just works? Trying to better understand so that I remember what to use the next time.
A data.frame is merely a list of vectors having the same length. You can see it using is.list(a_data_frame). It will return TRUE.
[] can have different meaning or action depending of the object it is applied on. It even can be redefined as it is in fact a function.
[] allows to subset or insert vector columns from data.frame.
df[1] get the first column
df[1] <- 2 replace the first column with 2 (repeated in order to have the same length as other columns)
df[] return the whole data.frame
df[] <- list(c1,c2,c3) sets the content of the data.frame replacing it's current content
Plus a wide number of other way to access or set data in a data.frame (by column name, by subset of rows, of columns, ...)

Exclude one single column from sapply

I have a dataframe with multiple columns that I want to group according to their names. When several columns names respond to the same pattern, I want them grouped in a single column and that column is the sum of the group.
colnames(dataframe)
[1] "Départements" "01...3" "01...4" "01...5" "02...6" "02...7" "02...8" "02...9" "02...10" "03...11"
[11] "03...12" "03...13" "04...14" "04...15" "05...16" "05...17" "05...18" "06...19" "06...20" "06...21"
So I use this bit of code that works just fine when every column are numeric, though the first one is character and therefore I hit an error. How can I exclude the first column from the code?
#Group columns by patern, look for a pattern and loop through
patterns <- unique(substr(names(dataframe_2012), 1, 3))` #store patterns in a vector
dataframe <- sapply(patterns, function(xx) rowSums(dataframe[,grep(xx, names(dataframe)), drop=FALSE]))
#loop through
This is the error code I get
Error in rowSums(DEPTpolicedata_2012[, grep(xx, names(DEPTpolicedata_2012)), :
'x' must be numeric
You can simply remove the first column using
patterns$Départements <- NULL

R: Converting a column of IDs toString is grouping them into comma separated values instead of turning the field type to string

I have data (1000 obs. of 9 variables) that has a field called Employee containing 5 number values for each record. When I run the below code to convert the chr type to string, the ID field groups all of my IDs in the dataset into comma separated values and puts them in each row, with the data still having 1000 obs. and 9 variables. I want to group my data by Employee, which is why I am converting it toString.
Data$Employee <- toString(Data$Employee)
Column before converting toString, when data type is character
Column after converting toString with above code
The wording of your question is a little confusing, but I think you are trying to convert a numeric vector into a character vector and are running into trouble. If that is correct, I have an answer. Otherwise, feel free to stop reading here!
The function toString() in R creates a single character vector separated by comma values. For example:
toString(c(1, 2, 3))
gives "1, 2, 3".
If however, you want to turn a numeric vector into a character vector in R, you want to use the function as.character().
All that to say,
Data$ID <- as.character(Data$ID)
should give you what you're looking for!

'dictionary' list to data.table columns

I am converting output from an API call to a bibliography database, that returns content in RIS form. I would then like to get a data.table object, with a row for each database item, and a column for each field of the RIS output.
I will explain more about RIS later, but I am stuck in the following:
I would like to get a data.table using something like:
PubDB <- as.data.table(list(TY = "txtTY",TI = "txtTI"))
which returns:
PubDB
TY TI
1: txtTY txtTI
However, what I have is a string (actually a vector of strings returned from API call: PubStr is one element)
PubStr
## [1] "TY = \"txtTY\",TI = \"txtTI\" "
How can I convert this string to the list needed inside the as.data.table command above?
More specifically, following the first steps of my code, after resp<-GET(url), rawToChar(resp$content) and as.data.table() after some string manipulation, I have a data table with rows for each publication, and one column called PubStr that has the string as above. How to convert this string to many columns, for each row of the data.table. Note: some rows have more or fewer fields.
I am unsure of RIS format but if each element of these strings are separated by commas and then within each comma the header column names are separated by the equal sign then here is a quick and dirty function that uses base R and data.table:
RIS_parser_fn<-function(x){
string_parse_list<-lapply(lapply(x,
function(i) tstrsplit(i,",")),
function(j) lapply(tstrsplit(j,"="),
function(k) t(gsub("\\W","",k))))
datatable_format<-rbindlist(lapply(lapply(string_parse_list,
function(i) data.table(Reduce("rbind",i))),
function(j) setnames(j,unlist(j[1,]))[-1]),fill = T)
return(datatable_format)
}
The first line of code simply creates a list of lists which contain 2 lists of matrices. The outer list has the number of elements equal to the size of the initial vector of strings. The inner list has exactly two matrix elements with the number of columns equal to the number of fields in each string element determined by the ',' sign. The first matrix in each list of lists consists of the columns headers (determined by the '=' sign) and the second matrix contains the values they are equal to. The last gsub simply removes any special characters remaining in the matrices. May need to modify this if you want nonalphanumeric characters to be present in the values. There were not any in your example.
The second line of code converts these lists into one data.table object. The Reduce function simply rbinds the 2 element lists and then converts them to data.tables. Hence there is now only one list consisting of data.tables for each initial string element. The "j" lapply function sets the column names to the first row of the matrix and then removes that row from the data.table. The final rbindlist call combines the list of the data.tables which have varying number of columns. Set the fill=T to allow them to be combined and NAs will be assigned to cells that do not have that particular field.
I added a second string element with one more field to test the code:
PubStr<-c("TY = \"txtTY1\",TI = \"txtTI1\"","TY = \"txtTY2\",TI = \"txtTI2\" ,TF = \"txtTF2\"")
RIS_parser_fn(PubStr)
Returns this:
TY TI TF
1: txtTY1 txtTI1 <NA>
2: txtTY2 txtTI2 txtTF2
Hopefully this will help you out and/or stimulate some ideas for more efficient code. Best of luck!

Using grepl() to remove values from a dataframe in R

I have a data.frame with 1 column, and a nondescript number of rows.
This column contains strings, and some strings contain a substring, let's say "abcd".
I want to remove any strings from the database that contain that substring. For example, I may have five strings that are "123 abcd", and I want those to be removed.
I am currently using grepl() to try and remove these values, but it is not working. I am trying:
data.frame[!grepl("abcd", dataframe)]
but it returns an empty data frame.
We can use grepl to get a logical vector, negate (!) it, and use that to subset the 'data'
data[!grepl("abcd", data$Col),,drop = FALSE]

Resources