Grouping occurences of a string to a row - r

tl;dr
Is there a way to group together a large number of values to a single column without truncation of those values?
I am working on a data frame with 48,178 entries on RStudio. The data frame has 2 columns of which the first one contains unique numeric values, and the other contains repeated strings.
----------
id name
1 forest
2 forest
3 park
4 riverbank
.
.
.
.
.
48178 water
----------
I would like to group together all entries on the basis of unique entries in the 2nd column. I have used the package "ddply" to achieve the result. I now have the following derived table:
----------
type V1
forest forest,forest,forest
park park,park,park,park
riverbank riverbank,riverbank,
water water,water,water,water
----------
However, on applying str function on the derived data frame, I find that the column contains truncated values, and not every instance of each string.
The output to the str is:
'data.frame': 4 obs. of 2 variables:
$ type: chr "forest" "park" "riverbank" "water"
$ V1 : chr "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ "park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,pa"| __truncated__ "riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverba"| __truncated__ "water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,w"| __truncated__`
How do I group together same strings and push them to a row, without truncation?

Extending the answer of HubertL, the str() function does exactly what it is supposed to but is perhaps the wrong choice for what you intend to do.
From the (rather limited) information you have given in your Q it seems that you already have achieved what you are looking for, i.e., concatenating all strings of the same type.
However, it appears that you are stuck with the output of the str() function.
Please, refer to the help page ?str.
From the Description section:
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object.
str() has a parameter nchar.max which defaults to 128.
nchar.max maximal number of characters to show for character strings. Longer strings are truncated, see longch example below.
The longch example in the Examples section illustrates the effect of this parameter:
nchar(longch <- paste(rep(letters,100), collapse = ""))
#[1] 2600
str(longch)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw"| __truncated__
str(longch, nchar.max = 52)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy"| __truncated__
Maximum length of a character string
According to ?"Memory-limits", the number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9. Given the number of rows in your data frame and the length of name the concatened strings won't exceed 0.6*10^6 which is far from the limit.

Try storing the results in a list using base R split() function:
new.list <- split(df, f=df$type)
This will split the data frame into multiple data frames that can be accessed using square brackets. It keeps the character strings from being combined and truncated as the records continue to be preserved in separate cells.

Your strings are not really truncated, only their display by str are truncated:
size <- 48000
df <- data.frame(1:size,
type=sample(c("forest", "park", "riverbank", "water" ),
size, replace = TRUE),
stringsAsFactors = FALSE)
res <- by(df$type , df$type, paste, collapse=",")
str(res)
'by' chr [1:4(1d)] "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ ...
- attr(*, "dimnames")=List of 1
..$ df$type: chr [1:4] "forest" "park" "riverbank" "water"
- attr(*, "call")= language by.default(data = df$type, INDICES = df$type, FUN = paste, collapse = ",")
lengths( strsplit(res, ','))
forest park riverbank water
11993 12017 11953 12037
sum(lengths( strsplit(res, ',')))
[1] 48000

If all you want is a count of occurance, then why not simply use table ?
df<- read.table(head=T, text="id name
1 forest
2 forest
3 park
4 riverbank")
df
df1<- as.data.frame(table(df$name))
#will give you number of times the word occurs
# if for some reason you want a repetition,then
x<- mapply(rep,df1$Var1,df1$Freq)
y<- sapply(x,paste, collapse=",")
data.frame(type=df1$Var1, V1=y)

Related

R Delete all rows in dataframe based on index

I have a list of indices that I know I want to remove from my data frame.
Normally I can do this easily with just writing out the names but I don't understand why the following command works when I want to keep the rows I am deleting:
str(data)
'data.frame': 180 obs. of 624 variables:
$ Sites : chr "SS0501_1" "SS0570_1" "SS0609_1" "SS0645_1" ...
$ LandUse : chr "Urban" "Urban" "Urban" "Urban" ...
.
.
.
f_pattern <- "SS2371|SS1973|SS1908|SS1815|SS1385|SS1304" # find index names in data frame using partial site names
get_full_id <- data[grep(f_pattern, rownames(data)),] # get the full site names (these are indices in the data frame)
data <- data[!get_full_id$Sites,] # DOES NOT WORK
Error in !check$Sites : invalid argument type
However, it does work if I pull these sites out.
data <- data[get_full_id$Sites,] # Works fine, I get a dataframe with 6 rows...the ones I don't want to keep.
str(data)
'data.frame': 6 obs. of 624 variables:
$ Sites : chr "SS1908_1" "SS1973_1" "SS1304_2" "SS1385_2" ...
$ LandUse : chr "Urban" "Rural" "Rural" "Urban" ...
.
.
I don't understand why the reverse with "!" won't work at all?
If the dataset have rownames, then we may need - instead of ! (if it is an exact match (- not clear as the rownames are not showed))
data[-get_full_id$Sites,]
because the negation works on a logical vector. Here, we are asking to return the rows that doesn't match the rownames in 'Sites' column. If we want to use !, create a logical vector
data[!row.names(data) %in% get_full_id$Sites,]
This also works only if there is an exact match
Also, this can be done directly
data[-grep(f_pattern, rownames(data)),]
Or use invert = TRUE
data[grep(f_pattern, rownames(data), invert = TRUE),]

Extract all values from a vector of named numerics with the same name in R

I'm trying to handle a vector of named numerics for the first time in R. The vector itself is named p.values. It consists of p-values which are named after their corresponding variabels. Through simulating I obtained a huge number of p-values that are always named like one of the five variables they correspond to. I'm interested in p-values of only one variable however and tried to extract them with p.values[["var_a"]] but that gives my only the p-value of var_a's last entry. p.values$var_a is invalid and as.numeric(p.values) or unname(p.values) gives my only all values without names obviously. Any idea how I can get R to give me the 1/5 of named numerics that are named var_a?
Short example:
p.values <- as.numeric(c(rep(1:5, each = 5)))
names(p.values) <- rep(letters[1:5], 5)
str(p.values)
Named num [1:25] 1 1 1 1 1 2 2 2 2 2 ...
- attr(*, "names")= chr [1:25] "a" "b" "c" "d" ...
I'd like to get R to show me all 5 numbers named "a".
Thanks for reading my first post here and I hope some more experienced R users know how to deal with named numerics and can help me with this issue.
You can subset p.values using [ with names(p.values) == "a" to show all values named a.
p.values[names(p.values) == "a"]
#a a a a a
#1 2 3 4 5

writing first row (column headers) to a vector

I have a matrix I have coerced from a realRatingMatrix in recommenderlab package in R. The data contains predictions of ratings between 0-1 for a number of products.
The matrix should contain customer numbers along the rows (row 2 down) so that column 1 header is row label, and product IDs along the columns in the first row from column 2 onwards. The problem I have is when I coerce to a matrix the data structure becomes messy:
EDIT: Link to Github repository www.github.com/APBuchanan/recommenderlab-model
str(wsratings)
num [1:43, 1:319] 0.192 0.44 0.262 0.161 0.239 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:319] "X011211" "X014227" "X014229" "X014235" ...
The first cell wsratings[1,1] should be labelled as "CustomerNumber" and the remainder of the columns in row 1 should contain the data that is currently held in the above $:chr, but should display as separate variables in the matrix.
From the code below you will see that I've been trying to go about this by inserting the data into two vectors, that I can then call in the dimnames function, but I'm getting something wrong:
setwd("location to pull in data")
#look at using XLConnect package to link straight to excel workbook
library(recommenderlab)
library(xlsx)
library(tidyr)
library(Matrix)
#library(stringer)
data=read.csv("WS1 & WS2 V3.csv",header=TRUE,row.names=1)
#remove rows where number of purchases is <10
df=data[rowSums(data[-1])>=10,]
df<-as.matrix(df)
data.matrix=as(df,"binaryRatingMatrix")
#image(data.matrix)
model=Recommender(data.matrix,method="UBCF")
predictions<-predict(model,data.matrix,n=5)
set.seed(100)
evaluation<-evaluationScheme(data.matrix,method="split",train=0.5,given=5)
Rec.ubcf <- Recommender(getData(evaluation, "train"), "UBCF")
predict.ubcf<-predict(Rec.ubcf,getData(evaluation,"known"),type="topNList")
pred.ubcfratings<-predict(Rec.ubcf,getData(evaluation,"known"),type="ratings")
error.ubcf<-calcPredictionAccuracy(predict.ubcf,getData(evaluation,"unknown"),given=5)
setwd("Location to output data from model")
wsratings<-as(pred.ubcfratings,"matrix")
ratingrows<-c(evaluation#runsTrain)
where I've called colnames2<-c(wsratings[1,2:ncol(wsratings)]) I am expecting the the data from column 2 to the last column, in row 1 to be read into the vector. But when I print the results, it includes rating information as well which is not what I'm after.
ratingrows<-c(evaluation#runsTrain) contains the customer numbers that I want to insert below the row label "CustomerNumber".
I'm guessing there's a way of sorting this out with tidyr package, but not so familiar with it. If anyone can provide some advice on how I can clean this all up, I'd be very grateful.
So with the data you gave, I whipped up a solution here.
You said "I need to extract the customer numbers from the test split of data and drop that into the first column of the matrix - that's my main issue". The way to extract that is either: colnames(wsratings) or dimnames(wsratings)[[2]].
Once you have this vector (length of 320), you want to "drop that to the first column". You're asking for a cbind(), but the length of the data you want to bind it contains 43 row. You can't bind them together because the length of the two elements are not the same or multiples of each other.
Assuming you have the full dataset and their length matches, then the code would be:
customerid <-c("CustomerName", evaluation#runsTrain[[1]])
wsratings <- cbind(customerid, wsratings)
This is what I gathered you want, and it yields me the following:

Build column of data frame with character vectors of different length?

I want to create a data frame in R.
To make an easy 2x2 example of my problem:
Assume the first column is a simple vector:
first <- c(1:2)
The second column is for every row a character vector (but of different length), for example:
c('A') for the first row and c('B','C') for the second.
How can I build this data frame?
If you want to store different vector sizes in each row of a certain column, you will need to use a list, problem that (from ?data.frame)
If a list or data frame or matrix is passed to data.frame it is as if
each component or column had been passed as a separate argument
Thus you will need to wrap it up into I in order to protect you desired structure, e.g.
df <- data.frame(first = 1:2, Second = I(list("A", c("B", "C"))))
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ first : int 1 2
# $ Second:List of 2
# ..$ : chr "A"
# ..$ : chr "B" "C"
# ..- attr(*, "class")= chr "AsIs"

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources