tl;dr
Is there a way to group together a large number of values to a single column without truncation of those values?
I am working on a data frame with 48,178 entries on RStudio. The data frame has 2 columns of which the first one contains unique numeric values, and the other contains repeated strings.
----------
id name
1 forest
2 forest
3 park
4 riverbank
.
.
.
.
.
48178 water
----------
I would like to group together all entries on the basis of unique entries in the 2nd column. I have used the package "ddply" to achieve the result. I now have the following derived table:
----------
type V1
forest forest,forest,forest
park park,park,park,park
riverbank riverbank,riverbank,
water water,water,water,water
----------
However, on applying str function on the derived data frame, I find that the column contains truncated values, and not every instance of each string.
The output to the str is:
'data.frame': 4 obs. of 2 variables:
$ type: chr "forest" "park" "riverbank" "water"
$ V1 : chr "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ "park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,pa"| __truncated__ "riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverba"| __truncated__ "water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,w"| __truncated__`
How do I group together same strings and push them to a row, without truncation?
Extending the answer of HubertL, the str() function does exactly what it is supposed to but is perhaps the wrong choice for what you intend to do.
From the (rather limited) information you have given in your Q it seems that you already have achieved what you are looking for, i.e., concatenating all strings of the same type.
However, it appears that you are stuck with the output of the str() function.
Please, refer to the help page ?str.
From the Description section:
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object.
str() has a parameter nchar.max which defaults to 128.
nchar.max maximal number of characters to show for character strings. Longer strings are truncated, see longch example below.
The longch example in the Examples section illustrates the effect of this parameter:
nchar(longch <- paste(rep(letters,100), collapse = ""))
#[1] 2600
str(longch)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw"| __truncated__
str(longch, nchar.max = 52)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy"| __truncated__
Maximum length of a character string
According to ?"Memory-limits", the number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9. Given the number of rows in your data frame and the length of name the concatened strings won't exceed 0.6*10^6 which is far from the limit.
Try storing the results in a list using base R split() function:
new.list <- split(df, f=df$type)
This will split the data frame into multiple data frames that can be accessed using square brackets. It keeps the character strings from being combined and truncated as the records continue to be preserved in separate cells.
Your strings are not really truncated, only their display by str are truncated:
size <- 48000
df <- data.frame(1:size,
type=sample(c("forest", "park", "riverbank", "water" ),
size, replace = TRUE),
stringsAsFactors = FALSE)
res <- by(df$type , df$type, paste, collapse=",")
str(res)
'by' chr [1:4(1d)] "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ ...
- attr(*, "dimnames")=List of 1
..$ df$type: chr [1:4] "forest" "park" "riverbank" "water"
- attr(*, "call")= language by.default(data = df$type, INDICES = df$type, FUN = paste, collapse = ",")
lengths( strsplit(res, ','))
forest park riverbank water
11993 12017 11953 12037
sum(lengths( strsplit(res, ',')))
[1] 48000
If all you want is a count of occurance, then why not simply use table ?
df<- read.table(head=T, text="id name
1 forest
2 forest
3 park
4 riverbank")
df
df1<- as.data.frame(table(df$name))
#will give you number of times the word occurs
# if for some reason you want a repetition,then
x<- mapply(rep,df1$Var1,df1$Freq)
y<- sapply(x,paste, collapse=",")
data.frame(type=df1$Var1, V1=y)
I have a data frame with a "Calculation" column, which could be reproduced by the following code:
a <- data.frame(Id = c(1:3), Calculation = c('[489]/100','[4771]+[4777]+[5127]+[5357]+[5597]+[1044])/[463]','[1044]/[463]'))
> str(a)
'data.frame': 3 obs. of 2 variables:
$ Id : int 1 2 3
$ Calculation: Factor w/ 3 levels "[1044]/[463]",..: 3 2 1
Please note that there are two types of numbers in "Calculation" column: most of them are surrounded by brackets, but some (in this case the number 100) is not (this has a meaning in my application).
What I would like to do is to extract all the distinct numbers that appear in Calculation column to return a vector with the union of these numbers. Ideally, I would like to be able to distinguish between the numbers that are between brackets and the numbers that are not. This step is not so important (if it makes it complicated) since the numbers that are NOT between the brackets are few and I can manually detect them. So the desired output in this case would be:
b = c(489,4771,4777,5127,5357,5597,1044,463)
Thanks in advance
We can use str_extract_all from library(stringr). Using the regex lookbehind ((?<=\\[)), we match the numbers \\d+ that is preceded by [, extract them in a list, unlist to convert it to vector and then change the character to numeric (as.numeric), and get the unique elements.
library(stringr)
unique(as.numeric(unlist(str_extract_all(a$Calculation, '(?<=\\[)\\d+'))))
#[1] 489 4771 4777 5127 5357 5597 1044 463
I have a function I'm using in R that requires input to several parameters, once as a numeric (1) and as a character (NULL). The default is NULL.
I want to apply the function using all possible combinations of parameters, so I used expand.grid to try and create a dataframe which stores these. However, I am running into problems with creating an object that contains both numerics and characters in one column.
This is what I've tried:
comb<-expand.grid(c("NULL",1),c("NULL",1),stringsAsFactors=FALSE), which returns:
comb
Var1 Var2
1 NULL NULL
2 1 NULL
3 NULL 1
4 1 1
with all entries characters:
class(comb[1,1])
[1] "character"
If I now try and insert a numeric into a specific spot, I still receive a character:
comb[2,1]<-as.numeric(1)
class(comb[2,1])
[1] "character"
I've also tried it using stringsAsFactors=TRUE, or using expand.grid(c(0,1),c(0,1)) and then switching out the 0 for NULL but always have the exact same problem: whenever I do this, I do not get a numeric 1.
Manually creating an object using cbind and then inserting the NULL as a character also does not help. I'd be grateful for a pointer, or a work-around to running the function with all possible combinations of parameters.
As you have been told, generally speaking columns of data frames need to be a single type. It's hard to solve your specific problem, because it is likely that the solution is not really "putting multiple types into a single column" but rather re-organizing your other unseen code to work within this restriction.
As I suggested, it probably will be better to use the built in NA value as expand.grid(c(NA,1),c(NA,1)) and then modify your function to use NA as an input. Or, of course, you could just use some "special" numeric value, like -1, or -99 or something.
The related issue that I mentioned is that you really should avoid using the character string "NULL" to mean anything, since NULL is a special value in R, and confusion will ensue.
These sorts of strategies would all be preferable to mixing types, and using character strings of reserved words like NULL.
All that said, it technically is possible to get around this, but it is awkward, and not a good idea.
d <- data.frame(x = 1:5)
> d$y <- list("a",1,2,3,"b")
> d
x y
1 1 a
2 2 1
3 3 2
4 4 3
5 5 b
> str(d)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y:List of 5
..$ : chr "a"
..$ : num 1
..$ : num 2
..$ : num 3
..$ : chr "b"
I'm beginning to analyse datas for my thesis. I first need to count consecutive occurences of strings as one. Here's a sample vector :
test <- c("vv","vv","vv","bb","bb","bb","","cc","cc","vv","vv")
I would like to simply extract unique values, as in the unix command uniq. So expected output would be a vector as :
"vv","bb","cc","vv"
I looked at rle function, wich seems to be fine, but how would I get the output of rle as a vector ? I don't seem to understand the rle class...
> rle(test)
Run Length Encoding
lengths: int [1:5] 3 3 1 2 2
values : chr [1:5] "vv" "bb" "" "cc" "vv"
How to get one vector of the values output by rle and another one for the lengths ? Hope I'm making myself clear...
Thanks again for any help !
rle() returns a two-element list of class "rle"; as #gsk points out, you can use ordinary list-indexing constructs to access the component vectors.
Also, try this, to put the results of rle into a more familiar format:
as.data.frame(rev(unclass(rle(test))))
# values lengths
# 1 vv 3
# 2 bb 3
# 3 1
# 4 cc 2
# 5 vv 2
Source: http://www.sigmafield.org/2009/09/22/r-function-of-the-day-rle
Solution: rle(test)$values
They use: coin.rle <- rle(coin) and coin.rle$values so, rle(test)$values should work.