Build column of data frame with character vectors of different length? - r

I want to create a data frame in R.
To make an easy 2x2 example of my problem:
Assume the first column is a simple vector:
first <- c(1:2)
The second column is for every row a character vector (but of different length), for example:
c('A') for the first row and c('B','C') for the second.
How can I build this data frame?

If you want to store different vector sizes in each row of a certain column, you will need to use a list, problem that (from ?data.frame)
If a list or data frame or matrix is passed to data.frame it is as if
each component or column had been passed as a separate argument
Thus you will need to wrap it up into I in order to protect you desired structure, e.g.
df <- data.frame(first = 1:2, Second = I(list("A", c("B", "C"))))
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ first : int 1 2
# $ Second:List of 2
# ..$ : chr "A"
# ..$ : chr "B" "C"
# ..- attr(*, "class")= chr "AsIs"

Related

R Delete all rows in dataframe based on index

I have a list of indices that I know I want to remove from my data frame.
Normally I can do this easily with just writing out the names but I don't understand why the following command works when I want to keep the rows I am deleting:
str(data)
'data.frame': 180 obs. of 624 variables:
$ Sites : chr "SS0501_1" "SS0570_1" "SS0609_1" "SS0645_1" ...
$ LandUse : chr "Urban" "Urban" "Urban" "Urban" ...
.
.
.
f_pattern <- "SS2371|SS1973|SS1908|SS1815|SS1385|SS1304" # find index names in data frame using partial site names
get_full_id <- data[grep(f_pattern, rownames(data)),] # get the full site names (these are indices in the data frame)
data <- data[!get_full_id$Sites,] # DOES NOT WORK
Error in !check$Sites : invalid argument type
However, it does work if I pull these sites out.
data <- data[get_full_id$Sites,] # Works fine, I get a dataframe with 6 rows...the ones I don't want to keep.
str(data)
'data.frame': 6 obs. of 624 variables:
$ Sites : chr "SS1908_1" "SS1973_1" "SS1304_2" "SS1385_2" ...
$ LandUse : chr "Urban" "Rural" "Rural" "Urban" ...
.
.
I don't understand why the reverse with "!" won't work at all?
If the dataset have rownames, then we may need - instead of ! (if it is an exact match (- not clear as the rownames are not showed))
data[-get_full_id$Sites,]
because the negation works on a logical vector. Here, we are asking to return the rows that doesn't match the rownames in 'Sites' column. If we want to use !, create a logical vector
data[!row.names(data) %in% get_full_id$Sites,]
This also works only if there is an exact match
Also, this can be done directly
data[-grep(f_pattern, rownames(data)),]
Or use invert = TRUE
data[grep(f_pattern, rownames(data), invert = TRUE),]

Extract all values from a vector of named numerics with the same name in R

I'm trying to handle a vector of named numerics for the first time in R. The vector itself is named p.values. It consists of p-values which are named after their corresponding variabels. Through simulating I obtained a huge number of p-values that are always named like one of the five variables they correspond to. I'm interested in p-values of only one variable however and tried to extract them with p.values[["var_a"]] but that gives my only the p-value of var_a's last entry. p.values$var_a is invalid and as.numeric(p.values) or unname(p.values) gives my only all values without names obviously. Any idea how I can get R to give me the 1/5 of named numerics that are named var_a?
Short example:
p.values <- as.numeric(c(rep(1:5, each = 5)))
names(p.values) <- rep(letters[1:5], 5)
str(p.values)
Named num [1:25] 1 1 1 1 1 2 2 2 2 2 ...
- attr(*, "names")= chr [1:25] "a" "b" "c" "d" ...
I'd like to get R to show me all 5 numbers named "a".
Thanks for reading my first post here and I hope some more experienced R users know how to deal with named numerics and can help me with this issue.
You can subset p.values using [ with names(p.values) == "a" to show all values named a.
p.values[names(p.values) == "a"]
#a a a a a
#1 2 3 4 5

collapse data frame with embedded matrices [duplicate]

This question already has answers here:
aggregate() puts multiple output columns in a matrix instead
(1 answer)
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 4 years ago.
Under certain conditions, R generates data frames that contain matrices as elements. This requires some determination to do by hand, but happens e.g. with the results of an aggregate() call where the aggregation function returns multiple values:
set.seed(101)
d0 <- data.frame(g=factor(rep(1:2,each=20)), x=rnorm(20))
d1 <- aggregate(x~g, data=d0, FUN=function(x) c(m=mean(x), s=sd(x)))
str(d1)
## 'data.frame': 2 obs. of 2 variables:
## $ g: Factor w/ 2 levels "1","2": 1 2
## $ x: num [1:2, 1:2] -0.0973 -0.0973 0.8668 0.8668
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "m" "s"
This makes a certain amount of sense, but can make trouble for downstream processing code (for example, ggplot2 doesn't like it). The printed representation can also be confusing if you don't know what you're looking at:
d1
## g x.m x.s
## 1 1 -0.09731741 0.86678436
## 2 2 -0.09731741 0.86678436
I'm looking for a relatively simple way to collapse this object to a regular three-column data frame (either with names g, m, s, or with names g, x.m, x.s ...).
I know this problem won't arise with tidyverse (group_by + summarise), but am looking for a base-R solution.

Grouping occurences of a string to a row

tl;dr
Is there a way to group together a large number of values to a single column without truncation of those values?
I am working on a data frame with 48,178 entries on RStudio. The data frame has 2 columns of which the first one contains unique numeric values, and the other contains repeated strings.
----------
id name
1 forest
2 forest
3 park
4 riverbank
.
.
.
.
.
48178 water
----------
I would like to group together all entries on the basis of unique entries in the 2nd column. I have used the package "ddply" to achieve the result. I now have the following derived table:
----------
type V1
forest forest,forest,forest
park park,park,park,park
riverbank riverbank,riverbank,
water water,water,water,water
----------
However, on applying str function on the derived data frame, I find that the column contains truncated values, and not every instance of each string.
The output to the str is:
'data.frame': 4 obs. of 2 variables:
$ type: chr "forest" "park" "riverbank" "water"
$ V1 : chr "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ "park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,pa"| __truncated__ "riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverba"| __truncated__ "water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,w"| __truncated__`
How do I group together same strings and push them to a row, without truncation?
Extending the answer of HubertL, the str() function does exactly what it is supposed to but is perhaps the wrong choice for what you intend to do.
From the (rather limited) information you have given in your Q it seems that you already have achieved what you are looking for, i.e., concatenating all strings of the same type.
However, it appears that you are stuck with the output of the str() function.
Please, refer to the help page ?str.
From the Description section:
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object.
str() has a parameter nchar.max which defaults to 128.
nchar.max maximal number of characters to show for character strings. Longer strings are truncated, see longch example below.
The longch example in the Examples section illustrates the effect of this parameter:
nchar(longch <- paste(rep(letters,100), collapse = ""))
#[1] 2600
str(longch)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw"| __truncated__
str(longch, nchar.max = 52)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy"| __truncated__
Maximum length of a character string
According to ?"Memory-limits", the number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9. Given the number of rows in your data frame and the length of name the concatened strings won't exceed 0.6*10^6 which is far from the limit.
Try storing the results in a list using base R split() function:
new.list <- split(df, f=df$type)
This will split the data frame into multiple data frames that can be accessed using square brackets. It keeps the character strings from being combined and truncated as the records continue to be preserved in separate cells.
Your strings are not really truncated, only their display by str are truncated:
size <- 48000
df <- data.frame(1:size,
type=sample(c("forest", "park", "riverbank", "water" ),
size, replace = TRUE),
stringsAsFactors = FALSE)
res <- by(df$type , df$type, paste, collapse=",")
str(res)
'by' chr [1:4(1d)] "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ ...
- attr(*, "dimnames")=List of 1
..$ df$type: chr [1:4] "forest" "park" "riverbank" "water"
- attr(*, "call")= language by.default(data = df$type, INDICES = df$type, FUN = paste, collapse = ",")
lengths( strsplit(res, ','))
forest park riverbank water
11993 12017 11953 12037
sum(lengths( strsplit(res, ',')))
[1] 48000
If all you want is a count of occurance, then why not simply use table ?
df<- read.table(head=T, text="id name
1 forest
2 forest
3 park
4 riverbank")
df
df1<- as.data.frame(table(df$name))
#will give you number of times the word occurs
# if for some reason you want a repetition,then
x<- mapply(rep,df1$Var1,df1$Freq)
y<- sapply(x,paste, collapse=",")
data.frame(type=df1$Var1, V1=y)

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources