R Delete all rows in dataframe based on index - r

I have a list of indices that I know I want to remove from my data frame.
Normally I can do this easily with just writing out the names but I don't understand why the following command works when I want to keep the rows I am deleting:
str(data)
'data.frame': 180 obs. of 624 variables:
$ Sites : chr "SS0501_1" "SS0570_1" "SS0609_1" "SS0645_1" ...
$ LandUse : chr "Urban" "Urban" "Urban" "Urban" ...
.
.
.
f_pattern <- "SS2371|SS1973|SS1908|SS1815|SS1385|SS1304" # find index names in data frame using partial site names
get_full_id <- data[grep(f_pattern, rownames(data)),] # get the full site names (these are indices in the data frame)
data <- data[!get_full_id$Sites,] # DOES NOT WORK
Error in !check$Sites : invalid argument type
However, it does work if I pull these sites out.
data <- data[get_full_id$Sites,] # Works fine, I get a dataframe with 6 rows...the ones I don't want to keep.
str(data)
'data.frame': 6 obs. of 624 variables:
$ Sites : chr "SS1908_1" "SS1973_1" "SS1304_2" "SS1385_2" ...
$ LandUse : chr "Urban" "Rural" "Rural" "Urban" ...
.
.
I don't understand why the reverse with "!" won't work at all?

If the dataset have rownames, then we may need - instead of ! (if it is an exact match (- not clear as the rownames are not showed))
data[-get_full_id$Sites,]
because the negation works on a logical vector. Here, we are asking to return the rows that doesn't match the rownames in 'Sites' column. If we want to use !, create a logical vector
data[!row.names(data) %in% get_full_id$Sites,]
This also works only if there is an exact match
Also, this can be done directly
data[-grep(f_pattern, rownames(data)),]
Or use invert = TRUE
data[grep(f_pattern, rownames(data), invert = TRUE),]

Related

Subset dataframe by factor variable

Newbie here. I'm sure this is easy and have been answered before but I've been more than an hour now looking for the answer and can't find it.
I have a dataframe with 3 variables:
> str(statement)
'data.frame': 16464206 obs. of 3 variables:
$ statement_type_cd: Factor w/ 428 levels "A00001","A00002"...
$ statement_text : Factor w/ 9894526 levels...
$ serial_no : int 60146682 60149828 70011210...
I'd like to extract the statement_text observations that matches the statement_type_cd observations GSXXXX being X any number.
In other words, how do I subset the dataframe by any observation that begins with GS in the statement_type_cd variable?
Thanks :)
We can use grepl to create a logical vector by matching the pattern 'GS' from the start (^) of the string and use it to subset the dataset
statementsub <- subset(statement, grepl("^GS", statement_type_cd))
Or with tidyverse
library(dplyr)
statementsub <- statement %>%
filter(grepl("^GS", statement_type_cd))

Grouping occurences of a string to a row

tl;dr
Is there a way to group together a large number of values to a single column without truncation of those values?
I am working on a data frame with 48,178 entries on RStudio. The data frame has 2 columns of which the first one contains unique numeric values, and the other contains repeated strings.
----------
id name
1 forest
2 forest
3 park
4 riverbank
.
.
.
.
.
48178 water
----------
I would like to group together all entries on the basis of unique entries in the 2nd column. I have used the package "ddply" to achieve the result. I now have the following derived table:
----------
type V1
forest forest,forest,forest
park park,park,park,park
riverbank riverbank,riverbank,
water water,water,water,water
----------
However, on applying str function on the derived data frame, I find that the column contains truncated values, and not every instance of each string.
The output to the str is:
'data.frame': 4 obs. of 2 variables:
$ type: chr "forest" "park" "riverbank" "water"
$ V1 : chr "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ "park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,park,pa"| __truncated__ "riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverbank,riverba"| __truncated__ "water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,water,w"| __truncated__`
How do I group together same strings and push them to a row, without truncation?
Extending the answer of HubertL, the str() function does exactly what it is supposed to but is perhaps the wrong choice for what you intend to do.
From the (rather limited) information you have given in your Q it seems that you already have achieved what you are looking for, i.e., concatenating all strings of the same type.
However, it appears that you are stuck with the output of the str() function.
Please, refer to the help page ?str.
From the Description section:
Compactly display the internal structure of an R object, a diagnostic function and an alternative to summary (and to some extent, dput). Ideally, only one line for each ‘basic’ structure is displayed. It is especially well suited to compactly display the (abbreviated) contents of (possibly nested) lists. The idea is to give reasonable output for any R object.
str() has a parameter nchar.max which defaults to 128.
nchar.max maximal number of characters to show for character strings. Longer strings are truncated, see longch example below.
The longch example in the Examples section illustrates the effect of this parameter:
nchar(longch <- paste(rep(letters,100), collapse = ""))
#[1] 2600
str(longch)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvw"| __truncated__
str(longch, nchar.max = 52)
# chr "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxy"| __truncated__
Maximum length of a character string
According to ?"Memory-limits", the number of bytes in a character string is limited to 2^31 - 1 ~ 2*10^9. Given the number of rows in your data frame and the length of name the concatened strings won't exceed 0.6*10^6 which is far from the limit.
Try storing the results in a list using base R split() function:
new.list <- split(df, f=df$type)
This will split the data frame into multiple data frames that can be accessed using square brackets. It keeps the character strings from being combined and truncated as the records continue to be preserved in separate cells.
Your strings are not really truncated, only their display by str are truncated:
size <- 48000
df <- data.frame(1:size,
type=sample(c("forest", "park", "riverbank", "water" ),
size, replace = TRUE),
stringsAsFactors = FALSE)
res <- by(df$type , df$type, paste, collapse=",")
str(res)
'by' chr [1:4(1d)] "forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,forest,f"| __truncated__ ...
- attr(*, "dimnames")=List of 1
..$ df$type: chr [1:4] "forest" "park" "riverbank" "water"
- attr(*, "call")= language by.default(data = df$type, INDICES = df$type, FUN = paste, collapse = ",")
lengths( strsplit(res, ','))
forest park riverbank water
11993 12017 11953 12037
sum(lengths( strsplit(res, ',')))
[1] 48000
If all you want is a count of occurance, then why not simply use table ?
df<- read.table(head=T, text="id name
1 forest
2 forest
3 park
4 riverbank")
df
df1<- as.data.frame(table(df$name))
#will give you number of times the word occurs
# if for some reason you want a repetition,then
x<- mapply(rep,df1$Var1,df1$Freq)
y<- sapply(x,paste, collapse=",")
data.frame(type=df1$Var1, V1=y)

Build column of data frame with character vectors of different length?

I want to create a data frame in R.
To make an easy 2x2 example of my problem:
Assume the first column is a simple vector:
first <- c(1:2)
The second column is for every row a character vector (but of different length), for example:
c('A') for the first row and c('B','C') for the second.
How can I build this data frame?
If you want to store different vector sizes in each row of a certain column, you will need to use a list, problem that (from ?data.frame)
If a list or data frame or matrix is passed to data.frame it is as if
each component or column had been passed as a separate argument
Thus you will need to wrap it up into I in order to protect you desired structure, e.g.
df <- data.frame(first = 1:2, Second = I(list("A", c("B", "C"))))
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ first : int 1 2
# $ Second:List of 2
# ..$ : chr "A"
# ..$ : chr "B" "C"
# ..- attr(*, "class")= chr "AsIs"

R convert matrix to list

I have a large matrix
> str(distMatrix)
num [1:551, 1:551] 0 6 5 Inf 5 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:551] "+" "ABRAHAM" "ACTS" "ADVANCE" ...
..$ : chr [1:551] "+" "ABRAHAM" "ACTS" "ADVANCE" ...
which contains numeric values. I need to gather all numeric values into ONE long list (for acquiring distribution). Currently what I have:
for(i in 1:dim(distMatrix)[[1]]){
for (j in 1:1:dim(distMatrix)[[1]]){
distances[length(distances)+1] <- distMatrix[i,j]
}
}
However, that takes forever. Can anyone suggest a faster way?
To turn a matrix into a list, the length of which is the same as the number of elements in the matrix, you can simply do
as.list(distMatrix)
This goes down the columns, but you can use the transpose
as.list(t(distMatrix))
to make it go across the rows. Since your matrix is 551x551 it should be sufficiently efficient.

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources