Subset dataframe by factor variable - r

Newbie here. I'm sure this is easy and have been answered before but I've been more than an hour now looking for the answer and can't find it.
I have a dataframe with 3 variables:
> str(statement)
'data.frame': 16464206 obs. of 3 variables:
$ statement_type_cd: Factor w/ 428 levels "A00001","A00002"...
$ statement_text : Factor w/ 9894526 levels...
$ serial_no : int 60146682 60149828 70011210...
I'd like to extract the statement_text observations that matches the statement_type_cd observations GSXXXX being X any number.
In other words, how do I subset the dataframe by any observation that begins with GS in the statement_type_cd variable?
Thanks :)

We can use grepl to create a logical vector by matching the pattern 'GS' from the start (^) of the string and use it to subset the dataset
statementsub <- subset(statement, grepl("^GS", statement_type_cd))
Or with tidyverse
library(dplyr)
statementsub <- statement %>%
filter(grepl("^GS", statement_type_cd))

Related

R Delete all rows in dataframe based on index

I have a list of indices that I know I want to remove from my data frame.
Normally I can do this easily with just writing out the names but I don't understand why the following command works when I want to keep the rows I am deleting:
str(data)
'data.frame': 180 obs. of 624 variables:
$ Sites : chr "SS0501_1" "SS0570_1" "SS0609_1" "SS0645_1" ...
$ LandUse : chr "Urban" "Urban" "Urban" "Urban" ...
.
.
.
f_pattern <- "SS2371|SS1973|SS1908|SS1815|SS1385|SS1304" # find index names in data frame using partial site names
get_full_id <- data[grep(f_pattern, rownames(data)),] # get the full site names (these are indices in the data frame)
data <- data[!get_full_id$Sites,] # DOES NOT WORK
Error in !check$Sites : invalid argument type
However, it does work if I pull these sites out.
data <- data[get_full_id$Sites,] # Works fine, I get a dataframe with 6 rows...the ones I don't want to keep.
str(data)
'data.frame': 6 obs. of 624 variables:
$ Sites : chr "SS1908_1" "SS1973_1" "SS1304_2" "SS1385_2" ...
$ LandUse : chr "Urban" "Rural" "Rural" "Urban" ...
.
.
I don't understand why the reverse with "!" won't work at all?
If the dataset have rownames, then we may need - instead of ! (if it is an exact match (- not clear as the rownames are not showed))
data[-get_full_id$Sites,]
because the negation works on a logical vector. Here, we are asking to return the rows that doesn't match the rownames in 'Sites' column. If we want to use !, create a logical vector
data[!row.names(data) %in% get_full_id$Sites,]
This also works only if there is an exact match
Also, this can be done directly
data[-grep(f_pattern, rownames(data)),]
Or use invert = TRUE
data[grep(f_pattern, rownames(data), invert = TRUE),]

How to convert outcome of table function to a dataframe

df = data.frame(table(train$department , train$outcome))
Here department and outcome both are factors so it gives me a dataframe which looks like in the given image
is_outcome is binary and df looks like this
containing only 2 variables(fields) while I want this department column to be a part of dataframe i.e a dataframe of 3 variables
0 1
Analytics 4840 512
Finance 2330 206
HR 2282 136
Legal 986 53
Operations 10325 1023
Procurement 6450 688
R&D 930 69
Sales & Marketing 15627 1213
Technology 6370 768
One way I learnt was...
df = data.frame(table(train$department , train$is_outcome))
write.csv(df,"df.csv")
rm(df)
df = read.csv("df.csv")
colnames(df) = c("department", "outcome_0","outcome_1")
but I cannot save file in everytime in my program
is there any way to do it directly.
When you are trying to create tables from a matrix in R, you end up with trial.table. The object trial.table looks exactly the same as the matrix trial, but it really isn’t. The difference becomes clear when you transform these objects to a data frame. Take a look at the outcome of this code:
> trial.df <- as.data.frame(trial)
> str(trial.df)
‘data.frame’: 2 obs. of 2 variables:
$ sick : num 34 11
$ healthy: num 9 32
Here you get a data frame with two variables (sick and healthy) with each two observations. On the other hand, if you convert the table to a data frame, you get the following result:
> trial.table.df <- as.data.frame(trial.table)
> str(trial.table.df)
‘data.frame’: 4 obs. of 3 variables:
$ Var1: Factor w/ 2 levels “risk”,”no_risk”: 1 2 1 2
$ Var2: Factor w/ 2 levels “sick”,”healthy”: 1 1 2 2
$ Freq: num 34 11 9 32
The as.data.frame() function converts a table to a data frame in a format that you need for regression analysis on count data. If you need to summarize the counts first, you use table() to create the desired table.
Now you get a data frame with three variables. The first two — Var1 and Var2 — are factor variables for which the levels are the values of the rows and the columns of the table, respectively. The third variable — Freq — contains the frequencies for every combination of the levels in the first two variables.
In fact, you also can create tables in more than two dimensions by adding more variables as arguments, or by transforming a multidimensional array to a table using as.table(). You can access the numbers the same way you do for multidimensional arrays, and the as.data.frame() function creates as many factor variables as there are dimensions.

Extracting the numbers from the data frame

I have a data frame with a "Calculation" column, which could be reproduced by the following code:
a <- data.frame(Id = c(1:3), Calculation = c('[489]/100','[4771]+[4777]+[5127]+[5357]+[5597]+[1044])/[463]','[1044]/[463]'))
> str(a)
'data.frame': 3 obs. of 2 variables:
$ Id : int 1 2 3
$ Calculation: Factor w/ 3 levels "[1044]/[463]",..: 3 2 1
Please note that there are two types of numbers in "Calculation" column: most of them are surrounded by brackets, but some (in this case the number 100) is not (this has a meaning in my application).
What I would like to do is to extract all the distinct numbers that appear in Calculation column to return a vector with the union of these numbers. Ideally, I would like to be able to distinguish between the numbers that are between brackets and the numbers that are not. This step is not so important (if it makes it complicated) since the numbers that are NOT between the brackets are few and I can manually detect them. So the desired output in this case would be:
b = c(489,4771,4777,5127,5357,5597,1044,463)
Thanks in advance
We can use str_extract_all from library(stringr). Using the regex lookbehind ((?<=\\[)), we match the numbers \\d+ that is preceded by [, extract them in a list, unlist to convert it to vector and then change the character to numeric (as.numeric), and get the unique elements.
library(stringr)
unique(as.numeric(unlist(str_extract_all(a$Calculation, '(?<=\\[)\\d+'))))
#[1] 489 4771 4777 5127 5357 5597 1044 463

How to transform a subset columns of an all string data frame in to numeric?

All of my data comes in character format. When I try transforming a subset of the data in to numeric using apply it doesn't seem to work.
df2 <- as.data.frame(matrix(as.character(1:9),3,3))
df2[,-2] <- apply(df2[,-2], 2, as.numeric)
apply(df2, 2, class)
Could somebody point me out what I am doing wrong in the example above?
Thanks
As commented above.. a matrix in R can only hold values of the same type in all columns. You cannot change some of the values to numeric and leave some others as characters. If you want different data types, you can use a data.frame, but even then, you can only have one data type per column.
For your example case:
df2 <- as.data.frame(matrix(as.character(1:9),3,3))
will create a data.frame with factors in each column. If you want to convert the second column to numeric, you can do:
df2$V2 <- as.numeric(levels(df2$V2))[df2$V2]
Or
df$V2 <- as.numeric(as.character(df2$V2))
So you don't need to use apply in this case.
str(df2)
#'data.frame': 3 obs. of 3 variables:
# $ V1: Factor w/ 3 levels "1","2","3": 1 2 3
# $ V2: num 4 5 6
# $ V3: Factor w/ 3 levels "7","8","9": 1 2 3
If you wanted to convert all columns to numeric, you can do:
# if the columns were factors before:
df2[] <- lapply(df2, function(i) as.numeric(levels(i))[i])
Or
# if the columns were characters before:
df2[] <- lapply(df2, as.numeric)

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources