Subsetting in R using setDT to remove values [duplicate] - r

This question already has answers here:
Drop unused factor levels in a subsetted data frame
(16 answers)
Closed 5 years ago.
Hello I am using R studio to filter out varieties of wine that appear less that 5000 times in a dataset.
I have run the below function -
#create new data frame with varities greater than 5000
wineVar <- setDT(wineNew)[, if(.N > 5000) .SD, by = variety]
#list the unique varieties to show theres only 5
unique(wineVar$variety)
However when I try to see how many levels there are I still get the other 632 values.
[1] Cabernet Sauvignon Pinot Noir Chardonnay
[4] Bordeaux-style Red Blend Red Blend
632 Levels: Žilavka Agiorgitiko Aglianico Aidani Airen Albana Albarín ... Zweigelt
Is there a way to completely remove these as it is causing issues with my training set - ie the training set still sees the values but with no data for dropped varieties.

I think what you are looking for is this. You almost there.
wineVar <- setDT(wineNew)
wineVar <- wineVar[, .SD[.N > 5000], by = variety]
wineVar[, Variety:=as.factor(as.character(Variety))]

Related

R: Applying factor values from one column to another

I am trying to process municipal information in R and it seems that factors (to be exact factor()). are the best way to achieve my goal. I am only starting to get the hang of R, so I imagine my problem is possibly very simple.
I have the following example dataframe to share (a tiny portion of Finnish municipalities):
municipality<-c("Espoo", "Oulu", "Tampere", "Joensuu", "Seinäjoki",
"Kerava")
region<-c("Uusimaa","Pohjois-Pohjanmaa","Pirkanmaa","Pohjois-Karjala","Etelä-Pohjanmaa","Uusimaa")
myData<-cbind(municipality,region)
myData<-as.data.frame(myData)
By default R converts my character columns into factors, which can be tested with str(myData). Now to the part where my beginner to novice level R skills end: I can't seem to find a way to apply factors from column region to column municipality.
Let me demonstrate. Instead of having the original result
as.numeric(factor(myData$municipality))
[1] 1 4 6 2 5 3
I would like to get this, the factors from myData$region applied to myData$municipality.
as.numeric(factor(myData$municipality))
[1] 5 4 2 3 1 5
I welcome any help with open arms. Thank you.
To better understand the use of factor in R have a look here.
If you want to add factor levels, you have to do something like this in your dataframe:
levels(myData$region)
[1] "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa" "Uusimaa"
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki" "Tampere"
> levels(myData$municipality)<-c(levels(myData$municipality),levels(myData$region))
> levels(myData$municipality)
[1] "Espoo" "Joensuu" "Kerava" "Oulu" "Seinäjoki"
[6] "Tampere" "Etelä-Pohjanmaa" "Pirkanmaa" "Pohjois-Karjala" "Pohjois-Pohjanmaa"
[11] "Uusimaa"

removing variables containing certain string in r [duplicate]

This question already has answers here:
Remove Rows From Data Frame where a Row matches a String
(6 answers)
Delete rows containing specific strings in R
(7 answers)
Closed 4 years ago.
I'd have hundreds of observations and I'd like to remove the ones that contain the string "english basement". I can't seem to find the right syntax to do so. I can only figure out how to keep observations with the that string. For instance, I used the code below to get only observations containing the string, and it worked perfectly:
eng_base <- zdata %>%
filter(str_detect(zdata$ListingDescription, “english basement”))
Now I want a data set,top_10mpEB, that excludes observations containing "english basement". Your help is greatly appreciated.
I do not know how your data looks like, but maybe this example helps you - I think you just need to negate the logical vector returned by str_detect:
library(dplyr)
library(stringr)
zdata <- data.frame(ListingDescription = c(rep("english basement, etc",3), letters[1:2] ))
zdata
# ListingDescription
#1 english basement, etc
#2 english basement, etc
#3 english basement, etc
#4 a
#5 b
zdata %>%
filter(!str_detect(ListingDescription, "english basement"))
# ListingDescription
#1: a
#2: b
Or using data.table package (no need of stringr::str_detect):
library(data.table)
setDT(zdata)
zdata[! ListingDescription %like% "english basement"]
# ListingDescription
#1: a
#2: b
You can do this using grepl():
x <- data.frame(ListingDescription = c('english basement other words description continued',
'great fireplace and an english basement',
'no basement',
'a house with a sauna!',
'the pool is great... and wait till you see the english basement!',
'new listing...will go fast'),
rent = c(3444, 23444, 346, 9000, 1250, 599))
x_english_basement <- x[grepl('english basement',
x$ListingDescription)==FALSE, ]
You can use dplyr to easily filter your dataframe.
library(dplyr)
new_data <- data %>%
filter(!ListingDescription=="english basement")
The ! became my best friend once I realized it meant "doesnt equal"

Testing row by row for equality in two columns (factor/character data)

I am trying to test for equality in two columns, per row. However, my data is not numeric. The data set I am working with was merged from two data sets. Going through the data I noticed that columns that should be identical are actually different. This is an extremely large data set (approx 300K obs) so I am trying to code.
E.g. Source.x is from the 1st data set in the merge function and Source.y is from the 2nd data set in the merge function.
RightID Source.x Source.y
1000 Ground Unnamed Stream
1001 Ground Ground
1002 Stream Stream
1003 Bear Creek Ground
I would like to return a new data frame just containing observations 1000 and 1003 since these. I have tried the following code...
lapply(rights, rights$Source.x == rights$Source.y
filter(rights, rights$Source.x == rights$Source.y
filter(rights, identical(Source.x, Source.y)
However, because the data is in factor/character format and have different levels due to the variability in the source name, none of my code has managed to work. Source.x has 6743 levels and Source.y has 6457. As far as I can tell, there is no posted solution to my issue that elaborates on this levels issue. If anyone has any suggestions it would be much appreciated.
levels <- sort(unique(unlist(rights[, c('source.x', 'source.y')])))
rights$source.x <- factor(rights$source.x, levels = levels)
rights$source.y <- factor(rights$source.y, levels = levels)
result <- rights[rights$source.x == rights$source.y, ]
You can specify the levels for each of the factors so that they are consistent. Just create a unique list of levels using both columns.

Criteria for deciding which character columns should be converted to factors

I have been working through the book "Analyzing Baseball Data with R" by Marchi and Albert and am wondering about an issue which they don't address.
Many of the datasets I need to import are fairly large (though not really "Big" in the sense of "Big Data"). For example, the Retrosheet Game Logs have 1 csv file per year dating back to 1871 where each file has a row for each game played that year, and 161 columns. When I read it into a dataframe using read.csv() using the default setting on stringsAsFactors fully 75 of the 161 columns become factors. Some of these columns conceptually are factors (such as one containing "D" or "N" for day or night games) but others are probably better left as strings (many of the columns contain names of starting pitchers, closers, etc.) I know how to convert columns from factors to strings or vice versa, but I don't want to have to scan through 161 columns, making an explicit decision for 75 of them.
The reason I think it important is that I've noticed that conceptually small dataframes obtained by subsetting these game logs are surprisingly large given the need to retain the full factor information. For example, given the dataframe GL2016 obtained from downloading, unzipping and the reading in the file, object.size(GL2016) is about 2.8 MB, and when I use:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
to extract the home day games played by the Cleveland Indians in 2016, I get a df with 26 rows. 26/2428 (where 2428 is the number of rows in the whole dataframe) is slightly more than 1%, but object.size(df) is around 1.3 MB, which is far more than 1% of the size of GL2016.
I came up with an ad-hoc solution. I first defined a function:
big.factor <- function(v,k){is.factor(v) && length(levels(v)) > k}
And then used mutate_if from dplyr like thus:
GL2016 %>% mutate_if(function(v){big.factor(v,30)},as.character) -> GL2016
30 is the number of teams in the MLB and I somewhat arbitrarily decided that any factor with more than 30 levels should probably be treated as a string.
After this code has been run, the number of factor variables has been reduced from 75 to 12. It works in the sense that even though now GL2016 is around 3.2 MB (slightly larger than before), if I now subset the dataframe to pull out the Cleveland day games, the resulting dataframe is just 0.1 MB.
Questions:
1) What criteria (hopefully less ad-hoc than what I used above) are relevant for deciding which character columns should be converted to factors when importing a large data set?
2) I am aware of the cost in terms of memory footprint of converting all character data to factors, but am I incurring any hidden costs (say in processing time) when I convert most of these factors back into strings?
Essentially, I think what you need to do is:
df <- with(GL2016,GL2016[V7 == "CLE" & V13 == "D",])
df <- droplevels(df)
droplevelsfunction will remove all the unused factor levels, and thus reduce the size of df immensely.

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources