Create Column in Data Frame that Indicates Repeated Value in Another Column - r

Say I have a data table like this in R:
Data Table
And I want to a add column to this table which indicates if the person switched majors (like "Y" for switched, "N" for didn't switch), how would I do that? I've tried using the count and unique functions but don't know how to proceed.

You can simply add a column IsSwitched by using by clause of data.table:
DT[, IsSwitched:= ifelse(.N>1,"Y","N"), by=Id]
Where DT is your data.table.

Related

standardize column names - search for a set of column names and then use the one available for the new standardized columns

I am preparing and standardizing data. Part of that standardizing column names. I use data.table.
What I want is this: I want to create a new standardized column name (self-defined) and set my code so that it searches a specificed vector of colnames in the original data and if it find any of these colmns then use that to fill in the standardized column name.
I appreciate it might not be clear so here is an example. In teh belwo, I want to create new standardized column name WEIGHT. I want to seach colnames in dat for any of these (wt,WT,WTBL) and if it finds one of them then use it for the new column WEIGHT
library(data.table)
library(car)
dat <- as.data.table(mtcars)
dat[, WEIGHT := wt] #this is what we do normally - but i want to make it semiautomatic so that i search for a vector of column names and use the one that is avalble to fill in the WEIGHT columes.
dat[, WEIGHT := colnames(dat%in%c('wt','WT','WTBL'))] #this is wrong and there where i need help!
There's probably a simpler construction of this, but here's an attempt. The mget() attempts to grab each value in order, returning a NULL if not found.
Then the first non-NULL value is used to overwrite:
dat[, WEIGHT := {
m <- mget(c("WTBL","wt","WT"), ifnotfound=list(NULL))
m[!sapply(m, is.null)][1]
}]

Is there any function to split data in R, depending on a certain value?

I want to create a new table with these conditions: all the data that in transition=13-14 was size_age=j.
You can use dplyr::filter, with conditions to get your desired subset
dplyr::filter(df, transition =="13-14", size_age=="j")

Filter a chunk of rows based on a specific value in another column in R

Example of Constricted Data Table:
I would like to be able to filter out rows of data based on if a particular value exists in another column. The rows I would like to filter out would all have the same "Material" #. In the example I provided, the Material #U83231036 has the value, "ZHLB (ConAgra Semifinished prod)" in one of the two rows in the "Material_Type_Comp" column. I want to be able to extract out the two rows of data related to that Material # because that value exists in the "Material_Type_Comp" column for one of the rows.
What is the best way to go about doing this?
One option is to do a filter by group
library(dplyr)
df1 %>%
group_by(Material) %>%
filter("ZHLB (ConAgra Semifinished prod)" %in% Material_Type_Comp)
#or use any with `==`
#filter(any(Material_Type_Comp == "ZHLB (ConAgra Semifinished prod)")

R update table column based on search string from another table

I am trying to update Cell B in a table based on the value of cell A in the same table. To filter the rows I want to update I am using grepl to compare cell A to a list of character strings from a list/table/vector or some other external source. For all rows where cell A matches the search criteria, I want to update cell B to say "xxxx". I need to do this for all rows in my table.
So far I have something like this where cat1 is a list of some sort that has strings to search for.
for (x in 1:length(cat1)){
data %<>% mutate(Cat = ifelse(grepl(cat1[i],ItemName),"xxx",Cat))
}
I am open to any better way of accomplishing this. I've tried for loops with dataframes and I'm open to a data.table solution.
Thank you.
To avoid the loop you can collapse the character vector with | and then use it as a single pattern in grepl, for example you can try:
cat1_collapsed <- paste(cat1, collapse = "|")
data %>% mutate(Cat = ifelse(grepl(cat1_collapsed, ItemName),"xxx", Cat))
Or the equivalent using data.table (or base R of course).
use the following code assuming that you have a data frame called "data" with column "A" and "B" and that "cat1" is a vector of the desired strings, as described
library(data.table)
setDT(data)
data[A %in% cat1,B:="XXXX"]

Fastest way/algorithm to find count of unique rows of a sorted file

I currently use .N to find the number of unique rows in a file using by= ... .
For eg. to find the count of unique rows of col1 and col2 in a data table, dt, the query would be,
dt[, .N, by="col1,col2"]
For very large files this could take a very long time. If the table is sorted, is there a faster way to do this? Basically, you could set a counter and update it with the number of times each row appears using a single entry every time a unique row is encountered. I can't use a for loop as that would take forever.
unique.data.table is very different than base R unique in the sense that unique.data.table fetches unique values based on only key columns of the data.table if a key is set. To explain this with an example,
Try this:
dt <- data.table(x=c(1,1,1,2,2), y=c(5,6,6,7,8))
unique(dt) # no key set, similar to 'unique.data.frame' output
# set key now
setkey(dt, "x")
unique(dt) # unique based on just column x
If you want to get just the total number of unique rows, therefore try the following:
setkeyv(dt, c("col1", "col2"))
nrow(unique(dt))
On your question:
dt[, .N, by="col1,col2"]
does actually not give you the number of unique rows, while either of these two do:
dt[, .N, by="col1,col2"][, .N] # data.table solution
nrow(dt[, .N, by="col1,col2"]) # data.frame syntax applied to data.table
My answer to your question:
A core feature of the data.table package is to work with a key. On p.2 from the short introduction to the data.table package it reads:
Furthermore, the rows are sorted by the key. Therefore, a data.table
can have at most one key, because it cannot be sorted in more than one
way.
Thus unless you have a column defining the sort order that you can set as key, the fact that your data are sorted, will be of no advantage. You thus need to set the key. For your purpose (large datafiles, thus assumingly many columns), you would want to include all of the columns in your dataset to set the key:
setkeyv(dt,c(names(dt))) # use key(dt) to check whether this went as expected
unique(dt)[, .N] # or nrow(unique(dt))
PS: please provide us a with a replicable dataset, so we can assess what you consider fast or slow.

Resources