Filter table by column in R - r

I would like to filter table if I have column name written in variable. I tried bellow code but it did not work. dat is a data frame, name of column is Name, and I would like to filter by "John".
colname <- "Name"
dat[dat$colname %in% "John",]
I saw that it works fine if I do not use variable for column name. (Bellow code works fine)
dat[dat$"Name" %in% "John",]

You may use the bracket function [.
colname <- "Name"
dat[dat[[colname]] %in% "John", ]
dat[dat[, colname] %in% "John", ] # or
# Name X1 X2
# 8 John 0.8646536 1.2688507
# 9 John -1.7201559 -0.3125515
Data
dat <- structure(list(Name = structure(c(3L, 3L, 2L, 4L, 4L, 2L, 3L,
1L, 1L, 2L), .Label = c("John", "Linda", "Mary", "Olaf"), class = "factor"),
X1 = c(0.758396178001042, -1.3061852590117, -0.802519568703793,
-1.79224083446114, -0.0420324540227439, 2.15004261784474,
-1.77023083820321, 0.864653594565389, -1.72015589816109,
0.134125668141181), X2 = c(-0.0758265646523722, 0.85830054437592,
0.34490034810227, -0.582452690107777, 0.786170375925402,
-0.692099286413293, -1.18304353631275, 1.26885070606311,
-0.31255154601115, 0.0305712590978896)), class = "data.frame", row.names = c(NA,
-10L))

An approach with dplyr using non-standard evaluation. Using #jay.sf's data
library(dplyr)
dat %>% filter(!!sym(colname) == "John")
# Name X1 X2
#1 John 0.864654 1.268851
#2 John -1.720156 -0.312552
In data.table, we can use get
library(data.table)
setDT(dat)[get(colname) == "John"]
Since we have only one value to compare we can use == here instead of %in%.

With data.table, we can use eval with as.symbol
library(data.table)
setDT(dat)[eval(as.symbol(colname)) == "John"]
# Name X1 X2
#1: John 0.8646536 1.2688507
#2: John -1.7201559 -0.3125515

Related

remove double quotes from factors in a dataframe

I got a dataframe to work on where I have a bunch of variables as factors in quotation marks like ""x1"".
str(df) gives me something like this:
$ x : Factor w/ 10 Levels "\"\"x1\"\"",..: 1 7 9 ...
I tried to get rid of the quotation marks with the gsub() function but that didn´t work. Probably because I don´t know what to insert as pattern? Would be great if somebody can solve this puzzle and maybe explain to me if the "\"\"x1\"\"" is the solution to this?
An example for the dataframe would look like this:
structure(list(Sent = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c("\"\"Opted out\"\"",
"\"\"Yes\"\""), class = "factor"), Responded = structure(c(2L,
2L, 2L, 2L, 2L), .Label = c("\"\"Complete\"\"", "\"\"No\"\"",
"\"\"Partial\"\""), class = "factor")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Sent",
"Responded"))
Thanks in advance!
vec = c('""x1""', '""x2""', '""x3""')
vec = factor(vec)
levels(vec) <- gsub('["\\]', "", levels(vec))
#> vec
#[1] x1 x2 x3
#Levels: x1 x2 x3
See how I would use ' as wrapper, when I want to use " inside a string.
Another problem it didn't work for you was probably because you didn't use the levels attribute but rather the factor variable itself.
Factor variables are internally stored as 1, 2, 3,... numbers.
As you now have provided data, you can use: (df1 being your data with the factor columns)
df1[] <- lapply(df1, function(vec){ levels(vec) <- gsub('["\\]',"",levels(vec)); vec})

Count values in column then reset [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 4 years ago.
I am trying to have a column that counts the number of names and starts from scratch each time it is different like this :
NAME ID
PIERRE 1
PIERRE 2
PIERRE 3
PIERRE 4
JACK 1
ALEXANDRE 1
ALEXANDRE 2
Reproducible data
structure(list(NAME = structure(c(3L, 3L, 3L, 3L, 2L, 1L, 1L), .Label =
c("ALEXANDRE",
"JACK", "PIERRE"), class = "factor")), class = "data.frame", row.names
= c(NA,
-7L))
You could build a sequence along the elements in each group (= Name):
ave(1:nrow(df), df$NAME, FUN = seq_along)
Or, if names may reoccur later on, and it should still count as a new group (= Name-change), e.g.:
groups <- cumsum(c(FALSE, df$NAME[-1]!=head(df$NAME, -1)))
ave(1:nrow(df), groups, FUN = seq_along)
Using dplyr and data.table:
df %>%
group_by(ID_temp = rleid(NAME)) %>%
mutate(ID = seq_along(ID_temp)) %>%
ungroup() %>%
select(-ID_temp)
Or just data.table:
setDT(df)[, ID := seq_len(.N), by=rleid(NAME)]
Here's a quick way to do it.
First you can set up your data:
mydata <- data.frame("name"=c("PIERRE", "ALEX", "PIERRE", "PIERRE", "JACK", "PIERRE", "ALEX"))
Next, I add a dummy column of 1s that makes the solution inelegant:
mydata$placeholder <- 1
Finally, I add up the placeholder column (cumulative sum), grouped by the name column:
mydata$ID <- ave(mydata$placeholder, mydata$name, FUN=cumsum)
Since I started with unsorted names, my dataframe is currently unsorted, but that can be fixed with:
mydata <- mydata[order(mydata$name, mydata$ID),]

arranging strings from one data frame based on another one

I have a data frame like this one
df1<- structure(list(V1 = structure(c(8L, 4L, 5L, 7L, 6L, 3L, 9L, 1L,
2L), .Label = c("A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4", "A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920",
"C1P641;C1P640;A0A061AD21;G5EEV6", "O16276", "O16520-2", "O17323-2",
"O17395", "O17403", "Q22501;A0A061AE05"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-9L))
My second data from looks like this
df2<- structure(list(From = structure(c(12L, 10L, 11L, 8L, 7L, 1L,
9L, 15L, 2L, 5L, 13L, 3L, 16L, 6L, 4L, 14L), .Label = c("A0A061AD21",
"A0A061AE05", "A0A061AJ82", "A0A061AJK8", "A0A061AKW6", "A0A061AL89",
"C1P640", "C1P641", "G5EEV6", "O16276", "O17395", "O17403", "Q19219",
"Q21920", "Q22501", "Q7JLR4"), class = "factor"), To = structure(c(4L,
8L, 1L, 5L, 5L, 5L, 5L, 6L, 6L, 2L, 2L, 2L, 2L, 3L, 3L, 7L), .Label = c("aat-3",
"CELE_F08G5.3", "CELE_R11A8.7", "cpsf-2", "epi-1", "pps-1", "R11A8.7",
"ugt-61"), class = "factor")), .Names = c("From", "To"), class = "data.frame", row.names = c(NA,
-16L))
df2 is taken from df1 but some information are added and some are removed . I want to reconstruct the df2 like df1 and arrange the column named To based on that
So the output should look like this
From To
O17403 cpsf-2
O16276 ugt-61
O16520-2 -
O17395 aat-3
O17323-2 -
C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
Q22501;A0A061AE05 pps-1
A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7; R11AB.7
It means we have O17403 in df2 and was only one string in df1, so it stays the same. O16276 was only one string in a raw in df1 so it also stays the same
O16520-2 was in df1 was not in df2 so in column named to a hyphen
the same for the rest until C1P641;C1P640;A0A061AD21;G5EEV6 are all in the same row of df1 and their To is the same, so we put them the same as df1 and just add one epi-1
Probably the best is to put df1 as template and then parse the To to it , those that are in df2, parse their To , those that are not only a hyphen
It is very complicated, I even could not think how to do it.I will appreciate any help
To solve this I split the semicolon delimited strings and created a nested for-for-if-if loop.
Here's the logic behind the loop which runs against the split string's data.frame (tmp):
Fix data classes (i.e. change factor to character to avoid conflicting level sets) and append a temporary To column to tmp
For each column and row of tmp start by seeing if a cell contains a valid string for matching and a matched value in df2$To, if not, go to the next iteration
If it does then look at the matching value in To from df2, checking to see if we already have the matched value in tmp$To (if so, go to next iteration)
If there's a new matched value in df2$To then put it in the correspond cell of tmp$To, prepending it with any preceeding matches and semicolons if it is not the first match for that row
df1$V1 <- as.character(df1$V1)
df2$From <- as.character(df2$From)
df2$To <- as.character(df2$To)
library(stringr)
tmp <- as.data.frame(str_split_fixed(df1$V1, ";",n=5), stringsAsFactors = F)
tmp$To <- as.character(NA)
for(j in 1:nrow(tmp)){
for(i in 1:ncol(tmp)){
if(length(df2$To[df2$From == tmp[j,i]]) == 0 | is.null(tmp[j,i])){
next
} else if(length(df2$To[df2$From == tmp[j,i]] ) == 1 & !is.na(tmp[j,i])){
if(is.na(tmp$To[j]) | tmp$To[j] == df2$To[df2$From == tmp[j,i]]){
tmp$To[j] <- df2$To[df2$From == tmp[j,i] ]
} else{
tmp$To[j] <- paste(tmp$To[j],";",df2$To[df2$From == tmp[j,i] ], sep="")
}
} else{
next
}
}
}
df1 <- data.frame(From=df1$V1, To=tmp$To)
df1
From To
1 O17403 cpsf-2
2 O16276 ugt-61
3 O16520-2 <NA>
4 O17395 aat-3
5 O17323-2 <NA>
6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
7 Q22501;A0A061AE05 pps-1
8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
One way of doing this is to use the splitstackshape package (use cSplit). I converted the factors to character strings to simplify (and get rid of warnings).
library(dplyr)
library(data.table) # cSplit from 'splitstackshape' returns a 'data.table'.
library(splitstackshape)
### Remove the factors for convenience of manipulation
df1 <- df1 %>% mutate(From = as.character(V1))
df2 <- df2 %>% mutate(From = as.character(From), To = as.character(To))
### 'cSplit' will split on ';' and create a new row for each item. The
### original 'From' column is kept around as cSplit removes the split column.
### 'rn' (row number) is used for ordering later.
cSplit(df1 %>% mutate(rn = row_number(), From_temp = From),
"From_temp", sep = ";", direction = "long", drop = FALSE, type.convert = FALSE) %>%
left_join(df2, by = c(From_temp = 'From')) %>% # Join to 'df2' to get the 'To' column
group_by(From, rn) %>% # Group by original 'From' column.
summarise(To = paste(sort(unique(na.omit(To))), collapse = ';'), # Create 'To' by joining 'To' Values
To = ifelse(To=='', '-', To)) %>% # Set empty values to '-'
ungroup %>%
arrange(rn) %>% # Sort by original row number and
select(-rn) # remove 'rn' column.
## From To
## <chr> <chr>
## 1 O17403 cpsf-2
## 2 O16276 ugt-61
## 3 O16520-2 -
## 4 O17395 aat-3
## 5 O17323-2 -
## 6 C1P641;C1P640;A0A061AD21;G5EEV6 epi-1
## 7 Q22501;A0A061AE05 pps-1
## 8 A0A061AKW6;Q19219;A0A061AJ82;Q7JLR4 CELE_F08G5.3
## 9 A0A061AL89;A0A061AJK8;Q21920-2;Q21920-7;Q21920 CELE_R11A8.7;R11A8.7
There may be a cleaner way to do with dplyr that doesn't require the splitstackshape.

data.table weird behaviour when used in a function

I have a data.frame as follows.
data <- structure(list(V1 = structure(1:3, .Label = c("S01", "S02", "S03"), class = "factor"), V2 = structure(c(1L, 3L, 2L), .Label = c("Alan", "Bruce", "Jay"), class = "factor"), V3 = structure(c(3L, 1L, 2L), .Label = c("Barry", "Dick", "Hal"), class = "factor"), V4 = structure(c(1L, 3L, 2L), .Label = c("Guy", "Jean-Paul", "Wally"), class = "factor"), V5 = structure(c(3L, 1L, 2L), .Label = c("Bart", "Damien", "John"), class = "factor")), .Names = c("V1", "V2", "V3", "V4", "V5"), class = "data.frame", row.names = c(NA, -3L))
It is not a data.table
is.data.table(data)
[1] FALSE
I have a function foo for example which utilizes data.table for doing some manipulations in the data.frame as follows.
foo <- function(df) {
if(!is.data.frame(df)) stop('"df" is not a data.frame')
setDT(df)
setkey(df, V1)
df[, "NEW" := paste0(V3, V4), with = FALSE]
setDF(df)
return(df)
}
However when I run the function with the data.frame data (not a data.table), the output out is a data.frame (because of setDF(df)).
out <- foo(data)
is.data.table(out)
[1] FALSE
But now the original data.frame data is a data.table.
is.data.table(data)
[1] TRUE
I understand this is because data.table works by reference. However how to deal with this when being used in a function. I dont' wan't to inadvertently change any data.frame in environment. Should I always force copy with copy or <- instead of setDT whenever data.table is used in a function, or is there another way?
With regard to
is there another way?
Instead of setDT() inside the function, you could use as.data.table()
foo <- function(df) {
if(!is.data.frame(df)) stop('"df" is not a data.frame')
df <- as.data.table(df)
setkey(df, V1)
df[, NEW := paste0(V3, V4)]
setDF(df)
return(df)
}
foo(data)
# V1 V2 V3 V4 V5 NEW
# 1 S01 Alan Hal Guy John HalGuy
# 2 S02 Jay Barry Wally Bart BarryWally
# 3 S03 Bruce Dick Jean-Paul Damien DickJean-Paul
is.data.table(data)
# [1] FALSE
For some examples of functions that turn the input data frame into a data table but do not change the original data frame at all, I'd definitely recommend looking at source code for the functions in package splitstackshape.

Check whether value in one dataframe is in another (larger) dataframe

I'm struggling to come up with a vectorised solution to the following problem. I have two dataframes:
> people <- data.frame(name = c('Fred', 'Bob'), profession = c('Builder', 'Baker'))
> people
name profession
1 Fred Builder
2 Bob Baker
> allowed <- data.frame(name = c('Fred', 'Fred', 'Bob', 'Bob'), profession = c('Builder', 'Baker', 'Barman', 'Biker'))
> allowed
name profession
1 Fred Builder
2 Fred Baker
3 Bob Barman
4 Bob Biker
That is to say, I want to check every person in people has a permitted profession, and return any names which do not.
For instance, Fred can be a Builder or a Baker, and so he is fine. However, Bob can be a Barman or a Biker, but not a Baker (note: there are only ever two permitted professions in my use case).
I would like to a return a data frame those names which do not have a permitted profession:
name profession permitted
1 Bob Baker Biker
2 Bob Baker Barman
Thanks for the help
Simple base-only solution. I'm sure someone can come up with something better.
out <- allowed[!allowed$name %in% merge(people, allowed)$name, ]
This gets you the desired people, along with their permitted professions. If you also want their actual professions:
names(out)[2] <- "permitted"
out <- merge(people, out, all.y=TRUE)
Here's a slightly more readable data.table solution. You can do the last step on the same line as well to make it a one-liner, if you consider that readable.
# load library, convert people to a data.table and set a key
library(data.table)
people = data.table(people, key = "name,profession")
# compute
result = data.table(allowed, key = "name")[people[!allowed]]
setnames(result, "profession.1", "permitted")
result
# name profession permitted
#1: Bob Barman Baker
#2: Bob Biker Baker
Probably there's another way, but this should work. I added a third person with an unpermitted profession to show you how to apply the function to the entire dataset.
currentprof <-structure(list(name = structure(c(2L, 1L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(3L,
2L, 1L), .Label = c("Analyst", "Baker", "Builder"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -3L))
allowed <- structure(list(name = structure(c(2L, 2L, 1L, 1L, 3L, 3L), .Label = c("Bob",
"Fred", "Jan"), class = "factor"), profession = structure(c(4L,
1L, 2L, 3L, 6L, 5L), .Label = c("Baker", "Barman", "Biker", "Builder",
"Driver", "Teacher"), class = "factor")), .Names = c("name",
"profession"), class = "data.frame", row.names = c(NA, -6L))
checkprof <- function(name){
allowedn <- allowed[allowed$name == name,]
currentprofn <- currentprof[currentprof$name==name,]
if(!currentprofn$profession %in% allowedn$profession)
{result <- merge(currentprofn, allowedn, by = "name", all.x=TRUE)} else
{result <-data.frame(col1=character(),
col2=character(),
col3=character(),
stringsAsFactors=FALSE)}
colnames(result) <- c("name","profession","permitted")
return(result)
}
do.call(rbind,lapply(levels(allowed$name),checkprof))
This is my take on it. May need some more testing though.I'd be open to suggestions myself. It works with your example but I am not sure if it would generalize.
people$check <- ifelse(people$profession %in% allowed[which(allowed$name == people$name),"profession"], TRUE,FALSE)
people_select <- people[people$check == TRUE,]
EDIT: and just for clarification in case this is holding you back from voting. The ifelse is vectorized and will run very fast.

Resources