I have a huge data.set in R (1mil+ rows) and 51 columns. One of my columns is "StateFIPS" the other is "CountyFIPS" and another is "event type". The rest I do not care about.
Is there an easy way to take that dataframe and pull out all the columns that have "StateFIPS"=3 AND "CountyFIPS=4" AND "event type"=Tornado, and put all those rows into a new dataframe.
Thanks!
We can use subset
df2 <- subset(df1, StateFIPS == 3 & CountyFIPS == 4 & `event type` == "Tornado")
It is quite easy. This should do it (supposing your data.frame is named "data_set")
new_data <- data_set[(data_set$CountyFIPS == 4) |
(data_set$event_type == 'Tornado') |
(data_set$StateFIPS == 3),]
Sure,
You can sue the which() command, see https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/which
You can then use any logical conditions (and combine them with & (=and) and | (=or)
Related
I want to check if the values of several columns follow a condition, the columns have similar names, and what I've tried is this
filter(df.w.meth.mean, cov.CD34.1 > 4 & cov.CD34.2 > 4 & cov.CD34.4 >4 & cov.CD34.5 >4 & cov.CD34.6 > 4)
How can I simplify this?
I was thinking in using grep to keep the columns that have 'cov' pattern, but is not working.
Can you help?
Using dplyr::filter_at() you can do:
library(dplyr)
df.w.meth.mean %>%
filter_at(vars(starts_with("cov.CD34")), ~ . > 4)
In base R, using grep we can find out columns which starts with "cov". We subset those column and select rows where all the values are greater than 4.
cols <- grep("^cov", names(df.w.meth.mean))
df.w.meth.mean[rowSums(df.w.meth.mean[cols] > 4) == length(cols),]
I would like to make a new column in my data frame by using a conditional statement that would say "If Column_y contains Column_x then 1 else 0"
For example:
Event Name Winner Loser New Column
1 James James,Bob John,Steve 1
1 Bob James,Bob John,Steve 1
1 John James,Bob John,Steve 0
1 Steve James,Bob John,Steve 0
I want to have New Column<- "If Winner contains Name then 1 else 0"
Keep in mind this is for 100,000 rows and probably 700 unique names. When I try things like
df$NewColumn<-ifelse(grepl(df$Name,df$Winner)==TRUE,1,0)
or variations I get the "pattern has a length > 1" error.
I think you just want to compare the Name column against the Winner column:
df$NewColumn <- ifelse(df$Name == df$Winner, 1, 0)
Note that because df$Name == df$Winner is actually a boolean expression, you might also be able to simplify to:
df$NewColumn <- df$Name == df$Winner
In your example, exact string matching works. But I am assuming it does not hold true for your entire data.
Implementing the contains condition would be something like this:
library(dplyr)
library(purrr)
df = df %>%
dplyr::mutate(NewColumn = purrr::map2_dbl(.x=Winner,.y=Name,~ifelse(grepl(.y,.x),1,0)))
Adding an alternate solution with stringr:
df = df %>%
dplyr::mutate(NewColumn=ifelse(str_detect(Winner,Name),1,0))
Let me know if this works.
P.S.: str_detect is faster.
I am a medical researcher. I have a very large administrative database where the diagnoses are included in columns with headers dx1 - dx15 (dx = diagnosis). These columns contain numbers/letter codes which are in character form in R. I have written code to run through these dx columns, but would like to rewrite the code in the form of an array. I can do that easily in SAS, but am finding it difficult to do the same in R.
I am attaching the code that I use here:
a <- as.character(c("4578","4551")) # here I identify initially the codes for the diagnosis that I am interested in.
Then I create a new variable cancer in my dataframe df and use this code to identify patients with cancer. the new variable df$cancer will be either 0 or 1 depending upon diagnosis.
The code work, but as you can see, is not tidy and elegant at all.
df$cm_cancer <- with(df, ifelse((dx3 %in% a | dx4 %in% a | dx5 %in% a |
dx6 %in% a | dx7 %in% a | dx8 %in% a | dx9 %in% a |
dx10 %in% a | dx11 %in% a | dx12 %in% a | dx13 %in% a |
dx14 %in% a | dx15 %in% a), 1, 0))
With SAS, I can do the same with this elegant piece of code:
data df2;
set df;
cancer = 0;
array dgn[15] dx1 - dx15;
do i = 1 to 15;
if dgn[i] in ("4578","4551") then
cancer = 1;
end;
drop i;
run;
I refuse to believe that SAS has better answers for this than R; just agree that I am still a novice in the use of R.
Any help welcome; believe me, I have tried to google to find arrays in R, loops in R; anything that would help me to rewrite this code better.
I'm trying to subset a data frame based on a variable I'm passing into it. My goal is to form a column name inside a function using some values I am passing into it and filter on that newly form column name.
Here's a reproducible example:
var_as_col_name <- function(df, col_var, filter_var) {
subset(df, col_var == filter_var)
}
# this should return what subset(df, cty == 18) would return
var_as_col_name(mpg,"cty", 18)
# this should return what subset(df, cyl == 4) would return
var_as_col_name(mpg,"cyl", 4)
Also, apart from the filters on mpg$cty and mpg$cyl above, I might have another filter that is hardcoded, which I don't want to change, i.e. my requirement should hold for more than one filter. Is there a better approach without using subset (since it is meant for interactive use)?
I am doing this because I have some columns in my dataset like t_1, t_2, t_3...t_24 and I need to filter on either of them and another flag column, so I'm doing:
df_1 <- subset(my_df,flag == 0 & t_1 > 0 & t_1 < 1) when I want data after filtering on t_1
df_2 <- subset(my_df,flag == 1 & t_2 > 0 & t_2 < 1) when I want data after filtering on t_2
...
Instead of this I was thinking of writing a function that takes:
n from 1 to 24, filters on that t_n
takes 1 or 0 for the flag.
and then returns the subsetted dataframe that I want.
Let me know if you need clarification on the question and thanks for your help...
I have a data frame with 10 items and I want to negate the even numbered rows. I came up with this monstrosity:
change_even <- data.frame(val=runif(10))
change_even$val[row( as.matrix(change_even[,'val']) ) %% 2 == 0 ] <- -change_even$val[row( as.matrix(change_even[,'val']) ) %% 2 == 0 ]
is there a better way?
Simply you can use recycling:
change_even$val*c(1,-1)
#[1] 0.1077468 -0.5418167 0.8319609 -0.7230043 0.6649786 -0.7232669
#[7] 0.2677659 -0.4035824 0.6880934 -0.5600653
(values are not reproducible since seed was not set; however the alternating sign can be seen clearly).
You can simply do,
change_even[c(FALSE,TRUE),] <- change_even[c(FALSE,TRUE),]*(-1)
With a data.table, you can get similar with data.frame. Similar to here Selecting multiple odd or even columns/rows for dataframe in R
library(data.table)
change_even <- data.table(val=runif(10))
even_indexes<-seq(2,nrow(change_even),2)
change_even <- change_even[even_indexes,val:=val*-1]
Use the remainder operator to find the even numbered rows, then simply negate
change_even <- data.frame(val=runif(10))
change_even[seq(nrow(change_even)) %% 2 != 1,] = -change_even[seq(nrow(change_even)) %% 2 != 1,]
This is what I came up with:
change_even$val = change_even$val * c(rep(-1,nrow(change_even))^((row(change_even)+1)))
Another one:
(-1)^(0:(nrow(change_even)-1))*change_even$val