Removing rows from a table in R - r

I used the following code to scrape a table into R.
player.offense.201702050atl = comments.201702050atl[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table()
Then changed the column labels using:
colnames(player.offense.201702050atl) = c("Player", "Tm", "Cmp.Passing", "Att.Passing", "Yds.Passing", "TD.Passing", "Int.Passing", "Sk.Passing", "Yds.Sk.Passing", "Lng.Passing", "Rate.Passing", "Att.Rushing", "Yds.Rushing", "TD.Rushing", "Lng.Rushing", "Tgt.Receiving", "Rec.Receiving", "Yds.Receiving", "TD.Receiving", "Lng.Receiving", "Fmb.Fumbles", "FL.Fumbles")
Next I need to eliminate rows 1, 11, and 12.
I could use:
player.offense.201702050atl.a = player.offense.201702050atl[2:10, ]
player.offense.201702050atl.b = player.offense.201702050atl[13:20, ]
player.offense.201702050atl.c = rbind(player.offense.201702050atl.a, player.offense.201702050atl.b)
However, I have multiple tables in need of similar manipulations; and, the rows which I intend to eliminate, vary with each one. The criteria for a row I desire eliminated is:
All rows for which the value in column 3 is either "Cmp" or "Passing".
Is there a way to run a function that will parse the table, identify the rows that meet the above criteria, and eliminate them?

df <- data.frame(x = c('a', 'b', 'c'), y = c('ca', 'cb', 'cc'), z=c('da', 'db', 'dc'))
x y z
1 a ca da
2 b cb db
3 c cc dc
df[-union(which(df$y == 'cc'),which(df$y == 'ca')),]
Result:
x y z
2 b cb db

Regarding
I desire eliminated is: All rows for which the value in column 3 is either "Cmp" or "Passing".
df <- data.frame(col1 = 1:3, col2 = c('Cmp', 'Passing', 'other'))
df[!df$col2 %in% c('Cmp', 'Passing'), ]

Related

Filter rows in dataset for distinct words in r

Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()

How would I automate dropping a column in R based on summary data for that column?

I have a dataset that is being used to create an automated dashboard. Essentially it's looking at the relationship between certain conditions and the cost of care on a month by month basis for a health care institution. What I want to be able to do is in pseudocode:
dataset %>% select(-c("columns where the average value is lower than X"))
No amount of googling seems to be getting me close.
We can use select_if
library(dplyr)
val <- 10
dataset %>%
select_if(~ is.numeric(.) && mean(.) < val)
Or using base R
dataset[, names(which(colMeans(dataset[sapply(dataset, class) ==
"numeric"]) < val)), drop = FALSE]
# col3
#1 3
#2 4
#3 7
data
dataset <- data.frame(col1 = c('A', 'B', 'C'), col2 = c(10, 8, 15),
col3 = c(3, 4, 7), stringsAsFactors = FALSE)

Select data frame values row-wise using a variable of column names

Suppose I have a data frame that looks like this:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
# x y
# 1 1 4
# 2 2 5
# 3 3 6
And a vector of column names, one per row of the data frame:
colname = c('x', 'y', 'x')
For each row of the data frame, I would like to select the value from the corresponding column in the vector. Something similar to dframe[, colname] but for each row.
Thus, I want to obtain c(1, 5, 3) (i.e. row 1: col "x"; row 2: col "y"; row 3: col "x")
My favourite old matrix-indexing will take care of this. Just pass a 2-column matrix with the respective row/column index:
rownames(dframe) <- seq_len(nrow(dframe))
dframe[cbind(rownames(dframe),colname)]
#[1] 1 5 3
Or, if you don't want to add rownames:
dframe[cbind(seq_len(nrow(dframe)), match(colname,names(dframe)))]
#[1] 1 5 3
One can use mapply to pass arguments for rownumber (of dframe) and vector for column name (for each row) to return specific column value.
The solution using mapply can be as:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
colname = c('x', 'y', 'x')
mapply(function(x,y)dframe[x,y],1:nrow(dframe), colname)
#[1] 1 5 3
Although, the next option may not be very intuitive but if someone wants a solution in dplyr chain then a way using gather can be as:
library(tidyverse)
data.frame(colname = c('x', 'y', 'x'), stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
left_join(dframe %>% rownames_to_column() %>%
gather(colname, value, -rowname),
by = c("rowname", "colname" )) %>%
select(rowname, value)
# rowname value
# 1 1 1
# 2 2 5
# 3 3 3

adding unique rows from one data frame to another

I have a data frame which comprises a subset of records contained in a 2nd data frame. I would like to add the record rows of the 2nd data frame that are not common in the first data frame to the first... Thank you.
If you want all unique rows from both dataframes, this would work:
df1 <- data.frame(X = c('A','B','C'), Y = c(1,2,3))
df2 <- data.frame(X = 'A', Y = 1)
df <- rbind(df1,df2)
no.dupes <- df[!duplicated(df),]
no.dupes
# X Y
#1 A 1
#2 B 2
#3 C 3
But it won't work if there's duplicate rows in either dataframe that you want to preserve.
You should look dplyr's distint() and bind_rows() functions.
Or Better provide a dummy data to work on and expected output .
Suppose you have two dataframes a and b ,and you want to merge unique rows of a dataframe to the b dataframe
a = data.frame(
x = c(1,2,3,1,4,3),
y = c(5,2,3,5,3,3)
)
b = data.frame(
x = c(6,2,2,3,3),
y = c(19,13,12,3,1)
)
library(dplyr)
distinct(a) %>% bind_rows(.,b)

Case usage in R:Count number of events from Table 2 when case in Table 1 satisfy specific restrictions

The DF for Table 1 is like this:
df1 <- data.frame(ID = c('001','001','002','003', '003', '003'),
date = c('2015-05-23', '2015-07-29', '2015-08-08', '2015-06-10', '2015-10-12', '2015-11-15'),
date_last = c('2015-01-20', '2015-05-23', '2015-05-15', '2015-01-20', '2015-06-10', '2015-10-12'))
And the DF for Table 2 is like this:
df2 <- data.frame(Event = c('A', 'B', 'C', 'D', 'E'),
Event_date = c('2015-01-21', '2015-01-21', '2015-03-29', '2015-08-12', '2015-10-12'))
what I want to get is to get case when df1$date_last < df2$Event_date < df1$date, then count(Event) as 1 and sum up how many events during the time period. The ideal result I want to have is like the following:
df3 <- data.frame(ID = c('001','001','002','003', '003', '003'),
date = c('2015-05-23', '2015-07-29', '2015-02-08', '2015-06-10', '2015-10-12', '2015-11-15'),
date_last = c('2015-01-20', '2015-05-23', '2015-05-15', '2015-01-20', '2015-06-10', '2015-10-12'),
number_of_events = c(3,1,0,3,1,0))
Anyone know the R code for this? Thank you so much!
Make sure that all your dates are of class date. You simply to this by putting as.Date() around the columns in the creation of the data frames.
First define a function with x being a vector with end and start date respectively, and y being a vector with dates that should be checked.
nr_events_in_between <- function(x, y) sum(x[2] < y & x[1] > y)
Apply this to all rows in df1 and you get the number_of_events column.
apply(df1[ ,c('date', 'date_last')], 1, nr_events_in_between, df2[,'Event_date'])
(Note that for the second row the value is 0 not 1 as you state in the example for df3)

Resources