efficiently match values and average column where TRUE - r

Having trouble just matching values and taking an average of a column when those values match up efficiently in R. Essentially I have a chess table that I have pulled data out of and want to get the average for each player's pre-chess rating based on who they played against.
If I have a dataframe:
number <- c(1:10) #number assigned to each player
rating <- c(1000,1200,1210,980,1000,1001,1100,1300,1100,1250) #rating of the player
df <- data.frame(number= number, rating = rating)
p1_games <- c(1,2,3,4,5) # player 1 played against players 2,3,4,5
I want to essentially do is check to see if the values in p1_games match a number in the table, and when they match, average the values in the ratings column.
I just want to return one value and so I've had trouble trying to make ifelse() work:
avg_rate <- ifelse(p1_games %in% df$number, sum(df$rating)/length(p1_games)) #not working
I would like to like to avoid looping if possible but if there's no other efficient way that's fine. Just can't figure out what's up here. Ideally I'd like to apply this logic over many p*_games vectors.
If p1_games in df$number, sum each corresponding rating and divide by the number or ratings. So the output for p1_games in this case would be 1078. I feel like this is really simple but can't quite make this work.

%in% is great at this kind of thing
> mean(df[number %in% p1_games, "rating"])
[1] 1078

An alternate answer using data.table, which may be of use with larger data sets (although since p1_games isn't a column, I'm not sure):
> setDT(df)
> df[number %in% p1_games, mean(rating)]
[1] 1078

Related

Subset based on one value in multiple columns

I have a dataset with weekly number of lucky days for some of those weekly values i have values greater than 7 which must be a mistake.
Therefore what I want to do is to delete rows which have a value greater than 7 in one of the multiple columns. Those columns are column 21 to 68. What I have tried so far is this:
new_df <- subset(df, 21:68 <= 7)
This leaves me with an completely empty new_df.
I know there is a option that goes like this:
new_df <- subset(df, b != 7 & d != 7)
But I feel like there must be a more elegant way than to name every single column which I want to refer to. Do I need to use square brackets or sth. like that?
There is no Error message when computing the above mentioned command.
The referred values are numerical.
Can someone help?

Comparing dates in different columns to isolate certain within-group entries in R

I have a data frame with an ID column that includes duplicates. There is a column called type that takes the values "S" or "N." There are two additional date columns - admission date and discharge date. My question is a bit similar to comparing two data frames and isolating rows based on certain date differences, but not quite. If needed, I could separate my data into two data frames, but I'm wondering if I can accomplish what I want without the extra steps.
Here is a small example of what the data for two patients looks like in R:
example <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
What I want to do is compare within patients, entries that take the value "N" and entries that take the value "S" in the type variable. Based on the discharge date for entries with the value "S," I would like to find entries with the value "N" that have an admission date within 5 days of the former's discharge date (the discharge date with value "S" should be before the admission date with value "N").
So in the example data frame, the only two entries that should be retained are rows 2 and 3 and not rows 5 and 6 since the difference between admission date and discharge date is greater than 5.
Does anyone have any suggestions of how I can filter this data? Any help is greatly appreciated.
This was an interesting challenge. One reason for this is because iterating over rows is less intuitive than iterating over columns (see this question for lots of suggestions: For each row in an R dataframe).
Now I know vectorized solutions are preferred over for loops, but one of the challenges with this problem was that instead of just performing functions on each row, we're comparing the iterated rows to other rows and deleting some rows as we go along. I expect there's a better solution out there and I hope someone posts a better solution to help me learn.
One minor note before I begin, "example" isn't a great name for an object because it's also a function in base R. Additionally, the solution is much easier if we're only dealing with alternating rows of "S" and "N" - that is if many S's precede an N then only the lowest S might be within 5 days of N. Nonetheless it was worth the effort to attack the more challenging case.
Ultimately I ended up solving this as a 2-stage problem, each solved with a for loop. First, I took out all the S rows which weren't within 5 days of the corresponding N rows. Then I took out those N rows which didn't have any appropriate S companions. All of this is implemented in base R.
So to begin:
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
example_df$admission_date<-as.numeric(as.Date(example_df$admission_date))
example_df$discharge_date<-as.numeric(as.Date(example_df$discharge_date))
The first thing I did was to take the date columns (which were characters) and convert them to numeric based on date. Originally I was doing mathematical operations with date objects, but this became complicated with the subsetting operations I ended up using.
Here's the first for loop:
del_vec <- vector("integer")
for( i in 1:nrow(example_df)) {
if (example_df[i,"type"]== "S") {
next
}
if (example_df[i,"type"] == "N") {
add_on <- which
(
example_df["type"] == "S" &
example_df["ID"]==example_df[i,"ID"] &
example_df["discharge_date"] < (example_df[i,"admission_date"] - 5)
)
}
del_vec<- append(del_vec,add_on)
}
example_df_new <- example_df[-c(del_vec),]
rownames(example_df_new) <- 1:nrow(example_df_new)
example_df_new
What I did here is start by creating a vector which will contain the row numbers that we delete. To get rid of the inappropriate S rows we need to actually work on the N rows, so I have the loop skip the S rows. Then when the loop encounters an N row, we find the rows which meet the following conditions:
have type S
have the same ID as the N row in question
have a discharge date which is more than 5 days from the admission date for the N row in question
Using which()captures the row numbers that meet these criteria. Now I add these rows to the empty vector and remove them from the original df. I also rename the rows of the new df to get the following output for example_df_new
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
3 52 16241 16251 N
So we've preserved the 2 rows you wanted to keep, but now we have this bottom row that we want to get rid of. I do this in the second loop which iterates over the rows in the new reduced df:
del_vec2 <- vector()
for(i in 1:nrow(example_df_new)) {
if (example_df_new[i,"type"]=="S") {
next
}
if (example_df_new[i,"type"] == "N") {
add_on_two <- which(example_df_new["type"] == "S" & example_df_new["ID"] == example_df_new[i,"ID"])
}
if(length(add_on_two !=0)) {
next
} else {
del_vec2 <- append(del_vec2,i)
}
}
example_df_3<-example_df_new[-c(del_vec2),]
example_df_3
Again, we tell the loop to skip the S rows — whichever ones made the first cut should stay in. Now when the loop encounters an N row we ask the loop to look for rows that meet the following criteria:
is type S
has the same ID as the N row in question
Again I use which() to save the positions of these rows. If these criteria are met then we skip ahead - we want to keep all the N's that have an appropriate S companion. If not then we add the row number of (i) - that is the row number for the N in question to our vector of rows that we want to delete.
We then delete those rows and end up with the desired output:
ID admission_date discharge_date type
1 22 16140 16145 S
2 22 16145 16157 N
At this point you can change the date columns back to a date format.
Again, while this may be the first, I expect it's not the best solution. I hope to see an improved solution, but the problem is more tricky than it appears at first.
After attempting to filter within the same data frame, I decided to separate the data into two tables: one containing only data of type "S" and the other containing only data of type "N." Then, I did a full join while matching on the ID column. While this creates a greater number of rows than before, I was then able to compare the two date of interest. The resulting data frame contains only one row - the entry of a patient with an admission date with type "N" within 5 days of a discharge date with type "S."
The code in R is as follows:
library(dplyr)
example_df <- data.frame(ID = c(22,22,22,52,52,52),
admission_date = c("2013-10-03","2014-03-11","2014-03-16","2012-02-08","2014-06-10","2014-06-20"),
discharge_date = c("2013-10-11","2014-03-16","2014-03-28","2012-02-13","2014-06-12","2014-06-30"),
type = c('S','S','N','S','S','N'))
N_only <- example_df %>%
filter(type == "N")
S_only <- example_df %>%
filter(type == "S")
example_df_merged <- merge(N_only, S_only, by = "ID")
example_df_merged$admission_date.x <- as.Date(as.character(example_df_merged$admission_date.x), format="%Y-%m-%d")
example_df_merged$discharge_date.y <- as.Date(as.character(example_df_merged$discharge_date.y), format="%Y-%m-%d")
example_df_merged$dateDiff <- example_df_merged$discharge_date.y - example_df_merged$admission_date.x
example_df_final <- example_df_merged %>%
filter(dateDiff <= 5 & dateDiff >= 0)
For clearer variable names, I would have changed the variables ending in ".x" and ".y," but that is not necessary.

Is there an R function where I can get the names within a specific column in my dataset

Edit: using the aid from one of the users, I was able to use "table(ArrestData$CHARGE)", yet, since there are over 2400 entries, many of the entries are being omitted. I am looking for the top 5 charges, is there code for this? Additionally, I am looking at a particular council district (which is another variable titled "CITY_COUNCIL_DIST"). I want to see which are the top 5 charges given out within a specific council district. Is there code for this?
Thanks for the help!
Original post follows
Just like how I can use "names(MyData)" to see the names of my variables, I am wondering if I can use a code to see the names/responses/data points of a specific column.
In other words, I am attempting to see the names in my rows for a specific column of data. I would like to see what names are cumulatively being used.
After I find this, I would like to know how many times each name within the rows is being used, whether thats numeric or percentage. After this, I would like to see how many times each name within the rows is being used with the condition that it meets a numeric value of another column/variable.
Apologies if this, in any way, is confusing.
To go further in depth, I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables I am looking at.
Any and all help will be appreciated.
To get all the distinct variables, you can use the unique function, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> unique(x)
[1] 1 2 3 4 5 6
To count the number of distinct values you can use table, as in:
> x <- c(1,1,2,3,3,4,5,5,5,6)
> table(x)
x
1 2 3 4 5 6
2 1 2 1 3 1
The first row gives you the distinct values and the second row the counts for each of them.
EDIT
This edit is aimed to answer your second question following with my previous example.
In order to look for the top five most repeated values of a variable we can use base R. To do so, I would first create a dataframe from your table of frequencies:
df <- as.data.frame(table(x))
Having this, now you just have to order the column Freq in descending order:
df[order(-df$Freq),]
In order to look for the top five most repeated values of a variable within a group, however, we need to go beyond base R. I would use dplyr to create an augmented dataframe with frequencies for each value of the variable of interest, let it be count_variable:
library(dplyr)
x_or <- x %>%
group_by(group_variable, count_variable) %>%
summarise(freq=n())
where x is your original dataframe, group_variable is the variable for your groups and count_variable is the variable you want to count. Now, you just have to order the object in a way you get the frequencies of your count_variable ordered by group_variables:
x_or %>%
arrange(group_variable, count_variable, freq)

For each value in a column count occurrences of that value in another column

I have a a simple, but large data frame (lateness_tbl) consisting of three columns (Days, Due_Date, End_Date). I need to see how many times each Due Date is matched in End Date. I’m currently doing something like this:
x <- c()
for (i in 1:length(lateness_tbl$Due_Date){
x[i] <- sum(lateness_tbl$Due_Date[i] == lateness_tbl$End_Date)}
The only problem is I have more than 2 million records to compare and am looking for help from the community to speed this up. Any tips, tricks, or corrections would be awesome. Thanks
There is a simple solution to it. You can define a new vector to store the differences between the EndDate and DueDate and then count the entries on this vector that are equal to zero.
differences <- lateness_tbl$Due_Date - lateness_tbl$End_Date
length(which(differences == 0))
If Due_date and End_Date are data (and not integers), you can use the difftime function as shown here and use the same strategy pointed above.

Eliminating dates by characteristics in time series

I´m analyzing absenteeism data for schools and am seeking a bit of help. For every day, I have 360 rows (classrooms) containing that day´s (column 1) number of absent (column 2) and non-absent students (column 3).
Some days (holidays) only have, say, 20 classes reporting, because the other 340 classes did not have class. I want to ELIMINATE those rows from my dataset. In other words, I want to eliminate
classes and want to eliminate from my dataset all entries in which the total number of entries for the date are less than a certain number. In other words, I want to eliminate all rows with date x if the total number of rows containg date x are less than 200.
Here´s what I´ve got so far:
for (i in c(min(df$date):max(df$date))){
b <- df[df$date == i,]
z <- as.vector(ifelse(nrow(b[which(b$date==i),]) <200, "FALSE", "TRUE"))
print(z)
df$newcolumn <- z
}
This prints z, which goes day-by-day telling me if that day meets my conditions, but I cannot figure out a way to incorporate z back into the dataframe´s 10000s of rows. Instead, my df$newcolumn is simply populated by all TRUEs.
Any help would be greatly appreciated.
Hard to do for real without a reproducible example, but doesn't something like df[ ! df$date %in% z, ] work?
%in% will return a logical vector of whether each element exists in the other vector, ! negates so it returns TRUE if it's >200, and the [ rowselector,] selects rows from the data.frame.

Resources