Using data table or dplyr instead of loops - r

Here I have the code with a for loop:
for (i in 1:length(mc_1$code))
{cmc1 = mc_1$code[i]
cmc2 = mc_1[mc_1$code == cmc1,]
cmc3 = cmc2[order(cmc2[ ,2], cmc2[ ,3]),]
mc_1[mc_1$code == cmc1,]$region = last(cmc3$region)
}
For each value in the variable "code", mc_1 have different number of rows. And mc_1 also has columns of year and month (column 2 and 3), and another column, say, region. "region" is different even for same "code" at different month and year.
For each "code", I want to select only the most recent region by month and year (that's why I use "order") and assign that region to all the regions in all the rows for that certain code.
I did have this for loop, which works. But for efficiency and code length issue, how can I rewrite it better using something like data table or dplyr?

you can try this using the dplyr package
and the fact that n() returns the number of rows in each group
mc_1 %>%
group_by(code) %>%
arrange(year, month ) %>%
mutate(region = region[n()])
hope it helps!!

Related

How to mutate new columns in R based on earliest and latest dates for other variables

In a dataset where each patient had multiple test administrations and a score on each test date, I have to identify the earliest & latest test dates, then subtract the difference of the scores of those dates. I think I've identified the first & last dates through dplyr, creating new columns for those:
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
arrange(desc(PatientID))
Score column is TotalScore
Now how do I extract the scores from these 2 dates (for each patient) to create new columns of earliest & latest scores? Haven't been able to figure out a mutate with case_when or if_else to create a score based on a record with a certain date.
Have you tried to use one combine verb, like left_join, for example?
SplitDates <- SortedDates %>%
group_by(PatientID) %>%
mutate(EarliestTestDate = min(AdministrationDate),
LatestTestDate = max(AdministrationDate)) %>%
ungroup() %>%
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “EarliestTestDate”)) %>% # picking the score of EarliestTestDate
left_join(SortedDates,
by = c(“PatientID” = “PatientID”, “AdministrationDate” = “LatestTestDate”)) %>% # picking the score of EarliestTestDate
arrange(desc(PatientID)) # now you can make the mutante task that you need.
I suggest to you see the dplyr cheatsheet.

Counting the rows based on two other column values, and manipulate the value in a loop through one of these column values in R

There are three columns: website, Date ("%Y %m"), click_tracking (T/F). I would like to add a variable describing the number of websites whose click tracking = T in each month / the number of all website in that month.
I thought the steps would be something like:
aggregate(sum(df$click_tracking = TRUE), by=list(Category=df$Date), FUN = sum)
as.data.frame(table(Date))
Then somehow loop through Date and divide the two variables above which would have been already grouped by Date. How can I achieve this? Many thanks!
If we are creating a column, then do a group by 'Date' and get the sum of 'click_tracking' (assuming it is a logical column - TRUE/FALSE) iin mutate
library(dplyr)
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(click_tracking))
If the column is factor, convert to logical with as.logical
df %>%
group_by(Date) %>%
mutate(countTRUE = sum(as.logical(click_tracking)))
If it is to create a summarised output
df %>%
group_by(Date) %>%
summarise(countTRUE = sum(click_tracking))
In the OP's code, = (assignment) is used instead of == in sum(df$click_tracking = TRUE) and there is no need to do a comparison on a logical column
aggregate(cbind(click_tracking = as.logical(click_tracking)) ~ Date, FUN = sum)
This will create the proportion of websites with click tracking (out of all websites) per month.
aggregate(data=df, click_tracking ~ Date, mean)

Aggregating two rows based on condition of different ID in R

I am dealing with a dataset of players statistics for a sport. There is an error in the data where one week a player who doesn't exist, has been attributed the data that belongs to a real player. I need to aggregate the two players data and delete the instance of the false players' row.
I need to adjust my preprocessing code to accommodate this so when I scrape future weeks data then I don't need to make manual adjustments.
df <- data.frame(Name = c("Bob","Ben","Bill"),
Team = c("Dogs","Cats","Birds"),
Runs = c(6, 4, 2)
I'd like to do something along the lines of aggregating the two rows based on their df$Name e.g. when df$Name == "Bob" & df$Name == "Bill" aggregate columns [3:40] -- these are my columns with numeric statistics, [1:2] have df$Name and df$Team.
It would depend on the type of aggregation you are trying to do. This looks like a perfect use of the group_by from the dplyr package. Consider the CO2 data set.
library(dplyr)
CO2 %>%
group_by(Plant) %>%
summarise(
n = n(), #Calculate number of rows in each group
meanUptake = mean(uptake) # Aggregate data and take mean for each group
) %>%
ungroup()
Here we take each group, in your case above it would be name. In the summarise, if you wish to include extra information (like team) include it within the summarise.

How might I summarize the sum of all columns in a filtered dataset using dplyr?

I'm having trouble getting the sum of a column from a filtered dataset. Would someone be able to show me where I am going wrong? This summarize method worked before, but now I get an error. Thank you,
select("STNAME", "CTYNAME", "YEAR", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
save(popSample, file="./datafiles/popSample.rdata" )
load("./datafiles/popSample.rdata")
# We want to see Total Population for all years and all age groups
set1filter <- popSample %>%
filter(AGEGRP == 0) %>%
summarize(set1filter, set1 = sum(TOT_POP))
set1```
There is an extra %>% at the end of filter while creating the set1filter or remove the set1filter from the summarize if we are using the same chain
library(dplyr)
popSample %>%
filter(AGEGRP == 0) %>%
summarise(set1 = sum(TOT_POP))
We can't have an object that is not yet created in the summarize

Dividing values in each cell by the group average in R

I am trying to generate a new column with values derived from the original chart. I would like to calculate the group average of same hotel and same date first, then use this group averages to divide the original sales.
Here is my code: I tried to calculate the group average by using group_by and summarise embedding in dplyr package, however, it did not generate my expected results.
hotel = c(rep("Hilton",3), rep("Caesar",3))
date1 = c(rep('2018-01-01',2), '2018-01-02', rep('2018-01-01',3))
dba = c(2,0,1,3,2,1)
sales = c(3,5,7,5,2,3)
df = data.frame(cbind(hotel, date1, dba, sales))
df1 = df %>%
group_by(date1, hotel) %>%
dplyr::summarise(avg = mean(sales)) %>%
acast(., date1~hotel)
Any suggestion would be highly appreciated!
Instead of summarise, we can use mutate. After grouping by 'date1', 'hotel', divide the 'sales' by the mean of 'sales' to create a new column
library(tidyverse)
df %>%
group_by(date1, hotel) %>%
mutate(SalesDividedByMean = sales/mean(sales))
NOTE: When there are columns having different types, cbinding results in a matrix and matrix can have only a single type. So, a character class vector can change the whole data into character. Wrapping with data.frame, propagate that change into either factor (by default stringsAsFactors = TRUE or `character)
data
df <- data.frame(hotel, date1, dba, sales)

Resources