counting occurrences of patterns in data frame columns by groups in R - r

I have a data frame that contains one column with strings coding events and one column with participant numbers. Every participant has multiple rows of events. Event codes contain keywords separated by underscores. I would like to count the occurrence of specific keywords per participant and put this in a new dataframe with one row per participant.
I have tried to do this by using grepl to find the keywords and then group_by and summarise to create a new data frame. Here is a minimal example:
part = rep(1:5,4)
events = c("black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue")
data = data.frame(part,events)
data_sum = data %>%
group_by(part) %>%
summarise(
black = sum(grepl("black",data$event)),
black_yellow = sum(grepl("black_yellow",data$event))
)
However, if I run this, the counts are not grouped by participant but the overall counts, therefore, the same for everyone.
Does anyone have any tipps on what I'm doing wrong?

You can use this code
data_sum = data %>%
group_by(part) %>%
summarise(
black = sum(grepl("black",events)),
black_yellow = sum(grepl("black_yellow",events))
)

Related

How can I create a new column in a data frame that contains a new assigned ID number, 1 through 33, for the long ID numbers assigned in the dataset?

I'd like to keep the column of original Id numbers, but I'd like to create a new Id number for each participant in my dataset, so when I create a geom_bar graph it's not using decimals and looking strange.
This is the current R code I have written.
library(dplyr)
unique_ids <- daily_activity %>%
group_by(Id) %>%
summarize(days_used = n_distinct(ActivityDate)).
This is the current data frame:
https://i.stack.imgur.com/T96hM.png
As you can see, it has the Id number, and when I geom_bar this, it becomes these skinny bars due to the program creating numbers using scientific notation, ie = 7e+09.
I'd like to create a new column in this data frame that assigns a new number Id to each of the long Id numbers. That way, I have a unique identifier for each super long Id. I'm curious if there is a way to auto assign numbers starting at 1 and going up to whatever the last number needs to be, positive integers only. I'll then use a note on my graph that says, "See table for Id pairings" or something...
Does any of this make sense? I'm very new to R, coding, graphing, analysis...Any suggestions of ideas I can try? Thoughts?
I recommend creating a separate look up table for converting old IDs to new IDs. Something like this:
lookup_table = input_table %>%
select(old_IDs = IDs) %>%
distinct()
lookup_table$new_IDs = 1:nrow(lookup_table)
You can then join the look up table to the original table:
output_table = input_table %>%
innner_join(lookup_table, by = c("IDs" = "old_IDs")) %>%

manipulating all the tables in an R group_split list

Edited
I have a large table, which starts like this
Essentially, it's a table with multiple samples ("samp_id") showing the number ("least") of "taxon" present in each.
I want to transpose/pivot the table to look like this;
i.e. with "taxon" as the top row, with each of the 90 samples in "data" following as a row based on the "least" column, re-named with its "samp_id". So you see what each sample is, as well the value in "least" for each sample in the different "taxon" (which may not be identical across the 90 samples).
Previously, I have separated the data into multiple tibbles based on "samp_id", selected "taxon" and "least", re-named "least" with the "samp_id" then combined the individual tibbles based on "taxon" with full_join using something like the code below, then transposing the combined table
ACLOD_11 = data %>%
filter(samp_id == "ACLOD_11") %>%
select(taxon, least) %>%
rename("ACLOD_11" = least)
ACLOD_12 = data ... #as above, but different samp_id
data_final = list(ACLOD_11, ACLOD_12, ...) %>%
reduce(full_join, by = "taxon")
As I have more data tables to follow after this one with 90 samples, so I want to be able to do this without having to individually separate the data into 100s of tibbles and manually inputting the "samp_id" before joining.
I have currently split the data into 90 separate tibbles based on "samp_id" (there are 90 samples in "data")
data_split = data %>%
group_split(samp_id)
but am unsure if this is the best way to do this, or what I should to next?
We can use
library(dplyr)
library(purrr)
data %>%
split(.$samp_id) %>%
imap(~ .x %>%
select(taxon, least) %>%
rename(!!.y := least)) %>%
reduce(full_join, by = 'taxon')

Changing a Column to an Observation in a Row in R

I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.

Aggregating two rows based on condition of different ID in R

I am dealing with a dataset of players statistics for a sport. There is an error in the data where one week a player who doesn't exist, has been attributed the data that belongs to a real player. I need to aggregate the two players data and delete the instance of the false players' row.
I need to adjust my preprocessing code to accommodate this so when I scrape future weeks data then I don't need to make manual adjustments.
df <- data.frame(Name = c("Bob","Ben","Bill"),
Team = c("Dogs","Cats","Birds"),
Runs = c(6, 4, 2)
I'd like to do something along the lines of aggregating the two rows based on their df$Name e.g. when df$Name == "Bob" & df$Name == "Bill" aggregate columns [3:40] -- these are my columns with numeric statistics, [1:2] have df$Name and df$Team.
It would depend on the type of aggregation you are trying to do. This looks like a perfect use of the group_by from the dplyr package. Consider the CO2 data set.
library(dplyr)
CO2 %>%
group_by(Plant) %>%
summarise(
n = n(), #Calculate number of rows in each group
meanUptake = mean(uptake) # Aggregate data and take mean for each group
) %>%
ungroup()
Here we take each group, in your case above it would be name. In the summarise, if you wish to include extra information (like team) include it within the summarise.

R: Adding a column from one dataset to another based on matching multiple columns

I have two datasets:
DS1 - contains a list of subjects with a columns for name, ID number and Employment status
DS2 - contains the same list of subjects names and ID numbers but some of these are missing on the second data set.
Finally it contains a 3rd column for Education Level.
I want to merge the Education column onto the first dataset. I have done this using the merge function sorting by ID number but because some of the ID numbers are missing on the second data set I want to merge the remaining Education level by name as a secondary option. Is there a way to do this using dplyr/tidyverse?
There are two ways you can do this. Choose the one based on your preference.
1st option:
#here I left join twice and select columns each time to ensure there is no duplication like '.x' '.y'
finalDf = DS1 %>%
dplyr::left_join(DS2 %>%
dplyr::select(ID,EducationLevel1=EducationLevel),by=c('ID')) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel2=EducationLevel),by=c('Name')) %>%
dplyr::mutate(FinalEducationLevel = ifelse(is.na(EducationLevel1),EducationLevel2,EducationLevel1))
2nd option:
#first find the IDs which are present in the 2nd dataset
commonIds = DS1 %>%
dplyr::inner_join(DS2 %>%
dplyr::select(ID,EducationLevel),by=c('ID'))
#now the records where ID was not present in DS2
idsNotPresent = DS1 %>%
dplyr::filter(!ID %in% commonIds$ID) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel),by=c('Name'))
#bind these two dfs to get the final df
finalDf = bind_rows(commonIds,idsNotPresent)
Let me know if this works.
The second option in makeshift-programmer's answer worker for me. Thank you so much. Had to play around with it for my actual data sets but the basic structure worked very well and it was easy to adapt

Resources