Ciao, I have two columns. Every row represents one student. The first column tells what class the student is in. The second column tells if the student passed a exam.
Here is my replicating example.
This is the data I have now:
a=c("A","A","A","A","B","B","B","C","C")
b=c(0,0,1,0,0,0,0,1,1)
mydata=data.frame(a,b)
names(mydata)=c("CLASS","PASSED")
This is the data I seek to attain:
a1=c("A","B","C")
b1=c(4,3,2)
c1=c(1,0,2)
mydataWANT=data.frame(a1,b1,c1)
names(mydataWANT)=c("CLASS","SIZE","PASSED")
Here is my attempt for the dplyr package
mydataWANT <- data.frame(mydata %>%
group_by(CLASS,PASSED) %>%
summarise(N = n()))
yet it does not yield the desire output.
Related
I'm struggling to get window functions working in R to rank groups by the number of rows.
Here's my sample code:
data <- read_csv("https://dq-content.s3.amazonaws.com/498/book_reviews.csv")
data %>%
group_by(state) %>%
mutate(num_books = n(),
state_rank = dense_rank(num_books)) %>%
arrange(num_books)
The expected output is that the original data will have a new column that tells me the rank for each row (book, state, price and review) depending on whether that row is for a state with the most book reviews (would have state_rank of 1); second most books (rank 2), etc.
Manually I can get the output like this:
manual_ranks <- data %>%
count(state) %>%
mutate(state_rank = rank(state))
desired_output <- data %>%
left_join(manual_ranks)
In other words, I want the last column of this table:
data %>%
count(state) %>%
mutate(state_rank = rank(state))
added to each row of the original table (without having to create this table and then using left_join by state; that's the point of window functions).
Anyway, with the original code, you'll see that all state_rank just say 1, when I would expect states with the most book reviews to be ranked 1, second most reviews would have 2, etc.
My goal is to then be able to filter by, say, state_rank > 4. That is, I want to keep all the rows in the original data for top 4 states with the most book reviews.
I have a data frame that contains one column with strings coding events and one column with participant numbers. Every participant has multiple rows of events. Event codes contain keywords separated by underscores. I would like to count the occurrence of specific keywords per participant and put this in a new dataframe with one row per participant.
I have tried to do this by using grepl to find the keywords and then group_by and summarise to create a new data frame. Here is a minimal example:
part = rep(1:5,4)
events = c("black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue","black_yellow","black_white","black_blue")
data = data.frame(part,events)
data_sum = data %>%
group_by(part) %>%
summarise(
black = sum(grepl("black",data$event)),
black_yellow = sum(grepl("black_yellow",data$event))
)
However, if I run this, the counts are not grouped by participant but the overall counts, therefore, the same for everyone.
Does anyone have any tipps on what I'm doing wrong?
You can use this code
data_sum = data %>%
group_by(part) %>%
summarise(
black = sum(grepl("black",events)),
black_yellow = sum(grepl("black_yellow",events))
)
I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.
I have two datasets:
DS1 - contains a list of subjects with a columns for name, ID number and Employment status
DS2 - contains the same list of subjects names and ID numbers but some of these are missing on the second data set.
Finally it contains a 3rd column for Education Level.
I want to merge the Education column onto the first dataset. I have done this using the merge function sorting by ID number but because some of the ID numbers are missing on the second data set I want to merge the remaining Education level by name as a secondary option. Is there a way to do this using dplyr/tidyverse?
There are two ways you can do this. Choose the one based on your preference.
1st option:
#here I left join twice and select columns each time to ensure there is no duplication like '.x' '.y'
finalDf = DS1 %>%
dplyr::left_join(DS2 %>%
dplyr::select(ID,EducationLevel1=EducationLevel),by=c('ID')) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel2=EducationLevel),by=c('Name')) %>%
dplyr::mutate(FinalEducationLevel = ifelse(is.na(EducationLevel1),EducationLevel2,EducationLevel1))
2nd option:
#first find the IDs which are present in the 2nd dataset
commonIds = DS1 %>%
dplyr::inner_join(DS2 %>%
dplyr::select(ID,EducationLevel),by=c('ID'))
#now the records where ID was not present in DS2
idsNotPresent = DS1 %>%
dplyr::filter(!ID %in% commonIds$ID) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel),by=c('Name'))
#bind these two dfs to get the final df
finalDf = bind_rows(commonIds,idsNotPresent)
Let me know if this works.
The second option in makeshift-programmer's answer worker for me. Thank you so much. Had to play around with it for my actual data sets but the basic structure worked very well and it was easy to adapt
Using openFDA package, I have this code:
# devtools::install_github("ropenhealth/openfda")
library("openfda")
drugs = fda_query("/drug/event.json") %>%
fda_api_key("MY_KEY") %>%
fda_count("patient.drug.medicinalproduct.exact") %>%
fda_exec()
So, I get a list of drugs with a count for each line (fda_count). I'd like to add to the data.frame "drugs" also a column with the corresponding company name to the drug. How can I add the data from "patient.drug.openfda.manufacturer_name.exact" as third column?