Finding counts of each occurrence in R - r

I am trying to find the number of occurrences of each string within a certain row of a data frame in R. I assume I would use the unique() function.
For example, If I wanted a count of how many times each type of dog showed up within a data frame, how would I go about this?
Thanks!

It would be best if you gave a reproducible example. but...
sum(df[row_num, ] %in% c("Golden Retriever"))
would give the number of occurrences of "Golden Retreiver" in the first row. Iterating using a for loop would work for whole data frame.
Using the dplyr package you can do a rowwise operation to to populate a new column with the count. eg.
df %>% rowwise() %>% mutate(gold_count = sum(c(col_name1, col_name2, ...,) %in% "Golden Retriever"))
you can do this for all the other as well

Related

Trying to use ddply to subset a dataframe by two column variables, then find the maximum of a third column in r?

I have a dataframe called data with variables for data, time, temperature, and a group number called Box #. I'm trying to subset the data to find the maximum temperature for each day, for each box, along with the time that temperature occurred at. Ideally I could place this data into a new dataframe with the date, time, maximum temperature and the time is occurred at.
I tried using ddply but was the code only returns one line of output
ddply(data, .('Box #', 'Date'), summarize, max('Temp'))
I was able to find the maximum temperatures for each day using tapply on separate dataframes that only contain the values for individual groups
mx_day_2 <- tapply(box2$Temp, box2$Date, max)
I was unable to apply this to the larger dataframe with all groups and cannot figure out how to also get time from this code.
Is it possible to have ddply subset by both Box # and Date, then return two separate outputs of both maximum temperature and time, or do I need to use a different function here?
Edit: I managed to get the maximum times using a version of the code in the answer below, but still haven't figured out how to find the time at which the max occurs in the same data. The code that worked for the first part was
max_data <- data %>%
group_by(data$'Box #', data$'Date')
max_values <- summarise(max_data, max_temp=max(Temp, na.rm=TRUE))
I would use dplyr/tidyverse in stead of plyr, it's an updated version of the package. And clean the column names with janitor: a space is difficult to work with (it changes 'Box #' to box_number).
library(tidyverse)
library(janitor)
mx_day2 <- data %>%
clean_names() %>%
group_by(date,box_number)%>%
summarise(max_temp=max(temp, na.rm=TRUE)
I found a solution that pulls full rows from the initial dataframe into a new dataframe based on only max values. Full code for the solution below
max_data_v2 <- data %>%
group_by(data$'Box #', data$'Date') %>%
filter(Temp == max(Temp, na.rm=TRUE))

Create new dataframe column in R that conditions on row values without iterating?

So let's say I have the following dataframe "df":
names <- c("Bob","Mary","Ben","Lauren")
number <- c(1:4)
age <- c(20,33,34,45)
df <- data.frame(names,number,age)
Let's say I have another dataframe ("df2") with thousands of people and I want to sum the income of people in that other dataframe that have the given name, number and age of each row in "df". That is, for each row "i" of "df", I want to create a fourth column "TotalIncome" that is the sum of the income of all the people with the given name, age and number in dataframe "df2". In other words, for each row "i":
df$TotalIncome[i] <- sum(
df2$Income[df2$Name == df1$Name[i] &
df2$Numbers == df1$Numbers[i] &
df2$Age == df1$Age[i]], na.rm=TRUE)
Is there a way to do this without having to iterate in a for loop for each row "i" and perform the above code? Is there a way to use apply() to calculate this for the entire vector rather than only iterating each line individually? The actual dataset I am working with is huge and iterating takes quite a while and I am hoping there is a more efficient way to do this in R.
Thanks!
Have you considered use dplyr package? You can use some grammar with SQL-style and make this job quick and easy.
The code will be something like
library(dplyr)
df %>% left_join(df2) %>%
group_by(name, numbers, age) %>%
summarize(TotalIncome = sum(Income))
I suggest you to find the cheat sheets available on dplyr site or see the Wickham and Grolemund book.

Percentage from age data in a column

I am using R for a project for University. I imported a csv file and created a df. Everything was going smoothly until I had to gather the percentages of age groups in the "Age" column. There are 3,000 rows of information in my df. How do I only sample information from rows 50-200 to find the percentages of people ages 15-20, 21-25, 26-30, and 31-35?
You can try creating another df which only takes information from rows 50-200 using the slice function e.g my_data %>% slice(1:6) would give rows 1-6 I believe. Incase you didnt know, this function exists in tidyverse, which you can call using library(tidyverse). For filtering by particular age groups, you can again use the tidyverse filter function, e.g my_data %>% filter.
If your goal is to sample, better than slice specific rows you can use the function sample_n

Changing a Column to an Observation in a Row in R

I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!
Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.

In R, how to sum multiple columns based on value of another column?

In R, I have a dataframe, so that I have One Variable (the name of a country), a number of variables (Population, Number of cars, etc) and then a Column that represents region.
I would like to sum the variables (1, 2, ....) based on the value of the last region. I think this should be possible with dplyr and summarise each, but I cannot get it to work.
Would someone be able to help me please? Thanks a lot.
Reading the response (althought this may change if you can get some of your dataframe together...
library(dplyr)
summarized_df <- df %>%
group_by(region) %>%
summarise(var1=sum(variable1), var2=sum(variable2), var3=sum(variable3))
If this doesn't seem to work, maybe you can post your code and the errors even if you can't post the dataframe.

Resources