I have the following data frame in R (actual data frame is millions of rows with thousands of unique Column A values):
Row Column A Column B
1 130077 65
2 130077 65
3 130077 65
4 200040 10
5 200040 10
How can I add up Column B values grouped by Column A values without including duplicated Column A values? Correct output would be:
130077 65
200040 10
........
I have tried using filter and group_by with no success since the final output does sum values by Column A values but includes duplicated values.
An option is to get the distinct rows, then do a group by 'ColumnA' and get the sum of 'ColumnB'
library(dplyr)
df1 %>%
distinct(ColumnA, ColumnB) %>% # The example gives the expected output here
group_by(ColumnA) %>%
summarise(ColumnB = sum(ColumnB))
Or in base R with unique and aggregate
aggregate(ColumnB ~ ColumnA, unique(df1[c("ColumnA", "ColumnB")]), sum)
Related
I'm new to R and have a data.frame with 100 columns. Each column is character data and I am trying to make a summary of how many times a character shows up in each column. I would like to be able to make a summary of all the columns at once without having to type in code for each column. I've tried
occurrences <- table(unlist(my_df))
but this table gives me a summary of all columns combined (not a summary for each column.
When I make a summary for one column my output looks how I want but only for that one column:
BG_occurrences <- table(unlist(my_df$G))
1 na SOME
17 20 1
Is there a way to code and get a summary of all data in each column all at once? I want the output to look something like this:
1 na SOME
BG: 17 20 1
sBG: 23 10 5
BX: 18 20 0
NG: 21 11 6
We can use lapply/sapply to loop over the columns and apply the table
lapply(my_df, table)
Or it can be done in a vectorized way
table(c(col(my_df)), unlist(my_df))
Or with tidyverse
library(dplyr)
library(tidyr)
my_df %>%
pivot_longer(cols = everything()) %>%
count(name, value)
I am trying to create a character vector that stores the names of countries with 10 or more total medals in R. I am getting back an 'Error: attempt to apply non-function' when running the code.
Here is the code:
biggest_winners=olympic_df$name(sum(olympic_df$medals.gold & olympic_df$medals.silver & olympic_df$medals.bronze >=10))
here is a picture of the first 10 rows with column headers for reference:
Try this:
library("dplyr")
# Selects those 3 columns Gets the sum of three columns for each row
medal_count = olympic_df %>% select(medals.gold, medals.silver, medals.bronze) %>% rowSums()
# Filters the original dataset when sum of medals for a row > 10
olympic_df[medal_count>10,"name"] %>% unique # Returns distinct name
You can create a new column with the sum by row by using rowSums and then filter for values superior to 10:
olympic_df$Sum <- rowSums(olympic_df[,3:5])
olympic_df[olympic_df$Sum >= 10,]
Or in a single line:
olympic_df[rowSums(olympic_df[3:5]) >= 10,]
I have a data frame which looks like this
where value of b ranges from 1:31 and alpha_1,alpha_2 and alpha_3 can only have value 0 and 1. for each b value i have 1000 observations so total 31000 observations. I want to group the entire dataset by b and wanted to count value of alpha columns ONLY when its value is 1. So the end result would have 31 observations (unique b values from 1:31) and count of alpha values when its 1.
how do i do this in R. I have tried using pipe methods in dplyr and nothing seems to be working.
We can use
library(dplyr)
df1 %>%
group_by(b) %>%
summarise_at(vars(starts_with("alpha")), sum)
Sample data
mysample <- data.frame(ID = 1:100, kWh = rnorm(100))
I'm trying to automate the process of returning the rows in a data frame that contain the 5 highest values in a certain column. In the sample data, the 5 highest values in the "kWh" column can be found using the code:
(tail(sort(mysample$kWh), 5))
which in my case returns:
[1] 1.477391 1.765312 1.778396 2.686136 2.710494
I would like to create a table that contains rows that contain these numbers in column 2.
I am attempting to use this code:
mysample[mysample$kWh == (tail(sort(mysample$kWh), 5)),]
This returns:
ID kWh
87 87 1.765312
I would like it to return the r rows that contain the figures above in the "kWh" column. I'm sure I've missed something basic but I can't figure it out.
We can use rank
mysample$Rank <- rank(-mysample$kWh)
head(mysample[order(mysample$Rank),],5)
if we don't need to create column, directly use order (as #Jaap mentioned in three alternative methods)
#order descending and get the first 5 rows
head(mysample[order(-mysample$kWh),],5)
#order ascending and get the last 5 rows
tail(mysample[order(mysample$kWh),],5)
#or just use sequence as index to get the rows.
mysample[order(-mysample$kWh),][1:5]
I have a dataframe with 23000 rows and 8 columns
I want to subset it using only unique identifiers that are in column 1. I do this by,
total_res2 <- unique(total_res['Entrez.ID']);
This produces 17,000 rows with only the information from column 1.
I am wondering how to extract the unique rows, based on this column and also take the information from the other 7 columns using only these unique rows.
This returns the rows of total_res containing the first occurrences of each Entrez.ID value:
subset(total_res, ! duplicated( Entrez.ID ) )
or did you mean you only want rows whose Entrez.ID is not duplicated:
subset(total_res, ave(seq_along(Entrez.ID), Entrez.ID, FUN = length) == 1 )
Next time please provide test data and expected output.