using r to count character occurrences in multiple columns of data.frame - r

I'm new to R and have a data.frame with 100 columns. Each column is character data and I am trying to make a summary of how many times a character shows up in each column. I would like to be able to make a summary of all the columns at once without having to type in code for each column. I've tried
occurrences <- table(unlist(my_df))
but this table gives me a summary of all columns combined (not a summary for each column.
When I make a summary for one column my output looks how I want but only for that one column:
BG_occurrences <- table(unlist(my_df$G))
1 na SOME
17 20 1
Is there a way to code and get a summary of all data in each column all at once? I want the output to look something like this:
1 na SOME
BG: 17 20 1
sBG: 23 10 5
BX: 18 20 0
NG: 21 11 6

We can use lapply/sapply to loop over the columns and apply the table
lapply(my_df, table)
Or it can be done in a vectorized way
table(c(col(my_df)), unlist(my_df))
Or with tidyverse
library(dplyr)
library(tidyr)
my_df %>%
pivot_longer(cols = everything()) %>%
count(name, value)

Related

Is there an R function to split a row into three different rows based on the values in a column?

I have a data frame that was created from a software that populated my data with SNP ID's based on chromosome/position. Some chromosome/positions have more than one SNP ID associated with them, which outputted like row 35. Is there a way in R Studio to split ones that are like row 35 into 3 separate rows, where it would look like:
22:16490252:CT:C rs1175903481 22 22:16490252:CT:C;rs1175903481;rs541536337;rs763066770
...
22:16490252:CT:C rs541536337 22 22:16490252:CT:C;rs1175903481;rs541536337;rs763066770
22:16490252:CT:C rs763066770 22 22:16490252:CT:C;rs1175903481;rs541536337;rs763066770
Essentially every other column needs to be the same for the three SNPs, but instead of all being in the same row, they need to be separated.
Attached Data frame is a sample of the data frame.
library(dplyr)
library(stringr)
If SNP names (rs1175903481;rs541536337;rs763066770) are in column V2 (for example) then you could do something like this:
splitSNP <- df %>%
mutate(V2 = strsplit(as.character(V2), ";")) %>% unnest(V2)
That should split the row into three new rows, with the same information in all of the columns but only one SNP name.

Combining df in R and renaming columns from df 2 - 12 to the column names of df 1

I have 12 dataframes that I need to combine. It appears that they all have the same column names, but it's possible there are slight differences in spacing. I want to rename the columns in df 2 - 12 to the column name of df 1 so that they can properly full_join. Each dataframe has 7 columns. What is the best way to go about doing this?

Adding values not including duplicates

I have the following data frame in R (actual data frame is millions of rows with thousands of unique Column A values):
Row Column A Column B
1 130077 65
2 130077 65
3 130077 65
4 200040 10
5 200040 10
How can I add up Column B values grouped by Column A values without including duplicated Column A values? Correct output would be:
130077 65
200040 10
........
I have tried using filter and group_by with no success since the final output does sum values by Column A values but includes duplicated values.
An option is to get the distinct rows, then do a group by 'ColumnA' and get the sum of 'ColumnB'
library(dplyr)
df1 %>%
distinct(ColumnA, ColumnB) %>% # The example gives the expected output here
group_by(ColumnA) %>%
summarise(ColumnB = sum(ColumnB))
Or in base R with unique and aggregate
aggregate(ColumnB ~ ColumnA, unique(df1[c("ColumnA", "ColumnB")]), sum)

Calculating amount of observations for grouped variable and attach result to dataframe

Similar to dplyr: put count occurrences into new variable, Putting rowwise counts of value occurences into new variables, how to do that in R with dplyr?, Count observations of distinct values per group and add a new column of counts for each value or Create columns from factors and count
I am looking for a way to summarize observations under a certain grouping variable and attach the result to my dataframe (preferably with dplyr, but any solution is appreciated).
Here is the structure of my data:
id team teamsize apr16 may16
123 A 13 0 1
124 B 8 1 1
125 A 13 1 0
126 A 13 0 1
I would like R to group my data according to the teams, count the number of 1 for apr16 (so just add up the 1 for each team) and attach a new variable to my dataframe with the result. In a next step, I would like to use this result (it represents numbers of users of a certain policy) in order to calculate the share of users for each Team (that's what I need teamsize for) - but I haven gotten there yet.
I have tried several ways to just calculate the number of users, but it did not work out.
DF <- DF%>%
group_by(team, apr16) %>%
mutate(users_0416 = n()) %>%
ungroup
DF <- DF%>%
group_by(team, apr16) %>%
mutate(users_0416 = sum()) %>%
ungroup
I feel like this is a really easy thing to do but I am just not moving on and really looking forward to any help.

Find last values by condition [duplicate]

This question already has answers here:
Select the first and last row by group in a data frame
(6 answers)
Closed 6 years ago.
I have a very large data frame that I need to subset by last values. I know that the data.table library includes the last() function which returns the last value of an array, but what I need is to subset foo by the last value in id for every separate value in track. Values in id are consecutive integers, but the last values will be different for every track.
> head(foo)
track id coords.x coords.y
1 0 0 -79.90732 43.26133
2 0 1 -79.90733 43.26124
3 0 2 -79.90733 43.26124
4 0 3 -79.90733 43.26124
5 0 4 -79.90725 43.26121
6 0 5 -79.90725 43.26121
The output would look something like this.
track id coords.x coords.y
1 0 57 -79.90756 43.26123
2 1 98 -79.90777 43.26231
3 2 61 -79.90716 43.26200
... and so on
How would one apply the last() function (or another function like tail()) to produce this output?
We can try with dplyr, grouping by track and selecting only the last row of every group.
library(dplyr)
df %>%
group_by(track) %>%
filter(row_number() == n())
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'track' get the last row with tail
library(data.table)
setDT(df1)[, tail(.SD, 1), by = track]
As the also mentioned another logic with 'id' about the consecutive numbers, we can also create a logical index using diff, get the row index (.I) and subset the rows.
setDT(df1)[df1[, .I[c(FALSE, diff(id) ! = 1)], by = track]$V1]
Or we can do this using base R itself
df1[!duplicated(df1$track, fromLast=TRUE),]
Or another option is dplyr
library(dplyr)
df1 %>%
group_by(track) %>%
slice(n())

Resources