Add a ColSum to vector in r using dplyr - r

I am trying to add the sum (of all the counts in a specific vector) in my data frame in R. Specifically, I want to keep all the counts and then add a sum at the end. In excel, you would do =sum(A1:A5232). Additionally, I don't know the length of the specific vector. See below:
#sumarize by colname
NewDepartment <- List %>%
group_by(NewDepartment) %>%
tally(sort=TRUE)
The above code will give me the following:
NewDepartment n
<chr> <int>
1 <NA> 709
2 Collections 454
3 Telesales 281
4 Operations Control Management 93
5 Underwriting 92
I want a total count at the end like this:
NewDepartment n
<chr> <int>
1 <NA> 709
2 Collections 454
3 Telesales 281
4 Operations Control Management 93
5 Underwriting 92
6 Total Sum 1721
How do I get the row # 6 above??

Try this:
NewDepartment = rbind(NewDepartment,
data.frame(NewDepartment = "Total Sum", n = sum(NewDepartment$n)))

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help
Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

read.csv - to separate information stored in .csv based on the presence or absence of a duplicate value

First of all - apologies, I'm new to all of this, so I may write things in a confusing way.
I have multiple .csv files that I need to read, and to save a lot of time I am looking to find an automated way of doing this.
I am looking to read different rows of the .csv and store the information as two separate files, based on the information stored in the last column.
My data is specifically areas, and slices of a 3D image, which I will use to compile volumes. If two rows have the same "slice" then I need to separate them, as the area found in row 1 corresponds to a different structure to the one with an area in row 2, on the same slice.
Eg:
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183
So slice structure 1 has an area at slice 180 (area = 50) and 181 (area = 49), whereas structure 2 has an area at each slice from 180 to 183.
I want to be able to store all the bold data in one .csv, and all the other data in another .csv
There may be .csv files with more or less overlapping slice values, adding complexity to this.
Thank you for the help, please let me know if I need to clarify anything.
Use duplicated:
dat <- read.csv(text="
Row,area,slice
1,50,180
2,52,180
3,49,181
4,53,181
5,65,182
6,60,183")
dat[duplicated(dat$slice),]
# Row area slice
# 2 2 52 180
# 4 4 53 181
dat[!duplicated(dat$slice),]
# Row area slice
# 1 1 50 180
# 3 3 49 181
# 5 5 65 182
# 6 6 60 183
(Whether you write each of these last two frames to files or store them for later use is up to you.)
duplicated normally returns TRUE for the second and subsequent incidents of the field(s). Your logic of 2,4,5,6 is more along the lines of "last of the dupes or "no dupes", which is a little different.
library(dplyr)
dat %>%
group_by(slice) %>%
slice(-n()) %>%
ungroup()
# # A tibble: 2 x 3
# Row area slice
# <int> <int> <int>
# 1 1 50 180
# 2 3 49 181
dat %>%
group_by(slice) %>%
slice(n()) %>%
ungroup()
# # A tibble: 4 x 3
# Row area slice
# <int> <int> <int>
# 1 2 52 180
# 2 4 53 181
# 3 5 65 182
# 4 6 60 183
Similarly, with data.table:
library(data.table)
as.data.table(dat)[, .SD[.N,], by = .(slice)]
# slice Row area
# 1: 180 2 52
# 2: 181 4 53
# 3: 182 5 65
# 4: 183 6 60
as.data.table(dat)[, .SD[-.N,], by = .(slice)]
# slice Row area
# 1: 180 1 50
# 2: 181 3 49

How to assign ID to multiple rows based on a value in 1 column in 1 row duplicating a value in a DIFFERENT column in a different row in R?

When a call is placed to an emergency line, it is given a CallNo (a unique to the event); however, sometimes, multiple calls are placed and different call takers accidentally assign them different call numbers. Later, the CallNo of the other call (the DupCallNo) is appended on to EACH call.
I have two columns, CallNo and DupCallNo, plus many other variables:
CallNo DupCallNo Priority Unit
123 255 A Bravo12
255 123 A Bravo44
366 476 B Xray22
476 366 A Xray109
512 366 A Xray116
How can I assign a unique ID to the first two rows and another to the second two rows?
I have found several questions and answers regarding making a unique ID based on values in the same column, but on those for two different rows with different columns. In this case, if column A in row 1 equals column B in row, how to assign rows 1 and 2 a unique ID?
Thanks so much, from an R novice.
P.S. Here is an example of what I would like to end up with:
CallNo DupCallNo Priority Unit UNIQUE_ID
123 255 A Bravo12 call1
255 123 A Bravo44 call1
366 476 B Xray22 call2
476 366 A Xray109 call2
512 366 A Xray116 call2
How about creating a unique ID from the two columns:
library(tidyverse)
df %>% rowwise() %>%
mutate(Combined = paste0(min(CallNo, DupCallNo, na.rm = TRUE), max(CallNo,DupCallNo, na.rm = TRUE)))
# A tibble: 4 x 5
# Groups: Combined [2]
CallNo DupCallNo Priority Unit Combined
<int> <int> <fct> <fct> <chr>
1 123 255 A Bravo12 123255
2 255 123 A Bravo44 123255
3 366 476 B Xray22 366476
4 476 366 A Xray109 366476

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

Counting Attempts of an event in R

I'm relatively new in R and learning. I have the following data frame = data
ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016
I am looking to count the number of people (in this case only two unique individuals) who passed their tests after multiple attempts(passing is defined as 65 or over). So the final product would return me a list of unique ID's who had multiple counts until their test scores hit 65. This would inform me that approx. 66% of the clients in this data frame require multiple test sessions before getting a passing grade.
Below is my idea or concept more or less, I've framed it as an if statement
If ID appears twice
count how often it appears, until TEST GRADE >= 65
ifelse(duplicated(data$ID), count(ID), NA)
I'm struggling with the second piece where I want to say, count the occurrence of ID until grade >=65.
The other option I see is some sort of loop. Below is my attempt
for (i in data$ID) {
duplicated(datad$ID)
count(data$ID)
Here is where something would say until =65
}
Again the struggle comes in how to tell R to stop counting when grade hits 65.
Appreciate the help!
You can use data.table:
library(data.table)
dt <- fread(" ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
# count the number of try per ID then get only the one that have been successful
dt <- dt[, N:=.N, by=ID][grade>=65]
# proportion of successful having tried more than once
length(dt[N>1]$ID)/length(dt$ID)
[1] 0.6666667
Another option, though the other two work just fine:
library(dplyr)
dat2 <- dat %>%
group_by(ID) %>%
summarize(
multiattempts = n() > 1 & any(grade < 65),
maxgrade = max(grade)
)
dat2
# Source: local data frame [3 x 3]
# ID multiattempts maxgrade
# <int> <lgl> <int>
# 1 1 TRUE 73
# 2 2 TRUE 76
# 3 3 FALSE 66
sum(dat2$multiattempts) / nrow(dat2)
# [1] 0.6666667
Here is a method using the aggregate function and subsetting that returns the maximum score for testers that took the the test more than once starting from their second test.
multiTestMax <- aggregate(grade~ID, data=df[duplicated(df$ID),], FUN=max)
multiTestMax
ID grade
1 1 73
2 2 76
To get the number of rows, you can use nrow:
nrow(multiTestMax)
2
or the proportion of all test takers
nrow(multiTestMax) / unique(df$ID)
data
df <- read.table(header=T, text="ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")

Resources