how to find which rows are related by mathematical difference of x in R - r

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help

Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

Related

Calculation within a pipe between different rows of a data frame

I have a tibble with a column of different numbers. I wish to calculate for every one of them how many others before them are within a certain range.
For example, let's say that range is 200 ; in the tibble below the result for the 5th number would be 2, that is the cardinality of the list {816, 705} whose numbers are above 872-1-200 = 671 but below 872.
I have thought of something along the lines of :
for every theRow of the tibble, do calculate the vector theTibble$number_list between(X,Y) ;
summing the boolean returned vector.
I have been told that using loops is less efficient.
Is there a clean way to do this within a pipe without using loops?
Not the way you asked for it, but you can use a bit of linear algebra. Should be more efficient and more simple than a loop.
number_list <- c(248,650,705,816,872,991,1156,1157,1180,1277)
m <- matrix(number_list, nrow = length(number_list), ncol = length(number_list))
d <- (t(m) - number_list)
cutoff <- 200
# I used setNames to name the result, but you do not need to
# We count inclusive of 0 in case of ties
setNames(colSums(d >= 0 & d < cutoff) - 1, number_list)
Which gives you the following named vector.
248 650 705 816 872 991 1156 1157 1180 1277
0 0 1 2 2 2 1 2 3 3
Here is another way that is pipe-able using rollapply().
library(zoo)
cutoff <- 200
df %>%
mutate(count = rollapply(number_list,
width = seq_along(number_list),
function(x) sum((tail(x, 1) - head(x, -1)) <= cutoff),
align = "right"))
Which gives you another column.
# A tibble: 10 x 2
number_list count
<int> <int>
1 248 0
2 650 0
3 705 1
4 816 2
5 872 2
6 991 2
7 1156 1
8 1157 2
9 1180 3
10 1277 3

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

Working with repeates values in rows

I am working with a df of 46216 observation where the units are homes and people, where each home may have any number of integrants, like:
enter image description here
and this for another almost 18000 homes.
What i need to do is to get the mean of education years for every home, for what i guess i will need a variable that computes the number of people of each home.
What i tried to do is:
num_peopl=by(df$person_number, df$home, max), for each home I take the highest person number with the total number of people who live there, but when I try to cbind this with the df i get:
"arguments imply differing number of rows: 46216, 17931"
It is like it puts the number of persons only for one row, and leaves the others empty.
How can i do this? Is there a function?
I think aggregate and join may be what your looking for. Aggregate does the same thing that you did, but puts it into a data frame that I'm more familiar with at least.
Then I used dplyr left_join, joining the home number's together:
library(tidyverse)
df<-data.frame(home_number = c(1,1,1,2,2,3),
person_number = c(1,2,3,1,2,1),
age = c(20,21,1,54,50,30),
sex = c("m","f","f","m","f","f"),
salary = c(1000,890,NA,900,500,1200),
years_education = c(12,10,0,8,7,14))
df2<-aggregate(df$person_number, by = list(df$home_number), max)
df_final<-df%>%
left_join(df2, by = c("home_number" = "Group.1"))
home_number person_number age sex salary years_education x
1 1 1 20 m 1000 12 3
2 1 2 21 f 890 10 3
3 1 3 1 f NA 0 3
4 2 1 54 m 900 8 2
5 2 2 50 f 500 7 2
6 3 1 30 f 1200 14 1

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Assign industry codes according to ranges in R

I would like to assign overall industry/parent codes to a data.frame (df below) containing more detailed/child codes (called ChildCodes below). The following data serves to illustrate my data.frame containing the detailed codes:
> df <- as.data.frame(cbind(c(1,2,3,4,5,6),c(110,101,200,2041,3651,2102)))
> names(df) <- c('Id','ChildCodes')
> df
Id ChildCodes
1 1 110
2 2 101
3 3 200
4 4 2041
5 5 3651
6 6 2102
The industry/parent codes are in the .csv file here: https://www.dropbox.com/s/5qtb7ysys1ar0lj/IndustryCodes.csv
The problem for me is the format of the .csv file. The file shows the parent/industry code in column 1 and ranges of child/detailed codes in the next 2 columns. Here is a subset:
> IndustryCodes <- as.data.frame(cbind(c(1,1,2,5,6),c(100,200,2040,2100,3650),c(199,299,2046,2199,3651)))
> names(IndustryCodes) <- c('IndustryGroup','LowerRange','UpperRange')
> IndustryCodes
IndustryGroup LowerRange UpperRange
1 1 100 199
2 1 200 299
3 2 2040 2046
4 5 2100 2199
5 6 3650 3651
So that ChildCode 110 corresponds industry group 1, 2041 to industry code 2 etc. How do best assign the industry/parent codes (IndustryGroup) to df in R?
Thanks!
You can use sapply to get the Industry code for every child code:
sapply(df$ChildCodes,
function(x) IndustryCodes$IndustryGroup[IndustryCodes$LowerRange <= x &
x <= IndustryCodes$UpperRange])
# [1] 1 1 1 2 6 5

Resources