Assign industry codes according to ranges in R - r

I would like to assign overall industry/parent codes to a data.frame (df below) containing more detailed/child codes (called ChildCodes below). The following data serves to illustrate my data.frame containing the detailed codes:
> df <- as.data.frame(cbind(c(1,2,3,4,5,6),c(110,101,200,2041,3651,2102)))
> names(df) <- c('Id','ChildCodes')
> df
Id ChildCodes
1 1 110
2 2 101
3 3 200
4 4 2041
5 5 3651
6 6 2102
The industry/parent codes are in the .csv file here: https://www.dropbox.com/s/5qtb7ysys1ar0lj/IndustryCodes.csv
The problem for me is the format of the .csv file. The file shows the parent/industry code in column 1 and ranges of child/detailed codes in the next 2 columns. Here is a subset:
> IndustryCodes <- as.data.frame(cbind(c(1,1,2,5,6),c(100,200,2040,2100,3650),c(199,299,2046,2199,3651)))
> names(IndustryCodes) <- c('IndustryGroup','LowerRange','UpperRange')
> IndustryCodes
IndustryGroup LowerRange UpperRange
1 1 100 199
2 1 200 299
3 2 2040 2046
4 5 2100 2199
5 6 3650 3651
So that ChildCode 110 corresponds industry group 1, 2041 to industry code 2 etc. How do best assign the industry/parent codes (IndustryGroup) to df in R?
Thanks!

You can use sapply to get the Industry code for every child code:
sapply(df$ChildCodes,
function(x) IndustryCodes$IndustryGroup[IndustryCodes$LowerRange <= x &
x <= IndustryCodes$UpperRange])
# [1] 1 1 1 2 6 5

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help
Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

How to sort a data frame by column?

I want sort a data frame by datas of a column (the first column, called Initial). My data frame it's:
I called my dataframe: t2
Initial Final Changes
1 1 200
1 3 500
3 1 250
24 25 175
21 25 180
1 5 265
3 3 147
I am trying with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=False),]
But, the result is of the type:
Initial Final Changes
3 1 250
3 3 147
21 25 180
24 25 175
1 5 265
1 1 200
1 3 500
And when I try with code:
t2 <- t2[order(t2$Initial, t2$Final, decreasing=TRUE),]
The result is:
Initial Final Changes
1 5 265
1 1 200
1 3 500
24 25 175
21 25 180
3 1 250
3 3 147
I don't understand what happen.
Can you help me, please?
It is possible that the column types are factors, in that case, convert it to numeric and should work
library(dplyr)
t2 %>%
arrange_at(1:2, ~ desc(as.numeric(as.character(.))))
Or with base R
t2[1:2] <- lapply(t2[1:2], function(x) as.numeric(as.character(x)))
t2[do.call(order, c(t2[1:2], decreasing = TRUE)), ]
Or the OP's code should work as well
Noticed that decreasing = False in the first option OP tried (may be a typo). In R, it is upper case, FALSE
t2[order(t2$Initial, t2$Final, decreasing=FALSE),]

Creating a table of results over multiple variables in R

I am using a large dataset that contains multiple variables that contain similar information. The variables range from PR1 through PR25. Each contains information regarding a procedure code. in short the dataframe looks like this:
Obs PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
Where PR1 through PR25 values are factors.
I am looking for a way to make a table of information across all of these variables. For instance, I would like to make a table that shows a count of total number of value "527" for PR1:PR25. I would like to do this for multiple values of interest.
For instance
PR Tot
#222 3
#341 3
#527 2
#569 3
#1600 1
#1660 1
However, I only want to retrieve the frequency for a very specific set of values such as only extracting the frequency of 527 or 1600.
I have initially tried using a simple function like length(which(PR1=="527")), which works but is tedious.
I used the method suggested by Soren using:
library(plyr)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c("527", "5251", "5252", "5253", "5259",
"526", "521", "529", "8512", "8521", "344", "854", "8523", "8541", "8546",
"8542", "8547" , "8544", "8545", "8543", "639",
"064","065","063","0650","0651", "0652", "062", "066", "4040", "4041",
"4042", "0721", "0712","0701", "0702", "070", "0741", "435","436", "4399",
"439", "438", "437", "4381", "4391", "4342", "5122", "5121", "5124", "5123",
"518", "519", "503", "5022", "5012")),]
And got the following output (abbreviated):
codes count
92 062 5
95 064 8
96 0650 2
769 526 8
770 527 8
However, I had a feeling that was incorrect. When I checked it against the output from sapply(df, function(PR1) length(which(PR1 == "527")))
I get the following:
PR1 PR2 PR3 PR4 PR5 PR6 PR7 PR8 ...
1152 36 6 1 2 1 1 1
Which is the correct number of "527" cases in the dataframe. Any suggestions why the first method is giving incorrect sums of factor levels?
Thanks for any help, and let me know if I can provide more info
You can use sapply() or lapply() function to get count of a some value over all columns.
Create data frame df
df <- data.frame(A = 1:4, B = c(4,4,4,4), C = c(2,3,4,4), D = 9:12)
df
# A B C D
# 1 1 4 2 9
# 2 2 4 3 10
# 3 3 4 4 11
# 4 4 4 4 12
Frequency of value "4" in each column A, B, C, and D using sapply() function
sapply(df, function(x) length(which(x == 4)))
A B C D
1 4 2 0
Frequency of value "4" in each column A, B, C, and D using lapply() function
lapply(df, function(x) length(which(x == 4)))
# $A
# [1] 1
# $B
# [1] 4
# $C
# [1] 2
# $D
# [1] 0
The following takes your example and returns an output that may be generalized across all 25 columns. The "plyr" library is used to create the aggregated counts
Scripted as follows:
library(plyr)
df <- data.frame(PR1=c("527","1600","341","222","569"),PR2=c("1422","527","222","569","341"),PR3=c("222","569","341","1422","1660"),stringsAsFactors = T)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c('527','222')),]
Explained as follows:
Create the data frame as specified above. As OP noted values are factors, stringsAsFactors is set to TRUE
df <- data.frame(
PR1=c("527","1600","341","222","569"),
PR2=c("1422","527","222","569","341"),
PR3=c("222","569","341","1422","1660"),
stringsAsFactors = T)
Reviewing results of df
df
PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
As OP asks to combine all the codes across PR1:PR25 a these are unified into a single list by using lapply to loop across all the columns. However, as these are factors -- and it seems that the interest in the in the level value of the factor and not its underlying numeric representation, lapply(df,levels) returns these values. To merge into a single list PR1:PR25 it's simply unlist() and since the column names are seemingly not useful in this case, use.names is set to FALSE. Finally, a data.frame is created with the single column called codes, which is later fed into the ddply() function to get the counts.
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
all_codes
codes
1 1600
2 222
3 341
4 527
5 569
6 1422
7 222
8 341
9 527
10 569
11 1422
12 1660
13 222
14 341
15 569
Uisng ddply() to split() the data.frame on df$codes value and then take the length() of each vector returned by split in ddply()
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result
Reviewing the result gives the PR1:PR25 aggregated count of all the level values of each factor in the original data.frame
codes count
1 1422 2
2 1600 1
3 1660 1
4 222 3
5 341 3
6 527 2
7 569 3
And since we're only interested in specific values (527 given in OP, but here two values of interest are exemplified, 527 and 222:
result[which(result$codes %in% c('527','222')),]
codes count
4 222 3
6 527 2

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

recoding using R

I have a data set with dam, sire, plus other variables but I need to recode my dam and sire id's. The dam column is sorted and each animal is only apprearing once. On the other hand, the sire column is unsorted and some animals are appearing more than once.
I would like to start my numbering of dams from 50,000 such that the first animal will get 50001, second animal 50002 and so on. I have this script that numbers each dam from 1 to N and wondering if it can be modified to begin from 50,000.
mydf$dam2 <- as.numeric(factor(paste(mydf$dam,sep="")))
*EDITED
my data set is similar to this but more variables
dam <- c("1M521","1M584","1M790","1M871","1M888","1M933")
sire <- c("1X057","1T456","1W865","1W209","1W209","1W648")
wt <- c(369,300,332,351,303,314)
p2 <- c(NA,16,18,NA,NA,15)
mydf <- data.frame(dam,sire,wt,p2)
For the sire column, I would like to start numbering from 10,000.
Any help would be very much appreciated.
Baz
At the moment, those sire and dam columns are factor variables, but in this case that means you can just add the as.numeric() results to you base number:
> mydf$dam_n <- 50000 +as.numeric(mydf$dam)
> mydf$sire_n <- 10000 +as.numeric(mydf$sire)
> mydf
dam sire wt p2 dam_n sire_n
1 1M521 1X057 369 NA 50001 10005
2 1M584 1T456 300 16 50002 10001
3 1M790 1W865 332 18 50003 10004
4 1M871 1W209 351 NA 50004 10002
5 1M888 1W209 303 NA 50005 10002
6 1M933 1W648 314 15 50006 10003
Why not use:
names(mydf$dam2) <- 50000:whatEverYourLengthIs
I am not sure if I understood your datastructures completly but usually the names-function is used to set names.
EDIT:
You can use dimnames to names columns and rows.
Like:
[,1] [,2]
a 1 2
b 4 5
c 7 8
and
dimnames(mymatrix) <- list(c("Jan", "Feb", "Mar"), c("2005", "2006"))
yields
2005 2006
Jan 1 2
Feb 4 5
Mar 7 8

Resources