recoding using R - r

I have a data set with dam, sire, plus other variables but I need to recode my dam and sire id's. The dam column is sorted and each animal is only apprearing once. On the other hand, the sire column is unsorted and some animals are appearing more than once.
I would like to start my numbering of dams from 50,000 such that the first animal will get 50001, second animal 50002 and so on. I have this script that numbers each dam from 1 to N and wondering if it can be modified to begin from 50,000.
mydf$dam2 <- as.numeric(factor(paste(mydf$dam,sep="")))
*EDITED
my data set is similar to this but more variables
dam <- c("1M521","1M584","1M790","1M871","1M888","1M933")
sire <- c("1X057","1T456","1W865","1W209","1W209","1W648")
wt <- c(369,300,332,351,303,314)
p2 <- c(NA,16,18,NA,NA,15)
mydf <- data.frame(dam,sire,wt,p2)
For the sire column, I would like to start numbering from 10,000.
Any help would be very much appreciated.
Baz

At the moment, those sire and dam columns are factor variables, but in this case that means you can just add the as.numeric() results to you base number:
> mydf$dam_n <- 50000 +as.numeric(mydf$dam)
> mydf$sire_n <- 10000 +as.numeric(mydf$sire)
> mydf
dam sire wt p2 dam_n sire_n
1 1M521 1X057 369 NA 50001 10005
2 1M584 1T456 300 16 50002 10001
3 1M790 1W865 332 18 50003 10004
4 1M871 1W209 351 NA 50004 10002
5 1M888 1W209 303 NA 50005 10002
6 1M933 1W648 314 15 50006 10003

Why not use:
names(mydf$dam2) <- 50000:whatEverYourLengthIs
I am not sure if I understood your datastructures completly but usually the names-function is used to set names.
EDIT:
You can use dimnames to names columns and rows.
Like:
[,1] [,2]
a 1 2
b 4 5
c 7 8
and
dimnames(mymatrix) <- list(c("Jan", "Feb", "Mar"), c("2005", "2006"))
yields
2005 2006
Jan 1 2
Feb 4 5
Mar 7 8

Related

How can I calculate the inter-pair correlation of a variable according to id in the whole dataframe?

I have a twin-dataset, in which there is one column called wpsum, another column is family-id, which is the same for corresponding twin pairs.
wpsum family-id
twin 1 14 220
twin 2 18 220
I want to calculate the correlation between wpsumof those with the same family-id, while there are also some single family id's, if one twin did not take part in the re-survey. family-id is a character.
There's no correlation between wpsum of those with the same family-id, as you put it, mainly because there's no third variable with which to correlate wpsum within the family-id groups (see my comment), but you can get the difference in wpsum scores within the groups. Maybe that's what you meant by correlation. Here's how to get those (I changed and expanded your example):
dat <- data.frame(wpsum = c(14, 18, 20, 5, 10, NA, 1),
family_id = c("220","220","221","221","222","222","223"))
dat
wpsum family_id
1 14 220
2 18 220
3 20 221
4 5 221
5 10 222
6 NA 222
7 1 223
diffs <- by(dat, dat$family_id, function(x) abs(x$wpsum[1] - x$wpsum[2]))
diffs
dat$family_id: 220
[1] 4
------------------------------
dat$family_id: 221
[1] 15
------------------------------
dat$family_id: 222
[1] NA
------------------------------
dat$family_id: 223
[1] NA
You can make a data.frame with this new variable of differences like so:
diff.frame <- data.frame(diffs = as.numeric(diffs), family_id = names(diffs))
diff.frame
diffs family_id
1 4 220
2 15 221
3 NA 222
4 NA 223
Note that neither missing values nor missing observations are a (coding) problem here - they just result in missing differences without error. If you started having more than two observations within each family ID, though, then you'd need to do something different.

Creating a table of results over multiple variables in R

I am using a large dataset that contains multiple variables that contain similar information. The variables range from PR1 through PR25. Each contains information regarding a procedure code. in short the dataframe looks like this:
Obs PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
Where PR1 through PR25 values are factors.
I am looking for a way to make a table of information across all of these variables. For instance, I would like to make a table that shows a count of total number of value "527" for PR1:PR25. I would like to do this for multiple values of interest.
For instance
PR Tot
#222 3
#341 3
#527 2
#569 3
#1600 1
#1660 1
However, I only want to retrieve the frequency for a very specific set of values such as only extracting the frequency of 527 or 1600.
I have initially tried using a simple function like length(which(PR1=="527")), which works but is tedious.
I used the method suggested by Soren using:
library(plyr)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c("527", "5251", "5252", "5253", "5259",
"526", "521", "529", "8512", "8521", "344", "854", "8523", "8541", "8546",
"8542", "8547" , "8544", "8545", "8543", "639",
"064","065","063","0650","0651", "0652", "062", "066", "4040", "4041",
"4042", "0721", "0712","0701", "0702", "070", "0741", "435","436", "4399",
"439", "438", "437", "4381", "4391", "4342", "5122", "5121", "5124", "5123",
"518", "519", "503", "5022", "5012")),]
And got the following output (abbreviated):
codes count
92 062 5
95 064 8
96 0650 2
769 526 8
770 527 8
However, I had a feeling that was incorrect. When I checked it against the output from sapply(df, function(PR1) length(which(PR1 == "527")))
I get the following:
PR1 PR2 PR3 PR4 PR5 PR6 PR7 PR8 ...
1152 36 6 1 2 1 1 1
Which is the correct number of "527" cases in the dataframe. Any suggestions why the first method is giving incorrect sums of factor levels?
Thanks for any help, and let me know if I can provide more info
You can use sapply() or lapply() function to get count of a some value over all columns.
Create data frame df
df <- data.frame(A = 1:4, B = c(4,4,4,4), C = c(2,3,4,4), D = 9:12)
df
# A B C D
# 1 1 4 2 9
# 2 2 4 3 10
# 3 3 4 4 11
# 4 4 4 4 12
Frequency of value "4" in each column A, B, C, and D using sapply() function
sapply(df, function(x) length(which(x == 4)))
A B C D
1 4 2 0
Frequency of value "4" in each column A, B, C, and D using lapply() function
lapply(df, function(x) length(which(x == 4)))
# $A
# [1] 1
# $B
# [1] 4
# $C
# [1] 2
# $D
# [1] 0
The following takes your example and returns an output that may be generalized across all 25 columns. The "plyr" library is used to create the aggregated counts
Scripted as follows:
library(plyr)
df <- data.frame(PR1=c("527","1600","341","222","569"),PR2=c("1422","527","222","569","341"),PR3=c("222","569","341","1422","1660"),stringsAsFactors = T)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c('527','222')),]
Explained as follows:
Create the data frame as specified above. As OP noted values are factors, stringsAsFactors is set to TRUE
df <- data.frame(
PR1=c("527","1600","341","222","569"),
PR2=c("1422","527","222","569","341"),
PR3=c("222","569","341","1422","1660"),
stringsAsFactors = T)
Reviewing results of df
df
PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
As OP asks to combine all the codes across PR1:PR25 a these are unified into a single list by using lapply to loop across all the columns. However, as these are factors -- and it seems that the interest in the in the level value of the factor and not its underlying numeric representation, lapply(df,levels) returns these values. To merge into a single list PR1:PR25 it's simply unlist() and since the column names are seemingly not useful in this case, use.names is set to FALSE. Finally, a data.frame is created with the single column called codes, which is later fed into the ddply() function to get the counts.
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
all_codes
codes
1 1600
2 222
3 341
4 527
5 569
6 1422
7 222
8 341
9 527
10 569
11 1422
12 1660
13 222
14 341
15 569
Uisng ddply() to split() the data.frame on df$codes value and then take the length() of each vector returned by split in ddply()
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result
Reviewing the result gives the PR1:PR25 aggregated count of all the level values of each factor in the original data.frame
codes count
1 1422 2
2 1600 1
3 1660 1
4 222 3
5 341 3
6 527 2
7 569 3
And since we're only interested in specific values (527 given in OP, but here two values of interest are exemplified, 527 and 222:
result[which(result$codes %in% c('527','222')),]
codes count
4 222 3
6 527 2

How to subtract data frame column from another data frame column if condition is met?

I have two simple data frames containing both the columns "word" and "n" for how often a certain word occurred. Here is an example:
df1 <- data.frame(word=c("beautiful","nice","like","good"),n=c(400,378,29,10))
df2 <- data.frame(word=c("beautiful","nice","like","good","wonderful","awesome","sad","happy"),n=c(6000,20,5,150,300,26,17,195))
Besides the words of df1, df2 contains much more words so df1 is only a small subset of df2.
I found the words, that are contained in both, df1 and df2. Now I would like to subtract the word countings of df1 from df2 if the specific word is contained in df2 , meaning I would like to do the following:
Subtract word counting: df2$n - df1$n
only IF df1$word is contained in df2$word
I hope that my problem is clear.
I already found all the words from df1 that are also contained in df2
df1 %>% filter(df1$word %in% df2$word)
However, I am struggling with the subtracting command based on the condition that the words in df1 must be also in df2 and then only subtract df2$n - df1$n
Thank you for your help!
Using merge:
> df.tmp <- merge(df1, df2, by="word", all=TRUE)
> df.tmp$result <- df.tmp$n.y - df.tmp$n.x
> df.tmp
word n.x n.y result
1 beautiful 400 6000 5600
2 good 10 150 140
3 like 29 5 -24
4 nice 378 20 -358
5 awesome NA 26 NA
6 happy NA 195 NA
7 sad NA 17 NA
8 wonderful NA 300 NA
If you only want matched words
> df.tmp <- merge(df1, df2, by="word")
> df.tmp$result <- df.tmp$n.y - df.tmp$n.x
> df.tmp
word n.x n.y result
1 beautiful 400 6000 5600
2 good 10 150 140
3 like 29 5 -24
4 nice 378 20 -358
require(dplyr)
df1 %>%
inner_join(df2, by = 'word') %>%
mutate(diff = n.y - n.x) %>%
select(word, diff)
Gives
word diff
1 beautiful 5600
2 nice -358
3 like -24
4 good 140
Here is a quick solution using a for loop and the %in% operator.
df2$diff <- NA
for (i in 1:nrow(df2)) {
if (df2$word[i] %in% df1$word[i]) {
df2$diff[i] <- df2$n[i] - df1$n[i]
}
}
df2
Output:
> df2
word n diff
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 NA
6 awesome 26 NA
7 sad 17 NA
8 happy 195 NA
Here's a vectorized base solution where Boolean multiplication is used to replace an if-then construct used in the for-lop from #Rob:
df2$n.adjusted <- df2$n - (df2$word %in% df1$word)* # zero if no match
df1$n[ match(df1$word, df2$word) ] # gets order correct
> df2
word n n.adjusted
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 300
6 awesome 26 26
7 sad 17 17
8 happy 195 195
Here's the example I used to test where the order of the df1 words was not the same as the order in df2 and the lengths were not an even multiple:
> df1 <-data.frame(word=c("nice","beautiful","like","good"),n=c(378,400,29,10))
> df2 <- data.frame(word=c("beautiful","nice","like","good","wonderful","awesome","sad"),n=c(6000,20,5,150,300,26,17))
>
> df1
word n
1 nice 378
2 beautiful 400
3 like 29
4 good 10
> df2
word n
1 beautiful 6000
2 nice 20
3 like 5
4 good 150
5 wonderful 300
6 awesome 26
7 sad 17
> df2$n.adjusted <- df2$n - (df2$word %in% df1$word)*df1$n[match(df1$word, df2$word)]
Warning message:
In (df2$word %in% df1$word) * df1$n[match(df1$word, df2$word)] :
longer object length is not a multiple of shorter object length
> df2
word n n.adjusted
1 beautiful 6000 5600
2 nice 20 -358
3 like 5 -24
4 good 150 140
5 wonderful 300 300
6 awesome 26 26
7 sad 17 17

Assign industry codes according to ranges in R

I would like to assign overall industry/parent codes to a data.frame (df below) containing more detailed/child codes (called ChildCodes below). The following data serves to illustrate my data.frame containing the detailed codes:
> df <- as.data.frame(cbind(c(1,2,3,4,5,6),c(110,101,200,2041,3651,2102)))
> names(df) <- c('Id','ChildCodes')
> df
Id ChildCodes
1 1 110
2 2 101
3 3 200
4 4 2041
5 5 3651
6 6 2102
The industry/parent codes are in the .csv file here: https://www.dropbox.com/s/5qtb7ysys1ar0lj/IndustryCodes.csv
The problem for me is the format of the .csv file. The file shows the parent/industry code in column 1 and ranges of child/detailed codes in the next 2 columns. Here is a subset:
> IndustryCodes <- as.data.frame(cbind(c(1,1,2,5,6),c(100,200,2040,2100,3650),c(199,299,2046,2199,3651)))
> names(IndustryCodes) <- c('IndustryGroup','LowerRange','UpperRange')
> IndustryCodes
IndustryGroup LowerRange UpperRange
1 1 100 199
2 1 200 299
3 2 2040 2046
4 5 2100 2199
5 6 3650 3651
So that ChildCode 110 corresponds industry group 1, 2041 to industry code 2 etc. How do best assign the industry/parent codes (IndustryGroup) to df in R?
Thanks!
You can use sapply to get the Industry code for every child code:
sapply(df$ChildCodes,
function(x) IndustryCodes$IndustryGroup[IndustryCodes$LowerRange <= x &
x <= IndustryCodes$UpperRange])
# [1] 1 1 1 2 6 5

How to combine datasets without looping where one has multiple values?

Given the basic tools I know now (which, order, if, %in%, order, etc..), I am running frequently into one problem I call "the uniqueness problem".
The problem basically looks like this...
I have a matrix A I want filled out from another raw matrix, B.
A:
[upc] [day1] [day2] ... day52
[1] 123 NA NA NA
[2] 456 NA NA NA
[3] 789 NA NA NA
B is mega huge row wise, so looping is out of the question.
[upc] [quantity] [day]
[1] 123 11 1
[2] 123 2 1
[3] 789 5 1
[4] 456 10 1
[5] 789 6 1
I want to fill up day1 for each UPC in matrix A with the quantities in matrix B. The problem is that there are multiple instances of each UPC in B, and I can't loop over them to get the total quantity to put next to each upc.
So what I WANT is this.. (which would be filled out TOTALLY, i.e. days 2-52 ..by looping over the other days, which is small and thus manageable)
A:
[upc] [day1] [day2] ... day52
[1] 123 13 NA NA
[2] 456 10 NA NA
[3] 789 11 NA NA
Do you know any functions that can accomplish this without looping?
If you convert your original matrices to data.frames, you can employ aggregate,merge and reshape to get there:
Make some data including multiple days for the added id of 999:
A <- data.frame(upc=c(123,456,789,999))
B <- data.frame(
upc=c(123,123,789,456,789,999,999,999),
quantity=c(11,2,5,10,6,10,3,3),
day=c(1,1,1,1,1,1,2,2)
)
Aggregate the quantities by id and day, then merge and reshape:
mrgd <- merge(A,aggregate(quantity ~ upc + day ,data=B, sum),by="upc")
final <- reshape(mrgd,idvar="upc",timevar="day",direction="wide",sep="")
names(final) <- gsub("quantity","day",names(final))
Which gives:
final
# upc day1 day2
#1 123 13 NA
#2 456 10 NA
#3 789 11 NA
#4 999 10 6
You can create a matrix A using the tapply function:
> B <- data.frame(
+ upc=c(123,123,789,456,789,999,999,999),
+ quantity=c(11,2,5,10,6,10,3,3),
+ day=c(1,1,1,1,1,1,2,2)
+ )
> tapply( B$quantity, B[,c('upc','day')], FUN=sum )
day
upc 1 2
123 13 NA
456 10 NA
789 11 NA
999 10 6
>
If the B matrix is really huge then you might consider saving it as an ff object (ff package) then using ffrowapply to do it in chunks.

Resources