R: How does the summary(factor) function order the values? - r

I have a larga database, and I wanted to print out the number of elements in each factor, with this code:
summary(factor(Bond$Cf$ISIN, levels = Bond$ISIN), maxsum = 1000)
I got this output below. My question is: how does this order the values? At first glance I thought it orders it according to the number of values in each factor but then I saw NL0009054907. It has 7 elements, but somehow It appears among the factors with 4 elements.
DE000A1X3LN0 XS1002977103 XS1014670233 DE000CZ40JH0 XS1016720853 XS0973623514 DE000HSH4QN3 XS0997333223 DE000A13SWD8 DE000HSH4XP4
3 3 3 3 3 3 3 3 3 3
XS1033018158 XS1369254310 XS1041793123 XS1196748757 XS1043150462DE000CZ40K31 XS0187339600 XS0413584672 XS1046237431 DE000CB83CE3
3 3 3 3 3 3 3 3 3 4
XS1050665386 XS1385935769 FR0010743070 FR0011233337 XS0418053152 XS0914402887 NL0009054907 XS0984200617 XS0993272862 XS0423989267
4 4 4 4 4 4 7 4 4 4
XS0296306078 DE000CZ40KW7 XS1070100257 XS0996755350 ES03136793B0 XS0432092137 XS0429192767 DE000HV2AKV6 XS1077629225 XS1078760813
4 4 4 4 4 4 4 4 4 4
DE000HSH28Z5 DE000HSH2893 XS1080952960 DE000DB7UQ89 XS1084838496 XS1236611684 DE000HSH41B9 DE000CZ291M8 DE000HSH3AM1 DE000HSH28J9
4 4 4 4 4 4 4 4 4 4

Compare:
x <- factor(rep(letters[1:3], 4:6))
summary(x)
# a b c
# 4 5 6
x <- factor(x, levels = letters[3:1])
summary(x)
# c b a
# 6 5 4

If x is a factor in a data.frame summary.factor sorts by frequency
x <- factor(rep(letters[1:10], 1:10), levels = letters[1:10])
summary(data.frame(x = x))
# x
# j :10
# i : 9
# h : 8
# g : 7
# f : 6
# e : 5
# (Other):10

Related

Attempting to remove a row in R using variable names

I am trying to remove some rows in a for loop in R. The conditional involves comparing it to the line below it, so I can't filter within the brackets.
I know that I can remove a row when a constant is specified: dataframe[-2, ]. I just want to do the same with a variable: dataframe[-x, ]. Here's the full loop:
for (j in 1:(nrow(referrals) - 1)) {
k <- j + 1
if (referrals[j, "Client ID"] == referrals[k, "Client ID"] &
referrals[j, "Provider SubCode"] == referrals[k, "Provider SubCode"]) {
referrals[-k, ]
}
}
The code runs without complaint, but no rows are removed (and I know some should be). Of course, if it I test it with a constant, it works fine: referrals[-2, ].
You need to add a reproducible example for people to work with. I don't know the structure of your data, so I can only guess if this will work for you. I would not use a loop, for the reasons pointed out in the comments. I would identify the rows to remove first, and then remove them using normal means. Consider:
set.seed(4499) # this makes the example exactly reproducible
d <- data.frame(Client.ID = sample.int(4, 20, replace=T),
Provider.SubCode = sample.int(4, 20, replace=T))
d
# Client.ID Provider.SubCode
# 1 1 1
# 2 1 4
# 3 3 2
# 4 4 4
# 5 4 1
# 6 2 2
# 7 2 2 # redundant
# 8 3 1
# 9 4 4
# 10 3 4
# 11 1 3
# 12 1 3 # redundant
# 13 3 4
# 14 1 2
# 15 3 2
# 16 4 4
# 17 3 4
# 18 2 2
# 19 4 1
# 20 3 3
redundant.rows <- with(d, Client.ID[1:nrow(d)-1]==Client.ID[2:nrow(d)] &
Provider.SubCode[1:nrow(d)-1]==Provider.SubCode[2:nrow(d)] )
d[-c(which(redundant.rows)+1),]
# Client.ID Provider.SubCode
# 1 1 1
# 2 1 4
# 3 3 2
# 4 4 4
# 5 4 1
# 6 2 2
# 8 3 1 # 7 is missing
# 9 4 4
# 10 3 4
# 11 1 3
# 13 3 4 # 12 is missing
# 14 1 2
# 15 3 2
# 16 4 4
# 17 3 4
# 18 2 2
# 19 4 1
# 20 3 3
Using all information given by you, I believe this could be a good alternative:
duplicated.rows <- duplicated(referrals)
Then, if you want the duplicated results run:
referrals.double <- referrals[duplicated.rows, ]
However, if you want the non duplicated results run:
referrals.not.double <- referrals[!duplicated.rows, ]
If you prefer to go step by step (maybe it's interesting for you):
duplicated.rows.Client.ID <- duplicated(referrals$"Client ID")
duplicated.rows.Provider.SubCode <- duplicated(referrals$"Provider SubCode")
referrals.not.double <- referrals[!duplicated.rows.Client.ID, ]
referrals.not.double <- referrals.not.double[!duplicated.rows.Client.ID, ]

R: How to make sequence (1,1,1,2,3,3,3,4,5,5,5,6,7,7,7,8)

Title says it all: how would I code such a repeating sequence where the base repeat unit is : a vector c(1,1,1,2) - repeated 4 times, but incrementing the values in the vector by 2 each time?
I've tried a variety of rep,times,each,seq and can't get the wanted result out..
c(1,1,1,2) + rep(seq(0, 6, 2), each = 4)
# [1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
The rep function allows for a vector of the same length as x to be used in the times argument. We can extend the desired pattern with the super secret rep_len.
rep(1:8, rep_len(c(3, 1), 8))
#[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
I'm not sure if I get it right but what's wrong with something as simple as that:
rep<-c(1,1,1,2)
step<-2
vec<-c(rep,step+rep,2*step+rep,3*step+rep)
I accepted luke as it is the easiest for me to understand (and closest to what I was already trying, but failing with!)
I have used this final form:
> c(1,1,1,2)+rep(c(0,2,4,6),each=4)
[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
You could do:
pattern <- rep(c(3, 1), len = 50)
unlist(lapply(1:8, function(x) rep(x, pattern[x])))
[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8
This lets you just adjust the length of the pattern under rep(len = X) and removes any usage of addition, which some of the other answers show.
How about:
input <- c(1,1,1,2)
n <- 4
increment <- 2
sort(rep.int(seq.int(from = 0, by = increment, length.out = n), length(input))) + input
[1] 1 1 1 2 3 3 3 4 5 5 5 6 7 7 7 8

R - Subset dataframe to include only subjects with more than 1 record

I'd like to subset a dataframe to include all records for subjects that have >1 record, and exclude those subjects with only 1 record.
Let's take the following dataframe;
mydata <- data.frame(subject_id = factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10)),
variable = rnorm(15))
The code below gives me the subjects with >1 record using duplicated();
duplicates <- mydata[duplicated(mydata$subject_id),]$subject_id
But I want to retain in my subset all records for each subject with >1 record, so I tried;
mydata[mydata$subject_id==as.factor(duplicates),]
Which does not return the result I'm expecting.
Any ideas?
A data.table solution
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
library(data.table)
setDT(mydata)[, .SD[.N > 1], by = subject_id] # #Thanks David.
# subject_id variable
# 1: 4 -1.3325937
# 2: 4 -0.4465668
# 3: 5 0.5696061
# 4: 5 -2.8897176
# 5: 6 -0.8690183
# 6: 6 -0.4617027
# 7: 9 -0.1503822
# 8: 9 -0.6281268
# 9: 9 1.3232209
A simple alternative is to use dplyr:
library(dplyr)
dfr <- data.frame(a=sample(1:2,10,rep=T), b=sample(1:5,10, rep=T))
dfr <- group_by(dfr, b)
dfr
# Source: local data frame [10 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 5
# 4 2 1
# 5 1 2
# 6 1 3
# 7 2 1
# 8 2 4
# 9 1 4
# 10 2 4
filter(dfr, n() > 1)
# Source: local data frame [8 x 2]
# Groups: b
#
# a b
# 1 2 4
# 2 2 2
# 3 2 1
# 4 1 2
# 5 2 1
# 6 2 4
# 7 1 4
# 8 2 4
Here you go (I changed your variable to var <- rnorm(15):
set.seed(11)
subject_id<-as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
var<-rnorm(15)
mydata<-as.data.frame(cbind(subject_id,var))
x1 <- c(names(table(mydata$subject_id)[table(mydata$subject_id) > 1]))
x2 <- which(mydata$subject_id %in% x1)
mydata[x2,]
subject_id var
4 4 0.3951076
5 4 -2.4129058
6 5 -1.3309979
7 5 -1.7354382
8 6 0.4020871
9 6 0.4628287
12 9 -2.1744466
13 9 0.4857337
14 9 1.0245632
Try:
> mydata[mydata$subject_id %in% mydata[duplicated(mydata$subject_id),]$subject_id,]
subject_id variable
4 4 -1.3325937
5 4 -0.4465668
6 5 0.5696061
7 5 -2.8897176
8 6 -0.8690183
9 6 -0.4617027
12 9 -0.1503822
13 9 -0.6281268
14 9 1.3232209
I had to edit your data frame a little bit:
set.seed(20)
subject_id <- as.factor(c(1,2,3,4,4,5,5,6,6,7,8,9,9,9,10))
variable <- rnorm(15)
mydata<-as.data.frame(cbind(subject_id, variable))
Now to get all the rows for subjects that appear more than once:
mydata[duplicated(mydata$subject_id)
| duplicated(mydata$subject_id, fromLast = TRUE), ]
# subject_id variable
# 4 4 -1.3325937
# 5 4 -0.4465668
# 6 5 0.5696061
# 7 5 -2.8897176
# 8 6 -0.8690183
# 9 6 -0.4617027
# 12 9 -0.1503822
# 13 9 -0.6281268
# 14 9 1.3232209
Edit: this would also work, using your duplicates vector:
mydata[mydata$subject_id %in% duplicates, ]

Performing calculations on binned counts in R

I have a dataset stored in a text file in the format of bins of values followed by counts, like this:
var_a 1:5 5:12 7:9 9:14 ...
indicating that var_a took on the value 1 5 times in the dataset, 5 12 times, etc. Each variable is on its own line in that format.
I'd like to be able to perform calculations on this dataset in R, like quantiles, variance, and so on. Is there an easy way to load the data from the file and calculate these statistics? Ultimately I'd like to make a box-and-whisker plot for each variable.
Cheers!
You could use readLines to read in the data file
.x <- readLines(datafile)
I will create some dummy data, as I don't have the file. This should be the equivalent of the output of readLines
## dummy
.x <- c("var_a 1:5 5:12 7:9 9:14", 'var_b 1:5 2:12 3:9 4:14')
I split by spacing to get each
#split by space
space_split <- strsplit(.x, ' ')
# get the variable names (first in each list)
variable_names <- lapply(space_split,'[[',1)
# get the variable contents (everything but the first element in each list)
variable_contents <- lapply(space_split,'[',-1)
# a function to do the appropriate replicates
do_rep <- function(x){rep.int(x[1],x[2])}
# recreate the variables
variables <- lapply(variable_contents, function(x){
.list <- strsplit(x, ':')
unlist(lapply(lapply(.list, as.numeric), do_rep))
})
names(variables) <- variable_names
you could get the variance for each variable using
lapply(variables, var)
## $var_a
## [1] 6.848718
##
## $var_b
## [1] 1.138462
or get boxplots
boxplot(variables, ~.)
Not knowing the actual form that your data is in, I would probably use something like readLines to get each line in as a vector, then do something like the following:
# Some sample data
temp = c("var_a 1:5 5:12 7:9 9:14",
"var_b 1:7 4:9 3:11 2:10",
"var_c 2:5 5:14 6:6 3:14")
# Extract the names
NAMES = gsub("[0-9: ]", "", temp)
# Extract the data
temp_1 = strsplit(temp, " |:")
temp_1 = lapply(temp_1, function(x) as.numeric(x[-1]))
# "Expand" the data
temp_1 = lapply(1:length(temp_1),
function(x) rep(temp_1[[x]][seq(1, length(temp_1[[x]]), by=2)],
temp_1[[x]][seq(2, length(temp_1[[x]]), by=2)]))
names(temp_1) = NAMES
temp_1
# $var_a
# [1] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#
# $var_b
# [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
#
# $var_c
# [1] 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Find repeated data from index and string it together

Using R, if I have a 2 column data frame:
meta <- c(1,2,2,3,4,4,4,5)
value <- c("a","b","c","d","e","f","g","h")
df <- data.frame(meta,value)
df
meta value
1 1 a
2 2 b
3 2 c
4 3 d
5 4 e
6 4 f
7 4 g
8 5 h
How would I go about combining "value" with a delimiter (like ||) by repeated "meta" such that the resulting data frame would look like:
meta value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h
Thanks!
Slightly different, fairly lean, and in base:
y <- split(df$value, df$meta)
data.frame(meta=names(y), value=sapply(y, paste, collapse="||"))
or even simpler:
aggregate(value~meta, df, paste, collapse="||")
Using the plyr package the following works
library(plyr)
> ldply(split(df,meta),function(x){paste(x$value,collapse="||")})
.id V1
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h
or
> ddply(df,.(meta),function(x){c(value=paste(x$value,collapse="||"))})
meta value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h
if you want to preserve names
I hope you don't dislike one liners: data.frame(meta=unique(df$meta), value=sapply(unique(df$meta), function(m){ paste(df$value[which(df$meta==m)],collapse="||") }) )
> data.frame(meta=unique(df$meta), value=sapply(unique(df$meta), function(m){ paste(df$value[which(df$meta==m)],collapse="||") }) )
meta value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h
Here is another way...
uni.meta <- unique(df$meta)
list <- lapply(1:length(uni.meta),function(x) which(df$meta==uni.meta[x]))
new.value <- unlist(lapply(1:length(list),function(x) paste(df$value[list[[x]]],collapse="||")))
new.df <- data.frame(uni.meta,new.value)
new.df
uni.meta new.value
1 1 a
2 2 b||c
3 3 d
4 4 e||f||g
5 5 h

Resources