Replace value in a column based on a Frequency Count using R - r

I have a dataset with multiple columns. Many of these columns contain over 32 factors, so to run a Random Forest (for example), I want to replace values in the column based on their Frequency Count.
One of the column reads like this:
$ country
: Factor w/ 92 levels "China","India","USA",..: 30 39 39 20 89 30 16 21 30 30 ...
What I would like to do is only retain the top N (where N is a value between 5 and 20) countries, and replace the remaining values with "Other".
I know how to calculate the frequency of the values using the table function, but I can't seem to find a solution for replacing values on the basis of such a rule. How can this be done?

Some example data:
set.seed(1)
x <- factor(sample(1:5,100,prob=c(1,3,4,2,5),replace=TRUE))
table(x)
# 1 2 3 4 5
# 4 26 30 13 27
Replace all the levels other than the top 3 (Levels 2/3/5) with "Other":
levels(x)[rank(table(x)) < 3] <- "Other"
table(x)
#Other 2 3 5
# 17 26 30 27

Related

How to extract rows with similar names into a submatrix?

I am building an asymmetrical matrix of values with the rows being coefficient names and the column the value of each coefficient:
# Set up Row and Column Names.
rows = c("Intercept", "actsBreaks0", "actsBreaks1","actsBreaks2","actsBreaks3","actsBreaks4","actsBreaks5","actsBreaks6",
"actsBreaks7","actsBreaks8","actsBreaks9","tBreaks0","tBreaks1","tBreaks2","tBreaks3", "unitBreaks0", "unitBreaks1",
"unitBreaks2","unitBreaks3", "covgBreaks0","covgBreaks1","covgBreaks2","covgBreaks3","covgBreaks4","covgBreaks5",
"covgBreaks6","yearBreaks2016","yearBreaks2015","yearBreaks2014","yearBreaks2013","yearBreaks2011",
"yearBreaks2010","yearBreaks2009","yearBreaks2008","yearBreaks2007","yearBreaks2006","yearBreaks2005",
"yearBreaks2004","yearBreaks2003","yearBreaks2002","yearBreaks2001","yearBreaks2000","yearBreaks1999",
"yearBreaks1998","plugBump0","plugBump1","plugBump2","plugBump3")
cols = c("Value")
# Build Matrix
matrix1 <- matrix(c(1:48), nrow = 48, ncol = 1, byrow = TRUE, dimnames = list(rows,cols))
output:
> matrix1
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
and I wish to extract certain rows that share row names (i.e. all rows with "unitBreaks'x'") into a submatrix.
I tried this
est_actsBreaks <- est_coef_mtrx[c("actsBreaks0","actsBreaks1","actsBreaks2","actsBreaks3",
"actsBreaks4","actsBreaks5","actsBreaks6","actsBreaks7",
"actsBreaks8","actsBreaks9"),c("Value")]
but it returns a vector and I need a matrix. I have seen other questions concerning similar procedures but their columns and rows all had identical names and/or values. Is there a way to do the operation I have in mind, such as grep()?
Welcome to StackOverflow.
As usual in R, there would probably be many ways to do what you request.
EDIT: I realized that my solution was going a little bit too far, sorry about that.
To extract only the rows that contain the pattern "unitBreaks" followed by several numbers, and still keep a matrix structure, you can run the following code. In a nutshell, grep is going to look for the pattern that you need and the argument drop = FALSE is going to make sure that you get a matrix as a result and not a vector.
uniBreakLines <- grep("unitBreaks[0-9]*", rows)
matrix1[uniBreakLines, , drop = FALSE]
Below is the first version of my answer.
First, I create a vector that describes the groups of rows. For this, I remove the numbers at the end of the row names.
grp <- gsub("[0-9]+$", "", rows)
Then, I transform the data matrix into a data-frame (why I do that is explained a little bit later).
dat1 <- data.frame(matrix1)
Finally, I use "split" on the data-frame, with the groups defined earlier. Using split on the data-frame will keep the structure: the result will be a list of data-frames, even though there is only one column.
dat1.split <- split(dat1, grp)
The result is a list of data-frames.
lapply(dat1.split, head)
$actsBreaks
Value
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
$covgBreaks
Value
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
$Intercept
Value
Intercept 1
$plugBump
Value
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48
$tBreaks
Value
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
$unitBreaks
Value
unitBreaks0 16
unitBreaks1 17
unitBreaks2 18
unitBreaks3 19
$yearBreaks
Value
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
After that, if you still need matrices, you can convert them with the function as.matrix in an "lapply":
matrix1.split <- lapply(dat1.split, as.matrix)
You might want to consider combining your data in a "tibble" with the "grouping" column. You will then be able to use these groups with the group_by function or other functions from the dplyr package (or other packages from the tidyverse).
For example:
library(dplyr)
tib1 <- tibble(rows, simpler_rows, value = 1:48)
And an example on how to use the grouping variable:
tib1 %>%
group_by(simpler_rows) %>%
summarize(sum(value))
EDIT bis: what if I don't know the pattern?
I played around a little bit with your example to answer the question (that nobody asked, but still, it's fun!): "what if I don't know the pattern?"
In this case, I would use a distance between the row names. This distance would look like this:
... and would be the output of the following lines of code
library(stringdist)
library(pheatmap)
strdist <- stringdistmatrix(rows)
pheatmap(strdist, border_color = "white", cluster_rows = F, cluster_cols = FALSE, cellwidth = 10, cellheight = 10, labels_row = rows, fontsize_row = 7)
After that, I only need to get the number of cluster, which can be done with a silhouette plot (similar to this one), that tells me that there are 8 clusters of words, which seems about right:
The cluster can be extracted then with the function used to create the silhouette plot (I used hclust and cutree).
Here a solution with dplyr and stringr to extract rownames that contain a certain string.
At the end change back to matrix:
library(dplyr)
library(stringr)
df1 <- df %>%
filter(!str_detect(rownames(df), "unitBreaks"))
df1 <- as.matrix(df1)
Value
Intercept 1
actsBreaks0 2
actsBreaks1 3
actsBreaks2 4
actsBreaks3 5
actsBreaks4 6
actsBreaks5 7
actsBreaks6 8
actsBreaks7 9
actsBreaks8 10
actsBreaks9 11
tBreaks0 12
tBreaks1 13
tBreaks2 14
tBreaks3 15
covgBreaks0 20
covgBreaks1 21
covgBreaks2 22
covgBreaks3 23
covgBreaks4 24
covgBreaks5 25
covgBreaks6 26
yearBreaks2016 27
yearBreaks2015 28
yearBreaks2014 29
yearBreaks2013 30
yearBreaks2011 31
yearBreaks2010 32
yearBreaks2009 33
yearBreaks2008 34
yearBreaks2007 35
yearBreaks2006 36
yearBreaks2005 37
yearBreaks2004 38
yearBreaks2003 39
yearBreaks2002 40
yearBreaks2001 41
yearBreaks2000 42
yearBreaks1999 43
yearBreaks1998 44
plugBump0 45
plugBump1 46
plugBump2 47
plugBump3 48

Percentile rank of column values - R

I am looking for a percentage rank for each value in a column.
It is quite easy in Excel, for example:
=RANK.EQ(A1,$A$1:$A$100,1)/COUNT($A$1:$A$100)
Returns a percent value in a new column that ranks the column I referred to above.
I have no problem finding quantile in R, but have not been able to find anything that accurately gives percentile for every single column value.
Try this using the data in your picture:
> Cost.Per.Kilo <- c(rep(c(6045170, 5412330, 3719760, 3589220), each=2),
3507400)
> Cost.Per.Kilo
[1] 6045170 6045170 5412330 5412330 3719760 3719760 3589220 3589220 3507400
> CPK.rank <- rank(Cost.Per.Kilo, ties.method="min")
> CPK.rank
[1] 8 8 6 6 4 4 2 2 1
> round(CPK.rank/length(CPK.rank) * 100)
[1] 89 89 67 67 44 44 22 22 11
In your picture you seem to have divided the ranks by 10, but there are only 9 values. That is why these percentages do not match.

R: Randomly sampling (with replacement) each column of a data frame independently

I am trying to create a new data frame by randomly sampling an existing data frame. Specifically, I want create a data frame that is the same size as the original data frame, but each column of the new data frame is a random sample (with replacement) of the corresponding column in the original data frame. My first attempt looked like this:
# Create toy data set
data.set <- as.data.frame(matrix(1:50, ncol = 5))
# Change names
colnames(data.set) <- c("Stuff", "Things", "Foo", "Bar", "Guff")
# Try to create randomly sampled data frame
data.set %>% sample_n(replace = TRUE, size = nrow(data.set))
The problem here is that it just randomly samples rows, but not elements within each column individually. For example, here is some output.
Stuff Things Foo Bar Guff
2 2 12 22 32 42
10 10 20 30 40 50
2.1 2 12 22 32 42
3 3 13 23 33 43
5 5 15 25 35 45
3.1 3 13 23 33 43
8 8 18 28 38 48
9 9 19 29 39 49
1 1 11 21 31 41
6 6 16 26 36 46
Notice that the first and third rows are exactly the same, as are the fourth and sixth rows. What I would like is for each and every column to be randomly sampled independently. So, I tried this.
apply(data.set, MARGIN = 2, sample_n, replace = TRUE, size = nrow(data.set))
which produced the following error:
Error: Don't know how to sample from objects of class integer
although, I don't see what I did incorrectly. Can anyone offer a concise way of achieving my goal?
First, the apply function should have argument. In this case we use columns since the margin is 2.
apply(df, MARGIN = 2, function(x) sample(x, replace = TRUE, size = length(x)))

Sort list on numeric values stored as factor

I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

Resources