Converting values between a certain range to a letter - r

Is there an easy way to let values between a certain range equal a letter?. So in the following example, how would I convert all values in df so that:
Values less than or equal to 1 = A.
Values less than or equal to 5 = B.
Values greater than 5 = C.
A small example dataset:
df1 <- rnorm (100, mean = 1, sd = 0.3)
df2 <- rnorm (100, mean = 5, sd = 1.6)
df <- cbind(df1,df2)

as.data.frame(apply(df,2, function(x) cut(x, c(-Inf,1,5,Inf), labels=c('A','B','C'))))
# df1 df2
# 1 A C
# 2 A C
# 3 B B
# 4 A C
# 5 A C
# 6 A B
# 7 A C
# 8 B B
# 9 B C
# 10 A C
Remember to use -Inf and Inf when creating cut points for your outer boundary. It's wrapped in an apply function to repeat the process over each column.

Related

How to rbind two dataframes in R when one has more columns than the other [duplicate]

This question already has answers here:
Combine two data frames by rows (rbind) when they have different sets of columns
(14 answers)
Closed 2 years ago.
I want to merge three dataframes together, appending them as rows to the bottom of the previous, but I want their columns to match. Each dataframe has a different number of columns, but they share column names. EX:
Dataframe A Dataframe B Dataframe C
A B Y Z A B C X Y Z A B C D W X Y Z
# # # # # # # # # # # # # # # # # #
In the end, I want them to look like:
Dataframe_Final
A B C D W X Y Z
# # # #
# # # # # #
# # # # # # # #
How can I merge these dataframes together in this way? Again, there's no ID for the rows that is unique (ascending, etc) across the dataframes.
Thanks!
A base R option might be Reduce + merge
out <- Reduce(function(x,y) merge(x,y,all = TRUE),list(dfA,dfB,dfC))
out <- out[order(names(out))]
which gives
A B C D W X Y Z
1 1 2 NA NA NA NA 3 4
2 1 2 3 NA NA 4 5 6
3 1 2 3 4 5 6 7 8
Dummy Data
dfA <- data.frame(A = 1, B = 2, Y = 3, Z = 4)
dfB <- data.frame(A = 1, B = 2, C = 3, X = 4, Y = 5, Z = 6)
dfC <- data.frame(A = 1, B = 2, C = 3, D = 4, W = 5, X = 6, Y = 7, Z = 8)

R help - change the maximum value of each row in a certain condition

I am in a novice of R. I have a dataframe with columns 1:n. Excluding column 1 and n, I want to change the maximum value of each row if the row has a specific value in a different column AND set the remaining values (excluding column 1 and n) to zero. I have about 300,000 cases and 40 columns in my real data, however, the example below illustrates what I am trying to achieve:
A <- c(1,1,5,5,10)
B <- rnorm(1:5)
C <- rnorm(1:5)
D <- rnorm(1:5)
E <- c(10,15,100,100,100)
df <- data.frame(A,B,C,D,E)
df
A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
Here, if column A of each row has 1, I want to change the maximum value of each row into the value of column E, and set columns B, C and D to 0.
So, the result should be like this:
A B C D E
1 1 0 0 10 10
2 1 0 15 0 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
I tried to do this for two days. Thanks.
Try this out and see what happens :)
df <- read.table(text = "A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100", stringsAsFactor = FALSE)
# find the max in columns B,C,D
z <- apply(df[df$A == 1, 2:4], 1, max)
# substitute the maximum value of each row for columns B,C,D where A == 1
# with the value of column E. Assign 0 to the others
y <- ifelse(df[df$A == 1, 2:4] == z, df$E[df$A == 1], 0)
# Change the values in your dataframe
df[df$A == 1, 2:4] <- y

R: Grouping data within cetrain range

I have a data frame with two columns, let's call them X and Y. Here's an example of it:
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
which produces this:
X Y
A 14
B 12
C 12
D 11
E 9
F 6
G 4
H 1
Note that the data frame will always be ordered in a descending order based on Y. I want to group together cases where the Y values lie within a certain range, while updating the X column to reflect the grouping too. For example, if the value is 2, I would like the final output to be:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F G 5.00000
H 1.00000
Let me explain how I got that. From the starting df data frame, the closest values were B and C. Joining them would result in:
X new_Y
A 14
B C 12
D 11
E 9
F 6
G 4
H 1
The new_Y value for cases B and C is the average of the original values for B and C i.e. 12. From this second data frame, B C are within 2 from D so they are the next to be grouped together:
X new_Y
A 14.00000
B C D 11.66667
E 9.00000
F 6.00000
G 4.00000
H 1.00000
Note that the Y value for B C D is 11.67 because the original values of B, C and D were 12, 12 and 11 respectively and their average is 11.667. I wouldn't want the code to return the average Y from the previous iteration (which in this case would be 11.5).
Finally, F and G can also be grouped together, producing the final output stated above.
I'm not sure of the code needed to achieve this. My only thoughts were to calculate the distance from the previous and following element, look for the minimum and check whether it exceeds the threshold value (of 2 in the example above). Based on where that minimum appears, join the X column while averaging the Y values from the original table. Repeat this until the minimum becomes larger than the threshold.
But I'm not sure how to write the necessary code to achieve this or whether there's a more efficient solution to the algorithm I'm suggesting above. Any help will be much appreciated.
P.S I forgot to mention that if the distance between the previous and the following Y value is the same, then the grouping should be done towards the larger Y value. So
X Y
A 10
B 8
C 6
would be returned as
X new_Y
A B 9
C 6
Thanks in advance for your patience. My apologies if I didn't explain this very well.
This sounds like hierarchical agglomerative clustering.
To get the groups, use dist, hclust and cutree.
Note that centroid clustering with hclust expects the distances as the square of the Euclidean distance.
df <- data.frame(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
dCutoff <- 2
d2 <- dist(df$Y)^2
hc <- hclust(d2, method = "centroid")
group_id <- cutree(hc, h = dCutoff^2)
group_id
#> [1] 1 2 2 2 3 4 4 5
To munge the original table, we can use dplyr.
library('dplyr')
df %>%
group_by(group_id = group_id) %>%
summarise(
X = paste(X, collapse = ' '),
Y = mean(Y))
#> # A tibble: 5 x 3
#> group_id X Y
#> <int> <chr> <dbl>
#> 1 1 A 14.00000
#> 2 2 B C D 11.66667
#> 3 3 E 9.00000
#> 4 4 F G 5.00000
#> 5 5 H 1.00000
This gives the average of the previous iteration though. In any case I hope it helps
library(data.table)
df <- data.table(X = LETTERS[1:8],
Y = c(14, 12, 12, 11, 9, 6, 4, 1),
stringsAsFactors = FALSE)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
while (minimum < 2){
index <- which(df$difference==minimum) # see where the minimum occurs
check = FALSE
# because the last row cannot have a number since there is not an element after that
# we need to see if this element has the minimum difference with its previous
# if it does not have the minimum difference then we exclude it and paste it later
if(df[nrow(df)-1,difference]!=minimum){
last_row <- df[nrow(df)]
df <- df[-nrow(df)]
check = TRUE
}
tmp <- df[(index:(index+1))]
df <- df[-(index:(index+1))]
to_bind <- data.table(X = paste0(tmp$X, collapse = " "))
to_bind$Y <- mean(tmp$Y)
df <- rbind(df[,.(X,Y)],to_bind)
if(check){
df <- rbind(df,last_row[,.(X,Y)])
}
setorder(df,-Y)
differences <- c(diff(df$Y),NA) # NA for the last element
df$difference <- abs(differences) # get the differences of the consequent elements(since Y is sorted it works)
minimum <- min(df$difference[1:(length(df$difference)-1)]) # get the minimum
}

How to reshape a data frame in R, conditioned on a maximum value?

I'm having some difficulty re-shaping my data frame in R. I have 5 individuals: A, B, C, D, and E. Some individuals have 1 observation and some have 2. I have measured 3 values for each observation: X, Y, and Z. I would like to transform my data frame from long to wide format, generating one row per individual and two sets of columns labeled X, Y, and Z. But, I want to condition on the value of X such that the set of observations with the maximum value of X appears first. Thus, for a given observation, the values of X, Y, and Z must remain grouped together, but whether the values from observation 1 or 2 appear first depends on which has the maximum value of X.
df = data.frame(
indiv = c("A","A","B","C","C","D","D","E"),
observ = c(1,2,1,1,2,1,2,1),
X = c(rnorm(8, mean = 10, sd = 6)),
Y = c(rnorm(8, mean = 0, sd = 2)),
Z = c(rnorm(8, mean = 4, sd = 4))
)
indiv observ X Y Z
1 A 1 9.959043 1.785043 10.134511
2 A 2 14.122006 -2.257666 5.799366
3 B 1 11.562801 -1.394951 4.988923
4 C 1 12.955644 -4.330272 8.870165
5 C 2 13.582154 -1.727224 -7.5617
6 D 1 4.053437 1.815233 1.789157
7 D 2 12.990071 -1.989307 3.67696
8 E 1 2.820895 -3.754263 3.001725
Below is what I would like my wide data frame to look like. For individual A, X was greater in observation 2, so that set of values (X,Y,Z) appears first. By contrast, for individuals C and D, X was greater in observation 1, so that set appears first. I think it should be some variation on the reshape function, but I'm not sure how to condition on the maximum value of X. Thanks in advance!
indiv observ X Y Z observ X Y Z
1 A 2 18.797087 0.3247862 4.774446 1 8.547868 0.3203667 6.729975
2 B 1 1.646638 0.7986036 6.938825 NA NA NA NA
3 C 1 17.354905 -2.399272 8.357045 2 6.856093 0.6493722 2.420827
4 D 1 16.058101 -1.2370024 4.045489 2 7.641576 3.0820116 4.232615
5 E 1 13.625998 -0.1953445 -5.627932 NA NA NA NA
I would just order before I casted. The following uses data.table as the dcast function is within that package as well - could be done with a normal data.frame and reshape as well
library(data.table)
set.seed(1)
df = data.frame(
indiv = c("A","A","B","C","C","D","D","E"),
observ = c(1,2,1,1,2,1,2,1),
X = c(rnorm(8, mean = 10, sd = 6)),
Y = c(rnorm(8, mean = 0, sd = 2)),
Z = c(rnorm(8, mean = 4, sd = 4))
)
df
indiv observ X Y Z
1: A 2 11.101860 -0.61077677 7.775345
2: A 1 6.241277 1.15156270 3.935239
3: B 1 4.986228 3.02356234 7.284885
4: C 1 19.571685 0.77968647 6.375605
5: C 2 11.977047 -1.24248116 7.675909
6: D 2 12.924574 2.24986184 4.298260
7: D 1 5.077190 -4.42939977 7.128545
8: E 1 14.429948 -0.08986722 -3.957407
setDT(df)
df <- df[order(indiv,-X)] #orders your frame
df[, observ := as.numeric(1:.N), by = indiv] #reset observ based on new order
df
indiv observ X Y Z
1: A 1 11.101860 -0.61077677 7.775345
2: A 2 6.241277 1.15156270 3.935239
3: B 1 4.986228 3.02356234 7.284885
4: C 1 19.571685 0.77968647 6.375605
5: C 2 11.977047 -1.24248116 7.675909
6: D 1 12.924574 2.24986184 4.298260
7: D 2 5.077190 -4.42939977 7.128545
8: E 1 14.429948 -0.08986722 -3.957407
Now cast normally:
dcast(df, indiv ~ observ, value.var = c("X","Y","Z"))
indiv X_1 X_2 Y_1 Y_2 Z_1 Z_2
1: A 11.101860 6.241277 -0.61077677 1.151563 7.775345 3.935239
2: B 4.986228 NA 3.02356234 NA 7.284885 NA
3: C 19.571685 11.977047 0.77968647 -1.242481 6.375605 7.675909
4: D 12.924574 5.077190 2.24986184 -4.429400 4.298260 7.128545
5: E 14.429948 NA -0.08986722 NA -3.957407 NA
To get the column order you want, I think you need to melt and then cast:
dcast(melt(df, id.vars = c("indiv","observ")), indiv ~ observ + variable)
indiv 1_X 1_Y 1_Z 2_X 2_Y 2_Z
1: A 11.101860 -0.61077677 7.775345 6.241277 1.151563 3.935239
2: B 4.986228 3.02356234 7.284885 NA NA NA
3: C 19.571685 0.77968647 6.375605 11.977047 -1.242481 7.675909
4: D 12.924574 2.24986184 4.298260 5.077190 -4.429400 7.128545
5: E 14.429948 -0.08986722 -3.957407 NA NA NA

Groupby bins and aggregate in R

I have data like (a,b,c)
a b c
1 2 1
2 3 1
9 2 2
1 6 2
where 'a' range is divided into n (say 3) equal parts and aggregate function calculates b values (say max) and grouped by at 'c' also.
So the output looks like
a_bin b_m(c=1) b_m(c=2)
1-3 3 6
4-6 NaN NaN
7-9 NaN 2
Which is MxN where M=number of a bins, N=unique c samples or all range
How do I approach this? Can any R package help me through?
A combination of aggregate, cut and reshape seems to work
df <- data.frame(a = c(1,2,9,1),
b = c(2,3,2,6),
c = c(1,1,2,2))
breaks <- c(0, 3, 6, 9)
# Aggregate data
ag <- aggregate(df$b, FUN=max,
by=list(a=cut(df$a, breaks, include.lowest=T), c=df$c))
# Reshape data
res <- reshape(ag, idvar="a", timevar="c", direction="wide")
There would be easier ways.
If your dataset is dat
res <- sapply(split(dat[, -3], dat$c), function(x) {
a_bin <- with(x, cut(a, breaks = c(1, 3, 6, 9), include.lowest = T, labels = c("1-3",
"4-6", "7-9")))
c(by(x$b, a_bin, FUN = max))
})
res1 <- setNames(data.frame(row.names(res), res),
c("a_bin", "b_m(c=1)", "b_m(c=2)"))
row.names(res1) <- 1:nrow(res1)
res1
a_bin b_m(c=1) b_m(c=2)
1 1-3 3 6
2 4-6 NA NA
3 7-9 NA 2
I would use a combination of data.table and reshape2 which are both fully optimized for speed (not using for loops from apply family).
The output won't return the unused bins.
v <- c(1, 4, 7, 10) # creating bins
temp$int <- findInterval(temp$a, v)
library(data.table)
temp <- setDT(temp)[, list(b_m = max(b)), by = c("c", "int")]
library(reshape2)
temp <- dcast.data.table(temp, int ~ c, value.var = "b_m")
## colnames(temp) <- c("a_bin", "b_m(c=1)", "b_m(c=2)") # Optional for prettier table
## temp$a_bin<- c("1-3", "7-9") # Optional for prettier table
## a_bin b_m(c=1) b_m(c=2)
## 1 1-3 3 6
## 2 7-9 NA 2

Resources