Look up/match values within the same dataframe column in R - r

Given data.frame(code=c(10, 20, 21, 22, 23, 31, 32, 40, 50), label=c("a", "b", "c", "d", "e", "f", "g", "h", "i")), I'd like c("", "", "b", "b", "b", "", "", "", "").
If the value is not a multiple of 10, assign the label of the immediately previous multiple of 10 if it is listed. If the immediately previous multiple of 10 is not listed, assign blank. If the value is a multiple of 10, assign blank. (Unlike this dummy example, multiple sequences of non-multiples of 10 may occur in the data and the values may not be ordered.)
Ideally, I'd like to do this as a vector operation in base R, for speed and parsimony.
EDIT: I was trying to simplify my question as much as possible but maybe it was misleading so here is the final output I'm aiming for: data.frame(code=c(10, 20, 21, 22, 23, 31, 32, 40, 50), label=c("a", "b", "b c", "b d", "b e", "f", "g", "h", "i")). That is: prepend the intermediate output to the label column.

This looks like an overkill but seems to work :
library(dplyr)
library(tidyr)
df %>%
#arrange the data based on value
arrange(code) %>%
#Get closest multiple of 10
mutate(multiple10 = floor(code/10) * 10,
#If completely divisible by 10 assign label else NA
result = ifelse(code %% 10 == 0, label, NA)) %>%
#For each multiple of 10
group_by(multiple10) %>%
#fill NA by most recent non-NA in the group
fill(result) %>%
ungroup %>%
#Turn NA to blank along with numbers which are completely divisible by 10
mutate(result = replace(result, code == multiple10 | is.na(result), ''))
# code label multiple10 result
# <dbl> <chr> <dbl> <chr>
#1 10 a 10 ""
#2 20 b 20 ""
#3 21 c 20 "b"
#4 22 d 20 "b"
#5 23 e 20 "b"
#6 31 f 30 ""
#7 32 g 30 ""
#8 40 h 40 ""
#9 50 i 50 ""

Related

Using 3 different data to output 4th dataframe

I’m having trouble working with 3 different sets of data (df1, df2, vec1) to output a third dataframe df3. I have 2 dataframes df1 and df2. In df1, each letter in X1 corresponds to a value in X2. In df2, X3 represents a numerical value found in vec1 and X4 represents a letter or multiple letters from df1$X1. I’m looking to scan the letters found in df2$X4 and see if there is a sequential order of N values determined from df2$X3 in vec1, and then remove any letters that do not fit this criterion.
For example, in df2[1, ], the letters are “A, B, D” and the value is 3. Looking at vec1, the max sequential order that includes the value 3 is “2, 3, 4, 5”, meaning df2[1, 2] should be replaced with “A, D” instead of “A, B, D”. The final output should look like df3. Any ideas would be greatly appreciated.
df1 <- data.frame(c("A", "B", "C", "D"), c(4, 8, 1, 3))
colnames(df1) <- c("X1", "X2")
df2 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, B, D", "A, C", NA, "B", "B, D", "C"))
colnames(df2) <- c("X3", "X4")
vec1 <- c(2, 3, 4, 5, 21, 22, 23, 27, 33, 34, 35, 36, 37, 38, 39, 46)
df3 <- data.frame(c(3, 21, 27, 34, 35, 46), c("A, D", "C", NA, NA, "D", NA))
This is not elegant but it may do what you need it to do.
First, create a list that contains consecutive integers:
vec1_seq <- split(vec1, cumsum(c(0, diff(vec1) > 1)))
$`0`
[1] 2 3 4 5
$`1`
[1] 21 22 23
$`2`
[1] 27
$`3`
[1] 33 34 35 36 37 38 39
$`4`
[1] 46
Then, do the following. Check for X3 in each element of the list, and determine the length if contained in that element. Then, keep only those letters that meet the length requirement:
cbind(df2,
X5 = apply(df2, 1, function(x) {
l <- length(unlist(vec1_seq[sapply(seq_along(vec1_seq), function(i) {
as.numeric(x[["X3"]]) %in% vec1_seq[[i]]
})]))
toString(na.omit(as.vector(sapply(trimws(unlist(strsplit(x[["X4"]], ","))), function(i) {
ifelse(i == df1[["X1"]] & df1[["X2"]] <= l, i, NA)
}))))
}))
It seems that "C" should remain for row 6; if that is incorrect let me know.
Output
X3 X4 X5
1 3 A, B, D A, D
2 21 A, C C
3 27 <NA>
4 34 B
5 35 B, D D
6 46 C C

How to scale segments of a column in an R data frame?

I have a data frame with a numeric value and a category. I need to scale the numeric value, but only with respect to those observations of its own category (hopefully without splitting up the dataframe into pieces and then using rbind to stitch it back up).
Here is the example:
df <- data.frame(x = c(1, 2, 3, 4, 5, 20, 22, 24, 25, 27, 12, 13, 12, 15, 17),
y = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"))
This function would give me a scale of the whole column, but I want the scales to be in relation only to the same category (ie A, B, and C).
df$z <- scale(df$x)
Appreciate the help!
Apply the same function (scale) by group.
In base R
df$z <- with(df, ave(x, y, FUN = scale))
df
# x y z
#1 1 A -1.26491
#2 2 A -0.63246
#3 3 A 0.00000
#4 4 A 0.63246
#5 5 A 1.26491
#6 20 B -1.33242
#7 22 B -0.59219
#8 24 B 0.14805
#9 25 B 0.51816
#10 27 B 1.25840
#11 12 C -0.83028
#12 13 C -0.36901
#13 12 C -0.83028
#14 15 C 0.55352
#15 17 C 1.47605
Using dplyr
library(dplyr)
df %>% group_by(y) %>% mutate(z = scale(x))
Or data.table
library(data.table)
setDT(df)[, z:= scale(x), y]

find out the biggest valu element by every title [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 3 years ago.
I have the following data:
library(tidyverse)
df1 <- tibble(
title = c("AA", "AA", "AA", "B", "C", "D", "D"),
rate = c(100, 100, 100, 95, 92, 90, 90),
name = c("G", "N", "E", "T", "O", "W", "L"),
pos = c(10, 1, 2, 2, 3, 5, 4)
)
title rate name pos
<chr> <dbl> <chr> <dbl>
AA 100 G 10
AA 100 N 1
AA 100 E 2
B 95 T 2
C 92 O 3
D 90 W 5
D 90 L 4
I want to find out at every title which name has the biggest pos value.
So, for title AA, it would be G, for title B, it would be T, for title C it would be O and for title D it would be W.
For B it should be "T"?
df1 %>% group_by(title) %>% top_n(1,pos) %>% pull(name)

Why does spread() create a NA-only column?

I'm still an R beginner, so I hope this question is not redundant but I couldn't find a satisfying answer to my problem. Although this Question seems to be very similar, I still wonder whether my observation represents the standard case. Using the funcion tidyr::spread results in an awkward behaviour when I try to spread three unique observations in one column that contain NAs. The result is a tibble with three new columns (as expected) but also with an additional fourth column named "NA" which is completely filled with NAs.
Here is my example dataframe:
test <- data.frame("Country" = c("A", "A", "A", "A", "A", "A", "A", "A"),
"Column1" = c(1, 1, 1, 1, 1, 1, 2, 2),
"Column2" = c(3, 3, 3, 4, 4, 4, 5, 5),
"Column3" = c("B", "M", "F", "B", "M", "F", "B", NA),
"Column4" = c(50, 74, 31, 53, 79, 33, 51, NA))
test1 <- spread(test, key = "Column3", value = "Column4")
test1
Is this normal when my tibble contains missing values? And if so, why? The creation of an additional column being completely filled with missing values as a standard behaviour seems strange to me. Or am I missing something obvious (probably)?
Any help would be much appreciated!
spread is behaving as expected, though the repeated presence of NA as both a column name and as values in the data frames might make the behavior unclear. Let's change the data frame to use a dummy value of 999 in "Column4":
test <- data.frame("Country" = c("A", "A", "A", "A", "A", "A", "A", "A"), "Column1" = c(1, 1, 1, 1, 1, 1, 2, 2), "Column2" = c(3, 3, 3, 4, 4, 4, 5, 5), "Column3" = c("B", "M", "F", "B", "M", "F", "B", 'NA'), "Column4" = c(50, 74, 31, 53, 79, 33, 51, 999))
Country Column1 Column2 Column3 Column4
1 A 1 3 B 50
2 A 1 3 M 74
3 A 1 3 F 31
4 A 1 4 B 53
5 A 1 4 M 79
6 A 1 4 F 33
7 A 2 5 B 51
8 A 2 5 NA 999
And now the spread operation:
test1 <- spread(test, key = "Column3", value = "Column4")
Country Column1 Column2 B F M NA
1 A 1 3 50 31 74 NA
2 A 1 4 53 33 79 NA
3 A 2 5 51 NA NA 999
spread has correctly placed the 999 value in the new "NA" column (again, new column names taken from the old values in "Column3"), and aligned this value with matching values from the original data frame. Because 999 only appears once in the original data frame, it only has 1 matching row in the new data frame, and all other rows in the new "NA" column are therefore filled with NA (again, somewhat confusingly here).

Summation of variables by Groups in R

I have a data frame, and I'd like to create a new column that gives the sum of a numeric variable grouped by factors. So something like this:
BEFORE:
data1 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60))
AFTER:
data2 <- data.frame(month = c(1, 1, 2, 2, 3, 3),
sex = c("m", "f", "m", "f", "m", "f"),
value = c(10, 20, 30, 40, 50, 60),
sum = c(30, 30, 70, 70, 110, 110))
In Stata you can do this with the egen command quite easily. I've tried the aggregate function, and the ddply function but they create entirely new data frames, and I just want to add a column to the existing one.
You are looking for ave
> data2 <- transform(data1, sum=ave(value, month, FUN=sum))
month sex value sum
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110
data1$sum <- ave(data1$value, data1$month, FUN=sum) is useful if you don't want to use transform
Also data.table is helpful
library(data.table)
DT <- data.table(data1)
DT[, sum:=sum(value), by=month]
UPDATE
We can also use a tidyverse approach which is simple, yet elegant:
> library(tidyverse)
> data1 %>%
group_by(month) %>%
mutate(sum=sum(value))
# A tibble: 6 x 4
# Groups: month [3]
month sex value sum
<dbl> <fct> <dbl> <dbl>
1 1 m 10 30
2 1 f 20 30
3 2 m 30 70
4 2 f 40 70
5 3 m 50 110
6 3 f 60 110

Resources