Replace numeric values with string values - r

In a data table, all the cells are numeric, and what i want do is to replace all the numbers into a string like this:
Numbers in [0,2]: replace them with the string "Bad"
Numbers in [3,4]: replace them with the string "Good"
Numbers > 4 : replace them with the string "Excellent"
Here's an example of my original table called "data.active":
My attempt to do that is this:
x <- c("churches","resorts","beaches","parks","Theatres",.....)
for(i in x){
data.active$i <- as.character(data.active$i)
data.active$i[data.active$i <= 2] <- "Bad"
data.active$i[data.active$i >2 && data.active$i <=4] <- "Good"
data.active$i[data.active$i >4] <- "Excellent"
}
But it doesn't work. is there any other way to do this?
EDIT
Here's the link to my dataset GoogleReviews_Dataset and here's how i got the table in the image above:
library(FactoMineR)
library(factoextra)
data<-read.csv2(file.choose())
data.active <- data[1:10, 4:8]

You can use the tidyverse's mutate-across combination to condition on the ranges:
library(tidyverse)
df <- tibble(
x = 1:5,
y = c(1L, 2L, 2L, 2L, 3L),
z = c(1L,3L, 3L, 3L, 2L),
a = c(1L, 5L, 6L, 4L, 8L),
b = c(1L, 3L, 4L, 7L, 1L)
)
df %>% mutate(
across(
.cols = everything(),
.fns = ~ case_when(
.x <= 2 ~ 'Bad',
(.x > 3) & (. <= 4) ~ 'Good',
(.x > 4) ~ 'Excellent',
TRUE ~ as.character(.x)
)
)
)
The .x above represents the element being evaluated (using a purrr-style functioning). This results in
# A tibble: 5 x 5
x y z a b
<chr> <chr> <chr> <chr> <chr>
1 Bad Bad Bad Bad Bad
2 Bad Bad 3 Excellent 3
3 3 Bad 3 Excellent Good
4 Good Bad 3 Good Excellent
5 Excellent 3 Bad Excellent Bad
For changing only select columns, use a selection in your .cols parameter for across:
df %>% mutate(
across(
.cols = c('a', 'x', 'b'),
.fns = ~ case_when(
.x <= 2 ~ 'Bad',
(.x > 3) & (. <= 4) ~ 'Good',
(.x > 4) ~ 'Excellent',
TRUE ~ as.character(.x)
)
)
)
This yields
# A tibble: 5 x 5
x y z a b
<chr> <int> <int> <chr> <chr>
1 Bad 1 1 Bad Bad
2 Bad 2 3 Excellent 3
3 3 2 3 Excellent Good
4 Good 2 3 Good Excellent
5 Excellent 3 2 Excellent Bad

x<-c('x','y','z')
df[,x] <- lapply(df[,x], function(x)
cut(x ,breaks=c(-Inf,2,4,Inf),labels=c('Bad','Good','Excellent'))))
Data
df<-structure(list(x = 1:5, y = c(1L, 2L, 2L, 2L, 3L), z = c(1L,3L, 3L, 3L, 2L),
a = c(1L, 5L, 6L, 4L, 8L),b = c(1L, 3L, 4L, 7L, 1L)),
class = "data.frame", row.names = c(NA, -5L))

Related

How to add column reporting sum of couple of subsequent rows

I have the following dataset
structure(list(Var1 = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L), .Label = c("0", "1"), class = "factor"), Var2 = structure(c(1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("congruent", "incongruent"
), class = "factor"), Var3 = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("spoken", "written"), class = "factor"),
Freq = c(8L, 2L, 10L, 2L, 10L, 2L, 10L, 2L)), class = "data.frame", row.names = c(NA,
-8L))
I would like to add another column reporting sum of coupled subsequent rows. Thus the final result would look like this:
I have proceeded like this
Table = as.data.frame(table(data_1$unimodal,data_1$cong_cond, data_1$presentation_mode)) %>%
mutate(Var1 = factor(Var1, levels = c('0', '1')))
row = Table %>% #is.factor(Table$Var1)
summarise(across(where(is.numeric),
~ .[Var1 == '0'] + .[Var1 == '1'],
.names = "{.col}_sum"))
column = c(rbind(row$Freq_sum,rep(NA, 4)))
Table$column = column
But I am looking for the quickest way possible with no scripting separated codes. Here I have used the dplyr package, but if you might know possibly suggest some other ways with map(), for loop, and or the method you deem as the best, please just let me know.
This should do:
df$column <-
rep(colSums(matrix(df$Freq, 2)), each=2) * c(1, NA)
If you are fine with no NAs in the dataframe, you can
df %>%
group_by(Var2, Var3) %>%
mutate(column = sum(Freq))
# A tibble: 8 × 5
# Groups: Var2, Var3 [4]
Var1 Var2 Var3 Freq column
<fct> <fct> <fct> <int> <int>
1 0 congruent spoken 8 10
2 1 congruent spoken 2 10
3 0 incongruent spoken 10 12
4 1 incongruent spoken 2 12
5 0 congruent written 10 12
6 1 congruent written 2 12
7 0 incongruent written 10 12
8 1 incongruent written 2 12

Using filter and sample in a grouped dataframe

I would like to get two IDs randomly sampled from a predefined set of IDs.
However, Using sample with dplyr::filter on grouped dataframe returns unexpected results "different sample size", e.g if I do sample(x,2) sometimes I get 2 sometimes I get a number not equal to 2.
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
5L, 5L, 6L, 6L), Sub = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("a", "b", "c", "d", "f",
"g"), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
samp.vec <- c(1,2,3,4,5)
library(dplyr)
set.seed(123)
#Return Different sample size, Not working
df %>% group_by(ID)%>%filter(ID %in% sample(samp.vec,2)) %>% count(ID)
df %>% group_by(ID)%>%filter(ID %in% sample(samp.vec,2)) %>% count(ID)
set.seed(123)
#Return one sample size, Working
df %>% group_by(ID)%>% ungroup() %>% filter(ID %in% sample(samp.vec,2)) %>% count(ID)
df %>% group_by(ID)%>% ungroup() %>% filter(ID %in% sample(samp.vec,2)) %>% count(ID)
One solution is to use ungroup() before filter. Does anyone know why this is happening?
When you are grouping, you are doing the operation for each group. So you don't just have one pair of IDs, like the fixed ID %in% c(2, 3). To make this more clear, let's omit filter and lets see the results of sample(samp.vec, 2),
df %>%
group_by(ID) %>%
mutate(v1 = toString(sample(samp.vec, 2)))
# A tibble: 14 x 3
# Groups: ID [6]
# ID Sub v1
# <int> <fct> <chr>
# 1 1 a 2, 3
# 2 1 a 2, 3
# 3 1 a 2, 3
# 4 2 b 1, 4
# 5 2 b 1, 4
# 6 3 c 3, 1
# 7 3 c 3, 1
# 8 4 d 4, 5
# 9 4 d 4, 5
#10 4 d 4, 5
#11 5 f 4, 2
#12 5 f 4, 2
#13 6 g 2, 4
#14 6 g 2, 4
So it will filter the 2 IDs from each group. Thus, sometimes you will have 2, sometimes 3 and sometimes all of them.

Rowwise median for multiple columns using dplyr

Given the following dataset, I want to compute for each row the median of the columns M1,M2 and M3. I am looking for a solution where the final column is added to the dataframe under the name 'Median'. The column names (M1:M3) should not be used directly (in the original dataset, there are many more columns, not just 3).
# A tibble: 8 x 5
I1 M1 M2 I2 M3
<int> <int> <int> <int> <int>
1 3 4 5 3 5
2 2 2 2 2 1
3 2 2 2 2 2
4 3 1 3 3 1
5 2 1 3 3 1
6 3 2 4 4 3
7 3 1 3 4 1
8 2 1 3 2 3
You can load the dataset using:
df = structure(list(I1 = c(3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L), M1 = c(4L,
2L, 2L, 1L, 1L, 2L, 1L, 1L), M2 = c(5L, 2L, 2L, 3L, 3L, 4L, 3L,
3L), I2 = c(3L, 2L, 2L, 3L, 3L, 4L, 4L, 2L), M3 = c(5L, 1L, 2L,
1L, 1L, 3L, 1L, 3L)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -8L), .Names = c("I1", "M1", "M2", "I2",
"M3"))
I know that several similar questions have already been asked. However, most solutions posted use rowMeans or rowSums. I'm looking for a solution where:
no 'row-function' can be used.
the solution is a simple dplyr solution
The reason for (2) is that I am teaching the 'tidyverse' to total beginners.
We could use rowMedians
library(matrixStats)
library(dplyr)
df %>%
mutate(Median = rowMedians(as.matrix(.[grep('M\\d+', names(.))])))
Or if we need to use only tidyverse functions, convert it to 'long' format with gather, summarize by row and get the median of the 'value' column
df %>%
rownames_to_column('rn') %>%
gather(key, value, starts_with('M')) %>%
group_by(rn) %>%
summarise(Median = median(value)) %>%
ungroup %>%
select(-rn) %>%
bind_cols(df, .)
Or another option is rowwise() from dplyr (hope the row is not a problem)
df %>%
rowwise() %>%
mutate(Median = median(c(!!! rlang::syms(grep('M', names(.), value=TRUE)))))
Given a dataframe df with some numeric values:
df <- structure(list(X0 = c(0.82046171427112, 0.836224720981912, 0.842547521493854,
0.848014287631906, 0.850943494153631, 0.85425398956647, 0.85616876970771,
0.856855792247478, 0.857471048654811, 0.857507363153284, 0.874487063791594,
1.70684558846347, 1.95711031206168, 6.84386713155156), X1 = c(0.755674148966666,
0.765242580861224, 0.774422478168495, 0.776953642833977, 0.778128315184819,
0.778611604461183, 0.778624581647491, 0.778454002430202, 1.52708579075974,
13.0356519295685, 18.0590093408357, 21.1371199340156, 32.4192814934364,
33.2355314147089), X2 = c(0.772236670327724, 0.788112332251601,
0.797695511542613, 0.804257521548174, 0.809815828400878, 0.816592605516508,
0.819421106011397, 0.821734473885381, 0.822561946509595, 0.822334970491528,
0.822404634095793, 2.66875340820162, 1.40412743557514, 6.33377768022403
), X3 = c(0.764363881671609, 0.788288196346034, 0.79927498357549,
0.805446784334039, 0.810604881970155, 0.814634331592811, 0.817002594424753,
0.818129844752095, 0.818572101954132, 0.818630700031836, 3.06323952591121,
6.4477868357554, 11.4657041958038, 9.27821049066848)), class = "data.frame", row.names = c(NA,
-14L))
One can easily compute row-wise median using base R like so:
df$median <- sapply(
seq(nrow(df)),
function(i) df[i, 1:4] %>% unlist %>% median
)
Above I select columns manually with numeric range, but to satisfy the dplyr requirement you can use dplyr::select() to choose your columns:
df$median <- sapply(
df %>% nrow %>% seq,
function(i) df[i, ] %>%
dplyr::select(X1, X2) %>%
unlist %>% median
)
I like this method because you don't have to search for different functions to calculate anything.
For example, standard deviation:
df$sd <- sapply(
df %>% nrow %>% seq,
function(i) df[i, ] %>%
dplyr::select(X1, X2) %>%
unlist %>% sd
)

how to count and remove similar strings across columns

I have a data with many columns . for example this is with three columns
df<-structure(list(V1 = structure(c(5L, 1L, 7L, 3L, 2L, 4L, 6L, 6L
), .Label = c("CPSIAAAIAAVNALHGR", "DLNYCFSGMSDHR", "FPEHELIVDPQR",
"IADPDAVKPDDWDEDAPSK", "LWADHGVQACFGR", "WGEAGAEYVVESTGVFTTMEK",
"YYVTIIDAPGHR"), class = "factor"), V2 = structure(c(5L, 2L,
7L, 3L, 4L, 6L, 1L, 1L), .Label = c("", "CPSIAAAIAAVNALHGR",
"GCITIIGGGDTATCCAK", "HVGPGVLSMANAGPNTNGSQFFICTIK", "LLELGPKPEVAQQTR",
"MVCCSAWSEDHPICNLFTCGFDR", "YYVTIIDAPGHR"), class = "factor"),
V3 = structure(c(4L, 3L, 2L, 4L, 3L, 1L, 1L, 1L), .Label = c("",
"AVCMLSNTTAIAEAWAR", "DLNYCFSGMSDHR", "FPEHELIVDPQR"), class = "factor")), .Names = c("V1",
"V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
-The first column, we don't look at any other column, we just count how many strings there are and keep the unique one
The second column, we keep the unique and also we remove those that were already in the first column
The third column, we keep the unique and we remove the strings that were in the first and second column
This continues for as many columns as we have
for example for this data, we will have the following
Column 1 Column 2 Column 3
LWADHGVQACFGR
CPSIAAAIAAVNALHGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
YYVTIIDAPGHR GCITIIGGGDTATCCAK
FPEHELIVDPQR HVGPGVLSMANAGPNTNGSQFFICTIK
DLNYCFSGMSDHR MVCCSAWSEDHPICNLFTCGFDR
IADPDAVKPDDWDEDAPSK
WGEAGAEYVVESTGVFTTMEK
Here is a solution via tidyverse,
library(tidyverse)
df1 <- df %>%
gather(var, string) %>%
filter(string != '' & !duplicated(string)) %>%
group_by(var) %>%
mutate(cnt = seq(n())) %>%
spread(var, string) %>%
select(-cnt)
Which gives
# A tibble: 7 x 4
cnt V1 V2 V3
* <int> <chr> <chr> <chr>
1 1 LWADHGVQACFGR LLELGPKPEVAQQTR AVCMLSNTTAIAEAWAR
2 2 CPSIAAAIAAVNALHGR GCITIIGGGDTATCCAK <NA>
3 3 YYVTIIDAPGHR HVGPGVLSMANAGPNTNGSQFFICTIK <NA>
4 4 FPEHELIVDPQR MVCCSAWSEDHPICNLFTCGFDR <NA>
5 5 DLNYCFSGMSDHR <NA> <NA>
6 6 IADPDAVKPDDWDEDAPSK <NA> <NA>
7 7 WGEAGAEYVVESTGVFTTMEK <NA> <NA>
You can use colSums to get the number of strings,
colSums(!is.na(df1))
#V1 V2 V3
# 7 4 1
A similar approach via base R, that would save the strings in a list would be,
df[] <- lapply(df, as.character)
d1 <- stack(df)
d1 <- d1[d1$values != '' & !duplicated(d1$values),]
l1 <- unstack(d1, values ~ ind)
lengths(l1)
#V1 V2 V3
# 7 4 1
A base R solution. df2 is the final output.
# Convert to character
L1 <- lapply(df, as.character)
# Get unique string
L2 <- lapply(L1, unique)
# Remove ""
L3 <- lapply(L2, function(vec){vec <- vec[!(vec %in% "")]})
# Use for loop to remove non-unique string from previous columns
for (i in 2:length(L3)){
previous_vec <- unlist(L3[1:(i - 1)])
current_vec <- L3[[i]]
L3[[i]] <- current_vec[!(current_vec %in% previous_vec)]
}
# Get the maximum column length
max_num <- max(sapply(L3, length))
# Append "" to each column
L4 <- lapply(L3, function(vec){vec <- c(vec, rep("", max_num - length(vec)))})
# Convert L4 to a data frame
df2 <- as.data.frame(do.call(cbind, L4))

Finding the max number of occurrences from the available result

I have a dataframe which looks like -
Id Result
A 1
B 2
C 1
B 1
C 1
A 2
B 1
B 2
C 1
A 1
B 2
Now I need to calculate how many 1's and 2's are there for each Id and then select the number whose frequency of occurrence is the greatest.
Id Result
A 1
B 2
C 1
How can I do that? I have tried using the table function in some way but not able to use it effectively. Any help would be appreciated.
Here you can use aggregate in one step:
df <- structure(list(Id = structure(c(1L, 2L, 3L, 2L, 3L, 1L, 2L, 2L,
3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor"),
Result = c(1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L)),
.Names = c("Id", "Result"), class = "data.frame", row.names = c(NA, -11L)
)
res <- aggregate(Result ~ Id, df, FUN=function(x){which.max(c(sum(x==1), sum(x==2)))})
res
Result:
Id Result
1 A 1
2 B 2
3 C 1
With data.table you can try (df is your data.frame):
require(data.table)
dt<-as.data.table(df)
dt[,list(times=.N),by=list(Id,Result)][,list(Result=Result[which.max(times)]),by=Id]
# Id Result
#1: A 1
#2: B 2
#3: C 1
Using dplyr, you can try
library(dplyr)
df %>% group_by(Id, Result) %>% summarize(n = n()) %>% group_by(Id) %>%
filter(n == max(n)) %>% summarize(Result = Result)
Id Result
1 A 1
2 B 2
3 C 1
An option using table and ave
subset(as.data.frame(table(df1)),ave(Freq, Id, FUN=max)==Freq, select=-3)
# Id Result
# 1 A 1
# 3 C 1
# 5 B 2

Resources