Change the value of a low frequency column to a desired value

Change the value of a low frequency column to a desired value - r

In my data below, I want to replace any value in a column (excluding the first column) that occurs less than two times (ex. 'greek' in column L1, and 'german' in column L2) to "others".
I have tried the following, but don't get the desired output. Is there a short and efficient way to do this in R?
data <- data.frame(study=c('a','a','b','c','c','d'),
L1= c('arabic','turkish','greek','arabic','turkish','turkish'),
L2= c(rep('english',5),'german'))
# I tried the following without success:
dd[-1] <- lapply(names(dd)[-1], function(i) ifelse(table(dd[[i]]) < 2,"others",dd[[i]]))

forcats has specific function for this:
dd = data
dd[-1] = lapply(dd[-1], forcats::fct_lump_min, min = 2, other_level = "others")
dd
# study L1 L2
# 1 a arabic english
# 2 a turkish english
# 3 b others english
# 4 c arabic english
# 5 c turkish english
# 6 d turkish others
Your approach fails because ifelse() returns a vector the same length as the test, which in your case is the table, but the way you are using it you are assigning to the whole column so it needs to return something the same length as the whole column.
We can fix it like this:
dd[-1] <- lapply(names(dd)[-1], function(i) {
tt = table(dd[[i]])
drop = names(tt)[tt <= 2]
ifelse(dd[[i]] %in% drop, "others", dd[[i]])
})

Related

Combine two strings with commas in R

I haven't been able to find an answer to this, but I am guessing this is because I am not phrasing my question properly.
I want to combine two strings containing several comma-separated values into one string, alternating the inputs from each original string.
x <- '1,2'
y <- 'R,L'
# fictitious function
z <- combineSomehow(x,y)
z = '1R, 2L'
EDIT : Adding dataframe to better describe my issue. I would like to be able to accomplish the above, but within a mutate ideally.
df <- data.frame(
x = c('1','2','1,1','2','1'),
y = c('R','L','R,L','L','R'),
desired_result = c('1R','2L','1R,1L','2L','1R')
)
df:
x y desired_result
1 1 R 1R
2 2 L 2L
3 1,1 R,L 1R,1L
4 2 L 2L
5 1 R 1R
Final Edit/Answer: Based on #akrun's comment/response below and after removing the error originally in df, this ended up being the tidyverse answer:
mutate(desired_result = map2(.x=strsplit(x,','),.y=strsplit(y,','),
~ str_c(.x,.y, collapse=',')))

It can be done with strsplit and paste
combineSomehow <- function(x, y) {
do.call(paste0, c(strsplit(c(x,y),","), collapse=", "))
}
combineSomehow(x,y)
#[1] "1R, 2L"
Without modifying the function, we can Vectorize it to apply on multiple elements
df$desired_result2 <- Vectorize(combineSomehow)(df$x, df$y)

Standardize group names using a vector of possible matches

I need to standardize how subgroups are referred to in a data set. To do this I need to identify when a variable matches one of several strings and then set a new variable with the standardized name. I am trying to do that with the following:
df <- data.frame(a = c(1,2,3,4), b = c(depression_male, depression_female, depression_hsgrad, depression_collgrad))
TestVector <- "male"
for (i in TestVector) {
df$grpl <- grepl(paste0(i), df$b)
df[ which(df$grpl == TRUE),]$standard <- "male"
}
The test vector will frequently have multiple elements. The grepl works (I was going to deal with the male/female match confusion later but I'll take suggestions on that) but the subsetting and setting a new variable doesn't. It would be better (and work) if I could transform the grepl output directly into the standard name variable.

Your only real issue is that you need to initialize the standard column. But we can simplify your code a bit:
df <- data.frame(a = c(1,2,3,4), b = c("depression_male", "depression_female", "depression_hsgrad", "depression_collgrad"))
TestVector <- "male"
df$standard <- NA
for (i in TestVector) {
df[ grepl(i, df$b), "standard"] <- "male"
}
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female male
# 3 3 depression_hsgrad <NA>
# 4 4 depression_collgrad <NA>
Then you've got the issue that the "male" pattern matches "female" as well.
Perhaps you're looking for sub instead? It works like find/replace:
df$standard = sub(pattern = "depression_", replacement = "", df$b)
df
# a b standard
# 1 1 depression_male male
# 2 2 depression_female female
# 3 3 depression_hsgrad hsgrad
# 4 4 depression_collgrad collgrad
It's hard to generalize what will be best in your case without more example input/output pairs. If all your data is of the form "depression_" this will work well. Or maybe the standard name is always after an underscore, so you could use pattern = ".*_" to replace everything before the last underscore. Or maybe something else... Hopefully these ideas give you a good start.

Using %in% operator in R for categorical variables

Trying to using %in% operator in r to find an equivalent SAS Code as below:
If weather in (2,5) then new_weather=25;
else if weather in (1,3,4,7) then new_weather=14;
else new_weather=weather;
SAS code will produce variable "new_weather" with values 25, 14 and as defined in variable "weather".
R code:
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[newcol] = df[col]
df[df[newcol] %in% c(2,5)]= 25
df[df[newcol] %in% c(1,3,4,7)] = 14
return(df)
}
Result: output values of "col" and "newcol" are same, when passing a data frame through a function "GS". Syntax is not picking up the second or more values for a variable "newcol"? Appreciated your time explaining the reason and possible fix.

Is this what you are trying to do?
df <- data.frame(A=seq(1:4), B=seq(1:4))
add_and_adjust <- function(df, copy_column, new_column_name) {
df[new_column_name] <- df[copy_column] # make copy of column
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(2,5), 25, df[,new_column_name])
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(1,3,4,7), 14, df[,new_column_name])
return(df)
}
Usage:
add_and_adjust(df, 'B', 'my_new_column')

df[newcol] is a data frame (with one column), df[[newcol]] or df[, newcol] is a vector (just the column). You need to use [[ here.
You also need to be assigning the result to df[[newcol]], not to the whole df. And to be perfectly consistent and safe you should probably test the col values, not the newcol values.
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[[newcol]] = df[[col]]
df[[newcol]][df[[col]] %in% c(2,5)] = 25
df[[newcol]][df[[col]] %in% c(1,3,4,7)] = 14
return(df)
}
GS(data.frame(x = 1:7), "x", "new")
# x new
# 1 1 14
# 2 2 25
# 3 3 14
# 4 4 14
# 5 5 25
# 6 6 6
# 7 7 14

#user9231640 before you invest too much time in writing your own function you may want to explore some of the recode functions that already exist in places like car and Hmisc.
Depending on how complex your recoding gets your function will get longer and longer to check various boundary conditions or to change data types.
Just based upon your example you can do this in base R and it will be more self documenting and transparent at one level:
df <- data.frame(A=seq(1:30), B=seq(1:30))
df$my_new_column <- df$B
df$my_new_column <- ifelse(df$my_new_column %in% c(2,5), 25, df$my_new_column)
df$my_new_column <- ifelse(df$my_new_column %in% c(1,3,4,7), 14, df$my_new_column)

R: go through a sequence of numbers and choose the highest one

I am scrapping data from a website, and in this context, data tidying gets kind of hard.
what I have right now is a string of numbers that go into a sequence, let's say
a<-c(1,2,3,1,2,3,4,5,1,2,3,4)
The first value that I'm looking for is 3, the second one is 5, and the third one will be 4.
So basically, I want to go through the sequence 1:5 and choose the highest value, to have the final output as
a<-c(3,4,5)
I thought about choosing the maximum values, such as
a<-sort(a, decreasing = T)
a<-a[1:3]
But this won't count, cause the final product is:
[1] 5 4 4
where the small values are discriminated. Any ideas if this could be possible?

not entirely sure if this is what you're asking for. i think what you're wanting is to see which of your values you have in your vector.
try this:
a<-c(1,2,3,1,2,3,4,5,1,2,3,4)
search_values = 3:5
# unique values
search_values = a[a %in% search_values]
unique(search_values)
# counts of values
table(search_values)

sort(unlist(lapply(split(a, cumsum(c(1, diff(a)) != 1)), max), use.names = FALSE))
#[1] 3 4 5

Sounds like you want something like this?
a <- c(1,2,3,1,2,3,4,5,1,2,3,4) # Data input
a <- unique(a) # Keep unique numbers
a <- sort(a, dec = F) # Sort ascending
tail(a, 3) # Last three numbers in set
Gives:
[1] 3 4 5
In one line:
tail(sort(unique(a), dec = F), 3)

Find similar strings and reconcile them within one dataframe

Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi

This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Change the value of a low frequency column to a desired value - r

Related

Combine two strings with commas in R

Standardize group names using a vector of possible matches

Using %in% operator in R for categorical variables

R: go through a sequence of numbers and choose the highest one

Find similar strings and reconcile them within one dataframe

Categories

Resources