Recode with a variable number of cases in R - r

I am creating a function that takes a list of user-specified words and then labels them as a number depending on the order of the number in the list. The user can specify different list lengths.
For example:
myNotableWords<-c("No_IM","IM","LGD","HGD","T1a")
aa<-c("No_IM","IM","No_IM","HGD","T1a","HGD","T1a","IM","LGD")
aa<-data.frame(aa,stringsAsFactors=FALSE)
Intended Output
new<-(1,2,1,4,5,4,5,2,3)
Is there a way of maybe getting the index of the original list and then looking up where the each element of the target list is in that index and replacing it with the index number?

Why not just use the factor functionality of R?
A "factor data type" stores an integer that references a "level" (= character string) via the index number:
myNotableWords<-c("No_IM","IM","LGD","HGD","T1a")
aa<-c("No_IM","IM","No_IM","HGD","T1a","HGD","T1a","IM","LGD")
aa <- as.integer(factor(aa, myNotableWords, ordered = TRUE))
aa
# [1] 1 2 1 4 5 4 5 2 3

new <- c()
for (item in aa) {
new <- c(new, which(myNotableWords == item))
}
print(new)
#[1] 1 2 1 4 5 4 5 2 3

You can do this using data.frame; the syntax shouldn't change. I prefer using data.table though.
library(data.table)
myWords <- c("No_IM","IM","LGD","HGD","T1a")
myIndex <- data.table(keywords = myWords, word_index = seq(1, length(myWords)))
The third line simply adds an index to the vector myWords.
aa <- data.table(keywords = c("No_IM","IM","No_IM","HGD","T1a",
"HGD","T1a","IM","LGD"))
aa <- merge(aa, myIndex, by = "keywords", all.x = TRUE)
And now you have a table that shows the keyword and its unique number.

Related

Change the value of a low frequency column to a desired value

In my data below, I want to replace any value in a column (excluding the first column) that occurs less than two times (ex. 'greek' in column L1, and 'german' in column L2) to "others".
I have tried the following, but don't get the desired output. Is there a short and efficient way to do this in R?
data <- data.frame(study=c('a','a','b','c','c','d'),
L1= c('arabic','turkish','greek','arabic','turkish','turkish'),
L2= c(rep('english',5),'german'))
# I tried the following without success:
dd[-1] <- lapply(names(dd)[-1], function(i) ifelse(table(dd[[i]]) < 2,"others",dd[[i]]))
forcats has specific function for this:
dd = data
dd[-1] = lapply(dd[-1], forcats::fct_lump_min, min = 2, other_level = "others")
dd
# study L1 L2
# 1 a arabic english
# 2 a turkish english
# 3 b others english
# 4 c arabic english
# 5 c turkish english
# 6 d turkish others
Your approach fails because ifelse() returns a vector the same length as the test, which in your case is the table, but the way you are using it you are assigning to the whole column so it needs to return something the same length as the whole column.
We can fix it like this:
dd[-1] <- lapply(names(dd)[-1], function(i) {
tt = table(dd[[i]])
drop = names(tt)[tt <= 2]
ifelse(dd[[i]] %in% drop, "others", dd[[i]])
})

How to assign the column name to the variable dynamically

I am currently developing an application and I need to loop through the columns of the data frame. For instance, if the data frame has the columns
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
If the input is given as "a", then the column name "b" should be assigned to the variable, say promote.
It throws an error Error in[.data.frame(char_set, i + 1) : undefined columns selected. Is there any solution?
char_name <- "a"
char_set <- data.frame(character(),character(),character(),character(),stringsAsFactors = FALSE)
names(char_set) <- c("a","b","c","d")
for (i in 1:ncol(char_set)) {
promote <- ifelse(names(char_set) == char_name,char_set[i+1], "-")
print(promote)
}
Thanks in advance!!!
This is actually quite interesting. I would suggest doing something on those lines:
char_name <- "a"
char_set <- data.frame(
a = 1:2,
b = 3:4,
c = 5:6,
d = 8:9,
stringsAsFactors = FALSE
)
res_dta <- data.frame(matrix(nrow = 2, ncol = 3))
for (i in wrapr::seqi(1, NCOL(char_set) - 1)) {
print(i)
if (names(char_set)[i] == char_name) {
res_dta[i] <- char_set[i + 1]
} else {
res_dta[i] <- char_set[i]
}
}
Results
char_set
a b c d
1 1 3 5 8
2 2 4 6 9
res_dta
X1 X2 X3
1 3 3 5
2 4 4 6
There are few generic points:
When you are looping through columns be mindful not fall outside data frame dimensions; running i + 1 on i = 4 will give you column 5 which will return an error for data frame with four columns. You may then decide to run to one column less or break for a specific i value
Not sure if I got your request right, for column names a you want to take values of column b; then column b stays as it was?
Broadly speaking, I'm of a view that this names(char_set)[i] == char_name requires more thought but you have a start with this answer. Updating your post with desired results would help to design a solution.
The problem in your code is that you are looping from 1 to the number of columns of the char_set df, then you are calling the variable char_set[i+1].
This, when the i index takes the maximum value, the instruction char_set[i+1] returns an error because there is no element with that index.
You can try with this solution:
char_name<-"a"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "b"
char_name<-"d"
promote<-ifelse((which(names(char_set)==char_name)+1)<ncol(char_set),names(char_set)[which(names(char_set)==char_name)+1],"-")
promote
> [1] "-"
However. when the variable char_name takes the value a, the variable promote will take the value that the set char_set has at the position after the element named a, which matches char_name.
I suggest you to think about the case in which the variable char_name takes the value d and you don't have any values in the char_set after d.

Find similar strings and reconcile them within one dataframe

Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi
This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)

Multiple one-to-many matching between vectors in R

I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]

Does column exist and how to rearrange columns in R data frame

How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.
One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]
1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>
Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.
Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")
or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B
I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)

Resources