How can I erase duplicate data from my dataframe - r

my code so far looks like this, I have been trying to eliminate the letters in a new and old vector that repeat themselves. the letters represent emails. I have tried using unique and distinct functions, but they keep one of the duplicate values when I need to erase them all. this is the vector I would like as a result
c(b,c,e,f,t,r,w,u,p,q)
new <- c("a","b","c","d","e","f","t")
old <- c("r","w","u","a","d","p","q")
num <- c(1:7)
df_new <- data.frame(num, new)
df_old <- data.frame(num, old)
df_new <- transmute(df_new, num, emails = new)
df_old <- transmute(df_old, num, emails = old)
all_emails <- merge(df_new, df_old, all = TRUE)

From what you show, you are complicating things unnecessarily by putting them in a data frame. Try this:
new <- c("a","b","c","d","e","f","t")
old <- c("r","w","u","a","d","p","q")
x = c(new, old)
result = x[!duplicated(x) & !duplicated(x, fromLast = TRUE)]
result
# [1] "b" "c" "e" "f" "t" "r" "w" "u" "p" "q"
Another method, if both your vectors are individually unique and you just need to drop everything that is in both new and old:
result = setdiff(union(new, old), intersect(new, old))

Related

create new dataframes from a master database in R

I have a database of different notifiable diseases.
I want to extract a dataframe for each disease in that database so that I can make an automated report form a template in Rmarkdown.
I created a function for creating the dataframe
NMC <- is master database
The database lists all conditions reported
I created a list of those conditions
conditions <- list(unique(NMC$Condition))
I then created a function to create a new dataframe based on the condition
newdf <- function(data, var){
var <- data %>% filter(data$Condition %in% paste0(var))
var
}
Now I want to run my function to create a number of new dataframes from the master database. I thought of doing a for loop:
for (df in conditions){
df <- newdf(NMC, "df")
}
Which runs but doesn't give me anything.
So I found split(), but this hasn't perfectly solved my problem as I still need to type out all the conditions to get each df to apply to the r template.
NMC <- split(NMC, factor(NMC$Condition), drop= FALSE)
#then to get a specifc df (which is laborious)
rubella <- NMC$congenitalrubellasyndrome
# How can i get the dataframes per condition into my environemnt, or access them easily, maybe with %>% fucntion?
My end goal is to then apply an R template to each data frame so that i have a standard epicurve/descriptive stats for each disease.
Thanks
> df <- data.frame(a = rep(letters[1:10], each = 3), x = 1:30)
> for (i in df$a) {
+ assign(i, df[df$a == i, ])
+ }
> ls()
[1] "a" "b" "c" "d" "df" "e" "f" "g" "h" "i" "j"
> a
a x
1 a 1
2 a 2
3 a 3
But see my comment above.

Removing values, stored in character vector, from a list

I want to remove certain values from a list to create a refined list. I have all of the values I want to remove stored in a character vector named remove. The values in remove correspond to the first column of the list. I've run the following code:
refined_list = list
for (i in length(list)){
if (refined_list[i,1] %in% remove){
refined_list = refined_list[-i,]
}
else{
refined_list = refined_list
}
}
only the initialization of refined list seems to register. No errors, but refined_list is identical to list. It's a mystery to me
It doesn't seem like you're actually talking about lists, since a list cannot be subset as you are proposing (i.e. list[, 1]). But if you're looking for a solution for a data.frame, here's one:
# Set up some test data
dd <- data.frame(letters = letters[1:10], stringsAsFactors = FALSE)
remove <- letters[c(1, 4, 6)]
# Shed values that are in remove
dd[!(dd[, 1] %in% remove), 1, drop = FALSE]
#> "b" "c" "e" "g" "h" "i" "j"

Evaluate dataframe$column expression stored as a string value

Can a string of the form below be evaluated so that it is equivalent to the same "literal" expression?
Example data and code:
df.name = data.frame(col1 = 1:5, col2 = LETTERS[seq(1:5)], col3 = letters[seq(1:5)], stringsAsFactors = FALSE)
col.name = "col2"
row.num = "4"
var1 = str_c("df.name$", col.name,"[",row.num,"]")
> var1
[1] "df.name$col2[4]"
The literal works as expected
> df.name$col2[4]
[1] D
get() is not equivalent:
get(var1)
## Error in get(var1) : object 'df.name$col2[4]' not found
This form of get() "works" but does not solve the problem
get("df.name")$col2[4]
[1] D
Per other posts I've tried eval(parse()) and eval(parse(text())) without success.
I'm trying to create a function that will search (subset) df.name using the col.name passed to the function. I want to avoid writing a separate function for each column name, though that will work since I can code df.name$col2[row.num] as a "literal".
EDIT
The example code should have shown the row.num as type numeric / integer, i.e., row.num = 4
You are almost there:
> eval(parse(text = var1))
[1] "D"
Because parse expecting file by default, you need to specify the text parameter.
I'm trying to create a function that will search (subset) df.name using the col.name passed to the function.
Set up data:
df.name = data.frame(col1 = 1:5, col2 = LETTERS[1:5], ## seq() is unnecessary
col3 = letters[1:5],
stringsAsFactors = FALSE)
col.name = "col2"
row.num = "4"
Solving your ultimate (index the data frame by column name) rather than your proximal (figure out how to use get()/eval() etc.) question: as #RichardScriven points out,
f <- function(col.name,row.num,data=df.name)
return(data[[col.name]][as.numeric(row.num)])
}
should work. It would probably be more idiomatic if you specified the row number as numeric rather than character, if possible ...

how to apply a function on every column of a data?

I asked a question and I received a great answer which solved my problem. However, I want to modify the code (here is my previous question).
finding similar strings in each row of two different data frame
I try to explain again the problem and how I tried to deal with it
The answer by Karsten W. gave me a normalised data (assign each string in each element a number of its position) as follow (I did not change it)
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}
The second part was to apply the above function on each column separately, so if i need to do that on 1000 columns it is very time consuming. instead I do the following in comment , I tried to use lappy
# s1 <- normalize(df1[,1], ";")
# s2 <- normalize(df1[,2], ";")
I do like this
myS <- lapply(df1, normalize,";")
I keep the other part as it is
lookup <- normalize(df2[,1], ",")
Then to check between the two, I modified the function to only keep the row numbers of df2 (I removed (s[found] from it)
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(pos, sep=""))
}
then whatever I do, I cannot get the output
process(myS$sample1) ...
At the end I need to have the data in a txt file or something which I can read. I used write.table but this does not work.
Is there any better way to do this? How to do it automatically?
It is a typo. process(myS$sample_1) instead of ...(myS$sample1)
I get:
> process(myS$sample_1)
[1] "4" "1" "4"
and
> lapply(myS, process)
$sample_1
[1] "4" "1" "4"
$sample_2
[1] "4" "15" "16"
IMHO for the function process() it would be better to return an integer vector:
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
names(pos) <- NULL
pos
}
For putting the result in a dataframe:
r <- lapply(myS, process)
m <- max(sapply(r, length))
r.matrix <- matrix(NA, m, length(r))
for (j in 1:length(r)) {
x <- r[[j]]
length(x) <- m
r.matrix[,j] <- x
}
colnames(r.matrix) <- names(r)
r.df <- as.data.frame(r.matrix)

Maintaining the order of a vector when applying it to setNames of a list

The following dataframe:
df <- data.frame(matrix(rnorm(9*9), ncol=9))
names(df) <- c("c_1", "d_1", "e_1", "a_p", "b_p", "c_p", "1_o1", "2_o1", "3_o1")
row.names(df) <- names(df)
...is split by rownames according to common indices found after "_" and i release dataframes from the list to the global environment:
list_all <- split(df,sub(".+_","",rownames(df)))
list2env(list_all,envir=.GlobalEnv)
Many of my dataframes have now numeric names, and cannot be adressed easily, so i want to change their names. Id like to add "df_" to every name, but since i dont know how to do it, i was told make.names could be nice. I create a vector of all unique indices, and factorize it, which i think maintains the original order of the indices:
indx <- gsub(".*_", "", names(df))
indx1 <- factor(indx, levels=unique(indx))
new.names <- make.names(unique(indx1))
new.names
[1] "X1" "p" "o1"
new.names is in the order i want it to be. I apply the new names to the list, and release it to the environment
list_all <- setNames(list_all, new.names)
list2env(list_all,envir=.GlobalEnv)
Now, the numeric names have an added leading X (nice!), but the sequence of the dataframes has changed and names have been wrongly assigned (dataframe p contains now all rows with "o1" and vice versa).
Questions:
Is there an easy way to add strings to object names of the same class in a workspace?
If i am going to do it with the make.names route, how can i absolutely make sure that the vectors in list_all are named in the
same order as in new.names?
Thank you!
Why not simply using, just after having created list_all:
names(list_all) = paste0("df_", names(list_all))
list2env(list_all,envir=.GlobalEnv)
#> df_1
# c_1 d_1 e_1 a_p b_p c_p 1_o1 2_o1 3_o1
#c_1 1.10388982 -0.2329471 -0.3330288 -2.0477186 -1.4576052 1.5411154 -0.9529714 0.289516457 -0.01017546
#d_1 -1.02420662 -0.1002591 -0.7884373 1.5021531 0.3551084 0.7755127 0.7679464 -0.002950944 -0.69849456
#e_1 -0.02004774 -0.1873947 -0.3674220 0.7321503 0.9076226 -0.4997974 -0.2915408 -1.376529597 -1.43563284
Here's a function that I think does what you want:
# dummy data:
x <- numeric(0)
y <- numeric(0)
z <- numeric(0)
df1 <- data.frame()
df2 <- data.frame()
df3 <- data.frame()
df4 <- data.frame()
renameObjects <- function(env=.GlobalEnv, class, pfx) {
objs <- ls(envir = env) # get list of objects
classes <- sapply(objs, function(x) class(get(x))) == class
for (obj in objs[classes]) {
assign(paste0(pfx, obj), get(obj), envir = env)
}
rm(list=objs[classes], envir = env)
}
# run the function
renameObjects(class='data.frame', pfx = 'my_prefix_')
Results
> ls()
[1] "df1" "df2" "df3" "df4"
[5] "renameObjects" "x" "y" "z"
> renameObjects(class='data.frame', pfx = 'my_prefix_')
> ls()
[1] "my_prefix_df1" "my_prefix_df2" "my_prefix_df3" "my_prefix_df4"
[5] "renameObjects" "x" "y" "z"

Resources