R remove duplicate data from each column - r

I get CSV's with hundreds of different columns and would like to be able to output a new file with the duplicate values removed from each column. Everything that I have seen and tried uses a specific column. I just need each column to be unique values.
For Example My Data:
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
df
A B C
1 1 1 Mr.
2 2 0 Mr.
3 3 1 Mrs.
4 4 0 Miss
5 5 0 Mr.
6 6 1 Mrs.
I would like:
A B C
1 1 1 Mr.
2 2 0 Mrs.
3 3 Miss
4 4
5 5
6 6
Then I can:
write.csv(df, file = file.path(df, "df_No_Dupes.csv"), na="")
So I can use it as a reference for my next task.

read.csv and write.csv work best with tabular data. Your desired output is not a good example of this (every row does not have the same number of columns).
You can easily get all the unique value for your columns with
vals <- sapply(df, unique)
Then you'd be better off saving this object with save() and load() to preserve the list as an R object.

Code snippet to work with a flexible number of columns, remove duplicate columns, and preserve column names:
require(rowr)
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
#get the number of columns in the dataframe
n <- ncol(df)
#loop through the columns
for(i in 1:ncol(df)){
#replicate column i without duplicates, fill blanks with NAs
df <- cbind.fill(df,unique(df[,1]), fill = NA)
#rename the new column
colnames(df)[n+1] <- colnames(df)[1]
#delete the old column
df[,1] <- NULL
}

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
for(i in 1:ncol(df)){
assign(paste("df_",i,sep=""), unique(df[,i]))
}
require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = NA)
V1 V1 V1
1 1 1 Mr.
2 2 0 Mrs.
3 3 NA Miss
4 4 NA <NA>
5 5 NA <NA>
6 6 NA <NA>
or you could do
require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = "")
df
V1 V1 V1
1 1 1 Mr.
2 2 0 Mrs.
3 3 Miss
4 4
5 5
6 6
If you want to avoid typing the name of each intermediate dataframe you can just use ls(pattern="df_") and get the objects named in that vector or use another loop.
If you want to change the column names back to their original values you can use:
colnames(output_df) <- colnames(input_df)
Then you can save the results however you, like, i.e.
saveRDS()
save()
or write it to a file.
Putting it all together:
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
for(i in 1:ncol(df)){
assign(paste("df_",i,sep=""), unique(df[,i]))
}
require(rowr)
files <- ls(pattern="df_")
df_output <- data.frame()
for(i in files){
df_output <- cbind.fill(df_output, get(i), fill = "")
}
df_output <- df_output[,2:4] # fix extra colname from initialization
colnames(df_output) <- colnames(df)
write.csv(df_output, "df_out.csv",row.names = F)
verify_it_worked <- read.csv("df_out.csv")
verify_it_worked
A B C
1 1 1 Mr.
2 2 0 Mrs.
3 3 Miss
4 4
5 5
6 6

Related

How to write a for loop to create multiple new variables in R?

Suppose I have this example dataset df with only character variables.
dx_order1<-c(1, 1, NA, 1, 1)
dx_order2<-c(2, 2, 2, 2, NA)
Suppose that these variables are numeric.
I want to recode the variables. For dx_order1 variable, I want to recode 1 as 1 and 0 otherwise. Similarly, for dx_order 2 variable I want to recode 2 as 1 and 0 otherwise. Say that the new variables are called diag_order1 and diag_order2.
I know how to do this one by one in a manual fashion. The codes below will do the job:
df$diag_order1 <- ifelse(is.na(df$dx_order1), 0, 1)
df$diag_order1 <- ifelse(is.na(df$dx_order1), 0, 1)
I was wondering how I can achieve the same outcome with for loop function. If I have a a lot of similar variables then this type of manual coding is not practical. So any advice on how to have a loop to fasten the process would be appreciated.
You don't need to use loop in this instance, you could do this by converting NA to 0 using is.na. For example:
Data
df <- data.frame(dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
df[!is.na(df)] <- 1
df[is.na(df)] <- 0
Or if you have more columns with NA but only want to apply to certain columns then you could do it by specifying those columns:
df2 <- data.frame(letter_col = c(NA, letters[1:4]),
dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
# any columns starting with dx
cols <- names(df2)[grepl("^dx", names(df2))]
df2[, cols][!is.na(df2[, cols])] <- 1
df2[, cols][is.na(df2[, cols])] <- 0
You can use across with mutate in dplyr like this
library(dplyr)
df2 <- data.frame(letter_col = c(NA, letters[1:4]),
dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
> df2
letter_col dx_order1 dx_order2
1 <NA> 1 2
2 a 1 2
3 b NA 2
4 c 1 2
5 d 1 NA
df2 %>% mutate(across(starts_with("dx"), ~case_when(. == as.numeric(str_extract(cur_column(), "\\d$")) ~ 1,
is.na(.) ~ 0,
TRUE ~ 0), .names = "diag_{.col}"))
letter_col dx_order1 dx_order2 diag_dx_order1 diag_dx_order2
1 <NA> 1 2 1 1
2 a 1 2 1 1
3 b NA 2 0 1
4 c 1 2 1 1
5 d 1 NA 1 0
Assuming that your dx column can have values like suffix, NA and otherwise too as written in your question, and it recodes everything else than suffix to 0
You can coerce the logical vector from is.na to integer. is.na works with the dataframe.
df <- data.frame(dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
df[] <- +!is.na(df)
df
# dx_order1 dx_order2
#1 1 1
#2 1 1
#3 0 1
#4 1 1
#5 1 0

Covert NAs to a separate level in each variable using mutate()

I have 12 variables that contained NA values as well. I need to covert NAs to a separate level. Level value for some variables is different. Following is the code:
Replace_NAs <- function(colindex, na_level){
cname <- colnames(tr[colindex])
tr <- tr %>% mutate(cname = as.factor(replace(cname, is.na(cname), na_level)))
return(tr)
}
for (i in 1:12) {
if(i == 5){
na_level <- 3;
tr <- Replace_NAs(i,na_level);
}
else if(i == 11){
na_level <- 5;
tr <- Replace_NAs(i,na_level);
}
else if(i == 4|6|8){
na_level <- 1;
tr <- Replace_NAs(i,na_level);
}
else{
na_level <- 20;
tr <- Replace_NAs(i,na_level);
}
}
Please help me. Thanks.
As Johan mentioned in the comments, you should include a reproducible example. Without that, we're left guessing at what exactly you want.
That said, here's my guess at what'll help you:
df %>%
mutate_at(vars(5), ~ replace_na(., 3)) %>%
mutate_at(vars(11), ~ replace_na(., 5)) %>%
mutate_at(vars(4, 6, 8) ~ replace_na(., 1)) %>%
mutate_at(vars(-c(4, 5, 6, 8, 11)), ~ replace_na(., 20))
Again, please provide a reproducible example with data and your desired output. A more robust answer to your question would explore applying a list of intended switches to your dataframe, but that would be overkill here.
Here's a way to do this using a for loop.
Consider this example :
tr <- data.frame(a = c(NA, 2, NA, 3), b = c(2, 3, NA, 4),
c = c(5, 6, NA, NA), d = c(1, 2, 3, NA))
tr
# a b c d
#1 NA 2 5 1
#2 2 3 6 2
#3 NA NA NA 3
#4 3 4 NA NA
Now prepare a list of column indices and a vector of replacement values
cols <- list(1, c(2, 3))
vals <- c(3, 5)
Use a for loop to replace the columns with the values
for(i in seq_along(cols)) {
tr[cols[[i]]][is.na(tr[cols[[i]]])] <- vals[i]
}
For remaining columns
f_cols <- setdiff(seq_len(ncol(tr)), unlist(cols))
tr[f_cols][is.na(tr[f_cols])] <- 20
tr
# a b c d
#1 3 2 5 1
#2 2 3 6 2
#3 3 5 5 3
#4 3 4 5 20
You can notice how NA's in column 1 is replaced with 3, how NA's in column 2 and 3 are replaced with 5 and for rest of the column it is replaced by 20.

Subtracting columns in a loop

I've got a data frame like that:
df:
A B C
1 1 2 3
2 2 2 4
3 2 2 3
I would like to subtract each column with the next smaler one (A-0, B-A, C-B). So my results should look like that:
df:
A B C
1 1 1 1
2 2 0 2
3 2 0 1
I tried the following loop, but it didn't work.
for (i in 1:3) {
j <- data[,i+1] - data[,i]
}
Try this
df - cbind(0, df[-ncol(df)])
# A B C
# 1 1 1 1
# 2 2 0 2
# 3 2 0 1
Data
df <- data.frame(A = c(1, 2, 2), B = c(2, 2, 2), C = c(3, 4, 3))
We can also remove the first and last column and do the subtraction
df[-1] <- df[-1]-df[-length(df)]
data
df <- data.frame(A = c(1, 2, 2), B = c(2, 2, 2), C = c(3, 4, 3))

Best way for looping in a dataframe in R

I am trying to create a program to iterate through a R data table. I am trying to avoid for loops, because as far as I know they are slow.
#creation of the data table
col <- c(0, 1, 0, 1, 0, 1)
Priority <- c(1,2,3,4,5,6) #1 highest, 6 lowest
IEC_category <- c("a","b","c","d","e","f")
eventlog_overlap.dt <- data.table(col,Priority, IEC_category)
#comparison and assignation of the priority
if (eventlog_overlap.dt$col == 1){
if (eventlog_overlap.dt$Priority <= shift(eventlog_overlap.dt$Priority,1)){
eventlog_overlap.dt$AlarmaPrior <- eventlog_overlap.dt$IEC_category #write the actual category
}
else{
eventlog_overlap.dt$AlarmaPrior <- shift(eventlog_overlap.dt$IEC_category,1) #write the previous category
}
} else{ eventlog_overlap.dt$AlarmaPrior <- NA
}
Pleas provide the desired result. A dplyr attempt:
library(dplyr)
library(hablar)
col <- c(0, 1, 0, 1, 0, 1)
Priority <- c(1,2,3,4,5,6) #1 highest, 6 lowest
IEC_category <- c("a","b","c","d","e","f")
df <- data.frame(col,Priority, IEC_category)
df %>%
mutate(AlarmaPrior = if_else_(col == 1,
if_else_(Priority <= lag(Priority),
IEC_category,
lag(IEC_category)), NA))
gives you:
col Priority IEC_category AlarmaPrior
1 0 1 a <NA>
2 1 2 b a
3 0 3 c <NA>
4 1 4 d c
5 0 5 e <NA>
6 1 6 f e

Is there a function to know how many times a column has the best value?

I have a data.frame like this :
A B C
4 8 2
1 3 5
5 7 6
It could have more column and lines.
So what I'd like to know is for each column how many times they have the lowest values (in my example the result should be 2 for A and 1 for C).
d = data.frame(a = c(4, 1, 5), b = c(8, 3, 7), c = c(2, 5, 6))
row_mins = apply(d, 1, min)
# alternately, slightly more efficient
row_mins = do.call(pmin, d)
colSums(d == row_mins)
# a b c
# 2 0 1

Resources