Related
I have a tree data serialized like the following:
Relationship: P to C is "one-to-many", and C to P is "one-to-one". So column P may have duplicate values, but column C has unique values.
P, C
1, 2
1, 3
3, 4
2, 5
4, 6
# in data.frame
df <- data.frame(P=c(1,1,3,2,4), C=c(2,3,4,5,6))
1. How do I efficiently implement a function func so that:
func(df, val) returns a vector of full path to root (1 in this case).
For example:
func(df, 3) returns c(1,2,3)
func(df, 5) returns c(1,2,5)
func(df, 6) returns c(1,3,4,6)
2. Alternatively, quickly transforming df to a lookup table like this also works for me:
C, Paths
2, c(1,2)
3, c(1,3)
4, c(1,3,4)
5, c(1,2,5)
6, c(1,2,4,6)
Here is a solution using igraph
library(igraph)
g <- graph_from_data_frame(df)
df <- within(df,
Path <- sapply(match(as.character(C),names(V(g))),
function(k) toString(names(unlist(all_simple_paths(g,1,k))))))
such that
> df
P C Path
1 1 2 1, 2
2 1 3 1, 3
3 3 4 1, 3, 4
4 2 5 1, 2, 5
5 4 6 1, 3, 4, 6
I want to match vector 1 to vector 2 to see if items in vector 1 and found in vector 2. Then I want to create 2 new vectors - a subset of vector 1 of the rows of values contained both vectors, and a subset of vector 1 for the values not found in both vectors. match() function followed by which(is.na()) works great for small data sets, but I have a data set with 1000 elements.
Data1 <- c(1, 2, 3, 4, 5)
Data2 <- c(1, 3, 5, 6, 7)
#Match vector1 to vector2
A <- match(Data1, Data2)
[1] 1 NA 2 NA 3
#to obtain positions of non matching elements
x <- which(is.na(A), arr.ind = TRUE)
[1] 2 4
Data1[c(2,4)]
#to obtain positions of matching elements
y < which(A >= 1)
[1] 1 3 5
Data1[c(1,3,5)]
Try this so you do not have to deal with the NAs from match():
Data1 <- c(1, 2, 3, 4, 5)
Data2 <- c(1, 3, 5, 6, 7)
# Values of Data1 in Data2
A <- Data1[Data1 %in% Data2]
A
# output:
# > A
# [1] 1 3 5
# create not in function
'%ni%' <- Negate('%in%')
# Values of Data1 not in Data2
B <- Data1[Data1 %ni% Data2]
B
# output:
# > B
# [1] 2 4
I would like to replace NAs in my data frame with values from another column. For example:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
df <- as.data.frame(cbind(a1, b1, c1, a2, b2, c2))
df
> df
a1 b1 c1 a2 b2 c2
1 1 3 NA 2 1 3
2 2 NA 3 3 2 3
3 4 4 3 5 4 2
4 NA 4 4 5 5 3
5 2 4 2 3 6 4
6 NA 3 3 4 3 3
I would like replace the NAs in df$a1 with the values from the corresponding row in df$a2, the NAs in df$b1 with the values from the corresponding row in df$b2, and the NAs in df$c1 with the values from the corresponding row in df$c2 so that the new data frame looks like:
> df
a1 b1 c1
1 1 3 3
2 2 2 3
3 4 4 3
4 5 4 4
5 2 4 2
6 4 3 3
How can I do this? I have a large data frame with many columns, so it would be great to find an efficient way to do this (I've already seen Replace missing values with a value from another column). Thank you!
An extensible option:
df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
df[c('a1','b1','c1')], df[c('a2','b2','c2')],
SIMPLIFY=FALSE)
df2
# a1 b1 c1
# 1 1 3 3
# 2 2 2 3
# 3 4 4 3
# 4 5 4 4
# 5 2 4 2
# 6 4 3 3
It's easy enough to extend this to arbitrary column pairs: the first column in the first subset (df[c('a1','b1','c1')]) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))] and df[grepl('2$',colnames(df))], assuming they don't mis-match.
coalesce in dplyr is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). e.g.
coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4
It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:
sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3
dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)
as.data.frame(dfnew)
this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo
You can use hutils::coalesce. It should be slightly faster, especially if it can 'cheat' -- if any columns have no NAs and so don't need to change, coalesce will skip them:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
s <- function(x) {
sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
a2 = s(a2), b2 = s(b2), c2 = s(c2)))
library(microbenchmark)
library(hutils)
library(data.table)
dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2")
dplyr_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
}
ans
}
hutils_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
}
ans
}
microbenchmark(dplyr = dplyr_coalesce(df),
hutils = hutils_coalesce(df))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800 100 b
#> hutils 36.48602 46.76336 63.46643 52.95736 64.53066 252.5608 100 a
Created on 2018-03-29 by the reprex package (v0.2.0).
I get CSV's with hundreds of different columns and would like to be able to output a new file with the duplicate values removed from each column. Everything that I have seen and tried uses a specific column. I just need each column to be unique values.
For Example My Data:
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
df
A B C
1 1 1 Mr.
2 2 0 Mr.
3 3 1 Mrs.
4 4 0 Miss
5 5 0 Mr.
6 6 1 Mrs.
I would like:
A B C
1 1 1 Mr.
2 2 0 Mrs.
3 3 Miss
4 4
5 5
6 6
Then I can:
write.csv(df, file = file.path(df, "df_No_Dupes.csv"), na="")
So I can use it as a reference for my next task.
read.csv and write.csv work best with tabular data. Your desired output is not a good example of this (every row does not have the same number of columns).
You can easily get all the unique value for your columns with
vals <- sapply(df, unique)
Then you'd be better off saving this object with save() and load() to preserve the list as an R object.
Code snippet to work with a flexible number of columns, remove duplicate columns, and preserve column names:
require(rowr)
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
#get the number of columns in the dataframe
n <- ncol(df)
#loop through the columns
for(i in 1:ncol(df)){
#replicate column i without duplicates, fill blanks with NAs
df <- cbind.fill(df,unique(df[,1]), fill = NA)
#rename the new column
colnames(df)[n+1] <- colnames(df)[1]
#delete the old column
df[,1] <- NULL
}
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
for(i in 1:ncol(df)){
assign(paste("df_",i,sep=""), unique(df[,i]))
}
require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = NA)
V1 V1 V1
1 1 1 Mr.
2 2 0 Mrs.
3 3 NA Miss
4 4 NA <NA>
5 5 NA <NA>
6 6 NA <NA>
or you could do
require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = "")
df
V1 V1 V1
1 1 1 Mr.
2 2 0 Mrs.
3 3 Miss
4 4
5 5
6 6
If you want to avoid typing the name of each intermediate dataframe you can just use ls(pattern="df_") and get the objects named in that vector or use another loop.
If you want to change the column names back to their original values you can use:
colnames(output_df) <- colnames(input_df)
Then you can save the results however you, like, i.e.
saveRDS()
save()
or write it to a file.
Putting it all together:
df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
for(i in 1:ncol(df)){
assign(paste("df_",i,sep=""), unique(df[,i]))
}
require(rowr)
files <- ls(pattern="df_")
df_output <- data.frame()
for(i in files){
df_output <- cbind.fill(df_output, get(i), fill = "")
}
df_output <- df_output[,2:4] # fix extra colname from initialization
colnames(df_output) <- colnames(df)
write.csv(df_output, "df_out.csv",row.names = F)
verify_it_worked <- read.csv("df_out.csv")
verify_it_worked
A B C
1 1 1 Mr.
2 2 0 Mrs.
3 3 Miss
4 4
5 5
6 6
my objective is to sum every nth row by every count. Maybe a loop function might help.
I used this code :
irr = rollapply( irr , width = 1 , by = n , align = "left" , FUN = sum )
Example:
V1
3
2
4
7
5
so if n = 2, the first 2 rows will sum up.
Results:
V1
5
4
7
5
So the problem is, i have multiple "n" in another data.frame variable.
2 5 3 and i want to make "n" change, let say to "3" when it finish summing the first two rows,
next n = 3
Results:
5 16
This is my first time using r so please pardon me for any mistake i made and if the question is hard to understand.Thanks
You can split the data frame according to n and then sum it over every list
As an example,
v1 <- data.frame(X = c(3,2,4,7,5, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4))
n <- data.frame(Y = c(2, 3, 2, 4, 1,4))
unlist(lapply(split(v1$X, rep(1:nrow(n), n$Y)), sum))
# 1 2 3 4 5 6
# 5 16 5 22 8 10