I have a sample dataset , which has missing values in it

I have a sample dataset , which has missing values in it - r

I have a sample dataset , which has missing values in it.I want to create a new column with a message of different combinations where it should tell which columns values are missing.
Example:
Dataset:
A B C D
1 2 4
4 4
4 1
3 2 3
The permutaions of the above data set is :
"a" ,"b","c","d" ,"a, b","a, c" ,"a, d" , "b, c","b, d","c, d" , "a, b, c","a, b, d","a, c, d","b, c, d","a, b, c, d"
Result:
A B C D Message
1 2 4 Column B is missing
4 4 column A and D is Missing
4 1 Column C and D is Missing
All column values are missing
3 2 3 Column B is Missing
Any suggestion would be really appreciated

Here's a way using apply from base R -
set.seed(4)
df <- data.frame(matrix(sample(c(1:5, NA), 15, replace = T), ncol = 3))
names(df) <- LETTERS[1:3]
df$msg <- apply(df, 1, function(x) {
if(anyNA(x)) {
paste0(paste0(names(x)[which(is.na(x))], collapse = " "), " missing", collapse = "")
} else {
"No missing"
}
})
df
A B C msg
1 4 2 5 No missing
2 1 5 2 No missing
3 2 NA 1 B missing
4 2 NA NA B C missing
5 5 1 3 No missing

Related

R help - change the maximum value of each row in a certain condition

I am in a novice of R. I have a dataframe with columns 1:n. Excluding column 1 and n, I want to change the maximum value of each row if the row has a specific value in a different column AND set the remaining values (excluding column 1 and n) to zero. I have about 300,000 cases and 40 columns in my real data, however, the example below illustrates what I am trying to achieve:
A <- c(1,1,5,5,10)
B <- rnorm(1:5)
C <- rnorm(1:5)
D <- rnorm(1:5)
E <- c(10,15,100,100,100)
df <- data.frame(A,B,C,D,E)
df
A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
Here, if column A of each row has 1, I want to change the maximum value of each row into the value of column E, and set columns B, C and D to 0.
So, the result should be like this:
A B C D E
1 1 0 0 10 10
2 1 0 15 0 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100
I tried to do this for two days. Thanks.

Try this out and see what happens :)
df <- read.table(text = "A B C D E
1 1 0.74286670 0.3222136 0.9381296 10
2 1 -0.03352498 0.5262685 0.1225731 15
3 5 -0.17689629 -0.8949740 -1.4376567 100
4 5 0.48329153 1.1574834 -1.1116581 100
5 10 0.13117277 -0.2068736 0.4841806 100", stringsAsFactor = FALSE)
# find the max in columns B,C,D
z <- apply(df[df$A == 1, 2:4], 1, max)
# substitute the maximum value of each row for columns B,C,D where A == 1
# with the value of column E. Assign 0 to the others
y <- ifelse(df[df$A == 1, 2:4] == z, df$E[df$A == 1], 0)
# Change the values in your dataframe
df[df$A == 1, 2:4] <- y

Calculate Euclidian distances between elements of two data sets

I have two data sets of the same size:
>df1
c d e
a 2 3 4
b 5 1 3
>df2
h i j
f 1 1 2
g 0 4 3
I need to calculate Euclidian distances between the same elements of these data sets to get:
c d e
a 1 2 2
b 5 3 0
I have tried using dist(rbind(df1, df2)), but the result gave only one entry.
I have to perform this operation with numerous data sets, that's why your help will be really appreciated.

The following will work if the data frames are all numeric and have the same column and row numbers.
df3 <- abs(df1 - df2)
df3
# c d e
# a 1 2 2
# b 5 3 0
DATA
df1 <- read.table(text = " c d e
a 2 3 4
b 5 4 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)
df2 <- read.table(text = " h i j
f 1 1 2
g 0 1 3",
header = TRUE, stringsAsFactors = FALSE, row.names = 1)

Given your update the solution would be to do absolute value (abs) of the difference:
abs(df1 - df2)
And you could make a function if you want to repeat the process a lot:
myfunc1 <- function(x1,x2){
abs(x1 - x2)
}
myfunc1(df1, df2)
The output looks as intended:
[,1] [,2] [,3]
[1,] 1 2 2
[2,] 5 3 0

How to remove duplicated concatenated string in R

I have the following dataset
path value
1 b,b,a,c 3
2 c,b 2
3 a 10
4 b,c,a,b 0
5 e,f 0
6 a,f 1
df
df <- data.frame (path= c("b,b,a,c", "c,b", "a", "b,c,a,b" ,"e,f" ,"a,f"), value = c(3,2,10,0,0,1))
and I wish to remove duplicated in column path. when I use this code the format of data changes:
df$path <- sapply(strsplit(as.character(df$path), split=","),
function(x) unique(x))
and it gives me data like a dataframe
path value
1 c("b", "a", "c") 3
2 c( "c", "b ") 2
...
However, I wish to have data like that:
path value
1 b, a, c 3
2 c, b 2
3 a 10
4 b, c, a 0
5 e, f 0
6 a, f 1

replace unique(x) with paste(unique(x), collapse = ', '), or toString(unique(x)) as Frank suggested.
df <- data.frame (
path= c("b,b,a,c", "c,b", "a", "b,c,a,b" ,"e,f" ,"a,f"),
value = c(3,2,10,0,0,1))
df$path <- sapply(strsplit(as.character(df$path), split=","),
function(x) paste(unique(x), collapse = ', '))
# or
df$path <- sapply(strsplit(as.character(df$path), split=","),
function(x) toString(unique(x)))
df
# path value
# 1 b, a, c 3
# 2 c, b 2
# 3 a 10
# 4 b, c, a 0
# 5 e, f 0
# 6 a, f 1

Set value of data frame new field equal to another field based on condition on a third field in R

If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.

Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F

R: reshape data frame when one column has unequal number of entries

I have a data frame x with 2 character columns:
x <- data.frame(a = numeric(), b = I(list()))
x[1:3,"a"] = 1:3
x[[1, "b"]] <- "a, b, c"
x[[2, "b"]] <- "d, e"
x[[3, "b"]] <- "f"
x$a = as.character(x$a)
x$b = as.character(x$b)
x
str(x)
The entries in column b are comma-separated strings of characters.
I need to produce this data frame:
1 a
1 b
1 c
2 d
2 e
3 f
I know how to do it when I loop row by row. But is it possible to do without looping?
Thank you!

Have you checked out require(splitstackshape)?
> cSplit(x, "b", ",", direction = "long")
a b
1: 1 a
2: 1 b
3: 1 c
4: 2 d
5: 2 e
6: 3 f

> s <- strsplit(as.character(x$b), ',')
> data.frame(value=rep(x$a, sapply(s, FUN=length)),b=unlist(s))
value b
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 3 f

there you go, should be very fast:
library(data.table)
x <- data.table(x)
x[ ,strsplit(b, ","), by = a]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

I have a sample dataset , which has missing values in it - r

Related

R help - change the maximum value of each row in a certain condition

Calculate Euclidian distances between elements of two data sets

How to remove duplicated concatenated string in R

Set value of data frame new field equal to another field based on condition on a third field in R

R: reshape data frame when one column has unequal number of entries

Categories

Resources