How to delete a duplicate row in R - r

I have the following data
x y z
1 2 a
1 2
data[2,3] is a factor but nothing shows,
In the data, it has a lot rows like this way.How to delete the row when the z has nothing?
I mean deleting the rows such as the second row.
output should be
x y z
1 2 a

OK. Stabbing a little bit in the dark here.
Imagine the following dataset:
mydf <- data.frame(
x = c(.11, .11, .33, .33, .11, .11),
y = c(.22, .22, .44, .44, .22, .44),
z = c("a", "", "", "f", "b", ""))
mydf
# x y z
# 1 0.11 0.22 a
# 2 0.11 0.22
# 3 0.33 0.44
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44
From the combination of your title and your description (neither of which seems to fully describe your problem), I would decode that you want to drop rows 2 and 3, but not row 6. In other words, you want to first check whether the row is duplicated (presumably only the first two columns), and then, if the third column is empty, drop that row. By those instructions, row 5 should remain (column "z" is not blank) and row 6 should remain (the combination of columns 1 and 2 is not a duplicate).
If that's the case, here's one approach:
# Copy the data.frame, "sorting" by column "z"
mydf2 <- mydf[rev(order(mydf$z)), ]
# Subset according to your conditions
mydf2 <- mydf2[duplicated(mydf2[1:2]) & mydf2$z %in% "", ]
mydf2
# x y z
# 3 0.33 0.44
# 2 0.11 0.22
^^ Those are the data that we want to remove. One way to remove them is using setdiff on the rownames of each dataset:
mydf[setdiff(rownames(mydf), rownames(mydf2)), ]
# x y z
# 1 0.11 0.22 a
# 4 0.33 0.44 f
# 5 0.11 0.22 b
# 6 0.11 0.44

Some example data:
df = data.frame(x = runif(100),
y = runif(100),
z = sample(c(letters[0:10], ""), 100, replace = TRUE))
> head(df)
x y z
1 0.7664915 0.86087017 a
2 0.8567483 0.83715022 d
3 0.2819078 0.85004742 f
4 0.8241173 0.43078311 h
5 0.6433988 0.46291916 e
6 0.4103120 0.07511076
Spot row six with the missing value. You can subset using a vector of logical's (TRUE, FALSE):
df[df$z != "",]
And as #AnandaMahto commented, you can even check against multiple conditions:
df[!df$z %in% c("", " "),]

Related

R cran: package 'Diagram' link width

I am trying to show a transition matrix. In addition to the values of the matrix itself, I have another matrix that gives me the 'confidence' for the score in the first matrix. I want to draw the diagram such that the line width reflects the 'confidence' matrix, even though the values shown are from the transition matrix. For example:
library(diagram)
set.seed(1234)
# transition matrix
mat1 <- matrix(round(abs(rnorm(16)),2),4,4)
rownames(mat1) <- colnames(mat1) <- letters[1:4]
# confidence score matrix
mat2 <- matrix(c(rep(1,9),rep(2,2),rep(3,5)),4,4)
rownames(mat2) <- colnames(mat2) <- letters[1:4]
plotmat(mat1,box.size = 0.04,lwd=mat2)
> mat1
a b c d
a 1.21 0.43 0.56 0.78
b 0.28 0.51 0.89 0.06
c 1.08 0.57 0.48 0.96
d 2.35 0.55 1.00 0.11
> mat2
a b c d
a 1 1 1 3
b 1 1 2 3
c 1 1 2 3
d 1 1 3 3
The line widths look ok (e.g. 'd' to 'a', 'd' to 'b','c' to 'b'), except where the link is from a node, back to itself (e.g. 'd' to 'd', or 'c' to 'c'). In these self-looping cases, the line width does not appear to work.
Is there something else that I need to do?
thanks!

How to forward fill from numeric column based on condition of logical column

I have a column of pdf values and a conditional column. I am attempting to create a third column that forward fills in values from the pdf column based on the conditional column. If the condition is TRUE then I would like the corresponding row to restart the pdf column from the beginning.
I've seen this question posted R: fill new columns in data.frame based on row values by condition? and it is close but I would like a dplyr solution to retain my pipe structure.
Very Simple Example Data:
library(tidyverse)
dat <- tibble(pdf = c(.025, .05, .10, .15, .175, .20, .29, .01),
cond = c(F, F, T, F, F, F, T, F),
expected = c(.025, .05, .025, .05, .10, .15, .025, .05))
The expected is seen in the dataframe above. (Note that I don't see the expected column)
Thank you in advance.
Here's a way by creating a reference using ave.
The output of cumsum(cond) produces a grouping and ave uses this grouping and creates a sequence along each group using seq_along. This sequence is then used as reference for pulling the appropriate pdf value.
dat %>%
mutate(
ref = ave(pdf, cumsum(cond), FUN = seq_along),
expected2 = pdf[ref]
)
# A tibble: 8 x 5
pdf cond expected ref expected2
<dbl> <lgl> <dbl> <dbl> <dbl>
1 0.025 FALSE 0.025 1 0.025
2 0.05 FALSE 0.05 2 0.05
3 0.1 TRUE 0.025 1 0.025
4 0.15 FALSE 0.05 2 0.05
5 0.175 FALSE 0.1 3 0.1
6 0.2 FALSE 0.15 4 0.15
7 0.290 TRUE 0.025 1 0.025
8 0.01 FALSE 0.05 2 0.05

Conditionally changing only some cells in a data frame - ifelse() failure?

I am trying to conditionally change some items when cleaning survey data.
I've got two questions, Question X and Question Y. If they respond 1 or 2 for Question X, they go on to answer Question Y. If they answer 3 or 4 for Question X, they skip Question Y.
If they answer X with 1 or 2 and then skip Y, I want to record their 'NULL!' entries as NA - they just didn't answer the question when they should have.
If they answer X with 3 or 4 and then skip Y, I want to record their 'NULL!' entries as 0 - they weren't supposed to answer the question, so they didn't.
Here's a reproducible dataset I made:
set.seed(1)
df <- data.frame(
X = as.factor(sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace = TRUE)),
Y = as.factor(sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE))
)
df
I'm trying to replace the aforementioned 'NULL!' fields with either NA or 0 respectively. I've been trying it with ifelse() and have had little luck - it appears to return anything that is 1.00 or 2.00 as NA and 3.00 or 4.00 as 0. Is there a better way to do this? What am I doing wrong?
levels(df$Y) <- c(levels(df$Y), 0)
df$Y <- ifelse(df$X == '3.00'| df$X == '4.00', df$Y[df$y == 'NULL!'] <- 0, df$Y[df$Y == '#NULL!'] <- NA)
df
Thank you for your help!
You're doing a couple of things the hard way. First, using factors constrains one to only use levels that exist in the a particular factor, which may not be what you want. Second, you have levels of "#NULL!" but are attempting (unsuccessfully) to test for a level of "NULL!". I'm guessing you wanted them to be the same level. Third; You are attempting to use "<-" inside the second and third arguments of ifelse. That will not succeed in manner you intended. The LHS of such an expression is not evaluated by ifelse.
You can instead use nested ifelse:
df$Y <- ifelse( (df$X == '3.00'| df$X == '4.00') & df$Y == "#NULL!", 0,
ifelse( df$Y == "#NULL!", NA, df$Y) ) # only mess with "Nulls"
df
X Y
1 2.00 1.00
2 2.00 1.00
3 3.00 0
4 4.00 2.00
5 1.00 <NA>
6 4.00 2.00
7 4.00 0
8 3.00 0
9 3.00 2.00
10 1.00 <NA>
To prevent the missing levels problem which you handled by adding a "0" level, I instead made my dataframe so it contained character vectors:
set.seed(1)
df <- data.frame(X = sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace== TRUE),
Y = sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE),
stringsAsFactors=FALSE)
Earlier tidyverse code:
library(tidyverse)
df %>% mutate(Y = case_when(
X == "3.00" ~ "0",
X == "4.00" ~ "0",
TRUE ~ as.character(Y)))
How about this?
set.seed(1)
df <- data.frame(
X = as.factor(sample(c("1.00", "2.00", "3.00", "4.00"), 10, replace = TRUE)),
Y = as.factor(sample(c("1.00", "2.00", "#NULL!"), 10, replace = TRUE))
)
df$X <- as.character(df$X)
df$Y <- as.character(df$Y)
df$Y <- ifelse(df$X=="1.00" | df$X=="2.00" & df$Y == "#NULL!", NA, df$Y)
df$Y <- ifelse(df$X=="3.00" | df$X=="4.00", "0.00", df$Y)
df
X Y
1 2.00 1.00
2 2.00 1.00
3 3.00 0.00
4 4.00 0.00
5 1.00 <NA>
6 4.00 0.00
7 4.00 0.00
8 3.00 0.00
9 3.00 0.00
10 1.00 <NA>

Create factor labels in a DF using a sequence of numbers

I have a data.frame containing numerics. I want to create a new column within that data.frame that will house factor labels using (letters[]). I want these factor labels to be built from a sequence of numbers that I have, and can change every time.
For example, my original DF has 1 column x containing numerics, I then have a sequence of numbers (3,7,9). So I need the new FLABEL column to populate according to the number sequence, i.e. first 3 lines are a, next 4 lines b and so on.
x FLABEL
0.23 a
0.21 a
0.19 a
0.27 b
0.25 b
0.22 b
0.15 b
0.09 c
0.32 c
0.19 d
0.17 d
I'm struggling with how to do this, I'm assuming some form of for-loop given that my number sequence can vary in length every time I run it So I could be populating letters a & b...or many more.
Based on the comment by #scoa, I suggest the following modified approach:
series <- c(3, 7, 9)
series <- c(series, nrow(DF)) # This ensures that the sequence extends to the last row of DF
series2 <- c(series[1] ,diff(series))
DF$FLABEL <- rep(letters[1:length(series2)], series2)
#> DF
# x FLABEL
#1 0.23 a
#2 0.21 a
#3 0.19 a
#4 0.27 b
#5 0.25 b
#6 0.22 b
#7 0.15 b
#8 0.09 c
#9 0.32 c
#10 0.19 d
#11 0.17 d
By using diff() the length of each sequence is calculated based on the index numbers in the input vector series. In this case, the index values 3, 7, 9 are converted into the number of repetitions of subsequent letters up to the last row of the data frame and stored in series2: 3, 4, 2, 2.
data
text <- "x FLABEL
0.23 x
0.21 x
0.19 x
0.27 x
0.25 x
0.22 x
0.15 x
0.09 x
0.32 x
0.19 x
0.17 x"
DF <- read.table(text = text, header=T)

How do you subset a data frame in R based on a minimum sample size

Let's say you have a data frame with two levels of factors that looks like this:
Factor1 Factor2 Value
A 1 0.75
A 1 0.34
A 2 1.21
A 2 0.75
A 2 0.53
B 1 0.42
B 2 0.21
B 2 0.18
B 2 1.42
etc.
How do I subset this data frame ("df", if you will) based on the condition that the combination of Factor1 and Factor2 (Fact1*Fact2) has more than, say, 2 observations? Can you use the length argument in subset to do this?
library(data.table)
dt = data.table(your_df)
dt[, if(.N > 2) .SD, list(Factor1, Factor2)]
# Factor1 Factor2 Value
#1: A 2 1.21
#2: A 2 0.75
#3: A 2 0.53
#4: B 2 0.21
#5: B 2 0.18
#6: B 2 1.42
Assuming your data.frame is called mydf, you can use ave to create a logical vector to help subset:
mydf[with(mydf, as.logical(ave(Factor1, Factor1, Factor2,
FUN = function(x) length(x) > 2))), ]
# Factor1 Factor2 Value
# 3 A 2 1.21
# 4 A 2 0.75
# 5 A 2 0.53
# 7 B 2 0.21
# 8 B 2 0.18
# 9 B 2 1.42
Here's ave counting up your combinations. Notice that ave returns an object the same length as the number of rows in your data.frame (this makes it convenient for subsetting).
> with(mydf, ave(Factor1, Factor1, Factor2, FUN = length))
[1] "2" "2" "3" "3" "3" "1" "3" "3" "3"
The next step is to compare that length to your threshold. For that we need an anonymous function for our FUN argument.
> with(mydf, ave(Factor1, Factor1, Factor2, FUN = function(x) length(x) > 2))
[1] "FALSE" "FALSE" "TRUE" "TRUE" "TRUE" "FALSE" "TRUE" "TRUE" "TRUE"
Almost there... but since the first item was a character vector, our output is also a character vector. We want it as.logical so we can directly use it for subsetting.
ave doesn't work on objects of class factor, in which case you'll need to do something like:
mydf[with(mydf, as.logical(ave(as.character(Factor1), Factor1, Factor2,
FUN = function(x) length(x) > 2))),]
You can use interaction and table to see the number of observation for each interaction (mydata is your data) and then use %in% to subset the data.
mydata$inter<-with(mydata,interaction(Factor1,Factor2))
table(mydata$inter)
A.1 B.1 A.2 B.2
2 1 3 3
mydata[!mydata$inter %in% c("A.1","B.1"), ]
Factor1 Factor2 Value inter
3 A 2 1.21 A.2
4 A 2 0.75 A.2
5 A 2 0.53 A.2
7 B 2 0.21 B.2
8 B 2 0.18 B.2
9 B 2 1.42 B.2
Updated as per #Ananda's comment:You can use following one line code after creating the interaction variable.
mydata[mydata$inter %in% names(which(table(mydata$inter) > 2)), ]

Resources