Filtering Rows with duplicate column values

Filtering Rows with duplicate column values - r

I was cleaning a dataset for class. I noticed there were some negative values. Some rows with this condition also have the same id name in two columns 2 and 3.
I'm stumped. I'm trying to draft out a code, but unsure where should I start. I would love to get advice. I couldn't find anything similar.
Below is a sample table similar to the table I have.
df <- data.frame(A=c(1,2,4,7,8), B=c(2,2,4,9,9), C=c(0,1,5,3,4))
Do I use the ifelse () nested within a filter()? I want to filter a data table without rows that have duplicate values in columns A and B. Using the table above as an example, what code would result in getting back rows 1, 4 and 5?
(sorry, above example keeps coming up as code and not a table.)

Up to now, the question has not received a proper answer.
If I understand correctly, the OP wants to know how to remove / filter out those rows where the columns A and B have identical values. Or, in other words how to keep those rows where A and B are different.
This is a basic question for which different approaches are available in R:
base R
df[df$A != df$B, ]
or
subset(df, A != B)
dplyr
as already mentioned in Martin Gal's comment
dplyr::filter(df, A != B)
data.table
as the question was tagged with data.table
data.table::setDT(df)[A != B]
All return rows 1, 4, and 5, e.g.,
A B C
1 1 2 0
4 7 9 3
5 8 9 4
There is no ifelse() required.
Data
df <- data.frame(
A = c(1, 2, 4, 7, 8),
B = c(2, 2, 4, 9, 9),
C = c(0, 1, 5, 3, 4)
)

Related

How do I add percentages beside frequency values in a one-way frequency table?

I just recently started using R, and it's also my first time posting here (sorry if I miss a few details my question). I have a dataset that contains around 22 questions/statements that are in 5 pt. Likert-scale format. Each statement has a dedicated column with the respective answers under it.
Here is a sample data frame of what it looks like (but with only 3 columns instead of 22):
q1 = c(1, 2, 2, 1, 3, 4, 3, 5, 2, 2)
q2 = c(2, 3, 5, 5, 4, 5, 1, 1, 5, 3)
q3 = c(4, 4, 2, 3, 2, 1, 1, 1, 5, 5)
data <- data.frame(q1, q2, q3)
colnames(data) = c("This is statement 1.", "This is statement 2.", "This is statement 3.")
data
I have a few specific requirements for it:
It needs to be horizontal, so that each statement and its responses will form one row. It would be too long if I have it set vertically for each of the 22 questions.
If possible, it should be compatible with knitr :: kable (compatible meaning it looks decent when knit in R Markdown)
Each element in the data frame should have a corresponding row percentage enclosed in parenthesis.
The header of the table should be 1, 2, 3, 4, 5 (since they are all 5 pt. Likert scale questions)
Here is a screenshot similar to what I wanted to achieve (taken from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3490724/):
In the reference picture, each question and the responses (yes/no, with their percentages in parentheses) took up 1 row, making it compact. Each frequency had a corresponding row percentage in parenthesis. I want the same thing, but instead of yes and no, it should have headers: 1, 2, 3, 4, 5.
I tried using table function, but I couldn't figure out how to add the percentage values to each element in the data frame. Thanks in advance!

We could use tidyr to arrange the data, janitor to create the table and format the percentages, and knitr/gt to make it prettier ;)
library(tidyr)
library(janitor)
library(gt)
data |>
pivot_longer(everything(),
names_to = "Question") |>
tabyl(Question, value) |>
adorn_percentages("row") |>
adorn_pct_formatting(digits = 0) |>
adorn_ns(position = "front") |>
gt() # knitr::kable()
Output:

In base R, just do
tbl1 <- table(stack(data)[2:1] )
tbl2 <- proportions(tbl1, 1) * 100
tbl1[] <- sprintf('%d (%d%%)', tbl1, tbl2)
-output
tbl1
values
ind 1 2 3 4 5
This is statement 1. 2 (20%) 4 (40%) 2 (20%) 1 (10%) 1 (10%)
This is statement 2. 2 (20%) 1 (10%) 2 (20%) 1 (10%) 4 (40%)
This is statement 3. 3 (30%) 2 (20%) 1 (10%) 2 (20%) 2 (20%)

Create a new column of cumulative value based on multiple columns in data table

This is my first post after days of searching for answer. I'm transitioning from R data frame to R data table with difficulties.
What I want to achieve is to create some sort of cumulative value based on the indicator from multiple columns/variables.
I can do that quite easily with data frame:
DF = data.frame(
a1 = c(1, 2, 3, 4, 5),
a2 = c(1, 2, 3, 4, 5),
a3 = c(1, 2, 3, 4, NA)
)
DF$b1<-as.numeric(0)
for(i in 1:3) {
DF$b1<-as.numeric(DF[i]>0)+DF$b1
}
However, to me, it is not so straight forward in data table. What I have done is the following:
DT<-setDT(DF)
DT[,b1:= as.numeric(DT[,1]>0)+as.numeric(DT[,2]>0)+as.numeric(DT[,3]>0)]
The code above works. But it doesn't seem to be user friendly if I want to increase the number of columns analyzed to (say) 10. In the case of data frame, I can just change the index from 1:3 to 1:10.
Appreciate any comments on how I can improve the code for data table above. It would also be very helpful if any good resources or documentations can be shared with me on this type of practical problem: referencing column index in a loop for data table. Thanks.

You can try rowSums after turning you table to logical via .SD > 0, i.e.
DT[, b1 := rowSums(.SD > 0)][]
# a1 a2 a3 b1
#1: 1 1 1 3
#2: 2 2 2 3
#3: 3 3 3 3
#4: 4 4 4 3
#5: 5 5 NA NA

filter the same values in a row add the same values together

In the sample of the dataset below, from the icase_id column, I want to remove numbers that appear more than two (>2) and less than (<2) times.
icase_id 2,2,3,3,3,1,4,4
summ
2
3
1
2
3
4
2
1
after doing that, want to count the total of each set of numbers and save it in one single icase_id, example as this:
icase_id 2, 4
summ 5, 3
so i need someones help how to accomplished this exercise. Thanking in advance

Pretty basic stuff with library(dplyr)
df <- cbind.data.frame(icase_id=c(2,2,3,3,3,1,4,4),summ=c(2, 3, 1, 2, 3, 4, 2, 1))
df %>%
group_by(icase_id) %>%
filter(n()==2 ) %>%
summarise(summ=sum(summ))

How return the count of number of occurrences of an integer in a vector, in a new vector using R [duplicate]

This question already has answers here:
Count the occurrence of one vector's values in another vector
(2 answers)
Comparing Vectors Values: 1 element with all other
(2 answers)
Closed 4 years ago.
New to R. I have seen a lot of similar questions where tables are used to count the number of occurrences, but I want to create a new vector for each integer in vector_1 (e.g. 1 through 10,), where the number of occurrences of the integer in vector_1 is checked in vector_2, and then returned in a third vector_3.
Desired Result:
vector_1 <- c(1:10)
vector_2 <- c(3, 4, 4, 5, 7, 9, 10)
vector_3 <- c(0, 0, 1, 2, 1, 0, 1, 0, 1, 1)
I have tried using for loops such as:
for (i in 1:10) {
for (j in vector_2) {
print(i) <- vector_3
}
}
Obviously this code doesn't work, but I am just not finding a good way to do a summation of the occurrences between the vectors. Any guidance or alternate approaches would be welcomed.
*Edit: most all answers that I have seen to similar questions use tables to count the occurrences within vector_2; I haven't come across questions that compare the two vectors and then output the result.

Your code doesn't make sense to me. Anyway, you can easily compare each value in vector 1 with each value in vector 2 using outer. rowSums then can give you the required counts.
vector_1 <- c(1:10)
vector_2 <- c(3, 4, 4, 5, 7, 9, 10)
rowSums(outer(vector_1, vector_2, "=="))
#[1] 0 0 1 2 1 0 1 0 1 1

Also you can create a factor variable:
vector_2 <- c(3, 4, 4, 5, 7, 9, 10)
vector_2 <- factor(vector_2,levels = 1:10)
table(vector_2)

Conditional Replacement Column Content--many ids to be updated

Thinking I can take the easy way out, I was going to use elseif to replace id codes in an entire dataset. I have a specific dataset with a id column. I have to replace these old ids with updated ids, but there are 50k+ rows with 270 unique ids. So, I first tried:
df$id<- ifelse(df$id== 2, 1,
ifelse(df$id== 3, 5,
ifelse(df$id == 4, 5,
ifelse(df$id== 6, NA,
ifelse(df$id== 7, 7,
ifelse(df$id== 285, NA,
ifelse(df$id== 8, 10,.....
ifelse(df$id=200, 19, df$id)
While this would have worked, I am limited to 51 nests, and I cannot separate them because it would only a 1/4 of the set. And then updates for first half would interfere as codes do overlap.
I then tried
df$id[df$id== 2] <- 1
and I was going to do that for every code. However, if I update all twos to one, there is still a later code in which old and new "1" will become X number, and I would only want the old "1" to become X... I actually think this takes out the if else even if 51 was not the limit. A function similar to vlookup in Excel? Any ideas?
Thanks!
Old forum related to replacing cell contents, but does not work in my case.
Replace contents of factor column in R dataframe

partial example
df <- data.frame(id=seq(1, 10))
old.id <- c(2, 3, 4, 6)
new.id <- c(1, 5, 5, NA)
df$id[df$id %in% old.id] <- new.id[unlist(sapply(df$id, function(x) which(old.id==x)))]
output
> df
id
1 1
2 1
3 5
4 5
5 5
6 NA
7 7
8 8
9 9
10 10

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Filtering Rows with duplicate column values - r

Related

How do I add percentages beside frequency values in a one-way frequency table?

Create a new column of cumulative value based on multiple columns in data table

filter the same values in a row add the same values together

How return the count of number of occurrences of an integer in a vector, in a new vector using R [duplicate]

Conditional Replacement Column Content--many ids to be updated

Categories

Resources