How to create new columns of duplicate and unique values in R - r

I'm trying to use a .csv file with 2 columns (e.g. col1 and col2) to generate 3 new columns: a column composed of all the values in common (duplicates) that col1 and col2 have (col3), a column composed of the values unique to col1 (col4) and a column composed of the values unique to col2 (col5). I'd then like to export the file.
For this purpose I hoped to use dplyr with the functions mutate(), pipes %>%, as well as the other functions duplicated(), unique() and write_csv(). However, nothing I've tried so far has worked.
My data isn't practical to paste here, as it is tens of thousands of rows of categorical values (names), so I can provide a simple example of the type of data I am working with:
col1 <- data.frame(c("a", "b", "c", "d", "e"))
col2 <- data.frame(c("a", "c", "e", "f", "g"))
If the code is correct, I would like to create 3 new objects (columns: col3 for duplicates, col4 and col5 for unique values of each original column), which return:
col3 = a, c, e
col4 = b, d
col5 = f, g
and then save this file as a .csv, for example:
write_csv(col_umns, path = "data/col_umns.csv")
Many thanks for any help!

Related

Assigning complex values to character elements of data frame in R

There are three columns in my data frame which are characters, "A","B", and "C" (this order can vary for different data frames). I want to assign values to them, A= 1+0i, B=2+3i and C=3+2i. I use as.complex(factor(col1)) and the same thing for column two and three, but it makes all three column equal to 1+0i!!
col1 <- c("A","A", "A")
col2 <- c("B", "B","B")
col3 <- c("C","C","C")
df <- data.frame(col1,col2,col3)
print(df)
A= 1+0i
B=2+3i
C=3+2i
df2<- transform(df, col1=as.complex(as.factor(col1)),col2=as.complex(as.factor(col2)),col3=as.complex(as.factor(col3)))
sapply(df2,class)
View(df2)
So this is a weird thing you're doing. You have a column of strings, letters like "A" and "B". Then you have objects with the same names, A = 1 + 0i, etc. Normally we don't treat object names as "data", but you're sort of mixing the two here. The solution I'd propose is to make everything data: combine your A, B, and C values into a vector, and give the vector names accordingly. Then we can replace the values in the data frame with the corresponding values from our named vector:
vec = c(A, B, C)
names(vec) = c("A", "B", "C")
df[] = lapply(df, \(x) vec[x])
df
# col1 col2 col3
# 1 1+0i 2+3i 3+2i
# 2 1+0i 2+3i 3+2i
# 3 1+0i 2+3i 3+2i

Finding values in a columns "a" which has different values in column "b" for two different data set

Data contains multiple columns and 3000 row
Same OrderNo but different Ordertype.
I want to get all the OrderNo whose Ordertype are different in the two data frame.
I have isolated the two columns from the two data frame and set them in ascending order. Then I tried to use the function cbind to combine the two columns and find the missing values in one of the columns.
xxx <- data.frame( orderNo = c(1:10), Ordertype = c("a", "b", "c", "d", "a", "b", "c", "d", "e", "f"))
yyy <- data.frame( orderNo = c(1:10), Ordertype = c("a", "b", "c", "d", "a", "b", "e", "d", "e", "f"))
In the above example: OrderNo "7" corresponds to "c" in one data frame and "e" in another data frame. I want a set of all such number with a different value in the column Ordertype as my output.
It sounds like you want a data frame that contains differences between two data frames, matched by (and including) orderNo. Is that correct?
One possibility is:
res <- merge(xxx, yyy, by = "orderNo")
res[res[,2] != res[,3], ]
orderNo Ordertype.x Ordertype.y
7 7 c e
Using dplyr and anti_join you can do the following to find differences:
library(dplyr)
inner_join(anti_join(xxx, yyy), anti_join(yyy, xxx), by='orderNo')
orderNo Ordertype.x Ordertype.y
1 7 c e

how count the number of rows in a dataframe with cell matching each other

I have two columns (one with predicted values (in strings) and one with real values (in strings) and my wish is to assess the number of rows in which the real values or string do match the predicted values or string in the same row.
I was wondering whether it is possible to something like that with R?
# create sample dataset
df <- data.frame(
col1 = c("a", "b", "c", "d", "e"),
col2 = c("a", "x", "y", "z", "e"),
stringsAsFactors = FALSE
)
# count the number of rows where two columns equal each other
sum( df$col1 == df$col2 )

Create a ordered list in a dataframe [duplicate]

This question already has answers here:
Sorting each row of a data frame [duplicate]
(2 answers)
Row wise Sorting in R
(2 answers)
Row-wise sort then concatenate across specific columns of data frame
(2 answers)
Closed 5 years ago.
I have the following data frame:
col1 <- c("a", "b", "c")
col2 <- c("c", "a", "d")
col3 <- c("b", "c", "a")
df <- data.frame(col1,col2,col3)
I want to create a new column in this data frame that has, for each row, the ordered list of the columns col1, col2, col3. So, for the first row it would be a list like "a", "b", "c".
The way I'm handling it is to create a loop but since I have 50k rows, it's quite inefficient, so I'm looking for a better solution.
rown <- nrow(df)
i = 0
while(i<rown){
i = i +1
col1 <- df$col1[i]
col2 <- df$col2[i]
col3 <- df$col3[i]
col1 <- as.character(col1)
col2 <- as.character(col2)
col3 <- as.character(col3)
list1 <- c(col1, col2, col3)
list1 <- list1[order(sapply(list1, '[[', 1))]
a <- list1[1]
b <- list1[2]
c <- list1[3]
df$col.list[i] <- paste(a, b, c, sep = " ")
}
Any ideas on how to make this code more efficient?
EDIT: the other question is not relevant in my case since I need to paste the three columns after sorting each row, so it's the paste statement that is dynamic, I'm not trying to change the data frame by sorting.
Expected output:
col1 col2 col3 col.list
a c b a b c
b a c a b c
c d a a c d

How to delete all rows with counterparts and the counterparts themselves?

Please, have a look at the following data frame:
df <- data.frame(col1 = c(1, -2, -1, 3, 2 , 2),
col2 = c("a", "b", "a", "c", "b", "b"),
col3 = c("d", "e", "f", "g", "h", "i"))
My goal is to delete all rows in df with negative counterparts and the counterparts themselves. Now, what do I mean by a "negative counterpart"? A row has a negative counterpart if there is another row with the same number in col1 but with a minus, and the same value in col2. The value in col3 does not matter. Rows can have multiple counterparts. In this case, only one of them should be deleted. Thus, in the above example, the final data frame should contain only the fourth and either the fifth or sixth row.
The real df has approx. 4*10^5 rows and 25 columns. Most rows do not have a counterpart. So, my idea was to build a for loop that checks for each row whose value in col1 is less than 0, if there is positive counterpart. But I am struggling with the "checking" part.
for (i in nrow(df)) {
if (df[i, ] < 0) {
# check for positive counterparts here
}
}

Resources