Add rows when values in columns are equal in df - r

For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.

One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()

Related

Converting a data frame to a motified list

although there are alot of questions concering this topic; I can not seem to find the correct question answer. Therefore I am directing this question to you guys.
The context:
I've got a data set with alot of rows (+150K) with 32 corresponding columns. The second column is a document number. The document number is not a unique ID. So the date contains rows with mutiple rows with the same document number. I like to create a list of the document numbers. This list of document numbers contains another list with the corresponding rows with the same document numbers.
For example:
Here is an example of the data (I included a dput output of the example below).
Document Number Col.A Col.B
A random_56681 random_24984
A random_78738 random_23098
A random_48640 random_32375
B random_96243 random_96927
B random_72045 random_52583
C random_19367 random_20441
C random_96778 random_22161
C random_48038 random_95644
C random_62999 random_44561
Now here is what I am looking for. I need a list that contains the 3 documents (A, B, C). Each of these list needs to contain another list containing the corresponding rows. For example, the main list (lets say my_list) should have 3 lists A, B and C; each of the lists should contain respectively 3, 2 and 4 lists.
Hope I was clear enough in asking the question (if not please let me know).
Here you can find the example data:
structure(list(Document_Number = structure(c(1L, 1L, 1L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Col.A = structure(c(4L, 7L, 3L, 8L, 6L, 1L, 9L, 2L, 5L), .Label = c("random_19367",
"random_48038", "random_48640", "random_56681", "random_62999",
"random_72045", "random_78738", "random_96243", "random_96778"
), class = "factor"), Col.B = structure(c(4L, 3L, 5L, 9L,
7L, 1L, 2L, 8L, 6L), .Label = c("random_20441", "random_22161",
"random_23098", "random_24984", "random_32375", "random_44561",
"random_52583", "random_95644", "random_96927"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
You can use split like:
split(x, x$Document_Number)
#$A
# Document_Number Col.A Col.B
#1 A random_56681 random_24984
#2 A random_78738 random_23098
#3 A random_48640 random_32375
#
#$B
# Document_Number Col.A Col.B
#4 B random_96243 random_96927
#5 B random_72045 random_52583
#
#$C
# Document_Number Col.A Col.B
#6 C random_19367 random_20441
#7 C random_96778 random_22161
#8 C random_48038 random_95644
#9 C random_62999 random_44561
An option is group_split
library(dplyr)
df1 %>%
group_split(Document_Number)

R code to assign a sequence based off of multiple variables [duplicate]

This question already has answers here:
Recode dates to study day within subject
(2 answers)
Closed 3 years ago.
I have data structured as below:
ID Day Desired Output
1 1 1
1 1 1
1 1 1
1 2 2
1 2 2
1 3 3
2 4 1
2 4 1
2 5 2
3 6 1
3 6 1
Is it possible to create a sequence for the desired output without using a loop? The dataset is quite large so a loop won't work, is it possible to do this with the dplyr package or maybe a combination of cumsum/diff?
An option is to group by 'ID', and then do a match on the 'Day' with the unique values of 'Day' column
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(desired = match(Day, unique(Day)))
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), Day = c(1L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 5L, 6L, 6L)), row.names = c(NA,
-11L), class = "data.frame")

Need help replacing values with NA when another condition is met in R (i.e. when another variable is a specific value)

I'm trying to delete some repeating information in my data set and replace it with NA. Here's an example of the data:
DataTable1
ID Day x y
1 1 1 3
1 2 1 3
2 1 2 5
2 2 2 5
3 1 3 4
3 2 3 4
4 1 4 6
4 2 4 6
I'm trying to replace "x" and "y" values with "NA" when Day=1. This is what I want:
ID Day x y
1 1 NA NA
1 2 1 3
2 1 NA NA
2 2 2 5
3 1 NA NA
3 2 3 4
4 1 NA NA
4 2 4 6
I'm not really sure where to start or how to go about this. I tried using the replace_with_na_if function from the naniar library. Otherwise, I am unsure what to try.
replace_with_na_if(data.frame=DataTable1$x,
condition=DataTable1$Day== 2)
I received an error message that reads:
Error in replace_with_na_if(data.frame = DataTable1$x, condition = DataTable1$Day == :
unused argument (data.frame = DataTable1$x)
An option in base R would be to create a logical vector based on the elements of 'Day'. Use that index to subset the 'x', 'y' columns and assign them to NA
i1 <- df1$Day == 1
df1[i1, c('x', 'y')] <- NA
Here's a data.table solution. Since you may be new to R, you need to install the data.table package first. If you have a large data set, data.table may work faster than using data frame. Also, I find the syntax to be easy to read and understand.
#Create the data frame:
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
library(data.table)
dt <- setDT(df) # convert the data frame to a data.table
dt[Day == 1, c("x","y") := NA] # where Day equals 1, make the columns x and y equal NA
Good luck and welcome to stackoverflow!
Using dplyr, we can use mutate_at and replace like
library(dplyr)
df %>% mutate_at(vars(x, y), ~replace(., Day == 1, NA))
# ID Day x y
#1 1 1 NA NA
#2 1 2 1 3
#3 2 1 NA NA
#4 2 2 2 5
#5 3 1 NA NA
#6 3 2 3 4
#7 4 1 NA NA
#8 4 2 4 6
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))

How to calculate the amount of different 'combinations of classes' of a variable depending on a other variable?

CustomerID MarkrtungChannel OrderID
1 A 1
2 B 2
3 A 3
4 B 4
5 C 5
1 C 6
1 A 7
2 C 8
3 B 9
3 B 10
Hi, I want to know which combinations of marketing channels are used by how many customers .
How can I calculate this with R?
E.g. The combination of Marketing channels A and C is used by 1 customer (ID 1)
the combination of Marketing channels C and B is also used by 1 customer (ID 2)
And so on...
and here's a tidyverse way.
library(tidyverse)
data.df%>%
group_by(CustomerID)%>%
summarize(combo=paste0(sort(unique(MarkrtungChannel)),collapse=""))%>%
ungroup()%>%
group_by(combo)%>%
summarize(n.users=n())
counting the number of people using each combo at the end.
You can do it multiple ways. Here is data.table way:
# Here is your data
df<-structure(list(CustomerID = c(1L, 2L, 3L, 4L, 5L, 1L, 1L, 2L,
3L, 3L), MarkrtungChannel = structure(c(1L, 2L, 1L, 2L, 3L, 3L,
1L, 3L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor"),
OrderID = 1:10), .Names = c("CustomerID", "MarkrtungChannel",
"OrderID"), class = "data.frame", row.names = c(NA, -10L))
df[]<-lapply(df[],as.character)
# Here is the combination field
library(data.table)
setDT(df)
df[,Combo:=.(list(unique(MarkrtungChannel))), by=CustomerID]
# Or (to get the combination counts)
df[,list(combo=(list(unique(MarkrtungChannel)))), by=CustomerID][,uniqueN(CustomerID),by=combo]

R - dplyr map slice for repeat rows

I have trouble combining slice and map.
I am interested of doing something similar to this; which is, in my case, transforming a compact person-period file to a long (sequential) person-period one. However, because my file is too big, I need to split the data first.
My data look like this
group id var ep dur
1 A 1 a 1 20
2 A 1 b 2 10
3 A 1 a 3 5
4 A 2 b 1 5
5 A 2 b 2 10
6 A 2 b 3 15
7 B 1 a 1 20
8 B 1 a 2 10
9 B 1 a 3 10
10 B 2 c 1 20
11 B 2 c 2 5
12 B 2 c 3 10
What I need is simply this (answer from this)
library(dplyr)
dt %>% slice(rep(1:n(),.$dur))
However, I am interested in introducing a split(.$group).
How I am suppose to do so ?
dt %>% split(.$group) %>% map_df(slice(rep(1:n(),.$dur)))
Is not working for example.
My desired output is the same as dt %>% slice(rep(1:n(),.$dur))
which is
group id var ep dur
1 A 1 a 1 20
2 A 1 a 1 20
3 A 1 a 1 20
4 A 1 a 1 20
5 A 1 a 1 20
6 A 1 a 1 20
7 A 1 a 1 20
8 A 1 a 1 20
9 A 1 a 1 20
10 A 1 a 1 20
.....
But I need to split this operation because the file is too big.
data
dt = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var = structure(c(1L,
2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L), .Label = c("a",
"b", "c"), class = "factor"), ep = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2",
"3"), class = "factor"), dur = c(20, 10, 5, 5, 10, 15, 20,
10, 10, 20, 5, 10)), .Names = c("group", "id", "var", "ep",
"dur"), row.names = c(NA, -12L), class = "data.frame")
map takes two arguments: a vector/list in .x and a function in .f. It then applies .f on all elements in .x.
The function you are passing to map is not formatted correctly. Try this:
f <- function(x) x %>% slice(rep(1:n(), .$dur))
dt %>%
split(.$group) %>%
map_df(f)
You could also use it like this:
dt %>%
split(.$group) %>%
map_df(slice, rep(1:n(), dur))
This time you directly pass the slice function to map with additional parameters.
I'm not quite sure what your desired final output is, but you could use tidyr to nest the data that you want to repeat and a simple function to expand levels of your nested data, very similar to Tutuchan's answer.
expand_df <- function(df, repeats) {
df %>% slice(rep(1:n(), repeats))
}
dt %>%
tidyr::nest(var:ep) %>%
mutate(expanded = purrr::map2(data, dur, expand_df)) %>%
select(-data) %>%
tidyr::unnest()
Tutuchan's answer gives exactly the same output as your original approach - is that what you were looking for? I don't know if it will have any advantage over your original method.

Resources