Sum and order columns [duplicate] - r

This question already has answers here:
Sorting rows alphabetically
(4 answers)
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a large dataset I want to simplify but I'm currently having some troubles with one thing.
The following table shows a origin destination combination. The count column, represents the amount of occurrences of A to B for example.
From To count
A B 2
A C 1
C A 3
B C 1
The problem I have is that for example A to C (1), is actually the same as C to A (3). As direction doesn't really matter to me only that there's a connection between A and C, I wonder how can I simply have A to C (4).
The problem is that I have a factor with 400 levels, so I can't do it manually. Is there something with dplyr or similar that can solve this for me?

df[1:2] <- t(apply(df[1:2], 1, sort))
aggregate(count ~ From + To, df, sum)
results in:
From To count
1 A B 2
2 A C 4
3 B C 1

Here is a base R method using aggregate, sort, paste, and mapply.
with(df, aggregate(count,
list(route=mapply(function(x, y) paste(sort(c(x, y)), collapse=" - "),
From, To)), sum))
route x
1 A - B 2
2 A - C 4
3 B - C 1
Here, mapply takes pairs of elements from the from and to variables, sorts them and pastes them into a single string with collapse=TRUE. The resulting string vector is used in aggregate to group the observations and sum the count values. with reduces typing.

Related

Counting number of repetions in a particular column [duplicate]

This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Closed 2 years ago.
for eg:
a dataframe "housing" has a column "street" with different street names as levels.
I want to return a df with counts of the number of houses in each street (level), basically number of repetitions.
what functions do i use in r?
This should help:
library(dplyr)
housing %>% group_by(street) %>% summarise(Count=n())
This can be done in multiple ways, for instance with base R using table():
table(housing$street)
It can also be done through dplyr, as illustrated by Duck.
Another option (my preference) is using data.table.
library(data.table)
setDT(housing)
housing[, .N, by = street]
summary gives the first 100 frequencies of the factor levels. If there are more, try:
table(housing$street)
For example, let's generate one hundred one-letter street names and summarise them with table.
set.seed(1234)
housing <- data.frame(street = sample(letters, size = 100, replace = TRUE))
x <- table(housing$street)
x
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# 1 3 5 6 4 6 2 6 5 3 1 3 1 2 5 5 4 1 5 5 3 7 4 5 3 5
As per OP's comment. To further use the result in analyses, it needs to be included in a variable. Here, the x. The class of the variable is table, and it works in base R with most functions as a named vector. For example, to find the most frequent street name, use which.max.
which.max(x)
# v
# 22
The result says that the 22nd position in x has the maximum value and it is called v.

How can i aggregate rows of a data.frame by name, summing the numeric value of the correspondent columns on R? [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I am an early user of Rstudio, and i have a quite simple problem, but unfortunately i am not able to solve it.
I just want to aggregate rows of my data.frame by words contained on the first column of the df.
The data.frame is made by five columns:
The first one is made by words;
the second, the third, the fourth, the fifth ones are made by numeric values.
for example if the data would be:
SecondWord X Y Z Q
NO 1 2 2 1
NO 0 0 1 0
YES 1 1 1 1
i expect to see a result like:
SecondWord X Y Z Q
NO 1 2 3 1
YES 1 1 1 1
How could i do?
i have tried to use the following method:
test <- read.csv2("test.csv")
df<-aggregate(.~Secondword,data=test, FUN = sum, na.rm=TRUE)
But the values were not the ones i expected to see.
Thank you for your future helps and sorry for the "simple" question.
You can also use tidyverse
library(tidyverse)
df <- test %>%
group_by(SecondWord) %>%
summarize_each(funs(sum))
df
# SecondWord X Y Z Q
# NO 1 2 3 1
# YES 1 1 1 1
ddply should work as well.
For example, something like:
library(plyr)
grouped <- ddply(test, "Secondword", numcolwise(sum))

Dplyr replace value based on function of previous column and row [duplicate]

This question already has answers here:
Replace NA with previous and next rows mean in R
(3 answers)
Closed 5 years ago.
I'm trying to replace NA values with the mean of the previous row and previous column same row using dplyr. See example below:
df <- data.frame(A=c(1,1,2),
B=c(2,4,NA))
So in this case the NA would be replaced by 3. How do I do this?
Below is what I was the lines I was thinking on but it doesn't work.
dfb <- df %>%
mutate(B = if_else(is.na(B), mean(lag(B),A), B))
Thanks!
Instead of using mean we can mention them separately and then divide it by 2.
df %>% mutate(B = ifelse(is.na(B),(lag(B) + A)/2, B))
# A B
#1 1 2
#2 1 4
#3 2 3
A simple base R method using subsetting is
df$B[is.na(df$B)] <- (df$B[which(is.na(df$B))-1] + df$A[is.na(df$B)]) / 2
df
A B
1 1 2
2 1 4
3 2 3
is.na returns a logical vector indicating whether each element is NA. which returns the position of logical TRUE elements. which is necessary for the first component of the average, since we have to find the lagged value.
This can be extended a bit to reduce computation (responding to docendo-discimus's comment) by computing the missing values once, and storing it, then re-using this vector.
missers <- is.na(df$B)
df$B[missers] <- (df$B[which(missers)-1] + df$A[missers]) / 2
#clean up, maybe
rm(missers)

How to extract information of a vector in a data frame corresponding to each unique value of another vector in the same data frame? [duplicate]

This question already has answers here:
Remove groups with less than three unique observations [duplicate]
(3 answers)
Closed 4 years ago.
Suppose I have the following data frame data-
V1 V2
A 3
A 2
A 1
B 2
B 3
C 4
C 3
C 1
C 2
Now I want to extract information of each level, i.e. (A,B,C,D & E) of V1. As an example, if I choose to see the sum of different levels in V2 for each level of V1, what should be the code?
The output I want is-
V1 V2
A 6
B 5
C 10
I tried lapply and sapply but they are not giving the information I want. Of course I tried sapply(data,unique) which made no sense.
Also, in advance (may be a bit trickier), if I want to see the values in V2 which are unique in all the levels of V1,how to do it?
Thanks !!
I think this is what you want, in that it will find unique values which are common across different groups:
Common V2 values in each level of V1
Reduce(intersect, split(dat$V2, dat$V1))
#[1] 3 2
Common V1 values in each level of V2
Reduce(intersect, split(dat$V1, dat$V2))
#[1] "C"
Using data.table, we can find the unique values in 'V2' that are common across 'V1'.
library(data.table)
setDT(data)[,uniqueN(V1)==uniqueN(data$V1) , by = V2][(V1)]$V2
#[1] 3 2
and the common 'V1' in each unique element of 'V2'
setDT(data)[, if(uniqueN(V1)==1) .SD , by = V2]$V1
#[1] "C"
Maybe this is helpful
output <- aggregate(data=df,V2~.,FUN=paste)
For extraction of common values in V2 presented all the levels of V1 use this
Reduce(intersect,output$V2)

How to split data.frame into smaller data.frames of predetermined number of rows? [duplicate]

This question already has answers here:
The difference between bracket [ ] and double bracket [[ ]] for accessing the elements of a list or dataframe
(11 answers)
Closed 7 years ago.
I have the following data frame:
df <- data.frame(a=rep(1:3),b=rep(1:3),c=rep(4:6),d=rep(4:6))
df
a b c d
1 1 1 4 4
2 2 2 5 5
3 3 3 6 6
i would like to have a vector N which determines my window size so for thsi example i will set
N <- 1
I would like to split this dataframe into equal portions of N rows and store the 3 resulting dataframes into a list.
I have the following code:
groupMaker <- function(x, y) 0:(x-1) %/% y
testlist2 <- split(df, groupMaker(nrow(df), N))
The problem is that this code renames my column names by adding an X0. in front
result <- as.data.frame(testlist2[1])
result
X0.a X0.b X0.c X0.d
1 1 1 4 4
>
I would like a code that does the exact same thing but keeps the column names as they are. please keep in mind that my original data has a lot more than 3 rows so i need something that is applicable to a much larger dataframe.
To extract a list element, we can use [[. Also, as each list elements are data.frames, we don't need to explicitly call as.data.frame again.
testlist2[[1]]
We can also use gl to create the grouping variable.
split(df, as.numeric(gl(nrow(df), N, nrow(df))))

Resources