I am new to R and this site, but I searched and didn't find the answer I was looking for.
If I have the following data set "total":
names <- c("a", "b", "c", "d", "a", "b", "c", "d")
x <- cbind(x1 = 3, x2 = c(3:10))
total <- data.frame(names, x)
total
names x1 x2
1 a 3 3
2 b 3 4
3 c 3 5
4 d 3 6
5 a 3 7
6 b 3 8
7 c 3 9
8 d 3 10
How can I create a new data set that works like the SumIf Excel function with just unique rows?
The answer should be a new data set "summary" that is 4 x 3.
names <- unique(names)
summary <- data.frame(names)
summary$Sumx1 <- ?????
summary$Sumx2 <- ?????
summary
names Sumx1 Sumx2
1 a 6 10
2 b 6 12
3 c 6 14
4 d 6 16
In base R:
aggregate(. ~ names, data=total, sum)
You can use ddply from the plyr package:
library(plyr)
ddply(total, .(names), summarise, Sumx1 = sum(x1), Sumx2 = sum(x2))
names Sumx1 Sumx2
1 a 6 10
2 b 6 12
3 c 6 14
4 d 6 16
You can also use data.table:
library(data.table)
DT <- as.data.table(total)
DT[ , lapply(.SD, sum), by = "names"]
names x1 x2
1: a 6 10
2: b 6 12
3: c 6 14
4: d 6 16
With the new dplyr package, you can do:
library(dplyr)
total %>%
group_by(names) %>%
summarise(Sumx1 = sum(x1), Sumx2 = sum(x2))
names Sumx1 Sumx2
1 d 6 16
2 c 6 14
3 b 6 12
4 a 6 10
Related
I have a pretty basic problem that I can't seem to solve:
Take 3 vectors:
V1
V2
V3
1
4
7
2
5
8
3
6
9
I want to merge the 3 columns into a single column and assign a group.
Desired result:
Group
Value
A
1
A
2
A
3
B
4
B
5
B
6
C
7
C
8
C
9
My code:
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
data <- data.frame(group = c("A", "B", "C"),
values = c(V1, V2, V3))
Actual result:
Group
Value
A
1
B
2
C
3
A
4
B
5
C
6
A
7
B
8
C
9
How can I reshape the data to get the desired result?
Thank you!
We can stack on a named list of vectors
stack(setNames(mget(paste0("V", 1:3)), LETTERS[1:3]))[2:1]
-output
ind values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Regarding the issue in the OP's data creation, if the length is less than the length of the second column, it will recycle. We may need rep
data <- data.frame(group = rep(c("A", "B", "C"), c(length(V1),
length(V2), length(V3))),
values = c(V1, V2, V3))
-output
> data
group values
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 C 7
8 C 8
9 C 9
Here is a another option using pivot_longer() and converting the 1,2,3 column names into factors labeled A, B, C
V1 <- c(1,2,3)
V2 <- c(4,5,6)
V3 <- c(7,8,9)
df <- data.frame(V1, V2, V3)
library(dplyr)
library(tidyr)
#basic answer:
answer<-pivot_longer(df, cols=starts_with("V"), names_to = "Group")
OR
Answer changing the column names to a different label
answer<-pivot_longer(df, cols=starts_with("V"), names_prefix = "V", names_to = "Group")
answer$Group <- factor(answer$Group, labels = c("A", "B", "C"))
answer %>% arrange(Group, value)
I am trying to figure out how to do this in R but would really appreciate some input on this. Let's say I have two dataframes, A and B:
dataframe A
a <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
b <- c(1, 5, 10, 2, 3, 8, 10, 28, 36)
c <- c(runif(9, min=5, max=99))
df_A <- data.frame(a,b,c)
names(df_A) <- c('name', 'trial', 'counts')
name trial counts
1 A 1 42.18785
2 A 5 17.17859
3 A 10 29.34961
4 B 2 23.20101
5 B 3 58.57507
6 B 8 28.94360
7 C 10 25.48171
8 C 28 55.67896
9 C 36 10.04799
dataframe B
e <- c("A", "A", "A", "B", "C", "C")
f <- c(1, 5, 10, 2, 3, 28)
g <- c(runif(6, min=5, max=99))
df_B <- data.frame(e,f,g)
names(df_B) <- c('name', 'trial', 'rate')
name trial rate
1 A 1 8.408579
2 A 5 28.029798
3 A 10 18.904179
4 B 2 20.577880
5 C 3 44.492629
6 C 28 81.408402
As you can see, these two dataframes share two columns but differ in length. What I need to do is to divide each value in the counts column by each value of the rate column in dataframe B. This has to be done on a name-by-name basis (i.e., group_by name column). A correct dataframe after this will look like this:
name trial counts
1 A 1 42.18785 / 8.408579
2 A 1 42.18785 / 28.029798
3 A 1 42.18785 / 18.904179
4 A 5 17.17859 / 8.408579
5 A 5 17.17859 / 28.029798
6 A 5 17.17859 / 18.904179
7 A 10 29.34961 / 8.408579
8 A 10 29.34961 / 28.029798
9 A 10 29.34961 / 18.904179
10 B 2 23.20101 / 20.577880
11 B 3 58.57507 / 20.577880
12 B 8 28.94360 / 20.577880
13 C 10 25.48171 / 44.492629
14 C 10 25.48171 / 81.408402
15 C 28 55.67896 / 44.492629
16 C 36 10.04799 / 81.408402
Here is a base R solution. merge the data sets and divide the result's columns counts by rate. Done with a pipe, introduced in R 4.2.0, to avoid the creation of a work, temporary data.frame.
merge(df_A, df_B[-2]) |>
(\(x) cbind(x[1:2], counts = x[[3]]/x[[4]]))()
#> name trial counts
#> 1 A 1 4.9008255
#> 2 A 1 1.9812148
#> 3 A 1 0.8574978
#> 4 A 5 3.2969133
#> 5 A 5 1.3328149
#> 6 A 5 0.5768612
#> 7 A 10 0.6277524
#> 8 A 10 0.2537761
#> 9 A 10 0.1098379
#> 10 B 2 0.3528129
#> 11 B 3 4.0136321
#> 12 B 8 1.9712023
#> 13 C 10 9.7051006
#> 14 C 10 0.9257950
#> 15 C 28 2.9923193
#> 16 C 28 0.2854452
#> 17 C 36 2.2441296
#> 18 C 36 0.2140734
Created on 2022-06-21 by the reprex package (v2.0.1)
A dplyr approach:
library(dplyr)
df_A |>
left_join(df_B, by = "name") |>
mutate(calc = counts / rate)
I have a dataframe including a column of factors that I would like to subset to select every nth row, after grouping by factor level. For example,
my_df <- data.frame(col1 = c(1:12), col2 = rep(c("A","B", "C"), 4))
my_df
col1 col2
1 1 A
2 2 B
3 3 C
4 4 A
5 5 B
6 6 C
7 7 A
8 8 B
9 9 C
10 10 A
11 11 B
12 12 C
Subsetting to select every 2nd row should yield my_new_df as,
col1 col2
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
I tried in dplyr:
my_df %>% group_by(col2) %>%
my_df[seq(2, nrow(my_df), 2), ] -> my_new_df
I get an error:
Error: Can't subset columns that don't exist.
x Locations 4, 6, 8, 10, and 12 don't exist.
ℹ There are only 2 columns.
To see if the nrow function was a problem, I tried using the number directly. So,
my_df %>% group_by(col2) %>%
my_df[seq(2, 4, 2), ] -> my_new_df
Also gave an error,
Error: Can't subset columns that don't exist.
x Location 4 doesn't exist.
ℹ There are only 2 columns.
Run `rlang::last_error()` to see where the error occurred.
My expectation was that it would run the subsetting on each group of data and then combine them into 'my_new_df'. My understanding of how group_by works is clearly wrong but I am stuck on how to move past this error. Any help would much appreciated.
Try:
my_df %>%
group_by(col2)%>%
slice(seq(from = 2, to = n(), by = 2))
# A tibble: 6 x 2
# Groups: col2 [3]
col1 col2
<int> <chr>
1 4 A
2 10 A
3 5 B
4 11 B
5 6 C
6 12 C
You might want to ungroup after slicing if you want to do other operations not based on col2.
Here is a data.table option:
library(data.table)
data <- as.data.table(my_df)
data[(rowid(col2) %% 2) == 0]
col1 col2
1: 4 A
2: 5 B
3: 6 C
4: 10 A
5: 11 B
6: 12 C
Or base R:
my_df[as.logical(with(my_df, ave(col1, col2, FUN = function(x)
seq_along(x) %% 2 == 0))), ]
col1 col2
4 4 A
5 5 B
6 6 C
10 10 A
11 11 B
12 12 C
I have data table
Name Score
A 5
A 6
B 9
B 1
B 0
...
I want to calculate and add a column 'FScore'=max score to this table
My expected result
Name Score Fscore
A 5 6
A 6 6
B 9 9
B 1 9
B 0 9
Thank.
We can use the base R option ave
df$Fscore <- ave(df$Score, df$Name, FUN = max)
df
# Name Score Fscore
#1 A 5 6
#2 A 6 6
#3 B 9 9
#4 B 1 9
#5 B 0 9
If you are trying to find the maximum score for each Name value, you can use data.table as below.
# example data
d <- data.table(Name = c("A", "A", "B", "B", "B"),
Score = c(5, 6, 9, 1, 0))
# find max for each Name and save the value in a new column, Fscore
d[ , Fscore := max(Score), by=Name]
Result:
> print(d)
Name Score Fscore
1: A 5 6
2: A 6 6
3: B 9 9
4: B 1 9
5: B 0 9
Another option using dplyr could be:
df = data.frame(Name = c('a', 'a', 'b','b','b'), Score = c(5,6,9,1,0))
df %>% group_by(Name) %>% mutate(Fscore = max(Score))
Source: local data frame [5 x 3]
Groups: Name [2]
Name Score FScore
<fctr> <dbl> <dbl>
1 a 5 6
2 a 6 6
3 b 9 9
4 b 1 9
5 b 0 9
Here is my issue:
df1 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7
df2 <- data.frame(x = 1:5, y = 2:6, z = 3:7)
rownames(df2) <- LETTERS[3:7]
df2
x y z
C 1 2 3
D 2 3 4
E 3 4 5
F 4 5 6
G 5 6 7
what I wanted is:
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
where duplicated rows were added up by same variable.
A solution with base R:
# create a new variable from the rownames
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
# bind the two dataframes together by row and aggregate
res <- aggregate(cbind(x,y,z) ~ rn, rbind(df1,df2), sum)
# or (thx to #alistaire for reminding me):
res <- aggregate(. ~ rn, rbind(df1,df2), sum)
# assign the rownames again
rownames(res) <- res$rn
# get rid of the 'rn' column
res <- res[, -1]
which gives:
> res
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
With dplyr,
library(dplyr)
# add rownames as a column in each data.frame and bind rows
bind_rows(df1 %>% add_rownames(),
df2 %>% add_rownames()) %>%
# evaluate following calls for each value in the rowname column
group_by(rowname) %>%
# add all non-grouping variables
summarise_all(sum)
## # A tibble: 7 x 4
## rowname x y z
## <chr> <int> <int> <int>
## 1 A 1 2 3
## 2 B 2 3 4
## 3 C 4 6 8
## 4 D 6 8 10
## 5 E 8 10 12
## 6 F 4 5 6
## 7 G 5 6 7
could also vectorize the operation turning the dfs to matrices:
result_df <- as.data.frame(as.matrix(df1) + as.matrix(df2))
This might need some teaking to get the rownames logic working on a longer example:
dfr <-rbind(df1,df2)
do.call(rbind, lapply( split(dfr, sapply(rownames(dfr),substr,1,1)), colSums))
x y z
A 1 2 3
B 2 3 4
C 4 6 8
D 6 8 10
E 8 10 12
F 4 5 6
G 5 6 7
If the rownames could all be assumed to be alpha characters a gsub solution should be easy.
An alternative is to melt the data and cast it. At first we set the row names to the last column of both data frames thanks to #Jaap
df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
Then we melt the data based on the name
melt(list(df1, df2), id.vars = "rn")
Then we use dcast with mget function which is used to retrieve multiple variables at once.
mydf<- dcast(melt(mget(ls(pattern = "df\\d+")), id.vars = "rn"),
rn ~ variable, value.var = "value", fun.aggregate = sum)
rownames(mydf) <- mydf$rn
# get rid of the 'rn' column
mydf <- mydf[, -1]
> mydf
# x y z
#A 1 2 3
#B 2 3 4
#C 4 6 8
#D 6 8 10
#E 8 10 12
#F 4 5 6
#G 5 6 7