How to sum rows and keep their name in a dataframe [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a data frame with some variables with the same name but different values. I need to sum the values and keep the original values as a separate column.
data <- data.frame(cod = c("A", "B", "C", "A", "A", "B"),
values = c(3, 4, 5, 1, 2, 5))
data
cod Values
A 3
B 4
C 5
A 1
A 2
B 5
I expect the following, where the original Values column is kept the same and the group sum is added as a new column, Values2:
> data2
cod Values Values2
A 3 6
B 4 9
C 5 5
A 1 6
A 2 6
B 5 9

An option with base R would be
data$Values2 <- with(data, ave(Values, cod, FUN = sum))

Related

How to return 2 specific rows from a dataframe?

firstVector <- c("A", "B", "C", "D", "E")
secondVector <- c(1, 2, 3, 4, 5)
thirdVector <- c("a", "b", "c", "d", "e")
myDataFrame <- data.frame(firstVector, secondVector, thirdVector)
How do I extract row 3 and 4 from my data frame? I want to print it row 3 and 4 in order it to look like this:
firstVector secondVector thirdVector
3 C 3 c
4 D 4 d
You can subset your dataframe like this [rows,columns]:
myDataFrame[c(3,4),]
In your case you want a vector containing rows 3 and 4, therefore c(3,4), you can add more columns in the vector to subset more rows, for example c(1,2,3,12).
If you dont provide an argument it returns the whole dimension. In your example you subset rows, and return all the columns
it's the same for columns:
myDataFrame[c(3,4),c(1,2)]
you can subset rows 3 and 4 and columns 1 and 2.
Another way to do this is using :
c(1:4) means from 1 to 4
Hope this helps

R data frame subsetting based on a column value frequency threshold [duplicate]

This question already has answers here:
Getting the top values by group
(6 answers)
Closed 6 years ago.
I am a new R user and this is my first question submission (hopefully in compliance with the protocol).
I have a data frame with two columns.
df <- data.frame(v1 = c("A", "A", "B", "B", "B", "B", "C", "D", "D", "E" ))
dfc <- df %>% count(v1)
df$n <- with(dfc, n[match(df$v1,v1)])
v1 n
1 A 2
2 A 2
3 B 4
4 B 4
5 B 4
6 B 4
7 C 1
8 D 2
9 D 2
10 E 1
I want to delete rows that exceed a threshold of 3 occurrences for a value in v1. All rows for that value less than the threshold are retained. In this example I want to delete row 6 and retain all remaining rows in a subset data frame.
The result would include the following values for v1:
v1
1 A
2 A
3 B
4 B
5 B
6 C
7 D
8 D
9 E
Row 6 would have been deleted because it was the 4th occurrence of "B", but the 3 previous rows for "B" have been retained.
I have read multiple posts that demonstrate how to remove ALL rows for a variable with row totals less/greater than a cumulative frequency value, such as 4. For example, I have tried:
df1 <- df %>%
group_by(v1) %>%
filter(n() < 4)
This approach keeps only the rows where all unique occurrences of V1 are < 4. 6 rows are subset.
df2 <- df %>%
group_by(v1) %>%
filter(n() > 3)
This approach keeps only the rows where all unique occurrences of v1 are > 3. 4 rows are subset.
df4 <- subset(df, v1 %in% names(table(df$v1))[table(df$v1) <4])
This approach has the same result as the first approach.
None of these methods produce the result I need.
As previously stated, I need to retain the first three rows where v1="B" and only delete rows if there are > 3 occurrences of that value.
Because I am new to R, it's possible I am overlooking a very simple solution. Any suggestions would be greatly appreciated.
Thanks.
Using dplyr's top_n:
df %>% group_by(v1) %>% top_n(3)
This seems to do it:
index <- vector("numeric", nrow(df))
for (i in 1:nrow(df)) {
if (sum(df[1:i, ] == as.character(df[i, 1])) <= 3) {
index[i] <- i
} else {
cat(i)
}
}
df[index, ]
v1 n
1 A 2
2 A 2
3 B 4
4 B 4
5 B 4
7 C 1
8 D 2
9 D 2
10 E 1
We can use data.table
library(data.table)
setDT(df)[, if(.N >3) head(.SD, 3) else .SD , v1]

Matching elements of two data frames in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two data frames. The first looks like this
name
1 a
2 b
3 c
4 d
5 f
and the second like this
name value
1 b 3
2 d 4
3 f 5
4 a 1
5 c 2
6 k 7
7 m 6
Now I want to add a second column to the first data frame which contains the values of elements taken from the second list. It has to look like this
name value
1 a 1
2 b 3
3 c 2
4 d 4
5 f 5
Can somebody help me this?
you can use merge to do this. In case your first data frame is called df1 and the second one df2:
merge(df1, df2, by='name')
What you want to do is an inner join. You might try with the dplyr package.
library(dplyr)
x <- data.frame(name = c("a", "b", "c", "d", "f"), stringsAsFactors = FALSE)
y <- data.frame(name = c("b", "d", "f", "a", "c", "k", "m"),
value = c(3, 4, 5, 1, 2, 7, 6),
stringsAsFactors = FALSE)
joined <- dplyr::inner_join(x, y, by = "name")

Observation number by group [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
In R I have a data frame with observations described by several values one of which is a factor. I have sorted the dataset by this factor and would like to add a column in which I would get a number of observation on each level of the factor e.g.
factor obsnum
a 1
a 2
a 3
b 1
b 2
b 3
b 4
c 1
c 2
...
In SAS I do it with something like:
data logs.full;
set logs.full;
count + 1;
by cookie;
if first.cookie then count = 1;
run;
How can I achieve that in R?
Thanks,
Use rle (run length encoding) and sequence:
x <- c("a", "a", "a", "b", "b", "b", "b", "c", "c")
data.frame(
x=x,
obsnum = sequence(rle(x)$lengths)
)
x obsnum
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
Here is the ddply() solution
dataset <- data.frame(x = c("a", "a", "a", "b", "b", "b", "b", "c", "c"))
library(plyr)
ddply(dataset, .(x), function(z){
data.frame(obsnum = seq_along(z$x))
})
One solution using base R, assuming your data is in a data.frame named dfr:
dfr$cnt<-do.call(c, lapply(unique(dfr$factor), function(curf){
seq(sum(dfr$factor==curf))
}))
There are likely better solutions (e.g. employing package plyr and its ddply), but it should work.

Change name of a cell in data frame in R [duplicate]

This question already has answers here:
Replace all particular values in a data frame
(8 answers)
Replace a value in a data frame based on a conditional (`if`) statement
(10 answers)
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Replace contents of factor column in R dataframe
(9 answers)
Closed 3 years ago.
I have a data set:
x y z
1 apple a 4
2 orange d 3
3 banana b 2
4 strawberry c 1
How can I change the name "banana" to "grape"? I want to get:
x y z
1 apple a 4
2 orange d 3
3 grape b 2
4 strawberry c 1
Reproducible code:
example<-data.frame( x = c("apple", "orange", "banana", "strawberry"), y = c("a", "d", "b", "c"), z = c(4:1) )
Below is the solution using tidyverse in R
library(tidyverse)
example %>%
mutate(x = as.character(x)) %>%
mutate(x = replace(x, x == 'banana', 'grape'))

Resources