I created a data frame of the following type
Name Date Value
A 01.01.01 10
B 02.01.01 2
A 04.01.01 4
...
I would like to obtain a list that ranks the elements in the name column by the total sum, provided that the dates are within a certain range.
Welcome to Stack Overflow (SO). It is very important for anybody asking questions to provide reproducible data which you can get using dput(). Please read this link. If you have tried something, you want to leave your code and describe what your challenge is. In this way, you can help SO users save more time and you are likely to receive more support. Here, I did my best to read your question, created a sample data, and did the following using the dplyr package.
# Sample data
foo <- data.frame(id = c("A", "B", "A", "C", "D", "B", "D", "E", "A", "S", "B"),
date = c("01.01.01", "02.01.01", "04.01.01", "05.01.01",
"11.01.01", "09.03.01", "12.15.01", "08.08.01",
"03.27.01", "11.16.01", "04.07.01"),
value = c(-10, -2, -4, 8, 5, 2, 10, 5, 11, 7, 8),
stringsAsFactors = FALSE)
# id date value
#1 A 01.01.01 -10
#2 B 02.01.01 -2
#3 A 04.01.01 -4
#4 C 05.01.01 8
#5 D 11.01.01 5
#6 B 09.03.01 2
#7 D 12.15.01 10
#8 E 08.08.01 5
#9 A 03.27.01 11
#10 S 11.16.01 7
#11 B 04.07.01 8
library(dplyr)
foo %>%
# Create date objects
mutate(date = as.Date(date, format = "%m.%d.%y")) %>%
# Select data points which stay between 2001-01-01 and 2001-08-31
filter(between(date, as.Date("2001-01-01"), as.Date("2001-08-31"))) %>%
# For each id group
group_by(id) %>%
# Get sum of value
summarise(Total = sum(value)) %>%
# Arrange row order by descending order with Total
arrange(desc(Total))
# id Total
#1 C 8
#2 B 6
#3 E 5
#4 A -3
Related
I have a dataset where there are two columns: Names and Age. It is a very big dataset and it looks something like the table below:
Name: A, A, A, B, B, E, E, E, E, E
Age: 10, 10, 10, 15, 14, 20, 20, 20, 19
I want to find out how many times it appears that these two columns, Name and Age, are not co-occurring. Basically, how many times it is identifying that the names of the people and age matches, for instance it could happen that B who is 15 years old and the one with age 14 years are different.
If I understand the question, you're looking to see how many different ages each name has in the data.
One dplyr approach would be to identify those distinct combinations of age and name, and then count by name. This tells us A has only one age, while B and E each have two.
library(dplyr)
my_data %>%
distinct(name, age) %>%
count(name)
name n
1 A 1
2 B 2
3 E 2
If you want more info about what those combinations are, you could use add_count to keep all the combinations, plus the count by name.
my_data %>%
distinct(name, age) %>%
add_count(name)
name age n
1 A 10 1
2 B 15 2
3 B 14 2
4 E 20 2
5 E 19 2
Sample data
Please note, it is best practice to include in your question the code to generate a specific sample data object. This reduces redundant work for people who want to help you, and reduces ambiguity (e.g. in your example there aren't as many ages as names).
my_data <- data.frame(
name = c("A", "A", "A", "B", "B", "E", "E", "E", "E", "E"),
age = c(10, 10, 10, 15, 14, 20, 20, 20, 19, 20))
If you want to subset the data frame, you can try
subset(
df,
ave(Age, Name, FUN = sd) == 0
)
which gives
Name Age
1 A 10
2 A 10
3 A 10
Or a summary like
> aggregate(cbind(n = Age) ~ Name, df, function(x) length(unique(x)))
Name n
1 A 1
2 B 2
3 E 2
Data
df <- data.frame(
Name = c("A", "A", "A", "B", "B", "E", "E", "E", "E"),
Age = c(10, 10, 10, 15, 14, 20, 20, 20, 19)
)
An option with data.table
library(data.table)
setDT(df)[, .SD[sd(Age) == 0], Name]
This works:
tibble(
Name = c(rep("A", 3), rep("B", 2), rep("E", 5)),
Age = c(rep(10, 3), 15, 14, rep(20, 3), 19, 19)
) %>%
group_by(Name, Age) %>%
summarise(n())
gives:
# A tibble: 5 x 3
# Groups: Name [3]
Name Age `n()`
<chr> <dbl> <int>
1 A 10 3
2 B 14 1
3 B 15 1
4 E 19 2
5 E 20 3
There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w
I have a data set of individuals with a number of health conditions. Individuals either do (1) or do not (0) have each condition (my real data set has 14). What I want to do is summarise the data so I know how often pairs of conditions occur. Note that some individuals may have three or four of the conditions, but what I'm interested in is the pairwise co-occurence. I would then like to plot this as a heatmap.
I suspect that the solution involves the 'gather' function from tidyr, but I haven't been able to work it out. This is an example of what my input looks like and what I'd like to achieve:
Here's some data on individuals and whether or not they have conditions "a", "b" or "c":
library(tidyverse)
library(viridis)
dat <- tibble(
id = c(1:15),
a = c(1,0,0,0,1,1,1,0,1,0,0,0,1,0,1),
b = c(1,0,0,1,1,1,0,0,1,0,0,1,1,0,1),
c = c(0,0,1,1,0,1,0,1,0,1,1,0,1,1,0))
I want to summarise how often each of the conditions occur, and how often they co-occur. In this case, it's evident that conditions "a" and "b" co-occur more often than do either of these with "c", which usually occurs on its own. Below is my imagined idea of what the data will look like in a plottable format. The first column is 'variable 1', the second is 'variable 2', and the third, is the count of how often these occur together. Below that is the plot which I have in my mind.
plotdat <- tibble(
var1 = c("a", "a", "a", "b", "b", "c"),
var2 = c("a", "b", "c", "b", "c", "c"),
count = c(7, 6, 2, 8, 3, 8))
ggplot(plotdat) +
geom_tile(aes(var1, var2, fill = count)) +
scale_fill_viridis()
Perhaps this is not the right approach at all and I actually need to convert the data into a 3x3 matrix. Any possible solutions would be gratefully received!
Here is a way
library(tidyverse)
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
filter(!is.na(value))
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 b b 8
#5 c b 3
#6 c c 8
The most important piece is crossprod, I think. But let's go through it step by step.
You don't need column id so we exclude it and convert dat[-1] to a matrix because this is what crossprod expects.
as.matrix(dat[-1]) %>%
crossprod()
# a b c
#a 7 6 2
#b 6 8 3
#c 2 3 8
Then we replace the upper triangle of this matrix with NA because you don't want to compare a-b and b-a etc.
The next step is to convert to a dataframe, make the rownames a column and reshape from wide to long
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname)
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 a b NA
#5 b b 8
#6 c b 3
#7 a c NA
#8 b c NA
#9 c c 8
Finally remove NAs to get desired output.
This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15
you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)
Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15
Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15
I have a dataframe that is created by a for-loop with a changing number of columns.
In a different function I want the drop the last five columns.
The variable with the length of the dataframe is "units" and it has numbers between 10 an 150.
I have tried using the names of the columns to drop but it is not working. (As soon as I try to open "newframe" R studio crashes, viewing myframe is no problem).
drops <- c("name1","name2","name3","name4","name5")
newframe <- results[,!(names(myframe) %in% drops)]
Is there any way to just drop the last five columns of a dataframe without relying on names or numbers of the columns
length(df) can also be used:
mydf[1:(length(mydf)-5)]
You could use the counts of columns (ncol()):
df <- data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10), ws = rnorm(10))
# rm last 2 columns
df[ , -((ncol(df) - 1):ncol(df))]
# or
df[ , -seq(ncol(df)-1, ncol(df))]
Yo can take advantage of the list method for head() (which drops whole list elements, and works differently to the data.frame method which drops rows):
# data.frame with 26 columns (named a-z):
df <- setNames( as.data.frame( as.list(1:26)) , letters )
# drop last 5 'columns':
as.data.frame( head(as.list(df),-5) )
# a b c d e f g h i j k l m n o p q r s t u
#1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
My preferable method is using rev which makes the syntax cleaner. For mtcars data set
mtcars[-rev(seq_len(ncol(mtcars)))[1:5]]
Or using head (similar to Simons suggestion)
mtcars[head(seq_len(ncol(mtcars)), -5)]
A tidyverse option is to use last_col, where we first select the fifth column from the last column (i.e., last_col(offset = 4)) then to the last column number. Then, we use the - to remove the selected columns.
library(tidyverse)
df %>%
select(-(last_col(offset = 4):last_col()))
Output
x y z
1 1 10 5
2 2 9 5
3 3 8 5
4 4 7 5
5 5 6 5
6 6 5 5
7 7 4 5
8 8 3 5
9 9 2 5
10 10 1 5
Another option is to use ncol in the select:
df %>%
select(-((ncol(.) - 4):ncol(.)))
Or we could use tail with names:
df %>%
select(-tail(names(.), 5))
Data
df <- structure(list(x = 1:10, y = 10:1, z = c(5, 5, 5, 5, 5, 5, 5,
5, 5, 5), a = 11:20, b = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j"), c = c("t", "s", "r", "q", "p", "o", "n", "m",
"l", "k"), d = 30:39, e = 50:59), class = "data.frame", row.names = c(NA,
-10L))
If you are using data.table package for your data processing, one nice way can be
drops <- c("name1","name2","name3","name4","name5")
df[, .SD, .SDcols=!drops]
In fact, this allows you to drop any variables as you like.