This question already has answers here:
How to get summary statistics by group
(14 answers)
Closed 5 years ago.
I'm working in R.
I have a dataframe df with three columns. The structure looks like this:
df <- data.frame(c(11:15,4:7,21:24), c(rep("A",9),rep("B",4)), c(rep("X",5),rep("Y",4),rep("X",4)))
colnames(df) <- c("pos","name","name2")
Example:
pos name name2
11 A X
12 A X
13 A X
14 A X
15 A X
4 A Y
5 A Y
6 A Y
7 A Y
21 B X
22 B X
23 B X
24 B X
From this dataframe, I want to create a new one (df_new) that looks like this
name name2 pos_min pos_max
A X 11 15
A Y 4 7
B X 21 24
So for every unique combination of name & name2 (in this case: A-X, A-Y and B-X), I want to put the minimal and maximal value of df$pos in two new columns.
Can anybody help me to achieve this?
This can be solved using the dplyr package:
df_new <- df %>%
group_by(name, name2) %>%
summarise(pos_min = min(pos),
pos_max = max(pos))
Related
This question already has answers here:
Expand ranges defined by "from" and "to" columns
(10 answers)
Closed 2 years ago.
I have a question regarding efficiently populating an R dataframe based on data retrieved from another dataframe.
So my input typically looks like:
dfInput <- data.frame(start = c(1,6,17,29), end = c(5,16,28,42), value = c(1,2,3,4))
start end value
1 5 1
6 16 2
17 28 3
29 42 4
I want to find the min and max values in cols 1 and 2 and create a new dataframe with a row for each value in that range:
rangeMin <- min(dfInput$start)
rangeMax <- max(dfInput$end)
dfOutput <- data.frame(index = c(rangeMin:rangeMax), value = 0)
And then populate it with the appropriate "values" from the input dataframe:
for (i in seq(nrow(dfOutput))) {
lookup <- dfOutput[i,"index"]
dfOutput[i, "value"] <- dfInput[which(dfInput$start <= lookup &
dfInput$end >= lookup),"value"]
}
This for-loop achieves what I want to do, but it feels like this is a very convoluted way to do it.
Is there a way that I can do something like:
dfOutput$value <- dfInput[which(dfInput$start <= dfOutput$index &
dfInput$end >= dfOutput$index),"value"]
Or something else to populate the values when instantiating dfOutput.
I feel like this is pretty basic but I'm new to R, so many thanks for any help!
You can create a sequence between start and end :
library(dplyr)
dfInput %>%
mutate(index = purrr::map2(start, end, seq)) %>%
tidyr::unnest(index) %>%
select(-start, -end)
# A tibble: 42 x 2
# value index
# <dbl> <int>
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# 6 2 6
# 7 2 7
# 8 2 8
# 9 2 9
#10 2 10
# … with 32 more rows
In base R :
do.call(rbind, Map(function(x, y, z)
data.frame(index = x:y, value = z), dfInput$start, dfInput$end, dfInput$value))
This question already has answers here:
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Replacing NAs with latest non-NA value
(21 answers)
Closed 3 years ago.
I have just added new data to my original data frame, so simplified, it looks like this:
df <- data.frame(ID = rep(letters[1:5], each = 2))
df
ID Volume
1 a 1.23
2 a
3 a
4 a
5 b 4.74
6 b
7 b
8 b
9 c 5.35
10 c
11 c
12 c
13 c
14 d 1.53
15 d
16 d
where I have an ID column with differing numbers of entries for each ID and a volume for one entry, but not the others.
Is there a way to populate the empty Volume cells with the filled cell of the corresponding ID?
I'm essentially trying to remove the step of going into Excel and using the "drag to fill" for each ID (I have over 2000 IDs). Not every ID has the same amount of entries (i.e. ID "a" has 4, where ID "c" has 5 and ID "d" has 3).
I'm thinking dplyr will have a tool to do this, but I have not been able to find the answer.
In the tidyverse
library(tidyverse)
df %>%
group_by(id) %>%
fill()
Here is a base R solution, where ave() is used to fill up the NAs
df <- within(df,Volume <- ave(Volume,ID, FUN = function(x) unique(x[!is.na(x)])))
This question already has answers here:
Substitute DT1.x with DT2.y when DT1.x and DT2.x match in R [duplicate]
(1 answer)
merge data.frames based on year and fill in missing values
(4 answers)
Closed 5 years ago.
I would like to update the dataframe d_sub with two new columns x,y(and excluding column xy) based on the matching of the common columns(treatment,replicate) in the parent dataframe d.
set.seed(0)
x <- rep(1:10, 4)
y <- sample(c(rep(1:10, 2)+rnorm(20)/5, rep(6:15, 2) + rnorm(20)/5))
treatment <- sample(gl(8, 5, 40, labels=letters[1:8]))
replicate <- sample(gl(8, 5, 40))
d <- data.frame(x=x, y=y, xy=x*y, treatment=treatment, replicate=replicate)
d_sub <- d[sample(nrow(d),6),4:5]
d_sub
# treatment replicate
# 32 b 2
# 11 h 7
# 9 h 3
# 20 e 3
# 10 b 5
# 7 d 3
Unlike the normal merge or other methods mentioned here, I would only need to extract few columns as shown in the below expected output:
# treatment replicate x y
# 32 b 2 2 8.998847
# 11 h 7 1 5.082928
# 9 h 3 2 7.050445
# 20 e 3 10 10.145350
# 10 b 5 10 7.941056
# 7 d 3 7 6.814287
Note the exclusion of xy column in the output here! In my original problem, there are thousands of columns which I would not require in the output than the required very few columns. I am especially looking for methods other than merge to know if I can achieve the solution in a memory-efficient way.
I guess it has been asked here before, but what you are looking for is:
merge(d_sub, d, by=c("treatment", "replicate"))
or:
d_sub <- merge(d_sub, d, by=c("treatment", "replicate"))
I am interested in re-arranging a data.frame in R. Bear with me a I stumble through a reproducible example.
I have a nominal variable which can have 1 of two values. Currently this nominal variable is a column. Instead I would like to have two columns, representing the two values this nominal variable can have. Here is an exmample data frame. S is the nominal variable with values T and C.
n <- c(1,1,2,2,3,3,4,4)
s <- c("t","c","t","c","t","c","t","c")
b <- c(11,23,6,5,12,16,41,3)
mydata <- data.frame(n, s, b)
I would rather have a data frame that looked like this
n.n <- c(1,2,3,4)
trt <- c(11,6,23,41)
cnt <- c(23,5,16,3)
new.data <- data.frame(n.n, trt, cnt)
I am sure there is a way to use mutate or possibly tidyr but I am not sure what the best route is and my data frame that I would like to re-arrange is quite large.
you want spread:
library(dplyr)
library(tidyr)
new.data <- mydata %>% spread(s,b)
n c t
1 1 23 11
2 2 5 6
3 3 16 12
4 4 3 41
How about unstack(mydata, b~s):
c t
1 23 11
2 5 6
3 16 12
4 3 41
Sorry if the solution to my problem is already out there, and I overlooked it. There are a lot of similar topics which all helped me understand the basics of what I'm trying to do, but did not quite solve my exact problem.
I have a data frame df:
> type = c("A","A","A","A","A","A","B","B","B","B","B","B")
> place = c("x","y","z","x","y","z","x","y","z","x","y","z")
> value = c(1:12)
>
> df=data.frame(type,place,value)
> df
type place value
1 A x 1
2 A y 2
3 A z 3
4 A x 4
5 A y 5
6 A z 6
7 B x 7
8 B y 8
9 B z 9
10 B x 10
11 B y 11
12 B z 12
>
(my real data has 3 different values in type and 10 in place, if that makes a difference)
I want to extract rows based on the strings in columns m and n.
E.g. I want to extract all rows that contain A in type and x and z in place, or all rows with A and B in type and y in place.
This works perfectly with subset, but I want to run my scripts on different combinations of extracted rows, and adjusting the subset command every time isn't very effective.
I thought of using a vector containing as elements what to get from type and place, respectively.
I tried:
v=c("A","x","z")
df.extract <- df[df$type&df$place %in% v]
but this returns an error.
I'm a total beginner with R and programming, so please bear with me.
You could try
df[df$type=='A' & df$place %in% c('x','y'),]
# type place value
#1 A x 1
#2 A y 2
#4 A x 4
#5 A y 5
For the second case
df[df$type %in% c('A', 'B') & df$place=='y',]
Update
Suppose, you have many columns and needs to subset the dataset based on values from many columns. For example.
set.seed(24)
df1 <- cbind(df, df[sample(1:nrow(df)),], df[sample(1:nrow(df)),])
colnames(df1) <- paste0(c('type', 'place', 'value'), rep(1:3, each=3))
row.names(df1) <- NULL
You can create a list of the values from the columns of interest
v1 <- setNames(list('A', 'x', c('A', 'B'),
'x', 'B', 'z'), paste0(c('type', 'place'), rep(1:3, each=2)))
and then use Reduce
df1[Reduce(`&`,Map(`%in%`, df1[names(v1)], v1)),]
you can make a function extract :
extract<-function(df,type,place){
df[df$type %in% type & df$place %in% place,]
}
that will work for the different subsets you want to do :
df.extract<-extract(df=df,type="A",place=c("x","y")) # or just extract(df,"A",c("x","y"))
> df.extract
type place value
1 A x 1
2 A y 2
4 A x 4
5 A y 5
df.extract<-extract(df=df,type=c("A","B"),place="y") # or just extract(df,c("A","B"),"y")
> df.extract
type place value
2 A y 2
5 A y 5
8 B y 8
11 B y 11