Let's say I have this data frame. How would I go about removing only the NA values associated with name a without physically removing them manually?
a 1 4
a 7 3
a NA 4
a 6 3
a NA 4
a NA 3
a 2 4
a NA 3
a 1 4
b NA 2
c 3 NA
I've tried using the function !is.na, but that removes all the NA values in the column ID1 for all the names. How would I specifically target the ones that are associated with name a?
You could subset your data frame as follows:
df_new <- df[!(df$name == "a" & is.na(df$ID1)), ]
This can also be written as:
df_new <- df[df$name != "a" | !is.na(df$ID1), ]
With dplyr:
library(dplyr)
df %>%
filter(!(name == "a" & is.na(ID1)))
Or with subset:
subset(df, !(name == "a" & is.na(ID1)))
Output
name ID1 ID2
1 a 1 4
2 a 7 3
3 a 6 3
4 a 2 4
5 a 1 4
6 b NA 2
7 c 3 NA
Data
df <- structure(list(name = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "c"), ID1 = c(1L, 7L, NA, 6L, NA, NA, 2L, NA, 1L, NA,
3L), ID2 = c(4L, 3L, 4L, 3L, 4L, 3L, 4L, 3L, 4L, 2L, NA)), class = "data.frame", row.names = c(NA,
-11L))
This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 2 years ago.
I need to apply a function to several subsets of data of differing lengths within a column and generate a new data frame which includes the outputs and their associated metadata.
How can I do this without recourse to for loops? tapply() seems like a good place to start, but I struggle with the syntax.
For example -- I have something like this:
block plot id species type response
1 1 1 w a 1.5
1 1 2 w a 1
1 1 3 w a 2
1 1 4 w a 1.5
1 2 5 x a 5
1 2 6 x a 6
1 2 7 x a 7
1 3 8 y b 10
1 3 9 y b 11
1 3 10 y b 9
1 4 11 z b 1
1 4 12 z b 3
1 4 13 z b 2
2 5 14 w a 0.5
2 5 15 w a 1
2 5 16 w a 1.5
2 6 17 x a 3
2 6 18 x a 2
2 6 19 x a 4
2 7 20 y b 13
2 7 21 y b 12
2 7 22 y b 14
2 8 23 z b 2
2 8 24 z b 3
2 8 25 z b 4
2 8 26 z b 2
2 8 27 z b 4
And I want to produce something like this:
block plot species type mean.response
1 1 w a 1.5
1 2 x a 6
1 3 y b 10
1 4 z b 2
2 5 w a 1
2 6 x a 3
2 7 y b 13
2 8 z b 3
Try this. You can use group_by() to set the grouping variables and then summarise() to compute the expected variable. Here the code using dplyr:
library(dplyr)
#Code
newdf <- df %>% group_by(block,plot,species,type) %>% summarise(Mean=mean(response,na.rm=T))
Output:
# A tibble: 8 x 5
# Groups: block, plot, species [8]
block plot species type Mean
<int> <int> <chr> <chr> <dbl>
1 1 1 w a 1.5
2 1 2 x a 6
3 1 3 y b 10
4 1 4 z b 2
5 2 5 w a 1
6 2 6 x a 3
7 2 7 y b 13
8 2 8 z b 3
Or using base R (-3 is used to omit id variable in the aggregation):
#Base R
newdf <- aggregate(response~.,data=df[,-3],mean,na.rm=T)
Output:
block plot species type response
1 1 1 w a 1.5
2 2 5 w a 1.0
3 1 2 x a 6.0
4 2 6 x a 3.0
5 1 3 y b 10.0
6 2 7 y b 13.0
7 1 4 z b 2.0
8 2 8 z b 3.0
Some data used:
#Data
df <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))
Use any of these where the input dd is given reproducibly in the Note at the end:
# 1. aggregate.formula - base R
# Can use just response on left hand side if header doesn't matter.
aggregate(cbind(mean.response = response) ~ block + plot + species + type, dd, mean)
# 2. aggregate.default - base R
v <- c("block", "plot", "species", "type")
aggregate(list(mean.response = dd$response), dd[v], mean)
# 3. sqldf
library(sqldf)
sqldf("select block, plot, species, type, avg(response) as [mean.response]
from dd group by 1, 2, 3, 4")
# 4. data.table
library(data.table)
v <- c("block", "plot", "species", "type")
as.data.table(dd)[, .(mean.response = mean(response)), by = v]
# 5. doBy - last column of output will be labelled response.mean
library(doBy)
summaryBy(response ~ block + plot + species + type, dd)
Note
The input in reproducible form:
dd <- structure(list(block = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), plot = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L
), id = 1:27, species = c("w", "w", "w", "w", "x", "x", "x",
"y", "y", "y", "z", "z", "z", "w", "w", "w", "x", "x", "x", "y",
"y", "y", "z", "z", "z", "z", "z"), type = c("a", "a", "a", "a",
"a", "a", "a", "b", "b", "b", "b", "b", "b", "a", "a", "a", "a",
"a", "a", "b", "b", "b", "b", "b", "b", "b", "b"), response = c(1.5,
1, 2, 1.5, 5, 6, 7, 10, 11, 9, 1, 3, 2, 0.5, 1, 1.5, 3, 2, 4,
13, 12, 14, 2, 3, 4, 2, 4)), class = "data.frame", row.names = c(NA,
-27L))
I have a complete data frame of all cities from Brazil. I want just some predefined cities. I have a column with these predefined cities. Then I'd like to use all the columns from my data frame, but select only the lines which coincides the cities of column with all cities and the column with predefined cities.
data = read.csv(file="C:/Users/guilherme/Desktop/data.csv", header=TRUE, sep=";")
data
> AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 A 2 4 C 12 5
2 B 2 2 A 11 10
3 C 3 4 F 09 2
4 D 4 2
5 E 5 6
6 F 6 2
I want the following
> data
AllCities Year1990 Year200 PredefinedCities CharacCities1 CharacCities2
1 C 3 4 C 12 5
2 A 2 4 A 11 10
3 F 6 2 F 09 2
You need merge -
merge(
data[, c("AllCities", "Year1990", "Year200")],
data[, c("PredefinedCities", "CharacCities1", "CharacCities2")],
by.x = "AllCities", by.y = "PredefinedCities"
)
AllCities Year1990 Year200 CharacCities1 CharacCities2
1 A 2 4 11 10
2 C 3 4 12 5
3 F 6 2 9 2
Note - Your data format is unusual. If you can, you should fix data source so that it gives you AllCities and PreferredCities tables separately or maybe even join them correctly before creating the csv file.
Data -
structure(list(AllCities = c("A", "B", "C", "D", "E", "F"), Year1990 = c(2L,
2L, 3L, 4L, 5L, 6L), Year200 = c(4L, 2L, 4L, 2L, 6L, 2L), PredefinedCities = c("C",
"A", "F", "", "", ""), CharacCities1 = c(12L, 11L, 9L, NA, NA,
NA), CharacCities2 = c(5L, 10L, 2L, NA, NA, NA)), .Names = c("AllCities",
"Year1990", "Year200", "PredefinedCities", "CharacCities1", "CharacCities2"
), class = "data.frame", row.names = c(NA, -6L))
data <- data[data$AllCities %in% data$PredefinedCities,]
I am working on a dataframe likes:
groups . values
a . 1
a . 1
a 2
b . 2
b . 3
b . 3
c . 4
c . 5
c . 6
d . 6
d . 7
d . 2
The problem is to turn it into something like:
groups . values
a . 1
a . 1
b . 3
b . 3
c . 4
c . 5
d . 7
I want to keep rows whose values only occur in ONE group. For example, value 2 is deleted because it occurs in three different groups, but value 1 is kept although it occur twice in ONLY ONE group.
Is there any functions from dplyr package can handle this problem? or I have to write my own function?
As you asked for a dplyr solution:
df %>% group_by(values) %>% filter(n_distinct(groups) == 1)
# # A tibble: 7 x 2
# # Groups: values [5]
# groups values
# <chr> <int>
#1 a 1
#2 a 1
#3 b 3
#4 b 3
#5 c 4
#6 c 5
#7 d 7
with
df <- structure(list(groups = c("a", "a", "a", "b", "b", "b", "c", "c", "c", "d", "d", "d"),
values = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 2L)),
row.names = c(NA, -12L), class = "data.frame")
Group by values and see if column groups has only one element. This can be done with ave.
i <- as.logical(with(df1, ave(as.numeric(groups), values, FUN = function(x) length(unique(x)) == 1)))
df1[i, ]
# groups values
#1 a 1
#2 a 1
#5 b 3
#6 b 3
#7 c 4
#8 c 5
#11 d 7
Data in dput format.
df1 <-
structure(list(groups = structure(c(1L, 1L, 1L, 2L,
2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("a", "b",
"c", "d"), class = "factor"), values = c(1L, 1L, 2L,
2L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 2L)),
class = "data.frame", row.names = c(NA, -12L))
x[x$values %in% names(which(colSums(table(x)>0)==1)),]
where
x = structure(list(groups = c("a", "a", "a", "b", "b", "b", "c",
"c", "c", "d", "d", "d"), values = c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 5L, 6L, 6L, 7L, 2L)), row.names = c(NA, -12L), class = "data.frame")
or, a data.table solution:
setDT(x)[, .SD[uniqueN(groups)==1], values]
Using sqldf package for your original data frame df:
library(sqldf)
result <- sqldf("SELECT * FROM df
WHERE `values` IN (
SELECT `values` from (
SELECT `values`, groups, count(*) as num from df
GROUP BY `values`, groups) t
GROUP BY `values`
HAVING COUNT(1) = 1
)")
I want to create a new column in a dataframe that states the number of observations for a particular group.
I have a surgical procedure (HRG.Code) and multiple consultants who perform this procedure (Consultant.Code) and the length of stay their patients are in for in days.
Using
sourceData2$meanvalue<-with(sourceData2,ave(LengthOfStayDays., HRG.Code, Consultant.Code FUN=mean))
I can get a new column (meanvalue) that shows the mean length of stay per consultant per procedure.
This is just what I need. However, I'd also like to know how many occurances of each procedures each consultant performed as a new column in this same data frame.
How do I generate this number of observations. There doesn't appear to be a FUN = Observations or FUN = freq capability.
You may try:
tbl <- table(sourceData2[,3:2]) #gives the frequency of each `procedure` i.e. `HRG.Code` done by every `Consultant.Code`
tbl
# HRG.Code
#Consultant.Code A B C
# A 1 1 0
# B 4 2 1
# C 0 0 1
# D 1 1 1
# E 2 0 0
as.data.frame.matrix(tbl) #converts the `table` to `data.frame`
If you want the total unique procedures done by each Consultant.Code in the long form.
with(sourceData2, as.numeric(ave(HRG.Code, Consultant.Code,
FUN=function(x) length(unique(x)))))
# [1] 3 3 3 2 1 3 3 3 3 1 1 3 3 3 2
data
sourceData2 <- structure(list(LengthofStayDays = c(2L, 2L, 4L, 3L, 4L, 5L, 2L,
4L, 5L, 2L, 4L, 2L, 4L, 4L, 2L), HRG.Code = c("C", "A", "A",
"B", "A", "A", "B", "C", "A", "A", "C", "A", "B", "B", "A"),
Consultant.Code = c("B", "B", "B", "A", "E", "B", "D", "D",
"D", "E", "C", "B", "B", "B", "A")), .Names = c("LengthofStayDays",
"HRG.Code", "Consultant.Code"), row.names = c(NA, -15L), class = "data.frame")