I have a data frame like this
> df
Source: local data frame [4 x 4]
a x y z
1 name1 1 1 1
2 name2 1 1 1
3 name3 1 1 1
4 name4 1 1 1
Want to mutate it by adding columns x, y, and z (there can be many more numeric columns). Trying to exclude column 'a' as follows is not working.
dft <- df %>% mutate(funs(total = rowSums(.)), -a)
Error: not compatible with STRSXP
This also produces an error:
dft <- df %>% mutate(total = rowSums(.), -a)
Error in rowSums(.) : 'x' must be numeric
What is the right way?
If you want to keep non-numeric columns in the result, you can do this:
dat %>% mutate(total=rowSums(.[, sapply(., is.numeric)]))
UPDATE: Now that dplyr has scoped versions of its standard verbs, here's another option:
dat %>% mutate(total=rowSums(select_if(., is.numeric)))
UPDATE 2: With dplyr 1.0, the approaches above will still work, but you can also do row sums by combining rowwise and c_across:
iris %>%
rowwise %>%
mutate(row.sum = sum(c_across(where(is.numeric))))
You can use rich selectors with select() inside the call to rowSums()
df %>% transmute(a, total = rowSums(select(., -a)))
This should work:
#dummy data
df <- read.table(text="a x y z
1 name1 1 1 1
2 name2 1 1 1
3 name3 1 1 1
4 name4 1 1 1",header=TRUE)
library(dplyr)
df %>% select(-a) %>% mutate(total=rowSums(.))
First exclude text column - a, then do the rowSums over remaining numeric columns.
Related
I have below data frame
library(dplyr)
data = data.frame('A' = 1:3, 'CC' = 1:3, 'DD' = 1:3, 'M' = 1:3)
Now let define a vectors of strings which represents a subset of column names of above data frame
Target_Col = c('CC', 'M')
Now I want to find the column names in data that match with Target_Col and then replace them with
paste0('Prefix_', Target_Col)
I prefer to do it using dplyr chain rule.
Is there any direct function available to perform this?
Other solutions can be found here!
clickhere
vars<-cbind.data.frame(Target_Col,paste0('Prefix_', Target_Col))
data <- data %>%
rename_at(vars$Target_Col, ~ vars$`paste0("Prefix_", Target_Col)`)
or
data %>% rename_with(~ paste0('Prefix_', Target_Col), all_of(Target_Col))
We may use
library(stringr)
library(dplyr)
data %>%
rename_with(~ str_c('Prefix_', .x), all_of(Target_Col))
A Prefix_CC DD Prefix_M
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
With dplyrs rename_with
library(dplyr)
rename_with(data, function(x) ifelse(x %in% Target_Col, paste0("Prefix_", x), x))
A Prefix_CC DD Prefix_M
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
I am trying to condense a grouped df, pulling out only rows that contain a certain value, but that value isn't reflected in all groups. I want to find a way to pull out all rows with that value, but also make a NA or 0 row for groups not containing that value.
Ex:
x1 <- c('1','1','1','1','1','2','2','2','2','2','3','3','3','3','3')
x2 <- c('a','b','c','d','e','b','c','d','e','f','a','b','d','e','f')
df <- data.frame(x1,x2)
df %>% group_by(x1) %>%
filter(x2 =="a")
this returns:
x1 x2
<fct> <fct>
1 1 a
2 3 a
but I want it to return:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
Obviously the real code is much more complicated, so I'm looking for the best way to keep these empty groups in a reproducible way.
PS - I would like to stay in dplyr to keep smooth in a function chain
Thanks!
One dplyr option could be:
df %>%
group_by(x1) %>%
slice(which.max(x2 == "a")) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_))
x1 x2
<fct> <fct>
1 1 a
2 2 <NA>
3 3 a
If it's relevant to have multiple target values per group:
df %>%
group_by(x1) %>%
filter(x2 == "a") %>%
bind_rows(df %>%
group_by(x1) %>%
filter(all(x2 != "a")) %>%
slice(1) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_)))
As you did not specify dplyr solutions only, here's one option with library(data.table)
setDT(df)
df[, .(x2 = x2[match('a', x2)]), x1]
# x1 x2
# 1: 1 a
# 2: 2 <NA>
# 3: 3 a
This happens because of the way Dplyr was written.
According to Hadley Wickham (the Package Creator) to maintain NA values you should declare that you want them explicitly. As he said in this issue on github, you should filter(a == x | is.na(a)). In your case you use the following:
df %>% group_by(x1) %>%
filter(x2 =="a" | is.na(x2)
That you'll return you this as a result:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
In this code you're asking to R all rows in which x2 is equal to "a" and also those in which x2 is NA.
We can use complete after the filter step to get the missing combinations. By default, all the other columns will be filled with NA (it can be made to custom value with fill argument)
library(dplyr)
library(tidyr)
df %>%
filter(x2 == 'a') %>%
complete(x1 = unique(df$x1))
# A tibble: 3 x 2
# x1 x2
# <fct> <fct>
#1 1 a
#2 2 <NA>
#3 3 a
Another option is match
df %>%
group_by(x1) %>%
summarise(x2 = x2[match('a', x2)])
If there are many columns, then mutate 'x2' with match and then slice the first row
df %>%
group_by(x1) %>%
mutate(x2 = x2[match('a', x2)]) %>%
slice(1)
How about the base R solution using aggregate() like below?
dfout <- aggregate(x2~x1,df,function(v) ifelse("a" %in% v,"a",NA))
or
dfout <- aggregate(x2~x1,df,function(v) v[match("a", v)])
such that
> dfout
x1 x2
1 1 a
2 2 <NA>
3 3 a
I have a dataframe like this
ID <- c("ID001","ID001","ID003","ID003","ID003",
"ID006","ID007","ID007","ID009","ID010")
Type <- c("Length","Breadth","Length","Breadth","Height",
"Length","Length","Height","Breadth","Length")
FailCount <- c(3,7,2,3,9,7,3,2,3,9)
df <- data.frame(ID,Type,FailCount)
I am trying to subset this data frame by these conditions
Remove any ID with only 1 type
summarize the failcount
Pivot the Type column into 1 row separated with commas
My desired output is
ID Type FailCount
ID001 Length, Breadth 10
ID003 Length, Breadth, Height 14
ID007 Length, Height 5
I can remove the rows with only 1 type this way
library(dplyr)
df <- df %>% group_by(ID) %>% filter(n_distinct(Type) > 1)
How do I accomplish the other tasks? Could someone point me in the right direction?
You can use summarise to get what you need:
df %>% group_by(ID) %>%
dplyr::filter(n_distinct(Type) > 1) %>%
summarise(Type=toString(Type), FailCount = sum(FailCount))
I hope this helps.
Try this
library(dplyr)
df <- df %>% group_by(ID) %>% filter(n_distinct(Type) > 1)%>%dplyr::summarise(Type=paste(Type,collapse=','),FailCount=sum(FailCount))
# A tibble: 3 × 3
ID Type FailCount
<fctr> <chr> <dbl>
1 ID001 Length,Breadth 10
2 ID003 Length,Breadth,Height 14
3 ID007 Length,Height 5
This is my data:
ID a b c d
1 x 1 2 3
2 y 1 2 3
3 z NA NA NA
4 z 1 2 3
5 y NA NA NA
Now, if I wanted to replace the NAs in a single column, say b, with the mean of b by the group a, I know how to do it by using this code:
data %>%
group_by(a) %>%
mutate(b = ifelse(is.na(b), as.integer(mean(b, na.rm=TRUE)), b)
I want to use basically the same code but to apply it over columns b,c,d. But the code I have isn't working and I don't know why, it says "error, incompatible size (3), expecting 10 (the group size) or 1"
cols <- c("b","c","d")
data %>%
group_by(a) %>%
mutate_at(.cols = cols, funs(ifelse(is.na(cols),
as.integer(mean(cols, na.rm=TRUE)), cols)
I'm assuming the problem has to do with the code not correctly applying the column names when looking at the data?
for referencing a character vector to mutate use mutate_if instead.
cols <- c("b","c","d")
data %>%
group_by(a) %>%
mutate_if(names(.) %in% cols,
funs(ifelse(is.na(.), as.integer(mean(., na.rm=TRUE)), .)))
I am working with a large dataframe in R but I got the next action and my solution looks too extent. I will use DF as an example of the dataframe I am using:
library(dplyr)
DF<-data.frame(ID=c(1:10),Cause1=c(rep("Yes 1",8),rep("No 1",2)),Cause2=c(rep("Yes 2",6),rep("No 2",4)),
Cause3=c(rep("Yes S",5),rep("No S",5)),Cause4=c(rep("Yes P",3),rep("No P",7)),
Cause5=c(rep("Yes",2),rep("No",8)),stringsAsFactors = F)
DF has the next structure:
ID Cause1 Cause2 Cause3 Cause4 Cause5
1 1 Yes 1 Yes 2 Yes S Yes P Yes
2 2 Yes 1 Yes 2 Yes S Yes P Yes
3 3 Yes 1 Yes 2 Yes S Yes P No
4 4 Yes 1 Yes 2 Yes S No P No
5 5 Yes 1 Yes 2 Yes S No P No
6 6 Yes 1 Yes 2 No S No P No
7 7 Yes 1 No 2 No S No P No
8 8 Yes 1 No 2 No S No P No
9 9 No 1 No 2 No S No P No
10 10 No 1 No 2 No S No P No
Where DF is composed of six variables (1 id variable and the others are variables that can be Yes or No). Then, for each of the variables with the prefix Cause I need to compute a summary of that variable, as first step, and after that I have to filter by that variable when it was achieved (or this is equal to Yes). For example I will do the first stage of this process with the next code and its respective explanation:
#Filtering stage
#N1
DF %>% group_by(Cause1) %>% summarise(N=n()) -> d1
DF %>% filter(Cause1=="Yes 1") -> DF2
In this case, using dplyr I group DF by variable Cause1 and summarise() to count the number of values it has (n()). Therefore, the result is saved in d1. After, I have to filter DF when Cause1 is equal to Yes 1 and that must be saved in a new data.frame called DF2. Once I get DF2 I must repeat a similar routine for Cause2, Cause3, Cause4 and Cause5. For that I use the next code:
#N2
DF2 %>% group_by(Cause2) %>% summarise(N=n()) -> d2
DF2 %>% filter(Cause2=="Yes 2") -> DF3
#N3
DF3 %>% group_by(Cause3) %>% summarise(N=n()) -> d3
DF3 %>% filter(Cause3=="Yes S") -> DF4
#N4
DF4 %>% group_by(Cause4) %>% summarise(N=n()) -> d4
DF4 %>% filter(Cause4=="Yes P") -> DF5
#N5
DF5 %>% group_by(Cause5) %>% summarise(N=n()) -> d5
DF5 %>% filter(Cause5=="Yes") -> DF6
The final result is DF6 but I have to make a control by combining all the dataframes d1,d2,d3,d4 and d5 and filtering all the No values. I used this code with that porpouse. The code sets a common names for all d's dataframes, rbind them and filter the No pattern.
#Connect
names(d1)<-names(d2)<-names(d3)<-names(d4)<-names(d5)<-c("Cause","N")
#Rbind
d<-rbind(d1,d2,d3,d4,d5)
d_reduced<-d[grepl("No",d$Cause),]
I obtain this:
Cause N
1 No 1 2
2 No 2 2
3 No S 1
4 No P 2
5 No 1
The final step is to compute the sum of N in d_reduced and the number of rows in DF minus that value must be the same that the number of rows of DF6:
(dim(DF)[1]-sum(d_reduced$N))==dim(DF6)[1]
That in this case is TRUE.
I would like to reduce this too long code because in my analysis the number of Cause variables can increase and the the code will be larger. Maybe by using the apply strategy or reshaping the data could be better. Any help about reducing the level of code would be marvelous. Thanks in advance.
How about something like this?
First we summarise how many "No" cases are in each column that starts with "Cause":
num_no <- DF %>% summarise_each(funs(substr(., 1, 1) == "N"), starts_with("Cause"))
> num_no
Cause1 Cause2 Cause3 Cause4 Cause5
1 2 4 5 7 8
You are interested in the incremental difference between each subsequent column, so lets just subtract a lagged version of num_no from num_no.
d_reduced <- num_no - lag(num_no, 1, 0)
> d_reduced
Cause1 Cause2 Cause3 Cause4 Cause5
1 2 2 1 2 1
This gives the values you wanted, but they are not the labelled, lets fix that, extracting the unique string that begins with N for each column:
labs <- lapply(DF, function(X){unique(X[grep("N", X)])}) %>% unlist
names(d_reduced) <- labs
> d_reduced
No 1 No 2 No S No P No
1 2 2 1 2 1
Then we do your final step would be, summing the occurrences of d_reduced and subtracting those from number of rows of DF and then checking if that is equal to the number of rows which are "Yes" for their entire row.
> (nrow(DF) - sum(d_reduced)) == sum(DF[, ncol(DF)] == "Yes")
[1] TRUE
Warning: This would only work because if someone has yes in the final column all preceding columns are yes (like in your example). If that assumption changed then this answer will not work.
You could reshape to long format, then count the votes, and then take the difference between the Yes values. data.table::melt uses regular expression for detecting measure variables, which should be useful in capturing all the Cause variables. Does this work?
d <-
melt(as.data.table(DF), # launch melt.data.table
id.vars = "ID",
measure.vars = patterns("Cause"), # grep columns
variable.name = "Cause") %>%
group_by(Cause) %>% # tabulate Yes's and No's
summarise(Yes = sum(grepl("Yes", value)),
No = sum(grepl("No", value))) %>%
mutate(N = lag(Yes) - Yes) %>% # N = difference between Yes's
rowwise() %>% # replace the NA in first row with the No value
mutate(N = replace(N, is.na(N), No))