I have a data frame with four rows, 23 numeric columns and one text column. I'm trying to normalize all the numeric columns by subtracting the value in the first row.
I've tried getting it to work with mutate_at, but I couldn't figure out a good way to get it to work.
I got it to work by converting to a matrix and converting back to a tibble:
## First, did some preprocessing to get out the group I want
totalNKFoldChange <- filter(signalingFrame,
Population == "Total NK") %>% ungroup
totalNKFoldChange_mat <- select(totalNKFoldChange, signalingCols) %>%
as.matrix()
normedNKFoldChange <- sweep(totalNKFoldChange_mat,
2, totalNKFoldChange_mat[1,])
normedNKFoldChange %<>% cbind(Timepoint =
levels(totalNKFoldChange$Timepoint)) %>%
as.tibble %>%
mutate(Timepoint = factor(Timepoint,
levels = levels(totalNKFoldChange$Timepoint)))
I'm so certain there's a nicer way to do it that would be fully dplyr native. Anyone have tips? Thank you!!
If we want to normalize all the numeric columns by subtracting the value in the first row, use mutate_if
library(dplyr)
df1 %>%
mutate_if(is.numeric, list(~ .- first(.)))
Related
I have several series, each one indicates the deflator for the GDP for each country. (Data attached down below)
So what I want to do is to divide every column for the 97th position.
I know this could be pretty simple for you, but I am struggling.
This is my code so far:
d_data <- d_data %>%
mutate_if(is.numeric, function(x) x/d_data[[97,x]])
So as you can see in the data, from columns 3 to 8 data are numeric.
I think the error is that argument x of the function refers to the column name, while in the d_data, the second argument refers to column position and that is the main issue.
How can I solve this? Thanks in advance!!
Data
Data was massive to put here (745 rows, 8 columns)
So I uploaded the dput(d_data) output here
Use mutate with across as _at/_all are deprecated. Also, to extract by position, use nth
library(dplyr)
d_data %>%
mutate(across(where(is.numeric), ~ .x/nth(.x, 97)))
In the OP's code, instead of d_data[[97,x]], it should be x[97] as x here is the column value itself
d_data %>%
mutate_if(is.numeric, function(x) x/x[97])
If we want to subset the original data column, have to pass either column index or column name. Here, x doesn't refer to column index or name. But with across, we can get the column name with cur_column() e.g. (mtcars %>% summarise(across(everything(), ~ cur_column()))) which is not needed for this case
I was trying to figure out a way to transform all values in selected columns of my dataset using an equation $$x_i = x_{max} - x_i$$ using dplyr. I'm not sure how to correctly do this for one column, let alone multiple columns. My attempt at mutating 1 column:
df1 <- df %>% mutate(column1 = replace(column1, ., x = max(column1) - x)
My x = max(column1) - x part is not literal, I just want to know how I can implement that equation into all row entries in the column. Furthermore, how can I do this for multiple columns in the same line? Any help is appreciated. Thanks!
If it is to replace all values across multiple columns, loop across the numeric columns and subtract the values from its max value for that column
library(dplyr)
df <- df %>%
mutate(across(where(is.numeric), ~ max(., na.rm = TRUE) - .))
I have a dataframe ("df") with a number of columns that I would like to estimate the weighted means of, weighting by population (df$Population), and grouping by commuting zone (df$cz).
This is the list of columns I would like to estimate the weighted means of:
vlist = c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
This is the code I have been using:
df = df %>% group_by(cz) %>% mutate_at(vlist, weighted.mean(., df$Population))
I have also tried:
df = df %>% group_by(cz) %>% mutate_at(vlist, function(x) weighted.mean(x, df$Population))
As well as tested the following code on only 2 columns:
df = df %>% group_by(cz) %>% mutate_at(vars(Public_Welf_Total_Exp, Welf_Cash_Total_Exp), weighted.mean(., df$Population))
However, everything I have tried gives me the following error, even though there are no NAs in any of my variables:
Error in weighted.mean.default(., df$Population) :
'x' and 'w' must have the same length
I understand that I could do the following estimation using lapply, but I don't know how to group by another variable using lapply. I would appreciate any suggestions!
There is a lot to unpack here...
Probably you mean summarise instead of mutate, because with mutate you would just replicate your result for each row.
mutate_at and summarise_at are subseeded and you should use across instead.
the reason why your code wasn't working was because you did not write your function as a formula (you did not add ~ at the beginning), also you were using df$Population instead of Population. When you write Population, summarise knows you're talking about the column Population which, at that point, is grouped like the rest of the dataframe. When you use df$Population you are calling the column of the original dataframe without grouping. Not only it is wrong, but you would also get an error because the length of the variable you are trying to average and the lengths of the weights provided by df$Population would not correspond.
Here is how you could do it:
library(dplyr)
df %>%
group_by(cz) %>%
summarise(across(vlist, weighted.mean, Population),
.groups = "drop")
If you really need to use summarise_at (and probably you are using an old version of dplyr [lower than 1.0.0]), then you could do:
df %>%
group_by(cz) %>%
summarise_at(vlist, ~weighted.mean(., Population)) %>%
ungroup()
I considered df and vlist like the following:
vlist <- c("Public_Welf_Total_Exp", "Welf_Cash_Total_Exp", "Welf_Cash_Cash_Assist", "Welf_Ins_Total_Exp","Total_Educ_Direct_Exp", "Higher_Ed_Total_Exp", "Welf_NEC_Cap_Outlay","Welf_NEC_Direct_Expend", "Welf_NEC_Total_Expend", "Total_Educ_Assist___Sub", "Health_Total_Expend", "Total_Hospital_Total_Exp", "Welf_Vend_Pmts_Medical","Hosp_Other_Total_Exp","Unemp_Comp_Total_Exp", "Unemp_Comp_Cash___Sec", "Total_Unemp_Rev", "Hous___Com_Total_Exp", "Hous___Com_Construct")
df <- as.data.frame(matrix(rnorm(length(vlist) * 100), ncol = length(vlist)))
names(df) <- vlist
df$cz <- rep(letters[1:10], each = 10)
df$Population <- runif(100)
I have a dataset in R with a little under 100 columns.
Some of the columns have numeric values such as 87+3 as oppose to 90.
I have been able to update each column with the following piece of code:
library(dplyr)
new_dataframe = dataframe %>%
rowwise() %>%
mutate(new_value = eval(parse(text = value)))
However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column.
Can someone help me find a more efficient way?
We can use mutate_at
library(dplyr)
dataframe %>%
rowwise() %>%
mutate_at(1:60, list(new_value = ~eval(parse(text = .))))
I have a data.frame:
set.seed(1L)
vector <- data.frame(patient=rep(1:5,each=2),medicine=rep(1:3,length.out=10),prob=runif(10))
I want to get the mean of the "prob" column while grouping by patient. I do this with the following code:
vector %>%
group_by(patient) %>%
summarise(average=mean(prob))
This code perfectly works. However, I need to get the same values without using the word "prob" on the "summarise" line. I tried the following code, but it gives me a data.frame in which the column "average" is a vector with 5 identical values, which is not what I want:
vector %>%
group_by(patient) %>%
summarise(average=mean(vector[,3]))
PD: for the sake of understanding why I need this, I have another data frame with multiple columns with complex names that need to be "summarised", that's why I can't put one by one on the summarise command. What I want is to put a vector there to calculate the probs of each column grouped by patients.
It appears you want summarise_each
vector %>%
group_by(patient) %>%
summarise_each(funs(mean), vars= matches('prop'))
Using data.table you could do
setDT(vector)[,lapply(.SD,mean),by=patient,.SDcols='prob')