I have a dataset in R with a little under 100 columns.
Some of the columns have numeric values such as 87+3 as oppose to 90.
I have been able to update each column with the following piece of code:
library(dplyr)
new_dataframe = dataframe %>%
rowwise() %>%
mutate(new_value = eval(parse(text = value)))
However, I would like to be able to update a list of 60 columns in a more efficient way than simply repeating this line for each column.
Can someone help me find a more efficient way?
We can use mutate_at
library(dplyr)
dataframe %>%
rowwise() %>%
mutate_at(1:60, list(new_value = ~eval(parse(text = .))))
Related
Good afternoon!
I think this is pretty straight forward question, but I think I am missing a couple of steps. Would like to create groups based on column position.
Am working with a dataframe / tibble; 33 rows long, and 66 columns wide. However, every sequence of 6 columns, should really be separated into its own sub-dataframe / tibble.
The sequence of the number columns is arbitrary to the dataframe. Below is an attempt with mtcars, where I am trying to group every 2 columns into its own sub-dataframe.
mtcars %>%
tibble() %>%
group_by(across(seq(1,2, length.out = 11))) %>%
nest()
However, that method generates errors. Something similar applies when working just within nest() as well.
Using mtcars, would like to create groups using a sequence for every 3 columns, or some other number.
Would ultimately like the mtcars dataframe to be...
Columns 1:3 to be group 1,
Columns 4:6 to be group 2,
Columns 7:9 to be group 3, etc... while retaining the information for the rows in each column.
Also considered something with pivot_longer...
mtcars %>%
tibble() %>%
pivot_longer(cols = seq(1,3, by = 1))
...but that did not generate defined groups, or continue the sequencing along all columns of the dataframe.
Hope one of you can help me with this! Would make certain tasks for work much easier.
PS - A plus if you can keep the workflow to tidyverse centric code :)
You could try this. It splits the dataframe into a list of dataframes based on the number of columns you want (3 in your example):
library(tidyverse)
list_of_dataframes <- mtcars %>%
tibble() %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
group_by(row) %>%
mutate(group = ceiling(row_number()/ 3)) %>%
ungroup() %>%
group_split(group) %>%
map(
~select(.x, row, name, value) %>%
pivot_wider()
)
EDIT
Here, based on comments from the question asker, we will avoid pivoting the data. Instead, we map the groups across the dataframe.
list_of_dataframes <- map(seq(1, ncol(mtcars), by = 3),
~mtcars %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+2, ncol(mtcars))))))
We can then wrap this in a function to make it a little easier to use and change group sizes and dataframes:
group_split_cols <- function(.data, ncols_per_group){
map(seq(1, ncol(.data), by = ncols_per_group),
~.data %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+ncols_per_group-1, ncol(.data))))))
}
list_of_dataframes <- group_split_cols(.data = mtcars, ncols_per_group = 3)
so I'm trying to learn R while playing with a dataset from https://www.kaggle.com/abcsds/pokemon
data = read.csv("Pokemon.csv")
data$Name = sub(".*(Mega)", "Mega", data$Name) # replacing name duplications
And I want to find all the pokemon that have a maximum value on any columns (Total, Attack, HP, etc):
I know I can do this: sapply(data[5:11], max, na.rm = TRUE) to find out the max values and stuff like
data[which.max(data$Total),]
data[which.max(data$HP),]
data[which.max(data$Attack),]
to find all the rows that have a max.
Is there a way I can use something like sapply in order to get all the rows without going through them sequentially?
I believe this is what you want to achieve
I use tidyverse for this, as the data is in wide format with different columns for stat, I first convert it into long format using pivot_longer then I group_by stats column and filter the max of each group to achieve the desired result.
library(tidyverse)
df %>%
select(c(2, 5:11)) %>%
pivot_longer(-1, names_to = "stats") %>%
group_by(stats) %>%
filter(value == max(value))
I have a data frame with four rows, 23 numeric columns and one text column. I'm trying to normalize all the numeric columns by subtracting the value in the first row.
I've tried getting it to work with mutate_at, but I couldn't figure out a good way to get it to work.
I got it to work by converting to a matrix and converting back to a tibble:
## First, did some preprocessing to get out the group I want
totalNKFoldChange <- filter(signalingFrame,
Population == "Total NK") %>% ungroup
totalNKFoldChange_mat <- select(totalNKFoldChange, signalingCols) %>%
as.matrix()
normedNKFoldChange <- sweep(totalNKFoldChange_mat,
2, totalNKFoldChange_mat[1,])
normedNKFoldChange %<>% cbind(Timepoint =
levels(totalNKFoldChange$Timepoint)) %>%
as.tibble %>%
mutate(Timepoint = factor(Timepoint,
levels = levels(totalNKFoldChange$Timepoint)))
I'm so certain there's a nicer way to do it that would be fully dplyr native. Anyone have tips? Thank you!!
If we want to normalize all the numeric columns by subtracting the value in the first row, use mutate_if
library(dplyr)
df1 %>%
mutate_if(is.numeric, list(~ .- first(.)))
I'm basically looking for the equivalent of the following python code in R:
df.groupby('Categorical')['Count'].count()[0]
The following is what I'm doing in R:
by(df$count,df$Categorical,sum)
This accomplishes the same thing as the first code but I'd like to know how to store an index value to a variable in R (new to R) .
Based on the by code, it seems like we can use (assuming that 'count' is a columns of 1s)
library(dplyr)
out <- df %>%
group_by(Categorical) %>%
summarise(Sum = sum(count))
If the columns 'count' have other values as well, the python function is taking the frequency count of 'Categorical' column. So, a similar option would be
out <- df %>%
count(Categorical) %>%
slice(1) %>%
pull(n)
I'm struggling with standardization of data columns in R in subgroups.
I created the data frame:
df<-data.frame(
salesPerson=sample(c('Alan','Bob','Cindy'),20 ,replace=TRUE)
, quater=sample(c('Q1','Q2','Q3'),20 ,replace=TRUE)
,salesValue=runif(20, 5.0, 7.5)
)
I would like to add additional column to the data frame with scaled values of Sales.
To scale all column I can use code:
df$salesValueScaled<-scale(df$salesValue)
The problem is that I would like to scale sales separably for each combination of columns salesPerson and quater. Sth like:
df$salesValueScaled<-scale(df$salesValue, by =c(df$salesPerson,df$quater))
I have been searching for this solution on this forum but I couldn't find a solution to this problem.
Thank you in advance for help.
You can use dplyr for this:
library(dplyr)
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
To work around rows that return NAs, you can either keep the original values as they are or filter them out before scaling:
Keeping the original values (by keeping scaling only instances where NROW is greater than 1):
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup
Filtering them out (as suggested by #steveb):
new_df <- df %>% group_by(salesPerson, quater) %>%
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
I hope this helps.