I have a data frame and the first column (called company) is filled with various company names, but the case isn't uniform. One might be phillips while another is Phillips, and a third phiLlips. I want to make them standardized - all lower case, so I tried mutate(), with replace() and tolower():
library(readr)
library(dplyr)
library(tidyr)
refine_original <- read_csv("~/Desktop/refine_original.csv")
# cleaning up the company names to match (case)
refine_original %>% mutate(company=replace(company, length(company),
tolower(company)))
print(refine_original$company)
When I print the column contents though, it looks like I didn't make the changes stick to the data frame, they remain unchanged. Any input or help would be much appreciated, thanks!
Related
I have a wide dataset which makes it really difficult to manipulate the data in the way I need. It looks like the dummy table below:
Dummy_table_unsorted
Essentially, as seen in the table, the information held in 1 row is at a user level, you have a user id and then all the animals owned by each user are in this row. What I would like it, I want this at animal level, so that a user can have multiple entries, which represent each of their different animals. I have pasted a table below of what I would like it to look like:
Dummy_table_sorted
Is there a simple way to do this? I have an idea as to how, but it is very long winded. I thought to maybe subset by selected columns relating to one animal only and merge the datasets back together. The problem is, in may data, it is possible for one person to have up to 100 animals, which makes this very long winded.
Please can someone offer a suggestion or a package/command that would allow me to change this wide dataset into a long one?
Thank You.
First, you should provide data that someone can easily insert into R. Screenshots are not helpful and increase the amount of work a person needs to perform to help you.
The data as you have it should be able to be split, and recombined with bind_rows or rbind. I would subset the data into three dataframes, rename columns, and bind. Assuming your original data is called df
df1 <- df[,c(1:4)]
df2 <- df[,c(1,5:7)]
df3 <- df[,c(1,8:10)]
# rename columns to match
names(df1) <- c('user id', 'animal', 'colour', 'legs')
names(df2) <- c('user id', 'animal', 'colour', 'legs')
names(df3) <- c('user id', 'animal', 'colour', 'legs')
remade <- bind_rows(df1, df2) %>%
bind_rows(df3)
I have a data frame containing three columns and first column is Species_Name which contain all species name but i want to remove those rows which are undetermined like "Salmonella sp" and want to keep only those rows which have full determined name like Salmonella enterica or bongori and so on. I tried following code but its not working. please give any suggestions.
dfcox1 <- dfcox1 %>%
filter(Species_Name != "Salmonella sp")
Welcome on stackoverflow.com! Please create reproducible examples so that other people have it easier to help you, which is especially easy when working with GNU R.
If you want to remove a row in a dataframe according to a specific regular expression (e.g. the row name ending with sp), you can do so as follows):
iris %>%
dplyr::filter(!stringr::str_detect(Species, "sp"))
I'm rather new to the tidyverse, and I want to learn, so this question is specifically about doing this the tibble way, using things like select(), mutate() and the like. I know how to achieve the desired effect with data frames matching column indices.
I have a rather large tibble, containing columns named Day1, Day2, ..., Day48, among others. I'd like to add columns of averages for every week, using regular expressions (assume the column names could be more complicated). How would I achieve this?
Figured it out:
data <- mutate(data, Week1=select(data, matches("^Day[1-7]$")) %>% rowMeans(na.rm=T))
I recently moved from common dataframe manipulation in R to the tidyverse. But I got a problem regarding scaling of columns with the scale()function.
My data consists of columns of whom some are numerical and some categorical features. Also the last column is the y value of data. So I want to scale all numerical columns but not the last column.
With the select()function i am able to write a very short line of code and select all my numerical columns that need to be scaled if i add the ends_with("...") argument. But I can't really make use of that with scaling. There I have to use transmute(feature1=scale(feature1),feature2=scale(feature2)...)and name each feature individually. This works fine but bloats up the code.
So my question is:
Is there a smart solution to manipulate column by column without the need to address every single column name with
transmute?
I imagine something like:
transmute(ends_with("...")=scale(ends_with("..."),featureX,featureZ)
(well aware that this does not work)
Many thanks in advance
library(tidyverse)
data("economics")
# add variables that are not numeric
economics[7:9] <- sample(LETTERS[1:10], size = dim(economics)[1], replace = TRUE)
# add a 'y' column (for illustration)
set.seed(1)
economics$y <- rnorm(n = dim(economics)[1])
economics_modified <- economics %>%
select(-y) %>%
transmute_if(is.numeric, scale) %>%
add_column(y = economics$y)
If you want to keep those columns that are not numeric replace transmute_if with modify_if. (There might be a smarter way to exclude column y from being scaled.)
I have dataset of a a few columns with duplicate row.( duplication based on one column by name ProjectID).
I want to remove the duplicate rows and keep just one of it.
However, each of these rows have a separate amount value against it which needs to be summed and stored for the final consolidated row.
I have used aggregate function. However it removes all other columns (by the use I know).
Can somebody Please tell me a easier way.
the example data set is attached.
dataset
This could be solved using dplyr as #PLapointe pointed out. If your dataset is called df then this would go as
df %>%
group_by(`Project ID`, `Project No.`, `Account Head`, `Function`, `Functionary`) %>%
summarise(cost.total = sum(Amount))
This should do it. You can also adjust the variables you want to keep.
Its a more complicated method, but worked for me.
I aggregated the amounts about the ProjectIDs using the aggregate function, storing them in a new tibble.
Further I appended this column to the original tibble as a new column.
It didn't work exactly what I wanted to. But I was able to work out with a new column Final_Amount keeping the earlier Amount column irrelevant.
Duplicate_remove2 <- function(dataGP_cleaned)
{
#aggregating unique amounts
aggregated_amount <- aggregate(dataGP_cleaned['Amount'], by=dataGP_cleaned['ProjectID'], sum)
#finding Distinct dataset
dataGP_unique <- distinct(dataGP_cleaned, ProjectID, .keep_all = TRUE)
#changing name of the column for easy identification
aggregated_amount$Final_Amount <- aggregated_amount$Amount
#appending the list
aggregate_dataGP <- bind_cols(dataGP_unique, aggregated_amount['Final_Amount'] )
return(aggregate_dataGP)
}