Im extremely new, sorry ahead of time. I have two vectors, one a character vector of account names (30) and the other a character vector of product names(30). Lastly, I have a dataframe with three columns account names, product names and revenue but this list goes way beyond the 30 of either.
Ultimately I need a 30x30 dataframe rows as products from the product name vector, columns as account names from the account name vector and the values as the revenue associated with the account in the column and the product in the row.
I think I need a nested loop function? but I dont know how to use that to populate the dataframe appropriately.
account<-c("a","b",etc)
product<-c("prod_a","prod_b", etc)
for(i in 1:length(account)){
for(i in 1:length(product)){
.....
}
}
Honestly Im just very lost haha
I think I know what you're trying to do here. I suspect there is a good reason you want this 30x30 cross-table type of structure, but I would also like to take the opportunity to encourage "tidy" data for analysis purposes. That link can be summarized by these three main criteria for data to be considered "tidy":
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
That said, below is my attempt to interpret and demonstrate what I think you're trying to accomplish.
library(tidyr)
# set up some fake data to better explain
account_vec <- paste0(letters, 1:26)
product_vec <- paste0(as.character(101:126), LETTERS)
revenue_vec <- rnorm(26*26)
# permutating accounts and products to set up our fake data
df <- expand.grid(account_vec, product_vec)
names(df) <- c("accounts", "products")
df$revenue <- revenue_vec
# if this is what your data looks like currently, I would consider this fairly "tidy"
# now let's pretend there's some data we need to filter out
df <- rbind(df,
data.frame(
accounts = paste0("bad_account", 1:3),
products = paste0("bad_product", 1:3),
revenue = rnorm(3)
)
)
# filter to just what is included in our "accounts" and "products" vectors
df <- df[df$accounts %in% account_vec, ]
df <- df[df$products %in% product_vec, ]
# spread out the products so they occupy the column values
df2 <- df %>% tidyr::spread(key="products", value="revenue")
# if you aren't familiar with the "%>%" pipe operator, the above
# line of code is equivalent to this one below:
# df2 <- tidyr::spread(df, key="products", value="revenue")
# now we have accounts as rows, products as columns, and revenues at the intersection
# we can go one step further by making the accounts our row names if we want
row.names(df2) <- df2$accounts
df2$accounts <- NULL
# now the accounts are in the row name and not in a column on their own
Related
I have this data set that requires some cleaning up. Is there a way to code in R such that it picks up columns with more than 3 different levels from the data set? Eg column C has the different education level and I would like it to be selected along with column D and F. While column E and G wont be picked up because it doesnt meet the more than 3 level requirement.
At the same time I need one of the columns to be arranged in a specific way? Eg Education, I would like PHD to be at the top. The other levels of education does not need to be in any order
Sorry i am really new to R and I attached a snapshot of a sample data i replicated from the original
All help is greatly appreciated
It is a bit complicated to replicate the data as it is an image, but you could use this function to select those columns of your dataframe that have at least 3 levels.
First I converted to factor those columns you are considering, in this case from column C or 3. Then with the for loop I identify those columns with more than 2 levels, and save the result in a vector and then filter the original data set according to these columns.
select_columns <- function(df){
df <- data.frame(lapply(df[,-c(1,2)], as.factor))
selectColumns <- c()
for (i in 1:length(df)) {
if((length(unique(df[,i])) > 3) ){
selectColumns[i] <- colnames(df[i])
}
}
selectColumns <- na.omit(selectColumns)
return(data %>% select(c(1:2),selectColumns))
}
select_columns(your_data_frame)
I am new to R and have spent the last 2 months on this website trying to learn more. I want to pull information from a dataset that has a specific keyword and then those that DO have that keyword, I want to pull the 5 words before and after that keyword. Then I want to know what number(s) they have near them in that same sentence.
To explain the "Why", I have a list of tickets I want to pull all the Titles of the tickets. Then I want to know from that list of those tickets which are requesting for additional Storage. If they are, I want to know how MUCH storage they are asking for, and then later I will create actions depending on how much storage they're asking for (but that is later).
Example of the code I have completed so far (it's a bit messy, I am working on a better/cleaner way still, I'm very new to R).
The keyword I'm searching for: Storage
Dataframe referenced as: DF, DF2, DF3 etc.
Column from DF: Title
#Check for keyword#
grep("storage", DF$Title, ignore.case=true)
#Pull words before and after keywords, this is case sensitive for some reason so I have to do it twice and merge the data frames, it also creates a list instead of a data frame so I have to change that into a data frame...Messy I know#
DF2 <- stringr::str_extract_all(DF$Title, "([^\\s]+\\s){0,5}Storage(\\s[^\\s]+){0,5}")
#Turn list into dataframe#
DF3 <- do.call(rbind.data.frame, DF2)
#Pull words before and after but in lower case, same as step two#
DF4 <- stringr::str_extract_all(DF$Title, "([^\\s]+\\s){0,5}storage(\\s[^\\s]+){0,5}")
#Turn list into dataframe#
DF5 <- do.call(rbind.data.frame, DF4)
#Change column names ( I have to do this to merge them via rbind)
DF6 <- setnames(DF3, c("Keyword")
DF7 <- setnames(DF5, c("Keyword")
#Merge both data frames together#
DF6 <- rbind(DF6,Df7)
I want to check the amount of storage being requested, so I'm trying to look for a number referencing GB or TB, etc. I've tried numerous code but a lot only pull the numbers or number right after a keyword, not all numbers in the sentence.
Example of what I've tried with it not working
DFTest <- as.integer(str_match(DF6, "(?i\\bGB:?\\s*(\\d+")[,2])
The following approach will extract all numbers before a specific keyword (this case I used AND), or after a keyword. You can change your keyword in the regex pattern.
library(tidyverse)
df <- data.frame(obs = 1:5, COL_D = c("2019AND", "AND1999", "101AND", "AND12", "20AND1999999"))
df2 <- df %>%
mutate(Extracted_Num = str_extract_all(COL_D, regex("\\d+(?=AND)|(?<=AND)\\d+")))
# obs COL_D Extracted_Num
# 1 1 2019AND 2019
# 2 2 AND1999 1999
# 3 3 101AND 101
# 4 4 AND12 12
# 5 5 20AND1999999 20, 1999999
I have a data set of plant demographics from 5 years across 10 sites with a total of 37 transects within the sites. Below is a link to a GoogleDoc with some of the data:
https://docs.google.com/spreadsheets/d/1VT-dDrTwG8wHBNx7eW4BtXH5wqesnIDwKTdK61xsD0U/edit?usp=sharing
In total, I have 101 unique combinations.
I need to subset each unique set of data, so that I can run each through some code. This code will give me one column of output that I need to add back to the original data frame so that I can run LMs on the entire data set. I had hoped to write a for-loop where I could subset each unique combination, run the code on each, and then append the output for each model back onto the original dataset. My attempts at writing a subset loop have all failed to produce even a simple output.
I created a column, "SiteTY", with unique Site, Transect, Year combinations. So "PWR 832015" is site PWR Transect 83 Year 2015. I tried to use that to loop through and fill an empty matrix, as proof of concept.
transect=unique(dat$SiteTY)
ntrans=length(transect)
tmpout=matrix(NA, nrow=ntrans, ncol=2)
for (i in 1:ntrans) {
df=subset(dat, SiteTY==i)
tmpout[i,]=(unique(df$SiteTY))
}
When I do this, I notice that df has no observations. If I replace "i" with a known value (like PWR 832015) and run each line of the for-loop individually, it populates correctly. If I use is.factor() for i or PWR 832015, both return FALSE.
This particular code also gives me the error:
Error in [,-(*tmp*, , i, value=mean(df$Year)) : subscript out of bounds
I can only assume this happens because the data frame is empty.
I've read enough SO posts to know that for-loops are tricky, but I've tried more iterations than I can remember to try to make this work in the last 3 years to no avail.
Any tips on loops or ways to avoid them while getting the output I need would be appreciated.
Per your needs, I need to subset each unique set of data, run a function, take the output and calculate a new value, consider two routes:
Using ave if your function expects and returns a single numeric column.
Using by if your function expects a data frame and returns anything.
ave
Returns a grouped inline aggregate column with repeated value for every member of group. Below, with is used as context manager to avoid repeated dat$ references.
# BY SITE GROUPING
dat$New_Column <- with(dat, ave(Numeric_Column, Site, FUN=myfunction))
# BY SITE AND TRANSECT GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, FUN=myfunction))
# BY SITE AND TRANSECT AND YEAR GROUPINGS
dat$New_Column <- with(dat, ave(Numeric_Column, Site, Transect, Year, FUN=myfunction))
by
Returns a named list of objects or whatever your function returns for each possible grouping. For more than one grouping, tryCatch is used due to possibly empty data frame item from all possible combinations where your myfunction can return an error.
# BY SITE GROUPING
obj_list <- by(dat, dat$Site, function(sub) {
myfunction(sub) # RUN ANY OPERATION ON sub DATA FRAME
})
# BY SITE AND TRANSECT GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# BY SITE AND TRANSECT AND YEAR GROUPINGS
obj_list <- by(dat, dat[c("Site", "Transect", "Year")], function(sub) {
tryCatch(myfunction(sub),
error = function(e) NULL)
})
# FILTERS OUT ALL NULLs (I.E., NO LENGTH)
obj_list <- Filter(length, obj_list)
# BUILDS SINGLE OUTPUT IF MATRIX OR DATA FRAME
final_obj <- do.call(rbind, obj_list)
Here's another approach using the dplyr library, in which I'm creating a data.frame of summary statistics for each group and then just joining it back on:
library(dplyr)
# Group by species (site, transect, etc) and summarise
species_summary <- iris %>%
group_by(Species) %>%
summarise(mean.Sepal.Length = mean(Sepal.Length),
mean.Sepal.Width = mean(Sepal.Width))
# A data.frame with one row per species, one column per statistic
species_summary
# Join the summary stats back onto the original data
iris_plus <- iris %>% left_join(species_summary, by = "Species")
head(iris_plus)
I'm working with data regarding people and what class of medicine they were prescribed. It looks something like this (the actual data is read in via txt file):
test <- matrix(c(1,"a",1,"a",1,"b",2,"a",2,"c"),ncol=2,byrow=TRUE)
colnames(test) <- c("id","med")
test <- as.data.table(test)
test <- unique(test[, 1:2])
test
The table has about 5 million rows, 45k unique patients, and 49 unique medicines. Some patients have multiples of the same medicines, which I remove. Not all patients have every medicine. I want to make each of the 49 unique medicines into separate columns, and have each unique patient be a row, and populate the table with 1s and 0s to show if the patient has the medicine or not.
I was trying to use spread or dcast, but there's no value column. I tried to amend this by adding a row of 1s
test$true <- rep(1, nrow(test))
And then using tidyr
library(tidyr)
test_wide <- spread(test, med, true, fill = 0)
My original data produced this error but I'm not sure why the new data isn't reproducing it...
Error: `var` must evaluate to a single number or a column name, not a list
Please let me know what I can do to make this a better reproducible example sorry I'm really new to this.
It looks like you are trying to do onehot encoding here. For this please refer to the "onehot" package. Details are here.
Code for reference:
library(onehot)
test <- matrix(c(1,"a",1,"a",1,"b",2,"a",2,"c"),ncol=2,byrow=TRUE)
colnames(test) <- c("id","med")
test <- as.data.frame(test)
str(test)
test$id <- as.numeric(test$id)
str(test)
encoder <- onehot(test)
finaldata <- predict(encoder,test)
finaldata
Make sure that all the columns that you want to be encoded are of the type factor. Also, I have taken the liberty of changing data.table to data.frame.
I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!
Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.
You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)