Combining function output into data frame with different column names - r

I am trying to figure out how to get a single data frame (or a tibble) with distinct column names from this function. This code prints three separate chunks with the same Date and same column names. Column values are however different.
hello <- function(x){
fx <- Quandl(x)
fx <- fx %>%
select(Date, Open, Close) %>%
mutate(new_col = `Open` - `Close`) %>%
select(Date, new_col)
print(kable(head(fx), format = "rst"))
}
lapply(c(EUR, USD, RUB), fx)
I am doing it the long way via replicating the code for every input (EUR, USD, RUB). Then merging them with the function below, then converting to a tibble to get rid of the observation numbers on the left before printing via kable.
Reduce(function(x, y) merge(x, y, all = TRUE), list(EUR, USD, RUB))
Just want to see how it is possible to do easier with one function.
Thank you!
EDIT 1:
Thank you for the answers and suggestions on improving the question!
So the reproducible code looks like this:
Wheat<-"CFTC/001602_FO_ALL"
Corn<-"CFTC/002602_F_ALL"
Beans<-"CFTC/005602_F_ALL"
funds<-function(x){
cftc<-Quandl(x)
cftc<-cftc%>%
select(Date,`Money Manager Longs`,`Money Manager Shorts`)%>%
mutate(Funds_Net = `Money Manager Longs`-`Money Manager Shorts`)%>%
select(Date,Funds_Net)
print(kable(head(cftc),format = "rst"))
}
lapply(c(Wheat,Corn,Beans),funds)
The output I have is this:
enter image description here
What I want as output is this:
enter image description here
Thank you!

Related

How can I create a new column in a data frame that contains a new assigned ID number, 1 through 33, for the long ID numbers assigned in the dataset?

I'd like to keep the column of original Id numbers, but I'd like to create a new Id number for each participant in my dataset, so when I create a geom_bar graph it's not using decimals and looking strange.
This is the current R code I have written.
library(dplyr)
unique_ids <- daily_activity %>%
group_by(Id) %>%
summarize(days_used = n_distinct(ActivityDate)).
This is the current data frame:
https://i.stack.imgur.com/T96hM.png
As you can see, it has the Id number, and when I geom_bar this, it becomes these skinny bars due to the program creating numbers using scientific notation, ie = 7e+09.
I'd like to create a new column in this data frame that assigns a new number Id to each of the long Id numbers. That way, I have a unique identifier for each super long Id. I'm curious if there is a way to auto assign numbers starting at 1 and going up to whatever the last number needs to be, positive integers only. I'll then use a note on my graph that says, "See table for Id pairings" or something...
Does any of this make sense? I'm very new to R, coding, graphing, analysis...Any suggestions of ideas I can try? Thoughts?
I recommend creating a separate look up table for converting old IDs to new IDs. Something like this:
lookup_table = input_table %>%
select(old_IDs = IDs) %>%
distinct()
lookup_table$new_IDs = 1:nrow(lookup_table)
You can then join the look up table to the original table:
output_table = input_table %>%
innner_join(lookup_table, by = c("IDs" = "old_IDs")) %>%

Why am I getting an 'Error in UseMethod(arrange) in r?

I'm writing an r program which lists sales prices for various items. I have a column called InvoiceDate, which lists date and time as follows: '12/1/2009 7:45'. I'm trying to isolate the date only in a separate field called date, and then arrange the dates sequentially. The code I'm using is as follows:
library(dplyr)
library(ggplot2)
setwd("C:/Users/cshor/OneDrive/Environment/Restoration_Ecology/Udemy/Stat_Thinking_&_Data_Sci_with_R/Assignments/Sect_5")
retail_clean <- read.csv("C:/Users/cshor/OneDrive/Environment/Restoration_Ecology/Udemy/Stat_Thinking_&_Data_Sci_with_R/Data/retail_clean.csv")
retail_clean$date <- as.Date(retail_clean$InvoiceDate)#, format = "%d/%m/%Y")
total_sales = sum(retail_clean$Quantity, na.rm=TRUE) %>%
arrange(retail_clean$date) %>% ggplot(aes(x=date, y=total_sales)) + geom_line()
Initially, everything works fine, and the date field is created. However, I get the following error for the arrange() function:
Error in UseMethod("arrange") :no applicable method for 'arrange' applied to an object of class "c('integer', 'numeric')"
I've searched for over a week for a solution to this problem, but have found nothing that specifically addresses this issue. I've also used '.asPosixct' instead of .asDate, with similar results. Any help as to why the program interprets Date data as numeric, and how I can correct the problem, would be greatly appreciated.
First, the error message is not about Date time.
Let's look at the code you provided:
total_sales = sum(retail_clean$Quantity, na.rm=TRUE) %>%
arrange(retail_clean$date) %>% ggplot(aes(x=date, y=total_sales)) + geom_line()
The result of this term sum(retail_clean$Quantity, na.rm=TRUE) is an integer in your case, and it is piped into the first argument of the dpyr::arrange function, which calls UseMethod("arrange").
Then, the piped argument is inspected as being an object of class of integer and numeric, and arrange do not have a method for these classes, that is, neither arrange.integer nor arrange.numeric are defined. Hence the error msg. There is nothing wrong with you date convertion except that you do need that format term you commented out in the code sample.
The solution is also simple. Change sum to something that returns a data.frame or other classes that arrange is aware of. You can check what methods are available for arrange:
$>methods(dplyr::arrange)
[1] arrange.data.frame*
In this R instance, you can only put a data.frame object through arrange, but you can always define specific methods for other classes.
Looks like this is a Udemy course assignment. Maybe here you need to calculate a sum for each day or each month, whichever your assignment is asking you to do, but sum is definitely not the right answer.
By the way, welcome to SO!
Update:
An example
n <- 100
data <- data.frame(sales = runif(n), day = sample(1:30, n, replace = TRUE))
data$date_ <- paste0(data$day, "/1/2009 7:45")
head(data$date_) # This is the orignial date string
data$date <- as.Date(data$date_, format = "%d/%m/%Y")
head(data$date) # Check here to see the formated date
library(dplyr)
library(ggplot2)
data %>%
group_by(date) %>%
summarise(totalSale = sum(sales, na.rm=TRUE)) %>%
arrange(date) %>%
ggplot(aes(x = date, y = totalSale)) +
geom_line()
Here is the plot
It looks fine, isn't it? The sales are all ordered by date now.

How to arrange, group and concentrate string values of repeated keys in different column using R

I have an HMMSCAN result file of protein domains with 10 columns. please see the link for the CSV file.
https://docs.google.com/spreadsheets/d/10d_YQwD41uj0q5pKinIo7wElhDj3BqilwWxThfIg75s/edit?usp=sharing
But I want it to look like this:-
1BVN:P|PDBID|CHAIN|SEQUENCE Alpha-amylase Alpha-amylase_C A_amylase_inhib
3EF3:A|PDBID|CHAIN|SEQUENCE Cutinase
3IP8:A|PDBID|CHAIN|SEQUENCE Amdase
4Q1U:A|PDBID|CHAIN|SEQUENCE Arylesterase
4ROT:A|PDBID|CHAIN|SEQUENCE Esterase
5XJH:A|PDBID|CHAIN|SEQUENCE DLH
6QG9:A|PDBID|CHAIN|SEQUENCE Tannase
The repeated entries of column 3 should get grouped and its corresponding values of column 1, which are in different rows, should be arranged in separate columns.
This is what i wrote till now:
df <- read.csv ("hydrolase_sorted.txt" , header = FALSE, sep ="\t")
new <- df %>% select (V1,V3) %>% group_by(V3) %>% spread(V1, V3)
I hope I am clear with the problem statement. Thanks in advance!!
Your input data set has two unregular rows. However, the approach in your solution is right but one more step is required:
library(dplyr)
df %>% select(V3,V1) %>% group_by(V3) %>% mutate(x = paste(V1,collapse=" ")) %>% select(V3,x)
What we did here is simply concentrating strings by V3. Before running the abovementioned code in this solution you should preprocess and fix some improper rows manually. The rows (TIM, Dannase, and DLH). To do that you can use the Convert texts into column function of in Excel.
Required steps defined are below. Problematic columns highlighted yellow:
Sorry for the non-English interface of my Excel but the way is self-explanatory.

Why numbers not mapped to each row?

So I am trying to find the number of occurrences of each name in another dataset. The code I am trying to run is:
Data$Count <- grep(Data$Name,OtherDataSet$LeadName) %>% length()
The issue is when I run this, the number for the first name gets mapped to each spot in that column. Why is this happening?
library(tidyverse)
Data <- data_frame(Name=c("Dog","Cat","Bird"))
OtherDataSet <- data_frame(LeadName=c("Frog","Cat","Catfish","BirdOfPrey","Bird","Bird"))
Data <- Data %>% mutate(Count=map(.x = Name,~str_detect(.,pattern = OtherDataSet$LeadName)) %>% map_int(~sum(.)))

Applying dplyr's tally over large amount of columns to create codebook

I have a dataframe ov 100+ variables and I would like to create a codebook to see the frequencies of each variable (and ideally output this to excel). Right now, I'm using the following code:
freq_fun <- function(var){
var <- enquo(var)
frequencies <- raw %>% group_by(group, !!var) %>% tally()
return(frequencies)
}
I added in the return in the hopes that looping by column names would at least show me the output but this was unsuccessful.
At this point, my plan is to do the following:
for(i in colnames(rawxl[,9:107])){
assign(paste0(i,"freq"), freq_queue(!!i))
}
output each dataframe to a csv and then copy and paste into one excel doc. This is undesirable for obvious reasons, but I can't see a clear way around it. What is a better way to do this?

Resources