I am trying to bring multiple things together using dplyr: Given I have a time series of multiple returns, I want to calculate the average correlation (I simplified my real task to give the easiest possible example) of all returns with all of the other returns. Of course (in contrast to the example below) my real dataset is rather large (and not yet spread(stock,ret)) contains multiple NAs. Also, in a second step I would have to create my own function and supply that to rollapply. Therefore, if you have a suggestion using something from the RCpproll-package I would be more than happy!
In the below example you can see that I need to input all columns at once, select a window, apply a function to all columns simultaneously, receive a vector with the same number of columns and so on...
Here is my example:
df <- data.frame(Date =as.Date("1926-01-01")+1:24,
PERMNO1 = rnorm(24,0.01,0.3),
PERMNO2 = rnorm(24,0.02,0.4),
PERMNO2 = rnorm(24,-0.01,0.6))
df %>%
do(rollapplyr(.[,-1],width=12,function(a) colMeans(cor(a))))
What I would like to get is something like this:
df2 <- df; df2[,2:4]<-NA
for (i in 12:24){
df2[i,2:4] <- colMeans(cor(df[(i-12):i,2:4]))
}
df2
Date PERMNO1 PERMNO2 PERMNO2.1
1926-01-02 NA NA NA
1926-01-03 NA NA NA
1926-01-04 NA NA NA
1926-01-05 NA NA NA
1926-01-06 NA NA NA
1926-01-07 NA NA NA
1926-01-08 NA NA NA
1926-01-09 NA NA NA
1926-01-10 NA NA NA
1926-01-11 NA NA NA
1926-01-12 NA NA NA
1926-01-13 0.14701350 0.2001694 0.3787320
1926-01-14 0.15364347 0.2438042 0.3143516
1926-01-15 0.16118233 0.2549841 0.3266877
1926-01-16 0.04727533 0.2534126 0.3132990
1926-01-17 0.05220443 0.2411095 0.2744379
1926-01-18 0.12252848 0.2461743 0.2766122
1926-01-19 0.08414717 0.2287705 0.2897744
1926-01-20 0.11164866 0.2503174 0.2414130
1926-01-21 0.08886537 0.2604810 0.2621597
1926-01-22 0.14216304 0.2667540 0.2543573
1926-01-23 0.12654902 0.3086711 0.2751671
1926-01-24 0.11068607 0.3019835 0.2728166
1926-01-25 0.06714698 0.2696828 0.2184242
Convert the data frame to a zoo object, run rollapplyr and convert back:
library(dplyr)
library(zoo)
df %>%
read.zoo %>%
rollapplyr(12, function(x) colMeans(cor(x)), by.column = FALSE, fill = NA) %>%
fortify.zoo
The last line could be omitted if you want to just keep the answer as a zoo object which would probably be more convenient than representing a time series as a data frame.
Related
I'm attempting to import multiple csv files and create a new dataframe that includes specific columns (some with the same name; some different) from each of these files. So far I have been able to create the dataframe with the specific columns I want, but somewhere in my code my data gets lost and doesn't transfer over to each column.
I would also like to create a new column named status where I would like to have each cell equal to either Lost/Gained/Neutral depending if the same value found in the all_v.csv file is also found in the lost_v and/or the gained_v. If it is found in niether then it is Neutral. I attempted to write a line of code for this, but I won't know if it works till I am able to attach the correct data in each column.
This would give me a total of 8 columns:
pre_contact, status, gained_variation, lost_variation, coord.lat, coord.long, country, Date
Most of these columns come from the 4 files listed below with the exception of the status column:
all_v - pre_contact
status - Lost / Gained / Neutral
gained_v - gained_variation
lost_v - lost_variation
SOUTH - coord.lat, coord.long, country, Date
An issue I'm also facing is having disproportionate dataframes. So when I attempt to merge or use rbind, I get an error saying that my rows do not line up because some columns are larger than others so I would like a way to fix this with adding NAs
Here is my sample code:
folder_path<- setwd("/directory/")
setwd(folder_path)
#this creates a table with two columns: filename and file_path but I'm not sure how to utilize it
all_of_them <- fs::dir_ls(folder_path, pattern="*.csv")
file_names <- tibble(filename = list.files(folder_path))
file.paths <- file_names %>% mutate(filepath = paste0(folder_path, filename))
#Each file I want to use
gained_v <- read.csv("gained.csv", header = TRUE)
lost_v <- read.csv("lost.csv", header = TRUE)
all_v <- read.csv("all.csv", header = TRUE)
SOUTH <- read.csv("SOUTH.csv", header = TRUE)
files = list.files(pattern="*.csv", full.names = TRUE)
for (i in 1:length(files)){
data <-
files %>%
map_df(~fread(.))
}
# Set Column Names
subset_data <- data.frame(data)
subset_data$status <- with(subset_data, subset_data$pre_contact == subset_data$gained_variation | subset_data$pre_contact == subset_data$lost_variatiion)
subset_data <- subset(subset_data, select = c(pre_contact,status, gained_variation,lost_variatiion,coord.lat,coord.long, country, Date))
subset_data <- as_tibble(subset_data)
write.csv(subset_data, "subset_data.csv")
status_data = read.csv("subset_data.csv", header = TRUE)
status_data <- data.frame(subset(status_data, select = -c(X)))
status_data <- tibble(status_data)
So far my output looks like this (where the only data showing is from my pre-contact column:
pre_contact status gained_variation lost_variation coord.lat coord.long country Date
1234 NA NA NA NA NA
6543 NA NA NA NA NA
9876 NA NA NA NA NA
1233 NA NA NA NA NA
1276 NA NA NA NA NA
I have the following data frame
school<-c("NYU", "BYU", "USC")
state<-c("NY","UT","CA")
measure<-c("MSAT","MSAT","GPA")
score<-c(500, 490, 2.9)
score2<-(c(200, 280, 4.3))
df<-data.frame(school,state, measure,score,score2, stringsAsFactors=FALSE)
> df
school state measure score score2
1 NYU NY MSAT 500.0 200.0
2 BYU UT MSAT 490.0 280.0
3 USC CA GPA 2.9 4.3
And I would like to set all the values for certain columns to NA without any condition. Just set them to NA. i.e.
> df
school state measure score score2
1 NYU NA MSAT NA NA
2 BYU NA MSAT NA NA
3 USC NA GPA NA NA
I have tried:
df <- mutate_at(vars(-school,-measure),na_if(.,!is.na(.)))
Where I was expecting na_if(.,!is.na(.)) to convert any value that wasn't already NA to NA. But as you can see I'm not correctly feeding the columns into the is.na() function.
Error in length_x %in% c(1L, n) : object '.' not found
How would I go about achieving this. I have many more columns I would like the perform this on than columns I want to preserve.
This would do it
df[,c("state","score","score2")]<-NA
Since you're asking specifically for mutate, here are some things to consider:
In your line df <- mutate_at(vars(-school,-measure),na_if(.,!is.na(.))), it fails because it expects df as the first argument - or piped in. The correct usage would be
df <- df %>% mutate_at(vars(-school,-measure),na_if(.,!is.na(.)))
But that doesn't solve it, because
Have you checked what na_if and is.na does? Just a quick ?na_if?? Because it doesn't replace values when the second argument is true, but replaces values equal to the second argument with NA. So, that just plainly doesn't work as expected.
And finally,
Why only change non-NA values to NA? Why not just change everything to NA?
Which leads to the following solution:
just.na <- function(x) rep(NA, length(x))
df %>% mutate_at(vars(-school, -measure), just.na)
Or, an anonymous function:
df %>% mutate_at(vars(-school, -measure), ~rep(NA, length(.)))
Or, it turns out you can do this:
df %>% mutate_at(vars(-school, -measure), ~NA)
(I am equally surprised!)
Try this:
df[,c(2,4,5)] <- NA
I have a file that I have filtered my SNPs for LD (in the example below;my.filtered.snp.id). I want to keep only these SNPs in my genotype matrix (geno_snp), I am trying to write a for loops in R, and I would appreciate any help to fix my code. I want to keep those lines (the whole line including snp.id and genotype information) in the genotype matrix where snp.id matches with snp.id in my my.filtered.snp.id and delete those that are not match.
head(my.filtered.snp.id)
Chr10_31458
Chr10_31524
Chr10_45901
Chr10_102754
Chr10_102828
Chr10_103480
head (geno_snp)
XRQChr10_103805 NA NA NA 0 NA 0 NA NA NA NA NA 0 0
XRQChr10_103937 NA NA NA 0 NA 1 NA NA NA NA NA 0 2
XRQChr10_103990 NA NA NA 0 NA 0 NA NA NA NA NA 0 NA
I am trying something like this:
for (i in 1:length(geno_snp[,1])){
for (j in 1:length(my.filtered.snp.id)){
if geno_snp[i,] == my.filtered.snp.i[j]
print (the whole line in geno_snp)
}
else (remove the line)
}
If I understood it correctly, you want a subset of your data.frame geno_snp in which the row names must match the selected SNP IDs from the vector my.filtered.snp.id.
Please check if this solution works for you:
index <- unlist(sapply(row.names(geno_snp), function(x) grep(pattern = x, x = my.filtered.snp.id)))
selected_subset <- geno_snp[index,]
What I did was to create an index adressing the rows with names that were a match with any value in my.filtered.snp.id. Then I used the index to make the subset of the dataframe. Since the result from applying the grep function with the aid of sapply was in the form of a list, I used unlist to obtain the results in the form of a vector.
EDIT:
I noticed you had some row.names that weren't an exact match with your original my.filtered.snp.id values. In this case, maybe what you wanna do is:
index <- unlist(sapply(my.filtered.snp.id, function(x) grep(pattern = x, x = row.names(geno_snp))))
selected_subset <- geno_snp[index,]
The thing is that you have row.names beggining with XRQ... so in this last case the code uses the reference values from my.filtered.snp.id to detect matches in row.names(geno_snp), even if there is this XRQ string in the beggining of it.
Finally, in the case I have misunderstood your data and what I'm calling row names here are, in fact, data in a column (the SNP IDs), just use geno_snp[,1] instead of row.names(geno_snp) in both codes above.
I would like to sort one column in my data frame by string length first then by alphabet, I tried code below:
#sort column by string length then alphabet
GSN[order(nchar(GSN[,3]),GSN[,3]),]
But I got error
Error in nchar(GSN[, 3]) : 'nchar()' requires a character vector
My data looks like:
Flowcell Lane barcode sample plate row column
314 NA NA AACAGACATT LD06_7620SDS GSN1_Hind384D B 4
307 NA NA AACAGCACT LG10_2688SDS GSN1_Hind384D C 3
289 NA NA AACCTC U09_105007SDS GSN1_Hind384D A 1
232 NA NA AACGACCACC 13_232 GSN1_Hind384C H 5
10 NA NA AACGCACATT 13_10 GSN1_Hind384A B 2
165 NA NA AACGG 13_165 GSN1_Hind384B E 9
I would like to sort "barcode" column.
Thanks for your time.
You can add another column to your data frame that contains the number of characters in the barcode, then sort in the usual way.
GSN <- transform(GSN, n=nchar(as.character(barcode)))
GSN[with(GSN, order(n, barcode)), ]
It appears that the issue you were having is because R thinks that barcode is a factor rather than a character vector, so nchar() is invalid. Converting it to character via as.character() solves this.
I wish to add a tidyverse solution
library(tidyverse)
GSN_sorted = GSN %>%
mutate(barcode = as.character(barcode)) %>%
arrange(str_length(barcode), barcode)
Note the factor to character conversion originally pointed out by Alex A.
I have preallocated a 3D array and try to fill it with data. However, whenever I do this with a previously defined data.frame collumn, the array gets mysteriously converted to a list, which messes up everything. Converting the data.frame collumn to a vector does not help it.
Example:
exampleArray <- array(dim=c(3,4,6))
exampleArray[2,3,] <- c(1:6) # direct filling works perfectly
exampleArray
str(exampleArray) # output as expected
Problem:
exampleArray <- array(dim=c(3,4,6))
exampleContent <- as.vector(as.data.frame(c(1:6)))
exampleArray[2,3,] <- exampleContent # filling array from a data.frame column
# no errors or warnings
exampleArray
str(exampleArray) # list-like output!
Is there any way I can get around this and fill up my array normally?
Thanks for your suggestions!
Try this:
exampleArray <- array(dim=c(3,4,6))
exampleContent <- as.data.frame(c(1:6))
> exampleContent[,1]
[1] 1 2 3 4 5 6
exampleArray[2,3,] <- exampleContent[,1] # take the desired column
# no errors or warnings
str(exampleArray)
int [1:3, 1:4, 1:6] NA NA NA NA NA NA NA 1 NA NA ...
You were trying to insert data frame in array, which won't work. You should use the dataframe$column or dataframe[,1] instead.
Also, as.vector doesn't do anything in as.vector(as.data.frame(c(1:6))), you were probably after as.vector(as.data.frame(c(1:6))), although that doesn't work:
as.vector(as.data.frame(c(1:6)))
Error: (list) object cannot be coerced to type 'double'