I have got a problem in R. As I am still learning to code successfully, I hope you can help me.
I have a data frame where the header is defined as the equity ISIN and below I have the daily equity prices in a numeric format.
ISIN1
ISIN 2
ISIN3
...
2.35
10.10
0.90
...
2.45
9.98
0.85
...
2.40
10.15
0.70
...
...
...
...
...
Now I don't need the equity prices in the columns below the ISIN but I need the percentage change. In the following format:
ISIN1
ISIN 2
ISIN3
...
NA
NA
NA
...
0.04255
-0.01188
-0.05555
...
-0.02041
0.01703
-0.12647
...
...
...
...
...
I managed to achieve this with the first column (ISIN1), however, I could not manifold the code to the other columns. I assume there must be an easy way without copying & pasting the code for every ISIN manually. I tried the different options of apply, sapply, loop, for, ... but I did not manage to find the right one or I did not enter it correctly.
Following you can find my code for the first ISIN. The df which is being created is named "return" and the df where I have the available stock prices is called "stock_prices"
for (i in 2:nrow(return)) {
return$`ISIN1`[(i)] = ((stock_prices$`ISIN1`[(i)])/stock_prices$`ISIN1`[(i-1)])-1
}
I hope you can help me and that I have posed the question in an understandable way.
Thank you!!
The canonical way to do this in R is lapply, something like:
running_change <- function(x) c(NA, diff(x) / x[-length(x)])
changes <- lapply(dat[,1:3], running_change)
changes
# $ISIN1
# [1] NA 0.04255319 -0.02040816
# $ISIN.2
# [1] NA -0.01188119 0.01703407
# $ISIN3
# [1] NA -0.05555556 -0.17647059
That returns a list; if you want to add that to data, then
cbind(dat, changes)
# ISIN1 ISIN.2 ISIN3 ... ISIN1 ISIN.2 ISIN3
# 1 2.35 10.10 0.90 ... NA NA NA
# 2 2.45 9.98 0.85 ... 0.04255319 -0.01188119 -0.05555556
# 3 2.40 10.15 0.70 ... -0.02040816 0.01703407 -0.17647059
If you want to replace the data, though, you can do
dat[1:3] <- changes
dat
# ISIN1 ISIN.2 ISIN3 ...
# 1 NA NA NA ...
# 2 0.04255319 -0.01188119 -0.05555556 ...
# 3 -0.02040816 0.01703407 -0.17647059 ...
or, in the original step, do
dat[,1:3] <- lapply(dat[,1:3], running_change)
My use of 1:3 here is to subset the data to just the columns needed to operate. It might not be necessary in your data if all columns are numeric and you need the calc on all of them; in this case, I kept the ... fake-data you had, so needed to not try to calculate fractional changes on strings. You can use whatever method you want to subset your columns, including integer indexing as here, or by-name.
By the way, if you want to do this on all columns, then
dat[] <- lapply(dat, running_change)
The dat[] <- ensures that the columns are replaced without changing from a data.frame to a list. It's a trick; other ways work too, but this is the most direct (and code-golf/shortest) I've seen.
Data
dat <- structure(list(ISIN1 = c(NA, 0.0425531914893617, -0.0204081632653062), ISIN.2 = c(NA, -0.0118811881188118, 0.0170340681362725), ISIN3 = c(NA, -0.0555555555555556, -0.176470588235294), ... = c("...", "...", "...")), row.names = c(NA, -3L), class = "data.frame")
Related
I need to merge two lists with each other but I am not getting what I want and I think it is because the "Date" column is in two different formats. I have a list called li and in this list there are 12 lists each with the following format:
> tail(li$fxe)
Date fxe
3351 2020-06-22 0.0058722768
3352 2020-06-23 0.0044256216
3353 2020-06-24 -0.0044998220
3354 2020-06-25 -0.0027309539
3355 2020-06-26 0.0002832672
3356 2020-06-29 0.0007552346
I am trying to merge each of these unique lists with a different list called factors which looks like :
> tail(factors)
Date Mkt-RF SMB HML RF
3351 20200622 0.0071 0.83 -1.42 0.000
3352 20200623 0.0042 0.15 -0.56 0.000
3353 20200624 -0.0261 -0.52 -1.28 0.000
3354 20200625 0.0112 0.25 0.50 0.000
3355 20200626 -0.0243 0.16 -1.37 0.000
3356 20200629 0.0151 1.25 1.80 0.000
The reason I need this structure is because I am trying to send them to a function I wrote to do linear regressions. But the first line of my function aims to merge these lists. When I merge them I end up with a null structure even thought my lists clearly have the same number of rows. In my function df is li. The embedded list of li is confusing me. Can someone help please?
Function I want to use:
Bf <- function(df, fac){
#This function calculates the beta of the french fama factor #using linear regression
#Input: df = a dataframe containg returns of the security
# fac = dataframe containing excess market retrun and
# french fama 3 factor
#Output: a Beta vectors of the french fama model
temp <- merge(df, fac, by="Date")
temp <- temp[, !names(temp) %in% "Date"]
temp[ ,1] <- temp[,1] - temp$RF return(lm(temp[,1]~temp[,2]+temp[,3]+temp[,4])$coeff)
}
a: you are dealing with data frames and not lists
b: if you want to merge them, you need to modify the factors$date column to match that of li$fxe$date
try to do:
factors$date <- as.Date(strptime(factors$date, format = "%Y%M%d"))
This should convert, the factors column to "Date" format.
I'm running the following code below to retrieve a data set, which unfortunately uses "." instead of NA to represent missing data. After much wrangling and searching SO and other fora, I still cannot make the code replace all instances of "." with NA so I can convert the columns to numeric and go on with my life. I'm pretty sure the problem is between the screen and the chair, so I don't see a need to post sessionInfo, but please let me know otherwise. Help in solving this would be greatly appreciated. The first four columns are integers setting out the date and the unique ID, so I would only need to correct the other columns. Thanks in advance you all!
library(data.table)
google_mobility_data <- data.table(read.csv("https://github.com/OpportunityInsights/EconomicTracker/raw/main/data/Google Mobility - State - Daily.csv",stringsAsFactors = FALSE))
# The following line is the one where I can't make it work properly.
google_mobility_data[, .SD := as.numeric(sub("^\\.$", NA, .SD)), .SDcols = -c(1:4)]
I downloaded your data and changed the last entry on the first row to "." to test NA in the final column.
Use readLines to read a character vector.
Use gsub to change . to NA.
Use fread to read as a data.table.
library(data.table)
gmd <- readLines("Google Mobility - State - Daily.csv")
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,."
# [2] "2020,4,25,10,-.384,-.191,.,-.479,-.441,.179,-.213"
gmd <- gsub(",\\.,",",NA,",gmd)
gmd <- gsub(",\\.$",",NA",gmd)
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,NA"
# [2] "2020,4,25,10,-.384,-.191,NA,-.479,-.441,.179,-.213"
google_mobility_data <- fread(text=gmd)
google_mobility_data[c(1,3119)]
# year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
summary(google_mobility_data)
EDIT: You mentioned using na.strings with fread didn't work for you, so I suggested the above approach.
However, at least with the data file downloaded as I did, this worked in one line - as suggested by #MichaelChirico:
google_mobility_data <- fread("Google Mobility - State - Daily.csv",na.strings=".")
google_mobility_data[c(1,3119)]
year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
I have multiple data frames (moving temperature of different duration at 130 observation points), and want to generate monthly average for all the data by applying the below code to each data frame - then put the outcome into one data frame. I have been trying to do this with for-loop, but not getting anywhere. I'm relatively new to R and really appreciate if someone could help me get through this.
Here is the glimpse of a data frame:
head(maxT2016[,1:5])
X X0 X1 X2 X3
1 20160101 26.08987 26.08987 26.08987 26.08987
2 20160102 25.58242 25.58242 25.58242 25.58242
3 20160103 25.44290 25.44290 25.44290 25.44290
4 20160104 26.88043 26.88043 26.88043 26.88043
5 20160105 26.60278 26.60278 26.60278 26.60278
6 20160106 24.87676 24.87676 24.87676 24.87676
str(maxT2016)
'data.frame': 274 obs. of 132 variables:
$ X : int 20160101 20160102 20160103 20160104 20160105 20160106 20160107 20160108 20160109 20160110 ...
$ X0 : num 26.1 25.6 25.4 26.9 26.6 ...
$ X1 : num 26.1 25.6 25.4 26.9 26.6 ...
$ X2 : num 26.1 25.6 25.4 26.9 26.6 ...
$ X3 : num 26.1 25.6 25.4 26.9 26.6 ...
Here is the code that I use for individual data frame:
library(dplyr)
library(lubridate)
library(tidyverse)
maxT10$X <- as.Date(as.character(maxTsma10$X), format="%Y%m%d")
monthlyAvr <- maxT10 %>%
group_by(month=floor_date(date, "month")) %>%
summarise(across(X0:X130, mean, na.rm=TRUE)) %>%
slice_tail(n=6) %>%
select(-month)
monthlyAvr2 <- as.data.frame(t(montlyAvr))
colnames(monthlyAvr2) <- c("meanT_Apr", "meanT_May", "meanT_Jun", "meanT_Jul", "meanT_Aug",
"meanT_Sep")
Essentially, I want to put all the all the data frames into a list and run the function through all the data frame, then sort these outputs into one data frame. I came across with lapply function as an alternative, but somewhat felt more comfortable with for-loop.
d = list(maxT10, maxT20, maxT30, maxT60 ... ...)
for (i in 1:lengh(d)){
}
MonthlyAvrT <- cbind(maxT10, maxT20, maxT30, maxT60... ... )
Basil. Welcome to StackOverflow.
I was wary of lapply when I first stated using R, but you should stick with it. It's almost always more efficient than using a for loop. In your particular case, you can put your individual data frames in a list and the code you run on each into a function myFunc, say, which takes the data frame you want to process as its argument.
Then you can simply say
allData <- bind_rows(lapply(1:length(dataFrameList), function(x) myFunc(dataFrameList[[x]])))
Incidentally, your column names make me think your data isn't yet tidy. I'd suggest you spend a little time making it so before you do much else. It will save you a huge amount of effort in the long run.
The logic in pseudo-code would be:
for each data.frame in list
apply a function
save the results
Applying my_function on each data.frame of the data_set list :
my_function <- function(my_df) {
my_df <- as.data.frame(my_df)
out <- apply(my_df, 2, mean) # compute mean on dimension 2 (columns)
return(out)
}
# 100 data.frames
data_set <- replicate(100, data.frame(X=runif(6, 20160101, 20160131), X0=rnorm(6, 25)))
> dim(data_set)
[1] 2 100
results <- apply(data_set, 2, my_function) # Apply my_function on dimension 2
# Output for first 5 data.frames
> results[, 1:5]
[,1] [,2] [,3] [,4] [,5]
X 2.016012e+07 2.016011e+07 2.016011e+07 2.016012e+07 2.016011e+07
X0 2.533888e+01 2.495086e+01 2.523087e+01 2.491822e+01 2.482142e+01
I'm moderately experienced using R, but I'm just starting to learn to write functions to automate tasks. I'm currently working on a project to run sentiment analysis and topic models of speeches from the five remaining presidential candidates and have run into a snag.
I wrote a function to do a sentence-by-sentence analysis of positive and negative sentiments, giving each sentence a score. Miraculously, it worked and gave me a dataframe with scores for each sentence.
score text
1 1 iowa, thank you.
2 2 thanks to all of you here tonight for your patriotism, for your love of country and for doing what too few americans today are doing.
3 0 you are not standing on the sidelines complaining.
4 1 you are not turning your backs on the political process.
5 2 you are standing up and fighting back.
So what I'm trying to do now is create a function that takes the scores and figures out what percentage of the total is represented by the count of each score and then plot it using plotly. So here is the function I've written:
scoreFun <- function(x){{
tbl <- table(x)
res <- cbind(tbl,round(prop.table(tbl)*100,2))
colnames(res) <- c('Score', 'Count','Percentage')
return(res)
}
percent = data.frame(Score=rownames, Count=Count, Percentage=Percentage)
return(percent)
}
Which returns this:
saPct <- scoreFun(sanders.scores$score)
saPct
Count Percentage
-6 1 0.44
-5 1 0.44
-4 6 2.64
-3 13 5.73
-2 20 8.81
-1 42 18.50
0 72 31.72
1 34 14.98
2 18 7.93
3 9 3.96
4 6 2.64
5 2 0.88
6 1 0.44
9 1 0.44
11 1 0.44
What I had hoped it would return is a dataframe with what has ended up being the rownames as a variable called Score and the next two columns called Count and Percentage, respectively. Then I want to plot the Score on the x-axis and Percentage on the y-axis using this code:
d <- subplot(
plot_ly(clPct, x = rownames, y=Percentage, xaxis="x1", yaxis="y1"),
plot_ly(saPct, x = rownames, y=Percentage, xaxis="x2", yaxis="y2"),
margin = 0.05,
nrows=2
) %>% layout(d, xaxis=list(title="", range=c(-15, 15)),
xaxis2=list(title="Score", range=c(-15,15)),
yaxis=list(title="Clinton", range=c(0,50)),
yaxis2=list(title="Sanders", range=c(0,50)),showlegend = FALSE)
d
I'm pretty certain I've made some obvious mistakes in my function and my plot_ly code, because clearly it's not returning the dataframe I want and is leading to the error Error in list2env(data) : first argument must be a named list when I run the `plotly code. Again, though, I'm not very experienced writing functions and I've not found a similar issue when I Google, so I don't know how to fix this.
Any advice would be most welcome. Thanks!
#MLavoie, this code from the question I referenced in my comment did the trick. Many thanks!
scoreFun <- function(x){
tbl <- data.frame(table(x))
colnames(tbl) <- c("Score", "Count")
tbl$Percentage <- tbl$Count / sum(tbl$Count) * 100
return(tbl)
}
Is there a clean/automatic way to convert CSV values formatted with as percents (with trailing % symbol) in R?
Here is some example data:
actual,simulated,percent error
2.1496,8.6066,-300%
0.9170,8.0266,-775%
7.9406,0.2152,97%
4.9637,3.5237,29%
Which can be read using:
junk = read.csv("Example.csv")
But all of the % columns are read as strings and converted to factors:
> str(junk)
'data.frame': 4 obs. of 3 variables:
$ actual : num 2.15 0.917 7.941 4.964
$ simulated : num 8.607 8.027 0.215 3.524
$ percent.error: Factor w/ 4 levels "-300%","-775%",..: 1 2 4 3
but I would like them to be numeric values.
Is there an additional parameter for read.csv? Is there a way to easily post process the needed columns to convert to numeric values? Other solutions?
Note: of course in this example I could simply recompute the values, but in my real application with a larger data file this is not practical.
There is no "percentage" type in R. So you need to do some post-processing:
DF <- read.table(text="actual,simulated,percent error
2.1496,8.6066,-300%
0.9170,8.0266,-775%
7.9406,0.2152,97%
4.9637,3.5237,29%", sep=",", header=TRUE)
DF[,3] <- as.numeric(gsub("%", "",DF[,3]))/100
# actual simulated percent.error
#1 2.1496 8.6066 -3.00
#2 0.9170 8.0266 -7.75
#3 7.9406 0.2152 0.97
#4 4.9637 3.5237 0.29
This is the same as Roland's solution except using the stringr package. When working with strings I'd recommend it though as the interface is more intuitive.
library(stringr)
d <- str_replace(junk$percent.error, pattern="%", "")
junk$percent.error <- as.numeric(d)/100
With data.table you can achieve it as
a <- fread("file.csv")[,`percent error` := as.numeric(sub('%', '', `percent error`))/100]
Tidyverse has multiple ways of solving such issues. You can use the parse_number() specification which will strip a number off any symbols, text etc.:
sample_data = "actual,simulated,percent error\n 2.1496,8.6066,-300%\n 0.9170,8.0266,-775%\n7.9406,0.2152,97%\n4.9637,3.5237,29%"
DF <- read_csv(sample_data,col_types = cols(`percent error`= col_number()))
# A tibble: 4 x 3
# actual simulated `percent error`
# <chr> <dbl> <dbl>
# 1 2.1496 8.61 -300
# 2 + 0.9170 8.03 -775
# 3 + 7.9406 0.215 97.0
# 4 + 4.9637 3.52 29.0