I'm running the following code below to retrieve a data set, which unfortunately uses "." instead of NA to represent missing data. After much wrangling and searching SO and other fora, I still cannot make the code replace all instances of "." with NA so I can convert the columns to numeric and go on with my life. I'm pretty sure the problem is between the screen and the chair, so I don't see a need to post sessionInfo, but please let me know otherwise. Help in solving this would be greatly appreciated. The first four columns are integers setting out the date and the unique ID, so I would only need to correct the other columns. Thanks in advance you all!
library(data.table)
google_mobility_data <- data.table(read.csv("https://github.com/OpportunityInsights/EconomicTracker/raw/main/data/Google Mobility - State - Daily.csv",stringsAsFactors = FALSE))
# The following line is the one where I can't make it work properly.
google_mobility_data[, .SD := as.numeric(sub("^\\.$", NA, .SD)), .SDcols = -c(1:4)]
I downloaded your data and changed the last entry on the first row to "." to test NA in the final column.
Use readLines to read a character vector.
Use gsub to change . to NA.
Use fread to read as a data.table.
library(data.table)
gmd <- readLines("Google Mobility - State - Daily.csv")
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,."
# [2] "2020,4,25,10,-.384,-.191,.,-.479,-.441,.179,-.213"
gmd <- gsub(",\\.,",",NA,",gmd)
gmd <- gsub(",\\.$",",NA",gmd)
gmd[c(2,3120)]
# [1] "2020,2,24,1,.00286,-.00714,.0557,.06,.0129,.00857,NA"
# [2] "2020,4,25,10,-.384,-.191,NA,-.479,-.441,.179,-.213"
google_mobility_data <- fread(text=gmd)
google_mobility_data[c(1,3119)]
# year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
summary(google_mobility_data)
EDIT: You mentioned using na.strings with fread didn't work for you, so I suggested the above approach.
However, at least with the data file downloaded as I did, this worked in one line - as suggested by #MichaelChirico:
google_mobility_data <- fread("Google Mobility - State - Daily.csv",na.strings=".")
google_mobility_data[c(1,3119)]
year month day statefips gps_retail_and_recreation gps_grocery_and_pharmacy gps_parks gps_transit_stations gps_workplaces gps_residential gps_away_from_home
#1: 2020 2 24 1 0.00286 -0.00714 0.0557 0.060 0.0129 0.00857 NA
#2: 2020 4 25 10 -0.38400 -0.19100 NA -0.479 -0.4410 0.17900 -0.213
Related
I have got a problem in R. As I am still learning to code successfully, I hope you can help me.
I have a data frame where the header is defined as the equity ISIN and below I have the daily equity prices in a numeric format.
ISIN1
ISIN 2
ISIN3
...
2.35
10.10
0.90
...
2.45
9.98
0.85
...
2.40
10.15
0.70
...
...
...
...
...
Now I don't need the equity prices in the columns below the ISIN but I need the percentage change. In the following format:
ISIN1
ISIN 2
ISIN3
...
NA
NA
NA
...
0.04255
-0.01188
-0.05555
...
-0.02041
0.01703
-0.12647
...
...
...
...
...
I managed to achieve this with the first column (ISIN1), however, I could not manifold the code to the other columns. I assume there must be an easy way without copying & pasting the code for every ISIN manually. I tried the different options of apply, sapply, loop, for, ... but I did not manage to find the right one or I did not enter it correctly.
Following you can find my code for the first ISIN. The df which is being created is named "return" and the df where I have the available stock prices is called "stock_prices"
for (i in 2:nrow(return)) {
return$`ISIN1`[(i)] = ((stock_prices$`ISIN1`[(i)])/stock_prices$`ISIN1`[(i-1)])-1
}
I hope you can help me and that I have posed the question in an understandable way.
Thank you!!
The canonical way to do this in R is lapply, something like:
running_change <- function(x) c(NA, diff(x) / x[-length(x)])
changes <- lapply(dat[,1:3], running_change)
changes
# $ISIN1
# [1] NA 0.04255319 -0.02040816
# $ISIN.2
# [1] NA -0.01188119 0.01703407
# $ISIN3
# [1] NA -0.05555556 -0.17647059
That returns a list; if you want to add that to data, then
cbind(dat, changes)
# ISIN1 ISIN.2 ISIN3 ... ISIN1 ISIN.2 ISIN3
# 1 2.35 10.10 0.90 ... NA NA NA
# 2 2.45 9.98 0.85 ... 0.04255319 -0.01188119 -0.05555556
# 3 2.40 10.15 0.70 ... -0.02040816 0.01703407 -0.17647059
If you want to replace the data, though, you can do
dat[1:3] <- changes
dat
# ISIN1 ISIN.2 ISIN3 ...
# 1 NA NA NA ...
# 2 0.04255319 -0.01188119 -0.05555556 ...
# 3 -0.02040816 0.01703407 -0.17647059 ...
or, in the original step, do
dat[,1:3] <- lapply(dat[,1:3], running_change)
My use of 1:3 here is to subset the data to just the columns needed to operate. It might not be necessary in your data if all columns are numeric and you need the calc on all of them; in this case, I kept the ... fake-data you had, so needed to not try to calculate fractional changes on strings. You can use whatever method you want to subset your columns, including integer indexing as here, or by-name.
By the way, if you want to do this on all columns, then
dat[] <- lapply(dat, running_change)
The dat[] <- ensures that the columns are replaced without changing from a data.frame to a list. It's a trick; other ways work too, but this is the most direct (and code-golf/shortest) I've seen.
Data
dat <- structure(list(ISIN1 = c(NA, 0.0425531914893617, -0.0204081632653062), ISIN.2 = c(NA, -0.0118811881188118, 0.0170340681362725), ISIN3 = c(NA, -0.0555555555555556, -0.176470588235294), ... = c("...", "...", "...")), row.names = c(NA, -3L), class = "data.frame")
My data
Hello, I have a problem with merging two dataframes with each other.
The goal is to merge them so that each date has the corresponding values. If there is no corresponding value, I want to replace NA with 0.
names(FiresNearLA.ab.03)[1] <- "Date.Local"
U.NO2.ab.03 <- unique(NO2.ab.03) # No2.ab.03 has all values multiplied
ind <- merge(FiresNearLA.ab.03,U.NO2.ab.03, all = TRUE, all.x=TRUE)
ind[is.na(ind)] <- 0
So far so good. And the first lines look like they are supposed to look. But beginning from 2004-04-24, all dates are doubled and it writes weird values in the second NO2.Mean colum.
U.NO2.Mean table:
Date.Local NO2.Mean
361 2004-03-31 30.217391
365 2004-04-24 50.000000
366 2004-04-25 47.304348
370 2004-04-26 50.913043
374 2004-04-27 41.157895
ind table:
Date.Local FIRE_SIZE F.number.n_fires NO2.Mean
113 2004-04-22 34.30 10 13.681818
114 2004-04-23 45.00 13 17.222222
115 2004-04-24 55.40 22 28.818182
116 2004-04-24 55.40 22 50.000000
117 2004-04-25 2306.85 15 47.304348
118 2004-04-25 2306.85 15 21.090909
Why, are there Values in NO2.Mean for 2004-04-23 and 2004-04-22 days if they should be 0? and why does it double the values after the 24th and where do the second ones come from?
Thank you
So I managed to merge your data:
FiresNearLA.ab.03 <- dget("FiresNearLA.ab.03.txt", keep.source = FALSE)
U.NO2.ab.03 <- dget("NO2.ab.03.txt", keep.source = FALSE)
ind <- merge(FiresNearLA.ab.03,
U.NO2.ab.03,
all = TRUE,
by.x = "DISCOVERY_DATEymd",
by.y = "Date.Local")
As a side note: Usually, you share a small sample of your data on stackoverflow, not the whole thing. In your case, dput(FiresNearLA.ab.03[1:50, ]) and then copy and paste from the console to the question would have been sufficient.
Back to your problem: The duplication already happens in NO2.ab.03 and a number of dates and values occurs twice or more often. The easiest way to solve this (in my experience) is to use the package data.table which has a duplicated which is more straightforward and also faster:
library(data.table)
# Test duplicated occurrences in U.NO2.ab.03
> table(duplicated(U.NO2.ab.03, by = c("DISCOVERY_DATEymd", "NO2.Mean")))
FALSE TRUE
7767 27308
>
> nrow(ind)
[1] 35229
# Remove duplicated rows from data frame
> ind <- ind[!duplicated(ind, by = c("DISCOVERY_DATEymd", "NO2.Mean")), ]
> nrow(ind)
[1] 7921
After these steps, you should be fine :)
I got the answer. The Original data source of NO3.ab.03 was faulty.
As JonGrub suggestes said, the problem was within NO3.ab.O3. For some days it had two different NO3.Means corresponding to the same date. I deleted this rows and now its working good. Thank you again for the help and the great advices
I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.
DF2
Date EMMI ACT NO2
2011/02/12 12345 21 11
2011/02/14 43211 22 12
2011/02/19 12345 21 13
2011/02/23 43211 13 12
2011/02/23 56341 13 12
2011/03/03 56431 18 20
I need to find difference between two dates in a column. For example difference between ACT column values.For example, the EMMI 12345, Difference between dates 2011/02/19 - 2011/02/12 = 21-21 = 0. like that i want to do for entire column of ACT. Add a new column diff and add values to that. Can anybody let me know please how to do it.
This is the output i want
DF3
Date EMMI ACT NO2 DifACT
2011/02/12 12345 21 11 NA
2011/02/14 43211 22 12 NA
2011/02/19 12345 21 13 0
2011/02/23 43211 13 12 -9
2011/02/23 56341 13 12 5
Try this:
DF3 <- DF2
DF3$difACT <- ave( DF3$ACT, DF3$EMMI, FUN= function(x) c(NA, diff(x)) )
As long as the dates are sorted (within EMMI) this will work, if they are not sorted then we would need to modify the above to sort within EMMI first. I would probably sort the entire data frame on date first (and save the results of order), then run the above. Then if you need it back in the original order you can run order on the results of the original order results to "unorder" the data frame.
This is based on plyr package (not tested):
library(plyr)
DF3<-ddply(DF2,.(EMMI),mutate,difACT=diff(ACT))
I would like to select in my dataframe (catch) only the rows for which my "tspp.name" variable is the same as my "elasmo.name" variable.
For example, row #74807 and #74809 in this case would be selected, but not row #74823 because the elasmo.name is "skate" and the tspp.name is "Northern shrimp".
I am sure there is an easy answer for this, but I have not found it yet. Any hints would be appreciated.
> catch[4:6,]
gear tripID obsID sortie setID date time NAFO lat long dur depth bodymesh
74807 GRL2 G00001 A 1 13 2000-01-04 13:40:00 2H 562550 594350 2.000000 377 80
74809 GRL2 G00001 A 1 14 2000-01-04 23:30:00 2H 562550 594350 2.166667 370 80
74823 GRL2 G00001 A 1 16 2000-01-05 07:45:00 2H 561450 593050 3.000000 408 80
codendmesh mail.fil long.fil nbr.fil hook.shape hook.size hooks VTS tspp tspp.name elasmo
74807 45 NA NA NA NA NA 3.3 2211 Northern shrimp 2211
74809 45 NA NA NA NA NA 3.2 2211 Northern shrimp 2211
74823 45 NA NA NA NA NA 3.3 2211 Northern shrimp 211
elasmo.name kept discard Tcatch date.1 latitude longitude EID
74807 Northern shrimp 2747 50 2797 2000-01-04 56.91667 -60.21667 G00001-13
74809 Northern shrimp 4919 100 5019 2000-01-04 56.91667 -60.21667 G00001-14
74823 Skates 0 50 50 2000-01-05 56.73333 -60.00000 G00001-16
fgear
74807 Shrimp trawl (stern) with a grid
74809 Shrimp trawl (stern) with a grid
74823 Shrimp trawl (stern) with a grid
I know what the problem is - you need to read in the data "as is", by adding the argument as.is=TRUE to the read.csv command (which you presumably used to load everything in). Without this, the strings get stored as factors, and all methods suggested above will fail (as you've discovered!)
Once you've read in the data correctly, you can use either
catch[which(catch$tspp.name == catch$elasmo.name),]
or
subset(catch, tspp.name == elasmo.name)
to obtain the matching rows - do not omit the which in the first one otherwise the code will fail when doing comparisons with NAs.
Below is a 30-second example using a small fabricated data set that illustrates all these points explicitly.
First, create a text file on disk that looks like this (I saved it as "F:/test.dat" but it can be saved anywhere)...
col1~col2
a~b
a~a
b~b
c~NA
NA~d
NA~NA
Let's load it in without converting factors to strings, just to see the methods proposed above fall over:
> dat=read.csv("F:/test.dat",sep="~") # don't forget to check the filename
> dat[which(dat$col1==dat$col2),]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> dat[dat$col1==dat$col2,]
Error in Ops.factor(dat$col1, dat$col2) : level sets of factors are different
> subset(dat,col1==col2)
Error in Ops.factor(col1, col2) : level sets of factors are different
This is exactly the problem you were having. If you type dat$col1 and dat$col2 you'll see that the first has factor levels a b c while the second has factor levels a b d - hence the error messages.
Now let's do the same, but this time reading in the data "as is":
> dat=read.csv("F:/test.dat",sep="~",as.is=TRUE) # note the as.is=TRUE
> dat[which(dat$col1==dat$col2),]
col1 col2
2 a a
3 b b
> dat[dat$col1==dat$col2,]
col1 col2
2 a a
3 b b
NA <NA> <NA>
NA.1 <NA> <NA>
NA.2 <NA> <NA>
> subset(dat,col1==col2)
col1 col2
2 a a
3 b b
As you can see, the first method (based on which) and the third method (based on subset) both give the right answer, while the second method gets confused by comparisons with NA. I would personally advocate the subset method as in my opinion it's the neatest.
A final note: There are other ways that you can get strings arising as factors in a data frame - and to avoid all of those headaches, always remember to include the argument stringsAsFactors = FALSE at the end whenever you create a data frame using data.frame. For instance, the correct way to create the object dat directly in R would be:
dat=data.frame(col1=c("a","a","b","c",NA,NA), col2=c("b","a","b",NA,"d",NA),
stringsAsFactors=FALSE)
Type dat$col1 and dat$col2 and you'll see they've been interpreted correctly. If you try it again but with the stringsAsFactors argument omitted (or set to TRUE), you'll see those darned factors appear (just like the dodgy first method of loading from disk).
In short, always remember as.is=TRUE and stringsAsFactors=FALSE, and learn how to use the subset command, and you won't go far wrong!
Hope this helps :)