compute diff of rows with NAs values in data frame using R

compute diff of rows with NAs values in data frame using R - r

I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.

You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044

Related

Screen Names from Twitter into DataFrame - R

I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user #sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.
Tweets <- search_tweets("#sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)#[^\\s]+")
This give me a List object with the every screen name of each text's tweet.
The first question is: How i get a data frame whith the following estructure?
X1
X2
X3
X4
X5
...
Xn
#sernac
#vtrchile
NA
NA
NA
NA
NA
#username
#playstation
#taylorswitft
#elonmusk
#instagram
NA
NA
#username2
#username5
#selenagomez
#username2
#username3
#FIFA
#xbox
#username4
#ebay
NA
NA
NA
NA
NA
Where the numbers of columns is equal to the max number of elements in a object from the list.
I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.
df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))
After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.
This is an example of the database created by me and the final desired result:
CLUSTER DATA FRAME
screen_name
cluster
#sernac
Gov
#playstation
Videogames
#walmart
Supermarket
#SelenaGomez
Celebrity
#elonmusk
Celebrity
#xbox
Videogames
#ebay
Ecommerce
FINAL RESULT
X1
X2
X3
X4
X5
...
Xn
cluster
#sernac
#vtrchile
NA
NA
NA
NA
NA
Gov
#username
#playstation
#taylorswitft
#elonmusk
#instagram
NA
NA
Videogames
#username2
#username5
#selenagomez
#username2
#username3
#FIFA
#xbox
Celebrity
#username4
#ebay
NA
NA
NA
NA
NA
Ecommerce
I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.

I would approach this differently.
First, if you are trying to download as many tweets as possible, set n = Inf and retryonratelimit = TRUE:
Tweets <- search_tweets("#sernac",
n = Inf,
include_rts = FALSE,
retryonratelimit = TRUE)
Second, there is no need to extract screen names from the tweet text, as this information can be found in the entities column.
One way to extract mentions is to use lapply. You can then create a data frame with just the useful columns, and convert screen names to lower case for matching.
library(dplyr)
mentions <- lapply(Tweets$entities, function(x) x$user_mentions) %>%
bind_rows(.id = "tweet_number") %>%
select(tweet_number, screen_name) %>%
mutate(screen_name_lc = tolower(screen_name))
head(mentions)
tweet_number screen_name screen_name_lc
1 1 mundo_pacifico mundo_pacifico
2 1 OIMChile oimchile
3 1 subtel_chile subtel_chile
4 1 ReclamosSubtel reclamossubtel
5 1 SERNAC sernac
6 2 mundo_pacifico mundo_pacifico
Next, add a column with the lower-case screen names to your cluster data:
cluster_df <- cluster_df %>%
mutate(screen_name_lc = str_replace(screen_name, "#", "") %>%
tolower())
Now we can join the data frames, just on the screen_name_lc column:
mentions_clusters <- mentions %>%
left_join(cluster_df,
by = "screen_name_lc") %>%
select(tweet_number, screen_name = screen_name.x, cluster)
head(mentions_clusters)
tweet_number screen_name cluster
1 1 mundo_pacifico <NA>
2 1 OIMChile <NA>
3 1 subtel_chile <NA>
4 1 ReclamosSubtel <NA>
5 1 SERNAC Gov
6 2 mundo_pacifico <NA>
This "long" format is much easier to work with for subsequent analysis than the "wide" format, and can still be grouped by tweet using the tweet_number column.
Data for cluster_df:
cluster_df <- structure(list(screen_name = c("#sernac", "#playstation", "#walmart",
"#SelenaGomez", "#elonmusk", "#xbox", "#ebay"), cluster = c("Gov",
"Videogames", "Supermarket", "Celebrity", "Celebrity", "Videogames",
"Ecommerce"), screen_name_lc = c("sernac", "playstation", "walmart",
"selenagomez", "elonmusk", "xbox", "ebay")), class = "data.frame", row.names = c(NA,
-7L))

apply function to subsets of dataframe r

I am trying to subset a dataframe by two variables ('site' and 'year') and apply a function (dismo::biovars) to each subset. Biovars requires monthly inputs (12 values) and outputs 19 variables per year. I'd like to store the outputs for each subset and combine them.
Example data:
data1<-data.frame(Meteostation=c(rep("OBERHOF",12),rep("SOELL",12)),
Year=c(rep(1:12),rep(1:12)),
tasmin=runif(24, min=-20, max=5),
tasmax=runif(24, min=-1, max=30),
pr=runif(24, min=0, max=300))
The full dataset contains 900 stations and 200 years.
I'm currently attempting a nested loop, which I realised isn't the most efficient, and which I'm struggling to make work - code below:
sitesList <- as.character(unique(data1$Meteostation))
#yearsList<- unique(data1$Year)
bvList<-list()
for (i in c(1:length(unique(sitesList)))) {
site<-filter(data1, Meteostation==sitesList[i])
yearsList[i]<-unique(site$Year)
for (j in c(1:length(yearsList))){
timestep<-filter(site,Year==yearsList[j])
tmin<-timestep$tasmin
tmax<-timestep$tasmax
pr<-timestep$pr
bv<-biovars(pr,tmin,tmax)
bvList[[j]]<- bv
}}
bv_all <- do.call(rbind, bvList)
I'm aware there are much better ways to go about this, and have been looking to variations of apply, and dplyr solutions, but am struggling to get my head around it. Any advice much appreciated.

You could use the dplyr package, as follows perhaps?
library(dplyr)
data1 %>%
group_by(Meteostation, Year) %>%
do(data.frame(biovars(.$pr, .$tasmin, .$tasmax)))

Use by and rbind the result.
library("dismo")
res <- do.call(rbind, by(data1, data1[c("Year", "Meteostation")], function(x) {
cbind(x[c("Year", "Meteostation")], biovars(x$pr, x$tasmin, x$tasmax))
}))
Produces
head(res[, 1:10])
# Meteostation Year bio1 bio2 bio3 bio4 bio5 bio6 bio7 bio8
# 1 OBERHOF 1 12.932403 18.59525 100 NA 22.2300284 3.634777 18.59525 NA
# 2 OBERHOF 2 5.620587 7.66064 100 NA 9.4509069 1.790267 7.66064 NA
# 3 OBERHOF 3 0.245540 12.88662 100 NA 6.6888506 -6.197771 12.88662 NA
# 4 OBERHOF 4 5.680438 45.33159 100 NA 28.3462326 -16.985357 45.33159 NA
# 5 OBERHOF 5 -6.971906 16.83037 100 NA 1.4432801 -15.387092 16.83037 NA
# 6 OBERHOF 6 -7.915709 14.63323 100 NA -0.5990945 -15.232324 14.63323 NA

Sorting in natural order by column in R

I've used full.joint to combine two tables:
fsts = full_join(fstvarcal, fst, by = "SNP")
And this had the effect of grouping 1st rows for which there were values for the two datasets, followed by rows for which there were values for the 1st dataset only (and NAs for the 2nd), followed by rows for which there were values for the 2nd dataset only (and NAs for the 1st).
I'm now trying to order by natural order.
Looking for the equivalent of sort -V -k1 in bash.
I've been tried:
library(naturalsort);
fstordered = fsts[naturalorder(fsts$SNP),]
which works, but it's very slow.
Any faster ways of doing this? Or of doing merging the two datasets without loosing the natural order?
I have:
SNP fst
scaffold_0 0.186473
scaffold_9 0.186475
scaffold_10 0.186472
scaffold_11 0.186470
scaffold_99 0.186420
scaffold_100 0.186440
and
SNP fstvarcal
scaffold_0 0.186472
scaffold_8 0.186475
scaffold_20 0.186477
scaffold_21 0.186440
scaffold_999 0.186450
scaffold_1000 0.186420
and wan to combine into
SNP fstvarcal fst
scaffold_0 0.186472 0.186473
scaffold_8 0.186475 NA
scaffold_9 NA 0.186475
scaffold_10 NA 0.186472
scaffold_11 NA 0.186470
scaffold_20 0.186477 NA
scaffold_21 0.186440 NA
scaffold_99 NA 0.186420
scaffold_100 NA 0.186440
scaffold_999 0.186450 NA
scaffold_1000 0.186420 NA

Perhaps you can do the following:
I generate some representative sample data first.
set.seed(2018)
df <- data.frame(
SNP = sprintf("scaffold_%i", 1:1000),
val = rnorm(1000))
df <- df[df$SNP, ]
We now use tidyr::separate to separate SNP into "id" and "no", and arrange rows by "id" and "no" to ensure natural ordering (convert = T automatically converts "no" to an integer column vector).
library(tidyverse)
df %>%
separate(SNP, into = c("id", "no"), remove = F, convert = T) %>%
arrange(id, no) %>%
select(-id, -no)
# SNP val
#1 scaffold_1 -0.4229839834
#2 scaffold_2 -1.5498781617
#3 scaffold_3 -0.0644293189
#4 scaffold_4 0.2708813526
#5 scaffold_5 1.7352836655
#6 scaffold_6 -0.2647112113
#7 scaffold_7 2.0994707023
#8 scaffold_8 0.8633512196
#9 scaffold_9 -0.6105871453
#10 scaffold_10 0.6370556066
#11 scaffold_11 -0.6430346953
#...

Want to select a column based on a return (within that same column) below a threshold in R

I have a return data frame (xts, zoo object) of size 1379 x 843. It should be read as date x security.
This is an example of the input:
BIIB.US.Equity JNJ.US.Equity BLUE.US.Equity BMRN.US.Equity AGN.US.Equity
2018-06-15 -0.5126407 0.001633853 -0.070558376 0.0846854857 -0.004426559
2018-06-18 -0.052158804 -0.310521165 -0.035226652 -0.0206967213 -0.008430535
2018-06-19 0.010099613 0.010303330 0.006510048 0.0004184976 0.007745167
2018-06-20 0.016504588 -0.004324060 0.029808774 0.0284459318 0.012366368
2018-06-21 0.001616924 -0.004834480 0.023211360 0.0009151922 -0.015411839
2018-06-22 -0.004136679 0.010374640 -0.065652522 0.0097023265 0.005322048
Now I would like to return a different list:
BIIB.US.Equity JNJ.US.Equity
2018-06-15 -0.5126407 0.001633853
2018-06-18 -0.052158804 -0.30521165
2018-06-19 0.010099613 0.010303330
2018-06-20 0.016504588 -0.004324060
2018-06-21 0.001616924 -0.004834480
2018-06-22 -0.004136679 0.010374640
As you can see the second list only contains 2 columns because there is a dip in 51% on the first security at time 2018-06-15, and a dip in 30% on the second security in the second at time 2018-06-18. Both of which exceed the threshold of 30%
What I want is to get a new data frame from my current which selects securities where there is an instance of a 30% drop in security return or greater.
Currently I have tried:
df1 <- returns < -.3
returns[df1]
but this returns the error:
Error in `[.xts`(returns, df1) : 'i' or 'j' out of range
I have also tried this:
cls <- sapply(returns, function(c) any(c < -.3))
a<- returns[, cls, with = FALSE]
Yet that returns a matrix of the same size, only with a lot of NA values.
Is there something I am missing?
Basically what I'd expect to get out is a data frame "df" of size 1379 x (something less than 843) where df is all columns where there is an instance of a daily drop of -.3 or less.
EDIT:
To those that have tried to help, thank you, but the output returns as such (I assigned the call to a):
> a
BIIB.US.Equity JNJ.US.Equity BLUE.US.Equity BMRN.US.Equity AGN.US.Equity PFE.US.Equity NBIX.US.Equity
> summary(a)
Index
Min. :NA
1st Qu.:NA
Median :NA
Mean :NA
3rd Qu.:NA
Max. :NA
> str(a)
An 'xts' object of zero-width

This should work:
df[, sapply(df, function(x) min(x, na.rm = TRUE) <= -0.3)]

Ok, it should work this way, using package data.table :
Let's try with an example data set that contains some NAs.
library(data.table)
set.seed(1)
x <- rnorm(10)*0.1
y <- x
z <- rnorm(10)+1
equities <- data.table(x,y,z)
equities[ sample(1:10,3), x:=NA]
equities[ sample(1:10,2), y:=NA]
equities[ sample(1:10,2), z:=NA]
print(equities)
x y z
1: -0.06264538 -0.06264538 NA
2: 0.01836433 0.01836433 1.3898432
3: -0.08356286 -0.08356286 0.3787594
4: 0.15952808 0.15952808 -1.2146999
5: 0.03295078 NA 2.1249309
6: NA NA 0.9550664
7: NA 0.04874291 0.9838097
8: 0.07383247 0.07383247 NA
9: NA 0.05757814 1.8212212
10: -0.03053884 -0.03053884 1.5939013
Choose the right columns, as described in Melissa's post:
myChoice <- sapply(equities, function(x) min(x, na.rm=T) <= -0.3)
And eventually:
newequities <- equities[ , myChoice , with=F]
print(newequities)
z
1: NA
2: 1.3898432
3: 0.3787594
4: -1.2146999
5: 2.1249309
6: 0.9550664
7: 0.9838097
8: NA
9: 1.8212212
10: 1.5939013

As you didn't provide an input/output example I am not sure if I understand you correctly but try
df[colSums(df <= -0.3, na.rm = T) > 0]
Edit
added na.rm = T after OPs update

A slight addendum to this, because I came back into work today and noticed the answer wasn't quite what I wanted. If you want to select the columns based on the row values, then the answer provided that I marked correct is just missing one comma!
The current answer selects all the rows needed, but for the desired output described in the original post, you must use the command:
returns[, sapply(returns, function(x) min(x, na.rm = TRUE) <= -.3)]
notice this has a comma in the beginning in order to select the rows.
Hopefully this helps someone else at some point!

replace NA with data from another column in R

I know how to make the NA's blanks with the following code:
IMILEFT IMIRIGHT IMIAVG
NA NA NA
NA 71.15127 NA
72.18310 72.86607 72.52458
70.61460 68.00766 69.31113
69.39032 69.91261 69.65146
72.58609 72.75168 72.66888
70.85714 NA NA
NA 69.88203 NA
74.47109 73.07963 73.77536
70.44855 71.28647 70.86751
NA 72.33503 NA
69.82818 70.45144 70.13981
68.66929 69.79866 69.23397
72.46879 71.50685 71.98782
71.11888 71.98336 71.55112
NA 67.86667 NA
IMILEFT <- ((ASLCOMPTEST$LHML + ASLCOMPTEST$LRML)/(ASLCOMPTEST$LFML +
ASLCOMPTEST$LTML)*100)
IMILEFT <- sapply(IMILEFT, as.character)
IMILEFT[is.na(IMILEFT)] <- ""
But when I do that code, it won't allow me to do an average of "IMILEFT" and "IMIRIGHT" or make the "IMIAVG" the same as the other column that has a numerical value.
IMIAVG<-((IMILEFT + IMIRIGHT)/2)
Error in IMILEFT + IMIRIGHT : non-numeric argument to binary operator
It will also be the same error if I make it as.numeric

Try the following. Leave the NAs as they are
rowSums(M, na.rm=TRUE) / 2 - (is.na(L) + is.na(R))
## WHERE
M = cbind(IMILEFT, IMIRIGHT)
L = IMILEFT
R = IMIRIGHT
if you have rows were both columns are NA, then have the denominator be
pmin(1, 2 - (is.na(L) + is.na(R)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

compute diff of rows with NAs values in data frame using R - r

Related

Screen Names from Twitter into DataFrame - R

apply function to subsets of dataframe r

Sorting in natural order by column in R

Want to select a column based on a return (within that same column) below a threshold in R

replace NA with data from another column in R

Categories

Resources