I've used full.joint to combine two tables:
fsts = full_join(fstvarcal, fst, by = "SNP")
And this had the effect of grouping 1st rows for which there were values for the two datasets, followed by rows for which there were values for the 1st dataset only (and NAs for the 2nd), followed by rows for which there were values for the 2nd dataset only (and NAs for the 1st).
I'm now trying to order by natural order.
Looking for the equivalent of sort -V -k1 in bash.
I've been tried:
library(naturalsort);
fstordered = fsts[naturalorder(fsts$SNP),]
which works, but it's very slow.
Any faster ways of doing this? Or of doing merging the two datasets without loosing the natural order?
I have:
SNP fst
scaffold_0 0.186473
scaffold_9 0.186475
scaffold_10 0.186472
scaffold_11 0.186470
scaffold_99 0.186420
scaffold_100 0.186440
and
SNP fstvarcal
scaffold_0 0.186472
scaffold_8 0.186475
scaffold_20 0.186477
scaffold_21 0.186440
scaffold_999 0.186450
scaffold_1000 0.186420
and wan to combine into
SNP fstvarcal fst
scaffold_0 0.186472 0.186473
scaffold_8 0.186475 NA
scaffold_9 NA 0.186475
scaffold_10 NA 0.186472
scaffold_11 NA 0.186470
scaffold_20 0.186477 NA
scaffold_21 0.186440 NA
scaffold_99 NA 0.186420
scaffold_100 NA 0.186440
scaffold_999 0.186450 NA
scaffold_1000 0.186420 NA
Perhaps you can do the following:
I generate some representative sample data first.
set.seed(2018)
df <- data.frame(
SNP = sprintf("scaffold_%i", 1:1000),
val = rnorm(1000))
df <- df[df$SNP, ]
We now use tidyr::separate to separate SNP into "id" and "no", and arrange rows by "id" and "no" to ensure natural ordering (convert = T automatically converts "no" to an integer column vector).
library(tidyverse)
df %>%
separate(SNP, into = c("id", "no"), remove = F, convert = T) %>%
arrange(id, no) %>%
select(-id, -no)
# SNP val
#1 scaffold_1 -0.4229839834
#2 scaffold_2 -1.5498781617
#3 scaffold_3 -0.0644293189
#4 scaffold_4 0.2708813526
#5 scaffold_5 1.7352836655
#6 scaffold_6 -0.2647112113
#7 scaffold_7 2.0994707023
#8 scaffold_8 0.8633512196
#9 scaffold_9 -0.6105871453
#10 scaffold_10 0.6370556066
#11 scaffold_11 -0.6430346953
#...
Related
I am downloading all the Tweets (using rtweet package, version 0.7.0) that contain the user #sernac in the text of the tweet (a Chilean government entity), then extract all all the usernames (screen name) from the body of the tweet using the following function.
Tweets <- search_tweets("#sernac", n = 50000, include_rts = F)
Names <- str_extract_all(Tweets$text, "(?<=^|\\s)#[^\\s]+")
This give me a List object with the every screen name of each text's tweet.
The first question is: How i get a data frame whith the following estructure?
X1
X2
X3
X4
X5
...
Xn
#sernac
#vtrchile
NA
NA
NA
NA
NA
#username
#playstation
#taylorswitft
#elonmusk
#instagram
NA
NA
#username2
#username5
#selenagomez
#username2
#username3
#FIFA
#xbox
#username4
#ebay
NA
NA
NA
NA
NA
Where the numbers of columns is equal to the max number of elements in a object from the list.
I tried the following fuction, but only return 4 columns, where the max elements into a object is 9.
df <- data.frame(matrix(unlist(Names), nrow=length(Names), byrow = T))
After this, I need to perform a left join between this table and a cluster table created by me, this left join must be between the first column of the newly created database and the cluster data base , but if there is no match in the left join, it should perform a second left join, but in this case using the second column, until exhausting all the columns if there is no match when performing the left join.
This is an example of the database created by me and the final desired result:
CLUSTER DATA FRAME
screen_name
cluster
#sernac
Gov
#playstation
Videogames
#walmart
Supermarket
#SelenaGomez
Celebrity
#elonmusk
Celebrity
#xbox
Videogames
#ebay
Ecommerce
FINAL RESULT
X1
X2
X3
X4
X5
...
Xn
cluster
#sernac
#vtrchile
NA
NA
NA
NA
NA
Gov
#username
#playstation
#taylorswitft
#elonmusk
#instagram
NA
NA
Videogames
#username2
#username5
#selenagomez
#username2
#username3
#FIFA
#xbox
Celebrity
#username4
#ebay
NA
NA
NA
NA
NA
Ecommerce
I have tried to explain myself in the best way, English is not my main language, so I can explain more detail in the comments.
I would approach this differently.
First, if you are trying to download as many tweets as possible, set n = Inf and retryonratelimit = TRUE:
Tweets <- search_tweets("#sernac",
n = Inf,
include_rts = FALSE,
retryonratelimit = TRUE)
Second, there is no need to extract screen names from the tweet text, as this information can be found in the entities column.
One way to extract mentions is to use lapply. You can then create a data frame with just the useful columns, and convert screen names to lower case for matching.
library(dplyr)
mentions <- lapply(Tweets$entities, function(x) x$user_mentions) %>%
bind_rows(.id = "tweet_number") %>%
select(tweet_number, screen_name) %>%
mutate(screen_name_lc = tolower(screen_name))
head(mentions)
tweet_number screen_name screen_name_lc
1 1 mundo_pacifico mundo_pacifico
2 1 OIMChile oimchile
3 1 subtel_chile subtel_chile
4 1 ReclamosSubtel reclamossubtel
5 1 SERNAC sernac
6 2 mundo_pacifico mundo_pacifico
Next, add a column with the lower-case screen names to your cluster data:
cluster_df <- cluster_df %>%
mutate(screen_name_lc = str_replace(screen_name, "#", "") %>%
tolower())
Now we can join the data frames, just on the screen_name_lc column:
mentions_clusters <- mentions %>%
left_join(cluster_df,
by = "screen_name_lc") %>%
select(tweet_number, screen_name = screen_name.x, cluster)
head(mentions_clusters)
tweet_number screen_name cluster
1 1 mundo_pacifico <NA>
2 1 OIMChile <NA>
3 1 subtel_chile <NA>
4 1 ReclamosSubtel <NA>
5 1 SERNAC Gov
6 2 mundo_pacifico <NA>
This "long" format is much easier to work with for subsequent analysis than the "wide" format, and can still be grouped by tweet using the tweet_number column.
Data for cluster_df:
cluster_df <- structure(list(screen_name = c("#sernac", "#playstation", "#walmart",
"#SelenaGomez", "#elonmusk", "#xbox", "#ebay"), cluster = c("Gov",
"Videogames", "Supermarket", "Celebrity", "Celebrity", "Videogames",
"Ecommerce"), screen_name_lc = c("sernac", "playstation", "walmart",
"selenagomez", "elonmusk", "xbox", "ebay")), class = "data.frame", row.names = c(NA,
-7L))
I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.
You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044
Seems like a simple function but cannot seem to find a good way to do it on R. I have a column, P, that has many rows with multiple inputs:
P:
[340000, 410000]
[450000, 450000]
530000
110000
[330000, 440000]
510000
440000
620000
320000
Desired P1 (the * values should be randomly selected): (apologies for the spacing, the spacing is just so each value is a different line)
340000*
450000*
530000
110000
440000*
510000
440000
620000
320000
I want to build a new column that randomly selects 1 value from every row vector starting with "[" in column P and then spits out a new column, P1, with the corrected values+the other independent row values. This is part of a larger effort to clean the column so it is usable for regression.
Right now, I've come up with this tidyverse code as the best option for mutating :
foo <- data.frame(P=="[")
foo %>%
rowwise %>%
mutate(P1 = sample(P, 1))
But this isn't returning the output I need. Asside from sample(), I'm not sure what else can be used for random selection from a [] vector. I'm wondering what the best way to go about this would be?? Appreciate the help.
You can remove [] from the column values, split data on comma and get each value in different row. For each row you can then select 1 random value.
library(dplyr)
df %>%
mutate(P1 = gsub('\\[|\\]', '', P),
row = row_number()) %>%
tidyr::separate_rows(P1, sep = ',\\s*') %>%
group_by(row) %>%
slice_sample(n = 1) %>%
#In older version of dplyr use sample_n
#sample_n(1)
ungroup %>%
select(-row)
# P P1
# <chr> <chr>
#1 [340000, 410000] 340000
#2 [450000, 450000] 450000
#3 530000 530000
#4 110000 110000
#5 [330000, 440000] 440000
#6 510000 510000
#7 440000 440000
#8 620000 620000
#9 320000 320000
In base R you can implement the same logic with
df$P1 <- sapply(strsplit(gsub('\\[|\\]', '', df$P), ',\\s*'), sample, 1)
data
df <- structure(list(P = c("[340000, 410000]", "[450000, 450000]",
"530000", "110000", "[330000, 440000]", "510000", "440000", "620000",
"320000")), class = "data.frame", row.names = c(NA, -9L))
I have a table called DATA_SET.This table contains one column with six different
cases of data.
#DATA_SET
DATA_SET<-data.frame(
CUSTOMS_RATE=c("20","15+0,41 eur/kg","10+0,1 eur/kg max.17","0,1
eur/l max.17","0,04 eur/kg max.10","NA")
)
View(DATA_SET)
#DATA_SET1
DATA_SET1<-data.frame(
RATE="",
SPECIFIC_RATE="",
MAXIMUM_RATE=""
)
So my intention is to divide this column into three different columns in order to continue with other statistical operations (calculation of averages, etc.) like table (DATA_SET 1) below.
So can anybody help me how to transform this table ?
Usually, separate would be a better option, but in this case, the positions of the numbers are not the same in each row, (sometimes missing too). So, we are using str_extract to individually extract the values
library(tidyverse)
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", "."),
RATE = str_extract(CUSTOMS_RATE, "^[0-9]+(?=\\+|$)"),
SPECIFIC_RATE = str_extract(CUSTOMS_RATE, "\\d+\\.\\d+"),
MAXIMUM_RATE = str_extract(CUSTOMS_RATE, "(?<=max\\.)\\d+")) %>%
select(2:4) %>%
mutate_all(as.numeric)
# RATE SPECIFIC_RATE MAXIMUM_RATE
#1 20 <NA> <NA>
#2 15 0.41 <NA>
#3 10 0.1 17
#4 <NA> 0.1 17
#5 <NA> 0.04 10
#6 <NA> <NA> <NA>
Or use str_replace to create a single delimiter and then use separate
DATA_SET %>%
mutate(CUSTOMS_RATE = str_replace_all(CUSTOMS_RATE, ",", ".") %>%
str_replace("\\+?([0-9]+\\.[0-9]+)", "+\\1") %>%
str_replace_all("[A-Za-z/ ]+\\.?", "+")) %>%
separate(CUSTOMS_RATE, into = c("RATE", "SPECIFIC_RATE", "MAXIMUM_RATE"),
sep="\\+", convert = TRUE)
I have a dataset, df, where columns consist of various chemicals and rows consist of samples identified by their id and the concentration of each chemical.
I need to correct the chemical concentrations using a unique value for each chemical, which are found in another dataset, df2.
Here's a minimal df1 dataset:
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",")
and here is a df2 example:
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",")
What I need to do is to divide all observations of chem1 in df1 by the value provided for chem1 in df2, repeated for each chemical. In reality, chemical names are not sequential, and there's roughly 30 chemicals.
Previously I would have done this using Excel and index/match but I'm looking to make my methods more reproducible, hence fighting my way through with R. I mostly do data manipulation with dplyr, so if there's a tidyverse solution out there, that would be great!
Thankful for any help
We can use the 'chem' column from 'df2' to subset the 'df1', divide by the 'value' column of 'df2' replicated to make the lengths same and update the columns of 'df1' by assigning the results back
df1[as.character(df2$chem)] <- df1[as.character(df2$chem)]/df2$value[col(df1[-1])]
Using reshape2 package, the data frame can be changed to long format to merge with the df2 as follows. (Note that the example df introduce some whitespace that are filtered in this solution)
library(reshape2)
df1 <- read.table(text="id,chem1,chem2,chem3,chemA,chemB
1,0.5,1,5,4,3
2,1.5,0.5,2,3,4
3,1,1,2.5,7,1
4,2,5,3,1,7
5,3,4,2.3,0.7,2.3",
header = TRUE,
sep=",",stringsAsFactors = F)
df2 <- read.table(text="chem,value
chem1,1.7
chem2,2.3
chem3,4.1
chemA,5.2
chemB,2.7",
header = TRUE,
sep = ",",stringsAsFactors = F)
df2$chem <- gsub("\\s+","",df2$chem) #example introduces whitespaces in the names
df1A <- melt(df1,id.vars=c("id"),variable.name="chem")
combined <- merge(x=df1A,y=df2,by="chem",all.x=T)
combined$div <- combined$value.x/combined$value.y
head(combined)
chem id value.x value.y div
1 chem1 1 0.5 1.7 0.2941176
2 chem1 2 1.5 1.7 0.8823529
3 chem1 3 1.0 1.7 0.5882353
4 chem1 4 2.0 1.7 1.1764706
5 chem1 5 3.0 1.7 1.7647059
6 chem2 1 1.0 2.3 0.4347826
or in wide format:
> dcast(combined[,c("id","chem","div")],id ~ chem,value.var="div")
id chem1 chem2 chem3 chemA chemB
1 1 0.2941176 0.4347826 1.2195122 0.7692308 1.1111111
2 2 0.8823529 0.2173913 0.4878049 0.5769231 1.4814815
3 3 0.5882353 0.4347826 0.6097561 1.3461538 0.3703704
4 4 1.1764706 2.1739130 0.7317073 0.1923077 2.5925926
5 5 1.7647059 1.7391304 0.5609756 0.1346154 0.8518519
Here's a tidyverse solution.
df3 <- df1 %>%
# convert the data from wide to long to make the next step easier
gather(key = chem, value = value, -id) %>%
# do your math, using 'match' to map values from df2 to rows in df3
mutate(value = value/df2$value[match(df3$chem, df2$chem)]) %>%
# return the data to wide format if that's how you prefer to store it
spread(chem, value)