Categorize observations in dataframe by different identifiers - r

I've searched around for a solution to this problem, but can't seem to find any.
I have pulled tweets from Danish MP's using the rtweet package to acces the Twitter API. I used get_gimeline() to pull the data.
get_timeline(c(politikere), n = 100, parse = TRUE, since_id = "1315756184247435264", max_id = "1333904927559725056", type = "recent") %>%
dplyr::filter(created_at > "2020-10-25" & created_at <="2020-12-01")
Now i would like to categorize the different Twitter users by their Party ID, in order to do some pary specific sentiment analysis.
From the API call you get all sorts of information in to a tibble dataframe e.g "user id" spanning to around 90 different variables.
user_id
status_id
created_at
screen_name
text
description
...x_i
The point is that I want to create a new column in the dataset named party_id and I want to assign a new value onto each user according to the party they belong to:
I would want to create a column which identifies the party affilitation. It should look something like this:
user_id
status_id
created_at
screen_name
text
description
party_id
1234346
683901040
2020-11-23
larsen_mc
gg..
Danish MP..
Conservatives
I looked at the dplyr package but I can't quite get my head around how to assign the same value to different rows that does not share the same identifiers. If e.g all the conservative MP's shared the same status_id it would be a somewhat easier task by using inner_join, but every user has it's own unique identifier in this case (of course).
Here is the example_data
structure(list(user_id = c("2373406198", "4360080437", "3512158337",
"746909257", "36910691", "58550919", "279986859", "1225930531",
"26263965", "2222188479"), status_id = c("1354094283230474241",
"1354707826317393922", "1354391556900483072", "1347169543853117444",
"1354866447735005185", "1332633849659088897", "1355522537669734401",
"1355554489361686530", "1329028442105458688", "1330791375449829376"
), created_at = structure(c(1611676209, 1611822489, 1611747085,
1610025223, 1611860307, 1606559643, 1612016732, 1612024349, 1605700047,
1606120363), tzone = "UTC", class = c("POSIXct", "POSIXt")),
screen_name = c("jacobmark_sf", "RuneLundEL", "kimvalentinDK",
"TommyPetersenDK", "JuulMona", "Blixt22", "JanEJoergensen",
"RasmusJarlov", "StemLAURITZEN", "olebirkolesen")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Hopes this makes sense
Best,
Gustav

Okay - I found a solution! After making the identifier manually (called Parti_id) I used the tidyverse package and used left_join():
poldata <- poldata %>%
select(screen_name,Parti_id)
FTtweets <- left_join(tmlpol, poldata, by = "screen_name")

Related

Rentrez is pulling the wrong data from NCBI in R?

I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following R code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:
search <- entrez_search(db = "biosample",
term = "SAMN30954130[ACCN]",
retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
id = search$ids,
rettype = "xml")
fetch_list <- xmlToList(fetch_test)
The issue here is that you are using a Biosample UID to query the Nucleotide database. However, the UID is then interpreted as a Nucleotide UID, so you get a sequence record unrelated to your original Biosample query.
What you need to use in this situation is entrez_link, which uses a UID to link records between two databases.
For example, your Biosample accession SAMN30954130 has the Biosample UID 30954130. You link that to Nucleotide like this:
nuc_links <- entrez_link(dbfrom='biosample', id=30954130, db='nuccore')
And you can get the corresponding Nucleotide UID(s) like this:
nuc_links$links$biosample_nuccore
[1] "2307876014"
And then:
fetch_test <- entrez_fetch(db = "nucleotide",
id = 2307876014,
rettype = "xml")
This is covered in the section "Finding cross-references" of the rentrez tutorial.

Update Purrr loop to input data row by row in R

This question kinda builds on questions I asked here and here, but its finally coming together and I think I know what the problem is, just need help kicking it over the goal line. TL;DR at the bottom.
The overall goal as simply put as possible:
I have a dataframe that is from an API pull of a redcap database. It
has a few columns of information about various studies.
I'd like to go through that dataframe line by line, and push it into a different website called Oncore, through an API.
In the first question linked above (here again), I took a much simpler dataframe... took one column from that dataframe (the number), used it to do an API pull from Oncore where it would download from Oncore, copy one variable it downloaded over to a different spot, and push it back in. It would do this over and over, once per row. Then it would return a simple dataframe of the row number and the api status code returned.
Now I want to get a bit more complicated and instead of just pulling a number from one colum, I want to swap over a bunch of variables from my original dataframe, and upload them.
The idea is for sample studies input into Redcap to be pushed into Oncore.
What I've tried:
I have this dataframe from the redcap api pull:
testprotocols<-structure(list(protocol_no = c("LS-P-Joe's API", "JoeTest3"),
nct_number = c(654321, 543210), library = structure(c(2L,
2L), levels = c("General Research", "Oncology"), class = "factor"),
organizational_unit = structure(c(1L, 1L), levels = c("Lifespan Cancer Institute",
"General Research"), class = "factor"), title = c("Testing to see if basic stuff came through",
"Testing Oncology Projects for API"), department = structure(c(2L,
2L), levels = c("Diagnostic Imaging", "Lifespan Cancer Institute"
), class = "factor"), protocol_type = structure(2:1, levels = c("Basic Science",
"Other"), class = "factor"), protocolid = 1:2), row.names = c(NA,
-2L), class = c("tbl_df", "tbl", "data.frame"))
I have used this code to try and push the data into Oncore:
##This chunk gets a random one we're going to change later
base <- "https://website.forteresearchapps.com"
endpoint <- "/website/rest/protocols/"
protocol <- "2501"
## 'results' will get changed later to plug back in
## store
protocolid <- protocolnb <- library_names <- get_codes <- put_codes <- list()
UpdateAccountNumbers <- function(protocol){
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
call2 <- paste(base,endpoint, protocol, sep="")
httpResponse_put <- PUT(
call2,
add_headers(authorization = token),
body=results, encode = "json",
verbose()
)
# save stats
protocolid <- append(protocolid, protocol)
protocolnb <- append(protocolnb, testprotocols$PROTOCOL_NO[match(protocol, testprotocols$PROTOCOL_ID)])
library_names <- append(library_names, testprotocols$LIBRARY[match(protocol, testprotocols$PROTOCOL_ID)])
get_codes <- append(get_codes, status_code(httpResponse_get))
put_codes <- append(put_codes, status_code(httpResponse_put))
}
## Oncology will have to change to whatever the df name is, above and below this
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
allresults <- tibble('protocolNo'=unlist(protocol_no),'protocolnb'=unlist(protocolnb),'library_names'=unlist(library_names), 'get_codes'=unlist(get_codes), 'put_codes'=unlist(put_codes) )
When I get to the line:
purrr::walk(testprotocols$protocol_no, UpdateAccountNumbers)
I get this error:
When I do traceback() I get this:
When I step through the loop line by line I realized that in this chunk of code:
call2<-paste(base,endpoint, protocol, sep="")
httpResponse <- GET(call2, add_headers(authorization = token))
results = fromJSON(content(httpResponse, "text"))
results$protocolId<- "8887" ## doesn't seem to matter
results$protocolNo<- testprotocols$protocol_no
results$library<- as.character(testprotocols$library)
results$title<- testprotocols$title
results$nctNo<-testprotocols$nct_number
results$objectives<-"To see if the API works, specifically if you can write over a previous number"
results$shortTitle<- "Short joseph Title"
results$nctNo<-testprotocols$nct_number
results$department <- as.character(testprotocols$department)
results$organizationalUnit<- as.charater(testprotocols$organizational_unit)
results$protocolType<- as.character(testprotocols$protocol_type)
Where I had envisioned it downloading ONE sample study and replacing aspects of it with variables from ONE row of my beginning dataframe, its actually trying to paste everything in the column in there. I.e. results$nctNo is "654321 543210" instead of just "654321" from the first row.
TL;DR version:
I need my purrr loop to take one row at a time instead of my entire column, and I think if I do that, it'll all magically work.
Within UpdateAccountNumbers(), you are referring to entire columns of the testprotocols frame when you do things like results$nctNo<-testprotocols$nct_number ....
Instead, perhaps at the top of the UpdateAccountNumbers() function, you can do something like tp = testprotocols[testprotocols$protocol_no == protocol,], and then when you are trying to assign values to results you can refer to tp instead of testprotocols
Note that your purrr::walk() command is passing just one value of protocol at a time to the UpdateAccountNumbers() function

rtweet - multiple AND/OR keyword search

I am using the rtweet package to retrieve tweets that contain specific keywords. I know how to do an "and"/"or" match, but how to chain these together into one keyword query with multiple OR/and conditions . For example, a search query I may wish to put into the search_twitter function is:
('cash' or 'currency' or 'banknote' or 'accepting cash' or 'cashless') AND ('covid' or 'virus' or 'coronavirus')
So the tweets can contain any one of the words in the first bracket and also any one of the words in the second bracket.
Using dplyr:
Assuming you have a df with a column that contains a character field of tweets:
Sample data:
df <- structure(list(Column = c("coronavirus cash", "covid", "currency covid",
"currency coronavirus", "coronavirus virus", "trees", "plants",
"moneys")), row.names = c(NA, -8L), class = c("tbl_df", "tbl",
"data.frame"))
You can use the following:
library(dplyr)
match <- df %>%
dplyr::filter(str_detect(Column, "cash|currency|banknote|accepting cash|cashless")) %>%
dplyr::filter(str_detect(Column, "covid|virus|coronavirus"))

How to create a function that adds different categories to a table with tidyverse in R

I am creating lots of tibbles. Yet, because this is repeatable I am trying to create a function that eases my work. It is necessary to create this function with tidyverse library in R. This is the created function:
cfg_write <- function(given = c(1:2),
common = c(1:2),
table = name_of_a_table,
path = "path/to/save"){
table <- tibble::tibble(given = c(1:2),
common = c(1:2))
table
saveRDS(table, file = path)
}
To bear in mind, in given and common parameters in the function I want to pass more than 2 strings, sometimes I can reach 18 levels rather than 2 as it is set in the "given" and "common".
Two things I do not get with the function created:
I wish I will get extra rows when I pass in the given and common parameters. These are categories of given variable.
And secondly, when I attempt to create several tibbles I get a tibble with two columns, which is good, and each column has a number, 1 and 2 which isn't what I expect.
This is what I do, to be more specific:
test <- cfg_write(given = c('Adrian', "Mary", "Neil"),
common = c("name1", "name2", "name3"),
table = test, path = "/users/bg/test.rds")
However, I get this:
dput(test)
structure(list(given = 1:2, common = 1:2), row.names = c(NA,-2L), class = c("tbl_df", "tbl", "data.frame"))
Can someone help?
You have not passed the argument names into the function body. As you have not used these arguments in the function body any new arguments supplied when using the function will be ignored. The table argument is also obsolete as the name of the tibble will be whatever you assign the function call to, in this case test. Perhaps this will move things in the right direction? Note I have changed the file path
cfg_write <- function(given = c(1:2),
common = c(1:2),
path = "path/to/save"){
table <- tibble::tibble(given = given,
common = common)
saveRDS(table, file = path)
}
test <- cfg_write(given = c('Adrian', "Mary", "Neil"),
common = c("name1", "name2", "name3"),
table = test, path = "Desktop/test.rds")
readRDS("Desktop/test.rds")
provided standard
<chr> <chr>
1 Adrian name1
2 Mary name2
3 Neil name3

R not producing the same result when the data set source is changed

if i manually create 2 DFs then the code does what it was intended to do:
`df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank","tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA, -5L))-5L))
`
test <- df2 %>%
rowwise() %>%
mutate(CompanyName = as.character(Filter(length,
lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))) %>%
group_by(CompanyName) %>%
summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
cSplit("Variation", ",")
this produces the following result:
CompanyName Variation_1 Variation_2 Variation_3
1: Google google plc google finance google play
2: Tesco tesco bank tesco insurance NA
but..... if i import a data set (using read.csv)then i get the following error Error in mutate_impl(.data, dots) : Column CompanyName must be length 1 (the group size), not 0. my data sets are rather large so df1 would have 1000 rows and df2 will have 54k rows.
is there a specific reason why the code works when the data set is manually created and it does not when data is imported?
the DF1 contains company names and DF2 contains variation names of those companies
help please!
Importing from CSV can be tricky. See if the default separator (comma) applies to your file in particular. If not, you can change it by setting the sep argument to a character that works. (E.g.: read.csv(file_path, sep = ";") which is a commom problem in my country due to our local conventions.
In fact, if your standard is semicolons, read.csv2(file_path) will suffice.
And also (to avoid further trouble) it is very commom for csv to mess with columns with decimal values, because here we use commas as decimal separators rather then dots. So, it would be worth checking if this is a problem in your file too, in any of the other columns.
If that is your case, you can set the adequate parameter in either read.csv or read.csv2 by setting dec = "," (E.g.: read.csv(file_path, sep = ";", dec = ","))

Resources