R list save as quoted list - r

Want to save recommenderlab predict list as list of "" seperated list. I have one question in place for the same but here want to extend it with a twist.
I already tried few approaches and found below as relavent but stuck with a simple step of putting the ouptput in "" comma seperated script.
library("recommenderlab")
library(stringi)
data("MovieLense")
MovieLense100 <- MovieLense[rowCounts(MovieLense) >100,]
MovieLense100
train <- MovieLense100[1:50]
rec <- Recommender(train, method = "UBCF")
rec
pre <- predict(rec, MovieLense100[101:105], n = 10)
as(pre, "list")
list1 = as(pre, "list")
cat(paste0(shQuote(list1[["291"]]),collapse=","))
The above gives me for given user:
"Titanic (1997)","Contact (1997)","Alien (1979)","Amadeus (1984)","Godfather, The (1972)","Aliens (1986)","Sting, The (1973)","American Werewolf in London, An (1981)","Schindler's List (1993)","Glory (1989)"
I want to put user and movies in dataframe where first column will be user and second column will be movies in above concatenated form

Given that cat(paste0(shQuote(list1[["291"]]),collapse=",")) produces the string of movie recommendations, one could do the following to turn this into a data frame tagged with a name:
movies <- cat(paste0(shQuote(list1[["291"]]),collapse=","))
theData <- data.frame(name="Santhosh",movies,stringsAsFactors=FALSE)
Another approach would be to save each movie as a separate column in the output data frame, which would make it easier to use the data in R without having to parse the movie list multiple times. The tidyverse (i.e. tidyr and dplyr) can be used to produce this data frame.
library(tidyr)
library(dplyr)
recommendedMovies <- c("Titanic (1997)","Contact (1997)","Alien (1979)","Amadeus (1984)","Godfather, The (1972)","Aliens (1986)","Sting, The (1973)","American Werewolf in London, An (1981)","Schindler's List (1993)","Glory (1989)")
theData <- data.frame(name="Santhosh",
rank=1:length(recommendedMovies),
movies=recommendedMovies,stringsAsFactors=FALSE)
theData %>% group_by(name) %>%
spread(.,rank,movies,sep="movie")
...and the output:
> theData %>% group_by(name) %>%
+ spread(.,rank,movies,sep="movie")
# A tibble: 1 x 11
# Groups: name [1]
name rankmovie1 rankmovie2 rankmovie3 rankmovie4 rankmovie5 rankmovie6 rankmovie7 rankmovie8 rankmovie9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Sant… Titanic (… Contact (… Alien (19… Amadeus (… Godfather… Aliens (1… Sting, Th… American … Schindler…
# ... with 1 more variable: rankmovie10 <chr>
>

Related

select for rows that don't have a string

I have a df of lot #'s with all of the data associated with them. Some of that data is experimental. Those lot #'s start with X. For example, X42A7299, where any normal lot would be 42A7299. I want to exclude those rows. The DF is called all_cls4. Here is the code I have tried:
all_cls4new<- all_cls4 %>% filter(!str_detect(Lot_#, ^X))
this returns a +
I also get this result with filter and !grep. What am I missing?
library(dplyr)
library(stringr)
x <- tribble(
~lot, ~other_data,
"X42A7299", 45,
"42A7299", 100
)
x %>%
filter(!(str_detect(lot, '^X')))
#> # A tibble: 1 × 2
#> lot other_data
#> <chr> <dbl>
#> 1 42A7299 100
Also, be careful with a symbol in your column name (e.g. Lot_#). I would rename it to a "clean" name (e.g. snakecase). janitor::clean_names() is useful for this. If you use it as is, you will have to wrap in backticks:
x %>%
filter(!(str_detect(`Lot_#`, '^X')))

Rvest : Extracting clickable content

I am trying to extract the table in the link below
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
I want the whole table to be extracted and I am using the following code
html_page <- read_html(curl(curl))
tab <- html_page %>% html_table(., fill = TRUE)
I get the table in tab[[1]], however, if you notice that website it has a clickable section within the table that has additional data. That part is missing from the extracted table. Will appreciate any help on how the whole table can be extracted.
I'm not sure what you're getting. However, when I pulled from this website I see that there are multiple tabs but I pulled all of the data.
Here is the bottom of the table, when you show all.
Here are the results, when I query for the last line of this website data.
library(rvest)
library(tidyverse)
hx = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State=0&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=--Select--&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
htp <- read_html(hx) %>% html_table(., fill = T)
tbOne = htp[[1]][, 1:10] # just the data
tbOne %>% filter(`State Name` == "Uttar Pradesh",
`District Name` == "Badaun",
`Market Name` == "Wazirganj")
# # A tibble: 1 × 10
# `State Name` `District Name` `Market Name` Variety Group `Arrivals (Tonnes)`
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Uttar Pradesh Badaun Wazirganj Dara Cereals 3.50
# # … with 4 more variables: `Min Price (Rs./Quintal)` <chr>,
# # `Max Price (Rs./Quintal)` <chr>, `Modal Price (Rs./Quintal)` <chr>,
# # `Reported Date` <chr>
Update
When I pressed the 2, nothing happened (and I did try repeatedly). However, I needed to be really patient and I wasn't. Sorry about that.
The URL has the query in it, so the URL can be used to get all of the data. You could do this by adding the states you're missing, or you could do this for every state. For example, page one ends on Utter Pradesh, but we don't know if this is all of Utter Pradesh. That might make more sense when you see what I did.
Using rvest, I collected all of the states' names from the form. Then I put these name-value pairs into a data frame.
# collect form values for State
ht <- read_html(hx) %>% html_form()
df1 <- as.data.frame(ht[[1]][["fields"]][["ctl00$ddlState"]][["options"]]) %>%
rownames_to_column("State")
names(df1)[2] <- "Abb"
To only look at the states that were not included in page one, you could just query the states after Utter Pradesh like this.
which(df1$State == "Uttar Pradesh", arr.ind = T)
# [1] 35
# split the URL
urone = "https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=1&Tx_State="
urtwo = "&Tx_District=0&Tx_Market=0&DateFrom=2022-01-28&DateTo=2022-01-28&Fr_Date=2022-01-28&To_Date=2022-01-28&Tx_Trend=2&Tx_CommodityHead=Wheat&Tx_StateHead=West+Bengal&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--"
# collect remaining states' data
df2 <- map(36:nrow(df1),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df2)
# [1] 2
length(df2[[1]]) # state 36 is empty
length(df2[[2]]) # state 37 is not
# add the new data to the original data
df3 <- df2[[2]][[1]]
tbOne <- rbind(tbOne, df3) # one data frame of tabled data
If you wanted to make sure that you had all the data for each state, you could expand this. Although, using map for that much data may be slow. So I used the function mclapply from the package parallel. In this code, I used 15 cores. You may need to change this depending on your computer's processor. Using 15 made this take less than a second.
# skip row 1, that's "select" or all
df4 <- mclapply(2:nrow(df1), mc.cores = getOption("mc.cores", 15L),
function(x){
# assemble URL
y = toString(df1$Abb[x])
urall = paste0(urone, y, urtwo)
# get table
tabs <- read_html(urall) %>% html_table(., fill = T)
tabs
})
length(df4)
# [1] 36
# create storage using first state with data
df5 <- df4[[7]][[1]]
map(8:36,
function(x){
y = length(df4[[x]])
if(y > 0){
df5 <<- rbind(df5, df4[[x]][[1]])
}
})
Now you have a data frame, df5 that started as each state queried separately.
I didn't look at how the data was different. However, my tbOne data frame has 577 observations. My df5 data frame has 584.

Iteratively create global environment objects from tibble

I'm trying to make objects directly from information listed in a tibble that can be called on by later functions/tibbles in my environment. I can make the objects manually but I'm working to do this iteratively.
library(tidyverse)
##determine mean from 2x OD Negatives in experimental plates, then save summary for use in appending table
ELISA_negatives = "my_file.csv"
neg_tibble <- as_tibble(read_csv(ELISA_negatives, col_names = TRUE)) %>%
group_by(Species_ab, Antibody, Protein) %>%
filter(str_detect(Animal_ID, "2x.*")) %>%
summarize(ave_neg_U_mL = mean(U_mL, na.rm = TRUE), n=sum(!is.na(U_mL)))
neg_tibble
# A tibble: 4 x 5
# Groups: Species_ab, Antibody [2]
Species_ab Antibody Protein ave_neg_U_mL n
<chr> <chr> <chr> <dbl> <int>
1 Mouse IgG GP 28.2 6
2 Mouse IgG NP 45.9 6
3 Rat IgG GP 5.24 4
4 Rat IgG NP 1.41 1
I can write the object manually based off the above tibble:
Mouse_IgG_GP_cutoff <- as.numeric(neg_tibble[1,4])
Mouse_IgG_GP_cutoff
[1] 28.20336
In my attempt to do this iteratively, I can make a new tibble neg_tibble_string with the information I need. All I would need to do now is make a global object from the Name in the first column Test_Name, and assign it to the numeric value in the second column ave_neg_U_mL (which is where I'm getting stuck).
neg_tibble_string <- neg_tibble %>%
select(Species_ab:Protein) %>%
unite(col='Test_Name', c('Species_ab', 'Antibody', 'Protein'), sep = "_") %>%
mutate(Test_Name = str_c(Test_Name, "_cutoff")) %>%
bind_cols(neg_tibble[4])
neg_tibble_string
# A tibble: 4 x 2
Test_Name ave_neg_U_mL
<chr> <dbl>
1 Mouse_IgG_GP_cutoff 28.2
2 Mouse_IgG_NP_cutoff 45.9
3 Rat_IgG_GP_cutoff 5.24
4 Rat_IgG_NP_cutoff 1.41
I feel like there has to be a way to do this to get this from the above tibble neg_tibble_string, and make this for all four of the rows. I've tried a variant of this and this, but can't get anywhere.
> list_df <- mget(ls(pattern = "neg_tibble_string"))
> list_output <- map(list_df, ~neg_tibble_string$ave_neg_U_mL)
Warning message:
Unknown or uninitialised column: `ave_neg_U_mL`.
> list_output
$neg_tibble_string
NULL
As always, any insight is appreciated! I'm making progress on my R journey but I know I am missing large gaps in knowledge.
As we already returned the object value in a list, we need only to specify the lambda function i.e. .x returns the value of the list element which is a tibble and extract the column
library(purrr)
list_output <- map(list_df, ~.x$ave_neg_U_ml)
If the intention is to create global objects, deframe, convert to a list and then use list2env
library(tibble)
list2env(as.list(deframe(neg_tibble_string)), .GlobalEnv)

R for loop to extract info from a file and add it into tibble?

I am not great with tidyverse so forgive me if this is a simple question. I have a bunch of files with data that I need to extract and add into distinct columns in a tibble I created.
I want the the row names to start with the file IDs which I did manage to create:
filelist <- list.fileS(pattern=".txt") # Gives me the filenames in current directory.
# The filenames are something like AA1230.report.txt for example
file_ID <- trimws(filelist, whitespace="\\..*") # Gives me the ID which is before the "report.txt"
metadata <- as_tibble(file_ID[1:181]) # create dataframe with IDs as row names for 180 files.
Now in these report files are information on species and abundance (kraken report files for those familiar with kraken) and all I need is to extract the number of reads for each domain. I can easily search up in each file the domains and number of reads that fall into that domain using something like:
sample_data <- as_tibble(read.table("AA1230.report.txt", sep="\t", header=FALSE, strip.white=TRUE))
sample_data <- rename(sample_data, Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) # Just renaming the column headers for clarity
sample_data %>% filter(Rank=="D") # D for domain
This gives me a clear output such as:
Percentage Num_Reads_Root Num_Reads_Taxon Rank NCBI_ID Name
<dbl> <int> <int> <fct> <int> <fct>
1 75.9 60533 28 D 2 Bacteria
2 0.48 386 0 D 2759 Eukaryota
3 0.01 4 0 D 2157 Archaea
4 0.02 19 0 D 10239 Viruses
Now, I want to just grab the info in the second column and final column and save this info into my tibble so that I can get something like:
> metadata
value Bacteria_Counts Eukaryota_Counts Viruses_Counts Archaea_Counts
<chr> <int> <int> <int> <int>
1 AA1230 60533 386 19 4
2 AB0566
3 AA1231
4 AB0567
5 BC1148
6 AW0001
7 AW0002
8 BB1121
9 BC0001
10 BC0002
....with 171 more rows
I'm just having trouble coming up with a for loop to create these sample_data outputs, then from that, extract the info and place into a tibble. I guess my first loop should create these sample_data outputs so something like:
for (files in file.list()) {
>> get_domains <<
}
Then another loop to extract that info from the above loop and insert it into my metadata tibble.
Any suggestions? Thank you so much!
PS: If regular dataframes in R is better for this let me know, I have just recently learned that tidyverse is a better way to organize dataframes in R but I have to learn more about it.
You could also do:
library(tidyverse)
filelist <- list.files(pattern=".txt")
nms <- c("Percentage", "Num_reads_root", "Num_reads_taxon", "Rank", "NCBI_ID", "Name")
set_names(filelist,filelist) %>%
map_dfr(read_table, col_names = nms, .id = 'file_ID') %>%
filter(Rank == 'D') %>%
select(file_ID, Name, Num_reads_root) %>%
pivot_wider(id_cols = file_ID, names_from = Name, values_from = Num_reads_root) %>%
mutate(file_ID = str_remove(file_ID, '.txt'))
I've found that using a for loop is nice sometimes because saves all the progress along the way in case you hit an error. Then you can find the problem file and debug it or use try() but throw a warning().
library(tidyverse)
filelist <- list.files(pattern=".txt") #list files
tmp_list <- list()
for (i in seq_along(filelist)) {
my_table <- read_tsv(filelist[i]) %>% # It looks like your files are all .tsv's
rename(Percentage=V1, Num_reads_root=V2, Num_reads_taxon=V3, Rank=V4, NCBI_ID=V5, Name=V6) %>%
filter(Rank=="D") %>%
mutate(file_ID <- trimws(filelist[i], whitespace="\\..*")) %>%
select(file_ID, everything())
tmp_list[[i]] <- my_table
}
out <- bind_rows(tmp_list)
out

R Beginner struggling with extremely messy XLSX

I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.

Resources