So, I have a function:
complete <- function(directory,id = 1:332 ) {
directory <- list.files(path="......a")
g <- list()
for(i in 1:length(directory)) {
g[[i]] <- read.csv(directory[i],header=TRUE)
}
rbg <- do.call(rbind,g)
rbgr <- na.omit(rbg) #reads files and omits NA's
complete_subset <- subset(rbgr,rbgr$ID %in% id,select = ID)
table.rbgr <- sapply(complete_subset,table)
table.rbd <- data.frame(table.rbgr)
id.table <- c(id)
findla.tb <- cbind (id.table,table.rbd)
names(findla.tb) <- c("id","nob")
print(findla.tb) #creates table with number of observations
}
Basically when you call the specific numberic id (say 4),
you are suppose to get this output
id nobs
15 328
So, I just need the nobs data to be fed into another function which measures the correlation between two columns if the nobs value is greater than another arbitrarily determined value(T). Since nobs is determined by the value of id, I am uncertain how to create a function that takes into account the output of the other function?
I have tried something like this:
corr <- function (directory, t) {
directory <- list.files(path=".......")
g <- list()
for(i in 1:length(directory)) {
g[[i]] <- read.csv(directory[i],header=TRUE)
}
rbg <- do.call(rbind,g)
g.all <- na.omit(rbg) #reads files and removes observations
source(".....complete.R") #sourcing the complete function above
complete("spec",id)
g.allse <- subset(g.all,g.all$ID %in% id,scol )
g.allnit <- subset(g.all,g.all$ID %in% id,nit )
for(g.all$ID %in% id) {
if(id > t) {
cor(g.allse,g.allnit) #calcualte correlation of these two columns if they have similar id
}
}
#basically for each id that matches the ID in g.all function, if the id > t variable, calculate the correlation between columns
}
complete("spec", 3)
cr <- corr("spec", 150)
head(cr)
I have also tried to make the complete function a data.frame but it does not work and it gives me the following error:
error in data.frame(... check.names = false) arguments imply differing number of rows. So, I am not sure how to proceed....
First off, a reproducible example always helps in getting your question answered, along with a clear explanation of what your functions do/are supposed to do. We cannot run your example code.
Next, you seem to have an error in your corr function. You make multiple references to id but never actually populate this variable in your example code. So we'll just have to guess at what you need help with.
I think what you are trying to do is:
given an id, call complete with that id
use the nobs from that in your code.
In this case, you need to make sure to store the output of your call to complete, e.g.
comp <- complete('spec', id)
You can access the id column value comp['id'] and the nobs value via comp['nobs'] so you could do e.g.
if (comp['nobs'] > t) {
# do stuff e.g.
cor(g.allse, g.allnit)
}
Make sure you store the output of cor somewhere if you wish to actualy get it back later.
You will have to fix the problem of id not being defined yourself, because it is unclear what you want that to be.
Related
I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks).
In order to learn scraping, I am exercising with a few pages on Psychology Today.
I have written a function that allows me to scrape information for one therapist and to create a data set with the information collected in this way:
install.packages('rvest') #Loading the rvest package
install.packages('xml2') #Loading the xml2 package
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
#Specifying the url for desired website to be scraped
url <- 'https://www.psychologytoday.com/us/therapists/THE_ONE_YOU_WANT'
#Reading the HTML code from the website
URL <- read_html(url)
#creating the function
getProfile <- function(profilescrape) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
##Structure of the data frame
str(onet_df)
}
View(onet_df)
This code seems to be working well for whatever therapist I choose.
Now, I would like to use this function on multiple profiles, to generate one data set, with name and modality of MHPs.
Let's say that I want to apply the above function "getProfile" to the first 20 therapists in Illinois and input the information for this 20 therapists in a data set called "onet_df"
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile <- function(profilescrape)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
EDITS START HERE:
This loop does not create any data set; moreover, it does not give any error message.
Would someone be able to help me de-bug this loop?
Please, keep in mind that I am a real beginner.
Following sueggetions, I have modified what follows at the beginning:
#creating the function
getProfile <- function(URL) {....}
Moreover, I have used three alternative loops:
1st alternative
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile(URL)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
which gives the followin errors message:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
2nd alternative
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpage_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- getProfile(URL)
j <- j + 1
}
}
final_df <- rbind.fill(df_list)
This loop gives the same error message (please, refer to the above one).
Now, I have just to figure out why no data set is produced with the loop. There might be two problems: First, something within the loop does not work (I have run both loops on only one existing page and no data set is produced) ; Second, when I run the loop on a series of link, some of them might be missing, which would produce an error message.
Consider several adjustments:
Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.
getProfile <- function(URL) {
...
}
Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).
Pass dynamic URL in loop to method without calling function:
URL <- paste0(...)
record_profile <- getProfile(URL)
Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpade_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
}
Call rbind.fill once outside loop to combine all data frames together
final_df <- rbind.fill(df_list)
With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.
# ALL POSSIBLE PAIRINGS
web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
withinpage_codes = c(1:20))
# MOVE URL ASSIGNMENT INSIDE FUNCTION
getProfile <- function(code1, code2) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
# ...same code as before...
}
# ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2),
error = function(e) NULL),
code1 = web_codes_df$MHP_codes,
code2 = web_codes_df$withinpage_codes)
final_df <- rbind.fill(df_list)
One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user.
Below I post the script. I apologize if it is not presicely commented.
Here is the code.
#Loading packages
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
library('plyr') #to bind together different data sets
#get working directory
getwd()
setwd("~/YOUR OWN FOLDER HERE")
#DEFINE SCRAPING FUNCTION
getProfile <- function(URL) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
#If I need to remove blank space do this:
#Data-Preprocessing: removing excess spaces
#variable<-gsub(" ","",variable)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
return(onet_df)
}
Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).
j <- 1
MHP_codes <- c(163805:163808) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")
I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5
RStudio Version 1.0.143
Windows Ver: Windows10 Pro
I have 300+ files which has the same struction, and I want to create a loop, so it can calculate the correlation index of the required files. I can get the right files and calculate the correlation index, but I can't get them all showed as a result. I tried to save them to a vector, but it tells me "the object not found". and if it can work, I also worried about whether the content of the vector will stay if I run the function for several times. Here's the loop:
for(i in ind_larg){
+ specdata_i <- read.csv(i)
+ com_case_ind <- complete.cases(specdata_i)
+ sulfate_i <- specdata_i[,2][com_case_ind]
+ nitrate_i <- specdata_i[,3][com_case_ind]
+ ou[i] <- cor(sulfate_i, nitrate_i)
+ }
and the result
Error: object 'ou' not found
I'm not sure if you need the rest of the code before this, so I attach them at the end here.
> setwd("C:/Users/sunxi/Coursera/specdata")
> ind <- dir(path = "C:/Users/sunxi/Coursera/specdata", pattern = ".csv") #Save the index of the files to a vector.
> specdata_ful <- lapply(ind, read.csv) #combine all the files to a data frame.
> specdat_recon_ful <- do.call(rbind, specdata_ful) #Reconstruct the data frame to put the same variable in one column.
> com_case_ful <- complete.cases(specdat_recon_ful) #Filter the complete cases.
> id_ful <- specdat_recon_ful[,4][com_case_ful] #The ID of the complete cases.
> sulfate_ful <- specdat_recon_ful[,2][com_case_ful] #The sulfate value of the complete cases.
> nitrate_ful <- specdat_recon_ful[,3][com_case_ful] #The nitrate value of the complete cases.
> id_freq_ful <- table(id_ful) #Summary the frequency in each id
> id_freq_mat_ful <- as.data.frame(id_freq_ful) #transfer the table into the data.frame.
> good <- id_freq_mat_ful[["Freq"]] > 1000 #Filter the freqency larger than threshold.
> id_good <- id_freq_mat_ful[["id_ful"]][good] #Filter the id has the frequency of complete cases larger than the threshold.
> ind_larg <- ind[id_good] #Create an index for the id has required requency.
You have to create the variable ou before you access it with ou[i]:
ou <- c()
for(i in ind_larg){
# your loop here...
ou[i] <- cor(sulfate_i, nitrate_i)
}
I have the following code:
df<- iris
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
for (z in column.D) {
x[[z]] <- NULL #with this part I wanted to delete the above selected columns
}
}
}
columnFunction(df)
So how is it possible to address data.frame columns "dynamically" so: x[[z]] <- NULL should translate to:
df$Species <- NULL
df[["Species"]] <- NULL
df[,"Species"] <- NULL
and that for every selected column in every data.frame chosen for the function.
Well does anyone know how to archive something like that? I tried several things like with the paste command or sprintf, deparse but i didnt get it working. I also tied to address the data.frame as a global variable by using <<- but didn`t help, too. (Well its the first time i even heard about that). It looks like i miss the right method transferring x and z to the variable assignment.
If you want to create a function columnFunction that removes columns from a passed data frame df, all you need to do is pass the data frame to the function, return the modified version of df, and replace df with the result:
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
x <- x[,!names(x) %in% column.D]
}
return(x)
}
df <- columnFunction(df)
I'm having an issue in R where I am running a cor.test on a data frame where there are multiple groups.
I am trying to obtain the correlation coefficient for one dependent variable and multiple independent variables contained in a data frame. The data frame has 2 grouping columns for subsetting the data. Here is an example:
DF <- data.frame(group1=rep(1:4,3),group2=rep(1:2,6),x=rnorm(12),v1=rnorm(12),v2=rnorm(12),v3=rnorm(12))
I created the following script that uses plyr to calculate the correlation coefficient for each of the groups and then loop through for each of the variables.
library(plyr)
group_cor <- function(DF,x,y)
{
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
resultDF <- ddply(DF, .(group1,group2), group_cor,3,4)
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
This works fine. The problem I'm running into is when there aren't enough values in a group to calculate the correlation coefficient. For example: when I change the data frame created above to now include a few key NA values and then try to run the same loop:
DF[c(2,6,10),5]=NA
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
I get the following error "Error: not enough finite observations"
I understand why I get this error and am not expecting to get a correlation coefficient for these cases. But what I would like to do is to pass out a null value and move on the the next group instead of stopping my code at an error.
I've tried using a wrapper with try() but can't seem to pass that variable into my result data frame.
Any help on how to get around this would be much appreciated.
I invariably forget to use try if I haven't use it in, oh, a day or something. This link helped me remember the basics.
For your function, you could add it in like this:
group_cor = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error")
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
However, the won't return anything for the group with the error. That's actually OK if you use the all argument when you merge. Here's another way to merge, saving everything into a list with lapply and then merging with Reduce.
allcor = lapply(4:6, function(i) ddply(DF, .(group1,group2), group_cor, 3, i))
Reduce(function(...) merge(..., by = c("group1", "group2"), all = TRUE), allcor)
If you want to fill in with NA inside the function rather than waiting to fill in using merge, you could change your function to:
group_cor2 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) == "try-error")
return(data.frame(cor = NA))
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
Finally (and outside the scope of the question), depending on what you are doing with your output, you might consider naming your columns uniquely based on which columns you are doing the cor.test for so merge doesn't name them all with suffixes. There is likely a better way to do this, maybe with merge and the suffixes argument.
group_cor3 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error") {
dat = data.frame(cor = cor.test(DF[,x], DF[,y])$estimate)
names(dat) = paste("cor", x, "vs", y, sep = ".")
dat
}
}