I am working on wikipedia internal links using WikipediR package
I am looking for internal links about Hérodote, (in french)
install.packages("WikipediR")
library (WikipediR)
all_bls <- page_backlinks("fr","wikipedia",
page = "Hérodote",
clean_response = TRUE)
all_bls_df <- as.data.frame(all_bls) # converting in d.f
my result:
str(all_bls_df)
## 'data.frame': 3 obs. of 50 variables:
## $ structure.c..60....0....Attributs.du.pharaon.....Names...c..pageid... : Factor w/ 3 levels "0","60","Attributs du pharaon": 2 1 3
## $ structure.c..133....0....Apis.....Names...c..pageid....ns....title. : Factor w/ 3 levels "0","133","Apis": 2 1 3
## $ structure.c..152....0....Anthropologie.....Names...c..pageid... : Factor w/ 3 levels "0","152","Anthropologie": 2 1 3
## $ structure.c..159....0....Asie.....Names...c..pageid....ns....title. : Factor w/ 3 levels "0","159","Asie": 2 1 3
## $ structure.c..325....0....Ahmôsis.II.....Names...c..pageid... : Factor w/ 3 levels "0","325","Ahmôsis II": 2 1 3
## $ structure.c..412....0....Bastet.....Names...c..pageid....ns... : Factor w/ 3 levels "0","412","Bastet": 2 1 3
## $ structure.c..542....0....Corse.....Names...c..pageid....ns... : Factor w/ 3 levels "0","542","Corse": 2 1 3
## $ structure.c..715....0....Cyclades.....Names...c..pageid....ns... : Factor w/ 3 levels "0","715","Cyclades": 2 1 3
## (goes on for 42 more variables)
How can I tidy my data.frame object?
Expect result:
pageid title
60 Attributs du pharaon
133 Apis
152 Antropologie
159 Asie
The function you're using returns named character vectors in a list. We can use purrr::map_df() with as.list(). map_df() will execute the as.list() on each element in your all_bls list and automagically row-bind them into a data frame:
purrr::map_df(all_bls, as.list)
## # A tibble: 50 × 3
## pageid ns title
## <chr> <chr> <chr>
## 1 60 0 Attributs du pharaon
## 2 133 0 Apis
## 3 152 0 Anthropologie
## 4 159 0 Asie
## 5 325 0 Ahmôsis II
## 6 412 0 Bastet
## 7 542 0 Corse
## 8 715 0 Cyclades
## 9 734 0 Culte à mystères
## 10 821 0 Chamanisme
## # ... with 40 more rows
Related
I have this table.
'data.frame': 5303 obs. of 9 variables:
$ Metric.ID : num 7156 7220 7220 7220 7220 ...
$ Metric.Name : Factor w/ 99 levels "Avoid accessing data by using the position and length",..: 51 59 59
$ Technical.Criterion: Factor w/ 25 levels "Architecture - Multi-Layers and Data Access",..: 4 9 9 9 9 9 9 9 9 9 ...
$ RT.Snapshot.name : Factor w/ 1 level "2017_RT12": 1 1 1 1 1 1 1 1 1 1 ...
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 2 1 2 2 2 1 1 1 1 1 ...
$ Critical.Y.N : num 0 0 0 0 0 0 0 0 0 0 ...
$ Grouping : Factor w/ 29 levels "281","Bes",..: 27 6 6 6 6 7 7 7 7 7 ...
$ Object.type : Factor w/ 11 levels "Cobol Program",..: 8 7 7 7 7 7 7 7 7 7 ...
$ Object.name : Factor w/ 3771 levels "[S:\\SOURCES\\",..: 3771 3770 3769 3768 3767 3
I want to have a statistic output like this:
For every Technical.Criterion a row with the sum of all rows of Critical.Y.N = 0 and 1
So I have to combine the rows of my database to a new matrix. Using Values of the factor sums ...
But I have no idea how to start...? Any hints?
Thanks
I believe you're asking for a cross-tabulation. Because you did not provide a reproducible sample, I've used mine:
xtabs(~ Sub.Category + Category, retail)
Produce this:
And if you want the value to be say, based on Sales, instead of the count, then you can modify the code to:
xtabs(Sales ~ Sub.Category + Category, retail)
And you will get the following output:
EDIT based on extra information in the OP's comment
If you want to have your tables also share a common title and want to change the name of that title, you can use a combination of names() and dimnames(). An xtab is a cross-tabulation table and if you call dimnames() on it it returns a list of length 2, first one corresponding to the row and second to the column.
dimnames(xtab(dat))
$Technical.Criterion
[1] "TechnicalCrit1" "TechnicalCrit2" "TechnicalCrit3"
$`Object.type`
[1] "Object.type1" "Object.type2" "Object.type3"
So given a data frame, b:
'data.frame': 3 obs. of 9 variables:
$ Metric.ID : int 101 102 103
$ Metric.Name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Technical.Criterion: Factor w/ 3 levels "TechnicalCrit1",..: 1 2 3
$ RT.Snapshot.name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 1 2 1
$ Critical.Y.N : num 1 0 1
$ Grouping : Factor w/ 3 levels "A","B","C": 1 2 3
$ Object.type : Factor w/ 3 levels "Object.type1",..: 1 2 3
$ Object.name : Factor w/ 3 levels "A","B","C": 1 2 3
We can use xtab and then change the "common" header right at the top of our table. Since I don't know how many levels are in b$Violation.status, I would use a generic for loop:
for(i in 1:length(unique(b$Violation.status))){
tab[[i]] <- xtabs(Critical.Y.N ~ Technical.Criterion + Object.type, b)
names(dimnames(tab[[i]]))[2] <- paste("Violation.status", i)
}
This produces:
Violation.status 1
Technical.Criterion Object.type1 Object.type2 Object.type3
TechnicalCrit1 1 0 0
TechnicalCrit2 0 0 0
TechnicalCrit3 0 0 1
Which I can now use in my shiny app.
I am pretty new to loops in R so I do apologies if this question has been asked elsewhere.
Read in all 30 CSVfiles -> Compare File A species to the other 30 CSV files by species -> Write a new CSV file for each of the 30 files with just the matching species
File A has one column with the names of 190 species ($name). The 30 other csv files each have a column with the species ($SBSname) with differing number of species in the column $SBSname that can range from 100-500 with replicates (so the file CSV file can be larger than 190 rows). However I don't know how to write the code that ...
This is all I have at the moment ...
I have looped in all the CSV files:
30files = list.files(pattern="*.csv")
for (i in 1:length(30files)) assign(30files[i], read.csv(30files[i]))
I have code for just comparing one CSV file (branching.csv) against File A:
> str(FileA)
'data.frame': **190 obs. of 1 variable**:
$ name: Factor w/ 190 levels "Acaena novae zelandiae",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(branching.csv)
'data.frame': **4055 obs. of 7 variables:**
$ SBSname : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 794 2075 1049 162 132 333 541 1840 272 1553 ...
$ SBS.number : int 16443 26711 40171 40398 40867 41151 37871 42412 35847 36245 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 3 1 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 1 1 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 9 9 20 3 3 3 3 3 33 33 ...
Species<-branching.csv[(branching.csv$SBSname %in% FileA$name),]
write.csv(Species, file = "Branching.csv")
> str(Species)
'data.frame': **298 obs. of 7 variables:**
$ name : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 1049 162 1548 47 57 1647 1060 2788 2094 1976 ...
$ SBS.number : int 40171 40398 36280 40532 41629 42495 40103 32792 32892 30583 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 2 2 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 2 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 3 3 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 20 3 33 33 33 33 33 44 44 44 ...
Any help or suggestions would be great. Doesn't have to be a loop!
How about this simple loop?
library(dplyr)
for(i in 1:length(30files))
{
csv.matching = read.csv(30files[i]) %>% inner_join(FileA, by=c("SBSname"="name"))
write.csv(csv.matching, file=gsub("\\.csv", "_matchin.csv", 30files[i]), na="")
}
I have a large data set, which I reduced applying gsub multiple times, basically in this form:
levels(Orders$Im) <- gsub("Osp", "OsProf", levels(Orders$Im))
I also used rbind:
DI_Reduced <- rbind(CX, OsP)
I need to apply function "tree" to the resulting data.frame, but I get an error:
tree.model <- tree(line ~ CountryCode + OrderType + Support, data=train.set)
The error is:
Error in tree(line ~ CountryCode + OrderType + Support, :
factor predictors must have at most 32 levels
Strange thing: if I export the train.set with write.csv and then I re-import it with read.csv, the error disappears and the tree is built.
I investigated the structure of the train.set and this is the difference before and after exporting/importing it:
$ CustomerNumber: Factor w/ 4616 levels "0","101959","210285",..: 3070 3069 4539 3732 2573 3086 2973 3817 4056 2956 ...
$ CountryCode : Factor w/ 4 levels "OtherCountry",..: 3 3 4 4 3 3 3 4 4 3 ...
$ OrderType : Factor w/ 5 levels "Order","NewOrder",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Support : Factor w/ 5 levels "#N/A","BN",..: 4 4 4 4 2 4 4 4 4 4 ...
$ Manuf : Factor w/ 163 levels "<Generic>","6gi",..: 52 52 52 52 52 52 52 52 52 52 ...
$ line : Factor w/ 623 levels "\"Generic\" Skews",..: 400 35 400 400 400 400 400 400 400 400 ...
________________________________________________________________
$ CustomerNumber: Factor w/ 692 levels "201500","20202",..: 361 360 680 499 138 367 315 523 592 304 ...
$ CountryCode : Factor w/ 2 levels "JP","US": 1 1 2 2 1 1 1 2 2 1 ...
$ OrderType : Factor w/ 1 level "Online": 1 1 1 1 1 1 1 1 1 1 ...
$ Support : Factor w/ 4 levels "BN","MC",..: 3 3 3 3 1 3 3 3 3 3 ...
$ Manuf : Factor w/ 1 level "DY": 1 1 1 1 1 1 1 1 1 1 ...
$ line : Factor w/ 2 levels "CX","OTX": 2 1 2 2 2 2 2 2 2 2 ...
It seems to me that gsub does not really subsect the original data.frame, and the hidden values stay in the train.set till I export/import the train. Is there another way to do this operation and obtain a tree?
As the error says, your dependent variable line has more than 32 levels. As per your train.set structure line : Factor w/ 623 levels
Try using other tree libraries like rpart.
Refactoring after subset might help.
sapply(train.set, {function(x) if(class(x) == "factor") {factor(x)}})
Also, gsub is not used for subsetting usually. It is global substitution function. You should share the pre-processing steps followed as well to help others help you with this better.
The First csv file is called "CLAIM" and these are parts of data
The second csv file is called "CUSTOMER" and these are parts of data
First, I wanted to merge two data based on the common column
Second, I wanted to remove all columns including NA value
Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
Fourth, I wanted to remove the empty row of OCCP_GRP_1
Expecting form is
dim(data_fin)
## [1] 114886 11
head(data_fin)
## CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1 1 1 2 47 3.사무직 2 Y Y
## 2 1 1 2 47 3.사무직 2 Y Y
## 3 1 1 2 47 3.사무직 2 Y Y
## 4 1 1 2 47 3.사무직 2 Y Y
## 5 2 1 1 53 3.사무직 2 Y Y
## 6 2 1 1 53 3.사무직 2 Y Y
## DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1 52450 52450 0.4343986
## 2 24000 24000 0.8823529
## 3 17500 17500 0.7272727
## 4 47500 47500 0.9217391
## 5 99100 99100 0.8623195
## 6 7817 7500 0.8623195
str(data_fin)
## 'data.frame': 114886 obs. of 11 variables:
## $ CUST_ID : int 1 1 1 1 2 2 2 3 4 4 ...
## $ DIVIDED_SET : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 2 2 2 2 1 1 1 1 2 2 ...
## $ AGE : int 47 47 47 47 53 53 53 60 64 64 ...
## $ OCCP_GRP_1 : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
## $ CHLD_CNT : int 2 2 2 2 2 2 2 0 0 0 ...
## $ WEDD_YN : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
## $ CHANG_FP_YN : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
## $ DMND_AMT : int 52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
## $ PAYM_AMT : int 52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
## $ NON_PAY_RATIO: num 0.434 0.882 0.727 0.922 0.862 ...
so I wrote down the code like
#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)
Next,
1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio
2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value
I don't know how to write it down
This is in reference to https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome I am trying to use the Cross Validation in GLMNET (i.e. cv.glmnet) for a binomial target variable. The glmnet works fine but the cv.glmnet throws an error here is the error log:
Error in storage.mode(y) = "double" : invalid to change the storage mode of a factor
In addition: Warning messages:
1: In Ops.factor(x, w) : ‘*’ not meaningful for factors
2: In Ops.factor(y, ybar) : ‘-’ not meaningful for factors
Data Types:
'data.frame': 490 obs. of 13 variables:
$ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 190 381 259 310 432 156 179 24 429 408 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 1 ...
$ dependents : Factor w/ 4 levels "0","1","2","3+": 1 1 1 3 1 4 2 3 1 1 ...
$ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 1 2 1 2 ...
$ self_employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ applicantincome : int 9328 3333 14683 7667 6500 39999 3750 3365 2920 2213 ...
$ coapplicantincome: num 0 2500 2100 0 0 ...
$ loanamount : int 188 128 304 185 105 600 116 112 87 66 ...
$ loan_amount_term : Factor w/ 10 levels "12","36","60",..: 6 9 9 9 9 6 9 9 9 9 ...
$ credit_history : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 1 2 1 1 1 2 2 1 1 1 ...
$ loan_status : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
Codes Used:
xfactors<-model.matrix(loan_status ~ gender+married+dependents+education+self_employed+loan_amount_term+credit_history+property_area,data=data_train)[,-1]
x<-as.matrix(data.frame(applicantincome,coapplicantincome,loanamount,xfactors))
glmmod<-glmnet(x,y=as.factor(loan_status),alpha=1,family='binomial')
plot(glmmod,xvar="lambda")
grid()
cv.glmmod <- cv.glmnet(x,y=loan_status,alpha=1) #This Is Where It Throws The Error
The credit for the answer goes to #user20650.
Suspect you need to add the familyto cv.glmnet as well. An example:
x <- model.matrix(am ~ 0 + . , data=mtcars)
cv.glmnet(x, y=factor(mtcars$am), alpha=1)
cv.glmnet(x, y=factor(mtcars$am), alpha=1, family="binomial")