How to tidyr data from a dataframe (wikipedia internal links)? - r

I am working on wikipedia internal links using WikipediR package
I am looking for internal links about Hérodote, (in french)
install.packages("WikipediR")
library (WikipediR)
all_bls <- page_backlinks("fr","wikipedia",
page = "Hérodote",
clean_response = TRUE)
all_bls_df <- as.data.frame(all_bls) # converting in d.f
my result:
str(all_bls_df)
## 'data.frame': 3 obs. of 50 variables:
## $ structure.c..60....0....Attributs.du.pharaon.....Names...c..pageid... : Factor w/ 3 levels "0","60","Attributs du pharaon": 2 1 3
## $ structure.c..133....0....Apis.....Names...c..pageid....ns....title. : Factor w/ 3 levels "0","133","Apis": 2 1 3
## $ structure.c..152....0....Anthropologie.....Names...c..pageid... : Factor w/ 3 levels "0","152","Anthropologie": 2 1 3
## $ structure.c..159....0....Asie.....Names...c..pageid....ns....title. : Factor w/ 3 levels "0","159","Asie": 2 1 3
## $ structure.c..325....0....Ahmôsis.II.....Names...c..pageid... : Factor w/ 3 levels "0","325","Ahmôsis II": 2 1 3
## $ structure.c..412....0....Bastet.....Names...c..pageid....ns... : Factor w/ 3 levels "0","412","Bastet": 2 1 3
## $ structure.c..542....0....Corse.....Names...c..pageid....ns... : Factor w/ 3 levels "0","542","Corse": 2 1 3
## $ structure.c..715....0....Cyclades.....Names...c..pageid....ns... : Factor w/ 3 levels "0","715","Cyclades": 2 1 3
## (goes on for 42 more variables)
How can I tidy my data.frame object?
Expect result:
pageid title
60 Attributs du pharaon
133 Apis
152 Antropologie
159 Asie

The function you're using returns named character vectors in a list. We can use purrr::map_df() with as.list(). map_df() will execute the as.list() on each element in your all_bls list and automagically row-bind them into a data frame:
purrr::map_df(all_bls, as.list)
## # A tibble: 50 × 3
## pageid ns title
## <chr> <chr> <chr>
## 1 60 0 Attributs du pharaon
## 2 133 0 Apis
## 3 152 0 Anthropologie
## 4 159 0 Asie
## 5 325 0 Ahmôsis II
## 6 412 0 Bastet
## 7 542 0 Corse
## 8 715 0 Cyclades
## 9 734 0 Culte à mystères
## 10 821 0 Chamanisme
## # ... with 40 more rows

Related

Complex Counting of dataframe with factors

I have this table.
'data.frame': 5303 obs. of 9 variables:
$ Metric.ID : num 7156 7220 7220 7220 7220 ...
$ Metric.Name : Factor w/ 99 levels "Avoid accessing data by using the position and length",..: 51 59 59
$ Technical.Criterion: Factor w/ 25 levels "Architecture - Multi-Layers and Data Access",..: 4 9 9 9 9 9 9 9 9 9 ...
$ RT.Snapshot.name : Factor w/ 1 level "2017_RT12": 1 1 1 1 1 1 1 1 1 1 ...
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 2 1 2 2 2 1 1 1 1 1 ...
$ Critical.Y.N : num 0 0 0 0 0 0 0 0 0 0 ...
$ Grouping : Factor w/ 29 levels "281","Bes",..: 27 6 6 6 6 7 7 7 7 7 ...
$ Object.type : Factor w/ 11 levels "Cobol Program",..: 8 7 7 7 7 7 7 7 7 7 ...
$ Object.name : Factor w/ 3771 levels "[S:\\SOURCES\\",..: 3771 3770 3769 3768 3767 3
I want to have a statistic output like this:
For every Technical.Criterion a row with the sum of all rows of Critical.Y.N = 0 and 1
So I have to combine the rows of my database to a new matrix. Using Values of the factor sums ...
But I have no idea how to start...? Any hints?
Thanks
I believe you're asking for a cross-tabulation. Because you did not provide a reproducible sample, I've used mine:
xtabs(~ Sub.Category + Category, retail)
Produce this:
And if you want the value to be say, based on Sales, instead of the count, then you can modify the code to:
xtabs(Sales ~ Sub.Category + Category, retail)
And you will get the following output:
EDIT based on extra information in the OP's comment
If you want to have your tables also share a common title and want to change the name of that title, you can use a combination of names() and dimnames(). An xtab is a cross-tabulation table and if you call dimnames() on it it returns a list of length 2, first one corresponding to the row and second to the column.
dimnames(xtab(dat))
$Technical.Criterion
[1] "TechnicalCrit1" "TechnicalCrit2" "TechnicalCrit3"
$`Object.type`
[1] "Object.type1" "Object.type2" "Object.type3"
So given a data frame, b:
'data.frame': 3 obs. of 9 variables:
$ Metric.ID : int 101 102 103
$ Metric.Name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Technical.Criterion: Factor w/ 3 levels "TechnicalCrit1",..: 1 2 3
$ RT.Snapshot.name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 1 2 1
$ Critical.Y.N : num 1 0 1
$ Grouping : Factor w/ 3 levels "A","B","C": 1 2 3
$ Object.type : Factor w/ 3 levels "Object.type1",..: 1 2 3
$ Object.name : Factor w/ 3 levels "A","B","C": 1 2 3
We can use xtab and then change the "common" header right at the top of our table. Since I don't know how many levels are in b$Violation.status, I would use a generic for loop:
for(i in 1:length(unique(b$Violation.status))){
tab[[i]] <- xtabs(Critical.Y.N ~ Technical.Criterion + Object.type, b)
names(dimnames(tab[[i]]))[2] <- paste("Violation.status", i)
}
This produces:
Violation.status 1
Technical.Criterion Object.type1 Object.type2 Object.type3
TechnicalCrit1 1 0 0
TechnicalCrit2 0 0 0
TechnicalCrit3 0 0 1
Which I can now use in my shiny app.

Compare one csv file to multiple csv files and write new csv files R

I am pretty new to loops in R so I do apologies if this question has been asked elsewhere.
Read in all 30 CSVfiles -> Compare File A species to the other 30 CSV files by species -> Write a new CSV file for each of the 30 files with just the matching species
File A has one column with the names of 190 species ($name). The 30 other csv files each have a column with the species ($SBSname) with differing number of species in the column $SBSname that can range from 100-500 with replicates (so the file CSV file can be larger than 190 rows). However I don't know how to write the code that ...
This is all I have at the moment ...
I have looped in all the CSV files:
30files = list.files(pattern="*.csv")
for (i in 1:length(30files)) assign(30files[i], read.csv(30files[i]))
I have code for just comparing one CSV file (branching.csv) against File A:
> str(FileA)
'data.frame': **190 obs. of 1 variable**:
$ name: Factor w/ 190 levels "Acaena novae zelandiae",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(branching.csv)
'data.frame': **4055 obs. of 7 variables:**
$ SBSname : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 794 2075 1049 162 132 333 541 1840 272 1553 ...
$ SBS.number : int 16443 26711 40171 40398 40867 41151 37871 42412 35847 36245 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 3 1 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 1 1 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 9 9 20 3 3 3 3 3 33 33 ...
Species<-branching.csv[(branching.csv$SBSname %in% FileA$name),]
write.csv(Species, file = "Branching.csv")
> str(Species)
'data.frame': **298 obs. of 7 variables:**
$ name : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 1049 162 1548 47 57 1647 1060 2788 2094 1976 ...
$ SBS.number : int 40171 40398 36280 40532 41629 42495 40103 32792 32892 30583 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 2 2 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 2 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 3 3 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 20 3 33 33 33 33 33 44 44 44 ...
Any help or suggestions would be great. Doesn't have to be a loop!
How about this simple loop?
library(dplyr)
for(i in 1:length(30files))
{
csv.matching = read.csv(30files[i]) %>% inner_join(FileA, by=c("SBSname"="name"))
write.csv(csv.matching, file=gsub("\\.csv", "_matchin.csv", 30files[i]), na="")
}

How to reduce a data set after subsetting it with gsub

I have a large data set, which I reduced applying gsub multiple times, basically in this form:
levels(Orders$Im) <- gsub("Osp", "OsProf", levels(Orders$Im))
I also used rbind:
DI_Reduced <- rbind(CX, OsP)
I need to apply function "tree" to the resulting data.frame, but I get an error:
tree.model <- tree(line ~ CountryCode + OrderType + Support, data=train.set)
The error is:
Error in tree(line ~ CountryCode + OrderType + Support, :
factor predictors must have at most 32 levels
Strange thing: if I export the train.set with write.csv and then I re-import it with read.csv, the error disappears and the tree is built.
I investigated the structure of the train.set and this is the difference before and after exporting/importing it:
$ CustomerNumber: Factor w/ 4616 levels "0","101959","210285",..: 3070 3069 4539 3732 2573 3086 2973 3817 4056 2956 ...
$ CountryCode : Factor w/ 4 levels "OtherCountry",..: 3 3 4 4 3 3 3 4 4 3 ...
$ OrderType : Factor w/ 5 levels "Order","NewOrder",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Support : Factor w/ 5 levels "#N/A","BN",..: 4 4 4 4 2 4 4 4 4 4 ...
$ Manuf : Factor w/ 163 levels "<Generic>","6gi",..: 52 52 52 52 52 52 52 52 52 52 ...
$ line : Factor w/ 623 levels "\"Generic\" Skews",..: 400 35 400 400 400 400 400 400 400 400 ...
________________________________________________________________
$ CustomerNumber: Factor w/ 692 levels "201500","20202",..: 361 360 680 499 138 367 315 523 592 304 ...
$ CountryCode : Factor w/ 2 levels "JP","US": 1 1 2 2 1 1 1 2 2 1 ...
$ OrderType : Factor w/ 1 level "Online": 1 1 1 1 1 1 1 1 1 1 ...
$ Support : Factor w/ 4 levels "BN","MC",..: 3 3 3 3 1 3 3 3 3 3 ...
$ Manuf : Factor w/ 1 level "DY": 1 1 1 1 1 1 1 1 1 1 ...
$ line : Factor w/ 2 levels "CX","OTX": 2 1 2 2 2 2 2 2 2 2 ...
It seems to me that gsub does not really subsect the original data.frame, and the hidden values stay in the train.set till I export/import the train. Is there another way to do this operation and obtain a tree?
As the error says, your dependent variable line has more than 32 levels. As per your train.set structure line : Factor w/ 623 levels
Try using other tree libraries like rpart.
Refactoring after subset might help.
sapply(train.set, {function(x) if(class(x) == "factor") {factor(x)}})
Also, gsub is not used for subsetting usually. It is global substitution function. You should share the pre-processing steps followed as well to help others help you with this better.

Get top 3 average from data in R

The First csv file is called "CLAIM" and these are parts of data
The second csv file is called "CUSTOMER" and these are parts of data
First, I wanted to merge two data based on the common column
Second, I wanted to remove all columns including NA value
Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
Fourth, I wanted to remove the empty row of OCCP_GRP_1
Expecting form is
dim(data_fin)
## [1] 114886 11
head(data_fin)
## CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1 1 1 2 47 3.사무직 2 Y Y
## 2 1 1 2 47 3.사무직 2 Y Y
## 3 1 1 2 47 3.사무직 2 Y Y
## 4 1 1 2 47 3.사무직 2 Y Y
## 5 2 1 1 53 3.사무직 2 Y Y
## 6 2 1 1 53 3.사무직 2 Y Y
## DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1 52450 52450 0.4343986
## 2 24000 24000 0.8823529
## 3 17500 17500 0.7272727
## 4 47500 47500 0.9217391
## 5 99100 99100 0.8623195
## 6 7817 7500 0.8623195
str(data_fin)
## 'data.frame': 114886 obs. of 11 variables:
## $ CUST_ID : int 1 1 1 1 2 2 2 3 4 4 ...
## $ DIVIDED_SET : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 2 2 2 2 1 1 1 1 2 2 ...
## $ AGE : int 47 47 47 47 53 53 53 60 64 64 ...
## $ OCCP_GRP_1 : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
## $ CHLD_CNT : int 2 2 2 2 2 2 2 0 0 0 ...
## $ WEDD_YN : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
## $ CHANG_FP_YN : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
## $ DMND_AMT : int 52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
## $ PAYM_AMT : int 52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
## $ NON_PAY_RATIO: num 0.434 0.882 0.727 0.922 0.862 ...
so I wrote down the code like
#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)
Next,
1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio
2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value
I don't know how to write it down

Error in Cross Validation in GLMNET package R for Binomial Target Variable

This is in reference to https://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome I am trying to use the Cross Validation in GLMNET (i.e. cv.glmnet) for a binomial target variable. The glmnet works fine but the cv.glmnet throws an error here is the error log:
Error in storage.mode(y) = "double" : invalid to change the storage mode of a factor
In addition: Warning messages:
1: In Ops.factor(x, w) : ‘*’ not meaningful for factors
2: In Ops.factor(y, ybar) : ‘-’ not meaningful for factors
Data Types:
'data.frame': 490 obs. of 13 variables:
$ loan_id : Factor w/ 614 levels "LP001002","LP001003",..: 190 381 259 310 432 156 179 24 429 408 ...
$ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
$ married : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 1 ...
$ dependents : Factor w/ 4 levels "0","1","2","3+": 1 1 1 3 1 4 2 3 1 1 ...
$ education : Factor w/ 2 levels "Graduate","Not Graduate": 1 1 1 2 1 1 1 2 1 2 ...
$ self_employed : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ applicantincome : int 9328 3333 14683 7667 6500 39999 3750 3365 2920 2213 ...
$ coapplicantincome: num 0 2500 2100 0 0 ...
$ loanamount : int 188 128 304 185 105 600 116 112 87 66 ...
$ loan_amount_term : Factor w/ 10 levels "12","36","60",..: 6 9 9 9 9 6 9 9 9 9 ...
$ credit_history : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ property_area : Factor w/ 3 levels "Rural","Semiurban",..: 1 2 1 1 1 2 2 1 1 1 ...
$ loan_status : Factor w/ 2 levels "0","1": 2 2 1 2 1 2 2 1 2 2 ...
Codes Used:
xfactors<-model.matrix(loan_status ~ gender+married+dependents+education+self_employed+loan_amount_term+credit_history+property_area,data=data_train)[,-1]
x<-as.matrix(data.frame(applicantincome,coapplicantincome,loanamount,xfactors))
glmmod<-glmnet(x,y=as.factor(loan_status),alpha=1,family='binomial')
plot(glmmod,xvar="lambda")
grid()
cv.glmmod <- cv.glmnet(x,y=loan_status,alpha=1) #This Is Where It Throws The Error
The credit for the answer goes to #user20650.
Suspect you need to add the familyto cv.glmnet as well. An example:
x <- model.matrix(am ~ 0 + . , data=mtcars)
cv.glmnet(x, y=factor(mtcars$am), alpha=1)
cv.glmnet(x, y=factor(mtcars$am), alpha=1, family="binomial")

Resources