Get top 3 average from data in R - r

The First csv file is called "CLAIM" and these are parts of data
The second csv file is called "CUSTOMER" and these are parts of data
First, I wanted to merge two data based on the common column
Second, I wanted to remove all columns including NA value
Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
Fourth, I wanted to remove the empty row of OCCP_GRP_1
Expecting form is
dim(data_fin)
## [1] 114886 11
head(data_fin)
## CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1 1 1 2 47 3.사무직 2 Y Y
## 2 1 1 2 47 3.사무직 2 Y Y
## 3 1 1 2 47 3.사무직 2 Y Y
## 4 1 1 2 47 3.사무직 2 Y Y
## 5 2 1 1 53 3.사무직 2 Y Y
## 6 2 1 1 53 3.사무직 2 Y Y
## DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1 52450 52450 0.4343986
## 2 24000 24000 0.8823529
## 3 17500 17500 0.7272727
## 4 47500 47500 0.9217391
## 5 99100 99100 0.8623195
## 6 7817 7500 0.8623195
str(data_fin)
## 'data.frame': 114886 obs. of 11 variables:
## $ CUST_ID : int 1 1 1 1 2 2 2 3 4 4 ...
## $ DIVIDED_SET : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 2 2 2 2 1 1 1 1 2 2 ...
## $ AGE : int 47 47 47 47 53 53 53 60 64 64 ...
## $ OCCP_GRP_1 : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
## $ CHLD_CNT : int 2 2 2 2 2 2 2 0 0 0 ...
## $ WEDD_YN : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
## $ CHANG_FP_YN : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
## $ DMND_AMT : int 52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
## $ PAYM_AMT : int 52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
## $ NON_PAY_RATIO: num 0.434 0.882 0.727 0.922 0.862 ...
so I wrote down the code like
#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)
Next,
1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio
2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value
I don't know how to write it down

Related

R - Weighted counting

I have this df
ZONA ID_DOM FE_DOM NO_MORAD
1 1 00010001 15.41667 2
2 1 00010001 15.41667 2
3 1 00010001 15.41667 2
4 1 00010001 15.41667 2
5 1 00010001 15.41667 2
6 1 00010002 15.41667 4
...
99 00010994 16.68444 5
Currently, I'm counting the number of ID_DOM using the weight variable FE_DOM with this code.
count(OD_2017[!duplicated(OD_2017$ID_DOM),],
wt = FE_DOM,
Zonas = ZONA,
name = "N_domicilios")
which returns me
Zonas N_domicilios
<int> <dbl>
1 1 1151.
2 2 2342.
3 3 7100.
4 4 12588.
5 5 8050.
6 6 9411.
However I want this data grouped by NO_MORAD, something like
Zonas 1Mor 2Mor ... 99Mor
1 50 78 ... 78
2 x y ... z
...
99 99 99 ... 99
Can anyone help me with this?
Thanks

'Object not found' error even though table() verifies the object is in the data set

I've read through others who have had a similar issue, but my situation doesn't seem to be the same as the fixes that have been proposed for those other issues. I'm trying to recode a variable using a conditional statement. I want to take a character string & turn it into a numeric so I can subset those observations out into a new data frame. Here's what I have, so far:
blad_mor <- read.csv("blad_mor.csv", header = T)
str(blad_mor)
blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
I get this output for the str() command:
> str(blad_mor)
'data.frame': 127073 obs. of 12 variables:
$ year : int 1999 1999 1999 1999 1999 1999 1999 1999 1999 1999 ...
$ sex : Factor w/ 4 levels "1","2","F","M": 1 1 1 2 1 2 2 2 2 2 ...
$ race : Factor w/ 17 levels "America","Asian &",..: 4 4 4 4 4 4 4 4 4 4 ...
$ county : Factor w/ 79 levels "COUNTY1","COUNTY2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ cod : Factor w/ 327 levels "C001","C005",..: 89 108 108 294 63 42 172 74 85 269 ...
$ fips : int 1 1 1 1 1 1 1 1 1 1 ...
$ state : int 5 5 5 5 5 5 5 5 5 5 ...
$ race_code : int 2 2 2 2 2 2 2 2 2 2 ...
$ ethnicity : Factor w/ 4 levels "","Hispanic",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ethnicity_code: int NA NA NA NA NA NA NA NA NA NA ...
But when I try the blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod) code I get this error:
> blad_mor_recode <- gsub(C670:C679, 29010, blad_mor$cod)
Error in gsub(C670:C679, 29010, blad_mor$cod) : object 'C670' not found
So, I verify that there actually is that object by table(blad_mor$cod) with this being some of the output:
C578 C579 C58 C60 C601 C609 C61 C629 C631 C639 C64 C65 C66 C670 C672 C674 C675 C676
2 43 4 1 1 53 6162 62 1 14 2911 30 47 1 4 1 1 2
C677 C678 C679 C680 C689 C690 C692 C693 C694 C695 C696 C699 C700 C701 C709 C71 C710 C711
1 4 2776 35 77 1 4 5 1 1 8 45 7 3 11 1 29 34
The object 'C670' has one instance as per this output, yet R is telling me it is not there & doesn't run the command. What am I missing here? Should I change the class type from factor to something else? I'm quite confused.
Edit: I have tried quotes around the character strings (e.g. blad_mor_recode <- gsub('C670:C679', '29010', blad_mor$cod) as well as ifelse(). I still get the same error message.
If you want to change all strings from C70to C79 you have to use regex. Something like the following would work:
blad_mor_recode <- gsub("C7[0-9]", "29010", blad_mor$cod)
A simple example:
gsub("C7[0-9]","",c("C60","C70","C78"))
[1] "C60" "" ""

Subset list with For cycle

I have to subset this "plt" list.
"Plt" is a list of GPS points, with date and time.
"Labels" is a list of all trips in the day, with start and end time.
I would take the point in row 1 from labels$Start and the point in row 1 from labels$End, search these values in plt$Data_Time column and subset all rows between the Start value and the End value.
> str(labels)
'data.frame': 10 obs. of 8 variables:
$ Date_ST: Factor w/ 5 levels "2008/04/28","2008/04/29",..: 1 1 2 2 3 3 4 4 5 5
$ Time_ST: Factor w/ 15 levels "01:27:05","01:33:29",..: 13 15 4 10 1 7 8 12 2 11
$ Date_ET: Factor w/ 5 levels "2008/04/28","2008/04/29",..: 1 1 2 2 3 3 4 4 5 5
$ Time_ET: Factor w/ 15 levels "01:35:25","01:41:11",..: 13 15 3 10 1 5 6 12 2 9
$ Mode : Factor w/ 2 levels "subway","walk": 2 2 2 2 2 2 2 2 2 2
$ ID : int 1 3 4 6 7 9 10 12 13 15
$ Start : chr "2008/04/28 11:27:42" "2008/04/28 11:42:56" "2008/04/29 01:38:21" "2008/04/29 01:57:55" ...
$ End : chr "2008/04/28 11:27:58" "2008/04/28 11:50:10" "2008/04/29 01:41:28" "2008/04/29 02:03:28" ...
> str(plt)
'data.frame': 4377 obs. of 9 variables:
$ Lat : num 40.1 40.1 40.1 40.1 40.1 ...
$ Long : num 116 116 116 116 116 ...
$ X0 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Alt : int 492 492 491 491 491 490 490 490 489 489 ...
$ n.days : num 39589 39589 39589 39589 39589 ...
$ Date : Factor w/ 5 levels "2008-05-21","2008-04-28",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time : Factor w/ 2955 levels "01:33:29","01:33:30",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Data_Time: chr "2008-05-21 01:33:29" "2008-05-21 01:33:30" "2008-05-21 01:33:31" "2008-05-21 01:33:33" ...
head(plt)
Lat Long X0 Alt n.days Date Time ID Data_Time
1 40.07045 116.3130 0 492 39589.06 2008-05-21 01:33:29 1 2008-05-21 01:33:29
2 40.07045 116.3133 0 492 39589.06 2008-05-21 01:33:30 2 2008-05-21 01:33:30
3 40.07050 116.3131 0 491 39589.06 2008-05-21 01:33:31 3 2008-05-21 01:33:31
4 40.07052 116.3130 0 491 39589.06 2008-05-21 01:33:33 4 2008-05-21 01:33:33
5 40.07050 116.3129 0 491 39589.06 2008-05-21 01:33:35 5 2008-05-21 01:33:35
6 40.07047 116.3129 0 490 39589.07 2008-05-21 01:33:37 6 2008-05-21 01:33:37
labels
Date_ST Time_ST Date_ET Time_ET Mode ID Start End
1 2008/04/28 11:27:42 2008/04/28 11:27:58 walk 1 2008/04/28 11:27:42 2008/04/28 11:27:58
3 2008/04/28 11:42:56 2008/04/28 11:50:10 walk 3 2008/04/28 11:42:56 2008/04/28 11:50:10
4 2008/04/29 01:38:21 2008/04/29 01:41:28 walk 4 2008/04/29 01:38:21 2008/04/29 01:41:28
6 2008/04/29 01:57:55 2008/04/29 02:03:28 walk 6 2008/04/29 01:57:55 2008/04/29 02:03:28
7 2008/05/12 01:27:05 2008/05/12 01:35:25 walk 7 2008/05/12 01:27:05 2008/05/12 01:35:25
9 2008/05/12 01:51:11 2008/05/12 01:55:35 walk 9 2008/05/12 01:51:11 2008/05/12 01:55:35
I need to do it for each row, so I have thought to use a for cycle.
In the end, I want to keep only the column 1 and 2 (Lat and Long).
for(i in 1:nrow(labels)) {
a = labels$Start[i] #prendo coord inizio/fine percorso
b = labels$End[i]
k = plt[plt$Data_Time >= a & plt$Data_Time < b, ]
LatLong = k[1:2]
head(LatLong)
write.table(LatLong, "~/Desktop/LatLongTrip.txt", sep="\t")
Unfortunately, the result is:
> k = plt[plt$Data_Time >= b & plt$Data_Time < a, ]
> k
[1] Lat Long X0 Alt n.days Date Time ID Data_Time
<0 rows> (or 0-length row.names)
In reality, there are some rows between these two values, could you help me, please?
You don't need a for cycle :)
Here:
First make sure to have library sqldf
Then, setting up a mock data example:
fechasInicioYFin <- data.frame(
fechasInicio = as.POSIXct(c('2016-08-19 10:00','2016-08-25 15:00','2016-09-15 15:00','2016-07-20 11:00')),
fechasFin = as.POSIXct(c('2016-08-19 14:00','2016-08-25 18:00','2016-09-15 19:00','2016-07-20 16:00'))
)
dataConFecha <- data.frame(num1 = c(1,2,3,4,5,6), num2 = c(11:16),
fechas = as.POSIXct(c('2016-08-19 12:00','2016-08-25 16:00','2016-09-15 16:00','2016-07-20 13:00',
'2016-08-19 13:00','2016-09-15 17:00'))
)
Now just join them by the date column and select only the columns that you are interested:
sqldf("select a.*,b.fechasInicio,b.fechasFin from dataConFecha as a join fechasInicioYFin as b on
a.fechas between b.fechasInicio and b.fechasFin")
**using "between" sql statement instead of >= and <=, as suggested by #G. Grothendieck
The output should be something like this:
As you can see, the data is now basically grouped by beginning date and ending date.

Extracting complete dataframe from Hmisc package in R

I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.

How to tidyr data from a dataframe (wikipedia internal links)?

I am working on wikipedia internal links using WikipediR package
I am looking for internal links about Hérodote, (in french)
install.packages("WikipediR")
library (WikipediR)
all_bls <- page_backlinks("fr","wikipedia",
page = "Hérodote",
clean_response = TRUE)
all_bls_df <- as.data.frame(all_bls) # converting in d.f
my result:
str(all_bls_df)
## 'data.frame': 3 obs. of 50 variables:
## $ structure.c..60....0....Attributs.du.pharaon.....Names...c..pageid... : Factor w/ 3 levels "0","60","Attributs du pharaon": 2 1 3
## $ structure.c..133....0....Apis.....Names...c..pageid....ns....title. : Factor w/ 3 levels "0","133","Apis": 2 1 3
## $ structure.c..152....0....Anthropologie.....Names...c..pageid... : Factor w/ 3 levels "0","152","Anthropologie": 2 1 3
## $ structure.c..159....0....Asie.....Names...c..pageid....ns....title. : Factor w/ 3 levels "0","159","Asie": 2 1 3
## $ structure.c..325....0....Ahmôsis.II.....Names...c..pageid... : Factor w/ 3 levels "0","325","Ahmôsis II": 2 1 3
## $ structure.c..412....0....Bastet.....Names...c..pageid....ns... : Factor w/ 3 levels "0","412","Bastet": 2 1 3
## $ structure.c..542....0....Corse.....Names...c..pageid....ns... : Factor w/ 3 levels "0","542","Corse": 2 1 3
## $ structure.c..715....0....Cyclades.....Names...c..pageid....ns... : Factor w/ 3 levels "0","715","Cyclades": 2 1 3
## (goes on for 42 more variables)
How can I tidy my data.frame object?
Expect result:
pageid title
60 Attributs du pharaon
133 Apis
152 Antropologie
159 Asie
The function you're using returns named character vectors in a list. We can use purrr::map_df() with as.list(). map_df() will execute the as.list() on each element in your all_bls list and automagically row-bind them into a data frame:
purrr::map_df(all_bls, as.list)
## # A tibble: 50 × 3
## pageid ns title
## <chr> <chr> <chr>
## 1 60 0 Attributs du pharaon
## 2 133 0 Apis
## 3 152 0 Anthropologie
## 4 159 0 Asie
## 5 325 0 Ahmôsis II
## 6 412 0 Bastet
## 7 542 0 Corse
## 8 715 0 Cyclades
## 9 734 0 Culte à mystères
## 10 821 0 Chamanisme
## # ... with 40 more rows

Resources