Subset list with For cycle - r

I have to subset this "plt" list.
"Plt" is a list of GPS points, with date and time.
"Labels" is a list of all trips in the day, with start and end time.
I would take the point in row 1 from labels$Start and the point in row 1 from labels$End, search these values in plt$Data_Time column and subset all rows between the Start value and the End value.
> str(labels)
'data.frame': 10 obs. of 8 variables:
$ Date_ST: Factor w/ 5 levels "2008/04/28","2008/04/29",..: 1 1 2 2 3 3 4 4 5 5
$ Time_ST: Factor w/ 15 levels "01:27:05","01:33:29",..: 13 15 4 10 1 7 8 12 2 11
$ Date_ET: Factor w/ 5 levels "2008/04/28","2008/04/29",..: 1 1 2 2 3 3 4 4 5 5
$ Time_ET: Factor w/ 15 levels "01:35:25","01:41:11",..: 13 15 3 10 1 5 6 12 2 9
$ Mode : Factor w/ 2 levels "subway","walk": 2 2 2 2 2 2 2 2 2 2
$ ID : int 1 3 4 6 7 9 10 12 13 15
$ Start : chr "2008/04/28 11:27:42" "2008/04/28 11:42:56" "2008/04/29 01:38:21" "2008/04/29 01:57:55" ...
$ End : chr "2008/04/28 11:27:58" "2008/04/28 11:50:10" "2008/04/29 01:41:28" "2008/04/29 02:03:28" ...
> str(plt)
'data.frame': 4377 obs. of 9 variables:
$ Lat : num 40.1 40.1 40.1 40.1 40.1 ...
$ Long : num 116 116 116 116 116 ...
$ X0 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Alt : int 492 492 491 491 491 490 490 490 489 489 ...
$ n.days : num 39589 39589 39589 39589 39589 ...
$ Date : Factor w/ 5 levels "2008-05-21","2008-04-28",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Time : Factor w/ 2955 levels "01:33:29","01:33:30",..: 1 2 3 4 5 6 7 8 9 10 ...
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Data_Time: chr "2008-05-21 01:33:29" "2008-05-21 01:33:30" "2008-05-21 01:33:31" "2008-05-21 01:33:33" ...
head(plt)
Lat Long X0 Alt n.days Date Time ID Data_Time
1 40.07045 116.3130 0 492 39589.06 2008-05-21 01:33:29 1 2008-05-21 01:33:29
2 40.07045 116.3133 0 492 39589.06 2008-05-21 01:33:30 2 2008-05-21 01:33:30
3 40.07050 116.3131 0 491 39589.06 2008-05-21 01:33:31 3 2008-05-21 01:33:31
4 40.07052 116.3130 0 491 39589.06 2008-05-21 01:33:33 4 2008-05-21 01:33:33
5 40.07050 116.3129 0 491 39589.06 2008-05-21 01:33:35 5 2008-05-21 01:33:35
6 40.07047 116.3129 0 490 39589.07 2008-05-21 01:33:37 6 2008-05-21 01:33:37
labels
Date_ST Time_ST Date_ET Time_ET Mode ID Start End
1 2008/04/28 11:27:42 2008/04/28 11:27:58 walk 1 2008/04/28 11:27:42 2008/04/28 11:27:58
3 2008/04/28 11:42:56 2008/04/28 11:50:10 walk 3 2008/04/28 11:42:56 2008/04/28 11:50:10
4 2008/04/29 01:38:21 2008/04/29 01:41:28 walk 4 2008/04/29 01:38:21 2008/04/29 01:41:28
6 2008/04/29 01:57:55 2008/04/29 02:03:28 walk 6 2008/04/29 01:57:55 2008/04/29 02:03:28
7 2008/05/12 01:27:05 2008/05/12 01:35:25 walk 7 2008/05/12 01:27:05 2008/05/12 01:35:25
9 2008/05/12 01:51:11 2008/05/12 01:55:35 walk 9 2008/05/12 01:51:11 2008/05/12 01:55:35
I need to do it for each row, so I have thought to use a for cycle.
In the end, I want to keep only the column 1 and 2 (Lat and Long).
for(i in 1:nrow(labels)) {
a = labels$Start[i] #prendo coord inizio/fine percorso
b = labels$End[i]
k = plt[plt$Data_Time >= a & plt$Data_Time < b, ]
LatLong = k[1:2]
head(LatLong)
write.table(LatLong, "~/Desktop/LatLongTrip.txt", sep="\t")
Unfortunately, the result is:
> k = plt[plt$Data_Time >= b & plt$Data_Time < a, ]
> k
[1] Lat Long X0 Alt n.days Date Time ID Data_Time
<0 rows> (or 0-length row.names)
In reality, there are some rows between these two values, could you help me, please?

You don't need a for cycle :)
Here:
First make sure to have library sqldf
Then, setting up a mock data example:
fechasInicioYFin <- data.frame(
fechasInicio = as.POSIXct(c('2016-08-19 10:00','2016-08-25 15:00','2016-09-15 15:00','2016-07-20 11:00')),
fechasFin = as.POSIXct(c('2016-08-19 14:00','2016-08-25 18:00','2016-09-15 19:00','2016-07-20 16:00'))
)
dataConFecha <- data.frame(num1 = c(1,2,3,4,5,6), num2 = c(11:16),
fechas = as.POSIXct(c('2016-08-19 12:00','2016-08-25 16:00','2016-09-15 16:00','2016-07-20 13:00',
'2016-08-19 13:00','2016-09-15 17:00'))
)
Now just join them by the date column and select only the columns that you are interested:
sqldf("select a.*,b.fechasInicio,b.fechasFin from dataConFecha as a join fechasInicioYFin as b on
a.fechas between b.fechasInicio and b.fechasFin")
**using "between" sql statement instead of >= and <=, as suggested by #G. Grothendieck
The output should be something like this:
As you can see, the data is now basically grouped by beginning date and ending date.

Related

Access frequencies of an atomic vector in a tibble data frame

I am doing Exploratory Data Analysis on a tibble data frame. I've never used tibble so I'm experiecing some difficulties.
My tibble data frame has this structure:
spec_tbl_df [7,397 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ X1 : num [1:7397] 9617 12179 9905 5745 10067 ...
$ Administrative : num [1:7397] 5 26 4 3 7 16 4 3 2 0 ...
$ Administrative_Duration: num [1:7397] 408 1562 58 103 165 ...
$ Informational : num [1:7397] 2 9 2 0 1 3 4 5 0 0 ...
$ Informational_Duration : num [1:7397] 47.5 503.7 28.5 0 28.5 ...
$ ProductRelated : num [1:7397] 54 183 82 25 115 86 75 23 27 33 ...
$ ProductRelated_Duration: num [1:7397] 1547 9676 4729 1109 3428 ...
$ BounceRates : num [1:7397] 0 0.0111 0 0 0 ...
$ ExitRates : num [1:7397] 0.01733 0.0142 0.01454 0.00167 0.01629 ...
$ PageValues : num [1:7397] 0 19.57 9.06 61.3 4.97 ...
$ SpecialDay : num [1:7397] 0 0 0 0 0 0 0 0 0 0 ...
$ Month : Factor w/ 10 levels "Aug","Dec","Feb",..: 8 8 8 1 8 4 8 7 8 8 ...
$ OperatingSystems : Factor w/ 8 levels "1","2","3","4",..: 2 3 2 2 2 3 3 4 8 2 ...
$ Browser : Factor w/ 13 levels "1","2","3","4",..: 2 2 2 2 2 2 2 1 2 5 ...
$ Region : Factor w/ 9 levels "1","2","3","4",..: 3 2 1 6 4 8 1 1 7 3 ...
$ TrafficType : Factor w/ 19 levels "1","2","3","4",..: 2 12 2 5 10 4 2 4 2 1 ...
$ VisitorType : Factor w/ 3 levels "New_Visitor",..: 3 3 3 1 3 3 3 3 1 3 ...
$ Weekend : Factor w/ 2 levels "FALSE","TRUE": 2 1 1 1 1 1 1 1 1 1 ...
$ Revenue : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
Now if I use plot_bar to plot the cathegorical data (using DataExplorer package) I have no problem. I would like, for example, to create a boxplot for the cathegorical variable "Month" where for each month I have a boxplot showing how values are distribuited. The problem is that I can't find a way to access the frequencies. If I do the following:
boxplot(Month)
It creates a single boxplot for all the data (all the months) but it's not helpfull at all. Like this:
I would like the months on the x axis and the frequencies on the y axis and a boxplot for each month.
I've tried to "extract" the feature month, transform it to a matrix and repeat the process but it does not work.
Here is the variable montht taken alone:
> summary(x_Month)
Aug Dec Feb Jul June Mar May Nov Oct Sep
258 1034 123 259 166 1125 2014 1814 327 277
What am I missing ?
Something like this would probably work to create barplots for the frequencies of Month:
library(ggplot2)
spec_tbl_df %>%
ggplot(aes(x = Month)) +
geom_bar()

how to generate grouped result in R?

Here is my data
> str(myData)
'data.frame': 500 obs. of 12 variables:
$ PassengerId: int 1 2 5 6 7 8 9 10 11 12 ...
$ Survived : int 0 1 0 0 0 0 1 1 1 1 ...
$ Pclass : int 3 1 3 3 1 3 3 2 3 1 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 16 559 520 629 417 581 732 96 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 1 1 1 1 ...
$ Age : num 22 38 35 NA 54 2 27 14 4 58 ...
$ SibSp : int 1 1 0 0 0 3 0 1 1 0 ...
$ Parch : int 0 0 0 0 0 1 2 0 1 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 473 276 86 396 345 133 617 39 ...
$ Fare : num 7.25 71.28 8.05 8.46 51.86 ...
$ Cabin : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA NA 130 NA NA NA 146 50 ...
$ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 2 3 3 3 1 3 3 ...
I have to generate 2 results
1.grouped by title and pclass of each passenger like this
2.display table of missing age counts grouped by title and pclass like this
but when I used what I know both resulted like below
> myData$Name = as.character(myData$Name)
> table_words = table(unlist(strsplit(myData$Name, "\\s+")))
> sort(table_words [grep('\\.',names(table_words))], decreasing=TRUE)
Mr. Miss. Mrs. Master. Dr. Rev. Col. Capt. Countess. Don.
289 99 76 20 5 3 2 1 1 1
L. Mlle. Mme. Sir.
1 1 1 1
> library(stringr)
> tb = cbind(myData$Age, str_match(myData$Name, "[a-zA-Z]+\\."))
> table(tb[is.na(tb[,1]),2])
Dr. Master. Miss. Mr. Mrs.
1 3 18 62 7
basically I have to return tables not by total amount like I did above but to display by 3 different rows sorting by Pclass int which the total of 3rows would still be the same as total amount(myTitle = Pclass int 1 / 2 / 3 in 'myData')
so for example, the result of image 1 would mean that Capt. exists only 1 by int 1 unber Pclass data.
how should i sort the total amount by Pclass int 1,2,and 3?
It is hard to tell with no data provided (though I think that it comes from the Titanic dataset on Kaggle).
I think the first thing to do is to create a new factor with Title as you want to make analysis with it. I'd do something like:
# Extract title from name and make it a factor
dat$Title <- gsub(".* (.*)\\. .*$", "\\1", as.character(dat$Name))
dat$Title <- factor(dat$Title)
You'll need to check that it works with your data.
Once you have the Title factor you can use ddply from the plyr library and make the first table (grouped by Title and Pclass of each passenger):
library(plyr)
# Number of occurences
classTitle <- ddply(dat, c('Pclass', 'Title'), summarise,
count=length(Name))
# Convert to wide format
classTitle <- reshape(classTitle, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
classTitle[is.na(classTitle)] <- 0
Almost the same thing for your second requirement (display table of missing age counts grouped by Title and Pclass):
# Number of NA in Age
countNA <- ddply(dat, c('Pclass', 'Title'), summarise,
na=sum(is.na(Age)))
# Convert to wide format
countNA <- reshape(countNA, idvar = "Title", timevar = "Pclass",
direction = "wide")
# Fill NA's with 0
countNA[is.na(countNA)] <- 0

Get top 3 average from data in R

The First csv file is called "CLAIM" and these are parts of data
The second csv file is called "CUSTOMER" and these are parts of data
First, I wanted to merge two data based on the common column
Second, I wanted to remove all columns including NA value
Third, I wanted to remove the variables like 'SIU_CUST_YN, CTPR, OCCP_GRP_2, RECP_DATE, RESN_DATE'.
Fourth, I wanted to remove the empty row of OCCP_GRP_1
Expecting form is
dim(data_fin)
## [1] 114886 11
head(data_fin)
## CUST_ID DIVIDED_SET SEX AGE OCCP_GRP_1 CHLD_CNT WEDD_YN CHANG_FP_YN
## 1 1 1 2 47 3.사무직 2 Y Y
## 2 1 1 2 47 3.사무직 2 Y Y
## 3 1 1 2 47 3.사무직 2 Y Y
## 4 1 1 2 47 3.사무직 2 Y Y
## 5 2 1 1 53 3.사무직 2 Y Y
## 6 2 1 1 53 3.사무직 2 Y Y
## DMND_AMT PAYM_AMT NON_PAY_RATIO
## 1 52450 52450 0.4343986
## 2 24000 24000 0.8823529
## 3 17500 17500 0.7272727
## 4 47500 47500 0.9217391
## 5 99100 99100 0.8623195
## 6 7817 7500 0.8623195
str(data_fin)
## 'data.frame': 114886 obs. of 11 variables:
## $ CUST_ID : int 1 1 1 1 2 2 2 3 4 4 ...
## $ DIVIDED_SET : int 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 2 2 2 2 1 1 1 1 2 2 ...
## $ AGE : int 47 47 47 47 53 53 53 60 64 64 ...
## $ OCCP_GRP_1 : Factor w/ 9 levels "","1.주부","2.자영업",..: 4 4 4 4 4 4 4 6 3 3 ...
## $ CHLD_CNT : int 2 2 2 2 2 2 2 0 0 0 ...
## $ WEDD_YN : Factor w/ 3 levels "","N","Y": 3 3 3 3 3 3 3 2 2 2 ...
## $ CHANG_FP_YN : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 1 2 ...
## $ DMND_AMT : int 52450 24000 17500 47500 99100 7817 218614 430000 200000 120000 ...
## $ PAYM_AMT : int 52450 24000 17500 47500 99100 7500 218614 430000 200000 120000 ...
## $ NON_PAY_RATIO: num 0.434 0.882 0.727 0.922 0.862 ...
so I wrote down the code like
#gc(reset=T); rm(list=ls())
getwd()
setwd("/Users/Hong/Downloads")
getwd()
CUSTOMER <- read.csv("CUSTOMER.csv", header=T)
CLAIM <- read.csv("CLAIM.csv", header=T)
#install.packages("dplyr")
library("dplyr")
merge(CUSTOMER, CLAIM, by='CUST_ID', all.y=TRUE)
merged_data <- merge(CUSTOMER, CLAIM)
omitted_data <- na.omit(merged_data)
deducted_data <- head(select(omitted_data, -SIU_CUST_YN, -CTPR, -OCCP_GRP_2, -RECP_DATE, -RESN_DATE), 115327)
data_fin <- head(filter(deducted_data, OCCP_GRP_1 !=""), 115327)
dim(data_fin)
head(data_fin)
str(data_fin)
Next,
1) I should get top 3 (OCCP_GRP_1) that has high non_pay_ratio
2) I should get the (CUST_ID) over 600,000 of DMND_AMT Value
I don't know how to write it down

Extracting complete dataframe from Hmisc package in R

I've used aregImpute to impute the missing values then i used impute.transcan function trying to get complete dataset using the following code.
impute_arg <- aregImpute(~ age + job + marital + education + default +
balance + housing + loan + contact + day + month + duration + campaign +
pdays + previous + poutcome + y , data = mov.miss, n.impute = 10 , nk =0)
imputed <- impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE)
y <- completed[names(imputed)]
and when i used str(y) it already gives me a dataframe but with NAs as it is not imputed before, My question is how to get complete dataset without NAs after imputation?
str(y)
'data.frame': 4521 obs. of 17 variables:
$ age : int 30 NA 35 30 NA 35 36 39 41 43 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
$ marital : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
$ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 NA 2 3 1 ...
$ default : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 NA 1 1 1 ...
$ balance : int NA 4789 1350 1476 0 747 307 147 NA -88 ...
$ housing : Factor w/ 2 levels "no","yes": NA 2 2 2 NA 1 2 2 2 2 ...
$ loan : Factor w/ 2 levels "no","yes": 1 2 1 2 NA 1 1 NA 1 2 ...
$ contact : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 NA 1 ...
$ day : int 19 NA 16 3 5 23 14 6 14 NA ...
$ month : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 NA 9 9 1 ...
$ duration : int 79 220 185 199 226 141 341 151 57 313 ...
$ campaign : int 1 1 1 4 1 2 1 2 2 NA ...
$ pdays : int -1 339 330 NA -1 176 330 -1 -1 NA ...
$ previous : int 0 4 NA 0 NA 3 2 0 0 2 ...
$ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
I have tested your code myself, and it works just fine, except for the last line:
y <- completed[names(imputed)]
I believe there's a type in the above line. Plus, you do not even need the completed function.
Besides, if you want to get a data.frame from the impute.transcan function, then wrap it with as.data.frame:
imputed <- as.data.frame(impute.transcan(impute_arg, imputation=1, data=mov.miss, list.out=TRUE, pr=FALSE, check=FALSE))
Moreover, if you need to test your missing data pattern, you can also use the md.pattern function provided by the mice package.

Error in `row.names<-.data.frame using mlogit in R language

Here are the steps I'm following to do a Multinomial Linear Regression.
> z<-read.table("2008 Racedata.txt", header=TRUE, sep="\t", row.names=NULL)
> head(z)
datekey raceno horseno place winner draw winodds log_odds jwt hwt
1 2008091501 1 8 1 1 2 12.0 2.484907 128 1170
2 2008091501 1 11 2 0 3 8.6 2.151762 123 1135
3 2008091501 1 6 3 0 5 7.0 1.945910 127 1114
4 2008091501 1 12 4 0 10 23.0 3.135494 123 1018
5 2008091501 1 14 5 0 4 11.0 2.397895 113 1027
6 2008091501 1 5 6 0 14 50.0 3.912023 131 972
> x<-mlogit.data(z,choice="winner",shape="long",id.var="datekey",alt.var="horseno")
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.8", "1.11", "1.6", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘10.2’, ‘10.4’, ‘10.8’,
‘100.7’, ‘101.12’, ‘102.1’, ‘102.3’, ‘103.2’, ‘103.4’,
‘103.6’, ‘104.12’, ‘104.3’, ‘104.9’, ‘105.1’, ‘105.5’,
‘105.6’, ‘105.8’, ‘106.11’, ‘106.12’, ‘106.13’, ‘106.7’,
‘107.10’, ‘107.14’, ‘107.3’, ‘108.12’, ‘108.2’, ‘108.6’,
‘108.9’, ‘109.1’, ‘109.14’, ‘109.7’, ‘11.12’, ‘11.5’,
‘11.9’, ‘110.2’, ‘110.3’, ‘110.4’, ‘110.9’, ‘111.1’,
‘111.7’, ‘112.12’, ‘112.3’, ‘112.6’, ‘112.8’, ‘113.10’,
‘113.13’, ‘113.7’, ‘114.12’, ‘114.2’, ‘114.9’, ‘115.10’,
‘115.13’, ‘115.5’, ‘116.11’, ‘116.6’, ‘117.14’, ‘117.3’,
‘117.7’, ‘118.1’, ‘118.13’, ‘118.2’, ‘118.9’, ‘119.10’,
‘119.5’, ‘119.6’, ‘119.8’, ‘12.1’, ‘12.10’, ‘12.3’,
‚Äò12.6‚Äô, ‚Äò120.2‚Äô, ‚Äò120.4‚Äô, ‚Äò120.7‚ [... truncated]
>
What step am I missing here? Why the duplicates in row.names?
Thanks,
Walt
Two problems.
You seem to have some problem with encoding since we are seeing lots of umlauts and accent marks in that error message. Furthernore I am wondering if that datekey column got converted into a factor class?
In this case it it referring to an error in construction of the row.names attribute of the new object, x. If you do:
with( z, table( datekey, horseno) )
... you may see an a horse with multiple entries on the same day.
Actually there were no duplicate datekey x horseno combos. Changing to factor for horseno and datekey and then switching the "long" argument to "wide" produces error free result with this result:
z$datekey <- as.character(z$datekey)
z$horseno <- as.character(z$horseno)
x<-mlogit.data(z,choice="winner",shape="wide",id.var="datekey",alt.var="horseno")
str(x)
#----------
Classes ‘mlogit.data’ and 'data.frame': 18312 obs. of 11 variables:
$ datekey : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
$ raceno : int 1 1 1 1 1 1 1 1 1 1 ...
$ horseno : chr "0" "1" "0" "1" ...
$ place : int 1 1 2 2 3 3 4 4 5 5 ...
$ winner : logi FALSE TRUE TRUE FALSE TRUE FALSE ...
$ draw : int 2 2 3 3 5 5 10 10 4 4 ...
$ winodds : num 12 12 8.6 8.6 7 7 23 23 11 11 ...
$ log_odds: num 2.48 2.48 2.15 2.15 1.95 ...
$ jwt : int 128 128 123 123 127 127 123 123 113 113 ...
$ hwt : int 1170 1170 1135 1135 1114 1114 1018 1018 1027 1027 ...
$ chid : num 1 1 2 2 3 3 4 4 5 5 ...
- attr(*, "index")='data.frame': 18312 obs. of 3 variables:
..$ chid: Factor w/ 9156 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
..$ alt : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
..$ id : Factor w/ 733 levels "2008091501","2008091502",..: 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "choice")= chr "winner"

Resources