Checking row format of csv - r

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.

Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.

This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

Related

How can create my own factor column in a dataframe?

I have dataframe and task:"Define your own criterion of income level, and split data according to levels of this criterion"
dput(head(creditcard))
structure(list(card = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("no",
"yes"), class = "factor"), reports = c(0L, 0L, 0L, 0L, 0L, 0L
), age = c(37.66667, 33.25, 33.66667, 30.5, 32.16667, 23.25),
income = c(4.52, 2.42, 4.5, 2.54, 9.7867, 2.5), share = c(0.03326991,
0.005216942, 0.004155556, 0.06521378, 0.06705059, 0.0444384
), expenditure = c(124.9833, 9.854167, 15, 137.8692, 546.5033,
91.99667), owner = structure(c(2L, 1L, 2L, 1L, 2L, 1L), levels = c("no",
"yes"), class = "factor"), selfemp = structure(c(1L, 1L,
1L, 1L, 1L, 1L), levels = c("no", "yes"), class = "factor"),
dependents = c(3L, 3L, 4L, 0L, 2L, 0L), days = c(54L, 34L,
58L, 25L, 64L, 54L), majorcards = c(1L, 1L, 1L, 1L, 1L, 1L
), active = c(12L, 13L, 5L, 7L, 5L, 1L), income_fam = c(1.13,
0.605, 0.9, 2.54, 3.26223333333333, 2.5)), row.names = c("1",
"2", "3", "4", "5", "6"), class = "data.frame")
I defined this criterion in this way
inc_l<-c("low","average","above average","high")
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 && x<10, 'above average',
ifelse(x>=3 && x<=6,'average',
ifelse(x<3, 'low'))))
}
And added a column like this
creditcard<-transform(creditcard, incom_levev=factor(sapply(creditcard$income, grad_fact), inc_l, ordered = TRUE))
But I need not to use saaply for this and I tried to do it in this way
creditcard<-transform(creditcard, incom_level=factor(grad_fact(creditcard$income),inc_l, ordered = TRUE))
But in this case, all the elements of the column take the value "average" and I don't understand why, please help me figure out the problem
We may need to change the && to & as && will return a single TRUE/FALSE. According to ?"&&"
& and && indicate logical AND and | and || indicate logical OR. The shorter forms performs elementwise comparisons in much the same way as arithmetic operators. The longer forms evaluates left to right, proceeding only until the result is determined. The longer form is appropriate for programming control-flow and typically preferred in if clauses.
In addition, the last ifelse didn't had a no case
grad_fact<-function(x){
ifelse(x>=10, 'high',
ifelse(x>6 & x<10, 'above average',
ifelse(x>=3 & x<=6,'average',
ifelse(x<3, 'low', NA_character_))))
}
and then use
creditcard <- transform(creditcard, incom_level=
factor(grad_fact(income),inc_l, ordered = TRUE))
-output
creditcard
card reports age income share expenditure owner selfemp dependents days majorcards active income_fam incom_level
1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3 54 1 12 1.130000 average
2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3 34 1 13 0.605000 low
3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4 58 1 5 0.900000 average
4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0 25 1 7 2.540000 low
5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2 64 1 5 3.262233 above average
6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0 54 1 1 2.500000 low

How to make "For loop" based on column

I have been using "For loops" before this. But the variable is usually k that refer to row numbers.
Example:
for (k in 1:n) {
expression
}
My question is, is it possible for the variable to be a certain column?
Example:
for ("column no" in 1:n) {
expression
}
I have had several trials and errors and a bit stuck now. Here is my data:
date mold no
22-May 1.35436 1
23-May 0.88592 1
24-May 0.81316 1
25-May 0.80856 1
26-May 0.84646 1
27-May 0.81762 1
28-May 0.79828 1
03-Jan 1.09158 2
04-Jan 0.86661 2
05-Jan 0.81908 2
06-Jan 0.7555 2
07-Jan 0.66577 2
08-Jan 0.66706 2
09-Jan 0.67133 2
05-Feb 20.4366 3
06-Feb 5.77923 3
06-Feb 3.12323 3
05-Feb 2.25436 3
06-Feb 1.74551 3
06-Feb 1.52744 3
05-Feb 1.45483 3
28-Jul 1.55148 4
29-Jul 1.18882 4
30-Jul 1.10595 4
31-Jul 1.14101 4
01-Aug 1.1453 4
02-Aug 1.10113 4
03-Aug 1.09152 4
30-Nov 8.3254 5
01-Dec 4.03003 5
02-Dec 2.18026 5
03-Dec 1.40028 5
04-Dec 1.02901 5
05-Dec 0.85859 5
06-Dec 0.7776 5
I would like to as R to sum up the values in the mold column for each group (1 to 5) in the no column. For example, for no=1, it would be
1.35436 + 0.88592 + 0.81316 + 0.80856 + 0.84646 + 0.81762 + 0.79828 = 6.32436
Then repeat for no = 2, 3, 4 etc.
We can loop through the unique elements, compare (==) and get the sumof the 'mold' elements that correspond to the boolean vector
un1 <- unique(df1$no)
v1 <- numeric(length(un1))
for(i in seq_along(v1)) v1[i] <- sum(df1$mold[df1$no== un1[i]])
v1
#[1] 6.32436 5.53693 36.32120 8.32521 18.60117
It is the same as rowsum
rowsum(df1$mold, df1$no)[,1]
# 1 2 3 4 5
# 6.32436 5.53693 36.32120 8.32521 18.60117
data
df1 <- structure(list(date = c("22-May", "23-May", "24-May", "25-May",
"26-May", "27-May", "28-May", "03-Jan", "04-Jan", "05-Jan", "06-Jan",
"07-Jan", "08-Jan", "09-Jan", "05-Feb", "06-Feb", "06-Feb", "05-Feb",
"06-Feb", "06-Feb", "05-Feb", "28-Jul", "29-Jul", "30-Jul", "31-Jul",
"01-Aug", "02-Aug", "03-Aug", "30-Nov", "01-Dec", "02-Dec", "03-Dec",
"04-Dec", "05-Dec", "06-Dec"), mold = c(1.35436, 0.88592, 0.81316,
0.80856, 0.84646, 0.81762, 0.79828, 1.09158, 0.86661, 0.81908,
0.7555, 0.66577, 0.66706, 0.67133, 20.4366, 5.77923, 3.12323,
2.25436, 1.74551, 1.52744, 1.45483, 1.55148, 1.18882, 1.10595,
1.14101, 1.1453, 1.10113, 1.09152, 8.3254, 4.03003, 2.18026,
1.40028, 1.02901, 0.85859, 0.7776), no = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L)),
class = "data.frame", row.names = c(NA,
-35L))

Producing smmary tables for very large datasets

I am working with migration data, and I want to produce three summary tables from a very large dataset (>4 million). An example of which is detailed below:
migration <- structure(list(area.old = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("leeds",
"london", "plymouth"), class = "factor"), area.new = structure(c(7L,
13L, 3L, 2L, 4L, 7L, 6L, 7L, 6L, 13L, 5L, 8L, 7L, 11L, 12L, 9L,
1L, 10L, 11L), .Label = c("bath", "bristol", "cambridge", "glasgow",
"harrogate", "leeds", "london", "manchester", "newcastle", "oxford",
"plymouth", "poole", "york"), class = "factor"), persons = c(6L,
3L, 2L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 3L, 4L, 1L, 1L, 2L, 3L, 4L,
9L, 4L)), .Names = c("area.old", "area.new", "persons"), class = "data.frame", row.names = c(NA,
-19L))
Summary table 1: 'area.within'
The first table I wish to create is called 'area.within'. This will detail only areas where people have moved within the same area (i.e. it will count the total number of persons where 'london' is noted down in 'area.old' and 'area.new'). There will probably be multiple occurrences of this within the data table. It will then do this for all of the different areas, so the summary would be:
area.within persons
1 london 13
2 leeds 5
3 plymouth 5
Using the data table package, I have as far as:
setDT(migration)[as.character(area.old)==as.character(area.new)]
... but this doesn't get rid of duplicates...
Summary table 2: 'moved.from'
The second table will summarise areas which have experienced people moving out (i.e. those unique values in 'area.old'). It will identify areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.from persons
1 london 24
2 leeds 17
3 plymouth 19
Summary table 3: 'moved.to'
The third table summarises which areas have experienced people moving to (i.e. those unique values in 'area.new'). It will identify all the unique areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.to persons
1 london 5
2 york 3
3 cambridge 2
4 bristol 5
5 glasgow 6
6 leeds 8
7 york 6
8 harrogate 3
9 manchester 4
10 plymouth 0
11 poole 2
12 newcastle 3
13 bath 4
14 oxford 9
Most importantly, a sum of all the persons detailed in tables 2 and 3 should be the same. And then this value, combined with the persons total for table 1 should equal the sum of the all the persons in the original table.
If anyone could help me sort out how to structure my code using the data table package to produce my tables, I should be most grateful.
Using data.table is a good choice i think.
setDT(migration) #This has to be done only once
1.
To avoid duplicates just sum them up by city as follows
migration[as.character(area.old)==as.character(area.new),
.(persons = sum(persons)),
by=.(area.within = area.new)]
2.
This is very similar to the 1. one but uses != in the i-Argument
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.from = area.old)]
3.
Same as 2.
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.to = area.new)]
Alternative
As 2. and 3. are very similar you can also do:
moved <- migration[as.character(area.old)!=as.character(area.new)]
#2
moved[,.(persons = sum(persons)), by=.(moved.from = area.old)]
#3
moved[,.(persons = sum(persons)), by=.(moved.to = area.new)]
Thus only once the selection of the right rows has to be done.

R: iterate over columns and plot

My data looks like this:
Group Feature_A Feature_B Feature_C Feature_D
1 1 0 3 2 4
2 1 5 2 2 8
3 1 9 8 6 5
4 2 5 7 8 8
5 2 2 6 8 1
6 2 3 8 6 4
7 3 1 5 3 5
8 3 1 4 3 4
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), Feature_A = c(0L,
5L, 9L, 5L, 2L, 3L, 1L, 1L), Feature_B = c(3L, 2L, 8L, 7L, 6L,
8L, 5L, 4L), Feature_C = c(2L, 2L, 6L, 8L, 8L, 6L, 3L, 3L), Feature_D = c(4L,
8L, 5L, 8L, 1L, 4L, 5L, 4L)), .Names = c("Group", "Feature_A",
"Feature_B", "Feature_C", "Feature_D"), class = "data.frame", row.names = c(NA,
-8L))
For every Feature I want to generate a plot (e.g., boxplot) that would higlight difference between Groups.
# Get unique Feature and Group
Features<-unique(colnames(df[,-1]))
Group<-unique(colnames(df$Group))
But how can I do the rest?
Pseudo-code might look like this:
Select Feature from Data
Split Data according Group
Boxplot
for (i in 1:levels(df$Features)){
for (o in 1:length(Group)){
}}
How can I achieve this? Hope someone can help me.
I would put py data in the long format. Then Using ggplot2 you can do some nice things.
library(reshape2)
library(ggplot2)
library(gridExtra)
## long format using Group as id
dat.m <- melt(dat,id='Group')
## bar plot
p1 <- ggplot(dat.m) +
geom_bar(aes(x=Group,y=value,fill=variable),stat='identity')
## box plot
p2 <- ggplot(dat.m) +
geom_boxplot(aes(x=factor(Group),y=value,fill=variable))
## aggregate the 2 plots
grid.arrange(p1,p2)
This is easy to do. I do this all the time
The code below will generate the charts using ggplot and save them as ch_Feature_A ....
you can wrap the answer in a pdf statement to send them to pdf as well
library(ggplot2)
df$Group <- as.factor(df$Group)
for (i in 2:dim(df)[2]) {
ch <- ggplot(df,aes_string(x="Group",y=names(df)[i],fill="Group"))+geom_boxplot()
assign(paste0("ch_",names(df)[i]),ch)
}
or even simpler, if you do not want separate charts
library(reshape2)
df1 <- melt(df)
ggplot(df1,aes(x=Group,y=value,fill=Group))+geom_boxplot()+facet_grid(.~variable)

R summing certain columns in each row

I'm having an issue, but I'm sure it's super easy for someone who is very familiar with R.
I have a matrix that is 3008 x 3008. What I want is it to sum every 8 columns in each row. So essentially you'd end up with a new matrix that is now 367 x 367.
Here's a small example:
C.1 C.2 C.3 C.4 C.5 C.6
row1 1 2 1 2 5 6
row1 1 2 3 4 5 6
row1 2 6 3 4 5 6
row1 1 2 3 4 10 6
So say I wanted to sum these for every 3 columns in each row, I'd want to end up with:
C.1 C.2
row1 4 13
row1 6 15
row1 11 15
row1 6 20
# m is your matrix
n <- 8
grp <- seq(1, ncol(m), by=n)
sapply(grp, function(x) rowSums(m[, x:(x+n-1)]))
Some explanation if you're new to R. grp is a sequence of numbers that gives the starting points for each group of columns: 1, 9, 17, etc if you want to sum every 8 columns.
The sapply call can be understood as follows. For each number in grp, it calls the rowSums function, passing it those matrix columns corresponding to that group number. Thus when grp is 1, it gets the row sums for columns 1-8; when grp is 9, it gets the row sums for columns 9-16 and so on. These are vectors, which sapply then binds together into a matrix.
Transform your matrix to an array, then use apply and rowSums.
mat <- structure(c(1L, 1L, 2L, 1L, 2L, 2L, 6L, 2L, 1L, 3L, 3L, 3L, 2L, 4L, 4L, 4L, 5L, 5L, 5L, 10L, 6L, 6L, 6L, 6L),
.Dim = c(4L, 6L),
.Dimnames = list(c("row1", "row2", "row3", "row4"), c("C.1", "C.2", "C.3", "C.4", "C.5", "C.6")))
n <- 3 #this needs to be a factor of the number of columns
a <- array(mat,dim=c(nrow(mat),n,ncol(mat)/n))
apply(a,3,rowSums)
# [,1] [,2]
# [1,] 4 13
# [2,] 6 15
# [3,] 11 15
# [4,] 6 20
#Create sample data:
df <- matrix(rexp(200, rate=.1), ncol=20)
#Choose the number of columns you'd like to sum up (e.g., 3 or 8)
number_of_columns_to_sum <- 3
df2 <- NULL #Set to null so that you can use cbind on the first value below
for (i in seq(1,ncol(df), by = number_of_columns_to_sum)) {
df2 <- cbind(df2, rowSums(df[,i:(i+number_of_columns_to_sum-1)]))
}
Another option: Though it may not be as elegant
mat <- structure(c(1L, 1L, 2L, 1L, 2L, 2L, 6L, 2L, 1L, 3L, 3L, 3L, 2L, 4L, 4L, 4L, 5L, 5L, 5L, 10L, 6L, 6L, 6L, 6L),
.Dim = c(4L, 6L),
.Dimnames = list(c("row1", "row1", "row1", "row1"), c("C.1", "C.2", "C.3", "C.4", "C.5", "C.6")))
new<- data.frame((mat[,1]+mat[,2]+mat[,3]),(mat[,4]+mat[,5]+mat[,6]))
names(new)<- c("C.1","C.2")
new

Resources