adding mising rows in R - r

I have a table in R as follows
month day total
1 1 3 1414
2 1 5 1070
3 1 6 211
4 1 7 2214
5 1 8 1766
6 1 13 2486
7 1 14 43
8 1 15 2349
9 1 16 4616
10 1 17 2432
11 1 18 1482
12 1 19 694
13 1 20 968
14 1 23 381
15 1 24 390
16 1 26 4063
17 1 27 3323
18 1 28 988
19 1 29 9671
20 1 30 11968
I have to add the missing values of days such as 1,2,4 with zero such that the result should be like
month day total
- 1 1 0
- 1 2 0
1 3 1414
1 4 0
2 1 5 1070
3 1 6 211
4 1 7 2214
5 1 8 1766
1 9 0
1 10 0
1 11 0
1 12 0
6 1 13 2486
7 1 14 43
8 1 15 2349
9 1 16 4616
10 1 17 2432
11 1 18 1482
12 1 19 694
13 1 20 968
1 21 0
1 22 0
14 1 23 381
15 1 24 390
1 25 0
16 1 26 4063
17 1 27 3323
18 1 28 988
19 1 29 9671
20 1 30 11968

Using only base R, you could do it this way:
for(d in 1:31) {
if(!d %in% my.df$day)
my.df[nrow(my.df) + 1,] <- c(1,d,0)
}
# Reorder rows
my.df <- my.df[with(my.df, order(month, day)),]
rownames(my.df) <- NULL
# Check the results
head(my.df)
# month day total
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1414
# 4 1 4 0
# 5 1 5 1070
# 6 1 6 211

In R, we could create a new dataset with 'day' column from 1:30 and 'month' as '1', left_join with the original dataset and replace the NA values after the merge with '0'
df2 <- data.frame(month=1, day=1:30)
library(dplyr)
left_join(df2, df1) %>%
mutate(total=replace(total, which(is.na(total)),0))
Or use merge from base R to get 'dM' and assign the NA values in the 'total' to '0'
dM <- merge(df1, df2, all.y=TRUE)
dM$total[is.na(dM$total)] <- 0

Related

Sum in R based on a date range and another condition?

I am working on a dataframe of baseball data called mlb_team_logs. A random sample lies below.
Date Team season AB PA H X1B X2B X3B HR R RBI BB IBB SO HBP SF SH GDP
1 2015-04-06 ARI 2015 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
2 2015-04-07 ARI 2015 31 36 8 4 1 1 2 7 7 5 0 7 0 0 0 1
3 2015-04-08 ARI 2015 32 35 5 3 2 0 0 2 1 2 0 7 1 0 0 0
4 2015-04-10 ARI 2015 35 38 7 6 0 0 1 4 4 3 0 10 0 0 0 0
5 2015-04-11 ARI 2015 32 35 10 9 0 0 1 6 6 3 0 7 0 0 0 1
6 2015-04-12 ARI 2015 36 38 10 7 3 0 0 4 4 1 0 11 0 0 1 1
7 2015-04-13 ARI 2015 39 44 12 8 3 1 0 8 7 4 0 11 0 0 1 0
8 2015-04-14 ARI 2015 28 32 3 1 2 0 0 1 1 3 0 4 1 0 0 2
9 2015-04-15 ARI 2015 33 34 9 7 1 0 1 2 2 1 0 8 0 0 0 1
10 2015-04-16 ARI 2015 47 51 11 6 2 0 3 7 7 3 1 8 1 0 0 0
240 2015-07-03 ATL 2015 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
241 2015-07-04 ATL 2015 34 40 10 6 3 0 1 9 9 5 0 5 0 0 1 0
242 2015-07-05 ATL 2015 35 37 7 6 1 0 0 0 0 1 0 10 1 0 0 1
243 2015-07-06 ATL 2015 40 44 15 10 4 0 1 5 5 3 0 7 0 0 1 1
244 2015-07-07 ATL 2015 34 37 10 7 1 1 1 4 4 2 0 4 0 0 1 1
245 2015-07-08 ATL 2015 31 38 7 4 1 0 2 5 5 5 1 7 0 0 2 1
246 2015-07-09 ATL 2015 34 37 10 8 2 0 0 3 3 1 0 9 0 1 1 2
247 2015-07-10 ATL 2015 32 35 8 7 0 0 1 3 3 2 0 5 1 0 0 2
248 2015-07-11 ATL 2015 33 38 6 3 1 0 2 2 2 5 1 8 0 0 0 0
249 2015-07-12 ATL 2015 34 41 8 6 2 0 0 3 3 7 1 10 0 0 0 1
250 2015-07-17 ATL 2015 30 36 7 4 3 0 0 4 4 5 1 7 0 0 0 0
In total, the df has 43 total columns. My objective is to sum columns 4 (AB) to 43 on two criteria:
the team
the date is within 7 days of the entry in "Date" (ie Date - 7 to Date - 1)
Eventually, I would like these columns to be appended to mlb_team_logs as l7_AB, l7_PA, etc (but I know how to do that if the output will be a new dataframe). Any help is appreciated!
EDIT I altered the sample to allow for more easily tested results
You might be able to use a data.table non-equi join here. The idea would be to create a lower date bound (below, I've named this date_lb), and then join the table on itself, matching on Team = Team, Date < Date, and Date >= date_lb. Then use lapply with .SDcols to sum the columns of interest.
load library and set your frame to data.table
library(data.table)
setDT(mlb_team_logs)
Identify the columns you want to sum, in a character vector (change to 4:43 in your full dataset)
sum_cols = names(mlb_team_logs)[4:19]
Add a lower bound on date
df[, date_lb := Date-7]
Join the table on itself, and use lapply(.SD, sum) on the columns of interest
result = mlb_team_logs[mlb_team_logs[, .(Team, Date, date_lb)], on=.(Team, Date<Date, Date>=date_lb)] %>%
.[, lapply(.SD, sum), by=.(Date,Team), .SDcols = sumcols ]
Set the new names (inplace, using setnames())
setnames(result, old=sumcols, new=paste0("I7_",sumcols))
Output:
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP
<IDat> <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 2015-04-06 ARI NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2: 2015-04-07 ARI 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
3: 2015-04-08 ARI 65 75 17 11 2 2 2 11 11 8 0 13 2 0 0 3
4: 2015-04-10 ARI 97 110 22 14 4 2 2 13 12 10 0 20 3 0 0 3
5: 2015-04-11 ARI 132 148 29 20 4 2 3 17 16 13 0 30 3 0 0 3
6: 2015-04-12 ARI 164 183 39 29 4 2 4 23 22 16 0 37 3 0 0 4
7: 2015-04-13 ARI 200 221 49 36 7 2 4 27 26 17 0 48 3 0 1 5
8: 2015-04-14 ARI 205 226 52 37 9 2 4 31 29 18 0 53 1 0 2 3
9: 2015-04-15 ARI 202 222 47 34 10 1 2 25 23 16 0 50 2 0 2 4
10: 2015-04-16 ARI 203 221 51 38 9 1 3 25 24 15 0 51 1 0 2 5
11: 2015-07-03 ATL NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
12: 2015-07-04 ATL 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
13: 2015-07-05 ATL 64 72 17 10 4 0 3 11 11 7 0 11 0 0 1 1
14: 2015-07-06 ATL 99 109 24 16 5 0 3 11 11 8 0 21 1 0 1 2
15: 2015-07-07 ATL 139 153 39 26 9 0 4 16 16 11 0 28 1 0 2 3
16: 2015-07-08 ATL 173 190 49 33 10 1 5 20 20 13 0 32 1 0 3 4
17: 2015-07-09 ATL 204 228 56 37 11 1 7 25 25 18 1 39 1 0 5 5
18: 2015-07-10 ATL 238 265 66 45 13 1 7 28 28 19 1 48 1 1 6 7
19: 2015-07-11 ATL 240 268 67 48 12 1 6 29 29 19 1 47 2 1 6 8
20: 2015-07-12 ATL 239 266 63 45 10 1 7 22 22 19 2 50 2 1 5 8
21: 2015-07-17 ATL 99 114 22 16 3 0 3 8 8 14 2 23 1 0 0 3
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP

Need sum of data rows and columns using R

I need to get the sum of my data from the rows and columns. I uploaded my data in csv and then removed NA to replace them with zeros. I just can’t get my data to read as integers and the sum it up.
data<-read.csv("DataSet.2.csv",header=FALSE)
mode(data)
[1] "list"
data[is.na(data)]=0
data
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Var_1 Var_2 Var_3 Var_4 Var_5 Var_6 Var_7 Var_8 Var_9 Var_10
2 Crow 8 8 0 3 2 4 4 44 0 23
3 Mouse 2 0 5 4 2 6 36 636 2 2
4 Boar 15 113 48 36 15 66 14 0 2 23
5 Plain 8 17 164 14 91 0 6 10 6 32
6 Silver.Carp 3 1 0 6 7 0 35 35 0 432
7 Dog 1 0 27 0 0 11 0 0 7 43
8 Bingo 2 3 1 15 1 21 0 0 1 0
9 Chrysalis 1 0 2 0 47 0 0 0 7 3
10 Apple 2 0 3 0 0 0 0 0 5 4
11 Cork 3 0 1 0 461 8 2305 15 0 2
12 Ant 11 0 2 0 0 0 0 91 4 0
13 Cat.Claw 2 22 1 110 2 7 10 7 0 0
14 Aardvark 3 1 0 5 25 30 125 0 5 4
15 Carriage 0 3 3 15 0 533 0 1 7 3
16 Airplane 3 2 1 10 0 28 0 47 7 1
17 Clipper 2 1 2 5 0 507 0 0 23 2
18 Armadillo 3 2 4 11 24 0 2 10 3322 0
19 Cork 3 3 1 9 461 88 2305 15 233 3
20 Colt 3 4 1 10 4902 0 0 1 4322 111
21 Cat 3 22 2 220 3 11 10 7 2333 22
V12
1 Var_11
2 15
3 4
4 13
5 3
6 312
7 1
8 22
9 12
10 0
11 0
12 23
13 32
14 44
15 43
16 2
17 33
18 2
19 3
20 55
21 3
#When I use as.numeric I am getting an error
data2<-as.numeric(data)
Error: 'list' object cannot be coerced to type 'double'
It looks like your .csv file contains a header ('Var_1', 'Var_2', etc.) but you are specifying header=FALSE when you load the data, so those strings are being interpreted as data values. Additionally, it looks like your first column represents row names for your dataset. You can specify this via the row.names argument.
Instead, load the data using:
data <- read.csv("DataSet.2.csv", header=TRUE, row.names = 1)
Once the data is loaded you can get the column and row sums via the functions colSums() and rowSums(), respectively. Additionally, if you are replacing the NA values with 0s just for the computation of the sums, you can skip that skip by setting the parameter, na.rm = TRUE within colSum() and rowSums(). This will remove the NA values from the collection of the sums. For example:
data <- read.csv("DataSet.2.csv", header=TRUE, row.names = 1)
row_sum <- rowSums(data, na.rm = TRUE)
col_sum <- colSums(data, na.rm = TRUE)

I have a weights variable and I need to create cross tabulations for a chord diagram

I have a dataset with over 15,000 observations. I've dropped all variables but three (3).
One is the individual's origin or, the other is the individual's destination dest, and the third is weight of that individual wgt.
Origin and destination are categorical variables.
The weights I have are used as analytic weights in Stata. However, Stata can't handle the number of columns I generate when making tables. R generates them with ease. However, I can't figure out how to apply weights into the generated table.
I tried using wtd.tables(), but the following error appears.
wtd.table(NonHSGrad$b206reg, NonHSGrad$c305reg, weights=NonHSGrad$ind_wgts)
Error in proxy[, ..., drop = FALSE] : incorrect number of dimensions
When I use only the table(), this comes out:
table(NonHSGrad$b206reg, NonHSGrad$c305reg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 285 38 20 8 6 3 1 2 0 1 0 10 38 46 0 2 14
2 32 312 26 3 1 0 2 1 1 0 1 1 22 51 0 0 8
3 17 35 325 12 12 2 3 7 0 2 3 5 52 13 1 1 25
4 3 5 27 224 19 5 2 10 1 1 1 2 51 4 0 3 35
5 4 9 44 81 778 6 7 22 1 4 5 5 155 5 0 5 47
6 4 5 22 21 10 547 24 12 32 21 32 81 86 5 3 15 58
7 5 4 12 17 20 21 558 20 31 99 93 33 59 1 3 67 15
8 8 9 41 49 17 11 24 919 5 8 37 10 151 2 0 52 19
9 0 1 7 9 1 4 26 5 466 66 19 17 17 2 24 24 7
10 1 2 3 4 2 3 27 8 41 528 21 17 13 2 11 36 2
11 3 0 3 10 1 5 11 5 6 17 519 59 7 1 2 49 1
12 0 1 1 2 0 1 5 2 2 10 39 318 10 0 14 17 1
13 15 9 26 34 25 21 12 42 2 5 3 5 187 2 1 6 15
14 14 47 7 5 0 0 0 1 1 0 0 0 9 475 0 0 0
15 0 0 3 1 2 2 4 2 22 9 3 60 9 2 342 2 3
16 0 2 6 10 3 2 11 21 3 33 29 4 34 0 3 404 5
17 1 1 7 15 2 6 1 2 0 1 1 0 34 0 0 2 463
99 0 0 1 1 0 0 0 1 0 1 0 0 0 1 2 0 1
I am also going to use the table for a chord diagram to show flows.

R data frame rank by groups (group by rank) with package dplyr

I have a data frame 'test' that look like this:
session_id seller_feedback_score
1 1 282470
2 1 275258
3 1 275258
4 1 275258
5 1 37831
6 1 282470
7 1 26
8 1 138351
9 1 321350
10 1 841
11 1 138351
12 1 17263
13 1 282470
14 1 396900
15 1 282470
16 1 282470
17 1 321350
18 1 321350
19 1 321350
20 1 0
21 1 1596
22 7 282505
23 7 275283
24 7 275283
25 7 275283
26 7 37834
27 7 282505
28 7 26
29 7 138359
30 7 321360
and a code (using package dplyr) that apparently should rank the 'seller_feedback_score' within each group of session_id:
test <- test %>% group_by(session_id) %>%
mutate(seller_feedback_score_rank = dense_rank(-seller_feedback_score))
however, what is really happening is that R rank the entire data frame together without relating to the groups (session_id's):
session_id seller_feedback_score seller_feedback_score_rank_2
1 1 282470 5
2 1 275258 7
3 1 275258 7
4 1 275258 7
5 1 37831 11
6 1 282470 5
7 1 26 15
8 1 138351 9
9 1 321350 3
10 1 841 14
11 1 138351 9
12 1 17263 12
13 1 282470 5
14 1 396900 1
15 1 282470 5
16 1 282470 5
17 1 321350 3
18 1 321350 3
19 1 321350 3
20 1 0 16
21 1 1596 13
22 7 282505 4
23 7 275283 6
24 7 275283 6
25 7 275283 6
26 7 37834 10
27 7 282505 4
28 7 26 15
29 7 138359 8
30 7 321360 2
I checked this by counting the unique 'seller_feedback_score_rank' values and not surprisingly it equals to the highest rank value. I'd appreciate if someone could reproduce and help. thanks
link to my original question: R group by and aggregate - return relative rank within groups using plyr
Had a similar issue, my answer was sorting on groups and the relevant ranked variable(s) in order to then use row_number() when using group_by.
# Sample dataset
df <- data.frame(group=rep(c("GROUP 1", "GROUP 2"),10),
value=as.integer(rnorm(20, mean=1000, sd=500)))
require(dplyr)
print.data.frame(df[0:10,])
group value
1 GROUP 1 1273
2 GROUP 2 1261
3 GROUP 1 1189
4 GROUP 2 1390
5 GROUP 1 1942
6 GROUP 2 1111
7 GROUP 1 530
8 GROUP 2 893
9 GROUP 1 997
10 GROUP 2 237
sorted <- df %>%
arrange(group, -value) %>%
group_by(group) %>%
mutate(rank=row_number())
print.data.frame(sorted)
group value rank
1 GROUP 1 1942 1
2 GROUP 1 1368 2
3 GROUP 1 1273 3
4 GROUP 1 1249 4
5 GROUP 1 1189 5
6 GROUP 1 997 6
7 GROUP 1 562 7
8 GROUP 1 535 8
9 GROUP 1 530 9
10 GROUP 1 1 10
11 GROUP 2 1472 1
12 GROUP 2 1390 2
13 GROUP 2 1281 3
14 GROUP 2 1261 4
15 GROUP 2 1111 5
16 GROUP 2 893 6
17 GROUP 2 774 7
18 GROUP 2 669 8
19 GROUP 2 631 9
20 GROUP 2 237 10
Found an answer in :
Add a "rank" column to a data frame
data.selected <- transform(data.selected,
seller_feedback_score_rank = ave(seller_feedback_score, session_id,
FUN = function(x) rank(-x, ties.method = "first")))
One way you can do this is :
dataset<-dataset%>%arrange(ID, DateTime,Index)
dataset$Rank<-c(0,ID)[-(nrow(dataset)+1)] == ID
dataset<- dataset%>%group_by(ID)%>%mutate(Rank = cumsum(Rank))
Had the same issue!

How do I use plyr to number rows?

Basically I want an autoincremented id column based on my cohorts - in this case .(kmer, cvCut)
> myDataFrame
size kmer cvCut cumsum
1 8132 23 10 8132
10000 778 23 10 13789274
30000 324 23 10 23658740
50000 182 23 10 28534840
100000 65 23 10 33943283
200000 25 23 10 37954383
250000 584 23 12 16546507
300000 110 23 12 29435303
400000 28 23 12 34697860
600000 127 23 2 47124443
600001 127 23 2 47124570
I want a column added that has new row names based on the kmer/cvCut group
> myDataFrame
size kmer cvCut cumsum newID
1 8132 23 10 8132 1
10000 778 23 10 13789274 2
30000 324 23 10 23658740 3
50000 182 23 10 28534840 4
100000 65 23 10 33943283 5
200000 25 23 10 37954383 6
250000 584 23 12 16546507 1
300000 110 23 12 29435303 2
400000 28 23 12 34697860 3
600000 127 23 2 47124443 1
600001 127 23 2 47124570 2
I'd do it like this:
library(plyr)
ddply(df, c("kmer", "cvCut"), transform, newID = seq_along(kmer))
Just add a new column each time plyr calls you:
R> DF <- data.frame(kmer=sample(1:3, 50, replace=TRUE), \
cvCut=sample(LETTERS[1:3], 50, replace=TRUE))
R> library(plyr)
R> ddply(DF, .(kmer, cvCut), function(X) data.frame(X, newId=1:nrow(X)))
kmer cvCut newId
1 1 A 1
2 1 A 2
3 1 A 3
4 1 A 4
5 1 A 5
6 1 A 6
7 1 A 7
8 1 A 8
9 1 A 9
10 1 A 10
11 1 A 11
12 1 B 1
13 1 B 2
14 1 B 3
15 1 B 4
16 1 B 5
17 1 B 6
18 1 C 1
19 1 C 2
20 1 C 3
21 2 A 1
22 2 A 2
23 2 A 3
24 2 A 4
25 2 A 5
26 2 B 1
27 2 B 2
28 2 B 3
29 2 B 4
30 2 B 5
31 2 B 6
32 2 B 7
33 2 C 1
34 2 C 2
35 2 C 3
36 2 C 4
37 3 A 1
38 3 A 2
39 3 A 3
40 3 A 4
41 3 B 1
42 3 B 2
43 3 B 3
44 3 B 4
45 3 C 1
46 3 C 2
47 3 C 3
48 3 C 4
49 3 C 5
50 3 C 6
R>
I think that this is what you want:
Load the data:
x <- read.table(textConnection(
"id size kmer cvCut cumsum
1 8132 23 10 8132
10000 778 23 10 13789274
30000 324 23 10 23658740
50000 182 23 10 28534840
100000 65 23 10 33943283
200000 25 23 10 37954383
250000 584 23 12 16546507
300000 110 23 12 29435303
400000 28 23 12 34697860
600000 127 23 2 47124443
600001 127 23 2 47124570"), header=TRUE)
Use ddply:
library(plyr)
ddply(x, .(kmer, cvCut), function(x) cbind(x, 1:nrow(x)))

Resources