Need sum of data rows and columns using R - r

I need to get the sum of my data from the rows and columns. I uploaded my data in csv and then removed NA to replace them with zeros. I just can’t get my data to read as integers and the sum it up.
data<-read.csv("DataSet.2.csv",header=FALSE)
mode(data)
[1] "list"
data[is.na(data)]=0
data
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Var_1 Var_2 Var_3 Var_4 Var_5 Var_6 Var_7 Var_8 Var_9 Var_10
2 Crow 8 8 0 3 2 4 4 44 0 23
3 Mouse 2 0 5 4 2 6 36 636 2 2
4 Boar 15 113 48 36 15 66 14 0 2 23
5 Plain 8 17 164 14 91 0 6 10 6 32
6 Silver.Carp 3 1 0 6 7 0 35 35 0 432
7 Dog 1 0 27 0 0 11 0 0 7 43
8 Bingo 2 3 1 15 1 21 0 0 1 0
9 Chrysalis 1 0 2 0 47 0 0 0 7 3
10 Apple 2 0 3 0 0 0 0 0 5 4
11 Cork 3 0 1 0 461 8 2305 15 0 2
12 Ant 11 0 2 0 0 0 0 91 4 0
13 Cat.Claw 2 22 1 110 2 7 10 7 0 0
14 Aardvark 3 1 0 5 25 30 125 0 5 4
15 Carriage 0 3 3 15 0 533 0 1 7 3
16 Airplane 3 2 1 10 0 28 0 47 7 1
17 Clipper 2 1 2 5 0 507 0 0 23 2
18 Armadillo 3 2 4 11 24 0 2 10 3322 0
19 Cork 3 3 1 9 461 88 2305 15 233 3
20 Colt 3 4 1 10 4902 0 0 1 4322 111
21 Cat 3 22 2 220 3 11 10 7 2333 22
V12
1 Var_11
2 15
3 4
4 13
5 3
6 312
7 1
8 22
9 12
10 0
11 0
12 23
13 32
14 44
15 43
16 2
17 33
18 2
19 3
20 55
21 3
#When I use as.numeric I am getting an error
data2<-as.numeric(data)
Error: 'list' object cannot be coerced to type 'double'

It looks like your .csv file contains a header ('Var_1', 'Var_2', etc.) but you are specifying header=FALSE when you load the data, so those strings are being interpreted as data values. Additionally, it looks like your first column represents row names for your dataset. You can specify this via the row.names argument.
Instead, load the data using:
data <- read.csv("DataSet.2.csv", header=TRUE, row.names = 1)
Once the data is loaded you can get the column and row sums via the functions colSums() and rowSums(), respectively. Additionally, if you are replacing the NA values with 0s just for the computation of the sums, you can skip that skip by setting the parameter, na.rm = TRUE within colSum() and rowSums(). This will remove the NA values from the collection of the sums. For example:
data <- read.csv("DataSet.2.csv", header=TRUE, row.names = 1)
row_sum <- rowSums(data, na.rm = TRUE)
col_sum <- colSums(data, na.rm = TRUE)

Related

Sum in R based on a date range and another condition?

I am working on a dataframe of baseball data called mlb_team_logs. A random sample lies below.
Date Team season AB PA H X1B X2B X3B HR R RBI BB IBB SO HBP SF SH GDP
1 2015-04-06 ARI 2015 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
2 2015-04-07 ARI 2015 31 36 8 4 1 1 2 7 7 5 0 7 0 0 0 1
3 2015-04-08 ARI 2015 32 35 5 3 2 0 0 2 1 2 0 7 1 0 0 0
4 2015-04-10 ARI 2015 35 38 7 6 0 0 1 4 4 3 0 10 0 0 0 0
5 2015-04-11 ARI 2015 32 35 10 9 0 0 1 6 6 3 0 7 0 0 0 1
6 2015-04-12 ARI 2015 36 38 10 7 3 0 0 4 4 1 0 11 0 0 1 1
7 2015-04-13 ARI 2015 39 44 12 8 3 1 0 8 7 4 0 11 0 0 1 0
8 2015-04-14 ARI 2015 28 32 3 1 2 0 0 1 1 3 0 4 1 0 0 2
9 2015-04-15 ARI 2015 33 34 9 7 1 0 1 2 2 1 0 8 0 0 0 1
10 2015-04-16 ARI 2015 47 51 11 6 2 0 3 7 7 3 1 8 1 0 0 0
240 2015-07-03 ATL 2015 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
241 2015-07-04 ATL 2015 34 40 10 6 3 0 1 9 9 5 0 5 0 0 1 0
242 2015-07-05 ATL 2015 35 37 7 6 1 0 0 0 0 1 0 10 1 0 0 1
243 2015-07-06 ATL 2015 40 44 15 10 4 0 1 5 5 3 0 7 0 0 1 1
244 2015-07-07 ATL 2015 34 37 10 7 1 1 1 4 4 2 0 4 0 0 1 1
245 2015-07-08 ATL 2015 31 38 7 4 1 0 2 5 5 5 1 7 0 0 2 1
246 2015-07-09 ATL 2015 34 37 10 8 2 0 0 3 3 1 0 9 0 1 1 2
247 2015-07-10 ATL 2015 32 35 8 7 0 0 1 3 3 2 0 5 1 0 0 2
248 2015-07-11 ATL 2015 33 38 6 3 1 0 2 2 2 5 1 8 0 0 0 0
249 2015-07-12 ATL 2015 34 41 8 6 2 0 0 3 3 7 1 10 0 0 0 1
250 2015-07-17 ATL 2015 30 36 7 4 3 0 0 4 4 5 1 7 0 0 0 0
In total, the df has 43 total columns. My objective is to sum columns 4 (AB) to 43 on two criteria:
the team
the date is within 7 days of the entry in "Date" (ie Date - 7 to Date - 1)
Eventually, I would like these columns to be appended to mlb_team_logs as l7_AB, l7_PA, etc (but I know how to do that if the output will be a new dataframe). Any help is appreciated!
EDIT I altered the sample to allow for more easily tested results
You might be able to use a data.table non-equi join here. The idea would be to create a lower date bound (below, I've named this date_lb), and then join the table on itself, matching on Team = Team, Date < Date, and Date >= date_lb. Then use lapply with .SDcols to sum the columns of interest.
load library and set your frame to data.table
library(data.table)
setDT(mlb_team_logs)
Identify the columns you want to sum, in a character vector (change to 4:43 in your full dataset)
sum_cols = names(mlb_team_logs)[4:19]
Add a lower bound on date
df[, date_lb := Date-7]
Join the table on itself, and use lapply(.SD, sum) on the columns of interest
result = mlb_team_logs[mlb_team_logs[, .(Team, Date, date_lb)], on=.(Team, Date<Date, Date>=date_lb)] %>%
.[, lapply(.SD, sum), by=.(Date,Team), .SDcols = sumcols ]
Set the new names (inplace, using setnames())
setnames(result, old=sumcols, new=paste0("I7_",sumcols))
Output:
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP
<IDat> <char> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1: 2015-04-06 ARI NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2: 2015-04-07 ARI 34 39 9 7 1 1 0 4 4 3 0 6 2 0 0 2
3: 2015-04-08 ARI 65 75 17 11 2 2 2 11 11 8 0 13 2 0 0 3
4: 2015-04-10 ARI 97 110 22 14 4 2 2 13 12 10 0 20 3 0 0 3
5: 2015-04-11 ARI 132 148 29 20 4 2 3 17 16 13 0 30 3 0 0 3
6: 2015-04-12 ARI 164 183 39 29 4 2 4 23 22 16 0 37 3 0 0 4
7: 2015-04-13 ARI 200 221 49 36 7 2 4 27 26 17 0 48 3 0 1 5
8: 2015-04-14 ARI 205 226 52 37 9 2 4 31 29 18 0 53 1 0 2 3
9: 2015-04-15 ARI 202 222 47 34 10 1 2 25 23 16 0 50 2 0 2 4
10: 2015-04-16 ARI 203 221 51 38 9 1 3 25 24 15 0 51 1 0 2 5
11: 2015-07-03 ATL NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
12: 2015-07-04 ATL 30 32 7 4 1 0 2 2 2 2 0 6 0 0 0 1
13: 2015-07-05 ATL 64 72 17 10 4 0 3 11 11 7 0 11 0 0 1 1
14: 2015-07-06 ATL 99 109 24 16 5 0 3 11 11 8 0 21 1 0 1 2
15: 2015-07-07 ATL 139 153 39 26 9 0 4 16 16 11 0 28 1 0 2 3
16: 2015-07-08 ATL 173 190 49 33 10 1 5 20 20 13 0 32 1 0 3 4
17: 2015-07-09 ATL 204 228 56 37 11 1 7 25 25 18 1 39 1 0 5 5
18: 2015-07-10 ATL 238 265 66 45 13 1 7 28 28 19 1 48 1 1 6 7
19: 2015-07-11 ATL 240 268 67 48 12 1 6 29 29 19 1 47 2 1 6 8
20: 2015-07-12 ATL 239 266 63 45 10 1 7 22 22 19 2 50 2 1 5 8
21: 2015-07-17 ATL 99 114 22 16 3 0 3 8 8 14 2 23 1 0 0 3
Date Team I7_AB I7_PA I7_H I7_X1B I7_X2B I7_X3B I7_HR I7_R I7_RBI I7_BB I7_IBB I7_SO I7_HBP I7_SF I7_SH I7_GDP

I have a weights variable and I need to create cross tabulations for a chord diagram

I have a dataset with over 15,000 observations. I've dropped all variables but three (3).
One is the individual's origin or, the other is the individual's destination dest, and the third is weight of that individual wgt.
Origin and destination are categorical variables.
The weights I have are used as analytic weights in Stata. However, Stata can't handle the number of columns I generate when making tables. R generates them with ease. However, I can't figure out how to apply weights into the generated table.
I tried using wtd.tables(), but the following error appears.
wtd.table(NonHSGrad$b206reg, NonHSGrad$c305reg, weights=NonHSGrad$ind_wgts)
Error in proxy[, ..., drop = FALSE] : incorrect number of dimensions
When I use only the table(), this comes out:
table(NonHSGrad$b206reg, NonHSGrad$c305reg)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 285 38 20 8 6 3 1 2 0 1 0 10 38 46 0 2 14
2 32 312 26 3 1 0 2 1 1 0 1 1 22 51 0 0 8
3 17 35 325 12 12 2 3 7 0 2 3 5 52 13 1 1 25
4 3 5 27 224 19 5 2 10 1 1 1 2 51 4 0 3 35
5 4 9 44 81 778 6 7 22 1 4 5 5 155 5 0 5 47
6 4 5 22 21 10 547 24 12 32 21 32 81 86 5 3 15 58
7 5 4 12 17 20 21 558 20 31 99 93 33 59 1 3 67 15
8 8 9 41 49 17 11 24 919 5 8 37 10 151 2 0 52 19
9 0 1 7 9 1 4 26 5 466 66 19 17 17 2 24 24 7
10 1 2 3 4 2 3 27 8 41 528 21 17 13 2 11 36 2
11 3 0 3 10 1 5 11 5 6 17 519 59 7 1 2 49 1
12 0 1 1 2 0 1 5 2 2 10 39 318 10 0 14 17 1
13 15 9 26 34 25 21 12 42 2 5 3 5 187 2 1 6 15
14 14 47 7 5 0 0 0 1 1 0 0 0 9 475 0 0 0
15 0 0 3 1 2 2 4 2 22 9 3 60 9 2 342 2 3
16 0 2 6 10 3 2 11 21 3 33 29 4 34 0 3 404 5
17 1 1 7 15 2 6 1 2 0 1 1 0 34 0 0 2 463
99 0 0 1 1 0 0 0 1 0 1 0 0 0 1 2 0 1
I am also going to use the table for a chord diagram to show flows.

Replace a dot "." with NA in a dataframe in R

I have the following data frame:
obs zip age bed bath size lot exter garage fp price
1 1 1 3 21 3 3.0 951 64904 other 0 0 30000
2 2 2 3 21 3 2.0 1036 217800 frame 0 0 39900
3 3 3 4 7 1 1.0 676 54450 other 2 0 46500
4 4 4 3 6 3 2.0 1456 51836 other 0 1 48600
5 5 5 1 51 3 1.0 1186 10857 other 1 0 51500
6 6 6 2 19 3 2.0 1456 40075 frame 0 0 56990
7 7 7 3 8 3 2.0 1368 . frame 0 0 59900
8 8 8 4 27 3 1.0 994 11016 frame 1 0 62500
9 9 9 1 51 2 1.0 1176 6259 frame 1 1 65500
10 10 10 3 1 3 2.0 1216 11348 other 0 0 69000
11 11 11 4 32 3 2.0 1410 25450 brick 0 0 76900
12 12 12 3 2 3 2.0 1344 . other 0 1 79000
13 13 13 3 25 2 2.0 1064 218671 other 0 0 79900
14 14 14 1 31 3 1.5 1770 19602 brick 0 1 79950
15 15 15 4 29 3 2.0 1524 12720 brick 2 1 82900
16 16 16 3 16 3 2.0 1750 130680 frame 0 0 84900
17 17 17 3 20 3 2.0 1152 104544 other 2 0 85000
18 18 18 3 18 4 2.0 1770 10640 other 0 0 87900
19 19 19 4 28 3 2.0 1624 12700 brick 2 1 89900
20 20 20 2 27 3 2.0 1540 5679 brick 2 1 89900
with the following structure:
str(df)
'data.frame': 69 obs. of 12 variables:
$ Obs : int 1 2 3 4 5 6 7 8 9 10 ...
$ obs : int 1 2 3 4 5 6 7 8 9 10 ...
$ zip : int 3 3 4 3 1 2 3 4 1 3 ...
$ age : int 21 21 7 6 51 19 8 27 51 1 ...
$ bed : int 3 3 1 3 3 3 3 3 2 3 ...
$ bath : num 3 2 1 2 1 2 2 1 1 2 ...
$ size : Factor w/ 66 levels ".","1036","1064",..: 65 2 64 14 6 14 10 66 5 7 ...
$ lot : Factor w/ 60 levels ".","10295","10400",..: 47 28 43 39 9 35 1 11 46 13 ...
$ exter : Factor w/ 3 levels "brick","frame",..: 3 2 3 3 3 2 2 2 2 3 ...
$ garage: int 0 0 2 0 1 0 0 1 1 0 ...
$ fp : int 0 0 0 1 0 0 0 0 1 0 ...
$ price : int 30000 39900 46500 48600 51500 56990 59900 62500 65500 69000 ...
As you can be seen the "lot" variable appears as a factor. I have the following questions about this data:
Why does R read this variable "lot" as a factor?
When I tried:
df$lot[df$lot == "."] <- NA all dots (.) were replaced with <NA> and not as NA as I wanted.
I then tried df$lot <- as.numeric(df$lot) but the numerical values of this variable have changed completely, with the (.) being replaced by 1. What happened when I changed the variable's type?
How may I replace all dots (.) with NA?

adding mising rows in R

I have a table in R as follows
month day total
1 1 3 1414
2 1 5 1070
3 1 6 211
4 1 7 2214
5 1 8 1766
6 1 13 2486
7 1 14 43
8 1 15 2349
9 1 16 4616
10 1 17 2432
11 1 18 1482
12 1 19 694
13 1 20 968
14 1 23 381
15 1 24 390
16 1 26 4063
17 1 27 3323
18 1 28 988
19 1 29 9671
20 1 30 11968
I have to add the missing values of days such as 1,2,4 with zero such that the result should be like
month day total
- 1 1 0
- 1 2 0
1 3 1414
1 4 0
2 1 5 1070
3 1 6 211
4 1 7 2214
5 1 8 1766
1 9 0
1 10 0
1 11 0
1 12 0
6 1 13 2486
7 1 14 43
8 1 15 2349
9 1 16 4616
10 1 17 2432
11 1 18 1482
12 1 19 694
13 1 20 968
1 21 0
1 22 0
14 1 23 381
15 1 24 390
1 25 0
16 1 26 4063
17 1 27 3323
18 1 28 988
19 1 29 9671
20 1 30 11968
Using only base R, you could do it this way:
for(d in 1:31) {
if(!d %in% my.df$day)
my.df[nrow(my.df) + 1,] <- c(1,d,0)
}
# Reorder rows
my.df <- my.df[with(my.df, order(month, day)),]
rownames(my.df) <- NULL
# Check the results
head(my.df)
# month day total
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1414
# 4 1 4 0
# 5 1 5 1070
# 6 1 6 211
In R, we could create a new dataset with 'day' column from 1:30 and 'month' as '1', left_join with the original dataset and replace the NA values after the merge with '0'
df2 <- data.frame(month=1, day=1:30)
library(dplyr)
left_join(df2, df1) %>%
mutate(total=replace(total, which(is.na(total)),0))
Or use merge from base R to get 'dM' and assign the NA values in the 'total' to '0'
dM <- merge(df1, df2, all.y=TRUE)
dM$total[is.na(dM$total)] <- 0

Subset rows containing not 0 elements in a data.frame

I have a data.frame that looks like this:
cln1 cln2 cln3 cln4
0 1 2 0
3 9 7 12
1 0 13 0
4 98 23 11
I would like to subset only that rows and columns not containing 0 elements. The desired output will be:
cln1 cln2 cln3 cln4
3 9 7 12
4 98 23 11
Assuming your data.frame is called "mydf":
> mydf[!rowSums(mydf == 0), ]
cln1 cln2 cln3 cln4
2 3 9 7 12
4 4 98 23 11

Resources