Create matrix from a pre-ordered vectors in r - r

How can I create a matrix that will contain the sales values:
summary(sales.dp)
Dept Week Sales
Min. : 1.00 8 : 81 Min. : -545.8
1st Qu.:21.00 9 : 81 1st Qu.: 2794.9
Median :42.00 11 : 81 Median : 7840.9
Mean :45.93 19 : 81 Mean : 14444.1
3rd Qu.:72.00 3 : 80 3rd Qu.: 19309.3
Max. :99.00 6 : 80 Max. :242627.6
(Other):3625
so that each Sales will appear in row Dept and in column Week?
All missing values will be represented by NA

You can use:
with(sales.dp, tapply(Sales, list(Dept, Week), mean))
If a week:Dept pair has multiple sales, this will take the average, otherwise the single value (and absence of sales is represented by NA)

Related

Not a Number (NaN) for the standard error, lower and upper using predict function in unmarked

Why do I get NaNs when running the following code? I am conducting a distance sampling analysis of species density per covariate. This hasn't happened for any of my other covariates?
#summary of umf2
> summary(umf2)
unmarkedFrameDS Object
line-transect survey design
Distance class cutpoints (m): 0 5 10 15 20 25 30 35
64 sites
Maximum number of distance classes per site: 7
Mean number of distance classes per site: 7
Sites with at least one detection: 8
Tabulation of y observations:
0 1 2
434 12 2
Site-level covariates:
transect length_m length number
01_Cungha_T1 : 1 Min. : 996 Min. :0.996 Min. : 1.00
02_Capicada_T1: 1 1st Qu.:1061 1st Qu.:1.061 1st Qu.:16.75
03_Caghode_T1 : 1 Median :1098 Median :1.099 Median :32.50
04_Caghode_T2 : 1 Mean :1126 Mean :1.126 Mean :32.50
05_Cafal_T1 : 1 3rd Qu.:1167 3rd Qu.:1.167 3rd Qu.:48.25
06_Muna_T1 : 1 Max. :1758 Max. :1.758 Max. :64.00
(Other) :58
y x d_forest_1 d_all_roads_1
Min. :11.10 Min. :-15.16 Min. : 0 Min. : 5.0
1st Qu.:11.19 1st Qu.:-15.08 1st Qu.: 0 1st Qu.: 180.3
Median :11.25 Median :-15.02 Median : 10 Median : 534.8
Mean :11.27 Mean :-15.00 Mean : 2751 Mean : 811.5
3rd Qu.:11.34 3rd Qu.:-14.94 3rd Qu.: 2893 3rd Qu.:1145.9
Max. :11.42 Max. :-14.77 Max. :17242 Max. :3666.7
#fitting an a priori model set: all roads, hazard.
m.haz.2.allroads <- distsamp(~1 ~d_all_roads_1, umf2, keyfun="hazard", output="density", unitsOut="kmsq")
#predict with distance to all roads
> m.allroads2 <- data.frame(d_all_roads_1=seq(5.0000, 3666.6775, length=64))
> allroads.pred2 <- predict(m.haz.2.allroads, type="state", newdata=m.allroads2, appendData=TRUE)
There were 50 or more warnings (use warnings() to see the first 50)
> allroads.pred2
Predicted SE lower upper d_all_roads_1
1 1.979158 0.23962464 1.561069 2.509221 5.00000
2 2.041778 0.23667781 1.626820 2.562582 63.12187
3 2.106379 0.23300058 1.695819 2.616337 121.24373
4 2.173025 0.22849261 1.768321 2.670350 179.36560
5 2.241779 0.22303220 1.844623 2.724444 237.48746
...
10 2.619589 0.17424708 2.299396 2.984369 528.09679
11 2.702472 0.15742845 2.410881 3.029331 586.21865
12 2.787978 0.13611667 2.533561 3.067943 644.34052
13 2.876189 0.10742754 2.673157 3.094641 702.46238
14 2.967191 0.06136485 2.849323 3.089934 760.58425
15 3.061072 NaN NaN NaN 818.70611
16 3.157924 NaN NaN NaN 876.82798
17 3.257839 NaN NaN NaN 934.94984
18 3.360917 NaN NaN NaN 993.07171
19 3.467255 NaN NaN NaN 1051.19357
...
60 12.434748 NaN NaN NaN 3434.19004
61 12.828180 NaN NaN NaN 3492.31190
62 13.234061 NaN NaN NaN 3550.43377
63 13.652784 NaN NaN NaN 3608.55563
64 14.084755 NaN NaN NaN 3666.67750
Please let me know if any further information is needed to help me solve this, many thanks.

Smart transpose of a data frame in R

I have multiple numeric columns in R (mydata = summary(labo[sapply(mydata, is.numeric)])) then performed a data frame transformation mydata<-as.data.frame(mydata) and I have this as 'data have' :
Var1 Var2 Freq
1 cars Min. : 1.100
2 cars 1st Qu.: 3.375
3 cars Median : 4.500
4 cars Mean :12.075
5 cars 3rd Qu.:12.350
6 cars Max. :12.000
7 cars NA's :3
8 bikes Min. : 12.00
9 bikes 1st Qu.: 23.00
10 bikes Median : 12.00
11 bikes Mean : 10.14
12 bikes 3rd Qu.: 12.00
13 bikes Max. :12.00
14 bikes NA's :2
15 wheels Min. :10.00
16 wheels 1st Qu.:12.00
17 wheels Median :10.00
18 wheels Mean :10.54
19 wheels 3rd Qu.:12.00
20 wheels Max. :20.00
21 wheels NA's :3
I'm looking a way to smartly transpose the output of the dataframe to this:
data want:
Var2 ! Min ! 1st Qu. ! Median ! 3rd Qu. ! Max. ! NA's
cars !1.100! 3.375. .....
bikes!12.00! 23.00......
One optiion is pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
separate(Freq, into = c('VarN', 'Freq'), sep=":\\s*", convert = TRUE) %>%
select(-Var1) %>%
pivot_wider(names_from = VarN, values_from = Freq)
data
mydata <- summary(iris[sapply(iris, is.numeric)])
df1 <- as.data.frame(mydata)

Reading multiple csv file that are located in the same directory in R

Hello I am trying to read in multiple csv files that are located in the same directory. I would like to select the working directory and then read in all the files into one big list(if possible). My attempt is below. Any assistance would be greatly appreciated. I do not know what I am doing wrong!
directory <- dlgDir()
file_list <- list.files(path = "directory", pattern = "*.csv")
bigList <- sapply(file_list, read.csv)
Here is an example using an updated version of Alberto Barradas' Pokémon Stats data from kaggle.com that reads the list of files from a directory and combines them into a data frame.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/pokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip")
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
thePokemonFiles
pokemonData <- lapply(thePokemonFiles,function(x) read.csv(x))
At this point, the object pokemonData is a list of seven dataframes, containing one of seven generations of Pokémon, which we'll demonstrate with summary().
> summary(pokemonData)
Length Class Mode
[1,] 13 data.frame list
[2,] 13 data.frame list
[3,] 13 data.frame list
[4,] 13 data.frame list
[5,] 13 data.frame list
[6,] 13 data.frame list
[7,] 13 data.frame list
To combine them into a single data frame, we use do.call() with the rbind() function.
pokemonData <- do.call(rbind,pokemonData)
To demonstrate that the pokemonData object now consists of a single data frame with all seven generations of Pokémon, we'll execute summary() again.
> summary(pokemonData)
Number Name
Min. : 1.0 Abra : 1
1st Qu.:208.0 Aerodactyl : 1
Median :402.0 AerodactylMega Aerodactyl: 1
Mean :405.4 Alakazam : 1
3rd Qu.:609.0 AlakazamMega Alakazam : 1
Max. :807.0 Arbok : 1
(Other) :887
Type1 Type2 Total HP
Water :122 :385 Min. :175.0 Min. : 1.00
Normal :110 Flying :108 1st Qu.:330.0 1st Qu.: 50.00
Grass : 82 Ground : 37 Median :455.0 Median : 66.00
Bug : 78 Poison : 35 Mean :437.6 Mean : 69.44
Psychic: 66 Psychic: 35 3rd Qu.:518.0 3rd Qu.: 80.00
Fire : 58 (Other):258 Max. :780.0 Max. :255.00
(Other):377 NA's : 35
Attack Defense SpecialAtk
Min. : 5.00 Min. : 5.00 Min. : 10.0
1st Qu.: 55.00 1st Qu.: 50.00 1st Qu.: 50.0
Median : 75.00 Median : 70.00 Median : 65.0
Mean : 79.83 Mean : 74.39 Mean : 73.4
3rd Qu.:100.00 3rd Qu.: 90.00 3rd Qu.: 95.0
Max. :190.00 Max. :230.00 Max. :194.0
SpecialDef Speed Generation
Min. : 20.00 Min. : 5.00 Min. :1.000
1st Qu.: 50.00 1st Qu.: 45.00 1st Qu.:2.000
Median : 70.00 Median : 65.00 Median :4.000
Mean : 72.37 Mean : 68.21 Mean :3.713
3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:5.000
Max. :230.00 Max. :180.00 Max. :7.000
Legendary
False:734
True : 65
NA's : 94

Remove the text in the cells having only those numbers and name the row names

Using trees dataset.
data(trees)
Each column has the values of summary including its titles Min,Max,1st Quartile and so on.. But only numbers should be present in the corresponding cells and that names should be named as row names in column for whole dataset.
Need Output like this
We can apply summary on each of the columns separately by looping with sapply.
data(trees)
sapply(trees, summary)
# Girth Height Volume
# Min. 8.30 63 10.20
# 1st Qu. 11.05 72 19.40
# Median 12.90 76 24.20
# Mean 13.25 76 30.17
# 3rd Qu. 15.25 80 37.30
# Max. 20.60 87 77.00
The OP's output may have resulted from applying the summary directly on the whole dataset.
summary(trees)
# Girth Height Volume
# Min. : 8.30 Min. :63 Min. :10.20
# 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
# Median :12.90 Median :76 Median :24.20
# Mean :13.25 Mean :76 Mean :30.17
# 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
# Max. :20.60 Max. :87 Max. :77.00

imputing data with median by date in R [duplicate]

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 3 years ago.
I need to replace the missing values in the field "steps" by the median of "steps" calculated over that particular day (group by "date") with NA values removed. I have already referred this thread but my NA values aren't replaced. Can somebody help me find out where am I going wrong? I would prefer using base package/data table/plyr. The dataset looks approx. like this:-
steps date interval
1: NA 2012-10-01 0
2: NA 2012-10-01 5
3: NA 2012-10-01 10
4: NA 2012-10-01 15
5: NA 2012-10-01 20
---
17564: NA 2012-11-30 2335
17565: NA 2012-11-30 2340
17566: NA 2012-11-30 2345
17567: NA 2012-11-30 2350
17568: NA 2012-11-30 2355
The structure and summary of the dataset(activity) are as shown below
#str(activity)
Classes ‘data.table’ and 'data.frame': 17568 obs. of 3 variables:
$ steps : int NA NA NA NA NA NA NA NA NA NA ...
$ date : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
#summary(activity)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Things I have tried:
Datatable method:
activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date]
summary(activityrepNA)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Using ave
activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)]
> summary(activity)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Attempt at ddply
cleandatapls<-ddply(activity,
+ .(as.character(date)),
+ transform,
+ steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps))
> summary(cleandatapls)
as.character(date) steps date interval
Length:17568 Min. : 0.00 Min. :2012-10-01 Min. : 0.0
Class :character 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Mode :character Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Aggregate for calculating median
whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE)
> summary(whynoclean)
Group.1 steps date interval
Min. :2012-10-01 Min. :0 Min. :2012-10-01 Min. :1178
1st Qu.:2012-10-16 1st Qu.:0 1st Qu.:2012-10-16 1st Qu.:1178
Median :2012-10-31 Median :0 Median :2012-10-31 Median :1178
Mean :2012-10-31 Mean :0 Mean :2012-10-31 Mean :1178
3rd Qu.:2012-11-15 3rd Qu.:0 3rd Qu.:2012-11-15 3rd Qu.:1178
Max. :2012-11-30 Max. :0 Max. :2012-11-30 Max. :1178
NA's :8
EDIT output for the code using mutate
activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))
Source: local data table [17,568 x 3]
steps date interval
1 NA 2012-10-01 0
2 NA 2012-10-01 5
3 NA 2012-10-01 10
4 NA 2012-10-01 15
5 NA 2012-10-01 20
6 NA 2012-10-01 25
7 NA 2012-10-01 30
8 NA 2012-10-01 35
9 NA 2012-10-01 40
10 NA 2012-10-01 45
.. ... ... ...
UPDATE:
Steven Beaupre helped me realize that my approach for the imputing was flawed as there were specific dates having only NA values that were causing the problem as the median of NA's is NA. Used another suggested approach.
Try:
library(dplyr)
df %>%
group_by(date) %>%
mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps))
If for a given date, all steps are NAs, you could replace them with 0:
df %>%
group_by(date) %>%
mutate(steps = ifelse(all(is.na(steps)), 0,
ifelse(is.na(steps), median(steps, na.rm = T), steps)))

Resources