This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 3 years ago.
I need to replace the missing values in the field "steps" by the median of "steps" calculated over that particular day (group by "date") with NA values removed. I have already referred this thread but my NA values aren't replaced. Can somebody help me find out where am I going wrong? I would prefer using base package/data table/plyr. The dataset looks approx. like this:-
steps date interval
1: NA 2012-10-01 0
2: NA 2012-10-01 5
3: NA 2012-10-01 10
4: NA 2012-10-01 15
5: NA 2012-10-01 20
---
17564: NA 2012-11-30 2335
17565: NA 2012-11-30 2340
17566: NA 2012-11-30 2345
17567: NA 2012-11-30 2350
17568: NA 2012-11-30 2355
The structure and summary of the dataset(activity) are as shown below
#str(activity)
Classes ‘data.table’ and 'data.frame': 17568 obs. of 3 variables:
$ steps : int NA NA NA NA NA NA NA NA NA NA ...
$ date : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
#summary(activity)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Things I have tried:
Datatable method:
activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date]
summary(activityrepNA)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Using ave
activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)]
> summary(activity)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Attempt at ddply
cleandatapls<-ddply(activity,
+ .(as.character(date)),
+ transform,
+ steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps))
> summary(cleandatapls)
as.character(date) steps date interval
Length:17568 Min. : 0.00 Min. :2012-10-01 Min. : 0.0
Class :character 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Mode :character Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Aggregate for calculating median
whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE)
> summary(whynoclean)
Group.1 steps date interval
Min. :2012-10-01 Min. :0 Min. :2012-10-01 Min. :1178
1st Qu.:2012-10-16 1st Qu.:0 1st Qu.:2012-10-16 1st Qu.:1178
Median :2012-10-31 Median :0 Median :2012-10-31 Median :1178
Mean :2012-10-31 Mean :0 Mean :2012-10-31 Mean :1178
3rd Qu.:2012-11-15 3rd Qu.:0 3rd Qu.:2012-11-15 3rd Qu.:1178
Max. :2012-11-30 Max. :0 Max. :2012-11-30 Max. :1178
NA's :8
EDIT output for the code using mutate
activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))
Source: local data table [17,568 x 3]
steps date interval
1 NA 2012-10-01 0
2 NA 2012-10-01 5
3 NA 2012-10-01 10
4 NA 2012-10-01 15
5 NA 2012-10-01 20
6 NA 2012-10-01 25
7 NA 2012-10-01 30
8 NA 2012-10-01 35
9 NA 2012-10-01 40
10 NA 2012-10-01 45
.. ... ... ...
UPDATE:
Steven Beaupre helped me realize that my approach for the imputing was flawed as there were specific dates having only NA values that were causing the problem as the median of NA's is NA. Used another suggested approach.
Try:
library(dplyr)
df %>%
group_by(date) %>%
mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps))
If for a given date, all steps are NAs, you could replace them with 0:
df %>%
group_by(date) %>%
mutate(steps = ifelse(all(is.na(steps)), 0,
ifelse(is.na(steps), median(steps, na.rm = T), steps)))
Related
Incidentally, I have found this problem with write.csv() and NA values if using format():
d <- data.frame(id=1:10, f=0.1*(1:10),f2=0.01*(1:10))
d$f2[3] <- NA
summary(d)
id f f2
Min. : 1.00 Min. :0.100 Min. :0.01000
1st Qu.: 3.25 1st Qu.:0.325 1st Qu.:0.04000
Median : 5.50 Median :0.550 Median :0.06000
Mean : 5.50 Mean :0.550 Mean :0.05778
3rd Qu.: 7.75 3rd Qu.:0.775 3rd Qu.:0.08000
Max. :10.00 Max. :1.000 Max. :0.10000
NA's :1
format(d, nsmall=3)
id f f2
1 1 0.100 0.010
2 2 0.200 0.020
3 3 0.300 NA
4 4 0.400 0.040
5 5 0.500 0.050
6 6 0.600 0.060
7 7 0.700 0.070
8 8 0.800 0.080
9 9 0.900 0.090
10 10 1.000 0.100
format(d$f2, nsmall = 3)
[1] "0.010" "0.020" " NA" "0.040" "0.050" "0.060" "0.070" "0.080" "0.090" "0.100"
format(d$f2[3])
[1] "NA"
write.csv(format(d,nsmall=3),file="test.csv",row.names = FALSE)
d2 <- read.csv("test.csv")
summary(d2)
id f f2
Min. : 1.00 Min. :0.100 Length:10
1st Qu.: 3.25 1st Qu.:0.325 Class :character
Median : 5.50 Median :0.550 Mode :character
Mean : 5.50 Mean :0.550
3rd Qu.: 7.75 3rd Qu.:0.775
Max. :10.00 Max. :1.000
I check test.csv and find that the cell corresponding to d$f[3] is not "NA" but " NA"
d2 <- read.csv("test.csv", na.strings=" NA")
summary(d2)
id f f2
Min. : 1.00 Min. :0.100 Min. :0.01000
1st Qu.: 3.25 1st Qu.:0.325 1st Qu.:0.04000
Median : 5.50 Median :0.550 Median :0.06000
Mean : 5.50 Mean :0.550 Mean :0.05778
3rd Qu.: 7.75 3rd Qu.:0.775 3rd Qu.:0.08000
Max. :10.00 Max. :1.000 Max. :0.10000
NA's :1
Should this behavior of format(), adding white spaces to NAs, not be considered a bug?
Not a critical issue as using format() within write.csv() is not really necessary (I found this problem in a very particular case), but, in principle, NAs should not be affected by any format. One thing is having a nicer print to the console and another actually saving those white spaces to a file that could be read back into R.
I have multiple numeric columns in R (mydata = summary(labo[sapply(mydata, is.numeric)])) then performed a data frame transformation mydata<-as.data.frame(mydata) and I have this as 'data have' :
Var1 Var2 Freq
1 cars Min. : 1.100
2 cars 1st Qu.: 3.375
3 cars Median : 4.500
4 cars Mean :12.075
5 cars 3rd Qu.:12.350
6 cars Max. :12.000
7 cars NA's :3
8 bikes Min. : 12.00
9 bikes 1st Qu.: 23.00
10 bikes Median : 12.00
11 bikes Mean : 10.14
12 bikes 3rd Qu.: 12.00
13 bikes Max. :12.00
14 bikes NA's :2
15 wheels Min. :10.00
16 wheels 1st Qu.:12.00
17 wheels Median :10.00
18 wheels Mean :10.54
19 wheels 3rd Qu.:12.00
20 wheels Max. :20.00
21 wheels NA's :3
I'm looking a way to smartly transpose the output of the dataframe to this:
data want:
Var2 ! Min ! 1st Qu. ! Median ! 3rd Qu. ! Max. ! NA's
cars !1.100! 3.375. .....
bikes!12.00! 23.00......
One optiion is pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
separate(Freq, into = c('VarN', 'Freq'), sep=":\\s*", convert = TRUE) %>%
select(-Var1) %>%
pivot_wider(names_from = VarN, values_from = Freq)
data
mydata <- summary(iris[sapply(iris, is.numeric)])
df1 <- as.data.frame(mydata)
Hello I am trying to read in multiple csv files that are located in the same directory. I would like to select the working directory and then read in all the files into one big list(if possible). My attempt is below. Any assistance would be greatly appreciated. I do not know what I am doing wrong!
directory <- dlgDir()
file_list <- list.files(path = "directory", pattern = "*.csv")
bigList <- sapply(file_list, read.csv)
Here is an example using an updated version of Alberto Barradas' Pokémon Stats data from kaggle.com that reads the list of files from a directory and combines them into a data frame.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/pokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip")
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
thePokemonFiles
pokemonData <- lapply(thePokemonFiles,function(x) read.csv(x))
At this point, the object pokemonData is a list of seven dataframes, containing one of seven generations of Pokémon, which we'll demonstrate with summary().
> summary(pokemonData)
Length Class Mode
[1,] 13 data.frame list
[2,] 13 data.frame list
[3,] 13 data.frame list
[4,] 13 data.frame list
[5,] 13 data.frame list
[6,] 13 data.frame list
[7,] 13 data.frame list
To combine them into a single data frame, we use do.call() with the rbind() function.
pokemonData <- do.call(rbind,pokemonData)
To demonstrate that the pokemonData object now consists of a single data frame with all seven generations of Pokémon, we'll execute summary() again.
> summary(pokemonData)
Number Name
Min. : 1.0 Abra : 1
1st Qu.:208.0 Aerodactyl : 1
Median :402.0 AerodactylMega Aerodactyl: 1
Mean :405.4 Alakazam : 1
3rd Qu.:609.0 AlakazamMega Alakazam : 1
Max. :807.0 Arbok : 1
(Other) :887
Type1 Type2 Total HP
Water :122 :385 Min. :175.0 Min. : 1.00
Normal :110 Flying :108 1st Qu.:330.0 1st Qu.: 50.00
Grass : 82 Ground : 37 Median :455.0 Median : 66.00
Bug : 78 Poison : 35 Mean :437.6 Mean : 69.44
Psychic: 66 Psychic: 35 3rd Qu.:518.0 3rd Qu.: 80.00
Fire : 58 (Other):258 Max. :780.0 Max. :255.00
(Other):377 NA's : 35
Attack Defense SpecialAtk
Min. : 5.00 Min. : 5.00 Min. : 10.0
1st Qu.: 55.00 1st Qu.: 50.00 1st Qu.: 50.0
Median : 75.00 Median : 70.00 Median : 65.0
Mean : 79.83 Mean : 74.39 Mean : 73.4
3rd Qu.:100.00 3rd Qu.: 90.00 3rd Qu.: 95.0
Max. :190.00 Max. :230.00 Max. :194.0
SpecialDef Speed Generation
Min. : 20.00 Min. : 5.00 Min. :1.000
1st Qu.: 50.00 1st Qu.: 45.00 1st Qu.:2.000
Median : 70.00 Median : 65.00 Median :4.000
Mean : 72.37 Mean : 68.21 Mean :3.713
3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:5.000
Max. :230.00 Max. :180.00 Max. :7.000
Legendary
False:734
True : 65
NA's : 94
I have an xts object with 12 variables over time in 15 minute intervals.
> summary(wideRawXTSscaled)
Index DO0182U09A3 DO0182U09B3 DO0182U09C3 DO0182U21A1 DO0182U21A2
Min. :2017-01-20 16:30:00 Min. :-1.09338 Min. :-1.0666 Min. :-0.9700 Min. :-1.2687 Min. :-1.00676
1st Qu.:2017-01-24 04:22:30 1st Qu.:-0.60133 1st Qu.:-0.6675 1st Qu.:-0.6009 1st Qu.:-0.4522 1st Qu.:-0.48525
Median :2017-01-27 16:15:00 Median :-0.38317 Median :-0.2742 Median :-0.1761 Median :-0.2127 Median :-0.27482
Mean :2017-01-27 16:15:00 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.:2017-01-31 04:07:30 3rd Qu.: 0.08221 3rd Qu.: 0.2922 3rd Qu.: 0.1125 3rd Qu.: 0.1248 3rd Qu.: 0.05455
Max. :2017-02-03 16:00:00 Max. : 3.33508 Max. : 9.2143 Max. : 5.8473 Max. :18.4909 Max. :12.21382
DO0182U21A3 DO0182U21B1 DO0182U21B2 DO0182U21B3 DO0182U21C1 DO0182U21C2 DO0182U21C3
Min. :-1.09339 Min. :-1.0268 Min. :-0.9797 Min. :-1.0853 Min. :-1.3556 Min. :-1.15469 Min. :-1.2063
1st Qu.:-0.33919 1st Qu.:-0.6020 1st Qu.:-0.5597 1st Qu.:-0.6692 1st Qu.:-0.5600 1st Qu.:-0.37291 1st Qu.:-0.3460
Median :-0.21082 Median :-0.3389 Median :-0.3466 Median :-0.3828 Median :-0.2138 Median :-0.16183 Median :-0.1635
Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
3rd Qu.:-0.01826 3rd Qu.: 0.2105 3rd Qu.: 0.2486 3rd Qu.: 0.5624 3rd Qu.: 0.1992 3rd Qu.: 0.08052 3rd Qu.: 0.1363
Max. :12.69083 Max. : 7.7314 Max. : 9.2900 Max. : 7.3540 Max. :13.7427 Max. :13.76166 Max. :15.8086
I wish to compute the correlations between each of the variables for each interval of time. Since i have 12 variables I expect to have a 12 x 12 matrix for each of the 15 min data points in my xts object.
For the correlation computation I am using the following code:
wideRawXTSscaledCorr <- rollapplyr(wideRawXTSscaled, 10, cor, by.column = FALSE)
"10" in the code above uses 10 time series values to calculate the correlation matrix therefore I will have 9 NA values at the start of my wideRawXTSscaledCorr with the correlation values returned in the 10th.
> wideRawXTSscaledCorr[1:10,1:5]
[,1] [,2] [,3] [,4] [,5]
2017-01-20 16:30:00 NA NA NA NA NA
2017-01-20 16:45:00 NA NA NA NA NA
2017-01-20 17:00:00 NA NA NA NA NA
2017-01-20 17:15:00 NA NA NA NA NA
2017-01-20 17:30:00 NA NA NA NA NA
2017-01-20 17:45:00 NA NA NA NA NA
2017-01-20 18:00:00 NA NA NA NA NA
2017-01-20 18:15:00 NA NA NA NA NA
2017-01-20 18:30:00 NA NA NA NA NA
2017-01-20 18:45:00 1 0.1590656 0.2427391 0.1987761 -0.1026246
When I change the value of the sliding window to any values <10 I get repetitions of the following error:
> wideRawXTSscaledCorr <- rollapplyr(wideRawXTSscaled, 7, cor, by.column = FALSE)
Warning messages:
1: In FUN(.subset_xts(data, (i - width + 1):i), ...) :
the standard deviation is zero
2: In FUN(.subset_xts(data, (i - width + 1):i), ...) :
the standard deviation is zero
3: In FUN(.subset_xts(data, (i - width + 1):i), ...) :
the standard deviation is zero
4: In FUN(.subset_xts(data, (i - width + 1):i), ...) :
the standard deviation is zero
5: In FUN(.subset_xts(data, (i - width + 1):i), ...) :
the standard deviation is zero
Is this the only test I can use to see if these errors are solely a symptom of my data or could it be due to some coding error I have made? Is there some other way I can go deeper into the code to see which values are causing these errors?
cor (c(0,1,0,1,1),c(1,1,1,1,1))
gives a similar error, check if any of the cor tests uses data that is identical across all elements of one of the vectors (i.e. the second vector)...
I have a binary file with size of (360 720 )for the globe.I wrote the code given below to read and extract an area from that file. when I use summary for the whole file I got.
summary(a, na.rm=FALSE)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 1.00 3.00 4.15 7.00 20.00 200083 .
But when used summary for the region(b) which I extracted, I got many V1,V2. Which is not right I should have got one line (as for a)not many V1,V2.
Here is the code:
X <- c(200:300)
Y <- c(150:190)
conne <- file("C:\\initial-WTD.bin", "rb")
a=readBin(conne, numeric(), size=4, n=360*720, signed=TRUE)
a[a == -9999] <- NA
y <- matrix(data=a,ncol=360,nrow=720)
image(t(t(y[X,Y])),ylim=c(1,0))
b = y[X,Y]
summary(b,na.rm=FALSE)
V1 V2 V3 V4 V5 V6 V7
Min. : NA Min. : NA Min. : NA Min. : NA Min. : 8 Min. : NA Min. :
1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.:11 1st Qu.: NA 1st Qu.:
Median : NA Median : NA Median : NA Median : NA Median :14 Median : NA Median
Mean :NaN Mean :NaN Mean :NaN Mean :NaN Mean :14 Mean :NaN Mean
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:17 3rd Qu.: NA 3rd
Max. : NA Max. : NA Max. : NA Max. : NA Max. :20 Max. : NA Max.
NA's :101 NA's :101 NA's :101 NA's :101 NA's :99 NA's :101 NA's :
The problem is not in your indexing of a matrix, but some place prior to accessing it:
a <- matrix(1:100, 10, 10)
summary( a[1:3,1:3] )
V1 V2 V3
Min. :1.0 Min. :11.0 Min. :21.0
1st Qu.:1.5 1st Qu.:11.5 1st Qu.:21.5
Median :2.0 Median :12.0 Median :22.0
Mean :2.0 Mean :12.0 Mean :22.0
3rd Qu.:2.5 3rd Qu.:12.5 3rd Qu.:22.5
Max. :3.0 Max. :13.0 Max. :23.0
You managed to hit a few non-NA values (apparently only 2) but why are you doing this with such sparse data? I scaled this up to 100 columns (out of 1000) and still got the expected results.