I have multiple numeric columns in R (mydata = summary(labo[sapply(mydata, is.numeric)])) then performed a data frame transformation mydata<-as.data.frame(mydata) and I have this as 'data have' :
Var1 Var2 Freq
1 cars Min. : 1.100
2 cars 1st Qu.: 3.375
3 cars Median : 4.500
4 cars Mean :12.075
5 cars 3rd Qu.:12.350
6 cars Max. :12.000
7 cars NA's :3
8 bikes Min. : 12.00
9 bikes 1st Qu.: 23.00
10 bikes Median : 12.00
11 bikes Mean : 10.14
12 bikes 3rd Qu.: 12.00
13 bikes Max. :12.00
14 bikes NA's :2
15 wheels Min. :10.00
16 wheels 1st Qu.:12.00
17 wheels Median :10.00
18 wheels Mean :10.54
19 wheels 3rd Qu.:12.00
20 wheels Max. :20.00
21 wheels NA's :3
I'm looking a way to smartly transpose the output of the dataframe to this:
data want:
Var2 ! Min ! 1st Qu. ! Median ! 3rd Qu. ! Max. ! NA's
cars !1.100! 3.375. .....
bikes!12.00! 23.00......
One optiion is pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
separate(Freq, into = c('VarN', 'Freq'), sep=":\\s*", convert = TRUE) %>%
select(-Var1) %>%
pivot_wider(names_from = VarN, values_from = Freq)
data
mydata <- summary(iris[sapply(iris, is.numeric)])
df1 <- as.data.frame(mydata)
Related
Hello I am trying to read in multiple csv files that are located in the same directory. I would like to select the working directory and then read in all the files into one big list(if possible). My attempt is below. Any assistance would be greatly appreciated. I do not know what I am doing wrong!
directory <- dlgDir()
file_list <- list.files(path = "directory", pattern = "*.csv")
bigList <- sapply(file_list, read.csv)
Here is an example using an updated version of Alberto Barradas' Pokémon Stats data from kaggle.com that reads the list of files from a directory and combines them into a data frame.
download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/pokemonData.zip",
"pokemonData.zip",
method="curl",mode="wb")
unzip("pokemonData.zip")
thePokemonFiles <- list.files("./pokemonData",
full.names=TRUE)
thePokemonFiles
pokemonData <- lapply(thePokemonFiles,function(x) read.csv(x))
At this point, the object pokemonData is a list of seven dataframes, containing one of seven generations of Pokémon, which we'll demonstrate with summary().
> summary(pokemonData)
Length Class Mode
[1,] 13 data.frame list
[2,] 13 data.frame list
[3,] 13 data.frame list
[4,] 13 data.frame list
[5,] 13 data.frame list
[6,] 13 data.frame list
[7,] 13 data.frame list
To combine them into a single data frame, we use do.call() with the rbind() function.
pokemonData <- do.call(rbind,pokemonData)
To demonstrate that the pokemonData object now consists of a single data frame with all seven generations of Pokémon, we'll execute summary() again.
> summary(pokemonData)
Number Name
Min. : 1.0 Abra : 1
1st Qu.:208.0 Aerodactyl : 1
Median :402.0 AerodactylMega Aerodactyl: 1
Mean :405.4 Alakazam : 1
3rd Qu.:609.0 AlakazamMega Alakazam : 1
Max. :807.0 Arbok : 1
(Other) :887
Type1 Type2 Total HP
Water :122 :385 Min. :175.0 Min. : 1.00
Normal :110 Flying :108 1st Qu.:330.0 1st Qu.: 50.00
Grass : 82 Ground : 37 Median :455.0 Median : 66.00
Bug : 78 Poison : 35 Mean :437.6 Mean : 69.44
Psychic: 66 Psychic: 35 3rd Qu.:518.0 3rd Qu.: 80.00
Fire : 58 (Other):258 Max. :780.0 Max. :255.00
(Other):377 NA's : 35
Attack Defense SpecialAtk
Min. : 5.00 Min. : 5.00 Min. : 10.0
1st Qu.: 55.00 1st Qu.: 50.00 1st Qu.: 50.0
Median : 75.00 Median : 70.00 Median : 65.0
Mean : 79.83 Mean : 74.39 Mean : 73.4
3rd Qu.:100.00 3rd Qu.: 90.00 3rd Qu.: 95.0
Max. :190.00 Max. :230.00 Max. :194.0
SpecialDef Speed Generation
Min. : 20.00 Min. : 5.00 Min. :1.000
1st Qu.: 50.00 1st Qu.: 45.00 1st Qu.:2.000
Median : 70.00 Median : 65.00 Median :4.000
Mean : 72.37 Mean : 68.21 Mean :3.713
3rd Qu.: 90.00 3rd Qu.: 90.00 3rd Qu.:5.000
Max. :230.00 Max. :180.00 Max. :7.000
Legendary
False:734
True : 65
NA's : 94
Using trees dataset.
data(trees)
Each column has the values of summary including its titles Min,Max,1st Quartile and so on.. But only numbers should be present in the corresponding cells and that names should be named as row names in column for whole dataset.
Need Output like this
We can apply summary on each of the columns separately by looping with sapply.
data(trees)
sapply(trees, summary)
# Girth Height Volume
# Min. 8.30 63 10.20
# 1st Qu. 11.05 72 19.40
# Median 12.90 76 24.20
# Mean 13.25 76 30.17
# 3rd Qu. 15.25 80 37.30
# Max. 20.60 87 77.00
The OP's output may have resulted from applying the summary directly on the whole dataset.
summary(trees)
# Girth Height Volume
# Min. : 8.30 Min. :63 Min. :10.20
# 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
# Median :12.90 Median :76 Median :24.20
# Mean :13.25 Mean :76 Mean :30.17
# 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
# Max. :20.60 Max. :87 Max. :77.00
This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 3 years ago.
I need to replace the missing values in the field "steps" by the median of "steps" calculated over that particular day (group by "date") with NA values removed. I have already referred this thread but my NA values aren't replaced. Can somebody help me find out where am I going wrong? I would prefer using base package/data table/plyr. The dataset looks approx. like this:-
steps date interval
1: NA 2012-10-01 0
2: NA 2012-10-01 5
3: NA 2012-10-01 10
4: NA 2012-10-01 15
5: NA 2012-10-01 20
---
17564: NA 2012-11-30 2335
17565: NA 2012-11-30 2340
17566: NA 2012-11-30 2345
17567: NA 2012-11-30 2350
17568: NA 2012-11-30 2355
The structure and summary of the dataset(activity) are as shown below
#str(activity)
Classes ‘data.table’ and 'data.frame': 17568 obs. of 3 variables:
$ steps : int NA NA NA NA NA NA NA NA NA NA ...
$ date : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
#summary(activity)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Things I have tried:
Datatable method:
activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date]
summary(activityrepNA)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Using ave
activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)]
> summary(activity)
steps date interval
Min. : 0.00 Min. :2012-10-01 Min. : 0.0
1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Attempt at ddply
cleandatapls<-ddply(activity,
+ .(as.character(date)),
+ transform,
+ steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps))
> summary(cleandatapls)
as.character(date) steps date interval
Length:17568 Min. : 0.00 Min. :2012-10-01 Min. : 0.0
Class :character 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
Mode :character Median : 0.00 Median :2012-10-31 Median :1177.5
Mean : 37.38 Mean :2012-10-31 Mean :1177.5
3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
Max. :806.00 Max. :2012-11-30 Max. :2355.0
NA's :2304
Aggregate for calculating median
whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE)
> summary(whynoclean)
Group.1 steps date interval
Min. :2012-10-01 Min. :0 Min. :2012-10-01 Min. :1178
1st Qu.:2012-10-16 1st Qu.:0 1st Qu.:2012-10-16 1st Qu.:1178
Median :2012-10-31 Median :0 Median :2012-10-31 Median :1178
Mean :2012-10-31 Mean :0 Mean :2012-10-31 Mean :1178
3rd Qu.:2012-11-15 3rd Qu.:0 3rd Qu.:2012-11-15 3rd Qu.:1178
Max. :2012-11-30 Max. :0 Max. :2012-11-30 Max. :1178
NA's :8
EDIT output for the code using mutate
activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))
Source: local data table [17,568 x 3]
steps date interval
1 NA 2012-10-01 0
2 NA 2012-10-01 5
3 NA 2012-10-01 10
4 NA 2012-10-01 15
5 NA 2012-10-01 20
6 NA 2012-10-01 25
7 NA 2012-10-01 30
8 NA 2012-10-01 35
9 NA 2012-10-01 40
10 NA 2012-10-01 45
.. ... ... ...
UPDATE:
Steven Beaupre helped me realize that my approach for the imputing was flawed as there were specific dates having only NA values that were causing the problem as the median of NA's is NA. Used another suggested approach.
Try:
library(dplyr)
df %>%
group_by(date) %>%
mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps))
If for a given date, all steps are NAs, you could replace them with 0:
df %>%
group_by(date) %>%
mutate(steps = ifelse(all(is.na(steps)), 0,
ifelse(is.na(steps), median(steps, na.rm = T), steps)))
Is there an easy way to change the output format for R's summary function so that the results print in a column instead of row? R does this automatically when you pass summary a data frame. I'd like to print summary statistics in a column when I pass it a single vector. So instead of this:
>summary(vector)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 6.699 6.000 559.000
It would look something like this:
>summary(vector)
Min. 1.000
1st Qu. 1.000
Median 2.000
Mean 6.699
3rd Qu. 6.000
Max. 559.000
Sure. Treat it as a data.frame:
set.seed(1)
x <- sample(30, 100, TRUE)
summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 10.00 15.00 16.03 23.25 30.00
summary(data.frame(x))
# x
# Min. : 1.00
# 1st Qu.:10.00
# Median :15.00
# Mean :16.03
# 3rd Qu.:23.25
# Max. :30.00
For slightly more usable output, you can use data.frame(unclass(.)):
data.frame(val = unclass(summary(x)))
# val
# Min. 1.00
# 1st Qu. 10.00
# Median 15.00
# Mean 16.03
# 3rd Qu. 23.25
# Max. 30.00
Or you can use stack:
stack(summary(x))
# values ind
# 1 1.00 Min.
# 2 10.00 1st Qu.
# 3 15.00 Median
# 4 16.03 Mean
# 5 23.25 3rd Qu.
# 6 30.00 Max.
How can I create a matrix that will contain the sales values:
summary(sales.dp)
Dept Week Sales
Min. : 1.00 8 : 81 Min. : -545.8
1st Qu.:21.00 9 : 81 1st Qu.: 2794.9
Median :42.00 11 : 81 Median : 7840.9
Mean :45.93 19 : 81 Mean : 14444.1
3rd Qu.:72.00 3 : 80 3rd Qu.: 19309.3
Max. :99.00 6 : 80 Max. :242627.6
(Other):3625
so that each Sales will appear in row Dept and in column Week?
All missing values will be represented by NA
You can use:
with(sales.dp, tapply(Sales, list(Dept, Week), mean))
If a week:Dept pair has multiple sales, this will take the average, otherwise the single value (and absence of sales is represented by NA)