I would like to show the data structure of a dataframe/Tibble BUT Without the attributes - attr(*, "spec")= at the end. Is there another command (or better way) that shows only lines 2 & 3?
## spec_tbl_df [4,238 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ male : Factor w/ 2 levels "F","M": 2 1 2 1 1 1 1 1 2 2 ...
## $ age : num [1:4238] 39 46 48 61 46 43 63 45 52 43 ...
## - attr(*, "spec")=
## .. cols(
## .. male = col_factor(levels = c("0", "1"), ordered = FALSE, include_na = FALSE),
## .. age = col_double(),
## .. )
## - attr(*, "problems")=<externalptr>
We can use give.attr=FALSE. Example:
test <- `attr<-`(data.frame(), 'foo', 'bar')
str(test)
# 'data.frame': 0 obs. of 0 variables
# - attr(*, "foo")= chr "bar"
str(test, give.attr=FALSE)
# 'data.frame': 0 obs. of 0 variables
Related
This is the code that I was using for my data mining assignment in R studio. I was preprocessing the data.
setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')
dataset = read.csv('Dataset.csv')
dataset[dataset == '?'] <- NA
View(dataset)
x <- na.omit(dataset)
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local-
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels =
c(1,2,3,4,5,6) )
And here by I will attach the result of the code.
the result
str(x)
As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA. I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.
The link for the dataset :
dataset
unfortunately I do not know the original data - possibly you just have to change the levels and labels content:
x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )
The problem is the factor() statement. The Dataset.csv file does not have character strings surrounded by quotation marks so you get a leading space on every character field.
str(dataset)
# data.frame': 100 obs. of 7 variables:
# $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
# $ Work_Class : chr " State-gov" " Self-emp-not-inc" " Private" NA ...
# $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
# $ Marital_Status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
# $ Sex : chr " Male" " Male" " Male" " Male" ...
# $ Hours_Per_week: int 40 13 40 40 40 40 16 45 50 40 ...
# $ Income : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
Notice the blank space before each label in Work_Class, Education, Marital_Status, Sex, and Income. You need to trim the white space when you read the file:
dataset = read.csv('Dataset.csv', strip.white=TRUE)
Then change the last line by removing the labels= argument:
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))
str(x)
# 'data.frame': 93 obs. of 7 variables:
# $ Age : num 2 1 2 3 2 2 1 2 2 3 ...
# $ Work_Class : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
# $ Education : chr "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
# $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
# $ Sex : chr "Male" "Male" "Male" "Female" ...
# $ Hours_Per_week: num 2 3 2 2 2 3 2 2 1 2 ...
# $ Income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
# - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
# ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
#
# Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
# 6 6 67 3 7 4
I am working on an economical research and have a data frame filled with regression coefficients using melt & tidy functions from broom package. My df:
> head(LmModGDP, 10)
Country variable term estimate std.error statistic p.value
1 Netherlands FDI_InFlow_MilUSD (Intercept) 5.354083e+02 5.974760e+01 8.961167 1.976417e-09
2 Netherlands FDI_InFlow_MilUSD value 2.400677e-03 1.409779e-03 1.702875 1.005189e-01
3 Netherlands FDI_InFlow_percGDP (Intercept) 6.184273e+02 6.723554e+01 9.197923 1.173719e-09
4 Netherlands FDI_InFlow_percGDP value -1.261933e+00 1.008740e+01 -0.125100 9.014067e-01
5 Netherlands FDI_InStock_MilUSD (Intercept) 3.110956e+02 2.719577e+01 11.439116 1.201802e-11
6 Netherlands FDI_InStock_MilUSD value 7.025298e-04 5.307147e-05 13.237429 4.620706e-13
7 Netherlands FDI_OutFlow_MilUSD (Intercept) 5.106762e+02 5.939921e+01 8.597356 4.465840e-09
8 Netherlands FDI_OutFlow_MilUSD value 1.920313e-03 8.646908e-04 2.220808 3.528536e-02
9 Netherlands FDI_OutFlow_percGDP (Intercept) 2.593453e+02 5.334202e+01 4.861932 4.838082e-05
10 Netherlands FDI_OutFlow_percGDP value 3.931491e+00 5.332541e-01 7.372641 7.896681e-08
After I filter the df using any method (even simply by subseting or with dplyr package):
LmModGDP[LmModGDP$variable == "FDI_InStock_MilUSD",]
or
LmModGDP %>%
filter(variable == "FDI_InStock_MilUSD")
It returns the desired df but when I drag my mouse over the last column (p.value) in RStudio viewer it tells me that it is "Unknown Column" and the data still correct. Also when I use str or class function on it it shows that it is numeric but in the viewer it shows something else..
My desired df:
Country variable term estimate std.error statistic p.value
5 Netherlands FDI_InStock_MilUSD (Intercept) 3.110956e+02 2.719577e+01 11.439116 1.201802e-11
6 Netherlands FDI_InStock_MilUSD value 7.025298e-04 5.307147e-05 13.237429 4.620706e-13
19 Romania FDI_InStock_MilUSD (Intercept) 3.122229e+01 3.313134e+00 9.423796 7.188216e-10
20 Romania FDI_InStock_MilUSD value 2.128223e-03 7.035679e-05 30.249006 8.588104e-22
When I try to use kable function to display it in markdown report p.value column shows only 0 values... not the actual ones.
Can someone help me ?
!! UP !!
Here's an output of str :
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 28 obs. of 7 variables:
$ Country : chr "Netherlands" "Netherlands" "Netherlands" "Netherlands" ...
$ variable : Factor w/ 7 levels "FDI_InFlow_MilUSD",..: 1 1 2 2 3 3 4 4 5 5 ...
$ term : chr "(Intercept)" "value" "(Intercept)" "value" ...
$ estimate : num 535.4083 0.0024 618.4273 -1.2619 311.0956 ...
$ std.error: num 59.7476 0.00141 67.23554 10.0874 27.19577 ...
$ statistic: num 8.961 1.703 9.198 -0.125 11.439 ...
$ p.value : num 1.98e-09 1.01e-01 1.17e-09 9.01e-01 1.20e-11 ...
- attr(*, "vars")= chr "Country" "variable"
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 14
..$ : int 0 1
..$ : int 2 3
..$ : int 4 5
..$ : int 6 7
..$ : int 8 9
..$ : int 10 11
..$ : int 12 13
..$ : int 14 15
..$ : int 16 17
..$ : int 18 19
..$ : int 20 21
..$ : int 22 23
..$ : int 24 25
..$ : int 26 27
- attr(*, "group_sizes")= int 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, "biggest_group_size")= int 2
- attr(*, "labels")='data.frame': 14 obs. of 2 variables:
..$ Country : chr "Netherlands" "Netherlands" "Netherlands" "Netherlands" ...
..$ variable: Factor w/ 7 levels "FDI_InFlow_MilUSD",..: 1 2 3 4 5 6 7 1 2 3 ...
..- attr(*, "vars")= chr "Country" "variable"
..- attr(*, "drop")= logi TRUE
I cannot comment yet, this is why I write here an answer.
Could you show us the output of str(LmModGDP) ? Maybe the df is nested? Maybe it is not a pure df but has special properties. Have you tried forcing LmModGDP<-as.data.frame(LmModGDP) ?
Have you tried forcing LmModGDP$p.value<-as.numeric(LmModGDP$p.value) ?
Have you tried converting to data.table and see if the behavior is different after applying your filter on it?
UPDATE1:
Thanks for posting the str(). Your object is a "grouped_df". Have you tried ungroup(LmModGDP)?
I'm using the aggregate function to summarise some data. The data is loans data, I have the ContractNum and LoanAmount. I want to aggregate the data by StartDate, count the number of Loans and Average the loan amount.
Here is a sample of the data and the function that I use:
ContractNum <- c("RHL-1","RHL-2","RHL-3","RHL-3")
StartDate <- c("2016-11-01","2016-11-01","2016-12-01","2016-12-01")
LoanPurpose <- c("Personal","Personal","HomeLoan","Investment")
LoanAmount <- c(200,500,600,150)
dat <- data.frame(ContractNum,StartDate,LoanPurpose,LoanAmount)
aggr.data <- aggregate(
cbind(LoanAmount,ContractNum) ~ StartDate + LoanPurpose
,data = dat
,FUN = function(x)c(count = mean(x),length(x))
)
When I lookat the results of the aggregate function, it looks ok:
> aggr.data
StartDate LoanPurpose LoanAmount.count LoanAmount.V2 ContractNum.count ContractNum.V2
1 2016-12-01 HomeLoan 600 1 3.0 1.0
2 2016-12-01 Investment 150 1 3.0 1.0
3 2016-11-01 Personal 350 2 1.5 2.0
But when I look at the strucutre of it, it seems to have created a sub-list:
> str(aggr.data)
'data.frame': 3 obs. of 4 variables:
$ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
$ LoanPurpose: Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
$ LoanAmount : num [1:3, 1:2] 600 150 350 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
$ ContractNum: num [1:3, 1:2] 3 3 1.5 1 1 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "count" ""
How do I get rid of this sub-list so that I can access each column the way I would normally access a DF? I understand that in the code I've asked to give me a mean on a ContractNum which is not meaningful, but I can just get rid of that column.
Thank you
Just do do.call(data.frame, ...) on aggr.data to unnest the matrices.
aggr.data <- do.call(data.frame, aggr.data);
str(aggr.data);
#'data.frame': 3 obs. of 6 variables:
# $ StartDate : Factor w/ 2 levels "2016-11-01","2016-12-01": 2 2 1
# $ LoanPurpose : Factor w/ 3 levels "HomeLoan","Investment",..: 1 2 3
# $ LoanAmount.count : num 600 150 350
# $ LoanAmount.V2 : num 1 1 2
# $ ContractNum.count: num 3 3 1.5
# $ ContractNum.V2 : num 1 1 2
Let's say we have a data frame/table organized like this
x$user1, x$user2, etc..
x$usern is a data table with attributes like $age, $department, $sale, $price, etc.
I would like to "push" and regroup the data frame in x$usern to one lower level, so that I can add other data tables below x$usern
Perhaps it's better with illustration : the current structure is
x
$user1 $user2
$price,$age, etc. $price, $age, etc.
Target structure is
x
$user1 $user2
$data $stat $data $stat
$price,$age, etc. $min, $max, etc. $price,$age, etc. $min, $max, etc.
What would be the best way to achieve this. I am thinking of lapply and/or loop through all user, but perhaps there is a more elegant way to do this ?
Thank you.
This seems like a good place for lapply (or one of its kin). Some mock data:
x <- list(
user1 = data.frame(price = 11, age = 12),
user2 = data.frame(price = 21, age = 22)
)
str(x)
# List of 2
# $ user1:'data.frame': 1 obs. of 2 variables:
# ..$ price: num 11
# ..$ age : num 12
# $ user2:'data.frame': 1 obs. of 2 variables:
# ..$ price: num 21
# ..$ age : num 22
The transformation:
newx <- lapply(x, function(l) {
st <- data.frame(min = 0.9*min(l$price), max = 1.1*max(l$age))
list(data = l, stat = st)
})
str(newx)
# List of 2
# $ user1:List of 2
# ..$ data:'data.frame': 1 obs. of 2 variables:
# .. ..$ price: num 11
# .. ..$ age : num 12
# ..$ stat:'data.frame': 1 obs. of 2 variables:
# .. ..$ min: num 9.9
# .. ..$ max: num 13.2
# $ user2:List of 2
# ..$ data:'data.frame': 1 obs. of 2 variables:
# .. ..$ price: num 21
# .. ..$ age : num 22
# ..$ stat:'data.frame': 1 obs. of 2 variables:
# .. ..$ min: num 18.9
# .. ..$ max: num 24.2
(Obviously, my definition of st would have to be tailored to your needs. Additionally, it does not strictly need to be defined within the lapply, but it makes sense to do it there if you already know its definition based on x$user1$....)
I have a list called tst , reproducible with this dput output below.
structure(list(CAF = structure(list(word = "CAF", freq = structure(list(
StartDate = structure(1:5, .Label = c("2004-01-04 - 2004-01-10",
"2004-01-11 - 2004-01-17", "2004-01-18 - 2004-01-24", "2004-01-25 - 2004-01-31",
"2004-02-01 - 2004-02-07"), class = "factor"), RelFreq = c(23L,
24L, 26L, 27L, 26L)), .Names = c("StartDate", "RelFreq"), row.names = c(NA,
5L), class = "data.frame")), .Names = c("word", "freq")), NAV = structure(list(
word = "NAV", freq = structure(list(StartDate = structure(1:5, .Label = c("2004-01-04 - 2004-01-10",
"2004-01-11 - 2004-01-17", "2004-01-18 - 2004-01-24", "2004-01-25 - 2004-01-31",
"2004-02-01 - 2004-02-07"), class = "factor"), RelFreq = c(67L,
55L, 62L, 79L, 60L)), .Names = c("StartDate", "RelFreq"), row.names = c(NA,
5L), class = "data.frame")), .Names = c("word", "freq"))), .Names = c("CAF",
"NAV"))
For ease of reading, the str output is here
> str(tst)
List of 2
$ CAF:List of 2
..$ word: chr "CAF"
..$ freq:'data.frame': 5 obs. of 2 variables:
.. ..$ StartDate: Factor w/ 5 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5
.. ..$ RelFreq : int [1:5] 23 24 26 27 26
$ NAV:List of 2
..$ word: chr "NAV"
..$ freq:'data.frame': 5 obs. of 2 variables:
.. ..$ StartDate: Factor w/ 5 levels "2004-01-04 - 2004-01-10",..: 1 2 3 4 5
.. ..$ RelFreq : int [1:5] 67 55 62 79 60
I'd like to assign new values to all the StartDate elements nested inside the freq data frame across all list elements. Specifically here, I will be replacing all with the POSIXct date of the first date in the value. (i.e. 2004-01-04 above), though I'm looking for a general solution to apply to other variables in the list that is not reproduced here.
I have a function fun that can do the conversion given a StartDate vector as an input, but I couldn't figure out how to do a batch reassignment across the entire list.
At the moment I resorted to doing a for loop across the entire tst list. Is there a better way, preferrably vectorized?
If you want to retain listness of tst, then
tst2 <- lapply(tst,function(x) { x$freq$StartDate <- as.POSIXct(x$freq$StartDate); x; });
tst2;
## $CAF
## $CAF$word
## [1] "CAF"
##
## $CAF$freq
## StartDate RelFreq
## 1 2004-01-04 23
## 2 2004-01-11 24
## 3 2004-01-18 26
## 4 2004-01-25 27
## 5 2004-02-01 26
##
##
## $NAV
## $NAV$word
## [1] "NAV"
##
## $NAV$freq
## StartDate RelFreq
## 1 2004-01-04 67
## 2 2004-01-11 55
## 3 2004-01-18 62
## 4 2004-01-25 79
## 5 2004-02-01 60
##
##
str(tst2);
## List of 2
## $ CAF:List of 2
## ..$ word: chr "CAF"
## ..$ freq:'data.frame': 5 obs. of 2 variables:
## .. ..$ StartDate: POSIXct[1:5], format: "2004-01-04" "2004-01-11" "2004-01-18" "2004-01-25" ...
## .. ..$ RelFreq : int [1:5] 23 24 26 27 26
## $ NAV:List of 2
## ..$ word: chr "NAV"
## ..$ freq:'data.frame': 5 obs. of 2 variables:
## .. ..$ StartDate: POSIXct[1:5], format: "2004-01-04" "2004-01-11" "2004-01-18" "2004-01-25" ...
## .. ..$ RelFreq : int [1:5] 67 55 62 79 60
However, I'd also like to make a recommendation that you transform your data into a data.frame, which would make a lot of operations easier, including this one:
df <- do.call(rbind,lapply(tst,function(x) cbind(Word=x$word,x$freq)));
df$StartDate <- as.POSIXct(df$StartDate);
df;
## Word StartDate RelFreq
## CAF.1 CAF 2004-01-04 23
## CAF.2 CAF 2004-01-11 24
## CAF.3 CAF 2004-01-18 26
## CAF.4 CAF 2004-01-25 27
## CAF.5 CAF 2004-02-01 26
## NAV.1 NAV 2004-01-04 67
## NAV.2 NAV 2004-01-11 55
## NAV.3 NAV 2004-01-18 62
## NAV.4 NAV 2004-01-25 79
## NAV.5 NAV 2004-02-01 60
str(df);
## 'data.frame': 10 obs. of 3 variables:
## $ Word : Factor w/ 2 levels "CAF","NAV": 1 1 1 1 1 2 2 2 2 2
## $ StartDate: POSIXct, format: "2004-01-04" "2004-01-11" "2004-01-18" "2004-01-25" ...
## $ RelFreq : int 23 24 26 27 26 67 55 62 79 60