I am encountering an issue with an R dataframe.
The dataframe contains columns that are not recognized as variables.
These new columns contain the other columns' names including the symbol '$'.
I cannot delete these columns from the dataframe.
Would anyone have an idea about what these columns are, why they are not considered as variables, and why I cannot delete them?
Here is a fraction of what appears when I use str() on the dataframe:
(...)
$ EDU_MAX_1: Ord.factor w/ 8 levels "NAP"<"NONE"<"ED1"<..: NA NA NA
$ EDU_MAX_2 : chr NA NA NA NA ...
$ age_14 :'data.frame': 6 obs. of 61 variables:
..$ leeftijd : num 14 14 14 14 14 14
..$ geboortejaar num 1985 1985 1985 1985 1985 ...
(...)
The problem seems to be linked to the variable age_14 which then seems coupled to each column.
When I export the dataframe to an excel file, these columns do not appear in the exported file.
Many thanks in advance for your help
Related
Issue:
I have a data frame (called Yeo) containing six parameters with continuous values (columns 5-11)(see parameters below) and I conducted a Shapiro-Wilk test to determine whether or not the univariate samples came from a normal distribution. For each parameter, the residuals showed non-normality and it's skewed, so I want to transform my variables using both the yjPower (Yeo transformation) and the bcPower(Box Cox transformation) families to compare both transformations.
I have used this R code below before on many occassions so I know it works. However, for this data frame, I keep getting this error (see below). Unfortunately, I cannot provide a reproducible example online as the data belongs to three different organisations. I have opened an old data frame with the same parameters and my R code runs absolutely fine. I really can't figure out a solution.
Would anybody be able to please help me understand this error message below?
Many thanks if you can advise.
Error
transform=powerTransform(as.matrix(Yeo[5:11]), family= "yjPower")
Error
Error in optim(start, llik, hessian = TRUE, method = method, ...) :
non-finite finite-difference value [1]
#save transformed data in strand_trans to compare both
stand_trans=Yeo
stand_trans[,5]=yjPower(Yeo[,5],transform$lambda[1])
stand_trans[,6]=yjPower(Yeo[,6],transform$lambda[2])
stand_trans[,7]=yjPower(Yeo[,7],transform$lambda[3])
stand_trans[,8]=yjPower(Yeo[,8],transform$lambda[4])
stand_trans[,9]=yjPower(Yeo[,9],transform$lambda[5])
stand_trans[,10]=yjPower(Yeo[,10],transform$lambda[6])
stand_trans[,11]=yjPower(Yeo[,11],transform$lambda[7])
Parameters
'data.frame': 888 obs. of 14 variables:
$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Year : num 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
$ Date : Factor w/ 19 levels "","01.09.2019",..: 19 19 19 19 19 19 19 17 17 17 ...
$ Country : Factor w/ 3 levels "","France","Argentina": 3 3 3 3 3 3 3 3 3 3 ...
$ Low.Freq : num 4209 8607 9361 9047 7979 ...
$ High.Freq : num 15770 18220 19853 18220 17843 ...
$ Start.Freq : num 4436 13945 16264 12283 12691 ...
$ End.Freq : num 4436 13945 16264 12283 12691 ...
$ Peak.Freq : num 4594 8906 11531 10781 8812 ...
$ Center.Freq : num 1.137 0.754 0.785 0.691 0.883 ...
$ Delta.Freq : num 11560 9613 10492 9173 9864 ...
I am creating a new column that looks at conditions in my data frame and alerts me whether an issue needs to be investigated or monitored. The code to add the column looks like this:
library(dplyr)
df %>%
mutate("Status" =
ifelse(apply(.[2:7], 1, sum) > 0 & .[8] > 0, "Investigate",
"Monitor"
)
)
If I run the command class(df$Status) on this newly generated column the class is listed as 'matrix'. What? Why isn't it listed as 'character'.
If I look at the structure of my data frame there's some oddity that may be the key, but I don't understand why. Notice that the first columns listed simply look like intergers, then the third column listed, which is the same data, has all this 'attr' phrasing. What is going on?
$ 2017-08 : int NA 1 NA 1 1 2 NA NA NA NA ...
$ 2017-09 : int NA NA 1 NA NA NA NA NA NA NA ...
$ 2017-10 : int NA NA NA NA NA NA 1 NA NA NA ...
- attr(*, "vars")= chr "Material"
- attr(*, "drop")= logi TRUE
- attr(*, "indices")=List of 34
..$ : int 0
..$ : int 1
..$ : int 2
..$ : int 3
..$ : int 4
...continued...
- attr(*, "group_sizes")= int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "biggest_group_size")= int 1
- attr(*, "labels")='data.frame': 34 obs. of 1 variable:
I grouped variables earlier and sometimes ungrouping magically helps. In addition I often have to convert tibbles back to data frames to get other routines to work in my code. This may or may not be related.
I have a rectangular table with three variables: country, year and inflation. I already have all the descriptives I can have, now I need to do some analytics, and figured that I should do some linear regression against a target country. The best idea I had was to create a new variable called inflation.in.country.x and loop through the inflation of x in this new column but that seems somehow unclean solution.
How to get a linear regression of a rectangular data table? The structure is like this:
> dat %>% str
'data.frame': 1196 obs. of 3 variables:
$ Country.Name: Factor w/ 31 levels "Albania","Armenia",..: 9 8 10 11 12 14 15 16 17 19 ...
$ year : chr "1967" "1967" "1967" "1967" ...
$ inflation : num 1.238 8.328 3.818 0.702 1.467 ...
I want to take Armenia inflation as dependent variable and Albania as independent to get a linear regression. It is possible without transforming the data and keeping the years coherent?
One way is to spread your data table using Country.Name as key:
dat.spread <- dat %>% spread(key="Country.Name", value="inflation")
dat.spread %>% str
'data.frame': 50 obs. of 31 variables:
$ year : chr "1967" "1968" "1969" "1970" ...
$ Albania : num NA NA NA NA NA NA NA NA NA NA ...
$ Armenia : num NA NA NA NA NA NA NA NA NA NA ...
$ Brazil : num NA NA NA NA NA NA NA NA NA NA ...
[...]
But that forces you to transform the data which may seem undesirable. Afterwards, you can simply use cbind to do the linear regression against all countries:
lm(cbind(Armenia, Brazil, Colombia, etc...) ~ Albania, data = dat.spread)
I've used the package haven to read SPSS data into R. All seems ok, except that when I try to subset the data it doesn't seem to behave correctly. Here's the code (I don't have SPSS to create example data and can't post the real stuff):
require(haven)
df <- read_spss("filename1.sav")
tmp <- df[as_factor(df$variable1) == "factor1",]
tmp <- tmp[!is.na(tmp$variable2), ]
The above df has "NA" scattered throughout. I expected the above to subset only the data, keeping only rows with variable1 with "factor1" and discarding all rows with NAs in variable2. The first subset works as expected. But the second subset does not. It removes rows, but NAs are still present.
I suspect the issue has something to do with the way haven structures the imported data and uses the class labelled instead of an actual factor variable, but it's over my head. Anyone know what could be happening and how to accomplish the same?
Here's the structure of df, variable1 and variable2:
> str(df)
'data.frame': 4573 obs. of 316 variables:
> str(df$variable1)
Class 'labelled' atomic [1:4573] 9 9 9 14 8 8 2 4 8 16 ...
..- attr(*, "labels")= Named num [1:18] 1 2 3 4 5 6 7 8 9 10 ...
.. ..- attr(*, "names")= chr [1:18] "factor1" "factor2" "factor3" "factor4" ...
> str(df$variable2)
Class 'labelled' atomic [1:4573] 3 NA 3 NA 3 NA 1 1 NA NA ...
..- attr(*, "labels")= Named num [1:3] 1 2 3
.. ..- attr(*, "names")= chr [1:3] "Sponsor" "Not a Sponsor" "Don't Know"
I have a data set containing salaries test data. Not all cells have values hence I used na.action=na.pass,na.rm=TRUE but it gives me an error due to the fact that I want to aggregate with JobTitle which is factor?
So far I have developed below code:
aggregate(salaries$JobTitle,
list(pay = salaries$TotalPay),
FUN=mean,
na.action=na.pass,
na.rm=TRUE)
My test data has the following columns:
'data.frame': 104 obs. of 36 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ EmployeeName : Factor w/ 11 levels "","ALBERT PARDINI",..: 10 7 2 4 11 6 3 5 9 8 ...
$ JobTitle : Factor w/ 9 levels "","ASSISTANT DEPUTY CHIEF II",..: 8 4 4 9 6 2 3 7 3 5 ...
$ BasePay : num 167411 155966 212739 77916 134402 ...
$ OvertimePay : num 0 245132 106088 56121 9737 ...
$ OtherPay : num 400184 137811 16453 198307 182235 ...
$ Benefits : logi NA NA NA NA NA NA ...
$ TotalPay : num 567595 538909 335280 332344 326373 ...
$ TotalPayBenefits: num 567595 538909 335280 332344 326373 ...
$ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
$ Notes : logi NA NA NA NA NA NA ...
$ Agency : Factor w/ 2 levels "","San Francisco": 2 2 2 2 2 2 2 2 2 2 ..
The error code which comes up is
Warning messages:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
etc...
I have tried with salaries$Id and it work like magic so I assume the code is correct and perhaps I need to change the data type for JobTitle?
If we are getting the mean of 'TotalPaygrouped by 'JobTitle', theformula` method would be
aggregate(TotalPay~JobTitle, salaries, mean, na.rm=TRUE, na.action=na.pass)
Or use
aggregate(salaries$TotalPay, list(salaries$JobTitle), FUN=mean, na.rm=TRUE)
data
set.seed(24)
salaries <- data.frame(JobTitle = sample(LETTERS[1:5], 20,
replace=TRUE), TotalPay= sample(c(1:20, NA), 20))