Using tapply to calculate the mean of a group [duplicate] - r

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed last year.
In the following CSV file:
Species, Age
australian, 2.6
australian, 2.3
brown, 2.3
brown, 2.3
brown, 3.4
brown, 3.4
dalmatian, 5.1
dalmatian, 4.4
dalmatian, 4.4
dalmatian, 4.1
dalmatian, 4.2
dalmatian, 4.7
dalmatian, 5.5
I am attempting to calculate the mean for the Pelican species, but R is displaying an error about unequal lengths.
df <- read.csv('c:/Users/Michelle/Downloads/pelican.csv')
tapply(df$Species, df$Age, mean)
Error in tapply(df$Species, df$Age, mean) :
arguments must have same length
I assumed the tapply function would output each pelican species with the mean age of each.
Unfortunately, the director at the University of Florida is insisting I use base R functions.
Edit 1:
str(df) 'data.frame': 13 obs. of 2 variables: $ Species: chr "australian" "australian" "brown" "brown" ... $ Age : num 2.6 2.3 2.3 2.3 3.4 3.4 5.1 4.4 4.4 4.1 ...
dput(df) structure(list(Species = c("australian", "australian", "brown", "brown", "brown", "brown", "dalmatian", "dalmatian", "dalmatian", "dalmatian", "dalmatian", "dalmatian", "dalmatian"), Age = c(2.6, 2.3, 2.3, 2.3, 3.4, 3.4, 5.1, 4.4, 4.4, 4.1, 4.2, 4.7, 5.5)), class = "data.frame", row.names = c(NA, -13L))
Thank you Pedro for the help.
Thank you for any help you can provide.
M.

Welcome Michelle! The tapply function works with two main objets (these objects need to be vectors), called X and INDEX. What the error messages is telling you, is that X and INDEX does not have the same length.
The example below, reproduces the same error that you are facing. See that the X object have 4 elements, but INDEX have only 2.
tapply(X = c(5, 6, 7, 8), INDEX = c(1, 2), mean)
This means that, to fix your error, the first and second objects that you pass to tapply(), need to have the same length. In your example, these two objects are df$Species and df$Age. You can confirm if df$Species and df$Age does not have the same length, by comparing the result of length(df$Species) and length(df$Age). If they are equal, then, these two vectors have the same length. But, if they are not equal, then these two vectors have different lengths.
What is probably going wrong in your code, is that the read.csv() function is not correctly reading your CSV file. Maybe df was transformed to a list, and not a data.frame. We cannot give better help than this for you, because we do not know what the df object is, or, how it is structured in your R session.
You could give these useful information for us, by copying and pasting the result of str(df) command, or, dput(df). Both of these commandos would give us enough information to probably point out exactly what you need to do. So, next time, when you post a question, is good idea to include these infos.
Anyway, when I copy and paste the CSV file that you passed, and try to run your code, everything works fine. So, again, your df object is probably not structured as you expected, probably because of some problem at the read.csv() function.
text <- "
Species, Age
australian, 2.6
australian, 2.3
brown, 2.3
brown, 2.3
brown, 3.4
brown, 3.4
dalmatian, 5.1
dalmatian, 4.4
dalmatian, 4.4
dalmatian, 4.1
dalmatian, 4.2
dalmatian, 4.7
dalmatian, 5.5"
data <- readr::read_csv(text)
tapply(data$Age, data$Species, mean)
Result:
australian brown dalmatian
2.450000 2.850000 4.628571

Related

How to create two different CSV files with the same name but one uses a upper case letters and the other uses a lower case letters

I want to create multiple files for columns in a life table. I thought the easiest way to do this would be to save the files using their variable names (ax, Sx, lx, Lx, ...). However, I cannot get R to create two files based on the same name (one in lower case and one in upper case, e.g. lx.csv and Lx.csv).
To demonstrate the problem:
# write a csv as normal
write.csv(mtcars, "d.csv")
# next line seems to replace d.csv rather than create a new D.csv file
write.csv(iris, "D.csv")
# get iris when read back in
d <- read.csv("d.csv")
head(d)
# X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 1 5.1 3.5 1.4 0.2 setosa
# 2 2 4.9 3.0 1.4 0.2 setosa
# 3 3 4.7 3.2 1.3 0.2 setosa
# 4 4 4.6 3.1 1.5 0.2 setosa
# 5 5 5.0 3.6 1.4 0.2 setosa
# 6 6 5.4 3.9 1.7 0.4 setosa
Is this behavior normal and is there a way to force the creation of new file with the upper case name?
I am using Windows and R 4.1.0
Update
Thanks to #tim for the answer. I had to go through the following steps in Powershell (in admin mode)
Run Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
Restart PC
Run cd C:\folder to get to the location i want to enable case sensitive file names
Run (Get-ChildItem -Recurse -Directory).FullName | ForEach-Object {fsutil.exe file setCaseSensitiveInfo $_ enable}
I wanted to enable case sensitive file names for all the sub directories. I think if I just needed for a single folder I could have used fsutil.exe file setCaseSensitiveInfo C:\folder enable for 3 and 4
Windows' NTFS file system is case insensitive. with the april 18 update sensitivity for specific folders was introduced:
https://www.howtogeek.com/354220/how-to-enable-case-sensitive-folders-on-windows-10/#:~:text=Windows%2010%20now%20offers%20an%20optional%20case-sensitive%20file,see%20%E2%80%9Cfile%E2%80%9D%20and%20%E2%80%9CFile%E2%80%9D%20as%20two%20separate%20files.

Missing data warning R

I have a dataframe with climatic values like temperature_max, temperature_min... in diferent locations. The data collection is a time series data there are some especific days in which there are no data registration. I woul like to impute taking in account date and also the location (place variable in the dataframe)
I have tried to impute those missing values with amelia. But no imputation is done with warning information
Checking variables:
head(df): PLACE, DATE, TEMP_MAX, TEMP_MIN, TEMP_AVG
PLACE DATE TEMP_MAX TEMP_MIN TEMP_AVG
F 12/01/2007 19.7 2.5 10.1
F 13/01/2007 18.8 3.5 10.4
F 14/01/2007 17.3 2.4 10.4
F 15/01/2007 19.5 4.0 9.2
F 16/01/2007
F 17/01/2007 21.5 2.8 9.7
F 18/01/2007 17.7 3.3 12.9
F 19/01/2007 18.3 3.8 9.7
A 16/01/2007 17.7 3.4 9.7
A 17/01/2007
A 18/01/2007 19.7 6.2 10.4
A 19/01/2007 17.7 3.8 10.1
A 20/01/2007 18.6 3.8 12.9
This is just some of the records of my data set.
DF = amelia(df, m=4, ts= c("DATE"), cs = c("PLACE"))
where DATE is time series data (01/01/2001, 02/01/2001, 03/01/2001...) but if you filter by PLACE the time series is not equal (not the same star and end time).
I have 3 questions:
I am not sure if I should have the time series data complete for all the places, I mean same start and end time for all the places.
I am not using lags or polytime parameters so, am I imputting correctly taking in account time series influence? I am not sure about how to use lag parameter although I have checked the R package information.
The last question is that when I try to use that code there is a warning
and no imputation is done.
Warning: There are observations in the data that are completely missing.
These observations will remain unimputed in the final datasets.
-- Imputation 1 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 2 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 3 --
No missing data in bootstrapped sample: EM chain unnecessary
-- Imputation 4 --
No missing data in bootstrapped sample: EM chain unnecessary
Can someone help me with this?
Thanks very much for your time!
For the software it does not matter if you have different start and end dates for different places. I think that it is more up to you and your thoughts on the data. I would ask myself, if those were missing data (missing at random) thus I would create empty rows in your data set or not.
You want to use lags in order to use past values of the variable to improve the prediction of missing values. It is not mandatory (i.e., the function can impute missing data even without such a specification) but it can be useful.
I contacted the author of the package and he told me that you need to specify the splinetime or polytime arguments to make sure that Amelia will use the time-series information to impute. For instance, if you set polytime = 3, it will impute based on a cubic of time. If you do that, I think you shouldn't see that error anymore.

Retrieve data that have similar values in one column

I have the following dataset:
Class Value
Drive 9.5
Analyser 6.35
GameGUI 12.09
Drive 9.5
Analyser 5.5
GameGUI 2.69
Drive 9.5
Analyser 9.10
GameGUI 6.1
I want to retrieve the classes that have similar values, which would be in the case of the example above is Drive. To do that I have the following command:
dataset[as.logical(ave(dataset$Value, dataset$Class, FUN = function(x) all(x==1))), ]
But this command returns only the classes that their values is always one. What I want is different, I don't want to give a specific value.

htmlTable is replacing dataframe contents with sequential numbers

I'm using R markdown to create an html document. I've written a function that produces the following data frame as its output:
April ($) April Growth (%) Current ($) Current Growth (%) Change (%)
1 2013:3 253,963.49 0.2 251,771.20 0.7 -0.9
2 2013:4 253,466.09 -0.8 251,515.26 -0.4 -0.8
3 2014:1 255,448.95 3.2 255,300.10 6.2 -0.1
4 2014:2 259,376.84 6.3 259,919.99 7.4 0.2
5 2014:3 261,398.85 3.2 262,486.91 4.0 0.4
6 2014:4 264,309.06 4.5 266,662.59 6.5 0.9
I'm then supplying this data frame to htmlTable as shown:
html.tab <- htmlTable(sample.df, rnames=F)
print(html.tab)
However, when I knit the file I the following table is produced:
Can anyone explain what is happening? I thought perhaps it was the data class in the data frame but I didn't see anything in the htmlTable vignette saying it couldn't handle data of certain classes.
This is my first time working with R Markdown and htmlTables so hopefully I've just made some basic mistake but I haven't been able to find anyone else with the same problem.
Thanks to Benjamin for the suggestion. It turns out the problem was the data class. sample.df contained data of class factor which apparently htmlTable can't handle. By converting the data to characters the correct table is produced.
sample.df[] <- lapply(sample.df, as.character)
Perhaps someone more familiar with the package can explain why factors are a problem?
I knew it would be something basic like this!

rank() doesn't rank properly when using with scienctific notation number

I tried to order csv file but the rank() function acting weird on number with -E notation.
> comparison = read.csv("e:/thesis/comparison/output.csv", header=TRUE)
> comparison$proxygeneld_full.txt[0:20]
[1] 9.34E-07 4.04E-06 4.16E-06 7.17E-06 2.08E-05 3.00E-05
[7] 3.59E-05 4.16E-05 7.75E-05 9.50E-05 0.0001116 0.00012452
[13] 0.00015494 0.00017892 0.00017892 0.00018345 0.0002232 0.000231775
[19] 0.00023241 0.0002666
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
> rank(comparison$proxygeneld_full.txt[0:20])
[1] 19.0 14.0 16.0 17.0 11.0 12.0 13.0 15.0 18.0 20.0 1.0 2.0 3.0 4.5 4.5
[16] 6.0 7.0 8.0 9.0 10.0
#It should be 1-20 in order ....
It seems just ignore -E notation right there. It turn out to be fine if I'm not using data from file
> rank(c(9.34E-07, 4.04E-06, 7.17E-06))
[1] 1 2 3
Am I missing something ? Thanks.
I guess you have some non-numeric data in your csv file.
What happens if you do?
as.numeric(comparison$proxygeneld_full.txt)
If this produces different numbers than you expected, you certainly have some text in this column.
Yep - $proxygeneld_full.txt[0:20] isn't even numeric. It is a factor:
13329 Levels: 0.0001116 0.00012452 0.00015494 0.00017892 0.00018345 ... adjP
So rank() is ranking the numeric codes that lay behind the factor representation, and the E-0X "numbers" sort after the non-E numbers in the levels.
Look at str(comparison) and you'll see that proxygeneld_full.txt is a factor.
I'm struggling to replicate the behaviour you are seeing with E numbers in a csv file. R reads them properly as numeric. Check your CSV to make sure you don't have some none numeric values in that column, or that the E numbers are not quoted.
Ahh! looking again at the levels you quote: there is an adjP lurking at the end of the code you show. Check your data again as this adjP is in there someone where and that is forcing R to code that variable as a factor hence the behaviour you see with ranking as I described above.

Resources