t-test for multiple columns of dataframe - r

I have a hourly PM10 data from 2014 to 2019 and it has 4 stations. I want to apply t-test on these stations to compare their values. My data contains only 4 columns which names are bafra, atakum, canik, and ilkadim (stations names).
This link is related my question but it always gived error.
R: t test over multiple columns using t.test function
I tried this code; but I want to show my results as a more clear such as results with station name, p value, conf. interval...
library(reshape2)
meltdf <- melt(samsun)
pairwise.t.test(meltdf$value, meltdf$Var2, p.adjust = "none")
How can i do?

Related

I want to select all rows in a larger dataset, whose identification number, exists in another dataset in R

I have a large dataset, lets call it df1 (4226 observations X 186 variables)
I used a package called naniar to assess missingness, and created a dataset that shows, for each observation, what the percentage of missing data is. I then filtered the dataset, to show me only the observations (rows), in which there was less then 50% of missing data. Then, I created a dataset of just the row number of all rows that fit the missingness criteria, we can call this df2
Now, I want to create a subset of dataset df1 using the data in df2 (2044 observations X 1 variable).
Can anyone help me here?
I have tried something like:
df3 <- df2[df2$row %in% df1]

Combining factor levels via taking the mean

Newbie here.
I have a dataset with columns "YEAR" (2014-2019), "SITE" (7 SITES), "TRANSECT" (UPSTREAM,DOWNSTREAM), and about 50 Insect species columns containing counts of individuals. I want to average the upstream and downstream samples for each year and site. The end goal is a dataset with columns "YEAR", "SITE", and the 50 Insect species columns containing the mean of the upstream and downstream counts. I have tried several methods to do this but have been unsuccessful. The following code is the last thing I have tried.
INS_YxS<-aggregate(INV.MEANS[5:54], INV.MEANS[1:3], mean)
Columns 1-4 in this dataset are X, YEAR, SITE, TRANSECT. 5-54 are Insect Species.
The resulting dataset appeared to have the correct columns but it looks like it just removed the TRANSECT column without averaging the upstream and downstream species counts... Anyone know how to accomplish what I am trying to do?
Here is a visual representation of what my data looks like (table 1) and what I want it to look like (table 2): https://i.stack.imgur.com/WkX4e.png
Notice that in 2 there is no TRANSECT column and that the new values in the insect columns are the means of the UPSTREAM and DOWNSTREAM TRANSECT rows for each YEAR SITE resulting in fewer rows.
Apologies, I am trying to find the best way to explain what I want to do...
I know the answer is out there and depends on me asking the correct question...
Thank you!!!
Consider the formula version for aggregate with dot notation:
INSECTS_MEANS <- aggregate(
. ~ YEAR + SITE + TRANSECT,
data=INSECTS_COUNTS,
FUN=mean, na.rm=TRUE,
na.action=na.omit
)
Otherwise you need to pass lists into by argument:
INSECTS_MEANS <- aggregate(
x = INSECTS_COUNTS[5:ncol(INSECTS_COUNTS)],
by = list(
YEAR = INSECTS_COUNTS$YEAR,
SITE = INSECTS_COUNTS$SITE,
TRANSECT = INSECTS_COUNTS$TRANSECT
),
FUN=mean, na.rm=TRUE,
na.action=na.omit
)

How do I keep my missing values to stay the same after I do mice imputation and save my results?

As a new R user I'm having trouble understanding why the NA valus in my dataframe keep changing. I'm running my code on Kaggle. Maybe that's where my problem is arising from?
Original dataframe titled "abc"
There are multiple columns that have NA values so I decided to try using multiple imputation to handle the na values.
So I created a new dataframe with just the columns that had na values and begin imputation
This is the new dataframe titled "abc1"
abc1 <- select(abc, c(9,10,15,16,17,18,19,25,26))
#mice imputation
input_data = abc1
my_imp = mice(input_data, m=5, method="pmm", maxit=20)
summary(input_data$m_0_9)
my_imp$imp$m_0_9
When the imputation begins it creates 5 columns that contain new values to fill in for the NA values of column m_0_9 and I choose which column.
Imputation of column 'm_0_9'
Then I run this code:
final_clean_abc1 <- complete(my_imp,5)
This assigns the values from column 5 of the last image to the NA values in my "abc1" dataframe and saves as "final_clean_abc1."
Lastly I replace the columns from the original "abc" dataframe that had missing values with the new columns in "final_clean_abc1."
I know this probably isnt the cleanest:
abc$m_0_9 <- final_clean_abc1$m_0_9
abc$m_10_12 <- final_clean_abc1$m_10_12
abc$f_0_9 <- final_clean_abc1$f_0_9
abc$f_10_12 <- final_clean_abc1$f_10_12
abc$f_13_14 <- final_clean_abc1$f_13_14
abc$f_15 <- final_clean_abc1$f_15
abc$f_16 <- final_clean_abc1$f_16
abc$asian_pacific_islander <- final_clean_abc1$asian_pacific_islander
abc$american_indian <- final_clean_abc1$american_indian
Now that I have a dataframe 'abc' with no missing values this is where my problem arises. I should be seeing '162' for row 10 for the m_0_9 column but when I save my code and view it on Kaggle I get the value '7' for that specific row and column. As shown in the photo below.
"abc" dataframe with no NA values
Hopefully this makes sense I tried to be as specific as I could be.
There are multiple stochastic processes going on in mice to impute multiple values for one target value, of which are then averaged. You should not expect the same result each time you run mice.
From the MICE documentation
In the first step, the dataset with missing values (i.e. the
incomplete dataset) is copied several times. Then in the next step,
the missing values are replaced with imputed values in each copy of
the dataset. In each copy, slightly different values are imputed due
to random variation. This results in mulitple imputed datasets. In the
third step, the imputed datasets are each analyzed and the study
results are then pooled into the final study result. In this Chapter,
the first phase in multiple imputation, the imputation step, is the
main topic. In the next Chapter, the analysis and pooling phases are
discussed.
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html
We have a wonderful series of vignettes that detail the use of mice. Part of this series is the stochastic nature of the algorithm and how to fix that. Setting mice(yourdata, seed = 123) would generate the same set of multiple imputation every time.

Create a subsample from a data frame in R

I have five data frames among which I want to run regressions:
df1: stock returns
df2: housing returns
df3: actual inflation rate
df4: expected inflation rate
df5: unexpected inflation rate
Dataframe example
Each of the data frames has the same format as above, with only different data inside it.
I want to do separate regression of housing and stocks against expected and unexpected inflation as below:
df1[i] ~ df4[i] + df5[i]
df2[i] ~ df4[i] + df5[i]
I want to compare the results of regression for periods where actual inflation (included in df3) is higher than the median value with periods where actual inflation is lower than the median value. For doing that, I need to create two subsamples from each data frame based on the value that each column has in df3.
Since I don't have a deep knowledge of R, I don't know how to do it. Is it possible to do it? and how? Or is it better to create 13 different data frames for each country?
Thank you in advance!

NA variables in dplyr summary r

I am trying to create a table, which includes relative frequencies (counts) of variables taken from two groups (A and B) that fall within pre-given temporal intervals. My problem is that if a row starts with 0 seconds (see start_sec), the variable does not fall within the 0-5 seconds interval but is marked as NA (see Output). My wish is to include these cases within the above-mentioned interval.
This is a dummy example:
Variables
group <- c("A","A","A","A","A","A","B","B","B")
person <- c("p1","p1","p1","p3","p2","p2","p1","p1","p2")
start_sec <- c(0,10.7,11.8,3.9,7.4,12.1,0,3.3,0)
dur_sec <- c(7.1,8.2,9.3,10.4,11.5,12.6,13.7,14.8,15.9)
Data frame
df <- data.frame(group,person,start_sec,dur_sec)
df
Pipeline
df %>%
group_by(group,person, interval=cut(start_sec, breaks=c(0,5,10,15))) %>%
summarise(counts= n(),sum_dur_sec=sum(dur_sec))
Output (so far)
Thank you in advance for all comments and feedback!

Resources