Factor Level issues after filling data frame using match - r

I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.

I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?

Related

How to make a bar plot using ggplot that uses multiple columns for the x-axis?

I am trying to use multiple column names as the x-axis in a barplot. So each column name will be the "factor" and the data it contains is the count for that.
I have tried iterations of this:
ggplot(aes( x = names, y = count)) + geom_bar()
I tried concatenating the x values I want to show with aes(c(col1, col2))
but the aesthetics length does not match and won't work.
library(dplyr)
library(ggplot2)
head(dat)
Sample Week Response_1 Response_2 Response_3 Response_4 Vaccine_Type
1 1 1 300 0 2000 100 1
2 2 1 305 0 320 15 1
3 3 1 310 0 400 35 1
4 4 1 400 1 410 35 1
5 5 1 405 0 180 35 2
6 6 1 410 2 800 75 2
dat %>%
group_by(Week) %>%
ggplot(aes(c(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
dat %>%
group_by(Week) %>%
ggplot(aes(Response_1, Response_2, Response_3, Response_4)) +
geom_boxplot() +
facet_grid(.~Week)
> Error: Aesthetics must be either length 1 or the same as the data
> (24): x
Both of these failed (kind of expected based on aes length error code), but hopefully you know the direction I was aiming for and can help out.
Goal is to have 4 separate groups, each with their own boxplot (1 for every response). And also have them faceted by week.
Using the simple code below got mostly what I want. Unfortunately I don't think it would be as easy to include the points and other characteristics to the plot like you can with ggplot.
boxplot(dat[,3:6], use.cols = TRUE)
And I could pretty easily just filter by the different weeks and use mfrow for faceting. Not as informative as ggplot, but gets the job done. If anyone else has other workarounds, I'd be interested in seeing.

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Include zero frequencies in frequency table for Likert data

I have a dataset with responses to a Likert item on a 9pt scale. I would like to create a frequency table (and barplot) of the data but some values on the scale never occur in my dataset, so table() removes that value from the frequency table. I would like it instead to present the value with a frequency of 0. That is, given the following dataset
# Assume a 5pt Likert scale for ease of example
data <- c(1, 1, 2, 1, 4, 4, 5)
I would like to get the following frequency table without having to manually insert a column named 3 with the value 0.
1 2 3 4 5
3 1 0 2 1
I'm new to R, so maybe I've overlooked something basic, but I haven't come across a function or option that gives the desired result.
EDIT:
tabular produces frequency tables while table produces contingency tables. However, to get zero frequencies in a one-dimensional contingency table as in the above example, the below code still works, of course.
This question provided the missing link. By converting the Likert item to a factor, and explicitly specifying the levels, levels with a frequency of 0 are still counted
data <- factor(data, levels = c(1:5))
table(data)
produces the desired output
table produces a contingency table, while tabular produces a frequency table that includes zero counts.
tabulate(data)
# [1] 3 1 0 2 1
Another way (if you have integers starting from 1 - but easily modifiable for other cases):
setNames(tabulate(data), 1:max(data)) # to make the output easier to read
# 1 2 3 4 5
# 3 1 0 2 1
If you want to quickly calculate the counts or proportions for multiple likert items and get your output in a data.frame, you may like the function psych::response.frequencies in the psych package.
Lets create some data (note that there are no 9s):
df <- data.frame(item1 = sample(1:7, 2000, replace = TRUE),
item2 = sample(1:7, 2000, replace = TRUE),
item3 = sample(1:7, 2000, replace = TRUE))
If you want to calculate the proportion in each category
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9)
you get the following:
1 2 3 4 5 6 7 8 9 miss
item1 0.1450 0.1435 0.139 0.1325 0.1380 0.1605 0.1415 0 0 0
item2 0.1535 0.1315 0.126 0.1505 0.1535 0.1400 0.1450 0 0 0
item3 0.1320 0.1505 0.132 0.1465 0.1425 0.1535 0.1430 0 0 0
If you want counts, you can multiply by the sample size:
psych::response.frequencies(df, max = 1000, uniqueitems = 1:9) * nrow(df)
You get the following:
1 2 3 4 5 6 7 8 9 miss
item1 290 287 278 265 276 321 283 0 0 0
item2 307 263 252 301 307 280 290 0 0 0
item3 264 301 264 293 285 307 286 0 0 0
A few notes:
the default max is 10. Thus, if you have more than 10 response options, you'll have issues. Otherwise, in your case, and many Likert item cases, you could omit the max argument.
uniqueitems specifies the possible values. If all your values were present in at least one item, then this would be inferred from the data.
I think the function only works with numeric data. So if you have your likert categories coded "Strongly disagree", etc. it wont work.

Resources