Reshaping dataframe (converting rows to columns) - r

I have the following dataset:
Name TOR_A Success_rate_A Realizable_Prod_A Assist_Rate_A Task_Count_A Date
1 BVG1 2.00 85 4.20 0.44 458 31/01/2014
2 BVG2 3.99 90 3.98 0.51 191 31/01/2014
3 BVG3 4.00 81 8.95 0.35 1260 31/01/2014
4 BVG4 3.50 82 2.44 4.92 6994 31/01/2014
5 BVG1 2.75 85 4.00 2.77 7954 07/02/2014
6 BVG2 4.00 91 3.50 1.50 757 07/02/2014
7 BVG3 3.80 82 7.00 1.67 7898 07/02/2014
8 BVG4 3.60 83 3.50 4.87 7000 07/02/2014
I wish to plot a ggplot line graph with Date on x-axis and TOR_A, Success_rate_A etc. on y-axis. I would also like to see it by the Name column. How can I prepare this dataset to achieve this objective?
I tried reshape in R but couldn't make it work.
UPDATE
Done it using reshape2::recast method as show below:
data_weekly = recast(data_frame_to_be_reshaped,variable+Name~Date,id.var=c(Name,Date))

You can use the Hadley Wickham's tidyr package.
df_reshaped <- gather(df_original, key = Variable, Value, Tor_A:Success_rate)
As you can see, the first argument in gather() function indicates the original data frame. Then you define how you would like to name a column with names of your original variables and then how should be named a column with their values. Finally, you specify which columns you want to reshape. All columns that will not be indicated (in our example: Date and Name) remains as they were in the original data frame.
There is a nice tutorial on tidyr published by Brad Boehmke in case you would need more information.

Related

Grouping Frame Values

I have a dataset of ingredients for cookies. I'm trying to answer which group (A, B, C, etc) of cookies has the most sugar in them. The dataset is structured as follows:
group id mois prot fat hocolate sugar carb cal
1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
4 B 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
5 B 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
7 C 14082 31.21 20.97 41.34 4.71 1.58 1.77 4.63
8 C 14097 28.76 21.41 41.60 5.28 1.75 2.95 4.72
etc....
How can I plot the mean of each grouping to show that one of them has a higher average of sugar than the others? Or at the least, how can I print off the results of the grouped averages of sugar to defend my argument that one has more sugar than the other?
After saving your text to CSV and loading this file into R, it's pretty easy to obtain the mean sugar quantity per group, which I'm assuming is what you need.
You first group your data by variable group and then summarize the data using the "mean" function.
library(dplyr)
(cookies = df %>%
group_by(group) %>%
summarize(meanSugar = mean(sugar)))
group meanSugar
<chr> <dbl>
1 A 1.71
2 B 1.62
3 C 1.66
As you can see, group A has sugar content a bit higher than the others based on your data.
If you wanna go a step further and really plot this data, you can do that:
library(ggplot2)
cookies %>%
ggplot(aes(x=meanSugar,y=reorder(group,meanSugar),fill=group,label=meanSugar)) +
geom_col()+
labs(y="Cookie groups",x="Mean Sugar")+
geom_label(stat="identity",hjust=+1.2,color="white")+
theme(legend.position = "none")
If you have any questions on some of these steps, let me know!
Obs: please try to provide better data the next time so it's easy to reproduce what you need and give you a quick answer :)

Is there a way to resolve this error in cardinality_threshold problem?

I tried to use ggpairs to visualise my dataset but the error message that I am getting is what I don't understand. Can someone please help me?
> describe(Mydata)
vars n mean sd median trimmed mad min max range skew
Time 1 192008 4257.07 2589.28 4156.44 4210.33 3507.03 0 8869.91 8869.91 0.09
Source* 2 192008 9.32 5.95 8.00 8.53 2.97 1 51.00 50.00 3.39
Destination* 3 192008 8.22 6.49 7.00 7.31 2.97 1 51.00 50.00 3.07
Protocol* 4 192008 16.14 4.29 19.00 16.77 0.00 1 20.00 19.00 -1.26
Length 5 192008 166.12 464.07 74.00 96.25 11.86 60 21786.00 21726.00 14.40
Info* 6 192008 63731.70 46463.90 60732.50 62899.62 69904.59 1 131625.00 131624.00 0.14
kurtosis se
Time -1.28 5.91
Source* 15.94 0.01
Destination* 13.21 0.01
Protocol* 0.66 0.01
Length 349.17 1.06
Info* -1.47 106.04
> Mydata[,1][Mydata[,1] ==0]<-NA
> ggpairs(Mydata)
Error in stop_if_high_cardinality(data, columns, cardinality_threshold) :
Column 'Source' has more levels (51) than the threshold (15) allowed.
Please remove the column or increase the 'cardinality_threshold' parameter. Increasing the
cardinality_threshold may produce long processing times
As the error suggests, the way to get rid of the error is to set cardinality_threshold=NULL or cardinality_threshold=51 as Source and Destination are both factor variables with 51 levels.
However, they're likely to be hard to see any detail in the plots, if it plots at all because one of the panels of the plot would be attempting to fit 51 barplots with 51 columns into it. You may want to think if grouping your factor levels makes sense for the analysis you're interested in, or exclude the factors (although that only leaves two continuous variables).

How can I use dplyr to turn one column into 3 based on the characters in the original column?

Hopefully this makes sense. I have one column in my dataset that has multiple entries of one of three size category (read in the data as characters), "(0,1.88]", "(1.88,4]", and "(4,10]". I would to combine all of my entries together by plot (another column in the dataset), totaling the response for each size category in its own column.
Ideally, I'm trying to take data which has multiple responses in each Plot and end up with one total response for each plot, divided by size category. I'm hoping to get something like this:
Plot Total Response for (0,1.88] Total Response for (1.88,4] Total Response for (4,10]
Here is the head of my data. Not all of it is needed, only Plot, ounces, and tuber.diam. tuber.diam has the entries grouped into size categories.
head(newChippers)
Plot ounces Height Shape Area plot variety rate block width length tuber.oz.bin tuber.diam
1 2422 1.31 1.22 26122 3237 242 Lamoka 3 4 1.65 1.70 (0,4] (0,1.88]
2 2422 2.76 1.56 27853 5740 242 Lamoka 3 4 2.20 2.24 (0,4] (1.88,4]
3 2422 1.62 1.31 24125 3721 242 Lamoka 3 4 1.53 1.95 (0,4] (0,1.88]
4 2422 3.37 1.70 27147 6498 242 Lamoka 3 4 2.17 2.48 (0,4] (1.88,4]
5 2422 3.19 1.70 27683 6126 242 Lamoka 3 4 2.22 2.34 (0,4] (1.88,4]
6 2422 2.83 1.53 27356 6009 242 Lamoka 3 4 2.00 2.53 (0,4] (1.88,4]
Here is what I currently have for making the new dataset:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot) %>%
summarize(totalOz = sum(Weight),
Diameter.0.1.88 = (tuber.diam("(0,1.88]")),
Diameter.1.88.4 = (tuber.diam(" (1.88,4]")),
Diameter.4.10 = (tuber.diam(" (4,10]")))
I get the following error code:
Error in x[[n]] : object of type 'closure' is not subsettable
Any help would be very much appreciated! Again, I'm very sorry if I've explained it poorly or made it too complicated. If any additional information is needed, I can try to provide it. Thank you!
I have revised your code. I assume your variable weight is the same as variable ounce as there is no weight variable in newChippers your data data. I use weight here as in your code:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot, tuber.diam) %>%
summarize(totalOz = sum(Weight)) %>%
pivot_wider(names_from = tuber.diam, values_from = totalOz)
YieldSizeProfileDiameter
I have not tested the code on my side as I do not have the data.

Ramp up/down missing time-series data in R

I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits.
#create sample data frame
test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5))
names(test)[1] <- "speed"
#set rate of acceleration for ramp
ramp <- 6
#set sampling rate of receiver
Hz <- 1/10
So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values:
3.59
2.99
2.39
1.79
1.19
0.59
0
Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again.
Hope this is clear.
Cheers
It's not really elegant, but you can do it in a loop.
na.pos <- which(is.na(test$speed))
acc = FALSE
for (i in na.pos) {
if (acc) {
speed <- test$speed[i-1]+(Hz*ramp)
}
else {
speed <- test$speed[i-1]-(Hz*ramp)
if (round(speed,1) < 0) {
acc <- TRUE
speed <- test$speed[i-1]+(Hz*ramp)
}
}
test[i,] <- speed
}
The result is:
speed
1 6.00
2 5.70
3 5.40
4 5.14
5 4.89
6 4.64
7 4.41
8 4.19
9 3.59
10 2.99
11 2.39
12 1.79
13 1.19
14 0.59
15 -0.01
16 0.59
17 1.19
18 1.79
19 2.39
20 2.99
21 3.59
22 4.19
23 4.79
24 5.00
25 5.10
26 5.30
27 5.40
28 5.50
Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.
When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero.
Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs.
library(data.table)
library(zoo)
test_speed <- test$speed
seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along)
ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0)
result <- ifelse(is.na(test_speed), ramp_down, test_speed)
giving:
> result
[1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00
[16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50

How to reshape a matrix

I have a matrix (d) that looks like:
d <-
as.matrix(read.table(text = "
month Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13
X10 10 7.04 8.07 9.4 8.17 9.39 8.13 9.43 9.06 8.59 9.37 9.79 8.47 8.86
X11 11 12.10 11.50 12.6 13.70 11.90 11.50 13.10 17.20 19.00 14.60 13.70 13.20 16.10
X12 12 24.00 22.00 22.2 20.50 21.60 22.50 23.10 23.30 30.50 34.10 36.10 37.40 28.90
X1 1 18.30 16.30 16.2 14.80 16.60 15.40 15.20 14.80 16.70 14.90 15.00 13.80 15.90
X2 2 16.70 14.40 15.3 14.10 15.50 16.70 15.20 16.10 18.00 26.30 28.00 31.10 34.20",
header=TRUE))
going from Q1 to Q31 (its the days in each month). what I would like to get is:
month day Q
10 1 7.04
10 2 8.07
and so on for the 31 days and the 12 months.
I have tried using the following code:
reshape(d, direction="long", varying = list(colnames(d)[2:32]), v.names="Q", idvar="month", timevar="day")
but I get the error :
Error in d[, timevar] <- times[1L] : subscript out of bounds
Can anyone tell me what is wrong with the code? I don't really understand the help file on "reshape", it's a bit confusing... Thanks!
Almost there - you're just missing as.data.frame(d) to make your matrix into a data frame. Also you don't need the list in varying - just a vector, so
reshape(as.data.frame(d), varying=colnames(d)[2:32], v.names="Q",
direction="long", idvar="month", timevar="day")
The help file is confusing as heck, not least because (as I've learned) the necessary information almost always actually is in there --- somewhere.
As a prime example, midway through the help file, there is this bit:
The function will
attempt to guess the ‘v.names’ and ‘times’ from these names [i.e. the ones in the 'varying' argument]. The
default is variable names like ‘x.1’, ‘x.2’, where ‘sep = "."’
specifies to split at the dot and drop it from the name. To have
alphabetic followed by numeric times use ‘sep = ""’.
That last sentence is the one you need here: "Q1", "Q2", etc. are indeed "alphabetic followed by numeric", so you need to set sep = "" argument if reshape() is to know how to split apart those column names.
Try this:
res <- reshape(as.data.frame(d), idvar="month", timevar="day",
varying = -1, direction = "long", sep = "")
head(res[with(res, order(month,day)),])
# month day Q
# 1.1 1 1 18.3
# 1.2 1 2 16.3
# 1.3 1 3 16.2
# 1.4 1 4 14.8
# 1.5 1 5 16.6
# 1.6 1 6 15.4
The help file on reshape is not a bit confusing. It's a LOT confusing. Assuming your matrix has 12 rows(1 for each month) and 31 columns (I'm guessing you have NA values months with fewer than 31), you could easily construct this by hand.
d <- data.frame(month = rep(d[,1], 31), day = rep(1:31, each = 12), Q = as.vector(d[,2:32])
Now, back to your reshape... I'm guessing it's not parsing the names of your columns correctly. It might work better with Q.1, Q.2, etc. BTW, my reshaping above really depends on what you presented actually being a matrix and not a data.frame.

Resources