In R, I've created a 3-dimensional table from a dataset. The three variables are all factors and are labelled H, O, and S. This is the code I used to simply create the table:
attach(df)
test <- table(H, O, S)
Outputting the flattened table produces this table below. The two values of S were split up, so these are labelled S1 and S2:
ftable(test)
+-----------+-----------+-----+-----+
| H | O | S1 | S2 |
+-----------+-----------+-----+-----+
| Isolation | Dead | 2 | 15 |
| | Sick | 64 | 20 |
| | Recovered | 153 | 379 |
| ICU | Dead | 0 | 15 |
| | Sick | 0 | 2 |
| | Recovered | 1 | 9 |
| Other | Dead | 7 | 133 |
| | Sick | 4 | 20 |
| | Recovered | 17 | 261 |
+-----------+-----------+-----+-----+
The goal is to use this table object, subset it, and produce a second table. Essentially, I want only "Isolation" and "ICU" from H, "Sick" and "Recovered" from O, and only S1, so it basically becomes the 2-dimensional table below:
+-----------+------+-----------+
| | Sick | Recovered |
+-----------+------+-----------+
| Isolation | 64 | 153 |
| ICU | 0 | 1 |
+-----------+------+-----------+
S = S1
I know I could first subset the dataframe and then create the new table, but the goal is to subset the table object itself. I'm not sure how to retrieve certain values from each dimension and produce the reduced table.
Edit: ANSWER
I now found a much simpler method. All I needed to do was reference the specific columns in their respective directions. So a much simpler solution is below:
> test[1:2,2:3,1]
O
H Sick Healed
Isolation 64 153
ICU 0 1
Subset the data before running table, example:
ftable(table(mtcars[, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 3 0 1
# 4 0 8
# 5 1 1
# 6 3 0 2
# 4 2 2
# 5 1 0
# 8 3 12 0
# 4 0 0
# 5 2 0
# subset then run table
ftable(table(mtcars[ mtcars$gear == 4, c("cyl", "gear", "vs")]))
# vs 0 1
# cyl gear
# 4 4 0 8
# 6 4 2 2
I am trying to count the number of not null rows of all column in a txt file. I am able to read not null rows in each column individually but I am trying to loop them all together. awk - F "|" '$1!=""{N++} print N'
Here is a look at my data
A | B | C | D | E
1 | 2 | 0 | 8 |
5 | 3 | 6 | | 4
| | 8 | |
| 7 | 8 | |
8 | 9 | 2 | | 4
I want the result to be like :
Column A: 3
Column B: 4
Column C: 5
Column D: 1
Column E: 2
Your attempt is not working. Please remove the space between - F and call print N at the end using END :
awk -F "|" '$1!=""{N++} END {print N}' input.txt
This command will also count lines with some text missing a |.
An alternative would be
grep -cE "[^|]+\|" input.txt
If you want to check all columns of all lines, instead of a particular column:
awk -F'|' '{ for (i = 1; i <= NF; i++) if ($i != "") n++ } END { print n }' input.txt
For each line, loop over every |-delimited field in that line, incrementing a counter if it's not empty. Finally print the count at the end.
The Modeling and Solving Linear Programming with R book has a nice example on planning shifts in Sec 3.7. I am unable to solve it with R. Also, I am not clear with the solution provided in the book.
Problem
A company has a emergency center which is working 24 hours a day. In
the table below, is detailed the minimal needs of employees for each of the
six shifts of four hours in which the day is divided.
Shift Employees
00:00 - 04:00 5
04:00 - 08:00 7
08:00 - 12:00 18
12:00 - 16:00 12
16:00 - 20:00 15
20:00 - 00:00 10
R solution
I used the following to solve the above.
library(lpSolve)
obj.fun <- c(1,1,1,1,1,1)
constr <- c(1,1,0,0,0,0,
0,1,1,0,0,0,
0,0,1,1,0,0,
0,0,0,1,1,0,
0,0,0,0,1,1,
1,0,0,0,0,1)
constr.dir <- rep(">=",6)
constr.val <-c (12,25,30,27,25,15)
day.shift <- lp("min",obj.fun,constr,constr.dir,constr.val,compute.sens = TRUE)
And, I get the following result.
> day.shift$objval
[1] 1.666667
> day.shift$solution
[1] 0.000000 1.666667 0.000000 0.000000 0.000000 0.000000
This is nowhere close to the numerical solution mentioned in the book.
Numerical solution
The total number of solutions required as per the numerical solution is 38. However, since the problem stated that, there is a defined minimum number of employees in every period, how can this solution be valid?
s1 5
s2 6
s3 12
s4 0
s5 15
s6 0
Your mistake is at the point where you initialize the variable constr, because you don't define it as a matrix. Second fault is your matrix itself. Just look at my example.
I was wondering why you didn't stick to the example in the book because I wanted to check my solution. Mine is based on that.
library(lpSolve)
obj.fun <- c(1,1,1,1,1,1)
constr <- matrix(c(1,0,0,0,0,1,
1,1,0,0,0,0,
0,1,1,0,0,0,
0,0,1,1,0,0,
0,0,0,1,1,0,
0,0,0,0,1,1), ncol = 6, byrow = TRUE)
constr.dir <- rep(">=",6)
constr.val <-c (5,7,18,12,15,10)
day.shift <- lp("min",obj.fun,constr,constr.dir,constr.val,compute.sens = TRUE)
day.shift$objval
# [1] 38
day.shift$solution
# [1] 5 11 7 5 10 0
EDIT based on your question in the comments:
This is the distribution of the shifts on the periods:
shift | 0-4 | 4-8 | 8-12 | 12-16 | 16-20 | 20-24
---------------------------------------------------
20-4 | 5 | 5 | | | |
0-8 | | 11 | 11 | | |
4-12 | | | 7 | 7 | |
8-16 | | | | 5 | 5 |
12-20 | | | | | 10 | 10
18-24 | | | | | |
----------------------------------------------------
sum | 5 | 16 | 18 | 12 | 15 | 10
----------------------------------------------------
need | 5 | 7 | 18 | 12 | 15 | 10
---------------------------------------------------
i have a text corpus and already sorted it by frequency:
tr ' ' '\n' < corpus.txt | sort | uniq -c | sort -nr
Now i want to count up all lines that start with the same number.
For example:
100 the
50 in
50 and
10 cat
10 dog
should return:
100 1
50 2
10 2
Is there a way to do it?
Thanks!
Easy with awk:
$ awk '{count[$1]++} END {for (i in count) print i, count[i]}' file
100 1
10 2
50 2
Just tweak your already written command:-
cut -d' ' -f1 corpus.txt| sort -rn | uniq -c
Required output is:-
1 100
2 50
2 10
I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!
This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)