Returning Multiple Counts of Rows - count

Hard to capture the problem in a single sentence, but the problem is a simple one. Table looks like this:
col_a col_b
v1 c
v2 c
v3 c
v5 d
v2 d
v1 a
v5 a
v7 a
v4 a
Each row is distinct, and I need a count of the rows for each unique value of col_b. So, my result set for this table should be:
c 3
d 2
a 4
I have no idea how to think this thru and do it.

I'm assuming that you're asking how to do this in SQL.
Use GROUP BY with an aggregate function, like this:
SELECT col_b, COUNT(*) FROM MyTable GROUP BY col_b

Well, in SQL you would do:
SELECT col_b, COUNT(ol_a)
FROM SomeTable
GROUP BY col_b

Related

How would I capture a specific amount of numerical characters within a function in R?

I'm sure the wording of my question could be better, but this is the scenario i'm dealing with.
My current data looks like this:
v1 v2 v3 v4
1 abc def 1 1
2 abc def 1 1
3 1990 def 0 1
v3 and v4 return 1 when v1 or v2 is 'abc' or 'def'. I have numerous instances in my dataset where there are years listed in the typical 4 digit context (ex: 1960, 1990, 2000). How can I include these in my code to return a '1' for v3 and v4 regardless of the date?
This is my current code:
df$v3 <- as.integer(grepl"(^abc$|^def$)", df$v1))
df$v4 <- as.integer(grepl"(^abc$|^def$)", df$v2))
Just to make sure I'm interpreting your desired output correctly, you want v3 to be 1 if v1 is 'abc' or 'def' or a 4-digit year, and you want v4 to be 1 if v2 is 'abc' or 'def' or a 4-digit year, correct?
If so, then instead of having your regex just look for the exact strings 'abc' or 'def', you can have it look for a 4-digit number as well.
df$v3 <- as.integer(grepl("(^abc$|^def$|^[[:digit:]]{4}$)", df$v1))
df$v4 <- as.integer(grepl("(^abc$|^def$|^[[:digit:]]{4}$)", df$v2))

filter a dataset under specific conditions in R

i have a dataframe structured like this:
V1 V2 V3 V4 V5 V6 V7
A. B. C. D. E.
C C. D. K.
A. B. C. D. E. F. G.
where there are empty cells.
i want to filter the data frame according this condition:
For every rows of the dataframe, if there are at least two values different from blank values in the columns V4, V5, V6, V7 take the row. Otherwise, delete it.
V1 V2 V3 V4 V5 V6 V7
A. B. C. D. E.
A. B. C. D. E. F. G.
How could I do?
Using rowSums
df[rowSums((df!='')[,c('V4','V5','V6','V7')])>=2,]
V1 V2 V3 V4 V5 V6 V7
1 A. B. C. D. E.
3 A. B. C. D. E. F. G.
You can subset the data frame, first setting an index that sums a logical operator for each column, as in the example below:
df <- data.frame(V4 = c('A', '', 'C'),
V5 = c('A', '', 'C'),
V6 = c('A', 'B', ''))
V4 V5 V6
1 A A A
2 B
3 C C
df <- df[(df$V4 != '') + (df$V5 != '') + (df$V6 != '') >= 2, ]
Output
V4 V5 V6
1 A A A
3 C C
When you sum the logical operators (testing whether the variable is empty '') they generate a numeric value representing the number of true values found. In your example you want to set the condition of having at least 2 columns satisfy the != '' condition.

Column is being returned with negative sinal in the end of field in R

I'm trying to run this code in SQL Server in R Studio...
CASE
WHEN COLUMN1 LIKE '%-%'
THEN CAST((REPLACE(COLUMN1, '-', '') AS NUMERIC) * -1
ELSE COLUMN1
END VALUE
I'm using data table, cause my file there is 2 GB, I've tried to use this:
MYDATATABLE[, newfield:=ifelse((COLUMN1 %like% '-'), ***replace '-' for nothing and mutiply this for "-1"***, COLUMN1)]
The column is begin returned with the negative sign at the end of value - like this:
COLUMN1
--------
55.400-
60.440-
61.280-
136.400-
506.333-
If I understood your question correctly then you can try two step process.
Instead of slow performing ifelse you could first create a new column new_col if col1 has - by assigning the negative of col1's numeric value after removing - from col1. Then replace NA value in new_col with numeric value of col1.
library(data.table)
DT[grepl("-", col1),
new_col := - as.numeric(gsub("-", "", col1))][is.na(new_col), new_col := as.numeric(col1)]
which gives
> DT
col1 id new_col
1: -123 1 -123
2: 1233- 2 -1233
3: 45 3 45
Sample data:
DT <- data.table(col1 = c("-123","1233-", "45"),
id = c(1,2,3))
# col1 id
#1: -123 1
#2: 1233- 2
#3: 45 3

variable corresponding to a rowname in r

Now I have a .df looks like below:
v1 v2 v3
1 2 3
4 5 6
What should I do with rownames such that if v2 of rownames(df) %% 2 == 0 does not equal to v2 of rownames(df) %% 2 == 1, then delete both rows?
Thank you all.
Update:
For this df below, you can see that for row 1 and 2, they have the same ID, so I want to keep these two rows as a pair (CODE shows 1 and 4).
Similarly I want to keep row 10 and 11 because they have the same ID and they are a pair.
What should I do to get a new df?
1) Create a dataframe with column for number of times the id comes
library(sqldf)
df2=sqldf("select count(id),id from df group by id"
2) merge them
df3=merge(df1,df2)
3) select only if count>1
df3[df3$count>1,]
If what you are looking for is to keep paired IDs and delete the rest (I doubt it is as simple as this), then ..
Extract your ids: I have written them out, you should extract.
id = c(263733,263733,2913733,3243733,3723733,4493733,273733,393733,2953733,3583733,3583733)
sort them
Find out which ones to keep.
id1 = cbind(id[1:length(id)-1],id[2:length(id)])
chosenID = id1[which(id1[,1]==id1[,2]),1]
And then extract from your df those rows that have chosenID.

How do I find values similar but not identical to others?

I'm trying to bind values from one table with an index similar to another table. My tables look like this:
V1 V2
1 1.2352
2 3.2345
3 2.2132
4 3.3344
The other table looks like this
V1 V2
1A 1.9494
1B 1.5092
1C 1.3242
2A 1.3833
2B 2.5223
etc...
I'm trying to get a table like this
V1 V2 V3 (value from table 1)
1A 1.9494 1.2352
1B 1.5092 1.2352
1C 1.3242 1.2352
2A 1.3833 3.2345
2B 2.5223 3.2345
Then, I have to iterate through a bunch of tables that go up to 1A1A1,
so in the end it'll look like this:
V1 V2 V3 V4 V5
1A1A1 1A1A 1A1 1A 1
1A1A2 1A1A 1A1 1A 1
1A1B1 1A1B 1A1 1A 1
etc....
Any thoughts?
Thanks!
This will get you to the third table above--the one you say you're trying to get to. Even with your clarifying comment I don't understand what the last table is.
# create data frames
df1 <- cbind(1:4,
rnorm(4, mean = 2)
)
df2 <- cbind(c('1A', '1B', '1C', '2A', '2B'),
rnorm(5, mean = 2)
)
# create a column in the second data frame to match on
df2 <- cbind(df2,
data.frame(substr(df2[ , 1], start = 1, stop = 1))
)
names(df2) <- c('V1', 'V2', 'V3')
# merge the two data frames by the key they have in common
merge(df2,
df1,
by.x = 'V3',
by.y = 'V1'
)
You can drop columns or reorganize them however you'd like at this point.

Resources