looping through csv file

looping through csv file - r

I'm very new to R and Rstudio. What I'm trying to do is loop through a csv file.
The file has 3 columns. 1) user 2) event (success or fail) 3) randNum
So basically every user starts off with a fail and once they reach a success it moves on to the next user.
Ex:
user: | event: | randNum
user1 | fail | 6
user1 | fail | 4
user1 | fail | 1
user1 | success | 2
user2 | ... |
Basically what I need to do is this. I need to store the first random number (6) and than the last random number (2) which will be whenever the user succeeds. How would I do that? And i need to do this for each user because I will be doing something with these numbers.

The quickest way would be to use table to get counts:
table(df$user)
Example code:
> df <- data.frame(user=c(rep("john",4),rep("jane",3)), event=c(rep("failed",3), "success", rep("failed",2), "success"))
> df
user event
1 john failed
2 john failed
3 john failed
4 john success
5 jane failed
6 jane failed
7 jane success
> table(df$user)
jane john
3 4
EDIT: To address latest edits you made that substantially modified the question:
> df <- data.frame(user=c(rep("john",4),rep("jane",3)), event=c(rep("failed",3), "success", rep("failed",2), "success"), randNum=c(4,6,1,2,9,3,5))
> library(dplyr)
> df <- df %>% group_by(user) %>% mutate(trial = 1:n())
> df[df$trial==1 | df$event=="success",]
Source: local data frame [4 x 4]
Groups: user [2]
user event randNum trial
<fctr> <fctr> <dbl> <int>
1 john failed 4 1
2 john success 2 4
3 jane failed 9 1
4 jane success 5 3

If every user eventually succeeds and you want to consider the first and last row of every user try following code:
df<-split(df,df$user)
df<-lapply(df,function(x){
x<-rbind(head(x,1),tail(x,1))
x
})
df<-do.call("rbind",df)
From this, you will get the first fail and success of each user

Related

Create a histogram with a non-numeric variable in R [duplicate]

Imagine you have a data frame with 2 variables - Name & Age. Name is of class factor and Age number. Now imagine now there are thousands of people in this data frame. How do you:
Produce a table with: NAME | COUNT(NAME) for each name uniquely?
Produce a histogram where you can change the minimum number of
occurrences to show up in the histogram.?
For part 2, I want to be able to test different minimum frequency values and see how the histogram comes out. Or is there a better method pragmatically to determine the minimum count for each name to enter the histogram?
Thanks!
Edit: Here is what the table would look like in a RDBS:
NAME | COUNT(NAME)
John | 10
Bill | 24
Jane | 12
Tony | 50
Emanuel| 1
...
What I want to be able to do is create a function to graph a histogram, where I can change a value that sets the minimum frequency to be graphed. Make more sense?

> x <- read.table(textConnection('
+ Name Age Gender Presents Behaviour
+ 1 John 9 male 25 naughty
+ 2 Bill 5 male 20 nice
+ 3 Jane 4 female 30 nice
+ 4 Jane 4 female 20 naughty
+ 5 Tony 4 male 34 naughty'
+ ), header=TRUE)
>
> table(x$Name)
Bill Jane John Tony
1 2 1 1
> layout(matrix(1:4, ncol = 2))
> plot(table(x$Name), main = "plot method for class \"table\"")
> barplot(table(x$Name), main = "barplot")
> tab <- as.numeric(table(x$Name))
> names(tab) <- names(table(x$Name))
> dotchart(tab, main = "dotchart or dotplot")
> ## or just this
> ## dotchart(table(dat))
> ## and ignore the warning
> layout(1)

Returning count for each id in table

Im using sqlite3 with my node.js API.
I have a DB talbe structured below:
id | colour
___|_______
1 | blue
1 | red
1 | green
2 | yellow
2 | green
5 | red
I want to return a count of the IDs in my table such that
1 - 3 occurences
2 - 2 occurences
5 - 1 occurence
Is there a sql qualifier I can use count like this, or will this need to be done within the js iteself?
Any help here would be awesome!

You can use COUNT with GROUP BY
select id, COUNT(id) from tbl GROUP BY id

Get frequency counts for a subset of elements in a column

I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –

I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+

CAPTURE CHANGE AT SOURCE LEVEL

Consider a source table(ORA 11g) with BATCH ID 1 with 3 records for day 1. Say these are getting loaded into a target table. Imagine on day 2 there are 3 more customer entries with batch ID 2. Can I write a SQL query which will be enable the source node to check the target if the BATCH_ID is existing and if not read and process that BATCH_IDs records through the code?
SRC TBL(say day1)
Batch_no | ID
1 | xx
1 | yy
1 | zz
TGT TBL(EOD Day1)
Batch_no | ID
1 | xx
1 | yy
1 | zz
SRC TBL(Day 2)
Batch_no|ID
1 |xx
1 |yy
1 |zz
2 |aa
2 |bb
2 |cc

This is what I found. Thank you for your help.
SELECT required fields
FROM
"SRC TBL"
LEFT JOIN "TGT TBL"
ON "SRC TBL".BATCH_ID="TGT TBL".BATCH_ID
WHERE "TGT TBL".BATCH_ID IS NULL

dplyr filter function gives wrong data

I have the following dataset: (sample)
Team Job Question Answer
1 1 1 2 1
2 1 1 3a 2
3 1 1 3b 2
4 1 1 4a 1
5 1 1 4b 1
and I have 21 teams so there are many rows. I am trying filter the rows of the teams which did good in the experiment (with the dplyr package):
q10best <- filter(quest,Team==c(2,4,6,10,13,17,21))
But it gives me messed up data and with many missing rows.
On the other hand, when I use:
q10best <- filter(quest,Team==2 | Team==4 | Team==6 | Team==10 | Team==13 | Team==17 | Team==21)
It gives me the right dataset that I want. What is the difference? what am I doing wrong in the first command?
Thanks

== checks if two objects are exactly the same. You are trying to check if one object (each element of quest$Team) belongs to a list of value. The proper way to do that is to use %in%
q10best <- filter(quest,Team %in% c(2,4,6,10,13,17,21))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

looping through csv file - r

If every user eventually succeeds and you want to consider the first and last row of every user try following code: df<-split(df,df$user) df<-lapply(df,function(x){ x<-rbind(head(x,1),tail(x,1)) x }) df<-do.call("rbind",df) From this, you will get the first fail and success of each user

Related

Create a histogram with a non-numeric variable in R [duplicate]

Returning count for each id in table

Get frequency counts for a subset of elements in a column

CAPTURE CHANGE AT SOURCE LEVEL

dplyr filter function gives wrong data

Categories

Resources