I am working on data collection by R on Win7
This is related to my previous question:
Data grouping and sub-grouping by column variable in R
I have this data frame.
var1 var2 value
1 56 649578
1 56 427352
1 88 354623
1 88 572397
2 17 357835
2 17 498455
2 90 357289
2 90 678658
I need to print them in CSV file as:
649578 354623 357835 357289
427352 572397 498455 678658
I need to use dictionary or hashset in R?
Here's your data, again, just for reproducibility:
mydf <- read.table(text='var1 var2 value
1 56 649578
1 56 427352
1 88 354623
1 88 572397
2 17 357835
2 17 498455
2 90 357289
2 90 678658', header=TRUE)
Take a look at the documentation for write.table.
You say you want a CSV, which would look like the following:
write.csv(matrix(mydf$value, nrow=2), 'test.csv')
Produces "test.csv":
"","V1","V2","V3","V4"
"1",649578,354623,357835,357289
"2",427352,572397,498455,678658
Or, I think you probably want:
write.table(matrix(mydf$value, nrow=2), 'test.tsv', sep='\t')
Produces "test.tsv":
"V1" "V2" "V3" "V4"
"1" 649578 354623 357835 357289
"2" 427352 572397 498455 678658
Related
I have column names like the following plot
Can I select all alpha one time instead of typing alpha[1], alpha[2]...alpha[9]?
How can I put in the following codes to let R know I need results of all alpha?
t_alpha <- mcmc_trace(mcmc,pars="alpha")
Something like this perhaps?
library(dplyr)
library(magrittr)
df %>% select(matches("^alpha"))`
# alpha.1. alpha.10.
# 1 55 43
# 2 97 20
# 3 80 84
# 4 24 60
# 5 27 21
# 6 98 70
I'm relatively new in R and learning. I have the following data frame = data
ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016
I am looking to count the number of people (in this case only two unique individuals) who passed their tests after multiple attempts(passing is defined as 65 or over). So the final product would return me a list of unique ID's who had multiple counts until their test scores hit 65. This would inform me that approx. 66% of the clients in this data frame require multiple test sessions before getting a passing grade.
Below is my idea or concept more or less, I've framed it as an if statement
If ID appears twice
count how often it appears, until TEST GRADE >= 65
ifelse(duplicated(data$ID), count(ID), NA)
I'm struggling with the second piece where I want to say, count the occurrence of ID until grade >=65.
The other option I see is some sort of loop. Below is my attempt
for (i in data$ID) {
duplicated(datad$ID)
count(data$ID)
Here is where something would say until =65
}
Again the struggle comes in how to tell R to stop counting when grade hits 65.
Appreciate the help!
You can use data.table:
library(data.table)
dt <- fread(" ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
# count the number of try per ID then get only the one that have been successful
dt <- dt[, N:=.N, by=ID][grade>=65]
# proportion of successful having tried more than once
length(dt[N>1]$ID)/length(dt$ID)
[1] 0.6666667
Another option, though the other two work just fine:
library(dplyr)
dat2 <- dat %>%
group_by(ID) %>%
summarize(
multiattempts = n() > 1 & any(grade < 65),
maxgrade = max(grade)
)
dat2
# Source: local data frame [3 x 3]
# ID multiattempts maxgrade
# <int> <lgl> <int>
# 1 1 TRUE 73
# 2 2 TRUE 76
# 3 3 FALSE 66
sum(dat2$multiattempts) / nrow(dat2)
# [1] 0.6666667
Here is a method using the aggregate function and subsetting that returns the maximum score for testers that took the the test more than once starting from their second test.
multiTestMax <- aggregate(grade~ID, data=df[duplicated(df$ID),], FUN=max)
multiTestMax
ID grade
1 1 73
2 2 76
To get the number of rows, you can use nrow:
nrow(multiTestMax)
2
or the proportion of all test takers
nrow(multiTestMax) / unique(df$ID)
data
df <- read.table(header=T, text="ID grade Test_Date
1 56 01-25-2012
1 63 02-21-2016
1 73 02-31-2016
2 41 12-23-2015
2 76 01-07-2016
3 66 02-08-2016")
I have 4 data frames with data from different experiments, where each row represents a trial. The participant's id (SID) is stored as a factor. Each one of the data frames look like this:
Experiment 1:
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059
I want to make a new data frame with the id's of the participants in each of the experiments, for example:
Exp1 Exp2 Exp3 Exp4
5402 22081 22160 25434
25403 22069 22179 25439
25485 22115 22141 25408
25457 22120 22185 25445
28041 22448 22239 25473
29514 22492 22291 25489
I want each column to be ordered as numbers, that is, 2 comes before 10.
I used unique() to extract the participant id's (SID) in each data frame, but I am having problems ordering the columns.
I tried using:
data.frame(order(unique(df1$SID)),
order(unique(df2$SID)),
order(unique(df3$SID)),
order(unique(df4$SID)))
and I get (without the column names):
38 60 16 32 15
2 9 41 14 41
3 33 5 30 62
4 51 11 18 33
I'm sorry if I am missing something very basic, I am still very new to R.
Thank you for any help!
Edit:
I tried the solutions in the comments, and now I have:
x<-cbind(sort(as.numeric(unique(df1$SID)),decreasing = F),
sort(as.numeric(unique(df2$SID)),decreasing = F),
sort(as.numeric(unique(df3$SID)),decreasing = F),
sort(as.numeric(unique(df4$SID)),decreasing = F) )
Still does not work... I get:
V1 V2 V3 V4
8 6 5 2
2 9 35 11 3
3 10 37 17 184
4 13 38 91 185
5 15 39 103 186
The subject id's are 3 to 5 digit numbers...
If your data looks like this:
df <- read.table(text="
SID trial measure
5402 1 0.6403791
5402 2 -1.8515095
5402 3 -4.8158912
25403 1 NA
25403 2 -3.9424822
25403 3 -2.2100059",
header=TRUE, colClasses = c("factor","integer","numeric"))
I would do something like this:
df <- df[order(as.numeric(as.character(df$SID)), trial),] # sort df on SID (numeric) & trial
split(df$SID, df$trial) # breaks the vector SID into a list of vectors of SID for each trial
If you were worried about unique values you could do:
lapply(split(df$SID, df$trial), unique) # breaks SID into list of unique SIDs for each trial
That will give you a list of participant IDs for each trial, sorted by numeric value but maintaining their factor property.
If you really wanted a data frame, and the number of participants in each experiment were equal, you could use data.frame() on the list, as in: data.frame(split(df$SID, df$trial))
Suppose x and y represent the Exp1 SID and Exp2 SID. You can create a ordered list of unique values as shown below:
x<-factor(x = c(2,5,4,3,6,1,4,5,6,3,2,3))
y<-factor(x = c(2,3,4,2,4,1,4,5,5,3,2,3))
list(exp1=sort(x = unique(x),decreasing = F),y=sort(x = unique(y),decreasing = F))
I have a large data.frame with 12 columns and a lot of rows but lets simplify
Id A1 A2 B1 B2 Result
1 55 23 62 12 1
2 23 55 12 62 1 * (dup of Id 1)
3 23 6 2 62 1
4 23 55 62 12 1 * (dup of Id 1)
5 21 62 55 23 0 * (dup of Id 1)
6 . . .
. .
. .
. .
Now the ordering of the A's (A1, A2) and B's (B1, B2) does not matter. If they both have the same values eg (55,23) and (62,12) they are duplicates, no matter the ordering of A and B variables.
Furthermore if A_id_x = B_id_y and B_id_x = A_id_y and Result_id_x = 1 - Result_id_y we also have a duplicate.
How does one go about cleaning this frame of duplicates?
For the first one I would create a new variable doing something like this:
tc= 'Id A1 A2 B1 B2 Result
1 55 23 62 12 1
2 23 55 12 62 1
3 23 6 2 62 1
4 23 55 62 12 1
5 21 62 55 23 0'
df =read.table(textConnection(tc),header=T)
df$tmp = paste(apply(df[,2:3],1,min),apply(df[,2:3],1,max),sep='')
subset(df, !duplicated(tmp))
For the second part your notation is quite confusing, but maybe you can follow a similar procedure.
How about this:
tc= 'Id A1 A2 B1 B2 Result
1 55 23 62 12 1
2 23 55 12 62 1
3 213 6 2 62 1
4 23 55 62 12 1
5 21 62 55 23 0'
x <- read.table(textConnection(tc),header=T)
a1b1 <- transform(x, combi="a1b1", a=A1, b=B1)
a1b2 <- transform(x, combi="a1b2", a=A1, b=B2)
a2b1 <- transform(x, combi="a2b1", a=A2, b=B1)
a2b2 <- transform(x, combi="a2b2", a=A2, b=B2)
x_long <- rbind(a1b1,a1b2,a2b1,a2b2)
idx <- duplicated(x_long[,c("a", "b")])
dup_ids <- unique(x_long[idx, "Id"])
unique_ids <- setdiff(x_long$Id, dup_ids)
x[unique_ids,]
Regarding the Result part, it is not clear to me what you mean.
Check out the allelematch package. While this package is primarily intended for finding matching rows in a data.frame consisting of allelic genotype data, it will work on data of any source.
It may be of particular interest to you as you are working with a case where you need to move beyond the perfect matching functionality provided by duplicated(). allelematch handles missing data, and mismatching data (i.e. where not all elements of two row vectors match or are present). It returns candidate matches by identifying rows of the data frame that are most similar.
This may be more functionality than you need - it sounds as if your columns have been permuted in some consistent way (it is not exactly clear what this from your post). However, if identifying the consistent permutation is itself a challenge, then this empirical approach might help.
I ended up using Excel VBA programming to solve the problem
This was the procedure:
Internally sort each A and each B for all of the rows
Then flip the positions of A and B of Result = 0 and change Result to 1
Remove duplicates
I have data that looks like:
Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row7
abc89 abc62 67 abc513 abc512 abc81 abc10
abc6 pop abc11 abc4 giant 13 abc15
abc90 abc16 abc123 abc33 abc22 abc08 9
111 abc15 abc72 abc36 abc57 abc9 abc55
I would like to calculate the percentage of cells in the data frame that begin with "abc". For example: there are 28 total cells here. This can be gotten by prod(dim(df)). So I need the # of cells that start with "abc" and then divide it by prod(dim(df)). Here the answer would be 0.785. how can this be done in R?
I would use:
> mean(grepl("^abc",unlist(dat)))
[1] 0.7857143
Using mean means you don't have to get the numerator and denominator yourself separately. grepl is the logical version of grep--it returns TRUE whenever "^abc" (i.e., a string beginning with abc) is found. Recall that the average of a Bernoulli vector is the percentage of successes.
If you wanted to do this by row or by column you'd use apply, e.g. apply(dat,1,function(x)mean(grepl("^abc",x))) to get the row-wise means.
You can use grep to search for the pattern of interest (a string starting with "abc"):
length(grep("^abc", as.character(unlist(dat)))) / prod(dim(dat))
# [1] 0.7857143
You can get row counts with:
(row.counts <- apply(dat, 1, function(x) length(grep("^abc", as.character(x)))))
# [1] 6 4 6 6
Data:
dat = read.table(text="Row1 Row2 Row3 Row4 Row5 Row6 Row7
abc89 abc62 67 abc513 abc512 abc81 abc10
abc6 pop abc11 abc4 giant 13 abc15
abc90 abc16 abc123 abc33 abc22 abc08 9
111 abc15 abc72 abc36 abc57 abc9 abc55", header=TRUE)