We are working on a social capital project so our data set has a list of an individual's organizational memberships. So each person gets a numeric ID and then a sub ID for each group they are in. The unit of analysis, therefore, is the group they are in. One of our variables is a three point scale for the type of group it is. Sounds simple enough?
We want to bring the unit of analysis to the individual level and condense the type of group it is into a variable signifying how many different types of groups they are in.
For instance, person one is in eight groups. Of those groups, three are (1s), three are (2s), and two are (3s). What the individual level variable would look like, ideally, is 3, because she is in all three types of groups.
Is this possible in the least?
##simulate data
##individuals
n <- 10
## groups
g <- 5
## group types
gt <- 3
## individuals*group membership
N <- 20
## inidividuals data frame
di <- data.frame(individual=sample(1:n,N,replace=TRUE),
group=sample(1:g,N, replace=TRUE))
## groups data frame
dg <- data.frame(group=1:g, type=sample(1:gt,g,replace=TRUE))
## merge
dm <- merge(di,dg)
## order - not necessary, but nice
dm <- dm[order(dm$individual),]
## group type per individual
library(plyr)
dr <- ddply(dm, "individual", function(x) length(unique(x$type)))
> head(dm)
group individual type
2 2 1 2
8 2 1 2
20 5 1 1
9 3 3 2
12 3 3 2
17 4 3 2
> head(dr)
individual V1
1 1 2
2 3 1
3 4 2
4 5 1
5 6 1
6 7 1
I think what you're asking is whether it is possible to count the number of unique types of group to which an individual belongs.
If so, then that is certainly possible.
I wouldn't be able to tell you how to do it in R since I don't know a lot of R, and I don't know what your data looks like. But there's no reason why it wouldn't be possible.
Is this data coming from a database? If so, then it might be easier to write a SQL query to compute the value you want, rather than to do it in R. If you describe your schema, there should be lots of people here who could give you the query you need.
Related
I have the following dataframe
OCC1990 Skilllevel
3 1
8 2
12 2
14 3
15 1
As illustrated above it contains a long list of occupations assigned to a specific skill level.
My actual dataframe is a household survey with millions of rows, including a column which is also named OCC1990.
My goal is to implement my assigned skill levels from the above-listed data frame into the household survey.
I applied in the past already the following code for smaller dataframes, which is a pretty manual way
cps_data[cps_data$OCC1990 %in% 3,"skilllevel"] <- 1
cps_data[cps_data$OCC1990 %in% 4:7,"skilllevel"] <- 1
cps_data[cps_data$OCC1990 %in% 8,"skilllevel"] <- 2
But due to the fact that I don't wanna spend hours copying pasting as well as it increases the probability of making mistakes I'm searching for a different, more direct way.
I've already tried to merge both dataframes, but this result in an error related to the size of the vector.
Is there another way than merging just the two dataframes to assign the skill level also to the occupations in the survey?
Many thanks in advance
Xx freddy
Using data.table for large dataset
create two vectors: levels and labels. The levels contains unique values of OCC1990 and labels contains the new skill levels you want to apply.
Now use levels and labels inside the factor function to modify the skill level. (I used Skilllevel = 3 for OCC1990 = 8 )
library(data.table)
setDT(df)
levels <- c(3:7,8) # unique values of OCC1990
labels <- c(rep(1,5), 3) # new Skill levels corresponding to OCC1990
setkey(df, OCC1990) # sort OCC1990 for speed before filtering
df[ OCC1990 %in% levels, Skilllevel := as.integer(as.character(factor(OCC1990, levels = levels, labels = labels)))]
head(df)
# OCC1990 Skilllevel
#1: 3 1
#2: 8 3
#3: 12 2
#4: 14 3
#5: 15 1
If you are still facing memory size issues, read in chunks of data from IO (use fread) and apply the above operation and then append data to a new file.
Data:
df <- read.table(text='OCC1990 Skilllevel
3 1
8 2
12 2
14 3
15 1 ', header=TRUE)
I have two sets of data, which correspond to different experiment tasks that I want to merge for analysis. The problem is that I need to search and match up certain rows for particular stimuli and for particular participants. I'd like to use a script to save some trouble. This is probably quite simple, but I've never done it before.
Here's my problem more specifically:
In the first data set, each row corresponds to a two-alternative forced choice task where two stimuli are presented at a time and the participant selects one. In the second data set, each row corresponds to a single item task where the participants are asked if they have ever seen the stimulus before. The stimuli in the second task match the stimuli in the pairs on the first task (twice as many rows). I want to be able to match up and add two columns to the first dataset--one that states if the leftside item was recognized later and one for the rightside stimulus.
I assume this could be done with nested loops, but I'm not sure if there is a elegant way to do this or perhaps a package.
As I understand it, your first dataset looks something like this:
(dat1 <- data.frame(person=1:2, stim1=1:2, stim2=3:4))
# person stim1 stim2
# 1 1 1 3
# 2 2 2 4
This would mean person 1 got stimuli 1 and 3 and person 2 got stimuli 2 and 4. Then your second dataset looks something like this:
(dat2 <- data.frame(person=c(1, 1, 2, 2), stim=c(1, 3, 4, 2), responded=c(0, 1, 0, 1)))
# person stim responded
# 1 1 1 0
# 2 1 3 1
# 3 2 4 0
# 4 2 2 1
This gives information about how each person responded to each stimulus they were given.
You can merge these two by matching person/stimulus pairs with the match function:
dat1$response1 <- dat2$responded[match(paste(dat1$person, dat1$stim1), paste(dat2$person, dat2$stim))]
dat1$response2 <- dat2$responded[match(paste(dat1$person, dat1$stim2), paste(dat2$person, dat2$stim))]
dat1
# person stim1 stim2 response1 response2
# 1 1 1 3 0 1
# 2 2 2 4 1 0
Another option (starting from the original dat1 and dat2) would be to merge twice with the merge function. You have a little less control on the names of the output columns, but it requires a bit less typing:
merged <- merge(dat1, dat2, by.x=c("person", "stim1"), by.y=c("person", "stim"))
merged <- merge(merged, dat2, by.x=c("person", "stim2"), by.y=c("person", "stim"))
Suppose I have the dataset that has the following information:
1) Number (of products bought, for example)
1 2 3
2) Frequency for each number (e.g., how many people purchased that number of products)
2 5 10
Let's say I have the above information for each of the 2 groups: control and test data.
How do I format the data such that it would look like this:
controldata<-c(1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
(each number * frequency listed as a vector)
testdata<- (similar to above)
so that I can perform the two independent sample t-test on R?
If I don't even need to make them a vector / if there's an alternative clever way to format the data to perform the t-test, please let me know!
It would be simple if the vector is small like above, but I can have the frequency>10000 for each number.
P.S.
Control and test data have a different sample size.
Thanks!
Use rep. Using your data above
rep(c(1, 2, 3), c(2, 5, 10))
# [1] 1 1 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Or, for your case
control_data = rep(n_bought, frequency)
I have a .csv file with several columns, but I am only interested in two of the columns(TIME and USER). The USER column consists of the value markers 1 or 2 in chunks and the TIME column consists of a value in seconds. I want to calculate the difference between the TIME value of the first 2 in a chunk in the USER column and the first 1 in a chunk in the USER column. I want to accomplish this through R. It would be ideal for their to be another column added to my data file with these differences.
So far I have only imported the .csv into R.
Latency <- read.csv("/Users/alinazjoo/Documents/Latency_allgaze.csv")
I'm going to guess your data looks like this
# sample data
set.seed(15)
rr<-sample(1:4, 10, replace=T)
dd<-data.frame(
user=rep(1:5, each=10),
marker=rep(rep(1:2,10), c(rbind(rr, 5-rr))),
time=1:50
)
Then you can calculate the difference using the base function aggregate and transform. Observe
namin<-function(...) min(..., na.rm=T)
dx<-transform(aggregate(
cbind(m2=ifelse(marker==2,time,NA), m1=ifelse(marker==1, time,NA)) ~ user,
dd, namin, na.action=na.pass),
diff = m2-m1)
dx
# user m2 m1 diff
# 1 1 4 1 3
# 2 2 15 11 4
# 3 3 23 21 2
# 4 4 35 31 4
# 5 5 44 41 3
We use aggregate to find the minimal time for each of the two kinds or markers, then we use transform to calculate the difference between them.
I have a data frame that looks like:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
Each row describes a number of messages (length) sent from one user to another user.
What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.
There are almost 200 countries. I use the function dcast as follows:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.
At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th cell represents the messages sent from country i to country j. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.
So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?
Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:
chats$from_country <- factor(chats$from_country,
levels = unique(c(chats$from_country,
chats$to_country)))
chats$to_country <- factor(chats$to_country,
levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# from_country US GB SG PR
# 1 US 5 0 0 1
# 2 GB 0 2 0 0
# 3 SG 1 0 1 0
# 4 PR 0 0 0 0
If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:
chats$from_country <- factor(chats$from_country,
levels = unique(c(levels(chats$from_country),
levels(chats$to_country)))
Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country) will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>.