Filling in missing data and transposing

Chong Ho (Alex) Yu, Ph.D.

Problem

The raw data are structured in the way that the same subject occupies multiple rows, as shown in the following table. This is called the tall structure, as opposed to the wide structure, in which each user takes a row and each item takes a column. In this example, there are five items and Question 5 is skipped by all examinees while the other items are skipped by only some of the examinees.
 
User ID
Item ID
Score

123

Q1

4

123

Q2

2

123

Q3

3

123

Q4

4

124

Q1

3

124

Q2

4

124

Q4

4

125

Q2

2

125

Q3

1

125

Q4

4

The data should be transposed into the following. What should be done?
 

User ID
Q1
Q2
Q3
Q4
Q5

123

4

2

3

4

.

124

3

4

.

4

.

125

.

2

1

4

.


Solution

It is easy to manually convert the data when there are only five items and three subjects. However, this process will be extremely tedious when there are 50 items and 3,000 examinees. Before transposing, the missing data must be filled in. The following SAS program shows you how to fill in the missing and then transpose the data.

/* Create a file to retain unqiue IDs only */

data idfile; set rawdata;

    proc sort nodupkey; by userid; run;

/* Create a file to make each subject carries 5 rows (items) and assign a faked ID to each row. The faked ID is just the ascending order (1-5) */

data idonly; set idfile;

    array XX(5) X1-X5;

        do fid=1 to 5;

        x=xx[fid];

        output;

    end;

    drop x1-x5 x itemid rawscore;

    proc sort; by fid; run;

/* Import the raw data and sort them for the merging procedure in a later stage */

data score; set rawdata;

    proc sort; by userid itemid; run;

/* Import a file listing the faked ID and all item IDs, including those that no one answered */

data item; set formitem;

    proc sort; by fid; run;

/* merge the ID file and item file so that every subject carries 5 items regardless of whether they have missing data */

data iditem; merge idonly item; by fid;

    proc sort; by userid itemid; run;

/* merge the raw data and the iditem file, and now even missing values are showed up. */

/* FID can be used to check whether every subject has 5 items */

data realdata; merge score iditem; by userid itemid; run;

    proc sort; by userid itemid; run;

/* transpose the data to make each row represents a subject with answered and non-answered items */

data transpose; set realdata; by userid itemid;

    length i1-i5 $8;

    array scores s1-s5;

    array inames $ i1-i5;

        retain s1-s5 i1-i5 n;

        if first.userid then n = 1;

        scores(n) = rawscore;

        inames(n) = itemid;

        n = n+1;

        if last.userid then output;

run;

 


Thanks Dr. Marlyin Thompson for her input in solving this problem.


 

Navigation

Index

Simplified Navigation

Table of Contents

Search Engine

Contact